<a href="https://colab.research.google.com/github/a-forty-two/EY_batch8_11Nov_AIplusOpenAI/blob/main/11Nov_006_EDA_Visuals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python for Data Science & Analysis
Exploratory Data Analysis & Visualization

# What is EDA?

## What is Exploratory Data Analysis?

* goal:
    * investigate
    * explain
    * describe
    * understand

* questions?
    * is there enough data?
    * is the data correct?
    * what is the distribution of each column?
    * how do the columns correlate?

* method
    * visual
    * primarily descriptive

## What are the tools for EDA in Python?

Simple pair: Pandas for data exploration, seaborn for visualization.


...but lots of options.

## What are the challenges around EDA?

### Challenges of EDA:

* Strategic
* Organizational
* Technical

## Objectives
* write a program which uses seaborn to:
    * show univariate plots (eg., distplot)
    * show multivariate plots (eg., scatterplot)
    * EXTRA: customize plots

# Part 3: Visualization

## How do I use pandas to plot?

Pandas has a `.plot` which you configure using its arguments (eg., `kind=hist`), this however *just uses* seaborn.

In [None]:
import pandas as pd
ti = pd.read_csv('https://raw.githubusercontent.com/a-forty-two/EY_batch8_11Nov_AIplusOpenAI/refs/heads/main/titanic.csv')
ti.head(3)


In [None]:
ti.tail()
ti.tail(5)

In [None]:
ti.describe().T

In [None]:
ti.info()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

#Create a pie chart presenting the male/female proportion

In [None]:
# sum the instances of males and females
males = (ti['sex'] == 'male').sum()
females = (ti['sex'] == 'female').sum()

# put them into a list called proportions
proportions = [males, females]

# Create a pie chart
plt.pie(
    # using proportions
    proportions,

    # with the labels being officer names
    labels = ['Males', 'Females'],

    # with no shadows
    shadow = False,

    # with colors
    colors = ['blue','red'],

    # with one slide exploded out
    explode = (0.15 , 0),

    # with the start angle at 90%
    startangle = 90,

    # with the percent listed as a fraction
    autopct = '%1.1f%%'
    )

# View the plot drop above
plt.axis('equal')

# Set labels
plt.title("Sex Proportion")

# View the plot
plt.tight_layout()
plt.show()

In [None]:
ti['age'].plot(kind='hist');


Often pandas doesn't choose the right plot, or the right dataseries... therefore it's often easier to go straight to using seaborn.

## How do I use seaborn to visualize data?

In [None]:
import seaborn as sns


### Checks for emptiness and cleanse data

In [None]:
ti.isna().sum() #check for emptiness

In [None]:
#replace NaNs in numerical fields with the mean values
import numpy as np
# Select only numeric columns for mean calculation
numeric_cols = ti.select_dtypes(include=np.number).columns

# Calculate the mean for numeric columns only
ti[numeric_cols] = ti[numeric_cols].fillna(ti[numeric_cols].mean())

In [None]:
# fill every column with its own most frequent value
ti = ti.apply(lambda x:x.fillna(x.value_counts().index[0]))

There are two ways of using seaborn, you can either

* supply x (, y, etc.) as individual arguments
* OR: supply a *dataframe* and name the relevant columns

In [None]:
sns.distplot(ti['age'], kde=False);


##### What is the kde?

* approximation of the distribution using a linear combination of normal distributions

* further reading: https://seaborn.pydata.org/tutorial/distributions.html#kernel-density-estimation

In [None]:
sns.lineplot(x='age', y='fare', data=ti)

## How do I use a dataframe with seaborn?

We set `data` to be the dataframe, and *name the columns* for `x` and `y`...

In [None]:
sns.scatterplot(data=ti, x='age', y='fare')


## How do I create a distribution plot?

In [None]:
sns.distplot(ti['survived'], vertical=True, kde=False)


## How do I create a violin plot?

The width of a violin plot is the frequency (ie., how common) a value is within some column...

In [None]:
sns.violinplot(ti['age'])


## How do I create a box plot?

Box plots show quarties (25th, 50th, 75th) and outliers.

In [None]:
sns.boxplot(ti['age'])


## How do I create a violin plot for multiple columns?

A violin plot can show multiple distributions, each a subset of a single column, factored (or grouped) by another.

In [None]:
sns.violinplot(data=ti, x='survived', y='age')


In [None]:
sns.violinplot(data=ti, x="age", y='embark_town')

## How do I create a bar plor for multiple columns?

Bar plots are useful for discrete data or for showing, here, the means:

In [None]:
sns.barplot(data=ti, x='survived', y='age');


## How do I create a line plot?

In [None]:
sns.lineplot(data=ti, x='fare', y='age');


###### NB: would creating a scatter plot be more appropriate?

## How do I create a scatter plot?

In [None]:
sns.scatterplot(data=ti, x='fare', y='age');


In [None]:
sns.scatterplot(data=ti, x='fare', y='age',
                hue = 'sex',
                size = 'pclass');

### EXTRA: Tips
* Often for the sake of communication, using excel is both faster and leads to better visuals
* use `df.to_csv()` to save the data behind a visual
* use excel to heavily customize the layout (and then, eg., copy powerpoint)

# Python for Data Science & Analysis
## Plotly

## How do you install plotly?

You can install plotly with `pip` (python package installer).

In [None]:
!pip install plotly


## How do I import plotly?

In [None]:
import plotly.express as px


## Plotly

In [None]:
import pandas as pd

ti = pd.read_csv("https://raw.githubusercontent.com/a-forty-two/EY_batch8_11Nov_AIplusOpenAI/refs/heads/main/titanic.csv")

In [None]:
px.histogram(ti, x='age')

In [None]:
px.scatter(ti, x='age', y='survived', color='class', size='fare')

## Exercise (30 min)- optional

## Step 1 (5 min)
* review seaborn individual, try a few plots

## Step 2 (25 min)

* What affected your chances of survival on the titanic?
    * brainstorm & execute an analysis of the titanic dataset
* Consider:
    * old vs. young
    * women vs. men
    * women & children vs. men
    * cheaper vs. expensive tickets
    * deck
    * class
    * ...location...
* Use seaborn plots (and pandas) to obtain helpful visuals which answer the investigatory question.

### Consider:
* Univariate (single-col) plots
    * sns.distplot
        * of survived
    * sns.violinplot
        * of fare
    * sns.boxplot
        * of age

* Multivariate (here: 2-col) plots
    * sns.violinplot
        * age by survived
        * fare by survived
    * sns.lineplot
        * fare vs. age
    * sns.scatterplot
        * age vs. fare

ti[["age","sex"]].groupby(ti.age>18).sum()