# Essential Exploratory Data Analysis (EDA) Pandas Methods

We have used many different Pandas attributes and methods over the last two weeks. Let's review the most essential ones that you will use when you BEGIN exploring data.

## Import Modules

In [None]:
import numpy as np
import pandas as pd

## Read data

Let's continue to work with the JOINED data set that we created previously.

In [None]:
df = pd.read_csv( 'joined_data.csv' )

In [None]:
df

But we CANNOT look at a dataset that has thousands to hundreds of thousands or even millions of rows!

We cannot look at a data set that has dozens to hundreds of columns!

What are the basic actions that we should perform for ANY data analysis task?

## Exploratory Data Analysis (EDA)

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df.info()

In [None]:
df.dtypes.value_counts()

In [None]:
df.isna().sum()

In [None]:
df.isna().sum()[ df.isna().sum() < 1 ]

In [None]:
df.isna().sum()[ df.isna().sum() > 0 ]

In [None]:
df.nunique()

`.nunique()` method does NOT treat MISSINGS as a VALUE. By default:

In [None]:
df.nunique(dropna=True)

If you switch `dropna=False` then the MISSING is counted as a VALUE.

In [None]:
df.nunique(dropna=False)

I think it is useful to examine for columns with 1 or 2 unique values.

Lastly, it is always important to begin summarizing the columns.

In [None]:
df.mean()

In [None]:
df.mean(numeric_only=True)

In [None]:
df.describe()

In [None]:
df.describe(include='object')

In [None]:
df.describe(include='all')

The next step is to begin counting the categorical/string columns!

In [None]:
df.A.value_counts()

In [None]:
df.A.value_counts(dropna=False)

In [None]:
df.D.value_counts(dropna=False)

In [None]:
df.E.value_counts(dropna=False)

In [None]:
df.E.value_counts(dropna=False, normalize=True)

In [None]:
df.H.value_counts(dropna=False)

In [None]:
df.H.value_counts(dropna=False, normalize=True)

In [None]:
df.F.value_counts()

In [None]:
df.F.value_counts(dropna=False, normalize=True)

## Realistic example

Let's follow the same steps to get the same type of BASIC information on a real data set!

In [None]:
import seaborn as sns

In [None]:
titanic = sns.load_dataset('titanic')

In [None]:
titanic.shape

In [None]:
titanic.columns

In [None]:
titanic

In [None]:
titanic.head()

In [None]:
titanic.tail()

In [None]:
titanic.dtypes

In [None]:
titanic.dtypes.value_counts()

In [None]:
titanic.info()

In [None]:
titanic.isna().sum()

In [None]:
titanic.isna().sum()[ titanic.isna().sum() > 0 ]

In [None]:
titanic.nunique()

In [None]:
titanic.nunique(dropna=False)

In [None]:
titanic.describe()

In [None]:
titanic.survived.value_counts()

In [None]:
titanic.survived.value_counts(dropna=False, normalize=True)

In [None]:
titanic.describe(include='object')

In [None]:
titanic.alive.value_counts()

In [None]:
titanic.describe(include='category')

In [None]:
titanic.describe(include='boolean')

In [None]:
titanic.pclass.value_counts()

In [None]:
titanic.pclass.value_counts(dropna=False, normalize=True)

In [None]:
titanic['class'].value_counts()

In [None]:
titanic['class'].value_counts(dropna=False, normalize=True)

In [None]:
titanic.deck.value_counts()

In [None]:
titanic.deck.value_counts(dropna=False)

In [None]:
titanic.deck.value_counts(dropna=False, normalize=True)

# SPLIT-APPLY-COMBINE or GROUPBY and AGGREGATE (summarize)

We have learned how to summarize data in Pandas. But we have summarized INDIVIDUAL columns ignoring all other columns!

Let's now explore how summary stats of one column CHANGE or VARY across the categories of another column!

Or...is the average different for a different group?

## Import Modules

In [None]:
import numpy as np
import pandas as pd

## Read data

Continue working with our JOINED data set.

In [None]:
df = pd.read_csv( 'joined_data.csv' )

In [None]:
df.info()

## Review

We know how to summarize individual columns!

In [None]:
df.nunique()

In [None]:
df.dtypes

In [None]:
df.E.value_counts()

In [None]:
df.E.value_counts(dropna=False)

In [None]:
df.B.mean()

In [None]:
df.C.mean()

But what is the AVERAGE of `B` for each unique of `E`?

We cannot answer this question by simply applying the same summary methods!

We need to do something else to support our exploration!

## Split-Apply-Combine

Split means we DIVIDE or BREAK the data into distinct and separate groups!

We must partition the data set into the unique categories of a **GROUPING VARIABLE**.

We need to know the unique values of the grouping variable.

In [None]:
df.E.unique()

We must split `df` into smaller data sets. Each data set will have 1 and only 1 value of `E`.

In [None]:
df_E_aa = df.loc[ df.E == 'aa', : ].copy()

In [None]:
df_E_aa.shape

In [None]:
df_E_aa

In [None]:
df_E_aa.nunique()

In [None]:
type( df_E_aa )

We can apply all Pandas methods to our smaller DataFrame!

In [None]:
df_E_aa.B.mean()

In [None]:
df.B.mean()

In [None]:
df

We need to repeat the SPLITTING process for each unique value of `E`.

In [None]:
df_E_bb = df.loc[ df.E == 'bb', : ].copy()

In [None]:
df_E_cc = df.loc[ df.E == 'cc', : ].copy()

In [None]:
df_E_dd = df.loc[ df.E == 'dd', : ].copy()

WE also need to SPLIT on the MISSING values of `E`.

In [None]:
df_E_na = df.loc[ df.E.isna(), :].copy()

In [None]:
df_E_na

We need to APPLY the summary method to `B` within each SPLIT smaller data set!

In [None]:
df_E_bb.B.mean()

In [None]:
df_E_cc.B.mean()

In [None]:
df_E_dd.B.mean()

In [None]:
df_E_dd

In [None]:
df_E_bb

In [None]:
df_E_na.B.mean()

But, we do NOT want to just let these values stay as stray numbers in our notebook.

We want to COLLECT or COMBINE the summary statistics PER GROUP into a new DataFrame!!!

SPLIT-APPLY-COMBINE breaks a dataset based on categories, applies methods to summarize variables within each smaller dataset, and then combines the summary statistics per group into a new easy to use dataframe!

In [None]:
df_E_summary = pd.DataFrame({'E': df.E.unique(),
                             'B_avg': [df_E_aa.B.mean(), df_E_bb.B.mean(), df_E_cc.B.mean(), df_E_dd.B.mean(), df_E_na.B.mean()]})

In [None]:
df_E_summary

Pandas has a method to manage SPLIT-APPLY-COMBINE for us!!!!

The `.groupby()` method will divide the dataset into smaller groups based on the categories of a grouping variable!!!

In [None]:
df.groupby('E').B.mean()

We can force `.groupby()` to INCLUDE the MISSING!

In [None]:
df.groupby('E', dropna=False).B.mean()

In [None]:
df.groupby('E', dropna=False)['B'].mean()

The bracket notation allows providing variables or objects to define the column to summarize.

In [None]:
var_to_summarize = 'B'

In [None]:
df.groupby('E', dropna=False)[ var_to_summarize ].mean()

We can also use a string or object to identify the grouping variable.

In [None]:
var_to_group = 'E'

In [None]:
df.groupby( var_to_group, dropna=False )[ var_to_summarize ].mean()

In [None]:
df.groupby( 'E', dropna=False ).B.std()

In [None]:
df.groupby( 'E', dropna=False ).C.mean()

In [None]:
df.groupby( 'E', dropna=False ).C.std()

Apply the SEM method!

In [None]:
df.groupby( 'E', dropna=False ).B.sem()

In [None]:
df.B.sem()

In [None]:
df.B.std()

If we want MULTIPLE summary stats returned we can apply the `.describe()` method!

In [None]:
df.groupby('E', dropna=False).B.describe()

We do not need to necessarily identify a single column...when we summarize...

In [None]:
df.groupby('E', dropna=False).mean()

In [None]:
df.groupby('E', dropna=False).mean(numeric_only=True)

In [None]:
df.groupby('E', dropna=False).std(numeric_only=True)

In [None]:
df.groupby('E', dropna=False).sem(numeric_only=True)

In [None]:
df.groupby('E', dropna=False).describe()

In [None]:
df.groupby('E', dropna=False).describe(include='object')

In [None]:
df.groupby('E', dropna=False).describe().columns

In [None]:
df.groupby('E', dropna=False).describe()[ ('B', 'mean') ]

In [None]:
df.groupby('E', dropna=False).describe()[ 'mean' ]

Pandas provides multiple different ways to APPLY summary methods to GROUPED dataframes.

I personally like the `.groupby().aggregate()` approach. I feel this is the most flexible yet straightforward way to apply summary methods to DIFFERENT COLUMNS!!!

We can pick the method to apply and to pick which column we apply that method to.

In [None]:
df.groupby('E', dropna=False).\
aggregate(B_avg = ('B', 'mean'),
          B_std = ('B', 'std'),
          B_sem = ('B', 'sem'),
          C_avg = ('C', 'mean'),
          C_sem = ('C', 'sem'),
          B_numrows = ('B', 'size'),
          B_nonmissing = ('B', 'count'),
          F_nunique = ('F', 'nunique'),
          G_nunique = ('G', 'nunique'),
          A_nunique = ('A', 'nunique')).\
reset_index()

Rather than just displaying the GROUPED and SUMMARIZED result...let's SAVE or ASSIGN it to a new object.

In [None]:
df_E_summary_info = df.groupby('E', dropna=False).\
aggregate(B_avg = ('B', 'mean'),
          B_std = ('B', 'std'),
          B_sem = ('B', 'sem'),
          C_avg = ('C', 'mean'),
          C_sem = ('C', 'sem'),
          B_numrows = ('B', 'size'),
          B_nonmissing = ('B', 'count'),
          F_nunique = ('F', 'nunique'),
          G_nunique = ('G', 'nunique'),
          A_nunique = ('A', 'nunique')).\
reset_index()

In [None]:
df_E_summary_info

In [None]:
df_E_summary_info.info()

Does the AVERAGE of `B` change across the categories of `E`?

In [None]:
df_E_summary_info

We can also group by MULTIPLE COLUMNS if we supply multiple column names within a list!

In [None]:
df.groupby(['E', 'H'], dropna=False).\
aggregate(B_avg = ('B', 'mean'),
          B_numrows = ('B', 'size'),
          B_nonmissing = ('B', 'count'))

In [None]:
df.groupby(['E', 'H'], dropna=False).\
aggregate(B_avg = ('B', 'mean'),
          B_numrows = ('B', 'size'),
          B_nonmissing = ('B', 'count')).\
index

The multi index is annoying and causes a lot of filtering issues.

So again I highly recommend and encourage reseting the index!!

In [None]:
df.groupby(['E', 'H'], dropna=False).\
aggregate(B_avg = ('B', 'mean'),
          B_numrows = ('B', 'size'),
          B_nonmissing = ('B', 'count')).\
reset_index()

## Realistic example

In [None]:
import seaborn as sns

In [None]:
titanic = sns.load_dataset('titanic')

In [None]:
titanic.info()

In [None]:
titanic.survived.value_counts()

In [None]:
titanic.survived.value_counts(dropna=False, normalize=True)

In [None]:
titanic.survived.mean()

But does the survival rate depend on the passenger class?

In [None]:
titanic.pclass.value_counts()

Let's GROUP BY `pclass` and APPLY methods to SUMMARIZE the `survived` column.

This way we can study or explore does the survial rate change across the `pclass` categories?

In [None]:
titanic.groupby('pclass', dropna=False).\
aggregate(num_rows = ('survived', 'size'),
          num_nonmissing = ('survived', 'count'),
          num_survive = ('survived', 'sum'),
          prop_survive = ('survived', 'mean'),
          survive_sem = ('survived', 'sem')).\
reset_index()

Let's group by 2 variables just to see what happens.

Let's group by `pclass` and `class`.

In [None]:
titanic.pclass.value_counts()

In [None]:
titanic['class'].value_counts()

In [None]:
titanic.groupby(['pclass', 'class'], dropna=False).\
aggregate(num_rows = ('survived', 'size'),
          num_nonmissing = ('survived', 'count'),
          num_survive = ('survived', 'sum'),
          prop_survive = ('survived', 'mean'),
          survive_sem = ('survived', 'sem')).\
reset_index()

It can be useful to simplify the grouped and summarized result by only focusing on the OBSERVED combinations!

In [None]:
titanic.groupby(['pclass', 'class'], dropna=False, observed=True).\
aggregate(num_rows = ('survived', 'size'),
          num_nonmissing = ('survived', 'count'),
          num_survive = ('survived', 'sum'),
          prop_survive = ('survived', 'mean'),
          survive_sem = ('survived', 'sem')).\
reset_index()