# EDA Quick Reference

## Why is EDA important?
Some of the reasons EDA is a good use of time:

* Identify patterns
* Build up an intuition about the data BASED ON the data, not the description of what the data is supposed to be/show.
** Leads to/informs:
*** Development of hypotheses
*** Model selection
*** Feature engineering


**Reasons for client/CEO/manager/boss/etc. (the audience)**
* Helps to ensure:
** that the results are technically sound and based on the data
** that the right questions are being asked
* Can help uncover other question that should be asked
* Tests assumptions about the data and the business problem(s)
* Provides context for the application of the results.
* May underscore the value of the results.
* Can lead to new insights that would otherwise not be found or new avenues by which additional insights can be gained through additional modeling or the collection of additional data.

## Remember
* EDA is never something that gets finished- with every analytical result, it is important to return to EDA to:
* Make sure the result makes sense according to the data and the problem/questions
* Test new questions that arise from the results  
* Be as objective as possible and listen to the DATA, not your assumptions or the company's assumptions about the data: the goal is to challenge and evaluate the assumptions. 
* Repeat EDA for every new problem even if the data remain the same. New perspectives (via new problems/goals) may reveal new insights about the data.

## Major tasks of EDA
Presented linearly, but in reality you will not follow this order exactly; the data and your choices based on the problem(s) and time-constraints will determine the order. 

1. Formulate hypothesis/develop investigation themes to explore/understand question and assumptions about data and what it would look like if these assumptions are met. 
2. Clean and wrangle data 
3. Assess data quality
4. Summarize data 
5. Explore each individual variable in the dataset 
6. Assess relationships/interactions:
   a. between each variable and/or target or goal/problem (if not predictive)
   b. between variables 
8. Explore the data across multiple dimensions 

Throughout analysis:
* Capture a list of hypotheses and questions that come up that might merit further exploration.
* Record what to watch out for/ be aware of
* Show intermediate results, get domain expertise from others, re-form perspective
* Pair visualizations and results to maximize ROI.

## Cleaning/Wrangling 

### Basic things to do 
* Make your data [tidy](https://tomaugspurger.github.io/modern-5-tidy.html).
    1. Each variable forms a column
    2. Each observation forms a row
    3. Each type of observational unit forms a table
* Transform data: sometimes you will need to transform your data to be able to extract information from it. This step will usually occur after some of the other steps of EDA unless domain knowledge can inform these choices beforehand.  
    * Log: when data is highly skewed (versus normally distributed like a bell curve), sometimes it has a log-normal distribution and taking the log of each data point will normalize it. 
    * Binning of continuous variables: Binning continuous variables and then analyzing the groups of observations created can allow for easier pattern identification. Especially with non-linear relationships. 
    * Simplifying of categories: you really don't want more than 8-10 categories within a single data field. Try to aggregate to higher-level categories when it makes sense.
    
Above test inspired by and adapted from https://github.com/cmawer/pycon-2017-eda-tutorial/blob/master/EDA-cheat-sheet.md

Daniel's EDA notebook
Exploratory Data Analysis
This notebook contains boilerplate code for typical EDA steps:

Import
Summary Stats
Cleaning
Missing data
Exploratory plots

# Importing data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Reading a csv file without the annoying index column
a = pd.read_csv("a.csv", index_col = 0)

# Naming columns for a dataframe
colnames= ['column1', 'column2', 'etc']
a.columns = colnames

In [None]:
# Joining dataframes, SQL style
df = a.join(b, how = 'left', on = 'key')

# Summarizing & cleaning data

In [None]:
# data type per column
df.info()

In [None]:
# It's often helpful to have a count of unique values in each column:
def nunicol(df):
    summary = []
    for i in range(0, len(df.columns)):
        summary.append(df.iloc[:,i].nunique())
    
    summary = pd.DataFrame([summary])
    summary.columns = df.columns
    
    return summary

nunicol(df)

In [None]:
## Make crosstab table for initial overview. Also exposes misspelled feature levels.
ct = pd.crosstab([df.feature_1, df.feature_2, df.feature_3], df.target, normalize='index')
ct.sort_values(by=1, ascending=False)

# normalize by 'index' gives percentages per row
# normalize by 'all' gives overall percentages
# to access a column, use e.g.: ct.iloc[:,-2]

In [None]:
# standardizing spellings/typos using a dictionary
df.replace({'column_name' : { 'wrong_1' : 'correct_1', 'wrong_2': 'correct_2'}},
           inplace=True)

# display levels after replacing misspellings
df.column_name.unique()

In [None]:
# Pandas aggregation with only one function:
df_by_color = df_nn.groupby(['color'])['yc_g', 'yr_g', 'rc_g', 'red'].mean()
                                                
# to convert hierarchical index to normal dataframe
df_by_color = df_by_color.reset_index()

Dealing with missing data

In [None]:
# count NaNs in dataframe by column
df.isnull().sum(axis=0)

In [None]:
# removing rows or columns with NaNs
df.dropna(axis=0, inplace=True) # axis=0 for rows, axis=1 for columns

In [None]:
# Mean imputation
df.column.fillna(df.column.mean(), inplace=True)

In [None]:
# Imputation using sklearn
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean') # or 'median', 'most_frequent',
# 'constant'
# filling the dataframe
df_imped = imp.fit_transform(df)

# when dealing with separate train/test sets, carry out fit and transfor separately:
imp.fit(X_train)
X_test = imp.transform(X_test)

Histograms, PDFs, etc.

In [None]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Nov 26 10:23:45 2018

@author: whitneyreiner
"""

# =============================================================================
# Histograms
# From: https://realpython.com/python-histograms/
# histograms: tool for quickly assessing a probability distribution. 
# https://github.com/realpython/materials/blob/master/histograms/histograms.py
# =============================================================================
#%%
# =============================================================================
# Need not be sorted, necessarily
a = (0, 1, 1, 1, 2, 3, 7, 7, 23)

#make a dictionary:
def count_elements(seq) -> dict:
# Tally elements from `seq`.
    hist = {}
    for i in seq:
        hist[i] = hist.get(i, 0) + 1 # “for each element of the sequence, 
        # increment its corresponding value in hist by 1.”
    return hist

counted = count_elements(a)
counted

In [None]:
#%%
# =============================================================================
# In fact, this is precisely what is done by the collections.Counter class from 
# Python’s standard library, which subclasses a Python dictionary and overrides 
# its .update() method:
# =============================================================================
#%%
# Can also use counter
from collections import Counter

recounted = Counter(a)
recounted
#test the two are equal
recounted.items() == counted.items()

        
import random
random.seed(1)
  

In [None]:
 =============================================================================
# Thus far, you have been working with what could best be called 
# “frequency tables.” But mathematically, a histogram is a mapping of bins 
# (intervals) to frequencies. More technically, it can be used to approximate the
#  probability density function (PDF) of the underlying variable.
# =============================================================================
#%%
# =============================================================================
# Technical Detail: All but the last (rightmost) bin is half-open. That is, all 
# bins but the last are [inclusive, exclusive), and the final bin is
#     [inclusive, inclusive].
# =============================================================================
#%%
# =============================================================================
# Staying in Python’s scientific stack, Pandas’ Series.histogram() uses 
# matplotlib.pyplot.hist() to draw a Matplotlib histogram of the input Series:
# =============================================================================
import pandas as pd
import numpy as np
# Generate data on commute times.
size, scale = 1000, 10
commutes = pd.Series(np.random.gamma(scale, size=size) ** 1.5)

commutes.plot.hist(grid=True, bins=20, rwidth=0.9,
                   color='#607c8e')
plt.title('Commute Times for 1,000 Commuters')
plt.xlabel('Counts')
plt.ylabel('Commute Time')
plt.grid(axis='y', alpha=0.75)
#%%
# =============================================================================
# pandas.DataFrame.histogram() is similar but produces a histogram for each 
# column of data in the DataFrame.
# =============================================================================

In [None]:
#%%
# =============================================================================
# A kernel density estimation (KDE) is a way to estimate the probability density 
# function (PDF) of the random variable that “underlies” our sample. KDE is a
#  means of data smoothing.
# =============================================================================
#%%
# Sample from two different normal distributions
means = 10, 20
stdevs = 4, 2
dist = pd.DataFrame(np.random.normal(loc=means, scale=stdevs, size=(1000, 2)),
                    columns=['a', 'b'])
dist.agg(['min', 'max', 'mean', 'std']).round(decimals=2)

In [None]:
#Now, to plot each histogram on the same Matplotlib axes:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
dist.plot.kde(ax=ax, legend=False, title='Histogram: A vs. B')
dist.plot.hist(density=True, ax=ax)
ax.set_ylabel('Probability')
ax.grid(axis='y')
ax.set_facecolor('#d8dcd6')
#%%
# =============================================================================
# If you take a closer look at this function, you can see how well it approximates the 
# “true” PDF for a relatively small sample of 1000 data points. Below, you can first build
# the “analytical” distribution with scipy.stats.norm(). This is a class instance that 
# encapsulates the statistical standard normal distribution, its moments, and descriptive 
# functions. Its PDF is “exact” in the sense that it is defined precisely as 
# norm.pdf(x) = exp(-x**2/2) / sqrt(2*pi).
# 
# Building from there, you can take a random sample of 1000 datapoints from this 
# distribution, then attempt to back into an estimation of the PDF with 
# scipy.stats.gaussian_kde():
# 
# =============================================================================

In [None]:
from scipy import stats

# An object representing the "frozen" analytical distribution
# Defaults to the standard normal distribution, N~(0, 1)
dist = stats.norm()

# Draw random samples from the population you built above.
# This is just a sample, so the mean and std. deviation should
# be close to (1, 0).
samp = dist.rvs(size=1000)

# `ppf()`: percent point function (inverse of cdf — percentiles).
x = np.linspace(start=stats.norm.ppf(0.01),
                stop=stats.norm.ppf(0.99), num=250)
gkde = stats.gaussian_kde(dataset=samp)

# `gkde.evaluate()` estimates the PDF itself.
fig, ax = plt.subplots()
ax.plot(x, dist.pdf(x), linestyle='solid', c='red', lw=3,
        alpha=0.8, label='Analytical (True) PDF')
ax.plot(x, gkde.evaluate(x), linestyle='dashed', c='black', lw=2,
        label='PDF Estimated via KDE')
ax.legend(loc='best', frameon=False)
ax.set_title('Analytical vs. Estimated PDF')
ax.set_ylabel('Probability')
ax.text(-2., 0.35, r'$f(x) = \frac{\exp(-x^2/2)}{\sqrt{2*\pi}}$',
        fontsize=12)
#%%
# =============================================================================
# This is a bigger chunk of code, so let’s take a second to touch on a few key lines:
# 
# SciPy’s stats subpackage lets you create Python objects that represent analytical 
# distributions that you can sample from to create actual data. So dist = stats.norm() 
# represents a normal continuous random variable, and you generate random numbers from it 
# with dist.rvs(). To evaluate both the analytical PDF and the Gaussian KDE, you need an 
# array x of quantiles (standard deviations above/below the mean, for a normal 
# distribution). stats.gaussian_kde() represents an estimated PDF that you need to 
# evaluate on an array to produce something visually meaningful in this case.
# The last line contains some LaTex, which integrates nicely with Matplotlib.
# =============================================================================

In [None]:
# =============================================================================
# A Fancy Alternative with Seaborn
# Let’s bring one more Python package into the mix. Seaborn has a displot() function that
# plots the histogram and KDE for a univariate distribution in one step. Using the NumPy 
# array d from ealier:
# 
# =============================================================================
import seaborn as sns

import numpy as np
# `numpy.random` uses its own PRNG.
np.random.seed(444)
np.set_printoptions(precision=3)

d = np.random.laplace(loc=15, scale=3, size=500)
d[:5]
sns.set_style('darkgrid')
sns.distplot(d)


In [None]:
# =============================================================================
# The call above produces a KDE. There is also optionality to fit a specific distribution 
# to the data. This is different than a KDE and consists of parameter estimation for 
# generic data and a specified distribution name:
# =============================================================================
sns.distplot(d, fit=stats.laplace, kde=False)
# =============================================================================
# Again, note the slight difference. In the first case, you’re estimating some unknown 
# PDF; in the second, you’re taking a known distribution and finding what parameters best 
# describe it given the empirical data.
# =============================================================================

In [None]:
# =============================================================================
# In addition to its plotting tools, Pandas also offers a convenient .value_counts() 
# method that computes a histogram of non-null values to a Pandas Series:
# 
# =============================================================================
import pandas as pd

data = np.random.choice(np.arange(10), size=10000, p=np.linspace(1, 11, 10) / 60)
s = pd.Series(data)

s.value_counts()

In [None]:
s.value_counts(normalize=True).head()

In [None]:
# =============================================================================
# Elsewhere, pandas.cut() is a convenient way to bin values into arbitrary intervals.
# Let’s say you have some data on ages of individuals and want to bucket them sensibly:
# =============================================================================
ages = pd.Series([1, 1, 3, 5, 8, 10, 12, 15, 18, 18, 19, 20, 25, 30, 40, 51, 52])
bins = (0, 10, 13, 18, 21, np.inf)  # The edges
labels = ('child', 'preteen', 'teen', 'military_age', 'adult')
groups = pd.cut(ages, bins=bins, labels=labels)

groups.value_counts()

In [None]:
pd.concat((ages, groups), axis=1).rename(columns={0: 'age', 1: 'group'})

Plotting raw data

In [None]:
# Univariate distribution plot (Histogram with optional kde and rug plot)
sns.distplot(df.column, kde=False, rug=False)

In [None]:
# Function to plot lines for  distributions
def plot_with_fill(x, y, label):
    lines = plt.plot(x, y, label=label, lw=2)
    plt.fill_between(x, 0, y, alpha=0.2, color=lines[0].get_c())
    plt.legend(loc='best')

In [None]:
# Correlation matrix
feature_names = list(df.columns[1:10])
label_name = list(df.columns[10:])

features = df[feature_names]

plt.figure(figsize=(10,10))
sns.heatmap(features.corr(), annot=True, square=True, cmap='coolwarm')
plt.show()

In [None]:
# Seaborn Catplots
import seaborn as sns

#bar charts
sns.catplot(x="col", y="other_col", kind="bar", data=df.loc[df['col']!='what you do not want'])

In [None]:
#bigger one
sns.catplot(x="col", y="other_col", kind="bar", data=df.loc[df['col']!='what you do not want'],size=7,aspect=1)

In [None]:
#boxplot with catplot
sns.catplot(x="col", y="other_col", kind="bar", data=df.loc[df['col']!='what you do not want'],size=8,aspect=1)

In [None]:
# with hue
sns.catplot(x="col", y="other_col", kind="bar",hue='column_to_color_by', data=df.loc[df['col'
            ]!='what_don_not_want'],size=8,aspect=1)

In [None]:
#scatterplot
sns.scatterplot(x="col", y="other_col", kind="bar", data=df.loc[df['col']='what you want'])

In [None]:
# facet wrapping/grid
#Look at natalie morse's DC 3 for specifics and for more examples
#salary groups and retention rate
sns.set(font_scale=2)
sns.catplot("df", col="col_name", col_wrap=4,
data=og_df, kind="count")

In [None]:
# histogram
sns.distplot(df[df['quitters'] == 1]['salary'], kde=True, rug=False, label = "Quit")
sns.distplot(df[df['quitters'] == 0]['salary'], kde=True, rug=False, label = "No Quit")
plt.legend()
plt.show()
# from Scott's DC3 submission

In [None]:
# four at once:
# Seniority
fig, axs = plt.subplots(6,1, figsize = (6,20))
i = 0
for col in set(list(df['dept'])): # get unique dept
    df_tmp = df[df['dept'] == col]
    sns.distplot(df_tmp[df_tmp['quitters'] == 1]['seniority'], kde=True, rug=False, ax = axs[i], label = "quit")
    sns.distplot(df_tmp[df_tmp['quitters'] == 0]['seniority'], kde=True, rug=False,
                 ax = axs[i], axlabel= col, label = "No quit")
    i += 1

plt.tight_layout()
plt.legend()
plt.show()
# from Scott's DC3 submission

In [None]:
# nice faceted bar charts - natalia's DC3 submission

import seaborn as sns
g=sns.catplot("dept","current", col="company_id", data=ec, kind="bar", height=2.5, aspect=.8, col_wrap=6)
g.set_xticklabels(rotation=30, ha='right')

#I think this is only showing proportion of current employees in each department at each company

Back to my stuff (WBR)
Graphs and Plots

In [None]:
# Function to visualize distributions

def plot_with_fill(x, y, label):

    lines = plt.plot(x, y, label=label, lw=2)
    plt.fill_between(x, 0, y, alpha=0.2, color=lines[0].get_c())
    plt.legend(loc='best')

In [None]:
'''To get pdf for beta distribution

PDF is a function, whose value at any given sample (or point) in the sample space 

(the set of possible values taken by the random variable) can be interpreted as providing 

a relative likelihood that the value of the random variable would equal that sample.

'''

def get_pdf(x, site):

    ''' 

    Parameters

    -----------

    x : Array of x values

    site : Array cooresponding to the site in question
    
    Returns

    --------
    numpy array

    '''

    alpha = sum(site)
    beta = len(site) - alpha
    return scs.beta(a=alpha, b=beta).pdf(x)

"""Start by looking only at converstion rate for old price.
We assume a uniform prior, i.e., probability of 0 or 1 equally likely.
Specifically, we use a beta distribution with alpha=1 and beta=1"""

In [None]:
#make a bunch of different plots at once

features=[#column names go in here]

fig=plt.subplots(figsize=(10,15))

for i, j in enumerate(features):
    plt.subplot(4, 2, i+1)
    plt.subplots_adjust(hspace = 1.0)
    sns.countplot(x=j,df)
    plt.xticks(rotation=90)
    plt.title("Title")

In [None]:
# Violin plot
sns.violinplot(x='color', y='red',  data=df_by_playerColor, inner='point')

In [None]:
# Graphing (counts) after a group-by
data6 = data2.groupby('skin_tone')['total_reds'].agg(['sum','count'])
data6['percent'] = data6['sum'] / data6['count']
data6.reset_index(inplace=True) # USEFUL HOW TO GRAPH GROUPBY STUFF IN FUTURE 
plt.figure(figsize=(17,3))
plt.title('Frequency Distribution of Red Cards', fontsize = 14)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Skin Tone', fontsize=12)
sns.barplot(x = "skin_tone", y = "sum", data = data6)

In [None]:
# # Graphing after a group-by with percentage
data6 = data2.groupby('skin_tone')['total_reds'].agg(['sum','count'])
data6['percent'] = data6['sum'] / data6['count']
data6.reset_index(inplace=True) # USEFUL HOW TO GRAPH GROUPBY STUFF IN FUTURE 
plt.figure(figsize=(17,3))
plt.title('Percentage Distribution of Red Cards', fontsize = 14)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Skin Tone', fontsize=12)
sns.barplot(x = "skin_tone", y = "percent", data = data6)

In [None]:
# Pair-plots
sns.set(style='white')
sns.set(style='whitegrid', color_codes=True)

sns.pairplot(player_data)
plt.show()

In [None]:
# Dist plots
sns.distplot(player_data.iloc[:, 11])
plt.show()

In [None]:
# Scatter plots
plt.scatter(player_data.iloc[:, 11], player_data.iloc[:, 8], marker='s', label='yellows')
plt.show()
plt.scatter(player_data.iloc[:, 11], player_data.iloc[:, 9], marker='o', label='yellowreds')
plt.show()
plt.scatter(player_data.iloc[:, 11], player_data.iloc[:, 10], marker='d', label='reds')
plt.show()

In [None]:
# Jointplot
sns.jointplot(x = 'skinTone', y = 'RCRate', data = df)

In [None]:
# Frequency bar chart
plt.hist(df['avg_rate'])
plt.xlabel("avg skin tone rating")
plt.ylabel("Frequency")