# Exploratory Data Analysis - Complete Checklist

### Data Structure

## Title 1

## Title 1

## Title 1

## Title 1

## Manual Bespoke EDA Checklist

In [None]:
0. Questions To Ask Before You Download the Data

I called this one step 0 because it happens before you import data into Python. 
It’s easy to forget, but if you can answer these questions it can save you a lot of time and frustration down the road.
    How was this data collected/where did it come from?
    
    Why am I interested in this data?
    
    What would be the target variable of interest? (if applicable)
    
    Is this data from a reputable source?
    
    Is there enough data here to make an ML model?
    
    Have other people conducted a similar analysis/modeling project on this dataset? Do I want to be able to learn from their conclusions or create a novel project?
    
    Is there a data dictionary for the dataset? Is it complete?
    
    Are there any additional challenges or problems that I anticipate if I use this data?
    
    It’s helpful to use these questions like a filter when you have a choice on what dataset to use.
    
    It’s really tough to realize halfway through a project that you picked a bad dataset.

### 1. Data Structure & Distributions

In [None]:
Questions to answer:
    How many features do you have?
    How many observations do you have?
    What is the data type of each feature?
    
    From what you know about the features of your dataset, do the data types make sense? Do you need to change any?

    Example: Your data has a Customer ID number for every row, and each number is five digits long, stored as an integer. 
        You will not ever be aggregating or analyzing the Customer ID like an integer, so you should change it to the “object” data type.
    
    Do you have null values? (to be fixed later)
    
    How much memory does this dataset use? Could this pose a problem for you later on?

#### How many features do you have?
#### How many observations do you have?

#### What is the data type of each feature?
#### From what you know about the features of your dataset, do the data types make sense? 
#### Do you need to change any data types?
#### Do you have null values? (to be fixed later)

#### How much memory does this dataset use? 
#### Could this pose a problem for you later on?

In [None]:
Questions to answer:
Are the max/min values reasonable for the variables? Do you see any values that look like errors?
What is the mean for each variable? What do the means tell you about your dataset as a whole?

#### Questions to answer:
#### What is the distribution of each variable?
#### Do there appear to be outliers? (to be fixed later)
#### Think about what the variables mean and what the histograms say about their values and their spread — are there any surprises?

## 2. Null Values & Duplicates

In [None]:
Questions to answer:
Is the null value a result of the way data was recorded?
Example: Survey response data is recorded in columns as “yes”, “no,” and a null value for “prefer not to answer.” In this case, all nulls can be filled in with a single value like “no answer.”
Can you drop the rows with null values without it significantly affecting your analysis?
Looking at the distributions of the variables, can you justify filling in the missing values with the mean or median for that variable?
Be careful! You have to deal with missing values somehow, but sometimes it is better to drop rows rather than tinker with the original data because if you put bad data into a model you cannot get meaningful results.
If your data is time-series data, can you fill the missing values with interpolation?
Are there so many missing values for a variable that you should drop that variable from your dataset?

In [None]:
3. Outliers

In [None]:
Questions to ask:
Do you have outliers (represented as dark circles on the boxplots) in your variables?
Why do you think you have outliers?
Do the outliers represent real observations (i.e. not errors)?
Should you exclude these observations? If not, should you winsorize the values?

This is a tricky question. I typically identify my outliers and then I leave them be until I have tried out some models. 
If I find the models have low accuracy, I will go back and re-evaluate whether I should winsorize the variable(s) with outliers (if I have no other options).

In [None]:
4. Correlations/Relationships

In [None]:
df.corr()

In [None]:
Questions to ask:
Which variables are most correlated with your target variable? (If applicable)
Is there multicollinearity? (Two features that have a correlation > 0.8) How will this affect your model?
Do you have variables that represent the same information? Can one be dropped?

In [None]:
5. Feature Engineering

In [None]:
Variable Transformation

In [None]:
The most common transformation is one-hot-encoding to transform categorical variables into numeric — binary, to be specific — variables. 
This is necessary because machine learning models cannot handle “object” data types. Pandas makes this easy to do:

new_df = pd.get_dummies(df,drop_first=True)

In [None]:
Another common transformation (which is necessary for some models) is standardizing variables. 
Here is the code for that:

from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)

In [None]:
Finally, you may want to transform variables so that they follow a normal distribution, depending on the model you are using. 
For this, you can try np.log() , np.sqrt() , the box-cox transformation, and other functions to transform your data to better fit a normal distribution.