# EDA General Framework Notebook

In [2]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy.stats as st
from sklearn import ensemble, tree, linear_model
#import missingno as msno

### Medium general article
https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce

1. Import libraries
2. Load df + head/tail
3. Check datatypes
4. Drop irrelevant cols
5. Rename cols
6. Drop duplicate rows
7. Drop (or fill) missing/null values
8. Detect outliers w/ visualization
9. Plot features (hist, scatter, heatmap, ...)

-> Full visualization tutorial: https://towardsdatascience.com/a-step-by-step-guide-for-creating-advanced-python-data-visualizations-with-seaborn-matplotlib-1579d6a1a7d0

--------------------------------------------
### Kaggle House Prices EDA
https://www.kaggle.com/pavansanagapati/a-simple-tutorial-on-exploratory-data-analysis

Nice conceptual intro + types and steps EDA

### Import data
Import data and extract basic stats, plus required preprocessing

In [None]:
# IMPORT DATA

path_train_data = ''
path_test_data = ''

df_train = pd.read_csv(path_train_data)
df_test = pd.read_csv(path_test_data)

# BASIC STATS 

df_train.shape
df_test.shape

df_train.head()
df_train.tail()

df_test.head()
df_test.head()

In [None]:
# Get target

y_name = ''
y = df_train[y_name]

In [None]:
# CHECK DATA TYPES

df_train.dtypes
df_test.dtypes

If the data type of some variables, especially dates, does not match with the proper one, it is better to change it right away.

Error might occur if the columns contain NaNs depending on the method.

For a more complete guide: https://stackoverflow.com/questions/15891038/change-data-type-of-columns-in-pandas

In [None]:
# Date
date_cols = []
date_format = ''

df_train[date_cols] = pd.to_datetime(df_train[date_cols], format=date_format)
df_test[date_cols] = pd.to_datetime(df_test[date_cols], format=date_format)

# Otherwise use .astype and the desired datatype 
# (from standard library, numpy, pandas, ...)
int_cols = []

df_train[int_cols] = df_train[int_cols].astype(int)
df_test[int_cols] = df_test[int_cols].astype(int)

In [None]:
# PANDAS DESCRIPTIVE STATS (Both train and test?)

df_train.describe()
df_test.describe()

# What else?

In [None]:
# SPLIT NUMERIC, CATEGORICAL AND DATES (Only train because need label)
# TODO: what about EDA within test data, ex. covariance matrix?

numeric_features = df_train.select_dtypes(include=np.number)

categorical_features = df_train.select_dtypes(include=np.object)

datetime_features =  df_train.select_dtypes(include=['datetime64'])

In [None]:
numeric_features.head()
categorical_features.head()
datetime_features.head()

Ideas EDA with dates?

1. If there are more dates, it might more meaningful to use the difference (or other manipulations) between the dates to explain the target variable

### Numeric data: discrete vs continuous

In order to separate discrete from continuous variables, we can use an arbitrary treshold (ex. 25) for the unique values of the variable. This decision rule should always work and will not require manual separation based on the nature of the variables if the dataset is large enough.  

In [None]:
treshold = 25

discrete_features = [feature for feature in numeric_feature if len(df_train[feature].unique())<treshold]
continuous_features = [feature for feature in numeric_feature if feature not in discrete_feature]

In [None]:
discrete_features.head()
continuous_features.head()

### Discrete features (regression)

In [None]:
# Group by the discrete feature and apply aggregation function on the target
# In this case the function is the median, but can be also average, min, max, etc...

for feature in discrete_features:
    data = df_train.copy()
    data.groupby(feature)[y_name].median().plot.bar()
    plt.xlabel(feature)
    plt.ylabel(y_name)
    plt.title(feature)
    plt.show()

### Discrete features (classification)

TODO