# Data analysis

- Inspect columns
  - info / describe (5 number summary)
  - nunique / unique
- Histograms `df.hist(figsize=(18, 10))` -> skewness / kurtosis / outliers
- Box plots
- Scatterplot / `sns.pairplot(data)` `pd.plotting.scatter_matrix(data_pred,figsize  = [9, 9])`
- Data types `df.isna()` / `df.isna().sum()`
- Nulls, drop columns
- Multicollinearity
- Sub dataframe
- Scaling and normalization

# 1. Data scrubbing

## Nulls

- Binning

`df["binned_markdown_"] = pd.cut(df.Column, 5, labels=['10%', '20%'])`

- Replacing Nulls

`df.Column.replace(np.NaN, "NaN", inplace=True)`

- Dropping columns

`to_drop = ['col1', 'col2']
df.drop(to_drop, axis=1, inplace=True)`

## Multicollinearity

`sns.set(style="white")
corr = df.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .5})`

## Normalize

## Categorical variables

`df.Column = (df.Column - df.Column.mean()) / df.Column.std()`

- Binning
- Label encoding
- Dummy variables / one-hot encoding `one_hot_df = pd.get_dummies(df)`

# 2. EDA

- KDE

`for column in ['Col1','Col2']:
    df[column].plot.hist(normed = True)
    df[column].plot.kde(label = column)
    plt.legend()
    plt.show()`
    
- Join Plot

`for for column in ['Col1','Col2']:
    sns.jointplot(x=column, y="TargetCol",
                  data=df, 
                  kind='reg', 
                  label=column,
                  joint_kws={'line_kws':{'color':'green'}})
    plt.legend()
    plt.show()`

# 3. Modelling

## Linear Regression

Model steps:
- scatter plot
- distributions of dependent and independent variables

Test:
- Linearity (scatter plots). Check for outliers
- Normality: **model residuals** should follow a normal distribution (histograms or Q-Q plots)
- Homoscedasticity <> Heteroscedasticity: dependent variable variability (scatter)

`plt.scatter(df.height, df.weight)
df.plot.kde()`

`df[column].plot.hist(normed=True, label = column + ' histogram')
df[column].plot.kde(label = column + ' kde')`

- Linearity

`fig, axs = plt.subplots(1, 3, sharey=True, figsize=(18, 6))
for idx, channel in enumerate(['TV', 'radio', 'newspaper']):
    df.plot(kind='scatter', x=channel, y='sales', ax=axs[idx], label=channel)
plt.legend()
plt.show()`

- OLS (Ordinary Least Square regression)

`f = 'weight~height'
model = ols(formula=f, data=df).fit()
model.summary()`

**Note Intercept**: association vs. causation

Prediction:

`new_df = pd.DataFrame({'TV': [df.TV.min(), df.TV.max()]})
model.predict(new_df)`

Error terms:

`fig = plt.figure(figsize=(15,8))
fig = sm.graphics.plot_regress_exog(model, "height", fig=fig)
plt.show()`

Q-Q Plots:

`residuals = model.resid
fig = sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True)
fig.show()`

Jarque-Bera test:

JB value of roughly 6 or higher indicates that errors are not normally distributed. Close to 0: normally distributed

## Multiple Regression

- Identify multicollinearity
  - Scatter matrix `pd.plotting.scatter_matrix(data_pred,figsize  = [9, 9])`
  - Correlation matrix `data_pred.corr()` `abs(data_pred.corr()) > 0.75`
  - Seaborn heatmap `sns.heatmap(data_pred.corr(), center=0)`
- Remove problematic features `df = df.drop('col', axis=1)`