# Data analysis

- Inspect columns
  - info / describe (5 number summary)
  - nunique / unique
- Histograms `df.hist(figsize=(18, 10))` -> skewness / kurtosis / outliers
- Box plots
- Scatterplot / `sns.pairplot(data)` `pd.plotting.scatter_matrix(data_pred,figsize  = [9, 9])`
- Data types `df.isna()` / `df.isna().sum()`
- Nulls, drop columns
- Multicollinearity
- Sub dataframe
- Scaling and normalization

# 1. Data scrubbing

## Nulls

- Binning

`df["binned_markdown_"] = pd.cut(df.Column, 5, labels=['10%', '20%'])`

- Replacing Nulls

`df.Column.replace(np.NaN, "NaN", inplace=True)`

- Dropping columns

`to_drop = ['col1', 'col2']
df.drop(to_drop, axis=1, inplace=True)`

## Multicollinearity

`sns.set(style="white")
corr = df.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .5})`

## Normalize

- `df.Column = (df.Column - df.Column.mean()) / df.Column.std()`
- `walmart_log["Weekly_Sales"]= np.log(walmart_log["Weekly_Sales"])`

## Categorical variables

`df.Column = (df.Column - df.Column.mean()) / df.Column.std()`

- Binning
- Label encoding
- Dummy variables / one-hot encoding `one_hot_df = pd.get_dummies(df)`

# 2. EDA

- KDE

`for column in ['Col1','Col2']:
    df[column].plot.hist(normed = True)
    df[column].plot.kde(label = column)
    plt.legend()
    plt.show()`
    
- Join Plot

`for for column in ['Col1','Col2']:
    sns.jointplot(x=column, y="TargetCol",
                  data=df, 
                  kind='reg', 
                  label=column,
                  joint_kws={'line_kws':{'color':'green'}})
    plt.legend()
    plt.show()`

# 3. Modelling

## Linear Regression

Model steps:
- scatter plot
- distributions of dependent and independent variables

Test:
- Linearity (scatter plots). Check for outliers
- Normality: **model residuals** should follow a normal distribution (histograms or Q-Q plots)
- Homoscedasticity <> Heteroscedasticity: dependent variable variability (scatter)

`plt.scatter(df.height, df.weight)
df.plot.kde()`

`df[column].plot.hist(normed=True, label = column + ' histogram')
df[column].plot.kde(label = column + ' kde')`

- Linearity

`fig, axs = plt.subplots(1, 3, sharey=True, figsize=(18, 6))
for idx, channel in enumerate(['TV', 'radio', 'newspaper']):
    df.plot(kind='scatter', x=channel, y='sales', ax=axs[idx], label=channel)
plt.legend()
plt.show()`

- OLS (Ordinary Least Square regression)

`import statsmodels.formula.api as smf`

`f = 'weight~height'
model = ols(formula=f, data=df).fit()
model.summary()`

`for idx, val in enumerate(col_names):
    print ("Walmart: Weekly_Sales~" + val)
    print ("------------------------------")
    f = 'Weekly_Sales~' + val
    model = smf.ols(formula=f, data=walmart).fit()
    results.append([val, model.rsquared, model.params[0], model.params[1], model.pvalues[1] ])
    print(results[idx+1])`
    
`walmart_final = walmart_log.drop(["Fuel_Price","Unemployment","IsHoliday", "Store_9", "Dept_99","Type_B"], axis=1)
walmart_final.columns[92:121]
walmart_final =walmart_final.drop(walmart_final.columns[92:121],axis=1)
walmart_final.describe()`

- Drop continuous variables which resulted in single linear models with a R-squared value <0.01 for the walmart_log models.
- Drop 1 column for each categorical variable we end up using.

**Note Intercept**: association vs. causation

Prediction:

`new_df = pd.DataFrame({'TV': [df.TV.min(), df.TV.max()]})
model.predict(new_df)`

Error terms:

`fig = plt.figure(figsize=(15,8))
fig = sm.graphics.plot_regress_exog(model, "height", fig=fig)
plt.show()`

Q-Q Plots:

`residuals = model.resid
fig = sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True)
fig.show()`

Jarque-Bera test:

JB value of roughly 6 or higher indicates that errors are not normally distributed. Close to 0: normally distributed

## Multiple Regression

- Identify multicollinearity
  - Scatter matrix `pd.plotting.scatter_matrix(data_pred,figsize  = [9, 9])`
  - Correlation matrix `data_pred.corr()` `abs(data_pred.corr()) > 0.75`
  - Seaborn heatmap `sns.heatmap(data_pred.corr(), center=0)`
- Remove problematic features `df = df.drop('col', axis=1)`

# 4. Interpretation

## Linear Regression

### Coef
"How does Y change for each additional unit X' (where X' is the (log- and min-max, standardized,...))"

### $R^2$
- R-squared uses a baseline model which is the worst model. This baseline model does not make use of any independent variables to predict the value of dependent variable Y. Instead it uses the mean of the observed responses of dependent variable Y and always predicts this mean as the value of Y. The mathematical formula to calculate R-squared for a linear regression line is in terms of squared errors for the fitted model and the baseline model. In the formula below, $SS_{RES}$ is the residual sum of squared errors or our model, also known as $SSE$, which is the error between the real and predicted values. $SS_{TOT}$ is the difference between real and mean y values.

$$ R^2 = 1-\dfrac{SS_{RES}}{SS_{TOT}}= \dfrac{\sum_i y_i-\hat{y_i}}{\sum_i{Y_i-\bar{y_i}}}$$

-  The problem with $R^2$ is that, whichever predictor you **add** to your model which will make your model more complex, will increase your $R^2$ value. That is, the model tends to overfit if we only use $R^2$ as our model fitting criterion. This is why train test split is essential and why regularization techniques are used to refine more advanced regression models. Make sure to read [this blogpost](https://www.statisticshowto.datasciencecentral.com/adjusted-r2/) on the difference between the two to get a better sense to why use $R^2_{adj}$ !

### P-value
Just like with single linear regression, the parameters or coefficients we're calculating have a p-value or *significance* attached to them. The interpretation of the p-value for each parameter is exactly the same as for single multiple regression: 

> The p-value represents the probability that the coefficient is actually zero.

In the Statsmodels output, the p-value can be found in the column with name $P>|t|$. A popular threshold for the p-value is 0.05, where we $p<0.05$ denotes that a certain parameter is significant, and $p>0.05$ means that the parameter isn't significant.

The two columns right to the p-value column represent the bounds associated with the 95% confidence interval. What this means is that, after having run the model, we are 95% certain that our parameter value is within the bounds of this interval. When you chose a p-value cut-off of 0.05, there is an interesting relationship between the 95% confidence interval and the p-value: If the 95% confidence does not include 0, the p-value will be smaller than 0.05, and the parameter estimate will be significant.