# OSEMN process

## 1. Data scrubbing

- Subsample
- Data types
  - `df.info()`, check strings (objects), check for categoricals in integers and convert to string
- Null values
  - Truth table `df.isna()` / `df.isna().sum()`
  - Placeholder values `value_counts()`
  - Binning `pd.cut()`
  - Dealing with null values
    - Remove
    - Replace
      - Numeric: column median, binning data and convert to categorical (coarse classification)
      - Categorical: null values as own category, replacing with most common category `df.Column.replace(np.NaN, "NaN", inplace=True)` (**Coarse Classification**)
- Multicollinearity
  - Heatmap `sns.heatmap()`
  - Dealing with multicollinearity
    - Remove one of the columns
    - Combine columns
- Normalize `df.Column = (df.Column - df.Column.mean()) / df.Column.std()`
- One-hot encoding categorical data
  - `one_hot_df = pd.get_dummies(df)`

## 2. Exploratory Data Analysis

- Visualizations
  - Histograms
  - KDE
  - Join Plots `sns.joinplot()`

## 3. Modelling

1. First model / feature selection

`from statsmodels.formula.api import ols`

`predictors = '+'.join(x_cols)
formula = outcome + "~" + predictors
model = ols(formula=formula, data=df).fit()
model.summary()`

Check for p-values, determine the features to be used in the model

**Check for multicollinearity / variance inflation factor**

`from statsmodels.stats.outliers_influence import variance_inflation_factor`

`X = df[x_cols]
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
list(zip(x_cols, vif))`

2. Check for normality (qq-plots)

`import statsmodels.api as sm
import scipy.stats as stats`

`fig = sm.graphics.qqplot(model.resid, dist=stats.norm, line='45', fit=True)`

3. Check for homoscedasticity

`plt.scatter(model.predict(df[x_cols]), model.resid)
plt.plot(model.predict(df[x_cols]), [0 for i in range(len(df))])`

4. Remove outliers

`for i in range(90,99):
    q = i / 100
    print('{} percentile: {}'.format(q, df['MPG_Highway'].quantile(q=q)))`
    
`subset = df[df['MPG_Highway']<38]
print('Percent removed:',(len(df) - len(subset))/len(df))`

### Feature ranking

In [2]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

In [None]:
r_list = []
adj_r_list = []
list_n = list(range(5,86,10))
for n in list_n: 
    select_n = RFE(linreg, n_features_to_select = n)
    select_n = select_n.fit(X, np.ravel(y))
    selected_columns = X.columns[select_n.support_ ]
    linreg.fit(X[selected_columns],y)
    yhat = linreg.predict(X[selected_columns])
    SS_Residual = np.sum((y-yhat)**2)
    SS_Total = np.sum((y-np.mean(y))**2)
    r_squared = 1 - (float(SS_Residual))/SS_Total
    print(r_squared)
    adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)
    print(adjusted_r_squared)
r_list.append(r_squared)
adj_r_list.append(adjusted_r_squared)

## 4. Holdout validation

`import numpy as np
from sklearn.model_selection import train_test_split
X = np.arange(10).reshape((5, 2))
y = range(5)`

` X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                     test_size=0.3, 
                                                     random_state=42)`
                                                     
`from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score`

`select_85 = RFE(linreg, n_features_to_select = 85)
select_85 = select_n.fit(X, np.ravel(y))
selected_columns = X.columns[select_n.support_]`

`cv_10_results = cross_val_score(linreg, X[selected_columns], y, cv=10, scoring="neg_mean_squared_error")
cv_10_results`