### Feature Engineering

In order to get comfortable with common feature engineering techniques and how to implement them in python, we will use a toy dataset.  First, let's create the dataset, and we can also read in the necessary libraries.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm;
import sklearn.preprocessing as p

df = pd.DataFrame({'response': [2.4, 3.3, -4.2, 5.6, 1.5, 8.7], 
                         'x1': ['yes','no','yes','maybe','no','yes'],
                         'x2': [-1,-3,np.nan, 0, np.nan, 1],
                         'x3': [2.4, 15, 3.3, 2.4, 1.8, 0.4],
                         'x4': [np.nan, np.nan, 1, 1, 1, 1],
                         'x5': ['A', 'B', np.nan, 'A', 'A', 'A']})
df

Unnamed: 0,response,x1,x2,x3,x4,x5
0,2.4,yes,-1.0,2.4,,A
1,3.3,no,-3.0,15.0,,B
2,-4.2,yes,,3.3,1.0,
3,5.6,maybe,0.0,2.4,1.0,A
4,1.5,no,,1.8,1.0,A
5,8.7,yes,1.0,0.4,1.0,A


`1.` Fit a linear model between the response and the three x-variables in the dataset.  Also add an intercept.  Use the results to answer the first quiz question below.

In [3]:
df.dropna(inplace=True)

In [4]:
df['intercept'] = 1
lm = sm.OLS(df['response'], df[['intercept','x2','x3','x4']])
results = lm.fit()
results.summary()

  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  * (1 - self.rsquared))
  return self.ssr/self.df_resid
  return np.dot(wresid, wresid) / self.df_resid


0,1,2,3
Dep. Variable:,response,R-squared:,1.0
Model:,OLS,Adj. R-squared:,
Method:,Least Squares,F-statistic:,0.0
Date:,"Sun, 24 May 2020",Prob (F-statistic):,
Time:,19:00:14,Log-Likelihood:,64.174
No. Observations:,2,AIC:,-124.3
Df Residuals:,0,BIC:,-127.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,2.7208,inf,0,,,
x2,3.2320,inf,0,,,
x3,0.0660,inf,0,,,
x4,2.7208,inf,0,,,

0,1,2,3
Omnibus:,,Durbin-Watson:,0.2
Prob(Omnibus):,,Jarque-Bera (JB):,0.333
Skew:,0.0,Prob(JB):,0.846
Kurtosis:,1.0,Cond. No.,2.32


`2.` Use the sklearn documetation [here](http://scikit-learn.org/stable/modules/preprocessing.html) and the previous video to assist in filling in the missing values for each of the quantitative columns with the column mean.  Now, use the new columns to re-fit the linear model from question `1.`, and use the results to answer quiz 2 below.

In [5]:
imp = p.Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(df[['x2', 'x3', 'x4']])
df[['x2', 'x3', 'x4']] = imp.transform(df[['x2', 'x3', 'x4']])

In [6]:
lm = sm.OLS(df['response'], df[['intercept','x2','x3','x4']])
results = lm.fit()
results.summary()

  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  * (1 - self.rsquared))
  return self.ssr/self.df_resid
  return np.dot(wresid, wresid) / self.df_resid


0,1,2,3
Dep. Variable:,response,R-squared:,1.0
Model:,OLS,Adj. R-squared:,
Method:,Least Squares,F-statistic:,0.0
Date:,"Sun, 24 May 2020",Prob (F-statistic):,
Time:,19:00:33,Log-Likelihood:,64.174
No. Observations:,2,AIC:,-124.3
Df Residuals:,0,BIC:,-127.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,2.7208,inf,0,,,
x2,3.2320,inf,0,,,
x3,0.0660,inf,0,,,
x4,2.7208,inf,0,,,

0,1,2,3
Omnibus:,,Durbin-Watson:,0.2
Prob(Omnibus):,,Jarque-Bera (JB):,0.333
Skew:,0.0,Prob(JB):,0.846
Kurtosis:,1.0,Cond. No.,2.32


`3.` Another common way to scale features is by subtracting the mean and dividing by the standard deviation.  There are certain machine learning algorithms where you should always consider this type of scaling (or other ways of normalizing), as discussed [here](https://stats.stackexchange.com/questions/189652/is-it-a-good-practice-to-always-scale-normalize-data-for-machine-learning).  Use the [sklearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) and the previous video to assist with performing this scaling on the three new, quantitative columns in your dataset.  

To assure you performed these transformations correctly, answer quiz 3 below.

In [7]:
norm = p.StandardScaler()
norm.fit(df[['x2','x3','x4']])
norm.transform(df[['x2','x3','x4']])

array([[-1.,  1.,  0.],
       [ 1., -1.,  0.]])