# Feature Engineering Review

Machine learning workflow with some feature engineering. We'll be looking at the cars dataset. The goal here is to get more comfortable with feature engineering. 

In [1]:
#usual imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

Load dataset from seaborn with information about car fuel efficiency

In [2]:
df_cars = sns.load_dataset('mpg')

#### Basic dataset inspection

In [3]:
df_cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [5]:
df_cars.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
name            0
dtype: int64

#### Drop the name column

In [6]:
df_cars.drop('name', inplace=True, axis=1)

### Deal with missing values

In [7]:
df_cars.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
dtype: int64

In [8]:
df_cars.isna().sum()

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
dtype: int64

### Make X and y
We'll predict mpg using the remaining columns (after some engineering)

In [12]:
df_cars.dropna(inplace = True)

In [13]:
X = df_cars.drop('mpg', axis=1)

In [14]:
y = df_cars['mpg']

## Encode string values 

(must do this after splitting into train and test if some values might only show up in the test data). Then you have to make sure X_test and X_train have the same columns, but only use the data available in X_train.

In [18]:
X['origin'].value_counts()

usa       245
japan      79
europe     68
Name: origin, dtype: int64

### Dummify/OneHotEncode

In [25]:
X = pd.get_dummies(X, columns=['origin'])

## Separate into training and test sets

In [23]:
from sklearn.model_selection import train_test_split

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)

In [27]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 9 columns):
cylinders        392 non-null int64
displacement     392 non-null float64
horsepower       392 non-null float64
weight           392 non-null int64
acceleration     392 non-null float64
model_year       392 non-null int64
origin_europe    392 non-null uint8
origin_japan     392 non-null uint8
origin_usa       392 non-null uint8
dtypes: float64(3), int64(3), uint8(3)
memory usage: 22.6 KB


### Standardize and scale our data

#### Import, instantiate, fit_transform, and transform

In [28]:
from sklearn.preprocessing import StandardScaler

In [29]:
ss = StandardScaler()

In [30]:
ss.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

## Polynomial Features

#### import and instantiate

In [31]:
from sklearn.preprocessing import StandardScaler, PolynomialFeatures


In [33]:
poly = PolynomialFeatures(3)

#### fit_transform and transform

In [34]:
X_train_poly = poly.fit_transform(X)

In [None]:
X_test_poly = poly.transform(X_test_Scaled)

##### What do you get back?

In [None]:
X_train_poly[0]

#### import and instantiate a Linear Regression Model

In [38]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
lr = LinearRegression()

#### import cross_val_score and score the model on the training data

In [39]:
cross_val_score(lr, X, y).mean()

0.5936618440508721

#### Result

Basically no predictive power.

Then you could try other values for polynomial features (or try other models -- soon 😀). See what performs best with cross validation on your training data. Then use the best model to score on your holdout/test data. 

#### Make a baseline model.

In [40]:
y_train.mean()

23.692176870748295

In [42]:
df_baseline = pd.DataFrame(y_test)
df_baseline.head()

Unnamed: 0,mpg
79,26.0
276,21.6
248,36.1
56,26.0
393,27.0


In [44]:
df_baseline['y_train_mean'] = y_train.mean()
df_baseline.head()

Unnamed: 0,mpg,y_train_mean
79,26.0,23.692177
276,21.6,23.692177
248,36.1,23.692177
56,26.0,23.692177
393,27.0,23.692177


#### Score the baseline model

In [46]:
from sklearn.metrics import r2_score

In [48]:
r2_score(df_baseline['mpg'], df_baseline['y_train_mean'])

-0.01923918562170024

basically 0.