## Preprocessing steps with sklearn 

We'll start by importing pandas and sklearn. 

You may need to pip install it if you don't have it installed already!

In [3]:
import pandas as pd
import sklearn

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')

### Variable selection 

What variables are we going to use for our predictive model? A couple of these might be too specific, so let's choose those which won't cause us to have 10000000 columns after one-hot encoding. 

Since we're predicting cup points, we can see right away that total_cup_points is our **output variable of interest**. Let's save it as our "y" variable. 

In [8]:
y = df.total_cup_points

Now we need to determine what X variables we need. For the sake of this tutorial, we'll only keep a few - but generally, try to include as many as you can! 

Good reasons for throwing out columns, generally:
- Too many irreconcilable values of a string variable - state, plain text fields
- No good theoretical reason to have any predictive power - first name, object id
- Overwhelming number of nulls
- We shouldn't predict on this variable - race, gender for ML applications that can tangibly affect people's lives

In [10]:
len(df.owner.value_counts()) #Something like this is way too many to include, unfortunately 

315

In [13]:
len(df.country_of_origin.value_counts()) # This could be fine

36

In [15]:
df.altitude.value_counts() # Lots of cleaning necessary on this, we'll leave it out for now but you should check it out in a project

1100              43
1200              42
1300              32
1400              32
4300              31
                  ..
1877               1
11000 metros       1
688 m              1
4563               1
1200m to 1350m     1
Name: altitude, Length: 396, dtype: int64

In [72]:
#Let's save some variables of interest:
x = df[['country_of_origin', 'variety', 'category_one_defects', 'category_two_defects', 'color', 
       'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters']]

In [73]:
x

Unnamed: 0,country_of_origin,variety,category_one_defects,category_two_defects,color,altitude_low_meters,altitude_high_meters,altitude_mean_meters
0,Ethiopia,,0,0,Green,1950.0,2200.0,2075.0
1,Ethiopia,Other,0,1,Green,1950.0,2200.0,2075.0
2,Guatemala,Bourbon,0,0,,1600.0,1800.0,1700.0
3,Ethiopia,,0,2,Green,1800.0,2200.0,2000.0
4,Ethiopia,Other,0,2,Green,1950.0,2200.0,2075.0
...,...,...,...,...,...,...,...,...
1334,Ecuador,,0,1,Blue-Green,,,
1335,Ecuador,,0,0,Blue-Green,40.0,40.0,40.0
1336,United States,,0,6,,795.0,795.0,795.0
1337,India,,20,1,Green,,,


### Split the data into our training and testing set. 

If you google "split data sklearn" you'll be greeted by https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html <- that lovely documentation page, which shows you not only the parameters to use, but how to import it. 

It's from model_selection submodule, so we'll do: 

In [74]:
from sklearn.model_selection import train_test_split

Now, using the parameters listed in the documentation to split into a 80-20 train-test split, which is returned to the arguments passed below. 

Random_state can be any number, just ensures that the split will be the same across lines of the code. 

In [75]:
X_train, X_val, y_train, y_val = train_test_split(x, y, 
                                                    test_size=0.2, 
                                                    random_state=42)

In [76]:
print(len(X_train))
print(len(y_train))

1071
1071


In [77]:
print(len(X_val))
print(len(y_val))

268
268


### Apply our transformations to each dataset.

We're only interested in transforming the X features here. 

**Critically**, we want to make sure that we transform the validation data using the learned parameters from the **train** data, not their own parameters. 

This is to ensure that the data is scaled according to values the model is used to, to improve accuracy. 

This step is going to be a lot, so let's demo out how each of these pieces works before we combine them into pipelines and put them in a ColumnTransformer.

In [78]:
#One-hot encoding:
from sklearn.preprocessing import OneHotEncoder

train_data =pd.DataFrame({'animal': ["fish", "dog", "dog", "fish", "cat", "fish", "dog"],
                          'color':['blue', 'green', 'green', 'green', 'pink', 'blue', 'blue']})
val_data =pd.DataFrame({'animal': ["fish","fish",  "fish"],
                          'color':['blue', 'blue', 'blue']})

#Save this
our_encoder = OneHotEncoder()

#Use fit_transform on the X training data
encoded_train = our_encoder.fit_transform(train_data)

#Now that we've already learned the columns, use just .transform 
#on validation data. 
encoded_val = our_encoder.transform(val_data)

Look how our data has been transformed:

In [79]:
train_data

Unnamed: 0,animal,color
0,fish,blue
1,dog,green
2,dog,green
3,fish,green
4,cat,pink
5,fish,blue
6,dog,blue


In [80]:
encoded_train

<7x6 sparse matrix of type '<class 'numpy.float64'>'
	with 14 stored elements in Compressed Sparse Row format>

In [81]:
#sparse matrix --> array
encoded_train.todense()

matrix([[0., 0., 1., 1., 0., 0.],
        [0., 1., 0., 0., 1., 0.],
        [0., 1., 0., 0., 1., 0.],
        [0., 0., 1., 0., 1., 0.],
        [1., 0., 0., 0., 0., 1.],
        [0., 0., 1., 1., 0., 0.],
        [0., 1., 0., 1., 0., 0.]])

In [82]:
val_data

Unnamed: 0,animal,color
0,fish,blue
1,fish,blue
2,fish,blue


In [83]:
encoded_val.todense()

matrix([[0., 0., 1., 1., 0., 0.],
        [0., 0., 1., 1., 0., 0.],
        [0., 0., 1., 1., 0., 0.]])

Do you see now why we used transform instead of fit_transform? 

ML models need the dimensions of the data they accept to be the same for training and evaluation, so we need to make sure the non-fish, non-blue columns are there too! 

In [84]:
#Don"T DO THIS!
bad_val = our_encoder.fit_transform(val_data)
bad_val.todense()

matrix([[1., 1.],
        [1., 1.],
        [1., 1.]])

### Imputing nulls:

In [85]:
from sklearn.impute import SimpleImputer
our_imputer = SimpleImputer(strategy="most_frequent")

In [86]:
train_data =pd.DataFrame({'animal': ["fish", np.nan, "dog", np.nan, "cat", "fish", "dog"],
                          'color':[np.nan, 'green', 'green', 'green', 'pink', 'blue', 'blue']})
val_data =pd.DataFrame({'animal': ["fish", np.nan,  "fish"],
                          'color':[np.nan, 'blue', 'blue']})

In [87]:
train_data

Unnamed: 0,animal,color
0,fish,
1,,green
2,dog,green
3,,green
4,cat,pink
5,fish,blue
6,dog,blue


In [88]:
our_imputer.fit_transform(train_data)

array([['fish', 'green'],
       ['dog', 'green'],
       ['dog', 'green'],
       ['dog', 'green'],
       ['cat', 'pink'],
       ['fish', 'blue'],
       ['dog', 'blue']], dtype=object)

In [89]:
val_data

Unnamed: 0,animal,color
0,fish,
1,,blue
2,fish,blue


In [90]:
our_imputer.transform(val_data)

array([['fish', 'green'],
       ['dog', 'blue'],
       ['fish', 'blue']], dtype=object)

NOTICE that the imputer fills with the parameters, the most-frequent values, from the TRAINING data. This is what we want, and it's why we called fit_transform on the training data and only transform on the validation data. 

## Doing it for real

We want our categorical and numerical variables to be treated differently. 

We definitely wouldn't want to one-hot encode our numerical variables!

As such, we'll use a ColumnTransformer, which allows us to specify different transformations to be applied to different columns. 

Since we're doing multiple transformations to the same columns, we'll need to use a Pipeline object, which are well-documented and user friendly. Google it!

In [91]:
from sklearn.compose import ColumnTransformer

In [92]:
#Save the columns of each type...
cat_columns = ['country_of_origin', 'variety', 'color']
num_columns = ['category_one_defects', 
               'category_two_defects', 'altitude_low_meters',
              'altitude_high_meters', 'altitude_mean_meters']

In [94]:
#We'll make a pipeline for each one. 

In [95]:
from sklearn.preprocessing import StandardScaler

Impute before we encode!

In [99]:
from sklearn.pipeline import Pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')), #Different!
    ('encoder', StandardScaler())
])

In [100]:
#Make the whole thing with column transformer.
full_transformer = ColumnTransformer(
    transformers = [
        ('numeric', numeric_pipeline, num_columns),
        ('categorical', categorical_pipeline, cat_columns)
    ]
)

Now, use it like the individual transformers. Fit_transform on training, transform only on val!

In [101]:
X_train_proc = full_transformer.fit_transform(X_train)
X_val_proc = full_transformer.transform(X_val)

In [102]:
X_train.shape

(1071, 8)

In [103]:
X_train_proc.shape

(1071, 71)

In [104]:
X_val_proc.shape

(268, 71)

### Anyway...

I know that's a lot of work. Use this as a guide and a resource - no need to reinvent the wheel. This data is now ready for some modeling! 