# Preprocessing for Machine Learning in Python
#### Course Description
This course covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when you get your data ready for modeling. Between importing and cleaning your data and fitting your machine learning model is when preprocessing comes into play. You'll learn how to standardize your data so that it's in the right form for your model, create new features to best leverage the information in your dataset, and select the best features to improve your model fit. Finally, you'll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.

In [117]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB



# 1. Introduction to Data Preprocessing

In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.

### Missing data - rows
Taking a look at the volunteer dataset again, we want to drop rows where the category_desc column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.

#### Instructions

- Check how many values are missing in the category_desc column using isnull() and sum().
- Subset the volunteer dataset by indexing by where category_desc is notnull(), and store in a new variable called volunteer_subset.
- Take a look at the .shape attribute of the new dataset, to verify it worked correctly.


In [16]:
vol_cols = ['opportunity_id', 'content_id', 'vol_requests', 'event_time', 'title',
       'hits', 'summary', 'is_priority', 'category_id', 'category_desc',
       'amsl', 'amsl_unit', 'org_title', 'org_content_id', 'addresses_count',
       'locality', 'region', 'postalcode', 'primary_loc', 'display_url',
       'recurrence_type', 'hours', 'created_date', 'last_modified_date',
       'start_date_date', 'end_date_date', 'status', 'Latitude', 'Longitude',
       'Community Board', 'Community Council ', 'Census Tract', 'BIN', 'BBL',
       'NTA']
volunteer = pd.read_csv('data/volunteer_opportunities.csv', 
                            usecols=vol_cols)

In [17]:
volunteer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664
Data columns (total 35 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   opportunity_id      665 non-null    int64  
 1   content_id          665 non-null    int64  
 2   vol_requests        665 non-null    int64  
 3   event_time          665 non-null    int64  
 4   title               665 non-null    object 
 5   hits                665 non-null    int64  
 6   summary             665 non-null    object 
 7   is_priority         62 non-null     object 
 8   category_id         617 non-null    float64
 9   category_desc       617 non-null    object 
 10  amsl                0 non-null      float64
 11  amsl_unit           0 non-null      float64
 12  org_title           665 non-null    object 
 13  org_content_id      665 non-null    int64  
 14  addresses_count     665 non-null    int64  
 15  locality            595 non-null    object 
 16  region  

In [18]:
# Check how many values are missing in the category_desc column
print(volunteer['category_desc'].isnull().sum())

# Subset the volunteer dataset
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]

# Print out the shape of the subset
print(volunteer_subset.shape)

48
(617, 35)


### Converting a column type

In [19]:
# Print the head of the hits column
print(volunteer["hits"].head())

# Convert the hits column to type int
volunteer["hits"] = volunteer["hits"].astype('int64')

# Look at the dtypes of the dataset
print(volunteer.hits.dtype)

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int64
int64


### Stratified sampling
We know that the distribution of variables in the category_desc column in the volunteer dataset is uneven. If we wanted to train a model to try to predict category_desc, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

#### Instructions

- Create a volunteer_X dataset with all of the columns except category_desc.
- Create a volunteer_y training labels dataset.
- Split up the volunteer_X dataset using scikit-learn's train_test_split function and passing volunteer_y into the stratify= parameter.
- Take a look at the category_desc value counts on the training labels.

In [23]:
volunteer.fillna(123456789, inplace=True)
# Create a data with all columns except category_desc
volunteer_X = volunteer.drop('category_desc', axis=1)

# Create a category_desc labels dataset
volunteer_y = volunteer[['category_desc']]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X,
                                    volunteer_y, stratify = volunteer_y)
X_train = X_train.replace(123456789, np.nan) 
X_test = X_test.replace(123456789, np.nan)
y_train = y_train.replace(123456789, np.nan)
y_test = y_test.replace(123456789, np.nan)
volunteer = volunteer.replace(123456789, np.nan)

# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64


# 2. Standardizing Data

This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance

### Modeling without normalizing
Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the wine dataset. One of the columns, Proline, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.

The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (knn) as well as the X and y sets you need to fit and score on.

#### Instructions

- Split up the X and y sets into training and test sets using train_test_split().
- Use the knn model's fit() method on the X_train data and y_train labels, to fit the model to the data.
- Print out the knn model's score() on the X_test data and y_test labels to evaluate the model.

In [31]:
wine = pd.read_csv('data/wine_types.csv')
X = wine.drop('Type', axis = 1)
y = wine.Type
knn  = KNeighborsClassifier()

In [32]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test,y_test))

0.7111111111111111


### Log normalization in Python

In [33]:
# Print out the variance of the Proline column
print(wine.Proline.var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine.Proline)

# Check the variance of the normalized Proline column
print(wine.Proline_log.var())

99166.71735542428
0.17231366191842018


### Scaling data - standardizing columns
Since we know that the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.

#### Instructions

- Import StandardScaler from sklearn.preprocessing.
- Create the StandardScaler() method and store in a variable named ss.
- Create a subset of the wine DataFrame of the Ash, Alcalinity of ash, and - Magnesium columns, store in a variable named wine_subset.
- Apply the ss.fit_transform method to the wine_subset DataFrame.

In [36]:
# Create the scaler
ss = StandardScaler()

# Take a subset of the DataFrame you want to scale 
wine_subset = wine[['Ash', 'Alcalinity of ash',  'Magnesium']]

# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)

### KNN on non-scaled data
Let's first take a look at the accuracy of a K-nearest neighbors model on the wine dataset without standardizing the data. The knn model as well as the X and y data and labels sets have been created already. Most of this process of creating models in scikit-learn should look familiar to you.

#### Instructions

- Split the dataset into training and test sets using train_test_split().
- Use the knn model's fit() method on the X_train data and y_train labels, to fit the model to the data.
- Print out the knn model's score() on the X_test data and y_test labels to evaluate the model.


In [38]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.7333333333333333


### KNN on scaled data
The accuracy score on the unscaled wine dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data. Once again, the knn model as well as the X and y data and labels set have already been created for you.

#### Instructions

- Create the StandardScaler() method, stored in a variable named ss.
- Apply the ss.fit_transform method to the X dataset.
- Use the knn model's fit() method on the X_train data and y_train labels, to fit the model to the data.
- Print out the knn model's score() on the X_test data and y_test labels to evaluate the model.


In [39]:
# Create the scaling method.
ss = StandardScaler()

# Apply the scaling method to the dataset used for modeling.
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train,y_train)

# Score the model on the test data.
print(knn.score(X_test,y_test))

0.9777777777777777


# 3. Feature Engineering

In this section you'll learn about feature engineering. You'll explore different ways to create new, more useful, features from the ones already in your dataset. You'll see how to encode, aggregate, and extract information from both numerical and textual features.

### Encoding categorical variables - binary
Take a look at the hiking dataset. There are several columns here that need encoding, one of which is the Accessible column, which needs to be encoded in order to be modeled. Accessible is a binary feature, so it has two values - either Y or N - so it needs to be encoded into 1s and 0s. Use scikit-learn's LabelEncoder method to do that transformation.

#### Instructions

- Store LabelEncoder() in a variable named enc
- Using the encoder's fit_transform() function, encode the hiking dataset's "Accessible" column. Call the new column Accessible_enc.
- Compare the two columns side-by-side to see the encoding.

In [40]:
hiking = pd.read_json('data/hiking.json')

In [44]:
# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking['Accessible_enc'] = enc.fit_transform(hiking.Accessible)

# Compare the two columns
print(hiking[['Accessible_enc', 'Accessible']].head())

   Accessible_enc Accessible
0               1          Y
1               0          N
2               0          N
3               0          N
4               0          N


### Encoding categorical variables - one-hot
One of the columns in the volunteer dataset, category_desc, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use Pandas' get_dummies() function to do so.

#### Instructions

- Call get_dummies() on the volunteer["category_desc"] column to create the encoded columns and assign it to category_enc.
- Print out the head() of the category_enc variable to take a look at the encoded columns.

In [45]:
# Transform the category_desc column
category_enc = pd.get_dummies(volunteer.category_desc)

# Take a look at the encoded columns
print(category_enc.head())

   Education  Emergency Preparedness  Environment  Health  \
0          0                       0            0       0   
1          0                       0            0       0   
2          0                       0            0       0   
3          0                       0            0       0   
4          0                       0            1       0   

   Helping Neighbors in Need  Strengthening Communities  
0                          0                          0  
1                          0                          1  
2                          0                          1  
3                          0                          1  
4                          0                          0  


### Engineering numerical features - taking an average
A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, you have a DataFrame of running times named running_times_5k. For each name in the dataset, take the mean of their 5 run times.

#### Instructions

- Create a list of the columns you want to take the average of and store it in a variable named run_columns.
- Use apply to take the mean() of the list of columns and remember to set axis=1. Use lambda row: in the apply.
- Print out the DataFrame to see the mean column.

In [77]:
running_dic = {'name': ['Sue', 'Mark', 'Sean', 'Erin', 'Jenny', 'Russell'],
             'run1': [20.1, 16.5, 23.5, 21.7, 25.8, 30.9],
             'run2': [18.5, 17.1, 25.1, 21.1, 27.1, 29.6],
             'run3': [19.6, 16.9, 25.2, 20.9, 26.1, 31.4],
             'run4': [20.3, 17.6, 24.6, 22.1, 26.7, 30.4],
             'run5': [18.3, 17.3, 23.9, 22.2, 26.9, 29.9]}
running_times_5k = pd.DataFrame(running_dic)

In [85]:
# Create a list of the columns to average
run_columns = [col for col in running_times_5k.columns  if running_times_5k[col].dtype == 'float64']

# Use apply to create a mean column
running_times_5k["mean"] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)

# Take a look at the results
print(running_times_5k)

      name  run1  run2  run3  run4  run5   mean
0      Sue  20.1  18.5  19.6  20.3  18.3  19.36
1     Mark  16.5  17.1  16.9  17.6  17.3  17.08
2     Sean  23.5  25.1  25.2  24.6  23.9  24.46
3     Erin  21.7  21.1  20.9  22.1  22.2  21.60
4    Jenny  25.8  27.1  26.1  26.7  26.9  26.52
5  Russell  30.9  29.6  31.4  30.4  29.9  30.44


### Engineering numerical features - datetime
There are several columns in the volunteer dataset comprised of datetimes. Let's take a look at the start_date_date column and extract just the month to use as a feature for modeling.

#### Instructions

- Use Pandas to_datetime() function on the volunteer["start_date_date"] column and store it in a new column called start_date_converted.
- To retrieve just the month, apply a lambda function to volunteer["start_date_converted"] that grabs the .month attribute from the row. Store this in a new column called start_date_month.
- Print the head() of just the start_date_converted and start_date_month columns.

In [86]:
# First, convert string column to date column
volunteer["start_date_converted"] = pd.to_datetime(volunteer["start_date_date"])

# Extract just the month from the converted column
volunteer["start_date_month"] = volunteer["start_date_converted"].apply(lambda row: row.month)

# Take a look at the converted and new month columns
print(volunteer[["start_date_converted", "start_date_month"]].head())


  start_date_converted  start_date_month
0           2011-07-30                 7
1           2011-02-01                 2
2           2011-01-29                 1
3           2011-02-14                 2
4           2011-02-05                 2


### Engineering features from strings - extraction
The Length column in the hiking dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.

#### Instructions

- Create a pattern that will extract numbers and decimals from text, using \d+ to get numbers and \. to get decimals, and pass it into re's compile function.
- Use re's match function to search the text, passing in the pattern and the length text.
- Use the matched mile's group() attribute to extract the matched pattern, making sure to match group 0, and pass it into float.
- Apply the return_mileage() function to the hiking["Length"] column.

In [98]:
import re
# Write a pattern to extract numbers and decimals
def return_mileage(length):
    if type(length)!= str:
        length = str(length)
    pattern = re.compile(r"\d+\.\d+")
    
    # Search the text for matches
    mile = re.match(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
        
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking["Length"].apply(lambda row: return_mileage(row))
print(hiking[["Length", "Length_num"]].head())

       Length  Length_num
0   0.8 miles        0.80
1    1.0 mile        1.00
2  0.75 miles        0.75
3   0.5 miles        0.50
4   0.5 miles        0.50


### Engineering features from strings - tf/idf
Let's transform the volunteer dataset's title column into a text vector, to use in a prediction task in the next exercise.

#### Instructions

- Store the volunteer["title"] column in a variable named title_text.
- Use the tfidf_vec vectorizer's fit_transform() function on title_text to transform the text into a tf-idf vector.

In [127]:
volunteer = volunteer[~volunteer.category_desc.isna()]
# Take the title text
title_text = volunteer["title"]

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

### Text classification using tf/idf vectors
Now that we've encoded the volunteer dataset's title column into tf/idf vectors, let's use those vectors to try to predict the category_desc column.

#### Instructions

- Using train_test_split, split the text_tfidf vector, along with your y variable, into training and test sets. Set the stratify parameter equal to y, since the class distribution is uneven. Notice that we have to run the toarray() method on the tf/idf vector, in order to get in it the proper format for scikit-learn.
- Use Naive Bayes' fit() method on the X_train and y_train variables.
- Print out the score() of the X_test and y_test variables.


In [130]:
# Split the dataset according to the class distribution of category_desc
y = volunteer["category_desc"]
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)

nb = GaussianNB()
# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

0.5483870967741935


# 4. Selecting features for modeling

This chapter goes over a few different techniques for selecting the most important features from your dataset. You'll learn how to drop redundant features, work with text vectors, and reduce the number of features in your dataset using principal component analysis (PCA).

### Modeling without normalizing
Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the wine dataset. One of the columns, Proline, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.

The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (knn) as well as the X and y sets you need to fit and score on.

#### Instructions

- Split up the X and y sets into training and test sets using train_test_split().
- Use the knn model's fit() method on the X_train data and y_train labels, to fit the model to the data.
- Print out the knn model's score() on the X_test data and y_test labels to evaluate the model.

In [133]:
X = wine.drop('Type', axis = 1)
y= wine.Type
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

knn = KNeighborsClassifier()
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test,y_test))

0.6666666666666666


### Selecting relevant features
Now let's identify the redundant columns in the volunteer dataset and perform feature selection on the dataset to return a DataFrame of the relevant features.

For example, if you explore the volunteer dataset in the console, you'll see three features which are related to location: locality, region, and postalcode. They contain repeated information, so it would make sense to keep only one of the features.

There are also features that have gone through the feature engineering process: columns like Education and Emergency Preparedness are a product of encoding the categorical variable category_desc, so category_desc itself is redundant now.

Take a moment to examine the features of volunteer in the console, and try to identify the redundant features.

#### Instructions

- Create a list of redundant column names and store it in the to_drop variable:
- Out of all the location-related features, keep only postcode.
- Features that have gone through the feature engineering process are redundant as well.
- Drop the columns from the dataset using .drop().
- Print out the .head() of the DataFrame to see the selected columns.


In [135]:
volunteer = pd.read_csv('data/volunteer.csv')

In [148]:
# Create a list of redundant column names to drop
to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

# Drop those columns from the dataset
volunteer_subset = volunteer.drop(to_drop, axis=1)

# Print out the head of the new dataset
print(volunteer_subset.head())

                                               title  hits  postalcode  \
0                                       Web designer    22     10010.0   
1      Urban Adventures - Ice Skating at Lasker Rink    62     10026.0   
2  Fight global hunger and support women farmers ...    14      2114.0   
3                                      Stop 'N' Swap    31     10455.0   
4                               Queens Stop 'N' Swap   135     11372.0   

   vol_requests_lognorm  created_month  Education  Emergency Preparedness  \
0              0.693147              1          0                       0   
1              2.995732              1          0                       0   
2              6.214608              1          0                       0   
3              2.708050              1          0                       0   
4              2.708050              1          0                       0   

   Environment  Health  Helping Neighbors in Need  Strengthening Communities  
0            

# 5. Putting it all together

Now that you've learned all about preprocessing you'll try these techniques out on a dataset that records information on UFO sightings.