This course covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when you get your data ready for modeling. Between importing and cleaning your data and fitting your machine learning model is when preprocessing comes into play. You'll learn how to standardize your data so that it's in the right form for your model, create new features to best leverage the information in your dataset, and select the best features to improve your model fit. Finally, you'll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.

# Introduction to Data Preprocessing
  
In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring datatypes and dealing with missing data.

In [27]:
import pandas as pd

## Introduction to preprocessing
  
**What is data preprocessing?**
  
Data preprocessing comes after we've explored and cleaned our dataset, so we understand its contents, structure, and quality. Once we've explored our data, we'll probably have a good idea about how we'd like to model it. Having this idea early-on will also help us decide on how to best preprocess the data so it's ready for modeling. Think of preprocessing as a prerequisite for modeling. Recall that machine learning models in Python require numerical features, so if our dataset contains categorical features, we'll need to transform them. This is a really common preprocessing step.
  
**Why preprocess?**  
  
The goal of preprocessing is not only to transform our dataset into a form that suitable for modeling, but also to improve the performance of our models, and in turn, produce more reliable results.

**Recap: exploring data with pandas**
  
The files we'll be working with in this course should be recognizable, and we can use common pandas functions for importing, such as `pd.read_json()` and `pd.read_csv()`. One of the first steps after importing data is to inspect it, which we can do with the `.head()` method.
  
It's also useful to know what features are present in the dataset and what their datatypes are. We can quickly find this information using the `.info()` method, which provides other useful information including the number of rows and columns, and also the number of non-missing values in each column.
  
Finally, we can quickly generate some summary statistics about a DataFrame's features, such as the mean, standard deviation, and quartiles using the `.describe()` method.
  
**Removing missing data**
  
One of the first steps we can take to preprocess our data is to remove missing data. There's a lot of ways to deal with missing data, but here we're only going to cover ways to remove either columns or rows containing missing data. The `.dropna()` method can be used to drop all rows containing missing values. This could be a good option if only a small number of rows contain missing data.
  
We can drop specific rows by passing index labels to the `drop()` function, which defaults to dropping rows.
  
Usually we'll want to focus on dropping a particular column, especially if all or most of its values are missing. We can use the `.drop()` method here as well, though the arguments are different. The first argument is the column name to drop, in this case, A. We have to specify `axis=1` to designate that we want to drop a column rather than a row.
  
What if we want to drop rows where data is missing in a particular column? First, let's take a look at how many missing values we have in each column, using `isna()` to identify nan values, and then using `sum()` to count them in each column. To filter out rows with missing values in particular columns, such as column B, we can specify a list of labels to the `subset=` argument of `dropna()`.
  
Finally, we can specify how many non-missing values we require in each row using the `thresh=` argument.

### Exploring missing data
  
You've been given a dataset comprised of volunteer information from New York City, stored in the volunteer DataFrame. Explore the dataset using the plethora of methods and attributes pandas has to offer to answer the following question.

How many missing values are in the locality column?  
70

In [28]:
volunteer = pd.read_csv('../_datasets/volunteer_opportunities.csv')
volunteer.head()

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,...,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,February 05 2011,approved,,,,,,,,


In [29]:
volunteer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664
Data columns (total 35 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   opportunity_id      665 non-null    int64  
 1   content_id          665 non-null    int64  
 2   vol_requests        665 non-null    int64  
 3   event_time          665 non-null    int64  
 4   title               665 non-null    object 
 5   hits                665 non-null    int64  
 6   summary             665 non-null    object 
 7   is_priority         62 non-null     object 
 8   category_id         617 non-null    float64
 9   category_desc       617 non-null    object 
 10  amsl                0 non-null      float64
 11  amsl_unit           0 non-null      float64
 12  org_title           665 non-null    object 
 13  org_content_id      665 non-null    int64  
 14  addresses_count     665 non-null    int64  
 15  locality            595 non-null    object 
 16  region  

In [30]:
# Count the missing values in the requested feature
volunteer.locality.isna().sum()

70

Exploring your data is a crucial first step before preprocessing. Time to start removing missing data!

### Dropping missing data
  
Now that you've explored the volunteer dataset and understand its structure and contents, it's time to begin dropping missing values.
  
In this exercise, you'll drop both columns and rows to create a subset of the volunteer dataset.
  
1. Drop the Latitude and Longitude columns from volunteer, storing as volunteer_cols.
  
2. Subset volunteer_cols by dropping rows containing missing values in the category_desc, and store in a new variable called volunteer_subset.
  
3. Take a look at the .shape attribute of volunteer_subset, to verify it worked correctly.

In [31]:
# Drop the Latitude and Longitude columns from volunteer
volunteer_cols = volunteer.drop(['Latitude', 'Longitude'], axis=1)

# Drop rows with missing category_desc values from volunteer_cols
volunteer_subset = volunteer_cols.dropna(subset=['category_desc'])

# Print out the shape of the subset
print(volunteer_subset.shape)

(617, 33)


Remember that you can use Boolean indexing to effectively subset DataFrames.

## Working With datatypes
  
Now that we've reviewed pandas techniques for exploring data and dropping missing values, we need to start thinking about other steps we have to take in order to prepare data for modeling.
  
**Why are types important?**
  
One of these steps is to think about the types that are present in your dataset, because we'll likely have to transform some of these columns to other types later on. Recall that you can check the types of a DataFrame by using the `.info()` method. Pandas datatypes are similar to native Python types, but there are a couple of subtle differences. 
  
The most common types are `object`, `int64`, `float64`, and `datetime64` types. 
  
The `object` type is what pandas uses to refer to a column that consists of string values or contains a mixture of types. `int64` and `float64` are equivalent to the Python integer and float types, where the 64 refers to the allocation of memory alloted for storing the values, in this case, the number of bits. 
  
`datetime64` is another common datatype that stores date and time data. This special datatype unlocks a bunch of extra functionality for working with time-series data, such as datetime indexing, adding timezone information, and selecting a datetime sampling frequency. 
  
For this course, though, we'll stick to objects, integers, and floats. Before any preprocessing can begin, we have to understand the datatypes of our features. Sometimes, when importing datasets, pandas accidentally assigns an incorrect or inappropriate datatype to a column, which will need to be converted.
  
**Converting column types**
  
Let's take a look at how to convert the type of a column if the type that pandas has inferred its type incorrectly. Here we have a simple dataset with a couple of columns. If we call the `.info()` method, we can see that the type for column C is object. However, if we look at this DataFrame, we can see that C contains float values: numbers with decimal points. If we want to preprocess and model this data, we're going to have to convert the column type.
  
**Converting column types**
  
The pandas `.astype()` method can be used to convert a column's datatype to a specified datatype. We need to reassign the column to overwrite the original datatype when converting it, as shown here. Before converting a column, be extra careful that all of the values it contains can be appropriately converted into this new datatype.

### Exploring data types
  
Taking another look at the dataset comprised of volunteer information from New York City, you want to know what types you'll be working with as you start to do more preprocessing.
  
Which datatypes are present in the volunteer dataset?  
object, float64, int64  

In [32]:
volunteer.dtypes.value_counts()

object     14
float64    13
int64       8
dtype: int64

### Converting a column type
  
If you take a look at the volunteer dataset types, you'll see that the column hits is type object. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type int.
  
1. Take a look at the `.head()` of the hits column.
  
2. Convert the hits column to type int.
  
3. Take a look at the `.dtypes` of the dataset again, and notice that the column type has changed.

In [33]:
volunteer.head()

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,...,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,February 05 2011,approved,,,,,,,,


In [34]:
volunteer.hits = volunteer.hits.astype('int')
volunteer.dtypes

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL       

You can use `.astype()` to convert between a variety of types.

## Training and test sets
  
One of the key steps for preprocessing that you should be familiar with is splitting the data into training and test sets.
  
**Why split?**
  
We split our dataset into training and test for a few main reasons. First, it reduces the risk of overfitting, which recall, arises when the model fits the training data too closely, resulting in poor performance when predicting on unseen data. Second, if we train a model on our entire set of data, we won't have any way to test and validate our model, as the model will essentially know the dataset by-heart. Holding out a test set allows us to preserve some data the model hasn't seen yet, so we can evaluate the model's performance on unseen data.
  
**Splitting up your dataset**
  
The `train_test_split()` function from `sklearn.model_selection` is used to randomly shuffle and then split the features and labels, stored in X and y, into training and test sets. X_train and X_test are the training and test features, and y_train and y_test are the training and test labels. 
  
It's good practice to specify the `random_state=` argument, so we can reproduce the exact same splits if needed. By default, the function will split 75% of the data into the training set and 25% into the test set, but we can adjust the proportion of the data assigned to the test set with the test_size argument. In many scenarios, the default splitting parameters will work well. 
  
However, if our labels have an uneven distribution, where one label is much more common than another, the test and training sets might not be representative samples of the dataset, which could bias the model we're trying to train. This is called class imbalance. For example, in the training and test shown here, we can see that the training set has only samples labeled n, while there is a y label in the test set.
  
**Stratified sampling**
  
A good technique for sampling accurately when you have imbalanced classes is stratified sampling, which is a way of sampling that takes into account the distribution of classes in the dataset. Let's say we have a dataset with 100 samples, 80 of which are class 1 and 20 of which are class 2. We want the class distribution in both our training set and our test set to reflect this, so in both our training and test sets, we'd want 80% of our sample to be class 1 and 20% to be class 2, which means we'd want 60 class 1 samples and 15 class 2 samples in our training set of 75 samples. In our test set of 25 samples, we want to have 20 class 1 samples and 5 of class 2. This is on par with the distribution of classes in the original dataset.
  
There's a nice way to do this using the `train_test_split()` function. The function has a `stratify=` parameter, and to stratify according to class labels, pass the dataset labels, y, to that argument. The dataset contains 100 labels, 80 of which are class 1 and 20 are class 2. Running `train_test_split()` and stratifying on the class labels, creates training and test labels with the same distribution of classes.

### Class imbalance
  
In the volunteer dataset, you're thinking about trying to predict the category_desc variable using the other features in the dataset. First, though, you need to know what the class distribution (and imbalance) is for that label.
  
Which descriptions occur less than 50 times in the volunteer dataset?  
Emergency Preparedness and Environment  

In [35]:
volunteer.category_desc.value_counts()

Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: category_desc, dtype: int64

Both Emergency Preparedness and Environment occur less than 50 times.

### Stratified sampling
  
You now know that the distribution of class labels in the category_desc column of the volunteer dataset is uneven. If you wanted to train a model to predict category_desc, you'll need to ensure that the model is trained on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this!  
  
1. Create a DataFrame of features, X, with all of the columns except category_desc.
  
2. Create a DataFrame of labels, y from the category_desc column.
  
3. Split X and y into training and test sets, ensuring that the class distribution in the labels is the same in both sets
  
4. Print the labels and counts in y_train using `.value_counts()`.

In [37]:
from sklearn.model_selection import train_test_split


# Create a DataFrame with all columns except category_desc
X = volunteer.dropna(subset=['category_desc'], axis=0)

# Create a category_desc labels dataset
y = volunteer.dropna(subset=['category_desc'], axis=0)[['category_desc']]

# Use stratified sampling to split up the dataset according to the y dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Print the category_desc counts from y_train
print(y_train['category_desc'].value_counts())

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64


This is one of the perks of keeping the X/y splits as dataframes when possible. Is that you can run summary stats on the X/y split and the train/test splits.