# <font color=blue>Preprocessing for Machine Learning in Python</font> 

This course covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when you get your data ready for modeling. Between importing and cleaning your data and fitting your machine learning model is when preprocessing comes into play. You'll learn how to standardize your data so that it's in the right form for your model, create new features to best leverage the information in your dataset, and select the best features to improve your model fit. Finally, you'll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.

## <font color=red>01 - Introduction to Data Preprocessing</font> 

 Learn how to discover the underlying groups (or "clusters") in a dataset. By the end of this chapter, you'll be clustering companies using their stock market prices, and distinguishing different species by clustering their measurements. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
import re

In [2]:
volunteer = pd.read_csv('./data/volunteer_opportunities.csv')
print(volunteer.shape)

(665, 35)


### Missing data - columns

<div><p>We have a dataset comprised of volunteer information from New York City. The dataset has a number of features, but we want to get rid of features that have at least 3 missing values. </p>
<p>How many features are in the original dataset, and how many features are in the set after columns with at least 3 missing values are removed?</p>
<ul>
<li>The dataset <code>volunteer</code> has been provided.</li>
<li>Use the <code>dropna()</code> function to remove columns.</li>
<li>You'll have to set both the <code>axis=</code> and <code>thresh=</code> parameters.</li>
</ul></div>


In [3]:
print(volunteer.shape)
print(volunteer.dropna(axis=1, thresh=3).shape)

(665, 35)
(665, 24)


### Missing data - rows

<div><p>Taking a look at the <code>volunteer</code> dataset again, we want to drop rows where the <code>category_desc</code> column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.</p></div>

In [4]:
volunteer['category_desc'].head()

0                          NaN
1    Strengthening Communities
2    Strengthening Communities
3    Strengthening Communities
4                  Environment
Name: category_desc, dtype: object

In [5]:
# Check how many values are missing in the category_desc column
print(volunteer['category_desc'].isnull().sum())

# Subset the volunteer dataset
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]

# Print out the shape of the subset
print(volunteer_subset.shape)

48
(617, 35)


### Exploring data types

<p>Taking another look at the dataset comprised of volunteer information from New York City, we want to know what types we'll be working with as we start to do more preprocessing.</p>
<p>Which data types are present in the <code>volunteer</code> dataset?</p>
<ul>
<li>The dataset <code>volunteer</code> has been provided.</li>
<li>Use the <code>.dtypes</code> attribute to check the datatypes.</li>
</ul>

In [6]:
volunteer.dtypes

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL       

### Converting a column type

<p>If you take a look at the <code>volunteer</code> dataset types, you'll see that the column <code>hits</code> is type <code>object</code>. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type <code>int</code>.</p>

In [7]:
# Print the head of the hits column
print(volunteer["hits"].head(), end='\n\n')

# Convert the hits column to type int
volunteer["hits"] = volunteer["hits"].astype('int')

# Look at the dtypes of the dataset
print(volunteer.dtypes)

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int64

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int32
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64

### Class imbalance

<p>In the <code>volunteer</code> dataset, we're thinking about trying to predict the <code>category_desc</code> variable using the other features in the dataset. First, though, we need to know what the class distribution (and imbalance) is for that label.</p>
<p>Which descriptions occur less than 50 times in the <code>volunteer</code> dataset?</p>
<ul>
<li>The dataset <code>volunteer</code> has been provided.</li>
<li>The colum you want to check is <code>category_desc</code>.</li>
<li>Use the <code>value_counts()</code> method to check variable counts.</li>
</ul>


In [8]:
volunteer['category_desc'].value_counts()

Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: category_desc, dtype: int64

### Stratified sampling

<p>We know that the distribution of variables in the <code>category_desc</code> column in the <code>volunteer</code> dataset is uneven. If we wanted to train a model to try to predict <code>category_desc</code>, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.</p>

In [9]:
# Create a data with all columns except category_desc
volunteer_X = volunteer.drop('category_desc', axis=1)

# Create a category_desc labels dataset
volunteer_y = volunteer[['category_desc']]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)

# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64


## <font color=red>02 - Standardizing Data </font> 

 This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance. 

### When to standardize

Now that you've learned when it is appropriate to standardize your data, which of these scenarios would you NOT want to standardize?

- A column you want to use for modeling has extremely high variance.
- You have a dataset with several continuous columns on different scales and you'd like to use a linear model to train the data.
- The models you're working with use some sort of distance metric in a linear space, like the Euclidean metric.
- **Your dataset is comprised of categorical data.**

In [10]:
wine = pd.read_csv('./data/wine_types.csv')
X = wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]
y = wine['Type']

In [11]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 4 columns):
Proline                 178 non-null int64
Total phenols           178 non-null float64
Hue                     178 non-null float64
Nonflavanoid phenols    178 non-null float64
dtypes: float64(3), int64(1)
memory usage: 5.6 KB


### Modeling without normalizing

<p>Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the <code>wine</code> dataset. One of the columns, <code>Proline</code>, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.</p>
<p>The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (<code>knn</code>) as well as the <code>X</code> and <code>y</code> sets you need to fit and score on.</p>

In [12]:
knn = KNeighborsClassifier()

# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.7333333333333333


### Checking the variance

<p>Check the variance of the columns in the <code>wine</code> dataset. Out of the four columns listed in the multiple choice section, which column is a candidate for normalization?</p>

- Alcohol
- **Proline**
- Proanthocyanins
- Ash

In [13]:
print(np.var(wine.Alcohol))
print(np.var(wine.Proline))
print(np.var(wine.Proanthocyanins))
print(np.var(wine.Ash))

0.6553597304633259
98609.60096578706
0.32575424820098453
0.07484180027774268


### Log normalization in Python

<p>Now that we know that the <code>Proline</code> column in our wine dataset has a large amount of variance, let's log normalize it.</p>
<p><code>Numpy</code> has been imported as <code>np</code> in your workspace.</p>

In [14]:
# Print out the variance of the Proline column
print(wine['Proline'].var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the Proline column again
print(wine['Proline_log'].var())

99166.71735542428
0.17231366191842018


### Scaling data - investigating columns

<p>We want to use the <code>Ash</code>, <code>Alcalinity of ash</code>, and <code>Magnesium</code> columns in the wine dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model. Using <code>describe()</code> to return descriptive statistics about this dataset, which of the following statements are true about the scale of data in these columns?</p>

- The max of Ash is 3.23, the max of Alcalinity of ash is 30, and the max of Magnesium is 162.
- The means of Ash and Alcalinity of ash are less than 20, while the mean of Magnesium is greater than 90.
- The standard deviations of Ash and Alcalinity of ash are equal.
- __1 and 2 are true.__

In [15]:
wine[['Ash', 'Alcalinity of ash', 'Magnesium']].describe()

Unnamed: 0,Ash,Alcalinity of ash,Magnesium
count,178.0,178.0,178.0
mean,2.366517,19.494944,99.741573
std,0.274344,3.339564,14.282484
min,1.36,10.6,70.0
25%,2.21,17.2,88.0
50%,2.36,19.5,98.0
75%,2.5575,21.5,107.0
max,3.23,30.0,162.0


### Scaling data - standardizing columns

<p>Since we know that the <code>Ash</code>, <code>Alcalinity of ash</code>, and <code>Magnesium</code> columns in the wine dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.</p>

In [16]:
pd.options.display.float_format = '{:.3f}'.format

# Import StandardScaler from scikit-learn
from sklearn.preprocessing import StandardScaler

# Create the scaler
ss = StandardScaler()

# Take a subset of the DataFrame you want to scale 
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium' ]]

# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)

wine_subset_scaled_df = pd.DataFrame(wine_subset_scaled, columns=['Ash','Alcalinity of ash', 'Magnesium'])
wine_subset_scaled_df.describe()

Unnamed: 0,Ash,Alcalinity of ash,Magnesium
count,178.0,178.0,178.0
mean,-0.0,-0.0,-0.0
std,1.003,1.003,1.003
min,-3.679,-2.671,-2.088
25%,-0.572,-0.689,-0.824
50%,-0.024,0.002,-0.122
75%,0.698,0.602,0.51
max,3.156,3.155,4.371


### KNN on non-scaled data

<p>Let's first take a look at the accuracy of a K-nearest neighbors model on the <code>wine</code> dataset without standardizing the data. The <code>knn</code> model as well as the <code>X</code> and <code>y</code> data and labels sets have been created already. Most of this process of creating models in scikit-learn should look familiar to you.</p>

In [17]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state= 23)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.6666666666666666


### KNN on scaled data

<p>The accuracy score on the unscaled <code>wine</code> dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data. Once again, the <code>knn</code> model as well as the <code>X</code> and <code>y</code> data and labels set have already been created for you.</p>

In [18]:
# Create the scaling method.
ss = StandardScaler()

# Apply the scaling method to the dataset used for modeling.
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state= 23)

# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train, y_train)

# Score the model on the test data.
print(knn.score(X_test, y_test))

0.9333333333333333


## <font color=red>03 -  Feature Engineering </font> 

 Learn how to discover the underlying groups (or "clusters") in a dataset. By the end of this chapter, you'll be clustering companies using their stock market prices, and distinguishing different species by clustering their measurements. 

### Feature engineering knowledge test

Now that you've learned about feature engineering, which of the following examples are good candidates for creating new features?

- A column of timestamps
- A column of newspaper headlines
- A column of weight measurements
- __1 and 2__
- None of the above

### Identifying areas for feature engineering

<p>Take an exploratory look at the <code>volunteer</code> dataset, using the variable of that name. Which of the following columns would you want to perform a feature engineering task on?</p>
- vol_requests
- title
- created_date
- category_desc
- __2, 3, and 4__

In [19]:
volunteer[['vol_requests', 'title' , 'created_date', 'category_desc']].head()

Unnamed: 0,vol_requests,title,created_date,category_desc
0,50,Volunteers Needed For Rise Up & Stay Put! Home...,January 13 2011,
1,2,Web designer,January 14 2011,Strengthening Communities
2,20,Urban Adventures - Ice Skating at Lasker Rink,January 19 2011,Strengthening Communities
3,500,Fight global hunger and support women farmers ...,January 21 2011,Strengthening Communities
4,15,Stop 'N' Swap,January 28 2011,Environment


In [20]:
import json
with open('./data/hiking.json', 'r') as json_file:
    json_data = json.load(json_file,)
hiking = pd.DataFrame.from_dict(json_data)

In [21]:
hiking.head()

Unnamed: 0,Accessible,Difficulty,Length,Limited_Access,Location,Name,Other_Details,Park_Name,Prop_ID,lat,lon
0,Y,,0.8 miles,N,"Enter behind the Salt Marsh Nature Center, loc...",Salt Marsh Nature Trail,<p>The first half of this mile-long trail foll...,Marine Park,B057,,
1,N,Easy,1.0 mile,N,Enter Park at Lincoln Road and Ocean Avenue en...,Lullwater,Explore the Lullwater to see how nature thrive...,Prospect Park,B073,,
2,N,Easy,0.75 miles,N,Enter Park at Lincoln Road and Ocean Avenue en...,Midwood,Step back in time with a walk through Brooklyn...,Prospect Park,B073,,
3,N,Easy,0.5 miles,N,Enter Park at Lincoln Road and Ocean Avenue en...,Peninsula,Discover how the Peninsula has changed over th...,Prospect Park,B073,,
4,N,Easy,0.5 miles,N,Enter Park at Lincoln Road and Ocean Avenue en...,Waterfall,Trace the source of the Lake on the Waterfall ...,Prospect Park,B073,,


### Encoding categorical variables - binary

<p>Take a look at the <code>hiking</code> dataset. There are several columns here that need encoding, one of which is the <code>Accessible</code> column, which needs to be encoded in order to be modeled. <code>Accessible</code> is a binary feature, so it has two values - either <code>Y</code> or <code>N</code> - so it needs to be encoded into 1s and 0s. Use scikit-learn's <code>LabelEncoder</code> method to do that transformation.</p>

In [22]:
# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])

# Compare the two columns
print(hiking[['Accessible', 'Accessible_enc']].head())

  Accessible  Accessible_enc
0          Y               1
1          N               0
2          N               0
3          N               0
4          N               0


### Encoding categorical variables - one-hot

<p>One of the columns in the <code>volunteer</code> dataset, <code>category_desc</code>, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use Pandas' <code>get_dummies()</code> function to do so.</p>

In [23]:
# Transform the category_desc column
category_enc = pd.get_dummies(volunteer["category_desc"])

# Take a look at the encoded columns
display(category_enc.head())

Unnamed: 0,Education,Emergency Preparedness,Environment,Health,Helping Neighbors in Need,Strengthening Communities
0,0,0,0,0,0,0
1,0,0,0,0,0,1
2,0,0,0,0,0,1
3,0,0,0,0,0,1
4,0,0,1,0,0,0


### Engineering numerical features - taking an average

<p>A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, you have a DataFrame of running times named <code>running_times_5k</code>. For each <code>name</code> in the dataset, take the mean of their 5 run times.</p>

In [24]:
running_times_5k = pd.read_csv('./data/runs.csv')
display(running_times_5k)

# Create a list of the columns to average
run_columns = ['run1','run2','run3','run4','run5']

# Use apply to create a mean column
running_times_5k["mean"] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)

# Take a look at the results
display(running_times_5k)

Unnamed: 0,name,run1,run2,run3,run4,run5
0,Ali,15.5,16.8,19.5,17.6,14.5
1,Veli,15.5,13.8,19.6,13.6,15.5
2,Meria,15.8,17.8,18.9,17.8,14.5


Unnamed: 0,name,run1,run2,run3,run4,run5,mean
0,Ali,15.5,16.8,19.5,17.6,14.5,16.78
1,Veli,15.5,13.8,19.6,13.6,15.5,15.6
2,Meria,15.8,17.8,18.9,17.8,14.5,16.96


### Engineering numerical features - datetime

<p>There are several columns in the <code>volunteer</code> dataset comprised of datetimes. Let's take a look at the <code>start_date_date</code> column and extract just the month to use as a feature for modeling.</p>

In [25]:
display(volunteer[["start_date_date"]].head())

# First, convert string column to date column
volunteer["start_date_converted"] = pd.to_datetime(volunteer["start_date_date"])

# Extract just the month from the converted column
volunteer["start_date_month"] = volunteer["start_date_converted"].apply(lambda row: row.month)

# Take a look at the original and new columns
display(volunteer[['start_date_date','start_date_converted', 'start_date_month']].head())

Unnamed: 0,start_date_date
0,July 30 2011
1,February 01 2011
2,January 29 2011
3,February 14 2011
4,February 05 2011


Unnamed: 0,start_date_date,start_date_converted,start_date_month
0,July 30 2011,2011-07-30,7
1,February 01 2011,2011-02-01,2
2,January 29 2011,2011-01-29,1
3,February 14 2011,2011-02-14,2
4,February 05 2011,2011-02-05,2


### Engineering features from strings - extraction

<p>The <code>Length</code> column in the <code>hiking</code> dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.</p>

In [26]:
pattern1 = re.compile(r"\d+\.\d+")
print(re.match(pattern1, '0.75 miles'))
print(re.match(pattern1, '0.75 miles').group(0))

<_sre.SRE_Match object; span=(0, 4), match='0.75'>
0.75


In [27]:
display(hiking[['Length']].head())

# Write a pattern to extract numbers and decimals
def return_mileage(length):
    pattern = re.compile(r"\d+\.\d+")
    
    # Search the text for matches
    mile = re.match(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))

Unnamed: 0,Length
0,0.8 miles
1,1.0 mile
2,0.75 miles
3,0.5 miles
4,0.5 miles


In [28]:
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking['Length'].apply(lambda row: return_mileage(str(row)))
display(hiking[["Length", "Length_num"]].head())

Unnamed: 0,Length,Length_num
0,0.8 miles,0.8
1,1.0 mile,1.0
2,0.75 miles,0.75
3,0.5 miles,0.5
4,0.5 miles,0.5


### Engineering features from strings - tf/idf

<p>Let's transform the <code>volunteer</code> dataset's <code>title</code> column into a text vector, to use in a prediction task in the next exercise.</p>

In [29]:
# Take the title text
title_text = volunteer_subset["title"]

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

In [30]:
text_array = text_tfidf.toarray()

### Text classification using tf/idf vectors

<p>Now that we've encoded the <code>volunteer</code> dataset's <code>title</code> column into tf/idf vectors, let's use those vectors to try to predict the <code>category_desc</code> column.</p>

In [33]:
from sklearn.naive_bayes import MultinomialNB, GaussianNB
nb = GaussianNB(priors=None)

y = volunteer_subset["category_desc"]
train_X, test_X, train_y, test_y = train_test_split(text_tfidf.toarray(), y, stratify=y)

In [34]:
# Fit the model to the training data
nb.fit(train_X, train_y)

# Print out the model's accuracy
print(nb.score(test_X, test_y))

0.5870967741935483


## <font color=red>04 -   Selecting features for modeling </font> 

 Learn how to discover the underlying groups (or "clusters") in a dataset. By the end of this chapter, you'll be clustering companies using their stock market prices, and distinguishing different species by clustering their measurements. 

### When to use feature selection

Let's say you had finished standardizing your data and creating new features. Which of the following scenarios is NOT a good candidate for feature selection?

- Several columns of running times that have been averaged into a new column.
- **A text field that hasn't been turned into a tf/idf vector yet.**
- A column of text that has already had a float extracted out of it.
- A categorial field that has been one-hot encoded.
- Your dataset contains columns related to whether something is a fruit or vegetable, the name of the fruit or vegetable, and the scientific name of the plant.

### Identifying areas for feature selection

<p>Take an exploratory look at the post-feature engineering <code>hiking</code> dataset. Which of the following columns is a good candidate for feature selection?</p>

- Length
- Difficulty
- Accessible
- **All of the above**
- None of the above

In [35]:
hiking[['Length', 'Difficulty', 'Accessible']].head()

Unnamed: 0,Length,Difficulty,Accessible
0,0.8 miles,,Y
1,1.0 mile,Easy,N
2,0.75 miles,Easy,N
3,0.5 miles,Easy,N
4,0.5 miles,Easy,N


### Selecting relevant features

<p>Now that you've identified redundant columns in the <code>volunteer</code> dataset, let's perform feature selection on the dataset to return a DataFrame of the relevant features.</p>

In [36]:
# Create a list of redundant column names to drop
to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

# Drop those columns from the dataset
volunteer_subset = volunteer.drop(to_drop, axis=1)

# Print out the head of the new dataset
display(volunteer_subset.head())

Unnamed: 0,opportunity_id,content_id,event_time,title,hits,summary,is_priority,category_id,amsl,amsl_unit,...,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA,start_date_converted,start_date_month
0,4996,37004,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,,...,,,,,,,,,2011-07-30,7
1,5008,37036,0,Web designer,22,Build a website for an Afghan business,,1.0,,,...,,,,,,,,,2011-02-01,2
2,5016,37143,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,,,...,,,,,,,,,2011-01-29,1
3,5022,37237,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,,,...,,,,,,,,,2011-02-14,2
4,5055,37425,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,,,...,,,,,,,,,2011-02-05,2


### Checking for correlated features

<p>Let's take a look at the <code>wine</code> dataset again, which is made up of continuous, numerical features. Run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.</p>

<li>Take a minute to look at the correlations. Identify a column where the correlation value is greater than 0.75 at least twice and store it in the <code>to_drop</code> variable.</li>

In [37]:
# Print out the column correlations of the wine dataset
display(wine.corr())

# Take a minute to find the column where the correlation value is greater than 0.75 at least twice
to_drop = "Flavanoids"

# Drop that column from the DataFrame
wine = wine.drop(to_drop, axis=1)

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline,Proline_log
Type,1.0,-0.328,0.438,-0.05,0.518,-0.209,-0.719,-0.847,0.489,-0.499,0.266,-0.617,-0.788,-0.634,-0.569
Alcohol,-0.328,1.0,0.094,0.212,-0.31,0.271,0.289,0.237,-0.156,0.137,0.546,-0.072,0.072,0.644,0.637
Malic acid,0.438,0.094,1.0,0.164,0.289,-0.055,-0.335,-0.411,0.293,-0.221,0.249,-0.561,-0.369,-0.192,-0.153
Ash,-0.05,0.212,0.164,1.0,0.443,0.287,0.129,0.115,0.186,0.01,0.259,-0.075,0.004,0.224,0.238
Alcalinity of ash,0.518,-0.31,0.289,0.443,1.0,-0.083,-0.321,-0.351,0.362,-0.197,0.019,-0.274,-0.277,-0.441,-0.417
Magnesium,-0.209,0.271,-0.055,0.287,-0.083,1.0,0.214,0.196,-0.256,0.236,0.2,0.055,0.066,0.393,0.424
Total phenols,-0.719,0.289,-0.335,0.129,-0.321,0.214,1.0,0.865,-0.45,0.612,-0.055,0.434,0.7,0.498,0.431
Flavanoids,-0.847,0.237,-0.411,0.115,-0.351,0.196,0.865,1.0,-0.538,0.653,-0.172,0.543,0.787,0.494,0.41
Nonflavanoid phenols,0.489,-0.156,0.293,0.186,0.362,-0.256,-0.45,-0.538,1.0,-0.366,0.139,-0.263,-0.503,-0.311,-0.276
Proanthocyanins,-0.499,0.137,-0.221,0.01,-0.197,0.236,0.612,0.653,-0.366,1.0,-0.025,0.296,0.519,0.33,0.29


### Exploring text vectors, part 1

<p>Let's expand on the text vector exploration method we just learned about, using the <code>volunteer</code> dataset's <code>title</code> tf/idf vectors. In this first part of text vector exploration, we're going to add to that function we learned about in the slides. We'll return a list of numbers with the function. In the next exercise, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our <code>text_tfidf</code> vector.</p>

In [None]:
# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 8, 3))

### Exploring text vectors, part 2

Using the function we wrote in the previous exercise, we're going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.

In [40]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Here we'll call the function from the previous exercise, and extend the list we're creating
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

In [None]:
# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# By converting filtered_words back to a list, we can use it to filter the columns in the text vector
filtered_text = text_tfidf[:, list(filtered_words)]

### Training Naive Bayes with feature selection

<p>Let's re-run the Naive Bayes text classification model we ran at the end of chapter 3, with our selection choices from the previous exercise, on the <code>volunteer</code> dataset's <code>title</code> and <code>category_desc</code> columns.</p>

In [None]:
# Split the dataset according to the class distribution of category_desc, using the filtered_text vector
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)

# Fit the model to the training data
nb.fit(train_X, train_y)

# Print out the model's accuracy
print(nb.score(test_X,test_y))

In [None]:
wine.head()

### Using PCA

<p>Let's apply PCA to the <code>wine</code> dataset, to see if we can get an increase in our model's accuracy.</p>

In [None]:
from sklearn.decomposition import PCA

# Set up PCA and the X vector for diminsionality reduction
pca = PCA()
wine_X = wine.drop("Type", axis=1)
y=wine.Type

# Apply PCA to the wine dataset X vector
transformed_X = pca.fit_transform(wine_X)

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)

### Training a model with PCA

<p>Now that we have run PCA on the <code>wine</code> dataset, let's try training a model with it.</p>

In [None]:
# Split the transformed X and the y labels into training and test sets
X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X, y)

# Fit knn to the training data
knn.fit(X_wine_train, y_wine_train)

# Score knn on the test data and print it out
knn.score(X_wine_test, y_wine_test)

## <font color=red>05 -  Putting it all together </font> 

 Learn how to discover the underlying groups (or "clusters") in a dataset. By the end of this chapter, you'll be clustering companies using their stock market prices, and distinguishing different species by clustering their measurements. 

### Checking column types

<p>Take a look at the UFO dataset's column types using the <code>dtypes</code> attribute. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as <code>object</code>, and the <code>date</code> column, which can be transformed into the <code>datetime</code> type. That will make our feature engineering efforts easier later on.</p>

In [42]:
ufo=pd.read_csv('./data/ufo_sightings_large.csv')

In [43]:
display(ufo.head())
print(ufo.shape)

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long
0,11/3/2011 19:21,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291
1,10/3/2004 19:05,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.696
2,9/25/2009 21:00,coon rapids,mn,us,cigar,0.0,,Green&#44 red&#44 and blue pulses of light tha...,12/12/2009,45.12,-93.287
3,11/21/2002 05:45,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382
4,8/19/2010 12:55,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,-114.083


(4935, 11)


In [44]:
# Check the column types
print(ufo.dtypes)

date               object
city               object
state              object
country            object
type               object
seconds           float64
length_of_time     object
desc               object
recorded           object
lat                object
long              float64
dtype: object


In [45]:
# Change the type of seconds to float
ufo["seconds"] = ufo["seconds"].astype('float')

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo["date"])

# Check the column types
print(ufo[['seconds','date']].dtypes)

seconds           float64
date       datetime64[ns]
dtype: object


### Dropping missing data

<p>Let's remove some of the rows where certain columns have missing values. We're going to look at the <code>length_of_time</code> column, the <code>state</code> column, and the <code>type</code> column. If any of the values in these columns are missing, we're going to drop the rows.</p>

In [46]:
# Check how many values are missing in the length_of_time, state, and type columns
print(ufo[['length_of_time', 'state', 'type']].isnull().sum())

# Keep only rows where length_of_time, state, and type are not null
ufo_no_missing = ufo[ufo['length_of_time'].notnull() & 
          ufo['state'].notnull() & 
          ufo['type'].notnull()]

# Print out the shape of the new dataset
print(ufo_no_missing.shape)

length_of_time    143
state             419
type              159
dtype: int64
(4283, 11)


### Extracting numbers from strings

The <code>length_of_time</code> field in the UFO dataset is a text field that has the number of minutes within the string. Here, you'll extract that number from that text field using regular expressions.

In [47]:
def return_minutes(time_string):

    # Use \d+ to grab digits
    pattern = re.compile(r"\d+")
    
    # Use match on the pattern and column
    num = re.match(pattern, time_string)
    if num is not None:
        return int(num.group(0))

In [48]:
# Apply the extraction to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(lambda x: return_minutes(str(x)))

# Take a look at the head of both of the columns
display(ufo[['length_of_time','minutes']].head())

Unnamed: 0,length_of_time,minutes
0,2 weeks,2.0
1,30sec.,30.0
2,,
3,about 5 minutes,
4,2,2.0


### Identifying features for standardization

<p>In this section, you'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the <code>seconds</code> and <code>minutes</code> column, you'll see that the variance of the <code>seconds</code> column is extremely high. Because <code>seconds</code> and <code>minutes</code> are related to each other (an issue we'll deal with when we select features for modeling), let's log normlize the <code>seconds</code> column.</p>

In [49]:
# Check the variance of the seconds and minutes columns
print(ufo[['seconds','minutes']].var())

# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo['seconds'])

# Print out the variance of just the seconds_log column
print(ufo["seconds_log"].var())

seconds   31567346180.215
minutes           870.993
dtype: float64
nan


  """


### Encoding categorical variables

There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You'll do that transformation here, using both binary and one-hot encoding methods.

In [50]:
# Use Pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda x: 1 if x=='us' else 0)

# Print the number of unique type values
print(len(ufo['type'].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo['type'])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

22


### Features from dates

<p>Another feature engineering task to perform is month and year extraction. Perform this task on the <code>date</code> column of the <code>ufo</code> dataset.</p>

In [51]:
# Look at the first 5 rows of the date column
display(ufo[['date']].head())

# Extract the month from the date column
ufo["month"] = ufo["date"].apply (lambda x : x.month)

# Extract the year from the date column
ufo["year"] = ufo["date"].apply(lambda x: x.year)

# Take a look at the head of all three columns
display(ufo[['date', 'year', 'month']].head())

Unnamed: 0,date
0,2011-11-03 19:21:00
1,2004-10-03 19:05:00
2,2009-09-25 21:00:00
3,2002-11-21 05:45:00
4,2010-08-19 12:55:00


Unnamed: 0,date,year,month
0,2011-11-03 19:21:00,2011,11
1,2004-10-03 19:05:00,2004,10
2,2009-09-25 21:00:00,2009,9
3,2002-11-21 05:45:00,2002,11
4,2010-08-19 12:55:00,2010,8


### Text vectorization

<p>Let's transform the <code>desc</code> column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.</p>

In [52]:
ufo['desc'] = ufo.desc.fillna("-")

In [53]:
# Take a look at the head of the desc field
display(ufo["desc"].head())

# Create the tfidf vectorizer object
vec = TfidfVectorizer()

# Use vec's fit_transform method on the desc field
desc_tfidf = vec.fit_transform(ufo["desc"])

# Look at the number of columns this creates
print(desc_tfidf.shape)

0    Red blinking objects similar to airplanes or s...
1                 Many fighter jets flying towards UFO
2    Green&#44 red&#44 and blue pulses of light tha...
3    It was a large&#44 triangular shaped flying ob...
4       A white spinning disc in the shape of an oval.
Name: desc, dtype: object

(4935, 6433)


### Selecting the ideal dataset

<p>Let's get rid of some of the unnecessary features. Because we have an encoded country column, <code>country_enc</code>, keep it and drop other columns related to location: <code>city</code>, <code>country</code>, <code>lat</code>, <code>long</code>, <code>state</code>. </p>
<p>We have columns related to <code>month</code> and <code>year</code>, so we don't need the <code>date</code> or <code>recorded</code> columns. </p>
<p>We vectorized <code>desc</code>, so we don't need it anymore. For now we'll keep <code>type</code>. </p>
<p>We'll keep <code>seconds_log</code> and drop <code>seconds</code> and <code>minutes</code>. </p>
<p>Let's also get rid of the <code>length_of_time</code> column, which is unnecessary after extracting <code>minutes</code>.</p>

In [None]:
# Check the correlation between the seconds, seconds_log, and minutes columns
print(ufo[['seconds','seconds_log','minutes']].corr())

# Make a list of features to drop
to_drop = ['city', 'country', 'date','desc', 'lat','length_of_time', 'long','minutes', 'recorded', 'seconds', 'state']

# Drop those features
ufo_dropped = ufo.drop(to_drop, axis=1)

# Let's also filter some words out of the text vector we created
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

### Modeling the UFO dataset, part 1

In this exercise, we're going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. Our X dataset has the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The y labels are the encoded country column, where 1 is us and 0 is ca.

In [None]:
# Take a look at the features in the X set of data
print(X.columns)

# Split the X and y sets using train_test_split, setting stratify=y
train_X, test_X, train_y, test_y = train_test_split(X,y, stratify=y)

# Fit knn to the training sets
knn.fit(train_X, train_y)

# Print the score of knn on the test sets
print(knn.score(test_X, test_y))