First load all the data from DataCamp's website.

In [0]:
import pandas as pd
import numpy as np

In [0]:
hiking = pd.read_json('https://assets.datacamp.com/production/repositories/1816/datasets/4f26c48451bdbf73db8a58e226cd3d6b45cf7bb5/hiking.json',
                      encoding = 'utf-8')

In [135]:
hiking.head(3).T

Unnamed: 0,0,1,2
Prop_ID,B057,B073,B073
Name,Salt Marsh Nature Trail,Lullwater,Midwood
Location,"Enter behind the Salt Marsh Nature Center, loc...",Enter Park at Lincoln Road and Ocean Avenue en...,Enter Park at Lincoln Road and Ocean Avenue en...
Park_Name,Marine Park,Prospect Park,Prospect Park
Length,0.8 miles,1.0 mile,0.75 miles
Difficulty,,Easy,Easy
Other_Details,<p>The first half of this mile-long trail foll...,Explore the Lullwater to see how nature thrive...,Step back in time with a walk through Brooklyn...
Accessible,Y,N,N
Limited_Access,N,N,N
lat,,,


In [0]:
wine = pd.read_csv('https://assets.datacamp.com/production/repositories/1816/datasets/9bd5350dfdb481e0f94eeef6acf2663452a8ef8b/wine_types.csv')

In [137]:
wine.head(3).T

Unnamed: 0,0,1,2
Type,1.0,1.0,1.0
Alcohol,14.23,13.2,13.16
Malic acid,1.71,1.78,2.36
Ash,2.43,2.14,2.67
Alcalinity of ash,15.6,11.2,18.6
Magnesium,127.0,100.0,101.0
Total phenols,2.8,2.65,2.8
Flavanoids,3.06,2.76,3.24
Nonflavanoid phenols,0.28,0.26,0.3
Proanthocyanins,2.29,1.28,2.81


In [0]:
ufo = pd.read_csv('https://assets.datacamp.com/production/repositories/1816/datasets/a5ebfe5d2ed194f2668867603b563963af4769e9/ufo_sightings_large.csv')

In [139]:
ufo.head(3).T

Unnamed: 0,0,1,2
date,11/3/2011 19:21,10/3/2004 19:05,9/25/2009 21:00
city,woodville,cleveland,coon rapids
state,wi,oh,mn
country,us,us,us
type,unknown,circle,cigar
seconds,1.2096e+06,30,0
length_of_time,2 weeks,30sec.,
desc,Red blinking objects similar to airplanes or s...,Many fighter jets flying towards UFO,Green&#44 red&#44 and blue pulses of light tha...
recorded,12/12/2011,10/27/2004,12/12/2009
lat,44.9530556,41.4994444,45.1200000


In [0]:
volunteer = pd.read_csv('https://assets.datacamp.com/production/repositories/1816/datasets/668b96955d8b252aa8439c7602d516634e3f015e/volunteer_opportunities.csv')

In [141]:
volunteer.head(3).T

Unnamed: 0,0,1,2
opportunity_id,4996,5008,5016
content_id,37004,37036,37143
vol_requests,50,2,20
event_time,0,0,0
title,Volunteers Needed For Rise Up & Stay Put! Home...,Web designer,Urban Adventures - Ice Skating at Lasker Rink
hits,737,22,62
summary,Building on successful events last summer and ...,Build a website for an Afghan business,Please join us and the students from Mott Hall...
is_priority,,,
category_id,,1,1
category_desc,,Strengthening Communities,Strengthening Communities


The tutorials start now.

### Part 1 - Introduction to Data Preprocessing

#### Missing data - rows

Taking a look at the volunteer dataset again, we want to drop rows where the `category_desc` column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.

In [142]:
volunteer['category_desc'].isnull().sum()

48

There are 48 missing values

In [0]:
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]

In [144]:
volunteer_subset.shape

(617, 35)

#### Converting a column type

If you take a look at the volunteer dataset types, you'll see that the column `hits` is type object. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type `int`.

In [145]:
volunteer['hits'].dtype

dtype('int64')

In [146]:
volunteer['hits'].head()

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int64

In [147]:
volunteer['hits'].astype('int').head()

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int64

#### Stratified sampling

We know that the distribution of variables in the category_desc column in the volunteer dataset is uneven. If we wanted to train a model to try to predict category_desc, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

In [0]:
volunteer_X = volunteer_subset.drop(columns='category_desc')

volunteer_y = volunteer_subset[['category_desc']]


In [0]:
from sklearn.model_selection import train_test_split

Use stratified sampling to split up the dataset according to the volunteer_y dataset

In [0]:
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)

Print out the category_desc counts on the training y labels

In [151]:
y_train['category_desc'].value_counts()

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64

### Part 2 - Standardizing Data

#### Modeling without normalizing

Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the wine dataset. One of the columns, Proline, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.

The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (knn) as well as the X and y sets you need to fit and score on.

In [0]:
wine_X = wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]
wine_y = wine[['Type']]

In [153]:
print(wine_y.shape)
print(wine_X.shape)

(178, 1)
(178, 4)


In [154]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

X_train, X_test, y_train, y_test = train_test_split(wine_X.values, wine_y.values)

knn.fit(X_train, y_train)

score = knn.score(X_test, y_test)
print(score)

0.6666666666666666


  if __name__ == '__main__':


In [155]:
wine_X.apply(lambda x: np.std(x))

Proline                 314.021657
Total phenols             0.624091
Hue                       0.227929
Nonflavanoid phenols      0.124103
dtype: float64

#### Log normalization in Python

Now that we know that the Proline column in our wine dataset has a large amount of variance, let's log normalize it.

In [0]:
wine['Proline_log'] = np.log(wine_X.loc[:,'Proline'])

In [157]:
wine_X.apply(lambda x: np.std(x))

Proline                 314.021657
Total phenols             0.624091
Hue                       0.227929
Nonflavanoid phenols      0.124103
dtype: float64

#### Scaling data - standardizing columns

Since we know that the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.

In [0]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

wine_subset = wine[["Ash", "Alcalinity of ash", "Magnesium"]].copy()

wine_subset_scaled = ss.fit_transform(wine_subset)


#### KNN on non-scaled data

Let's first take a look at the accuracy of a K-nearest neighbors model on the wine dataset without standardizing the data. 

In [159]:
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))

0.6666666666666666


  """Entry point for launching an IPython kernel.


#### KNN on scaled data

The accuracy score on the unscaled wine dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data. Once again, the knn model as well as the X and y data and labels set have already been created for you.

In [160]:
ss = StandardScaler()

X_scaled = ss.fit_transform(wine.drop(columns='Type'))
y = wine['Type']


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

knn.fit(X_train, y_train)

print(knn.score(X_test, y_test))

0.9555555555555556


We can see that our accuracy has been increased substantially.

#### Part 3 - Feature Engineering

#### Encoding categorical variables - binary

Take a look at the hiking dataset. There are several columns here that need encoding, one of which is the Accessible column, which needs to be encoded in order to be modeled. Accessible is a binary feature, so it has two values - either Y or N - so it needs to be encoded into 1s and 0s. Use scikit-learn's LabelEncoder method to do that transformation.

In [0]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()

In [162]:
hiking.head(3)

Unnamed: 0,Prop_ID,Name,Location,Park_Name,Length,Difficulty,Other_Details,Accessible,Limited_Access,lat,lon
0,B057,Salt Marsh Nature Trail,"Enter behind the Salt Marsh Nature Center, loc...",Marine Park,0.8 miles,,<p>The first half of this mile-long trail foll...,Y,N,,
1,B073,Lullwater,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,1.0 mile,Easy,Explore the Lullwater to see how nature thrive...,N,N,,
2,B073,Midwood,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.75 miles,Easy,Step back in time with a walk through Brooklyn...,N,N,,


In [0]:
hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])

In [164]:
print(hiking[["Accessible_enc", "Accessible"]].head())

   Accessible_enc Accessible
0               1          Y
1               0          N
2               0          N
3               0          N
4               0          N


#### Encoding categorical variables - one-hot

One of the columns in the volunteer dataset, category_desc, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use Pandas' get_dummies() function to do so.

In [165]:
category_enc = pd.get_dummies(volunteer['category_desc'])

category_enc.head()

Unnamed: 0,Education,Emergency Preparedness,Environment,Health,Helping Neighbors in Need,Strengthening Communities
0,0,0,0,0,0,0
1,0,0,0,0,0,1
2,0,0,0,0,0,1
3,0,0,0,0,0,1
4,0,0,1,0,0,0


#### Engineering numerical features - taking an average

A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, you have a DataFrame of running times named running_times_5k. For each name in the dataset, take the mean of their 5 run times.

In [0]:
running_dict = {'name': ['Sue', 'Mark', 'Sean', 'Erin', 'Jenny', 'Russell'],
 'run1': [20.1, 16.5, 23.5, 21.7, 25.8, 30.9],
 'run2': [18.5, 17.1, 25.1, 21.1, 27.1, 29.6],
 'run3': [19.6, 16.9, 25.2, 20.9, 26.1, 31.4],
 'run4': [20.3, 17.6, 24.6, 22.1, 26.7, 30.4],
 'run5': [18.3, 17.3, 23.9, 22.2, 26.9, 29.9]}

running_time = pd.DataFrame(running_dict)

In [167]:
running_time

Unnamed: 0,name,run1,run2,run3,run4,run5
0,Sue,20.1,18.5,19.6,20.3,18.3
1,Mark,16.5,17.1,16.9,17.6,17.3
2,Sean,23.5,25.1,25.2,24.6,23.9
3,Erin,21.7,21.1,20.9,22.1,22.2
4,Jenny,25.8,27.1,26.1,26.7,26.9
5,Russell,30.9,29.6,31.4,30.4,29.9


In [0]:
running_time["mean"] = running_time.mean(axis=1)

In [169]:
running_time

Unnamed: 0,name,run1,run2,run3,run4,run5,mean
0,Sue,20.1,18.5,19.6,20.3,18.3,19.36
1,Mark,16.5,17.1,16.9,17.6,17.3,17.08
2,Sean,23.5,25.1,25.2,24.6,23.9,24.46
3,Erin,21.7,21.1,20.9,22.1,22.2,21.6
4,Jenny,25.8,27.1,26.1,26.7,26.9,26.52
5,Russell,30.9,29.6,31.4,30.4,29.9,30.44


#### Engineering numerical features - datetime

There are several columns in the volunteer dataset comprised of datetimes. Let's take a look at the start_date_date column and extract just the month to use as a feature for modeling.

In [0]:
volunteer["start_date_converted"] = pd.to_datetime(volunteer['start_date_date'])

In [0]:
volunteer['start_date_month'] = volunteer['start_date_converted'].apply(lambda row: row.month)

In [172]:
volunteer[["start_date_converted", "start_date_month"]].head()

Unnamed: 0,start_date_converted,start_date_month
0,2011-07-30,7
1,2011-02-01,2
2,2011-01-29,1
3,2011-02-14,2
4,2011-02-05,2


#### Engineering features from strings - extraction

The Length column in the hiking dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.

In [173]:
hiking['Length'].head()

0     0.8 miles
1      1.0 mile
2    0.75 miles
3     0.5 miles
4     0.5 miles
Name: Length, dtype: object

In [0]:
import re
pattern = re.compile(r"\d+\.\d+")

# We need to convert the Lenght to str

length = [str(i) for i in hiking['Length']]

hiking['Length'] = length

In [175]:
def return_mileage(length):
    pattern = re.compile(r"\d+\.\d+")
    
    # Search the text for matches
    mile = re.match(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
        
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking["Length"].apply(lambda row: return_mileage(row))
print(hiking[["Length", "Length_num"]].head())

       Length  Length_num
0   0.8 miles        0.80
1    1.0 mile        1.00
2  0.75 miles        0.75
3   0.5 miles        0.50
4   0.5 miles        0.50


#### Engineering features from strings - tf/idf

Let's transform the volunteer dataset's title column into a text vector, to use in a prediction task in the next exercise.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

title_text = volunteer['title']

tfidf_vec = TfidfVectorizer()

text_tfidf = tfidf_vec.fit_transform(title_text)

In [177]:
text_tfidf

<665x1136 sparse matrix of type '<class 'numpy.float64'>'
	with 3397 stored elements in Compressed Sparse Row format>

### Part 4 - Selecting features for modeling

#### Selecting relevant features

Now let's identify the redundant columns in the volunteer dataset and perform feature selection on the dataset to return a DataFrame of the relevant features.

For example, if you explore the volunteer dataset in the console, you'll see three features which are related to location: locality, region, and postalcode. They contain repeated information, so it would make sense to keep only one of the features.

There are also features that have gone through the feature engineering process: columns like Education and Emergency Preparedness are a product of encoding the categorical variable category_desc, so category_desc itself is redundant now.

Take a moment to examine the features of volunteer in the console, and try to identify the redundant features.

In [178]:
to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

volunteer_subset = volunteer.drop(to_drop,
                                  axis = 1)

volunteer_subset.head(3).T

Unnamed: 0,0,1,2
opportunity_id,4996,5008,5016
content_id,37004,37036,37143
event_time,0,0,0
title,Volunteers Needed For Rise Up & Stay Put! Home...,Web designer,Urban Adventures - Ice Skating at Lasker Rink
hits,737,22,62
summary,Building on successful events last summer and ...,Build a website for an Afghan business,Please join us and the students from Mott Hall...
is_priority,,,
category_id,,1,1
amsl,,,
amsl_unit,,,


#### Checking for correlated features

Let's take a look at the wine dataset again, which is made up of continuous, numerical features. Run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.

Take a minute to find the column where the correlation value is greater than 0.75 at least twice.

In [179]:
wine_subset = wine[['Flavanoids', 'Total phenols', 'Malic acid',
       'OD280/OD315 of diluted wines', 'Hue']].copy()

wine_subset.corr()  

Unnamed: 0,Flavanoids,Total phenols,Malic acid,OD280/OD315 of diluted wines,Hue
Flavanoids,1.0,0.864564,-0.411007,0.787194,0.543479
Total phenols,0.864564,1.0,-0.335167,0.699949,0.433681
Malic acid,-0.411007,-0.335167,1.0,-0.36871,-0.561296
OD280/OD315 of diluted wines,0.787194,0.699949,-0.36871,1.0,0.565468
Hue,0.543479,0.433681,-0.561296,0.565468,1.0


Flavanoids has corr greater than 0.75 twice.

In [0]:
wine_subset.drop('Flavanoids', axis=1, inplace=True)

#### Using PCA

Let's apply PCA to the wine dataset, to see if we can get an increase in our model's accuracy.

In [181]:
from sklearn.decomposition import PCA

# Set up PCA and the X vector for diminsionality reduction
pca = PCA()
wine_X = wine.drop("Type", axis=1)

# Apply PCA to the wine dataset
transformed_X = pca.fit_transform(wine_X)

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)

[9.98091157e-01 1.73591574e-03 9.49596845e-05 5.02203105e-05
 1.23683570e-05 8.46366883e-06 2.80684492e-06 1.52331150e-06
 1.13031557e-06 7.22017230e-07 3.79083815e-07 2.12869847e-07
 8.25543070e-08 5.87363612e-08]


#### Training a model with PCA

Now that we have run PCA on the wine dataset, let's try training a model with it.

In [182]:
# Split the transformed X and the y labels into training and test sets
y = wine['Type']
X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X, y)

# Fit knn to the training data
knn.fit(X_wine_train, y_wine_train)

# Score knn on the test data and print it out
print(knn.score(X_wine_test, y_wine_test))

0.6444444444444445


### Part 5 - Putting it all together

#### Checking column types

Take a look at the UFO dataset's column types using the dtypes attribute. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as object, and the date column, which can be transformed into the datetime type. That will make our feature engineering efforts easier later on.

In [183]:
ufo.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
date,11/3/2011 19:21,10/3/2004 19:05,9/25/2009 21:00,11/21/2002 05:45,8/19/2010 12:55,6/16/2012 23:00,7/12/2009 21:30,10/20/2008 18:30,6/9/2013 00:00,4/26/2013 23:27
city,woodville,cleveland,coon rapids,clemmons,calgary (canada),san diego,duluth,fairfield,oakville (canada),lacey
state,wi,oh,mn,nc,ab,ca,mn,tx,on,wa
country,us,us,us,us,ca,us,us,us,ca,us
type,unknown,circle,cigar,triangle,oval,light,oval,other,light,light
seconds,1.2096e+06,30,0,300,0,600,600,0,120,120
length_of_time,2 weeks,30sec.,,about 5 minutes,2,10 minutes,total? maybe around 10 mi,several sightings from 10,2 minutes,2 minutes
desc,Red blinking objects similar to airplanes or s...,Many fighter jets flying towards UFO,Green&#44 red&#44 and blue pulses of light tha...,It was a large&#44 triangular shaped flying ob...,A white spinning disc in the shape of an oval.,Dancing lights that would fly around and then ...,A minor amber color trail&#44 (from where we w...,Multiple sightings in Central Texas (Freestone...,Brilliant orange light or chinese lantern at o...,Bright red light moving north to north west fr...
recorded,12/12/2011,10/27/2004,12/12/2009,12/23/2002,8/24/2010,7/4/2012,3/13/2012,1/10/2009,7/3/2013,5/15/2013
lat,44.9530556,41.4994444,45.1200000,36.0213889,51.083333,32.7152778,46.7833333,31.7244444,43.433333,47.0344444


In [184]:
print(ufo.dtypes)

# Change the type of seconds to float
ufo["seconds"] = ufo["seconds"].astype(float)

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo["date"])
print('-' * 40)
print('-' * 40)
# Check the column types
print(ufo[["seconds", "date"]].dtypes)

date               object
city               object
state              object
country            object
type               object
seconds           float64
length_of_time     object
desc               object
recorded           object
lat                object
long              float64
dtype: object
----------------------------------------
----------------------------------------
seconds           float64
date       datetime64[ns]
dtype: object


#### Identifying features for standardization

In this section, you'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the seconds and minutes column, you'll see that the variance of the seconds column is extremely high. Because seconds and minutes are related to each other (an issue we'll deal with when we select features for modeling), let's log normlize the seconds column.

In [185]:
print(ufo[["seconds"]].var())

# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo["seconds"])

# Print out the variance of just the seconds_log column
print(ufo["seconds_log"].var())

seconds    3.156735e+10
dtype: float64
nan


  result = getattr(ufunc, method)(*inputs, **kwargs)


#### Encoding categorical variables

There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You'll do that transformation here, using both binary and one-hot encoding methods.

In [186]:
# Use Pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda val: 1 if val == "us" else 0)

# Print the number of unique type values
print(len(ufo["type"].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo["type"])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

22


#### Text vectorization

Let's transform the desc column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.

In [187]:
# Take a look at the head of the desc field
print(ufo["desc"].head())

# Create the tfidf vectorizer object
vec = TfidfVectorizer()

# Use vec's fit_transform method on the desc field
desc_tfidf = vec.fit_transform(ufo["desc"].fillna('0'))

# Look at the number of columns this creates.
print(desc_tfidf.shape)

0    Red blinking objects similar to airplanes or s...
1                 Many fighter jets flying towards UFO
2    Green&#44 red&#44 and blue pulses of light tha...
3    It was a large&#44 triangular shaped flying ob...
4       A white spinning disc in the shape of an oval.
Name: desc, dtype: object
(4935, 6433)
