# Intro to Sklearn - Machine Learning in Python

## by Corey Wade

The following Jupyter Notebook is an introduction to Machine Learning in Python for ODSC West attendees on Nov. 1, 2022. We will be using pandas for data analytics, and sklearn for machine learning. A wide range of models will be covered including Linear and Logistic Regression, Decision Trees, Random Forests, and XGBoost.

This presentation includes ML fundamentals covered in Corey Wade's book [Hands-on Gradient Boosting with XGBoost and scikit-learn](https://www.amazon.com/Hands-Gradient-Boosting-XGBoost-scikit-learn/dp/1839218355). Another recommend text is [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/). For great web references, check out [Jason Brownlee's Machine Learning Mastery](https://machinelearningmastery.com/about/).

Our focus is on tabular data, that is, rows and columns of data sorted in tables; this is contrasted with images and text which are considered unstructured data. When it comes to images and text, neural networks usually perform better. For tabular data, neural networks do not necessarily have an edge. We will focus on XGBoost, one the strongest ML algorithms in the world, that often has an edge in tabular data.

# Module 1 - Preparing data for ML with pandas

The following module provides a brief introduction to pandas. To go more in-depth, try tutorial options from the official documentation: https://pandas.pydata.org/docs/getting_started/tutorials.html.

## Loading Data

### Bike Rentals Dataset

The [Bike Rentals dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset) is from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). It's been modified to include correcting null values for practice.

In [1]:
# load data into pandas dataframe and show first 5 rows
import pandas as pd
df = pd.read_csv('bike_rentals.csv')
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


## General Data Info

In [2]:
# show descriptive statistics
df.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,730.0,730.0,731.0,731.0,731.0,731.0,730.0,730.0,728.0,726.0,731.0,731.0,731.0
mean,366.0,2.49658,0.5,6.512329,0.028728,2.997264,0.682627,1.395349,0.495587,0.474512,0.627987,0.190476,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500343,3.448303,0.167155,2.004787,0.465773,0.544894,0.183094,0.163017,0.142331,0.077725,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.336875,0.337794,0.521562,0.134494,315.5,2497.0,3152.0
50%,366.0,3.0,0.5,7.0,0.0,3.0,1.0,1.0,0.499166,0.487364,0.627083,0.180971,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,9.75,0.0,5.0,1.0,2.0,0.655625,0.608916,0.730104,0.233218,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


In [3]:
# show correlations between columns
df.corr()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
instant,1.0,0.4122242,0.8660262,0.494807,0.016145,-1.6e-05,-0.009415,-0.021477,0.152677,0.154502,0.013773,-0.113047,0.275255,0.659623,0.62883
season,0.412224,1.0,-5.428568e-16,0.836863,-0.010537,-0.00308,0.016433,0.019211,0.336388,0.344739,0.209028,-0.228499,0.210399,0.411623,0.4061
yr,0.866026,-5.428568e-16,1.0,-0.003975,0.008195,-0.004103,-0.002945,-0.050322,0.050979,0.04935,-0.115456,-0.011963,0.249593,0.596168,0.56868
mnth,0.494807,0.8368628,-0.003975295,1.0,0.019599,0.011707,-0.007395,0.041218,0.226546,0.233626,0.227641,-0.206162,0.124549,0.296062,0.282624
holiday,0.016145,-0.01053666,0.008195345,0.019599,1.0,-0.10196,-0.252224,-0.034627,-0.028759,-0.032685,-0.016095,0.006319,0.054274,-0.108745,-0.068348
weekday,-1.6e-05,-0.003079881,-0.00410257,0.011707,-0.10196,1.0,0.038678,0.031087,-0.00183,-0.009003,-0.052728,0.014384,0.059923,0.057367,0.067443
workingday,-0.009415,0.01643296,-0.002945396,-0.007395,-0.252224,0.038678,1.0,0.057866,0.055573,0.055329,0.025879,-0.01772,-0.515692,0.30613,0.063781
weathersit,-0.021477,0.01921103,-0.05032247,0.041218,-0.034627,0.031087,0.057866,1.0,-0.119527,-0.120651,0.592841,0.038912,-0.247353,-0.260388,-0.297391
temp,0.152677,0.3363881,0.05097873,0.226546,-0.028759,-0.00183,0.055573,-0.119527,1.0,0.991702,0.13338,-0.159242,0.5436,0.540327,0.62786
atemp,0.154502,0.3447388,0.04934973,0.233626,-0.032685,-0.009003,0.055329,-0.120651,0.991702,1.0,0.146036,-0.184754,0.544113,0.544442,0.631357


In [4]:
# show histograms and scatter plots of all columns
import seaborn as sns
#sns.pairplot(df)

In [5]:
# get info on columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    float64
 3   yr          730 non-null    float64
 4   mnth        730 non-null    float64
 5   holiday     731 non-null    float64
 6   weekday     731 non-null    float64
 7   workingday  731 non-null    float64
 8   weathersit  731 non-null    int64  
 9   temp        730 non-null    float64
 10  atemp       730 non-null    float64
 11  hum         728 non-null    float64
 12  windspeed   726 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(10), int64(5), object(1)
memory usage: 91.5+ KB


## Null Values

In [6]:
# show total null values per column
df.isna().sum()

instant       0
dteday        0
season        0
yr            1
mnth          1
holiday       0
weekday       0
workingday    0
weathersit    0
temp          1
atemp         1
hum           3
windspeed     5
casual        0
registered    0
cnt           0
dtype: int64

In [7]:
# sum null values
df.isna().sum().sum()

12

In [8]:
# shows all null values
df[df.isna().any(axis=1)]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
56,57,2011-02-26,1.0,0.0,2.0,0.0,6.0,0.0,1,0.2825,0.282192,0.537917,,424,1545,1969
81,82,2011-03-23,2.0,0.0,3.0,0.0,3.0,1.0,2,0.346957,0.337939,0.839565,,203,1918,2121
128,129,2011-05-09,2.0,0.0,5.0,0.0,1.0,1.0,1,0.5325,0.525246,0.58875,,664,3698,4362
129,130,2011-05-10,2.0,0.0,5.0,0.0,2.0,1.0,1,0.5325,0.522721,,0.115671,694,4109,4803
213,214,2011-08-02,3.0,0.0,8.0,0.0,2.0,1.0,1,0.783333,0.707071,,0.20585,801,4044,4845
298,299,2011-10-26,4.0,0.0,10.0,0.0,3.0,1.0,2,0.484167,0.472846,0.720417,,404,3490,3894
388,389,2012-01-24,1.0,1.0,1.0,0.0,2.0,1.0,1,0.3425,0.349108,,0.123767,439,3900,4339
528,529,2012-06-12,2.0,1.0,6.0,0.0,2.0,1.0,2,0.653333,0.597875,0.833333,,477,4495,4972
701,702,2012-12-02,4.0,1.0,12.0,0.0,0.0,0.0,2,,,0.823333,0.124379,892,3757,4649
730,731,2012-12-31,1.0,,,0.0,1.0,0.0,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


In [9]:
# change null values in column
df['windspeed'] = df['windspeed'].fillna(df['windspeed'].median())

In [10]:
# change null values for entire dataframe
df = df.fillna(df.median())

In [11]:
# show rows
df.iloc[[129,213,730]]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
129,130,2011-05-10,2.0,0.0,5.0,0.0,2.0,1.0,1,0.5325,0.522721,0.627083,0.115671,694,4109,4803
213,214,2011-08-02,3.0,0.0,8.0,0.0,2.0,1.0,1,0.783333,0.707071,0.627083,0.20585,801,4044,4845
730,731,2012-12-31,1.0,0.5,7.0,0.0,1.0,0.0,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


In [12]:
# change null values by entry
df.loc[730,'yr']=1.0
df.loc[730, 'season']=4.0
df.loc[[730]]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
730,731,2012-12-31,4.0,1.0,7.0,0.0,1.0,0.0,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


## Choose X and y

In [13]:
# show order of columns
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


In [14]:
# choose X as all rows, and all columns excluding the first 2, and last 3
X = df.iloc[:, 2:-3]
X.head()

Unnamed: 0,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed
0,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446
1,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539
2,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309
3,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296
4,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869


In [15]:
# choose y as the last column
y=df.iloc[:, -1]
y.head()

0     985
1     801
2    1349
3    1562
4    1600
Name: cnt, dtype: int64

## The Census Dataset

The [Census Dataset](https://archive.ics.uci.edu/ml/datasets/Adult) (also called the Adult Dataset) is also from UCI. We include this dataset to balance regression with classification. Sklearn scoring metrics

In [16]:
# upload Census dataset with no header
df2 = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None)

# define columns by name
df2.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation',
                  'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                   'income']

# show first 5 rows
df2.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [17]:
# get column info
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


## One-hot encoding

One-hot encoding means you take each categorical column (say Color), and transform it into new columns for each value (Red, Green, Blue) as the new column header; the new columns values are 1 for presence, and 0 for absence. pd.get_dummies() often works for this purpose. sklearn includes an additional [onehotencoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) that may be useful for pipelines.

In [18]:
# Use pd.get_dummies() to transform categorical into numerical columns
df2 = pd.get_dummies(df2)

In [19]:
# show df after one-hot-encoding
df2.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,income_ <=50K,income_ >50K
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
3,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
4,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [20]:
# get new number of columns
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Columns: 110 entries, age to income_ >50K
dtypes: int64(6), uint8(104)
memory usage: 4.7 MB


In [21]:
# select X as all rows, columns except for last 2
X2 = df2.iloc[:, :-2]

# select y as last column
y2 = df2.iloc[:, -1]

# Module 2 - Supervised learning with sklearn

In this modules we cover the essentials of the sklearn suite.

In [22]:
# Split data into training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=0)

## Linear Regression

In [23]:
# import Linear Regression
from sklearn.linear_model import LinearRegression

# initialize model
model = LinearRegression()

# fit model to training data
model.fit(X_train, y_train)

# score model on test data (uses r2 default metric)
model.score(X_test, y_test)

0.8040621482429495

In [24]:
# show model coefficients
model.coef_

array([  448.99545626,  1964.15211779,   -26.9008259 ,  -358.1634053 ,
          76.05426566,    89.95264279,  -532.8684617 ,  2586.32297494,
        3140.02805295, -1297.67659702, -2829.67689761])

In [25]:
# show model params
model.get_params()

{'copy_X': True,
 'fit_intercept': True,
 'n_jobs': None,
 'normalize': False,
 'positive': False}

In [26]:
# show model predictions
model.predict(X_test.iloc[-2:])

array([6772.94745769, 4490.76552698])

In [27]:
# compare predictions to actual results
y_test[-2:]

504    8294
239    4334
Name: cnt, dtype: int64

## Regressors

In [28]:
# create function to score regressors
def score_reg(model):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [29]:
# import score Decision Tree
from sklearn.tree import DecisionTreeRegressor
score_reg(DecisionTreeRegressor())

0.8172726869322732

In [30]:
# import and score Random Forest
from sklearn.ensemble import RandomForestRegressor
score_reg(RandomForestRegressor())

0.8945909330392295

In [31]:
# install XGBoost to your computer
import sys
!{sys.executable} -m pip install xgboost



In [32]:
# import and score XGBoost
from xgboost import XGBRegressor
score_reg(XGBRegressor())

0.8911181007632755

## Classifiers

In [33]:
# write function to score classifiers
def score_clf(model):
    model.fit(X2_train, y2_train)
    return model.score(X2_test, y2_test)

In [34]:
# import and score Logistic Regression
from sklearn.linear_model import LogisticRegression
score_clf(LogisticRegression())

0.7930293259634577

In [35]:
# import and score Decision Tree for classification
from sklearn.tree import DecisionTreeClassifier
score_clf(DecisionTreeClassifier(random_state=0))

0.8136035621065562

In [36]:
# import and score Random Forest for classification
from sklearn.ensemble import RandomForestClassifier
score_clf(RandomForestClassifier(random_state=0))

0.8484569322892677

In [37]:
# import and score XGBoost for classification
from xgboost import XGBClassifier
score_clf(XGBClassifier(random_state=0))

0.869031168432366

## Predictions / Scoring Metrics

Making meaningful predictions is arguably the most important part of Machine Learning. You use pandas or numpy arrays to make predictions with sklearn.

There are many scoring metrics available in sklearn, especially for classification. See your options here: https://scikit-learn.org/stable/modules/model_evaluation.html

### Root Mean Squared Error

In [38]:
# build XGBoost model
model = XGBRegressor()
model.fit(X_train, y_train)

# get predictions for test set
y_pred = model.predict(X_test)

# for rmse, import mse first
from sklearn.metrics import mean_squared_error
# get mse
mse = mean_squared_error(y_test, y_pred)
# computer rmse
rmse = mse**0.5
# show rmse
print(rmse)

680.5073611167628


In [39]:
# show descriptive stats for y
y.describe()

count     731.000000
mean     4504.348837
std      1937.211452
min        22.000000
25%      3152.000000
50%      4548.000000
75%      5956.000000
max      8714.000000
Name: cnt, dtype: float64

### Predictions

In [40]:
# look at last 5 rows
X_test.tail()

Unnamed: 0,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed
566,3.0,1.0,7.0,0.0,5.0,1.0,2,0.665833,0.613025,0.844167,0.208967
688,4.0,1.0,11.0,0.0,1.0,1.0,2,0.380833,0.375621,0.623333,0.235067
266,4.0,0.0,9.0,0.0,6.0,0.0,2,0.606667,0.564412,0.8625,0.078383
504,2.0,1.0,5.0,0.0,6.0,0.0,1,0.6,0.566908,0.45625,0.083975
239,3.0,0.0,8.0,0.0,0.0,0.0,1,0.707059,0.647959,0.561765,0.304659


In [69]:
# predict last 2 rows
model.predict(X_test.iloc[-2:,:])

array([7707.2   , 4565.5044], dtype=float32)

In [72]:
# check last 2 rows actual value
y_test.iloc[-2:]

504    8294
239    4334
Name: cnt, dtype: int64

### NumPy Arrays

It's often easier to make predictions from ML models when your inputs are NumPy Arrays. Then you don't have worry about column names. Pandas DataFrames, or NumPy Arrays are okay. NumPy Arrays are better for single rows.

In [43]:
# convert data to numpy arrays
import numpy as np
X_train_np = np.array(X_train)
X_test_np = np.array(X_test)
y_train_np = np.array(y_train)
y_test_np = np.array(y_test)

# train model on numpy arrays
model = XGBRegressor()
model.fit(X_train_np, y_train_np)

In [44]:
# select last row to modify
X_test.tail(1)

Unnamed: 0,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed
239,3.0,0.0,8.0,0.0,0.0,0.0,1,0.707059,0.647959,0.561765,0.304659


In [45]:
# show predictions
model.predict(np.array([[3.0, 0.0, 8.0, 0.0, 0.0, 0.0, 2, 0.676667, 0.624388, 0.817500, 0.222633]]))

array([3823.1265], dtype=float32)

In [46]:
# now the prediction works, even though no column headers have been provided
model.predict([[3.0, 0.0, 8.0, 0.0, 0.0, 0.0, 2, 0.676667, 0.624388, 0.817500, 0.222633]])

array([3823.1265], dtype=float32)

### Confusion Matrix and Classification Report

In [47]:
# show confusion matrix and classification report
from sklearn.metrics import confusion_matrix, classification_report
model = XGBClassifier()
model.fit(X2_train, y2_train)
y2_pred = model.predict(X2_test)
print(confusion_matrix(y2_test, y2_pred))
print(classification_report(y2_test, y2_pred))

[[4587  331]
 [ 522 1073]]
              precision    recall  f1-score   support

           0       0.90      0.93      0.91      4918
           1       0.76      0.67      0.72      1595

    accuracy                           0.87      6513
   macro avg       0.83      0.80      0.82      6513
weighted avg       0.87      0.87      0.87      6513



In [48]:
# show the f1-score
from sklearn.metrics import f1_score
f1_score(y2_test, y2_pred)

0.7155718572857619

## Your Turn!

Try different models to get the best f1-score on the census dataset using random_state=0. You must use default params! (Changing params coming in Module 4 of this notebook.)

In [49]:
# try out different models using f1-scoring metric


# Module 3 - Cross-validation with sklearn

In [50]:
# import cross_val_score to use cross-validation
from sklearn.model_selection import cross_val_score
# choose your model
model=XGBRegressor()
# get scores on five folds of data 
cross_val_score(model, X, y, scoring='r2', cv=5)

array([0.41956178, 0.35431328, 0.14316349, 0.19342386, 0.64222721])

In [51]:
# get mean rmse
mse = cross_val_score(model, X, y, scoring='neg_mean_squared_error')
rmse = (-mse)**0.5
print(rmse.mean())

934.0368073102878


In [52]:
# use KFold for shuffled, consistent folds 
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=0)
model=XGBRegressor()
mse = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=kfold)
rmse = (-mse)**0.5
print(rmse)
print(rmse.mean())

[680.50736112 674.32594581 660.63198199 813.16389678 654.28787799]
696.5834127397749


In [53]:
# use stratified Kfold for classification to balance all test sets
from sklearn.model_selection import StratifiedKFold
ksfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
#clf=XGBClassifier()
#f1 = cross_val_score(clf, X2, y2, scoring='f1', cv=ksfold)
#print(f1.mean())

# Module 4 - Fine-tuning models with sklearn

## GridSearch

In [54]:
# use GridSearchCV to search grid of hyperparameters for best values
from sklearn.model_selection import GridSearchCV

# GridSearch uses a dictionary of parameters to find optimal values
params = {'max_depth':[1, 2, 3, 4, 5, 6, 8, 10]}

# GridSearchCV takes an ML model, the dictionary of params, etc. as inputs
model = XGBRegressor()
grid_reg = GridSearchCV(model, params, scoring='neg_mean_squared_error', cv=kfold)

# you fit gridsearch on training data just like an ml model
grid_reg.fit(X_train, y_train)

# now you can access the best parameters, with the best score
best_params = grid_reg.best_params_
print("Best params:", best_params)
best_score = (-grid_reg.best_score_)**0.5
print("Best score:", best_score)

Best params: {'max_depth': 4}
Best score: 660.9391538175233


In [55]:
# This function includes all steps in the cell above with XGBoost as the default model
def grid_search(params, reg=XGBRegressor()):
    grid_reg = GridSearchCV(reg, params, scoring='neg_mean_squared_error', cv=kfold)
    grid_reg.fit(X_train, y_train)
    best_params = grid_reg.best_params_
    print("Best params:", best_params)
    best_score = (-grid_reg.best_score_)**0.5
    print("Best score:", best_score)

In [56]:
# show params
model.get_params

<bound method XGBModel.get_params of XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, gamma=None,
             gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, n_estimators=100, n_jobs=None,
             num_parallel_tree=None, predictor=None, random_state=None,
             reg_alpha=None, reg_lambda=None, ...)>

In [57]:
# search 2 params - 12 models total
grid_search({'max_depth':[3, 4, 5],
            'n_estimators':[50, 100, 200, 400]})

Best params: {'max_depth': 4, 'n_estimators': 50}
Best score: 655.9991890520657


In [58]:
# add additional params
grid_search(params={'max_depth':[4],
                    'colsample_bytree':[0.4, 0.6, 0.8, 1],
                   'n_estimators':[25, 50, 100]})

Best params: {'colsample_bytree': 1, 'max_depth': 4, 'n_estimators': 25}
Best score: 654.3963754643038


## RandomSearch

In [59]:
# RandomizedSearchCV works the same way, but checks n (10 by default) random combinations
from sklearn.model_selection import RandomizedSearchCV
def random_search(params, reg=XGBRegressor()):
    grid_reg = RandomizedSearchCV(reg, params, scoring='neg_mean_squared_error', cv=kfold, n_iter=10, random_state=0)
    grid_reg.fit(X_train, y_train)
    best_params = grid_reg.best_params_
    print("Best params:", best_params)
    best_score = (-grid_reg.best_score_)**0.5
    print("Best score:", best_score)

In [60]:
# the following is a reasonable starting sample of params
random_search(params={'subsample':[0.5, 0.6, 0.7, 0.8, 0.9, 1],
        'colsample_bynode':[0.5, 0.6, 0.7, 0.8, 0.9, 1],
        'colsample_bytree':[0.5, 0.6, 0.7, 0.8, 0.9, 1],
        'colsample_bylevel':[0.5, 0.6, 0.7, 0.8, 0.9, 1], 
        'min_child_weight':[1, 2, 3, 4, 5], 
        'learning_rate':[0.001, 0.01, 0.1, 0.2, 0.4, 0.6], 
        'max_depth':[2, 3, 4, 5, 6, 8, 10], 
        'n_estimators':[25, 50, 100, 200, 400]})

Best params: {'subsample': 0.7, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 8, 'learning_rate': 0.1, 'colsample_bytree': 1, 'colsample_bynode': 0.8, 'colsample_bylevel': 0.6}
Best score: 625.181686886698


In [61]:
# adjust based on results
random_search(params={'subsample':[0.6, 0.7, 0.8],
        'colsample_bynode':[0.7, 0.8, 0.9],
        'colsample_bytree':[0.9, 1],
        'colsample_bylevel':[0.5, 0.6, 0.7], 
        'min_child_weight':[1, 2], 
        'learning_rate':[0.05, 0.1, 0.25], 
        'max_depth':[6, 8, 10], 
        'n_estimators':[50, 100, 200]})

Best params: {'subsample': 0.6, 'n_estimators': 200, 'min_child_weight': 2, 'max_depth': 6, 'learning_rate': 0.05, 'colsample_bytree': 0.9, 'colsample_bynode': 0.9, 'colsample_bylevel': 0.7}
Best score: 615.411076649143


## Your turn!

Try your own random and grid searches to get the best possible cv score on 5 folds using random_state=0

In [62]:
# try your own random searches, and/or grid searches

# Module 5 - Feature Importances

## Finalize Model

In [63]:
# choose your best model, fit on your data, then test against unseen data
model = XGBRegressor(subsample=0.8, n_estimators=100, max_depth=8,
                    learning_rate=0.1, colsample_bytree=1,
                    colsample_bynode=0.4, colsample_bylevel=0.6)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_pred, y_test)
mse**0.5

607.1702124988886

In [64]:
# show the influence of each column
model.feature_importances_

array([0.19686565, 0.42768875, 0.05280402, 0.01175597, 0.01094317,
       0.0068551 , 0.03675866, 0.13856605, 0.05325931, 0.03932855,
       0.02517479], dtype=float32)

In [65]:
# zip columns and feature_importances_ into dict
feature_dict = dict(zip(X.columns, model.feature_importances_))

# import operator
import operator

# sort dict by values (as list of tuples)
sorted(feature_dict.items(), key=operator.itemgetter(1), reverse=True)

[('yr', 0.42768875),
 ('season', 0.19686565),
 ('temp', 0.13856605),
 ('atemp', 0.053259306),
 ('mnth', 0.052804016),
 ('hum', 0.039328545),
 ('weathersit', 0.036758665),
 ('windspeed', 0.025174795),
 ('holiday', 0.011755966),
 ('weekday', 0.010943173),
 ('workingday', 0.006855099)]