# Introducing Machine Learning in Python with Scikit-learn

## by Corey Wade

The following Jupyter Notebook is an introduction to Machine Learning in Python designed for ODSC West attendees on Monday, October 30, 2023. We use pandas for preliminary data analytics, and sklearn for machine learning. A wide range of models will be covered including Linear and Logistic Regression, Decision Trees, Random Forests, XGBoost, and a version of LightGBM.

This presentation includes an updated version of ML fundamentals as covered in Corey Wade's book [Hands-on Gradient Boosting with XGBoost and scikit-learn](https://www.amazon.com/Hands-Gradient-Boosting-XGBoost-scikit-learn/dp/1839218355).

Our focus is on tabular data, that is, rows and columns of data sorted in tables, as contrasted with images and text which are considered unstructured data. When it comes to images and text, neural networks usually perform better. For tabular data, neural networks do not necessarily have an edge. We will focus on XGBoost, one the strongest ML algorithms in the world that has outperformed neural networks in Kaggle Competitions.

Additional note: the regression and classification models presented fall under the umbrella of Supervised Learning, which means that the y-values of the target columns are known in advance. The other option is Unsupervised Learning where the target values are unknown; one example is taking the texts of books as input and grouping them into original genres.)

# Module 1 - Preparing data for ML with pandas

This module provides a brief introduction to pandas in terms of preparing data for machine learning.

For machine learning algorithms to work, all data must be numerical with no null values. This means that you can't have blanks in the data (null values) or words (categorical columns). Both must be converted into numbers for ML algorithms to work. Why? Because at the core, machine learning models are mathematical models and words or blanks will break them. 

For general information on pandas, visit the official documentation at https://pandas.pydata.org/docs/getting_started/tutorials.html.

## 1.1 Loading Regression Data

### Bike Rentals Dataset

The [Bike Rentals dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset) is from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). It's been modified here to include correcting null values for practice.

In [1]:
# load data into pandas dataframe and show first 5 rows
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/coreyjwade/odsc/main/bike_rentals.csv')
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


## 1.2 Accessing General Data Stats
It's useful to see the data in terms of general statistics to better understand the data at hand.

In [2]:
# show descriptive statistics
df.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,730.0,730.0,731.0,731.0,731.0,731.0,730.0,730.0,728.0,726.0,731.0,731.0,731.0
mean,366.0,2.49658,0.5,6.512329,0.028728,2.997264,0.682627,1.395349,0.495587,0.474512,0.627987,0.190476,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500343,3.448303,0.167155,2.004787,0.465773,0.544894,0.183094,0.163017,0.142331,0.077725,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.336875,0.337794,0.521562,0.134494,315.5,2497.0,3152.0
50%,366.0,3.0,0.5,7.0,0.0,3.0,1.0,1.0,0.499166,0.487364,0.627083,0.180971,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,9.75,0.0,5.0,1.0,2.0,0.655625,0.608916,0.730104,0.233218,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


In [3]:
# show correlations between columns
df.corr()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
instant,1.0,0.4122242,0.8660262,0.494807,0.016145,-1.6e-05,-0.009415,-0.021477,0.152677,0.154502,0.013773,-0.113047,0.275255,0.659623,0.62883
season,0.412224,1.0,-5.428568e-16,0.836863,-0.010537,-0.00308,0.016433,0.019211,0.336388,0.344739,0.209028,-0.228499,0.210399,0.411623,0.4061
yr,0.866026,-5.428568e-16,1.0,-0.003975,0.008195,-0.004103,-0.002945,-0.050322,0.050979,0.04935,-0.115456,-0.011963,0.249593,0.596168,0.56868
mnth,0.494807,0.8368628,-0.003975295,1.0,0.019599,0.011707,-0.007395,0.041218,0.226546,0.233626,0.227641,-0.206162,0.124549,0.296062,0.282624
holiday,0.016145,-0.01053666,0.008195345,0.019599,1.0,-0.10196,-0.252224,-0.034627,-0.028759,-0.032685,-0.016095,0.006319,0.054274,-0.108745,-0.068348
weekday,-1.6e-05,-0.003079881,-0.00410257,0.011707,-0.10196,1.0,0.038678,0.031087,-0.00183,-0.009003,-0.052728,0.014384,0.059923,0.057367,0.067443
workingday,-0.009415,0.01643296,-0.002945396,-0.007395,-0.252224,0.038678,1.0,0.057866,0.055573,0.055329,0.025879,-0.01772,-0.515692,0.30613,0.063781
weathersit,-0.021477,0.01921103,-0.05032247,0.041218,-0.034627,0.031087,0.057866,1.0,-0.119527,-0.120651,0.592841,0.038912,-0.247353,-0.260388,-0.297391
temp,0.152677,0.3363881,0.05097873,0.226546,-0.028759,-0.00183,0.055573,-0.119527,1.0,0.991702,0.13338,-0.159242,0.5436,0.540327,0.62786
atemp,0.154502,0.3447388,0.04934973,0.233626,-0.032685,-0.009003,0.055329,-0.120651,0.991702,1.0,0.146036,-0.184754,0.544113,0.544442,0.631357


In [4]:
# get info on columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    float64
 3   yr          730 non-null    float64
 4   mnth        730 non-null    float64
 5   holiday     731 non-null    float64
 6   weekday     731 non-null    float64
 7   workingday  731 non-null    float64
 8   weathersit  731 non-null    int64  
 9   temp        730 non-null    float64
 10  atemp       730 non-null    float64
 11  hum         728 non-null    float64
 12  windspeed   726 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(10), int64(5), object(1)
memory usage: 91.5+ KB


In [5]:
# show histograms and scatter plots of all columns
import seaborn as sns
#sns.pairplot(df)

## 1.3 Correcting Null Values

Machine learning algorithms are like mathematical models. For the algorithms to work, all inputs must have values. Null values will break the algorithm. We need to find the null values in our data and change them.

You can eliminate columns if they are almost all null values, or eliminate rows if they have too many null values. However, if the rows and columns contain other valuable information it's often better to keep them and to change the null values to a statistical average of the column.

In [6]:
# show total null values per column
df.isna().sum()

instant       0
dteday        0
season        0
yr            1
mnth          1
holiday       0
weekday       0
workingday    0
weathersit    0
temp          1
atemp         1
hum           3
windspeed     5
casual        0
registered    0
cnt           0
dtype: int64

In [7]:
# sum null values
df.isna().sum().sum()

12

In [8]:
# shows all null values by row
df[df.isna().any(axis=1)]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
56,57,2011-02-26,1.0,0.0,2.0,0.0,6.0,0.0,1,0.2825,0.282192,0.537917,,424,1545,1969
81,82,2011-03-23,2.0,0.0,3.0,0.0,3.0,1.0,2,0.346957,0.337939,0.839565,,203,1918,2121
128,129,2011-05-09,2.0,0.0,5.0,0.0,1.0,1.0,1,0.5325,0.525246,0.58875,,664,3698,4362
129,130,2011-05-10,2.0,0.0,5.0,0.0,2.0,1.0,1,0.5325,0.522721,,0.115671,694,4109,4803
213,214,2011-08-02,3.0,0.0,8.0,0.0,2.0,1.0,1,0.783333,0.707071,,0.20585,801,4044,4845
298,299,2011-10-26,4.0,0.0,10.0,0.0,3.0,1.0,2,0.484167,0.472846,0.720417,,404,3490,3894
388,389,2012-01-24,1.0,1.0,1.0,0.0,2.0,1.0,1,0.3425,0.349108,,0.123767,439,3900,4339
528,529,2012-06-12,2.0,1.0,6.0,0.0,2.0,1.0,2,0.653333,0.597875,0.833333,,477,4495,4972
701,702,2012-12-02,4.0,1.0,12.0,0.0,0.0,0.0,2,,,0.823333,0.124379,892,3757,4649
730,731,2012-12-31,1.0,,,0.0,1.0,0.0,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


In [9]:
# change null values in column to median of column
df['windspeed'] = df['windspeed'].fillna(df['windspeed'].median())

In [10]:
# show rows of changed data
df.iloc[[56,81,128]]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
56,57,2011-02-26,1.0,0.0,2.0,0.0,6.0,0.0,1,0.2825,0.282192,0.537917,0.180971,424,1545,1969
81,82,2011-03-23,2.0,0.0,3.0,0.0,3.0,1.0,2,0.346957,0.337939,0.839565,0.180971,203,1918,2121
128,129,2011-05-09,2.0,0.0,5.0,0.0,1.0,1.0,1,0.5325,0.525246,0.58875,0.180971,664,3698,4362


In [11]:
# show last row of data
df.loc[[729, 730]]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
729,730,2012-12-30,1.0,1.0,12.0,0.0,0.0,0.0,1,0.255833,0.2317,0.483333,0.350754,364,1432,1796
730,731,2012-12-31,1.0,,,0.0,1.0,0.0,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


In [12]:
# change null values by entry
df.loc[730,'yr']=1.0
df.loc[730, 'mnth']=12.0

# show changed data
df.loc[[730]]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
730,731,2012-12-31,1.0,1.0,12.0,0.0,1.0,0.0,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


In [13]:
# change null values for entire dataframe
df = df.fillna(df.median())

In [14]:
# check that all null values have been corrected
df.isna().sum().sum()

0

## 1.4 Loading Classification Data

### Census Dataset

The [Census Dataset](https://archive.ics.uci.edu/ml/datasets/Adult) (also called the Adult Dataset) is also from UCI. We include this dataset to balance regression with classification.

In [15]:
# upload Census dataset with no header (none is provided - prevents first row of data as header)
df2 = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None)

# define columns by name
df2.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation',
                  'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                   'income']

# show first 5 rows
df2.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


The issue with this dataset is that we have many categorical (text) columns. We need to turn these columns into numerical columns to make progress. Remember, since ML algorithms are like mathematical models, they need to have numbers as their inputs.

In [16]:
# get column info
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


Note that objects are usually strings. The dtype is an object which serves as a pointer to preserve size and speed (see https://stackoverflow.com/questions/21018654/strings-in-a-dataframe-but-dtype-is-object for a discussion as to why.)

We need to convert the string columns into numbers.

## 1.5 One-hot encoding

One-hot encoding means you take each categorical column (say Color), and transform it into new columns for each value (Red, Green, Blue) as the new column header; the new columns values are 1 for presence, and 0 for absence. pd.get_dummies() often works for this purpose. sklearn includes an additional [onehotencoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) that works well in pipelines.

In [17]:
# Use pd.get_dummies() to transform categorical into numerical columns
df2 = pd.get_dummies(df2)

In [18]:
# show df after one-hot-encoding
df2.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,income_ <=50K,income_ >50K
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
3,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
4,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [19]:
# get new number of columns
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Columns: 110 entries, age to income_ >50K
dtypes: int64(6), uint8(104)
memory usage: 4.7 MB


# Module 2 -  Building your first ML Model

Now that the data is all numerical with no null-values, we can build a machine learning Model. An ML model takes a range of columns, the X-values, as the input, and one column, the y-value as the output. The model itself will be a mathematical model that tries to match the inputs with the outputs. This is challenging because there are many rows of data and all need to be matched.

## 2.1 Choosing X and y

The y-column is what you are trying to predict. It's what you want to know about the future. It's the column that you want to predict from the other columns.

The X-columns are the data that you already have. It's what you already know. It's what you will use to predict the future.

There is no right or wrong in terms of choosing  X (uppercase to denote many), and y (lowercase to denote one). It's all about your data and the problem that you are trying to solve. Machine learning's primary use-case is in solving problems about the future, that is, in using data that you have to predict what you want to know. 

Machine learning models are trained, however, on data in which the future is known, data in which you already have the y-values to match the X-values. You need this for machine learning models to learn. This is why it's called machine learning. The models learn their parameters from data that you already have.

In [20]:
# show all the data before choosing X and y
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


In [21]:
# choose X as all columns excluding the first 2, and last 3
X = df.iloc[:, 2:-3]
X.head()

Unnamed: 0,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed
0,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446
1,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539
2,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309
3,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296
4,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869


In [22]:
# choose y as the last column
y=df.iloc[:, -1]
y.head()

0     985
1     801
2    1349
3    1562
4    1600
Name: cnt, dtype: int64

In [23]:
# show census column data
df2.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,income_ <=50K,income_ >50K
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
3,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
4,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [24]:
# select X as starting with second column, columns except for last 2
X2 = df2.iloc[:, :-2]

# select y as last column
y2 = df2.iloc[:, -1]

## 2.2 Splitting Data into Training and Test Set

It's standard to split the data into a training and a test set. The idea is to keep some of the data back to test the model that you build. That way, after building a model on the training set, you can see how well your model performs on data that it has never seen before.

Splitting the data is important because it helps to prevent models from overfitting. Overfitting is when your model follows the original data too closely, picking up on possible errors and outliers. You want your ML model to generalize well to new data. You don't want it to over-focus on fluctuations within the sample data at hand. This is called finding a blance between variance and bias in the literature.

In [25]:
# Split data into training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=0)

## 2.3 Linear Regression

Linear Regression is an ML model that tries to fit the data onto a straight line in 2D, or a hyperplane in higher dimensions. The goal is to find the best weights, or multipliers, for each column in X, that when summed together are as close as possible to y. 

Initially the chosen weights are random, and the model is scored. The weights are then adjusted depending on whether the score is too high or too low using gradient descent. Each time the weights are adjusted a new score results. The model continues adjusting until it converges to optimize the score.

Linear Regression can be useful if the data is actually linear, otherwise nonlinear models like trees will perform better.

In [26]:
# import Linear Regression
from sklearn.linear_model import LinearRegression

# initialize model
model = LinearRegression()

# fit model to training data
model.fit(X_train, y_train)

# score model on test data (uses r2 default metric)
model.score(X_test, y_test)

0.8043199228256777

## 2.4 Logistic Regression

Logistic Regression follows the same steps as Linear Regression except for the final step in which it transforms the y-value after the columns are multiplied and summed by placing it in the sigmoid equation (1/(1+e^-y) before converting it to an of 1 if the value is greater than 0.5, and 0 otherwise. 

See https://en.wikipedia.org/wiki/Sigmoid_function for more details on the Sigmoid.

Note that Logistic Regression is used for datasets that require classification, and not regression. (Regression as a category of ML models and regression as generally used in statistics often overlap but the terms diverge here.)

In [27]:
# import Logistic Regression
from sklearn.linear_model import LogisticRegression

# initialize model
model2 = LogisticRegression()

# fit model to training data
model2.fit(X2_train, y2_train)

# score model on test data (uses r2 default metric)
model2.score(X2_test, y2_test)

0.7930293259634577

## 2.5 Model Information

After the models are built, additional information may be extracted such as the coefficients (the weights), and the parameters that were used (defaults in this case).

In [29]:
# show model coefficients
model.coef_

array([  463.28677223,  1968.80524346,   -30.32533849,  -369.28610848,
          74.59482084,    79.66752654,  -521.62688875,  2612.11138052,
        3076.78720844, -1332.29012748, -2864.90698442])

In [30]:
# show model params
model.get_params()

{'copy_X': True,
 'fit_intercept': True,
 'n_jobs': None,
 'normalize': False,
 'positive': False}

## 2.6 Model Predictions

The most valuable part of machine learning models are the predictions that it can make. Models have a method, .predict, that can be used to make predictions provided that the input is in the same format as the data that the model was trained on.

In [31]:
# show model predictions for last 5 rows
model2.predict(X2_test.iloc[-5:])

array([1, 0, 1, 0, 0], dtype=uint8)

In [32]:
# compare predictions to actual results
y2_test[-5:]

7694     1
10410    0
1043     1
30860    0
12467    1
Name: income_ >50K, dtype: uint8

In [34]:
# you can even get the probabilities of the predictions
model2.predict_proba(X2_test.iloc[-5:])

array([[0.43843834, 0.56156166],
       [0.7915125 , 0.2084875 ],
       [0.22969899, 0.77030101],
       [0.67225282, 0.32774718],
       [0.76947398, 0.23052602]])

We have column headers included in the training set. The training data may be converted to numpy arrays to avoid column headers as will be shown later.

## 2.7 Other Regressors

Sklearn provides many other regressors that may be tried on a regression dataset, which is when y has a range of numerical values and the models are trying to get as close as possible to those values. We try a sample of tree-based models and ensembles underneath.

In [35]:
# create function to score regressors
def score_reg(model):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [36]:
# import and score Decision Tree
from sklearn.tree import DecisionTreeRegressor
score_reg(DecisionTreeRegressor())

0.7904487022264318

In [37]:
# import and score Random Forest
from sklearn.ensemble import RandomForestRegressor
score_reg(RandomForestRegressor())

0.9008983119788628

In [38]:
# install XGBoost to your computer
import sys
!{sys.executable} -m pip install xgboost

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [39]:
# import and score XGBoost
from xgboost import XGBRegressor
score_reg(XGBRegressor())

0.8875910428315762

## 2.8 Other Classifiers
Many ML algorithms have their own versions of classification and regression. Here are the classification versions of some tree-based models and ensembles.

In [40]:
# write function to score classifiers
def score_clf(model):
    model.fit(X2_train, y2_train)
    return model.score(X2_test, y2_test)

In [41]:
# import and score Decision Tree for classification
from sklearn.tree import DecisionTreeClassifier
score_clf(DecisionTreeClassifier(random_state=0))

0.8136035621065562

In [42]:
# import and score Random Forest for classification
from sklearn.ensemble import RandomForestClassifier
score_clf(RandomForestClassifier(random_state=0))

0.8484569322892677

In [43]:
# import and score XGBoost for classification
from xgboost import XGBClassifier
score_clf(XGBClassifier(random_state=0))

0.869031168432366

# Module 3 - Cross-validation with sklearn

Cross-validation is a general technique for splitting your data into multiple training and validation sets. By using cross-validation, your ML models will generalize better since they are scored against multiple validation sets, instead of just one.

We now distinguish between validation sets and test sets. A validation set is what we use to validate models that we are trying out. We can change the models, and validate them by scoring them on the validation set. The test set, by contrast, is held back until the very end after our best model has been finalized. Think of a validation set as a test set while you still selecting and tuning models, and the test set as the final word to verify that your model generalizes well.

[Here is a visual example of cross-validation.](https://en.wikipedia.org/wiki/Cross-validation_(statistics)#/media/File:K-fold_cross_validation_EN.svg) Split your training set into 5 evenly split folds. Hold the first fold back as a validation set, then train your model on the remaining four folds of data before scoring it on the validation set. Next, take the second fold of data and hold it back as the validation set, training your model on the remaining four folds before scoring it on the validation set. Continue the process for all 5 folds of the data.

Cross-validation works for n folds where n is commonly 3,5,10 or 20.

## 3.1 Cross_val_score
Cross_val_score in sklearn is a great way to score models using cross-validation as follows.

In [44]:
# import cross_val_score to use cross-validation
from sklearn.model_selection import cross_val_score
# choose your model
model=XGBRegressor()
# get scores on five folds of data 
scores = cross_val_score(model, X_train, y_train, scoring='r2', cv=5)
print(scores)
print(scores.mean())

[0.86032332 0.79450923 0.84489855 0.82828609 0.89426178]
0.8444557932746213


## 3.2 Kfold cross-validation
Kfold cross-validation allows for balanced, consistent splits that may also be applied to grid searches and randomized searches as in the next module.

In [45]:
# use KFold for shuffled, consistent folds 
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=0)
model=XGBRegressor()
scores = cross_val_score(model, X_train, y_train, scoring='r2', cv=kfold)
print(scores)
print(scores.mean())

[0.84248305 0.8942442  0.877875   0.88271032 0.86781641]
0.8730257979482665


## 3.3 Stratified Kfold Cross-Validation
Stratified Kfold is used to ensure that classification datasets have the same number of positive cases (or categories) in the different validation sets.

In [46]:
# use stratified Kfold for classification to balance all test sets
from sklearn.model_selection import StratifiedKFold
ksfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
clf=XGBClassifier()
scores = cross_val_score(clf, X2, y2, scoring='accuracy', cv=ksfold)
print(scores)
print(scores.mean())

[0.86933825 0.87177518 0.872543   0.86916462 0.87392506]
0.8713492217983237


## 3.4 Choosing Scoring Metrics

There are [many scoring metrics available in sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html), especially for classification. Though Accuray and R2 are sklearn defaults, it's common to use other scoring metrics.

### Root Mean Squared Error
It's standard to choose the RMSE (root mean squared error) as your scoring metric in regression which tells you how far away your predictions are from the actual value. To implement this, you must select the negative mean squared error metric, and then take the negative square root. (In sklearn this keeps scoring metrics as "the higher the better;" for RMSE lower is better, hence the negative.)

In [47]:
# change scoring to RMSE 
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=0)
model=XGBRegressor()
mse = cross_val_score(model, X_train, y_train, scoring='neg_mean_squared_error', cv=kfold)
rmse = (-mse)**0.5
print(rmse)
print(rmse.mean())

[694.52090903 683.65580679 672.00913665 649.07556961 652.47237418]
670.3467592511886


### Confusion Matrix and Classification Report

The Confusion Matrix and Classification Report show how predictions work in terms of precision, recall and the f1-score which is their harmonic balance.

Accuracy is often not good enough as a scoring metric. If you have imbalanced data, say only 0.1 percent exoplanets, your model can be 99.9 percent accurate if it predicts no exoplanets. Awesome score, and you have learned nothing. 

If you want to publish results about exoplanets, you likely care more about precision, which is the percentage that your positive results, the exoplanets, are actually exoplanets without concern about incorrect negatives.

Recall might be useful if you want to conduct further studies and you want to be sure that you are not missing any possible exoplanets. Recall is the percentage of total exoplanets that are included, not caring about the non-exoplanets wrongly predicted.

[Here is a visual of precision and recall from wikipedia](https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg).

In [48]:
# show confusion matrix and classification report
from sklearn.metrics import confusion_matrix, classification_report
model = XGBClassifier()
model.fit(X2_train, y2_train)
y2_pred = model.predict(X2_test)
print(confusion_matrix(y2_test, y2_pred))
print(classification_report(y2_test, y2_pred))

[[4587  331]
 [ 522 1073]]
              precision    recall  f1-score   support

           0       0.90      0.93      0.91      4918
           1       0.76      0.67      0.72      1595

    accuracy                           0.87      6513
   macro avg       0.83      0.80      0.82      6513
weighted avg       0.87      0.87      0.87      6513



### f1-score examples

Let's see how to use the f1-score generally, and within cross-validation.

In [49]:
# show the f1-score on a test set
model = XGBClassifier()
model.fit(X2_train, y2_train)
y2_pred = model.predict(X2_test)
from sklearn.metrics import f1_score
f1_score(y2_test, y2_pred)

0.7155718572857619

In [50]:
# show the f1-score within cross-validation
from sklearn.model_selection import StratifiedKFold
ksfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
clf=XGBClassifier()
scores = cross_val_score(clf, X2_train, y2_train, scoring='f1', cv=ksfold)
print(scores)
print(scores.mean())

[0.71910112 0.68494343 0.71667381 0.70818815 0.70639033]
0.707059368934143


## 3.5 First Contest

The following code shows all sklearn classifiers. Try some classifiers out using cross_val_score with the StratifiedKFold, as shown above, using random_state=0. The first person who can beat the default XGBClassifier score of 0.707 will receive a prize (for ODSC attendees on Tues. Oct. 30). You must use the model default parameters as we have been doing thus far. We will change parameters in the next module.

In [51]:
from sklearn.utils import all_estimators

estimators = all_estimators(type_filter='classifier')
for name, class_ in estimators:
    module_name = str(class_).split("'")[1].split(".")[1]
    class_name = class_.__name__
    print(f'from sklearn.{module_name} import {class_name}')

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import CategoricalNB
from sklearn.multioutput import ClassifierChain
from sklearn.naive_bayes import ComplementNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.semi_supervised import LabelPropagation
from sklearn.semi_supervised import LabelSpreading
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
fr

## Enter Your Code Here

In [52]:
# try various classifiers using cv=ksfold as defined above

# Module 4 - Fine-tuning Models with Sklearn
Fine-tuning models is one of the most important concepts in machine learning. Instead of using default hyperparameters, you may adjust hyperparameters to find values better suited to your data.

The technical distinction between parameters and hyperparameters is as follows. Parameters are found by the model during the build-phase when it's making adjustments to fit the model to the data; the weights in Linear Regression are an example of a parameter in this sense. Hyperparameters, by contrast, are set before the model trains on the data; for example, the depth of a Decision Tree may be changed and it is set in advance of the model build-phase; it's up to the machine learning practitioner to make these adjustments.

Note that hyperparameters are often shortened to parameters and it's usually clear from context.

## XGBoost Hyperparameters

We focus on XGBoost Hyperparameters. XGBoost is a tree ensemble meaning that it is made of many Decision Trees. 

Here are a list of some XGBoost hyperparameters along with their meanings and ranges:

**max_depth** - depth of each tree, meaning number of times data is split before making predictions; default=6.

**n_estimators** - number of trees in ensemble; default=100.

**learning_rate** - shrinks tree weights in each round of boosting; default=0.3.

**subsample** - percentage of training rows for each round; default=1.

**colsample_bytree** - percentage of training columns for each round; default=1.

**colsample_bylevel** - percentage of training columns for each depth level of tree; default=1.

**colsample_bynode** - percenage of columns to evaluate splits; default=1.

See the [XGBoost Parameters Documentation](https://xgboost.readthedocs.io/en/stable/parameter.html) page for more info.

## 4.1 GridSearchCV

We can try out different parameters in sklearn's GridSeachCV. Let's start by varying the depth of each tree and see what gives the best results.

In [53]:
# use GridSearchCV to search grid of hyperparameters for best values
from sklearn.model_selection import GridSearchCV

# GridSearch uses a dictionary of parameters to find optimal values
params = {'max_depth':[1, 2, 3, 4, 5, 6, 8, 10]}

# GridSearchCV takes an ML model, the dictionary of params, and CV scoring and folds as inputs
model = XGBRegressor()
grid_reg = GridSearchCV(model, params, scoring='neg_mean_squared_error', cv=kfold)

# you fit gridsearch on training data just like an ml model
grid_reg.fit(X_train, y_train)

# now you can access the best parameters, with the best score
best_params = grid_reg.best_params_
print("Best params:", best_params)
best_score = (-grid_reg.best_score_)**0.5
print("Best score:", best_score)

Best params: {'max_depth': 8}
Best score: 669.1585860647834


In [54]:
# This function includes all steps in the cell above with XGBoost as the default model
def grid_search(params, reg=XGBRegressor()):
    grid_reg = GridSearchCV(reg, params, scoring='neg_mean_squared_error', cv=kfold)
    grid_reg.fit(X_train, y_train)
    best_params = grid_reg.best_params_
    print("Best params:", best_params)
    best_score = (-grid_reg.best_score_)**0.5
    print("Best score:", best_score)

In [55]:
# show params of model
model.get_params

<bound method XGBModel.get_params of XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, gamma=None,
             gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, n_estimators=100, n_jobs=None,
             num_parallel_tree=None, predictor=None, random_state=None,
             reg_alpha=None, reg_lambda=None, ...)>

Now we can narrow down max_depth and combine it with n_estimators, the total number of trees in the data. When doing a grid search, all combinations of parameters will be checked.

In [56]:
# search 3*4=12 different combinations of parameters 
# build 12*5=60 total cv models (60*400=2400 trees in last case)  
grid_search({'max_depth':[4, 6, 8],
            'n_estimators':[50, 100, 200, 400]})

Best params: {'max_depth': 8, 'n_estimators': 50}
Best score: 669.0794021817086


In [57]:
# add additional params
grid_search(params={'max_depth':[8],
                    'colsample_bytree':[0.4, 0.6, 0.8, 1],
                   'n_estimators':[25, 50, 100]})

Best params: {'colsample_bytree': 1, 'max_depth': 8, 'n_estimators': 25}
Best score: 668.4694813521713


## 4.2 RandomizedSearchCV

It's often a good idea to start with a random search to try and find a good starting point. Instead of searching all possible options, random searches will check for 10 random combinations by default and return the hyperaparamters with the best scores.

In [58]:
# RandomizedSearchCV works the same way, but checks n (10 by default) random combinations
from sklearn.model_selection import RandomizedSearchCV
def random_search(params, reg=XGBRegressor()):
    grid_reg = RandomizedSearchCV(reg, params, scoring='neg_mean_squared_error', cv=kfold, n_iter=10, random_state=0)
    grid_reg.fit(X_train, y_train)
    best_params = grid_reg.best_params_
    print("Best params:", best_params)
    best_score = (-grid_reg.best_score_)**0.5
    print("Best score:", best_score)

In [59]:
# the following is a reasonable starting sample of params
random_search(params={'subsample':[0.5, 0.6, 0.7, 0.8, 0.9, 1],
        'colsample_bynode':[0.5, 0.6, 0.7, 0.8, 0.9, 1],
        'colsample_bytree':[0.5, 0.6, 0.7, 0.8, 0.9, 1],
        'colsample_bylevel':[0.5, 0.6, 0.7, 0.8, 0.9, 1], 
        'min_child_weight':[1, 2, 3, 4, 5], 
        'learning_rate':[0.001, 0.01, 0.1, 0.2, 0.4, 0.6], 
        'max_depth':[2, 3, 4, 5, 6, 8, 10], 
        #'n_estimators':[25, 50, 100, 200, 400]
                     })

Best params: {'subsample': 0.9, 'min_child_weight': 1, 'max_depth': 4, 'learning_rate': 0.1, 'colsample_bytree': 0.8, 'colsample_bynode': 0.9, 'colsample_bylevel': 0.8}
Best score: 628.7523666895315


In [60]:
# adjust based on results
random_search(params={'subsample':[0.8, 0.9, 1],
        'colsample_bynode':[0.8, 0.9, 1],
        'colsample_bytree':[0.7, 0.8, 0.9],
        'colsample_bylevel':[0.7, 0.8, 0.9], 
        'learning_rate':[0.05, 0.1, 0.25], 
        'max_depth':[4, 6, 8], 
        #'n_estimators':[50, 100, 200]
                     })

Best params: {'subsample': 1, 'max_depth': 6, 'learning_rate': 0.1, 'colsample_bytree': 0.7, 'colsample_bynode': 0.8, 'colsample_bylevel': 0.7}
Best score: 626.6780890592731


In [61]:
# narrow params
grid_search(params={'subsample':[0.9],
        'colsample_bynode':[0.9],
        'colsample_bytree':[0.9],
        'colsample_bylevel':[0.7], 
        'learning_rate':[0.01, 0.025, 0.05, 0.075], 
        'max_depth':[4, 6, 8],
                   })

Best params: {'colsample_bylevel': 0.7, 'colsample_bynode': 0.9, 'colsample_bytree': 0.9, 'learning_rate': 0.075, 'max_depth': 4, 'subsample': 0.9}
Best score: 617.1317859597806


In [62]:
# narrow params
grid_search(params={'subsample':[0.9],
        'colsample_bynode':[0.9],
        'colsample_bytree':[0.9],
        'colsample_bylevel':[0.5, 0.6, 0.7], 
        'learning_rate':[0.075], 
        'max_depth':[4],
        'n_estimators':[50, 100, 200, 400]
                   })

Best params: {'colsample_bylevel': 0.5, 'colsample_bynode': 0.9, 'colsample_bytree': 0.9, 'learning_rate': 0.075, 'max_depth': 4, 'n_estimators': 200, 'subsample': 0.9}
Best score: 611.1755986598299


## Your turn!

Try your own random and grid searches to get the best possible cv score on 5 folds using random_state=0. You may use additional params/models. Whomever gets the best score (lowest is best, since we are using RMSE) is the winner!

In [63]:
# try your own random searches, and/or grid searches

# Module 5 - Finalizing Models

After selecting the hyperparameters of your best model, you can finalize the model and take advantage of additional sklearn features such as feature_importances_ to determine the most influential columns; pipelines are useful to automate the data cleaning and model-building process for future data.

## 5.1 Check Model on Test Data
After finalizing a best model, you should train the model on the training data and score it against the test set.

In [64]:
# choose your best model, fit on your data, then test against unseen data
model = XGBRegressor(subsample=0.9, n_estimators=200, max_depth=4,
                    learning_rate=0.075, colsample_bytree=0.9,
                    colsample_bynode=0.9, colsample_bylevel=0.5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_pred, y_test)
mse**0.5

628.2274646343187

## 5.2 Train Model on All Data
If your model is ready to move foward for real predictions, you should go back and train on all the data. Why? Because more data is better. At this stage you don't have to worry about overfitting. You have a model that already generalizes well to new data. You want more data for your model to learn from

### Using NumPy Arrays

It's often easier to make predictions from ML models when your inputs are NumPy Arrays. Then you don't have worry about column names.

In [65]:
# convert data to numpy arrays
import numpy as np
X_np = np.array(X)
y_np = np.array(y)

# train model on all data as numpy arrays
model = XGBRegressor(subsample=0.9, n_estimators=200, max_depth=4,
                    learning_rate=0.075, colsample_bytree=0.9,
                    colsample_bynode=0.9, colsample_bylevel=0.5)
model.fit(X_np, y_np)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=0.5, colsample_bynode=0.9, colsample_bytree=0.9,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.075, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=4, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=200, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)

In [66]:
# select last row to modify
X_test.tail(1)

Unnamed: 0,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed
239,3.0,0.0,8.0,0.0,0.0,0.0,1,0.707059,0.647959,0.561765,0.304659


In [67]:
# make some predictions
model.predict(np.array([[3.0, 0.0, 8.0, 0.0, 0.0, 0.0, 1, 0.707059, 0.647959, 0.561765, 0.304659],
                       [3.0, 0.0, 8.0, 0.0, 0.0, 0.0, 1, 0.757059, 0.697959, 0.561765, 0.304659],
                       [3.0, 0.0, 8.0, 0.0, 0.0, 0.0, 1, 0.677059, 0.647959, 0.561765, 0.104659]]))

array([4507.306 , 4240.8145, 4947.832 ], dtype=float32)

## 5.3 Saving Models with Pickle

In the event that your final model has value and will be utilized later, you may save your models and access them using pickle.

In [68]:
import pickle

# save model to local machine
filename = 'final_model.sav'
pickle.dump(model, open(filename, 'wb'))
  
# load the model from disk
load_model = pickle.load(open(filename, 'rb'))

#check model
print(load_model)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=0.5, colsample_bynode=0.9, colsample_bytree=0.9,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.075, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=4, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=200, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)


## 5.4 Column Importance: feature_importances_

We can check the overall influence of each column using feature_importances_. Whereas checking correlations of columns is useful for linear comparisons, this process gives the influence of columns for non-linear tree ensembles. 

You can use feature_importances_ earlier in the process to help narrow down columns as well.

In [69]:
# show the influence of each column
model.feature_importances_

array([0.1657849 , 0.42669895, 0.06681323, 0.00972347, 0.01182685,
       0.00876925, 0.06133859, 0.07507053, 0.12267406, 0.03070163,
       0.02059859], dtype=float32)

In [70]:
# zip columns and feature_importances_ into dict
feature_dict = dict(zip(X.columns, model.feature_importances_))

# import operator
import operator

# sort dict by values (as list of tuples)
sorted(feature_dict.items(), key=operator.itemgetter(1), reverse=True)

[('yr', 0.42669895),
 ('season', 0.1657849),
 ('atemp', 0.122674055),
 ('temp', 0.07507053),
 ('mnth', 0.06681323),
 ('weathersit', 0.06133859),
 ('hum', 0.030701632),
 ('windspeed', 0.020598589),
 ('weekday', 0.011826854),
 ('holiday', 0.009723474),
 ('workingday', 0.00876925)]

## 5.5 Pipelines

Automating the process of collecting data, cleaning data, and training a machine learning model on the data may be achieved via sklearn pipelines.

Clearing null values via the SimpleImputer and converting categorical columns via the OneHotEncoder are standard. You can design your own classes for more complicated procedures. (See https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/blob/master/Chapter10/XGBoost_Model_Deployment.ipynb for examples.)

The example that follows includes the SimpleImputer and the ML model in a pipeline.

In [71]:
# create pipeline to transform data by clearing null values and fitting xgb model on data
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
full_pipeline = Pipeline([('null', SimpleImputer(missing_values=np.nan, 
                                                         strategy='median')),  
                          ('xgb', XGBRegressor(max_depth=4,
                                               n_estimators=200,
                                               learning_rate=0.075,
                                               subsample=0.9, 
                                               colsample_bytree=0.9, 
                                               colsample_bylevel=0.5,
                                               colsample_bynode=0.9, 
                                              ))])

In [72]:
# fit the pipeline on your data
full_pipeline.fit(X, y)

Pipeline(steps=[('null', SimpleImputer(strategy='median')),
                ('xgb',
                 XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
                              colsample_bylevel=0.5, colsample_bynode=0.9,
                              colsample_bytree=0.9, early_stopping_rounds=None,
                              enable_categorical=False, eval_metric=None,
                              gamma=0, gpu_id=-1, grow_policy='depthwise',
                              importance_type=None, interaction_constraints='',
                              learning_rate=0.075, max_bin=256,
                              max_cat_to_onehot=4, max_delta_step=0,
                              max_depth=4, max_leaves=0, min_child_weight=1,
                              missing=nan, monotone_constraints='()',
                              n_estimators=200, n_jobs=0, num_parallel_tree=1,
                              predictor='auto', random_state=0, reg_alpha=0,
                

In [73]:
# make predictions from new data using your pipeline
X_new = X_test.copy()
full_pipeline.predict(X_new)

array([5426.846  , 4604.4785 , 1297.9181 , 1319.1804 , 3677.2258 ,
       1917.0964 , 3004.2808 , 6315.954  , 6653.0938 , 1216.8851 ,
       1571.0406 , 1360.2284 , 1484.4102 , 4810.1396 , 4624.3784 ,
       4094.9329 , 7513.7314 , 6517.717  , 3601.858  , 2005.2458 ,
       7617.31   , 1644.6609 , 5256.659  , 4417.773  , 1723.6769 ,
       5907.545  , 3400.5857 , 5237.753  , 7505.26   , 7921.6865 ,
        705.01184, 5116.439  , 5973.9087 , 5296.1143 , 1956.8448 ,
       3870.8042 , 6745.506  , 5100.06   , 2559.2012 , 3079.6116 ,
       7032.815  , 1115.9336 , 4995.4214 , 3677.9053 , 7254.9844 ,
       7434.9277 , 2203.2358 , 3682.1528 , 2712.967  , 1934.7947 ,
       6108.701  , 6797.68   , 5445.376  , 7176.278  , 4119.6016 ,
       3731.478  , 3506.8882 , 7070.639  , 6899.389  , 3600.1506 ,
       7196.9756 , 4124.8027 , 4047.785  , 8338.744  , 7351.622  ,
       2358.999  , 5074.493  , 5465.236  , 6748.441  , 5289.1367 ,
       6502.138  , 7346.4478 , 4688.1553 , 6846.597  , 3723.49