# Intro to Sklearn - Machine Learning in Python

## by Corey Wade

The following Jupyter Notebook is an introduction to Machine Learning in Python for ODSC West attendees on Nov. 1, 2022. We will be using pandas for data analytics, and sklearn for machine learning. A wide range of models will be covered including Linear and Logistic Regression, Decision Trees, Random Forests, and XGBoost.

This presentation includes ML fundamentals covered in Corey Wade's book [Hands-on Gradient Boosting with XGBoost and scikit-learn](https://www.amazon.com/Hands-Gradient-Boosting-XGBoost-scikit-learn/dp/1839218355). Another recommend text is [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/). For great web references, check out [Jason Brownlee's Machine Learning Mastery](https://machinelearningmastery.com/about/).

Our focus is on tabular data, that is, rows and columns of data sorted in tables; this is contrasted with images and text which are considered unstructured data. When it comes to images and text, neural networks usually perform better. For tabular data, neural networks do not necessarily have an edge. We will focus on XGBoost, one the strongest ML algorithms in the world, that often has an edge in tabular data.

# Module 1 - Preparing data for ML with pandas

The following module provides a brief introduction to pandas. To go more in-depth, try tutorial options from the official documentation: https://pandas.pydata.org/docs/getting_started/tutorials.html.

## Loading Data

### Bike Rentals Dataset

The [Bike Rentals dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset) is from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). It's been modified to include correcting null values for practice.

In [None]:
# load data into pandas dataframe and show first 5 rows


## General Data Info

In [None]:
# show descriptive statistics


In [None]:
# show correlations between columns
# df.corr()

In [None]:
# show histograms and scatter plots of all columns
#import seaborn as sns
#sns.pairplot(df)

In [None]:
# get info on columns


## Null Values

In [None]:
# show total null values per column
df.isna().sum()

In [None]:
# sum null values
df.isna().sum().sum()

In [None]:
# shows all null values
df[df.isna().any(axis=1)]

In [None]:
# change null values in column
df['windspeed'] = df['windspeed'].fillna(df['windspeed'].median())

In [None]:
# change null values for entire dataframe
df = df.fillna(df.median())

In [None]:
# show rows
df.iloc[[129,213,730]]

In [None]:
# change null values by entry
df.loc[730,'yr']=1.0
df.loc[730, 'season']=4.0
df.loc[[730]]

## Choose X and y

In [None]:
# show order of columns


In [None]:
# choose X as all rows, and all columns excluding the first 2, and last 3


In [None]:
# choose y as the last column


## The Census Dataset

The [Census Dataset](https://archive.ics.uci.edu/ml/datasets/Adult) (also called the Adult Dataset) is also from UCI. We include this dataset to balance regression with classification. Sklearn scoring metrics

In [None]:
# upload Census dataset with no header
df2 = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None)

# define columns by name
df2.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation',
                  'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                   'income']

# show first 5 rows
df2.head()

In [None]:
# get column info
df2.info()

## One-hot encoding

One-hot encoding means you take each categorical column (say Color), and transform it into new columns for each value (Red, Green, Blue) as the new column header; the new columns values are 1 for presence, and 0 for absence. pd.get_dummies() often works for this purpose. sklearn includes an additional [onehotencoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) that may be useful for pipelines.

In [None]:
# Use pd.get_dummies() to transform categorical into numerical columns
df2 = pd.get_dummies(df2)

In [None]:
# show df after one-hot-encoding


In [None]:
# get new number of columns


In [None]:
# select X as all rows, columns except for last 2


# select y as last column


# Module 2 - Supervised learning with sklearn

In this modules we cover the essentials of the sklearn suite.

In [None]:
# Split data into training and test set


## Linear Regression

In [None]:
# import Linear Regression


# initialize model


# fit model to training data


# score model on test data (uses r2 default metric)


In [None]:
# show model coefficients


In [None]:
# show model params


In [None]:
# show model predictions


In [None]:
# compare predictions to actual results


## Regressors

In [None]:
# create function to score regressors


In [None]:
# import and score Random Forest


In [None]:
# install XGBoost to your computer
import sys
!{sys.executable} -m pip install xgboost

In [None]:
# import and score XGBoost


## Classifiers

In [None]:
# write function to score classifiers


In [None]:
# import and score Logistic Regression


In [None]:
# import and score Decision Tree for classification


In [None]:
# import and score Random Forest for classification


In [None]:
# import and score XGBoost for classification


## Predictions / Scoring Metrics

Making meaningful predictions is arguably the most important part of Machine Learning. You use pandas or numpy arrays to make predictions with sklearn.

There are many scoring metrics available in sklearn, especially for classification. See your options here: https://scikit-learn.org/stable/modules/model_evaluation.html

### Root Mean Squared Error

In [None]:
# build XGBoost model

# get predictions for test set

# for rmse, import mse first
#from sklearn.metrics import mean_squared_error


In [None]:
# show descriptive stats for y


### Predictions

In [None]:
# look at last 5 rows


In [None]:
# predict last 2 rows


In [None]:
# check last 2 rows actual value


### NumPy Arrays

It's often easier to make predictions from ML models when your inputs are NumPy Arrays. Then you don't have worry about column names. Pandas DataFrames, or NumPy Arrays are okay. NumPy Arrays are better for single rows.

In [None]:
# convert data to numpy arrays
import numpy as np
X_train_np = np.array(X_train)
X_test_np = np.array(X_test)
y_train_np = np.array(y_train)
y_test_np = np.array(y_test)

# train model on numpy arrays
model = XGBRegressor()
model.fit(X_train_np, y_train_np)
y_pred_np = model.predict(X_test_np)

In [None]:
# select last row to modify
X_test.tail(1)

In [None]:
# show predictions
model.predict(np.array([[3.0, 0.0, 8.0, 0.0, 0.0, 0.0, 2, 0.676667, 0.624388, 0.817500, 0.222633]]))

In [None]:
# now the prediction works, even though no column headers have been provided


### Confusion Matrix and Classification Report

In [None]:
# show confusion matrix and classification report


In [None]:
# show the f1-score


## Your Turn!

Try different models to get the best f1-score on the census dataset using random_state=0. You must use default params! (Changing params coming in Module 4 of this notebook.)

In [None]:
# try out different models using f1-scoring metric


# Module 3 - Cross-validation with sklearn

In [None]:
# import cross_val_score to use cross-validation

# choose your model

# get scores on five folds of data 


In [None]:
# get mean rmse


In [None]:
# use KFold for shuffled, consistent folds 


In [None]:
# use stratified Kfold for classification to balance all test sets
from sklearn.model_selection import StratifiedKFold
ksfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
#clf=XGBClassifier()
#f1 = cross_val_score(clf, X2, y2, scoring='f1', cv=ksfold)
#print(f1.mean())

# Module 4 - Fine-tuning models with sklearn

## GridSearch

In [None]:
# use GridSearchCV to search grid of hyperparameters for best values

# GridSearch uses a dictionary of parameters to find optimal values

# GridSearchCV takes an ML model, the dictionary of params, etc. as inputs


# you fit gridsearch on training data just like an ml model

# now you can access the best parameters, with the best score


In [None]:
# This function includes all steps in the cell above with XGBoost as the default model
def grid_search(params, reg=XGBRegressor()):
    grid_reg = GridSearchCV(reg, params, scoring='neg_mean_squared_error', cv=kfold)
    grid_reg.fit(X_train, y_train)
    best_params = grid_reg.best_params_
    print("Best params:", best_params)
    best_score = (-grid_reg.best_score_)**0.5
    print("Best score:", best_score)

In [None]:
# show params


In [None]:
# search 2 params - 12 models total


In [None]:
# add additional params


## RandomSearch

In [None]:
# RandomizedSearchCV works the same way, but checks n (10 by default) random combinations
from sklearn.model_selection import RandomizedSearchCV
def random_search(params, reg=XGBRegressor()):
    grid_reg = RandomizedSearchCV(reg, params, scoring='neg_mean_squared_error', cv=kfold, n_iter=10, random_state=0)
    grid_reg.fit(X_train, y_train)
    best_params = grid_reg.best_params_
    print("Best params:", best_params)
    best_score = (-grid_reg.best_score_)**0.5
    print("Best score:", best_score)

In [None]:
# the following is a reasonable starting sample of params
random_search(params={'subsample':[0.5, 0.6, 0.7, 0.8, 0.9, 1],
        'colsample_bynode':[0.5, 0.6, 0.7, 0.8, 0.9, 1],
        'colsample_bytree':[0.5, 0.6, 0.7, 0.8, 0.9, 1],
        'colsample_bylevel':[0.5, 0.6, 0.7, 0.8, 0.9, 1], 
        'min_child_weight':[1, 2, 3, 4, 5], 
        'learning_rate':[0.001, 0.01, 0.1, 0.2, 0.4, 0.6], 
        'max_depth':[2, 3, 4, 5, 6, 8, 10], 
        'n_estimators':[25, 50, 100, 200, 400]})

In [None]:
# adjust based on results


## Your turn!

Try your own random and grid searches to get the best possible cv score on 5 folds using random_state=0

In [None]:
# try your own random searches, and/or grid searches

# Module 5 - Feature Importances

## Finalize Model

In [None]:
# choose your best model, fit on your data, then test against unseen data


In [None]:
# show the influence of each column


In [None]:
# zip columns and feature_importances_ into dict
feature_dict = dict(zip(X.columns, model.feature_importances_))

# import operator
import operator

# sort dict by values (as list of tuples)
sorted(feature_dict.items(), key=operator.itemgetter(1), reverse=True)