<a href="https://colab.research.google.com/github/amogh-karnik/AI-resources/blob/main/Week4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 - Models and Experimentation

## Step 1 Training a model

For the purposes of this demo, we will be using this [adapted demo](https://www.datacamp.com/tutorial/xgboost-in-python) and training an XGBoost model, and then doing some experimentation and hyperparameter tuning.


If running this notebook locally, use the following steps to create virtual environment:
- Don't use past python 3.10
- To create virtual environment use "venv"

`python -m venv NAME`

- Try to avoid anaconda, poetry or similar package management platforms
- To install a package use pip

`python -m pip install <package-name>`

- once you are done working with this virtual environment, deactivate it with `deactivate`

### Install packages

In [None]:
!pip install wandb -qU

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/2.2 MB[0m [31m6.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m267.1/267.1 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


### Import data

We will be using Diamonds dataset imported from Seaborn. It is also available on [Kaggle](https://www.kaggle.com/datasets/shivam2503/diamonds).

Read about the features by following the link. We will be predicting the price of diamonds.

In [None]:
diamonds = sns.load_dataset('diamonds')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [None]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB


In [None]:
diamonds.shape

(53940, 10)

In [None]:
X,y = diamonds.drop('price', axis=1), diamonds[['price']]

# For the cut, color and clarity use pandas category to enable XGBoost ability to deal with categorical data.

X['cut'] = X['cut'].astype('category')
X['color'] = X['color'].astype('category')
X['clarity'] = X['clarity'].astype('category')

### Split the data and train a model

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

In [None]:
# Define hyperparameters
params = {"objective": "reg:squarederror", "tree_method": "hist"}

n = 100
model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
)

In [None]:
# Define evaluation metrics - Root Mean Squared Error

predictions = model.predict(dtest)
rmse = mean_squared_error(y_test, predictions, squared=False)
print(f"RMSE: {rmse}")

RMSE: 545.191877397669


### Incorporate validation

In [None]:
params = {"objective": "reg:squarederror", "tree_method": "hist"}
n = 100

# Create the validation set
evals = [(dtrain, "train"), (dtest, "validation")]

In [None]:
evals = [(dtrain, "train"), (dtest, "validation")]

model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=10,
)

[0]	train-rmse:2861.71326	validation-rmse:2853.85688
[10]	train-rmse:554.29819	validation-rmse:579.26422
[20]	train-rmse:493.68077	validation-rmse:547.75493
[30]	train-rmse:467.32713	validation-rmse:540.03567
[40]	train-rmse:447.40974	validation-rmse:541.70531
[50]	train-rmse:432.62075	validation-rmse:540.89769
[60]	train-rmse:422.28318	validation-rmse:540.63039
[70]	train-rmse:410.72350	validation-rmse:543.67077
[80]	train-rmse:398.24619	validation-rmse:545.08296
[90]	train-rmse:386.92486	validation-rmse:543.90036
[99]	train-rmse:379.58717	validation-rmse:545.19188


In [None]:
# Incorporate early stopping
n = 10000


model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=50,
   # Activate early stopping
   early_stopping_rounds=50
)

[0]	train-rmse:2861.71326	validation-rmse:2853.85688
[50]	train-rmse:432.62075	validation-rmse:540.89769
[83]	train-rmse:393.82435	validation-rmse:544.68591


In [None]:
# Cross-validation

params = {"objective": "reg:squarederror", "tree_method": "hist"}
n = 1000

results = xgb.cv(
   params, dtrain,
   num_boost_round=n,
   nfold=5,
   early_stopping_rounds=20
)


In [None]:
results.head()

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,2861.51281,8.494816,2861.704341,37.144992
1,2081.847733,5.811005,2084.838207,31.889208
2,1547.031906,5.092391,1554.65745,30.699908
3,1184.129738,3.982239,1194.2516,26.940062
4,942.998782,3.327174,960.239319,24.392689


In [None]:
best_rmse = results['test-rmse-mean'].min()

best_rmse

553.4613038243663

## Start W&B


- Login into your W&B profile using the code below
- Alternatively you can set environment variables. There are several env variables which you can set to change the behavior of W&B logging. The most important are:
    - WANDB_API_KEY - find this in your "Settings" section under your profile
    - WANDB_BASE_URL - this is the url of the W&B server

- Find your API Token in "Profile" -> "Setttings" in the W&B App



In [None]:
# Log in to your W&B account
import wandb

wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
# TO DO
# Start experiment tracking with W&B
# Do at least 5 experiments with various hyperparameters
# Choose any method for hyperparameter tuning: grid search, random search, bayesian search
# Describe your findings and what you see

In [None]:
from wandb.xgboost import WandbCallback
import wandb

def train(config=None):
    with wandb.init(config=config):
        config = wandb.config

        config_dict = config.as_dict().copy()

        evals = [(dtrain, "train"), (dtest, "validation")]

        model = xgb.train(
            params=config_dict,
            dtrain=dtrain,
            num_boost_round=100,
            evals=evals,
            verbose_eval=50,
            callbacks=[WandbCallback()]
        )

        # Evaluate the model to get RMSE
        y_pred = model.predict(dtest)
        rmse = mean_squared_error(y_test, y_pred, squared=False)

        # Log RMSE to Weights & Biases
        wandb.log({"rmse": rmse})

sweep_config = {
    "method": "grid",
    "parameters": {
        "objective": {"value": "reg:squarederror"},
        "tree_method": {"value": "hist"},
        "eta": {"values": [0.01, 0.1, 0.5]},
        "max_depth": {"values": [6, 9, 12]},
    },
    "metric": {"name": "rmse", "goal": "minimize"}
}

# Initialize the sweep
sweep_id = wandb.sweep(sweep_config, project='diamond_price_prediction')

num_experiments = 9

# Run the sweep with specified number of experiments
wandb.agent(sweep_id, train, count=num_experiments)

wandb.finish()


Create sweep with ID: 8plq0ise
Sweep URL: https://wandb.ai/pensieve/diamond_price_prediction/sweeps/8plq0ise


[34m[1mwandb[0m: Agent Starting Run: rq33yjw4 with config:
[34m[1mwandb[0m: 	eta: 0.01
[34m[1mwandb[0m: 	max_depth: 6
[34m[1mwandb[0m: 	objective: reg:squarederror
[34m[1mwandb[0m: 	tree_method: hist


[0]	train-rmse:3951.98705	validation-rmse:3949.09091
[50]	train-rmse:2478.32156	validation-rmse:2472.35943
[99]	train-rmse:1617.38600	validation-rmse:1612.48917


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=0.8768856447688564, max=1.0…

0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
rmse,▁
train-rmse,██▇▇▇▇▆▆▆▆▆▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁
validation-rmse,██▇▇▇▇▆▆▆▆▆▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁

0,1
epoch,99.0
rmse,1612.48917


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: oouarbm0 with config:
[34m[1mwandb[0m: 	eta: 0.01
[34m[1mwandb[0m: 	max_depth: 9
[34m[1mwandb[0m: 	objective: reg:squarederror
[34m[1mwandb[0m: 	tree_method: hist


[0]	train-rmse:3951.33626	validation-rmse:3948.58189
[50]	train-rmse:2447.60123	validation-rmse:2446.36498
[99]	train-rmse:1562.44131	validation-rmse:1569.04328


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
rmse,▁
train-rmse,██▇▇▇▇▆▆▆▆▆▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁
validation-rmse,██▇▇▇▇▆▆▆▆▆▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁

0,1
epoch,99.0
rmse,1569.04328


[34m[1mwandb[0m: Agent Starting Run: lctwxg91 with config:
[34m[1mwandb[0m: 	eta: 0.01
[34m[1mwandb[0m: 	max_depth: 12
[34m[1mwandb[0m: 	objective: reg:squarederror
[34m[1mwandb[0m: 	tree_method: hist


[0]	train-rmse:3951.26255	validation-rmse:3948.54131
[50]	train-rmse:2442.97615	validation-rmse:2445.24152
[99]	train-rmse:1553.07828	validation-rmse:1569.85589


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
rmse,▁
train-rmse,██▇▇▇▇▆▆▆▆▆▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁
validation-rmse,██▇▇▇▇▆▆▆▆▆▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁

0,1
epoch,99.0
rmse,1569.85589


[34m[1mwandb[0m: Agent Starting Run: tdncz2e2 with config:
[34m[1mwandb[0m: 	eta: 0.1
[34m[1mwandb[0m: 	max_depth: 6
[34m[1mwandb[0m: 	objective: reg:squarederror
[34m[1mwandb[0m: 	tree_method: hist


[0]	train-rmse:3611.16120	validation-rmse:3606.77371
[50]	train-rmse:498.27426	validation-rmse:535.92046
[99]	train-rmse:453.38076	validation-rmse:525.22220


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=0.8787286063569683, max=1.0…

0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
rmse,▁
train-rmse,█▇▅▄▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
validation-rmse,█▇▅▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
epoch,99.0
rmse,525.2222


[34m[1mwandb[0m: Agent Starting Run: mlg09uz3 with config:
[34m[1mwandb[0m: 	eta: 0.1
[34m[1mwandb[0m: 	max_depth: 9
[34m[1mwandb[0m: 	objective: reg:squarederror
[34m[1mwandb[0m: 	tree_method: hist


[0]	train-rmse:3604.25864	validation-rmse:3601.39313
[50]	train-rmse:385.07484	validation-rmse:530.95324
[99]	train-rmse:322.40650	validation-rmse:535.48672


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=0.8798430603236881, max=1.0…

0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
rmse,▁
train-rmse,█▇▅▄▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
validation-rmse,█▇▅▄▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
epoch,99.0
rmse,535.48672


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 4ejopusz with config:
[34m[1mwandb[0m: 	eta: 0.1
[34m[1mwandb[0m: 	max_depth: 12
[34m[1mwandb[0m: 	objective: reg:squarederror
[34m[1mwandb[0m: 	tree_method: hist


[0]	train-rmse:3603.46421	validation-rmse:3600.95704
[50]	train-rmse:236.99153	validation-rmse:548.06311
[99]	train-rmse:165.97571	validation-rmse:554.15969


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=0.8802160039273441, max=1.0…

0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
rmse,▁
train-rmse,█▇▅▄▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
validation-rmse,█▆▅▄▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
epoch,99.0
rmse,554.15969


[34m[1mwandb[0m: Agent Starting Run: 42qcd6kx with config:
[34m[1mwandb[0m: 	eta: 0.5
[34m[1mwandb[0m: 	max_depth: 6
[34m[1mwandb[0m: 	objective: reg:squarederror
[34m[1mwandb[0m: 	tree_method: hist


[0]	train-rmse:2132.65626	validation-rmse:2121.02777
[50]	train-rmse:403.62002	validation-rmse:559.95577
[99]	train-rmse:339.28198	validation-rmse:566.10353


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=0.8812807881773399, max=1.0…

0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
rmse,▁
train-rmse,█▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
validation-rmse,█▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
epoch,99.0
rmse,566.10353


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: szav3b1k with config:
[34m[1mwandb[0m: 	eta: 0.5
[34m[1mwandb[0m: 	max_depth: 9
[34m[1mwandb[0m: 	objective: reg:squarederror
[34m[1mwandb[0m: 	tree_method: hist


[0]	train-rmse:2082.17246	validation-rmse:2082.39186
[50]	train-rmse:221.43633	validation-rmse:586.35858
[99]	train-rmse:144.37209	validation-rmse:591.81728


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=0.8787286063569683, max=1.0…

0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
rmse,▁
train-rmse,█▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
validation-rmse,█▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
epoch,99.0
rmse,591.81728


[34m[1mwandb[0m: Agent Starting Run: qz45ym7k with config:
[34m[1mwandb[0m: 	eta: 0.5
[34m[1mwandb[0m: 	max_depth: 12
[34m[1mwandb[0m: 	objective: reg:squarederror
[34m[1mwandb[0m: 	tree_method: hist


[0]	train-rmse:2075.81890	validation-rmse:2078.95777
[50]	train-rmse:83.09915	validation-rmse:598.72105
[99]	train-rmse:32.36722	validation-rmse:600.35191


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=0.8779892630551489, max=1.0…

0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
rmse,▁
train-rmse,█▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
validation-rmse,█▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
epoch,99.0
rmse,600.35191


In these experiments, we performed a grid search of learning rate, max depth, and subsample for the XGBoost model.

