# Exercise 1: Prediction Models

You will practice the basic steps to fit and to use a machine learning model.

In [3]:
conda install xgboost

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: c:\Users\verba\anaconda3

  added / updated specs:
    - xgboost


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _py-xgboost-mutex-2.0      |            cpu_2          16 KB
    ca-certificates-2025.7.15  |       haa95532_0         127 KB
    certifi-2025.8.3           |  py310haa95532_0         160 KB
    libxgboost-3.0.1           |       h585ebfc_0         2.7 MB
    py-xgboost-3.0.1           |  py310haa95532_0         309 KB
    ucrt-10.0.22621.0          |       haa95532_0         620 KB
    vc14_runtime-14.44.35208   |      h4927774_10         825 KB
    vs2015_runtime-14.44.35208 |      ha6b5a95_10          19 KB
    xgboost-3.0.1              |  py310haa95532_0          14 KB
    -------------------------------------------



  current version: 23.3.1
  latest version: 25.7.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=25.7.0




In [4]:
# import packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import xgboost as xgb

In [5]:
# Load data
X_train = pd.read_csv("ex1_train.csv", header=None)
X_test = pd.read_csv("ex1_test.csv", header=None)
y_train = pd.read_csv("ex1_class_train.csv", header=None)
y_test = pd.read_csv("ex1_class_test.csv", header=None)

# Part 1: Default XGBoost Classifier

**TODO: Fit the model and predict for test data in the following cell**

In [None]:
# 1) create an XGBoost classifier instance
classifier = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
# 2) fit the classifier using X_train and y_train
classifier.fit(X_train, y_train.values.ravel())
# 3) make prediction over X_test. The prediction output should be named y_pred_default
y_pred_default = classifier.predict(X_test)

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [8]:
# Evaluate the default model
accuracy_default = accuracy_score(y_test, y_pred_default)
precision_default = precision_score(y_test, y_pred_default)
recall_default = recall_score(y_test, y_pred_default)
f1_default = f1_score(y_test, y_pred_default)

print("Default Model Performance:")
print(f"Accuracy: {accuracy_default:.4f}")
print(f"Precision: {precision_default:.4f}")
print(f"Recall: {recall_default:.4f}")
print(f"F1 Score: {f1_default:.4f}")

Default Model Performance:
Accuracy: 0.7032
Precision: 0.7216
Recall: 0.7186
F1 Score: 0.7201


You should achieve F1 score>0.65 to pass Part 1.

# Part 2: Hyperparameter Tuning with Cross-Validation

In [14]:
# Define candidate hyperparameters
param_grid = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2]
}
# These are the hyperparameters we will tune

**TODO: Find the best hyperparameters and use them to fit an improved classifier in the following cell**

In [15]:
# 1) use GridSearchCV to find the best hyperparameters
# Creating the GridSearchCV instance
# estimator is the classifier instance, param_grid is the hyperparameter grid
grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid,
                            scoring='accuracy', cv=5, verbose=1, n_jobs=-1)
# 2) fit an XGBoost classifier using the best hyperparameters
grid_search.fit(X_train, y_train.values.ravel())
# 3) make prediction over X_test. The prediction output should be named y_pred_tuned
y_pred_tuned = grid_search.predict(X_test)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [16]:
# Evaluate the tuned model
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
precision_tuned = precision_score(y_test, y_pred_tuned)
recall_tuned = recall_score(y_test, y_pred_tuned)
f1_tuned = f1_score(y_test, y_pred_tuned)

print("Tuned Model Performance:")
print(f"Accuracy: {accuracy_tuned:.4f}")
print(f"Precision: {precision_tuned:.4f}")
print(f"Recall: {recall_tuned:.4f}")
print(f"F1 Score: {f1_tuned:.4f}")

# Analysis
print(f"Improvement in F1 Score: {f1_tuned - f1_default:.4f}")

Tuned Model Performance:
Accuracy: 0.7186
Precision: 0.7276
Recall: 0.7519
F1 Score: 0.7396
Improvement in F1 Score: 0.0195


To pass Part 2, your new F1 score should be higher 0.65 and the one in Part 1.

In [17]:
# Improvements
print("Improvements made by tuning hyperparameters:")
print(f"Accuracy: {accuracy_tuned - accuracy_default:.4f}")
print(f"Precision: {precision_tuned - precision_default:.4f}")
print(f"Recall: {recall_tuned - recall_default:.4f}")
print(f"F1 Score: {f1_tuned - f1_default:.4f}")

Improvements made by tuning hyperparameters:
Accuracy: 0.0155
Precision: 0.0060
Recall: 0.0334
F1 Score: 0.0195
