# Gradient Descent
La discesa del gradiente (o Gradient Descent) è un algoritmo di ottimizzazione che cerca di trovare il punto di minimo di una funzione (come la funzione di errore in un modello di machine learning) spostandosi iterativamente nella direzione del gradiente negativo, ovvero la direzione di massima discesa. Immagina di essere su una montagna nebbiosa e di voler trovare il punto più basso; la discesa del gradiente consiste nel fare un passo nella direzione più ripida che scende, ripetendo questo processo fino a raggiungere la valle. Il "gradiente" è il vettore che indica la direzione di massima crescita della funzione, e muovendosi nella direzione opposta si minimizza la funzione stessa

# scikit-learn GradientBoosting
Libreria interna e didattica, utile per capirne il funzionamento e testare prototipi. Per un'implementazione production-ready passeremo a XGBoost / LightGBM / CatBoost.

In [1]:
from sklearn.datasets import make_classification, make_regression
import numpy as np
import matplotlib as plt

#creiamo dataset fittizio per classificazione
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) #1000 campioni, 10 features di cui 5 informative e 5 inutili

#e per regressione
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) #1000 campioni, 10 features di cui 5 informative e 5 inutili


# GradintBoostClassifier - scikit-learn

In [17]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

#create model

sgb = GradientBoostingClassifier()
accuracy_cross_array = cross_val_score(sgb, X, y, scoring='accuracy', cv = 10, n_jobs=-1, error_score='raise') #n_jobs specifica il numero di job in parallelo per la cross-validation, -1 indica di usare tutti i processori.
#error_score ci dice se sollevare o meno un errore se avviene durante la cross-validation
print(f'accuracy media: {np.mean(accuracy_cross_array):.3f}, std: {np.std(accuracy_cross_array):.3f}')
sgb = GradientBoostingClassifier()
sgb.fit(X, y)

#define features for test set 
row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]

yhat = sgb.predict(row)
print(f'Prediciton: {yhat}') #ci dice 0/1 per la row in ingresso

accuracy media: 0.912, std: 0.034
Prediciton: [1]


# GradintBoostRegressor - scikit-learn

In [19]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

#create model
sgb = GradientBoostingRegressor()
neg_mean_absolute_error_array = cross_val_score(sgb, X, y, scoring='neg_mean_absolute_error', cv = 10, n_jobs=-1, error_score='raise') #n_jobs specifica il numero di job in parallelo per la cross-validation, -1 indica di usare tutti i processori.
#error_score ci dice se sollevare o meno un errore se avviene durante la cross-validation
print(f'neg_mean_absolute_error media: {np.mean(neg_mean_absolute_error_array):.3f}, std: {np.std(neg_mean_absolute_error_array):.3f}')
sgb = GradientBoostingRegressor()
sgb.fit(X, y)

#define features for test set 
row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]

yhat = sgb.predict(row)
print(f'Prediciton: {yhat[0]:.3f}') #ci dice valore continuo assegnato alla row in ingresso

neg_mean_absolute_error media: -12.188, std: 1.023
Prediciton: -80.661


# HistGradientBoostingClassifier - scikit-learn (supera GradeintBoostlassifier)
Invece di costruire gli alberi considerando tutti i possibili split continui per ogni feature discretizza le feature in istogrammi, ad esempio, viene eseguita una discretizzazione che trasforma gli input continui in buckets (es. 255 bucket, valori tra 0 e 30 --> bucket 1, da 31 a 60 --> bucket 2 ecc...)

In [18]:
from sklearn.ensemble import HistGradientBoostingClassifier

# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# evaluate the model
model = HistGradientBoostingClassifier()
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=10, n_jobs=-1, error_score='raise')
print(f'accuracy media: {np.mean(n_scores):.3f}, std: {np.std(n_scores):.3f}')
# fit the model on the whole dataset
model = HistGradientBoostingClassifier()
model.fit(X, y)
# make a single prediction
row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]
yhat = model.predict(row)
print('Prediction: %d' % yhat[0])

accuracy media: 0.927, std: 0.030
Prediction: 1


# HistGradientBoostingRegressor - scikit-learn (supera GradientBoostRegressor)
Invece di costruire gli alberi considerando tutti i possibili split continui per ogni feature discretizza le feature in istogrammi, ad esempio, viene eseguita una discretizzazione che trasforma gli input continui in buckets (es. 255 bucket, valori tra 0 e 30 --> bucket 1, da 31 a 60 --> bucket 2 ecc...)

In [23]:
from sklearn.ensemble import HistGradientBoostingRegressor

# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# evaluate the model
model = HistGradientBoostingRegressor()
neg_mean_absolute_error_array = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=10, n_jobs=-1, error_score='raise')
print(f'neg_mean_squared_error: {np.mean(neg_mean_absolute_error_array):.3f}, std: {np.std(neg_mean_absolute_error_array):.3f}')
# fit the model on the whole dataset
model = HistGradientBoostingRegressor()
model.fit(X, y)
# make a single prediction
row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]
yhat = model.predict(row)
print('Prediction: %.3f' % yhat[0])

neg_mean_squared_error: -12.924, std: 1.203
Prediction: -77.837


# XGBoost (best)
Si costruisce un albero decisionale iniziale.
Si calcola l’errore residuo (residuo = valore reale − predizione).
Si costruisce un nuovo albero che predice questi residui.
Si sommano le predizioni di tutti gli alberi.
Si ripete fino a che non si raggiunge un numero massimo di alberi o l’errore è sufficientemente basso.

Molto robusto, spesso batte reti neurali su dati strutturati.

# XGBClassifier

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV

parameter_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [3, 4, 5, 6, 7, 8, 9 ,10],
    'learning_rate' : [0.001, 0.01, 0.05, 0.1, 0.2], 
    'subsample' : [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'gamma':[0, 0.1, 0.3, 0.5, 1],
    'reg_alpha' : [0, 0.01, 0.1, 1],
    'reg_lambda' : [0.1, 1, 10]
}

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
model = XGBClassifier()

random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=parameter_grid,
    cv=5,
    n_iter=30,
    scoring='f1', 
    n_jobs=-1
)

random_search.fit(X, y)
scores = cross_val_score(random_search.best_estimator_, X, y, scoring='accuracy', cv=10, n_jobs=-1, error_score='raise')
print(f'accuracy media: {np.mean(scores):.3f}, std: {np.std(scores):.3f}')

row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]
yhat = random_search.best_estimator_.predict(row)
print(f'Prediciton: {yhat[0]}')

accuracy media: 0.937, std: 0.023
Prediciton: 1


# XGBRegressor

In [None]:
from xgboost import XGBRegressor

parameter_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [3, 4, 5, 6, 7, 8, 9 ,10],
    'learning_rate' : [0.001, 0.01, 0.05, 0.1, 0.2], 
    'subsample' : [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'reg_alpha' : [0, 0.01, 0.1, 1],
    'reg_lambda' : [0.1, 1, 10]
}


X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
model = XGBRegressor(objective='reg:squarederror')
random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions= parameter_grid, 
    cv=5, 
    n_iter=30, 
    scoring='neg_mean_absolute_error',
    n_jobs=-1
)
random_search.fit(X, y)
model = random_search.best_estimator_
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=10, n_jobs=-1, error_score='raise')
print(f'neg_mean_absolute_error: {np.mean(scores):.3f}, std: {np.std(scores):.3f}')




row = [2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]
row = np.asarray(row).reshape((1, len(row)))

yhat = model.predict(row)
print(f'Prediciton: {yhat[0]}')

neg_mean_absolute_error: -9.969, std: 0.787
Prediciton: -86.41383361816406


# LightGBM
più veloce su dataset grandi, usa leaf-wise growth, gestisce meglio dataset enormi.

# LGBMClassifier


In [None]:
from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV
import pandas as pd

parameter_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [3, 4, 5, 6, 7, 8, 9 ,10],
    'learning_rate' : [0.001, 0.01, 0.05, 0.1, 0.2], 
    'subsample' : [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'reg_alpha' : [0, 0.01, 0.1, 1],
    'reg_lambda' : [0.1, 1, 10]
}

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
model = LGBMClassifier()

random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=parameter_grid,
    cv=5,
    n_iter=30,
    scoring='f1', 
    n_jobs=-1
)

random_search.fit(X, y)
scores = cross_val_score(random_search.best_estimator_, X, y, scoring='accuracy', cv=10, n_jobs=-1, error_score='raise')
print(f'accuracy media: {np.mean(scores):.3f}, std: {np.std(scores):.3f}')

row = [2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]
rows = pd.DataFrame([row], columns=[f'feature{i}' for i in range(len(row))]) #come primo parametro passo i vari record. Se passo un array flat, ogni entry viene interpretata come una riga, quindi devo racchiudere row tra []
yhat = random_search.best_estimator_.predict(rows) #si aspetta array 2D perche oltre ai valori del record features vuole anche i nomi delle features
print(f'Prediciton: {yhat[0]}')

accuracy media: 0.935, std: 0.027
Prediciton: 1


# LGBMRegressor

In [None]:
from lightgbm import LGBMRegressor

parameter_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
    'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'reg_alpha': [0, 0.01, 0.1, 1],
    'reg_lambda': [0.1, 1, 10]
}

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
model = LGBMRegressor()
random_search = RandomizedSearchCV(estimator=model, param_distributions=parameter_grid, n_jobs=-1, n_iter=30, cv=5, scoring='neg_mean_absolute_error')
random_search.fit(X, y)
scores = cross_val_score(random_search.best_estimator_, X, y, scoring='neg_mean_absolute_error', cv=10, n_jobs=-1, error_score='raise')
print(f'neg_mean_absolute_error: {np.mean(scores):.3f}, std: {np.std(scores):.3f}')



row = [2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]
rows = pd.DataFrame([row], columns=[f'feature {i}' for i in range(len(row))])

yhat = random_search.best_estimator_.predict(rows)
print(f'Prediciton: {yhat[0]}')

)



neg_mean_absolute_error: -11.373, std: 0.817
Prediciton: -77.22435873098215


# CatBoost
Ottimo su categorical features, riduce il bisogno di feature engineering (encoding).

# CatBoostClassifier

In [None]:
from catboost import CatBoostClassifier

parameter_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [3, 4, 5, 6, 7, 8, 9 ,10],
    'learning_rate' : [0.001, 0.01, 0.05, 0.1, 0.2], 
    'subsample' : [0.6, 0.8, 1.0],
    'reg_lambda' : [0.1, 1, 10]
}

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
model = CatBoostClassifier()

random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=parameter_grid,
    cv=5,
    n_iter=30,
    scoring='f1', 
    n_jobs=-1
) #non ho X e y perche la cross validation e la conseguente sceltra del modello non avvioene ora, bensi al momento del .fit()

random_search.fit(X, y)
scores = cross_val_score(random_search.best_estimator_, X, y, scoring='accuracy', cv=10, n_jobs=-1, error_score='raise')
print(f'accuracy media: {np.mean(scores):.3f}, std: {np.std(scores):.3f}')

row = [2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]
rows = pd.DataFrame([row], columns=[f'feature{i}' for i in range(len(row))]) #come primo parametro passo i vari record. Se passo un array flat, ogni entry viene interpretata come una riga, quindi devo racchiudere row tra []
yhat = random_search.best_estimator_.predict(rows) #si aspetta array 2D perche oltre ai valori del record features vuole anche i nomi delle features
print(f'Prediciton: {yhat[0]}')



0:	learn: 0.6764235	total: 150ms	remaining: 1m 14s
1:	learn: 0.6611875	total: 154ms	remaining: 38.2s
2:	learn: 0.6474106	total: 157ms	remaining: 26s
3:	learn: 0.6326389	total: 161ms	remaining: 19.9s
4:	learn: 0.6190676	total: 164ms	remaining: 16.3s
5:	learn: 0.6049172	total: 168ms	remaining: 13.8s
6:	learn: 0.5932366	total: 171ms	remaining: 12.1s
7:	learn: 0.5778507	total: 175ms	remaining: 10.8s
8:	learn: 0.5641654	total: 178ms	remaining: 9.72s
9:	learn: 0.5520019	total: 182ms	remaining: 8.9s
10:	learn: 0.5387101	total: 185ms	remaining: 8.22s
11:	learn: 0.5282337	total: 189ms	remaining: 7.69s
12:	learn: 0.5175359	total: 193ms	remaining: 7.22s
13:	learn: 0.5072605	total: 196ms	remaining: 6.82s
14:	learn: 0.4977937	total: 200ms	remaining: 6.47s
15:	learn: 0.4874774	total: 204ms	remaining: 6.16s
16:	learn: 0.4763996	total: 207ms	remaining: 5.88s
17:	learn: 0.4646172	total: 210ms	remaining: 5.63s
18:	learn: 0.4564620	total: 214ms	remaining: 5.42s
19:	learn: 0.4456876	total: 217ms	remaining

# CatBoostRegressor

In [56]:
from catboost import CatBoostRegressor

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
model = CatBoostRegressor()

parameter_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [3, 4, 5, 6, 7, 8, 9 ,10],
    'learning_rate' : [0.001, 0.01, 0.05, 0.1, 0.2], 
    'subsample' : [0.6, 0.8, 1.0],
    'reg_lambda' : [0.1, 1, 10]
}

random_search = RandomizedSearchCV(estimator=model, param_distributions=parameter_grid, n_iter=30, n_jobs=-1, cv=5, scoring='neg_mean_absolute_error')

random_search.fit(X, y)

scores = cross_val_score(random_search.best_estimator_, X, y, scoring='neg_mean_absolute_error', n_jobs=-1, cv=5, error_score='raise')

print(f'neg_mean_absolute_error: {np.mean(scores):.3f}, std: {np.std(scores):.3f}')



row = [2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]
rows = pd.DataFrame([row], columns=[f'feature {i}' for i in range(len(row))])

yhat = random_search.best_estimator_.predict(rows)
print(f'Prediciton: {yhat[0]}')


0:	learn: 125.3014818	total: 610us	remaining: 305ms
1:	learn: 121.5928773	total: 1.07ms	remaining: 266ms
2:	learn: 117.2640807	total: 1.45ms	remaining: 240ms
3:	learn: 113.1748953	total: 1.82ms	remaining: 225ms
4:	learn: 109.0544135	total: 2.2ms	remaining: 218ms
5:	learn: 105.0253948	total: 2.57ms	remaining: 212ms
6:	learn: 101.7136815	total: 2.89ms	remaining: 204ms
7:	learn: 98.8594437	total: 3.23ms	remaining: 199ms
8:	learn: 95.9428830	total: 3.61ms	remaining: 197ms
9:	learn: 92.9817714	total: 3.96ms	remaining: 194ms
10:	learn: 90.3419583	total: 4.33ms	remaining: 193ms
11:	learn: 87.5970002	total: 4.79ms	remaining: 195ms
12:	learn: 84.9336608	total: 5.22ms	remaining: 196ms
13:	learn: 82.4312382	total: 5.83ms	remaining: 202ms
14:	learn: 80.2208588	total: 6.23ms	remaining: 201ms
15:	learn: 77.9358243	total: 6.57ms	remaining: 199ms
16:	learn: 75.6937407	total: 6.91ms	remaining: 196ms
17:	learn: 73.4032702	total: 7.32ms	remaining: 196ms
18:	learn: 71.2860782	total: 7.71ms	remaining: 195m