# Gradient Boosting

It combines the predictions from multiple individual models, typically decision trees, to produce a more accurate final prediction. In this project, we aim to predict whether each response variable (wine, meat, fish, gold, fruits, sweets) is likely to be purchased or not based on the given predictor variables (places, income).

In [None]:
import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import xgboost as xgb
from xgboost import plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [None]:
import graphviz

In [None]:
df = pd.read_csv('Datasets/cleaned_customer.csv')
df = df[['Income', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'Dt_Customer', 'Recency']]
df.head()

In [None]:
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'], format='%d-%m-%Y')
df['CurrentDate'] = datetime.datetime(2021, 1, 1)
df['numMonths'] = (df['CurrentDate'] - df['Dt_Customer']) / np.timedelta64(1, 'D') / 30
df['Recency'] = df['Recency'] / 30

In [None]:
#average purchases over the months
for var in df.columns[4:10]:
    df[var] = df[var]/ (df["numMonths"] - df["Recency"])

df.drop(columns = ['Dt_Customer', 'CurrentDate', "numMonths", "Recency"], inplace = True)
df

In [None]:
x=df[['Income', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']]
y= df[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']]
train_x, test_x, train_y, test_y = train_test_split(x,y, train_size=0.8, random_state=42)

In [None]:
reg = xgb.XGBRegressor(objective='reg:squarederror', seed=42)

In [None]:
reg.fit(train_x, train_y)

In [None]:
y_pred = reg.predict(test_x)

In [None]:
mse = mean_squared_error(test_y, y_pred)
print("Mean Squared Error:", mse)

In [None]:
xgb.plot_importance(reg)
plt.show()

From the Feature Importance plot, we can tell that 'Income' is the strongest predictor variable in determining the response variables' values.

In [None]:
fig, ax = plt.subplots(figsize=(500, 500))

xgb.plot_tree(reg, num_trees=2, ax=ax)
plt.tight_layout()
plt.show()