<a href="https://colab.research.google.com/github/ab-sa/Statistical-Machine-Learning3/blob/main/Lecture8_GAM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RepeatedKFold, cross_val_score
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.pipeline import Pipeline
#!pip install pygam
from pygam import GAM, s, te

In [None]:
Credit = pd.read_csv('Credit.csv')
print('Dimension of the data: ' + str(Credit.shape))
Credit.head()

Generalised Additive Model (GAM):
There is a Python library in development for using GAMs (https://github.com/dswah/pyGAM), but it is not yet as comprehensive as the R GAM library, which we will use here instead.

To fit a GAM:

- Import the gam library
- Populate a formula including s() on variables we want to fit smooths for
- Call gam(formula, family=) where family is a string naming a probability distribution, chosen based on how the response variable is thought to occur.

This is a good manual on fitting different smoothers as well as GAM on a dataset in python:
https://harvard-iacs.github.io/2019-CS109B/labs/lab2/solutions/

Here is an examplee how you can fit a GAM on Credit data:

In [None]:
Credit['Gender_str'] = pd.Series(Credit['Gender']).astype('string')
Credit['Student_str'] = pd.Series(Credit['Student']).astype('string')
Credit['Married_str'] = pd.Series(Credit['Married']).astype('string')
Credit['Ethnicity_str'] = pd.Series(Credit['Ethnicity']).astype('string')

display(Credit.dtypes)

In [None]:
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
# if you need to install gam package first:
#utils = importr('utils')
#utils.install_packages('gam')

X = pd.get_dummies(Credit.drop(['Balance', 'ID'], axis=1))
y = Credit['Balance']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
r_gam_lib = importr('gam')
r_gam = r_gam_lib.gam

r_Age = robjects.FloatVector(x_train[["Age"]].values)
r_Income = robjects.FloatVector(x_train[["Income"]].values)
r_Limit = robjects.FloatVector(x_train[["Limit"]].values)
r_Rating = robjects.FloatVector(x_train[["Rating"]].values)
r_Cards = robjects.FloatVector(x_train[["Cards"]].values)
r_Education = robjects.FloatVector(x_train[["Education"]].values)
r_Balance = robjects.FloatVector(y_train.values)
r_Gender = robjects.FactorVector(x_train["Gender_str"].values)
r_Student = robjects.FactorVector(x_train["Student_str"].values)
r_Married = robjects.FactorVector(x_train["Married_str"].values)
r_Ethnicity = robjects.FactorVector(x_train["Ethnicity_str"].values)

r_fmla = robjects.Formula("Balance ~ s(Age) + s(Income) + s(Limit) + s(Rating) + s(Cards) + s(Education) + Gender_str + Student_str + Married_str + Ethnicity_str")
r_fmla.environment['Balance'] = r_Balance
r_fmla.environment['Age'] = r_Age
r_fmla.environment['Limit'] = r_Limit
r_fmla.environment['Income'] = r_Income
r_fmla.environment['Rating'] = r_Rating
r_fmla.environment['Cards'] = r_Cards
r_fmla.environment['Education'] = r_Education
r_fmla.environment['Gender_str'] = r_Gender
r_fmla.environment['Student_str'] = r_Student
r_fmla.environment['Married_str'] = r_Married
r_fmla.environment['Ethnicity_str'] = r_Ethnicity

Balance_gam = r_gam(r_fmla)
print(Balance_gam.names)
print(Balance_gam.rx2('coefficients'))
print(Balance_gam.rx2('aic'))
print('sMSE: ', np.mean(pow(np.array(Balance_gam.rx2('residuals')), 2)))

Prediction for GAM:

In [None]:
from numpy.core.fromnumeric import mean
r_testSet = robjects.DataFrame({'Age': robjects.FloatVector(x_test[["Age"]].values), 
                               'Limit': robjects.FloatVector(x_test[["Limit"]].values), 
                               'Income': robjects.FloatVector(x_test[["Income"]].values),
                               'Rating': robjects.FloatVector(x_test[["Rating"]].values),
                               'Cards': robjects.FloatVector(x_test[["Cards"]].values),
                               'Education': robjects.FloatVector(x_test[["Education"]].values),
                               'Gender_str': robjects.FactorVector(x_test["Gender_str"].values),
                               'Student_str': robjects.FactorVector(x_test["Student_str"].values),
                               'Married_str': robjects.FactorVector(x_test["Married_str"].values),
                               'Ethnicity_str': robjects.FactorVector(x_test["Ethnicity_str"].values)})
r_predict = robjects.r['predict']
gam_preds = np.array(r_predict(Balance_gam, r_testSet))
print('MSPE: ', mean(pow(gam_preds - y_test, 2)))

In [None]:
%load_ext rpy2.ipython
%R -i Balance_gam plot(Balance_gam, residuals=TRUE,se=TRUE, scale=20);

Fit a similar GAM model by using pygam (simpler):

In [None]:
X = pd.get_dummies(Credit[['Age', 'Income', 'Rating', 'Limit', 'Cards']])
y = Credit['Balance']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

gam=GAM(s(0)+s(1)+s(2)+s(3)+s(4)).fit(x_train,y_train)
lams = np.linspace(0, 10, 100)
gam.gridsearch(X=x_train, y=y_train, lam=lams, return_scores=True)
#gam.summary()