# Machine Learning: Linear Regression

## Black Friday Sales Prediction:

We are going to use a dataset of product purchases during a Black Friday (in the US). The main idea is to be able to generate a predictor that allows us to predict the `purchase amount`.

In order to achieve a good predictor we must apply the different concepts that we have been learning:

* `Exploration`
* `Feature Engineering`
* `Modeling`
* `Evaluation`

The dataset here is a sample of the transactions made in a retail store. The store wants to know better the customer `purchase` behaviour against different products. The problem is a `regression problem` where we are trying to predict the dependent variable (the amount of purchase) with the help of the information contained in the other variables.

### You can try differents Scikit-Learn models from [Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)

# Load the dataset

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, RobustScaler
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

data = pd.read_csv("https://raw.githubusercontent.com/anyoneai/notebooks/main/datasets/BlackFriday.csv")
data.sample(5)

# Explore the dataset

In [None]:
print(data.shape)
print(list(data.columns))

In [None]:
data.describe()

In [None]:
# Explore the data
data.info()

In [None]:
data.isnull().sum()

In [None]:
mask = np.triu(np.ones_like(data.corr(), dtype=bool))
sns.heatmap(data.corr(), mask=mask, annot=True, cmap='coolwarm')

In [None]:
sns.histplot(data['Purchase'], kde=True)
plt.xlabel('Purchase amount')
plt.ylabel('Frequency')
plt.show()

# Feature engineering

In [None]:
object_features = data.select_dtypes(include=["object"]).nunique()
binary_features = object_features[object_features == 2].index
non_binary_features = object_features[object_features != 2].index

In [None]:
object_features

## Encode features:

In [None]:
# Dummies transformation:
## data = pd.get_dummies(data, columns=['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years'], drop_first=True)

In [None]:
# We encode Gender using binary encoding
data["Gender"] = pd.get_dummies(data["Gender"], drop_first=True)

In [None]:
# Encode Age using one-hot encoding
age_encoder = OneHotEncoder(categories=[['0-17', '18-25', '26-35', '36-45', '46-50', '51-55', '55+']])
age_encoded = age_encoder.fit_transform(data["Age"].values.reshape(-1, 1))
data[["Age_"+cat for cat in age_encoder.categories_[0]]] = pd.DataFrame(age_encoded.toarray(), index=data.index).astype(int)

In [None]:
# Encode City_Category using one-hot encoding
city_encoder = OneHotEncoder(categories=[['A', 'B', 'C']])
city_encoded = city_encoder.fit_transform(data["City_Category"].values.reshape(-1, 1))
data[["City_"+cat for cat in city_encoder.categories_[0]]] = pd.DataFrame(city_encoded.toarray(), index=data.index).astype(int)

In [None]:
# Encode Stay_In_Current_City_Years using label encoding
stay_encoder = LabelEncoder()
data["Stay_In_Current_City_Years"] = stay_encoder.fit_transform(data["Stay_In_Current_City_Years"]).astype(int)

In [None]:
data

# Handle missing values

In [None]:
# Create an instance of SimpleImputer to complete missing values:
knn_imputer = KNNImputer()

# Replace NaN values in Product_Category_2 and Product_Category_3 columns with the column means:
data[['Product_Category_2', 'Product_Category_3']] = knn_imputer.fit_transform(data[['Product_Category_2', 'Product_Category_3']])

# Scale and normalize

In [None]:
# Scaling 'Purchase' feature:
#robust_scaler = RobustScaler()
#data["Purchase"] = robust_scaler.fit_transform(data[["Purchase"]])

# Modeling

In [None]:
# We drop unwanted columns:
data.drop(['User_ID', 'Product_ID', 'Age', 'City_Category'], axis=1, inplace=True)

In [None]:
# We split the dataset into training and testing sets:
X = data.drop(['Purchase'], axis=1)
y = data['Purchase']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# We create the model:
lr = LinearRegression()
lr.fit(X_train, y_train)

# We make predictions:
y_pred_lr = lr.predict(X_test)

# Finally, we evaluate the model:
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)
print("Linear Regression:")
print("Mean Squared Error:", mse_lr)
print("R^2 Score:", r2_lr)

In [None]:
# Create Ridge regression model with default parameters
ridge_reg = Ridge(alpha=0.1)

# Fit the model on the training data
ridge_reg.fit(X_train, y_train)

# Predict purchase amounts for the test set
y_pred_ridge = ridge_reg.predict(X_test)

# Calculate the model's R-squared score on the test set
r2_ridge = r2_score(y_test, y_pred_ridge)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)

print("Ridge Model:")
print("R-squared: {:.4f}".format(r2_ridge))
print("MSE: {:.4f}".format(mse_ridge))

In [None]:
# Create Ridge regression model with cross-validation to select alpha
ridge_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])

# Fit the model on the training data
ridge_cv.fit(X_train, y_train)

# Predict purchase amounts for the test set
y_pred_ridgecv = ridge_cv.predict(X_test)

# Calculate the model's R-squared score on the test set
r2_ridge_cv = ridge_cv.score(X_test, y_test)

# Get the best value of alpha selected by cross-validation
alpha = ridge_cv.alpha_

# Calculate and print the MSE
mse_ridge_cv = mean_squared_error(y_test, y_pred_ridgecv)

# Print the R-squared score and best alpha value
print("R-squared score:", r2_ridge_cv)
print("Best alpha:", alpha)
print("MSE:", mse_ridge_cv)



In [None]:
# Get the coefficients learned by the Ridge regression model
coefficients = ridge_reg.coef_

# Create a DataFrame to display the coefficients for each feature
coef_df = pd.DataFrame({'Feature': X_train.columns, 'Coefficient': coefficients})
coef_df.sort_values(by='Coefficient', ascending=False, inplace=True)
print(coef_df)

In [None]:
# Create Ridge regression model with different alpha values
alphas = [0.01, 0.1, 1, 10, 100]
for alpha in alphas:
    ridge_reg = Ridge(alpha=alpha)
    ridge_reg.fit(X_train, y_train)
    y_pred_ridge = ridge_reg.predict(X_test)
    r2_ridge = r2_score(y_test, y_pred_ridge)
    mse_ridge = mean_squared_error(y_test, y_pred_ridge)
    print("Ridge Model (alpha = {}):".format(alpha))
    print("R-squared: {:.4f}".format(r2_ridge))
    print("MSE: {:.4f}".format(mse_ridge))

# Create Ridge regression model with cross-validation to select alpha
ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1, 10, 100])
ridge_cv.fit(X_train, y_train)
y_pred_ridgecv = ridge_cv.predict(X_test)
r2_ridge_cv = ridge_cv.score(X_test, y_test)
alpha = ridge_cv.alpha_
mse_ridge_cv = mean_squared_error(y_test, y_pred_ridgecv)

print("Ridge Model with Cross-Validation:")
print("Best alpha:", alpha)
print("R-squared: {:.4f}".format(r2_ridge_cv))
print("MSE: {:.4f}".format(mse_ridge_cv))