# Machine Learning: Linear Regression

## Black Friday Sales Prediction:

We are going to use a dataset of product purchases during a Black Friday (in the US). The main idea is to be able to generate a predictor that allows us to predict the `purchase amount`.

In order to achieve a good predictor we must apply the different concepts that we have been learning:

* `Exploration`
* `Feature Engineering`
* `Modeling`
* `Evaluation`

The dataset here is a sample of the transactions made in a retail store. The store wants to know better the customer `purchase` behaviour against different products. The problem is a `regression problem` where we are trying to predict the dependent variable (the amount of purchase) with the help of the information contained in the other variables.

### You can try differents Scikit-Learn models from [Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv("https://raw.githubusercontent.com/anyoneai/notebooks/main/datasets/BlackFriday.csv")
data.sample(5)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
21489,1003391,P00000542,M,18-25,4,A,0,0,5,,,3656
230191,1005509,P00105542,M,26-35,0,A,1,0,8,,,5959
152526,1005573,P00156742,F,36-45,1,B,1,1,5,6.0,8.0,3523
293001,1003207,P00028242,M,36-45,7,A,0,0,6,8.0,,11984
257453,1003691,P00000142,M,26-35,7,B,1,1,3,4.0,5.0,13541


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
black_friday = pd.read_csv('black_friday.csv')

# Explore the data
print(black_friday.head())
print(black_friday.info())
print(black_friday.describe())

In [None]:
# Create a new feature Age_Range_Num to represent the age range as a number
black_friday['Age_Range_Num'] = black_friday['Age'].apply(lambda x: int(x.split('-')[0]))

# Fill missing values in Product_Category_2 and Product_Category_3 with 0
black_friday['Product_Category_2'].fillna(0, inplace=True)
black_friday['Product_Category_3'].fillna(0, inplace=True)

# Combine the product categories into a single feature
black_friday['Product_Categories'] = black_friday['Product_Category_1'].astype(str) + ',' + black_friday['Product_Category_2'].astype(str) + ',' + black_friday['Product_Category_3'].astype(str)

# Drop the unnecessary columns
black_friday.drop(['User_ID', 'Product_ID', 'Age'], axis=1, inplace=True)

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Split the dataset into training and testing sets
X = black_friday.drop('Purchase', axis=1)
y = black_friday['Purchase']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the categorical and numerical columns
cat_cols = ['Gender', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital