# Machine Learning: Linear Regression

## Black Friday Sales Prediction:

We are going to use a dataset of product purchases during a Black Friday (in the US). The main idea is to be able to generate a predictor that allows us to predict the `purchase amount`.

In order to achieve a good predictor we must apply the different concepts that we have been learning:

* `Exploration`
* `Feature Engineering`
* `Modeling`
* `Evaluation`

The dataset here is a sample of the transactions made in a retail store. The store wants to know better the customer `purchase` behaviour against different products. The problem is a `regression problem` where we are trying to predict the dependent variable (the amount of purchase) with the help of the information contained in the other variables.

### You can try differents Scikit-Learn models from [Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)

# Load the dataset

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, RobustScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

data = pd.read_csv("https://raw.githubusercontent.com/anyoneai/notebooks/main/datasets/BlackFriday.csv")
data.sample(5)

# Explore the dataset

In [None]:
print(data.shape)
print(list(data.columns))

In [None]:
data.describe()

In [None]:
# Explore the data
data.info()

In [None]:
data.isnull().sum()

In [None]:
mask = np.triu(np.ones_like(data.corr(), dtype=bool))
sns.heatmap(data.corr(), mask=mask, annot=True, cmap='coolwarm')

In [None]:
sns.histplot(data['Purchase'], kde=True)
plt.xlabel('Purchase amount')
plt.ylabel('Frequency')
plt.show()

# Feature engineering

In [None]:
object_features = data.select_dtypes(include=["object"]).nunique()
binary_features = object_features[object_features == 2].index
non_binary_features = object_features[object_features != 2].index

In [None]:
object_features

## Encode features:

In [None]:
# We transform binary features:
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(data[binary_features])
bin_encoded = ordinal_encoder.transform(data[binary_features])

In [None]:
bin_encoded_df = pd.DataFrame(bin_encoded, columns=binary_features)
bin_encoded_df

In [None]:
# Now, we transform non binary features:
one_hot_encoder = OneHotEncoder()
one_hot_encoder.fit(data[non_binary_features])
nonbin_encoded = one_hot_encoder.fit_transform(data[non_binary_features])

In [None]:
# Convert encoded data to a DataFrame
nonbin_encoded_df = pd.DataFrame(nonbin_encoded.toarray(), columns=one_hot_encoder.get_feature_names_out(non_binary_features))
nonbin_encoded_df


In [None]:
# Dummies transformation:
## data = pd.get_dummies(data, columns=['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years'], drop_first=True)

# Handle missing values

In [None]:
# Create an instance of SimpleImputer to complete missing values:
simple_imputer = SimpleImputer()

# Replace NaN values in Product_Category_2 and Product_Category_3 columns with the column means:
data[['Product_Category_2', 'Product_Category_3']] = simple_imputer.fit_transform(data[['Product_Category_2', 'Product_Category_3']])

# Scale and normalize

In [None]:
# Scaling 'Purchase' feature:
robust_scaler = RobustScaler()
data["Purchase"] = robust_scaler.fit_transform(data[["Purchase"]])

In [None]:
# Concatenate bin_encoded, nonbin_encoded, and the remaining features
data = pd.concat([bin_encoded, nonbin_encoded, data.drop(columns=binary_features.union(non_binary_features)).values], axis=1)

# Create a list of the feature names for the new dataframe
feature_names = list(binary_features) + list(non_binary_features) + list(data.columns.drop(binary_features).drop(non_binary_features))

# Convert the concatenated numpy array to a pandas dataframe with column names
df = pd.DataFrame(data, columns=feature_names)


In [None]:
data = np.concatenate(
    [
        bin_encoded,
        nonbin_encoded,
        data.drop(
            columns=binary_features.union(non_binary_features)
        ).values,
    ],
    axis=1,
)

In [None]:
pd.DataFrame(data_encoded)


In [None]:
column_names = list(binary_features) + list(non_binary_features) + list(data.columns)
df = pd.DataFrame(data_encoded, columns = column_names)

In [None]:
# We drop unwanted columns:
data.drop(['User_ID', 'Product_ID'], axis=1, inplace=True)

In [None]:
# We split the dataset into training and testing sets:
X = data.drop(['Purchase'], axis=1)
y = data['Purchase']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# We create the model:
lr = LinearRegression()
lr.fit(X_train, y_train)

# Finally, we evaluate the model:
y_pred = lr.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test,y_pred)
print('Root Mean Squared Error:', rmse)
print('R-squared:', r2)