## Problem Statement

A retail company wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.

Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

## Data Acquisition

- User_ID	User ID
- Product_ID	Product ID
- Gender	Sex of User
- Age	Age in bins
- Occupation	Occupation (Masked)
- City_Category	Category of the City (A,B,C)
- Stay_In_Current_City_Years	Number of years stay in current city
- Marital_Status	Marital Status
- Product_Category_1	Product Category (Masked)
- Product_Category_2	Product may belongs to other category also (Masked)
- Product_Category_3	Product may belongs to other category also (Masked)
- Purchase	Purchase Amount (Target Variable)

## Importing Libraries and Loading Data

In [None]:
#Loading Packages
import pandas as pd 
import numpy as np 
from sklearn.preprocessing import LabelEncoder

import seaborn as sns                 
import matplotlib.pyplot as plt       
%matplotlib inline 
import plotly.express as px

import warnings  
warnings.filterwarnings("ignore")

In [None]:
data1 = pd.read_csv('user_demographics.csv')
data2 = pd.read_csv('User_product_purchase_details_p2.csv')
data = data1.merge(data2, left_on = 'User_ID', right_on = 'User_ID')

## Descriptive Analysis

In [None]:
data.head(5)

In [None]:
#Checking The Dimension Of the DataSet
data.shape

In [None]:
#Columns Present in the DataSet
data.columns

#### Unique elements in each attributes

In [None]:
data.nunique()

#### Frequency & Relative Frequency for Each Column

In [None]:
data['Gender'].value_counts()

In [None]:
data['Gender'].value_counts(normalize=True)

In [None]:
data['Age'].value_counts()

In [None]:
data['Age'].value_counts(normalize=True)

In [None]:
data['Occupation'].value_counts()

In [None]:
data['Occupation'].value_counts(normalize=True)

In [None]:
data['City_Category'].value_counts()

In [None]:
data['City_Category'].value_counts(normalize=True)

In [None]:
data['Stay_In_Current_City_Years'].value_counts()

In [None]:
data['Stay_In_Current_City_Years'].value_counts(normalize=True)

In [None]:
data['Marital_Status'].value_counts()

In [None]:
data['Marital_Status'].value_counts(normalize=True)

In [None]:
data.info()

In [None]:
data.dtypes

#### Summary Of Numeric Data

In [None]:
data.describe()

#### Summary Of Object Data

In [None]:
data.describe(include="object")

#### Identifying The Duplicate Data

In [None]:
# Identify duplicate data
data[data.duplicated()].sum()

#### Identifying the Missing Column Under Each Column

In [None]:
data.isnull().sum()

In [None]:
# Total missing values
data.isnull().sum().sum()

#### Percentage Of Missing Value in Each Column

In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data(data)

In [None]:
data.hist(figsize=(8,8))
plt.show()

## Exploratory Data Analysis(EDA)

In [None]:
sns.distplot(data["Purchase"],color='b')
plt.title("Purchase Distribution")
plt.xlabel('Purchase Amount')
plt.ylabel('Number of People')
plt.show()

We can observe that purchase amount is repeating for many customers.This may be because on Black Friday many are buying discounted products in large numbers and kind of follows a Gaussian Distribution.

In [None]:
sns.boxplot(data["Purchase"])
plt.title("Boxplot of Purchase")
plt.show()

In [None]:
data["Purchase"].describe()

In [None]:
data["Purchase"].skew()

In [None]:
data["Purchase"].kurtosis()

The purchase is right skewed and we can observe multiple peaks in the distribution we can do a log transformation for the purchase.

In [None]:
fig=px.histogram(data,x='Purchase')
fig.show()

In [None]:
# Remove the outliers
# IQR=Q3-Q1
Q1=np.quantile(data['Purchase'],0.25)
Q3=np.quantile(data['Purchase'],0.75)
IQR=Q3-Q1

UB=Q3+3*IQR
LB=Q1-3*IQR

print(Q1,Q3,LB,UB)

outliers=data[(data['Purchase']<=LB) | (data['Purchase']>=UB)]
print(len(outliers))

## Gender

In [None]:
sns.countplot(x ='Gender', data = data)
plt.ylabel('Number Of Purchase')

In [None]:
data.groupby("Gender").mean()["Purchase"].plot(kind='bar')
plt.title("Gender and Purchase Analysis")
plt.ylabel('Purchase Amount')
plt.show()

On average the male gender spends more money on purchase contrary to female, and it is possible to also observe this trend by adding the total value of purchase.

## Occupation

In [None]:
sns.countplot(x ='Occupation', data = data)
plt.ylabel('Number Of Purchase')

In [None]:
data.groupby("Occupation").mean()["Purchase"].plot(kind='bar')
plt.title("Occupation and Purchase Analysis")
plt.ylabel('Purchase Amount')
plt.show()

Although there are some occupations which have higher representations, it seems that the amount each user spends on average is more or less the same for all occupations. Of course, in the end, occupations with the highest representations will have the highest amounts of purchases.

## Age

In [None]:
sns.countplot(x ='Age', data = data)
plt.ylabel('Number Of Purchase')

Age 26-35 Age group makes the most no of purchases in the age group.

In [None]:
data.groupby("Age").mean()["Purchase"].plot(kind='bar')
plt.title("Age and Purchase Analysis")
plt.ylabel('Purchase Amount')
plt.show()

Mean puchase rate between the age groups tends to be the same except that the 51-55 age group has a little higher average purchase amount

## City_Category

In [None]:
sns.countplot(x ='City_Category', data = data)
plt.ylabel('Number Of Purchase')

Type B Category has the Maximum Number of purchases

In [None]:
data.groupby("City_Category").mean()["Purchase"].plot(kind='bar')
plt.title("City Category and Purchase Analysis")
plt.ylabel('Purchase Amount')
plt.show()

However,Type C Category Spends the Most 

## Stay_In_Current_City_Years

In [None]:
sns.countplot(x ='Stay_In_Current_City_Years', data = data)

It looks like the longest someone is living in that city the less prone they are to buy new things. Hence, if someone is new in town and needs a great number of new things for their house that they’ll take advantage of the low prices in Black Friday to purchase all the things needed

In [None]:
data.groupby("Stay_In_Current_City_Years").mean()["Purchase"].plot(kind='bar')
plt.title("Stay_In_Current_City_Years and Purchase Analysis")
plt.show()

We see the same pattern seen before which show that on average people tend to spend the same amount on purchases regardeless of their group. People who are new in city are responsible for the higher number of purchase, however looking at it individually they tend to spend the same amount independently of how many years the have lived in their current city.

In [None]:
sns.countplot(x ='Marital_Status', data = data)

In [None]:
data.groupby("Marital_Status").mean()["Purchase"].plot(kind='bar')
plt.title("Marital_Status and Purchase Analysis")
plt.show()

Purchasers who married or not, have almost same average of purchase.

## Product_Category_1

In [None]:
sns.countplot(x ='Product_Category_1', data = data)

It is clear that Product_Category_1 numbers 1,5 and 8 stand out. Unfortunately we don't know which product each number represents as it is masked.

In [None]:
data.groupby('Product_Category_1').mean()['Purchase'].plot(kind='bar',figsize=(18,5))
plt.title("Product_Category_1 and Purchase Mean Analysis")
plt.show()

If you see the value spent on average for Product_Category_1 you see that although there were more products bought for categories 1,5,8 the average amount spent for those three is not the highest. It is interesting to see other categories appearing with high purchase values despite having low impact on sales number.

## Product_Category_2

In [None]:
sns.countplot(x ='Product_Category_2', data = data)

It is clear that Product_Category_2 numbers 2.0,8.0 and 14.0 stand out. Unfortunately we don't know which product each number represents as it is masked

In [None]:
data.groupby('Product_Category_2').mean()['Purchase'].plot(kind='bar',figsize=(18,5))
plt.title("Product_Category_2 and Purchase Mean Analysis")
plt.show()

In [None]:
sns.countplot(x ='Product_Category_3', data = data)

It is clear that Product_Category_2 numbers 15.0 and 16.0 stand out. Unfortunately we don't know which product each number represents as it is masked

In [None]:
data.groupby('Product_Category_3').mean()['Purchase'].plot(kind='bar',figsize=(18,5))
plt.title("Product_Category_3 and Purchase Mean Analysis")
plt.show()

In [None]:
#Occupations and City Category

plt.figure(figsize=(15,5))
sns.countplot(x='Occupation',data=data,hue='City_Category')
plt.title('Comparing Occupations and City Category')
plt.show()

People from Occupation 4,0,7 are buying the most and most of the people from these occupations belong to City_Category B.

### Relation Between the Different Attributes of the Data

In [None]:
data.corr()

### HeatMap

In [None]:
sns.heatmap(data.corr(),annot=True)
plt.show()

From the correlation heatmap, we can observe that the dependent feature 'Purchase' is highly correlated with 'Product_Category_1' and 'Product_Category_2'.

In [None]:
df = data.copy()

## Data Pre Processing

### Discrepancies / Inconsistencies in data
#### Replacing '+' in 'Age' and 'Stay_In_Current_City_Years'

In [None]:
df['Stay_In_Current_City_Years'] = df['Stay_In_Current_City_Years'].replace(to_replace="4+",value="5")

In [None]:
df['Age'] = df['Age'].replace(to_replace='55+',value='55')

In [None]:
df['Age'].value_counts()

In [None]:
data.isnull().sum()

In [None]:
data.isnull().sum().sum()

In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data(data)

In [None]:
df.boxplot(figsize=(10,10))

### Dropping Irrelevant Data

In [None]:
df.drop('Product_Category_3', axis = 1, inplace = True)

In [None]:
df.drop(["User_ID"],axis=1,inplace=True)

## PreProcess 

In [None]:
# Product_ID preprocess e.g. P00069042 -> 69042

df['Product_ID'] = df['Product_ID'].str.replace('P00', '')

#object to int
df['Product_ID'] = pd.to_numeric(df['Product_ID'],errors='coerce')

In [None]:
df.isnull().sum()

In [None]:
df.head(5)

### Fixing null values in 'Product_Category_2'

In [None]:
#imputed missing values with random values in the same probability distribution as given feature already had

vc = df.Product_Category_2.value_counts(normalize = True)
miss = df.Product_Category_2.isna()
df.loc[miss, 'Product_Category_2'] = np.random.choice(vc.index, size = miss.sum(), p = vc.values)

### Feature Encoding

In [None]:
label_encoder_gender = LabelEncoder()
df['Gender'] = label_encoder_gender.fit_transform(df['Gender'])

In [None]:
label_encoder_age = LabelEncoder()
df['Age'] = label_encoder_age.fit_transform(df['Age'])

In [None]:
label_encoder_city = LabelEncoder()
df['City_Category'] = label_encoder_city.fit_transform(df['City_Category'])

### Convert 'Stay_In_Current_City_Years' and 'Product_Category_2' into numeric data type

In [None]:
df['Stay_In_Current_City_Years'] = df['Stay_In_Current_City_Years'].astype('int')

In [None]:
df['Product_Category_2'] = df['Product_Category_2'].astype('int')

In [None]:
df.head()

###  Creating a train test split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# split the data and label
X = df.drop(columns=['Purchase'],axis=1)
Y = df['Purchase']

In [None]:
X.head()

In [None]:
Y.head()

In [None]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=42)

In [None]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("Y_train shape:", Y_train.shape)
print("Y_test shape:", Y_test.shape)

## Data Modelling

### (1)Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

In [None]:
lin_reg.fit(X_train, Y_train)

In [None]:
Y_pred_lin_reg = lin_reg.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [None]:
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_lin_reg)))
print("R2 score:", r2_score(Y_test, Y_pred_lin_reg))

## (2)KNN Regression

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()

In [None]:
knn.fit(X_train, Y_train)

In [None]:
Y_pred_knn = knn.predict(X_test)

In [None]:
print("KNN regression: ")
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_knn)))
print("R2 score:", r2_score(Y_test, Y_pred_knn))

## (3)Decision Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor
dec_tree = DecisionTreeRegressor()

In [None]:
dec_tree.fit(X_train, Y_train)

In [None]:
Y_pred_dec = dec_tree.predict(X_test)

In [None]:
print("Decision tree regression: ")
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_dec)))
print("R2 score:", r2_score(Y_test, Y_pred_dec))

In [None]:
DT2 = DecisionTreeRegressor(max_depth=8, min_samples_leaf=150)

DT2.fit(X_train, Y_train)

y_pred = DT2.predict(X_test)

print('rmse:', np.sqrt(mean_squared_error(Y_test,y_pred)))
print('r2_score:',r2_score(Y_test,y_pred)) 

## (4)Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
ran_for = RandomForestRegressor()

In [None]:
ran_for.fit(X_train, Y_train)

In [None]:
Y_pred_ran_for = ran_for.predict(X_test)

In [None]:
print("Random forest regression: ")
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_ran_for)))
print("R2 score:", r2_score(Y_test, Y_pred_ran_for))

In [None]:
rf = RandomForestRegressor(random_state = 3,max_depth=10,n_estimators=25)

rf.fit(X_train,Y_train)

y_pred = rf.predict(X_test)

In [None]:
print('r2_score:',r2_score(Y_test,y_pred)) 
print('rmse:', np.sqrt(mean_squared_error(Y_test,y_pred)))

## (5)XGB Regressor

In [None]:
from xgboost import XGBRegressor
xgb = XGBRegressor(random_state = 42)

In [None]:
xgb.fit(X_train, Y_train)

In [None]:
Y_pred_xgb = xgb.predict(X_test)

In [None]:
print("XGB regression: ")
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_xgb)))
print("R2 score:", r2_score(Y_test, Y_pred_xgb))

In [None]:
xgb6 = XGBRegressor(n_estimators=470,max_depth=9,learning_rate=0.06)

In [None]:
xgb6.fit(X_train,Y_train)

In [None]:
y_pred = xgb6.predict(X_test)

In [None]:
print('r2_score:',r2_score(Y_test,y_pred)) 
print('rmse:', np.sqrt(mean_squared_error(Y_test,y_pred)))

In [None]:
#XGBoost Model1
from xgboost import XGBRegressor


xgb1 = XGBRegressor(n_estimators=1000, learning_rate=0.05)

xgb1.fit(X_train,Y_train)

y_pred = xgb1.predict(X_test)



In [None]:
print('r2_score:',r2_score(Y_test,y_pred)) 
print('rmse:', np.sqrt(mean_squared_error(Y_test,y_pred)))

In [None]:
xgb2 = XGBRegressor(n_estimators=500,max_depth=10,learning_rate=0.05)

xgb2.fit(X_train,Y_train)

y_pred = xgb2.predict(X_test)



In [None]:
print('r2_score:',r2_score(Y_test,y_pred)) 
print('rmse:', np.sqrt(mean_squared_error(Y_test,y_pred)))