<a href="https://colab.research.google.com/github/delcorej/hello-world/blob/main/Drizly_Fun_by_Justice_DelCore.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<b>Goal of this Notebook</b>

This notebook is to predict the Gross Merchandise Value or GMV of customers based upon their first purchase order. This prediction will be done by peforming a linear regression on a customer's first orders attributes.

In [1]:
#first import all of the necessary libs for the exploratory data analyis portion
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime 
import warnings
warnings.simplefilter('ignore') 
#this is to ignore any warnings that occur that do not interrupt the running of the cell
%matplotlib inline 
#this is to ensure that all plots are visible once I run the cell

In [2]:
#lets mount this to my google drive
from google.colab import drive

In [3]:
#now lets save the different data frames that I will be working with
order_items = pd.read_csv('store_order_items.csv.gz')
orders = pd.read_csv('store_orders.csv.gz')
first_orders = pd.read_csv('users_first_orders.csv.gz')

# Data Preprocessing

In [4]:
#lets look at the different components of each dataframe
order_items.info()
orders.info()
first_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1616900 entries, 0 to 1616899
Data columns (total 6 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   store_id            1616900 non-null  object 
 1   store_order_id      1616900 non-null  object 
 2   master_item_id      1612863 non-null  float64
 3   top_level_category  1612640 non-null  object 
 4   unit_price          1616900 non-null  float64
 5   quantity            1616900 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 74.0+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 561167 entries, 0 to 561166
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   store_id        561167 non-null  object
 1   user_id         561167 non-null  object
 2   order_id        561167 non-null  object
 3   store_order_id  561167 non-null  object
dtypes: object(4)
memory usage: 17.1+

In [5]:
#lets look at the different components of each dataframe
order_items.isnull().sum()

store_id                 0
store_order_id           0
master_item_id        4037
top_level_category    4260
unit_price               0
quantity                 0
dtype: int64

In [6]:
#null values for each column
orders.isnull().sum()


store_id          0
user_id           0
order_id          0
store_order_id    0
dtype: int64

In [7]:
#null values for each column
first_orders.isnull().sum()

user_id               0
dma_name              0
is_gift              68
is_corporate_user     0
device                0
gmv                   0
dtype: int64

##Null Values##

So there are null values within these datasets which is really important to point out because null values have an impact on the model building process. Having too many null values could throw off the fit of a model by quite a lot.It is important to know what the zeroes me now for this notebook the zero values occur in whether or not something is a gift, top item category as well as master item id. One thing I can say right off the bat is since there are only 68 values out of hundreds of thousands missing from the gift columns I would say its okay to ignore. Now the other two columns are missing a couple thousand values in a typical notebook I might replace them with the mean or the median however, for this project I am going to choose to ignore them. The reason for this being the goal of this notebook is not whether I know the impact of missing values but rather if I can build a sucessful multiple linear regression model


In [8]:
from pandas.core.reshape.merge import merge_ordered
#Now we have to merge all of the dataframes together the first set I will merge together is the store orders and the store order items as they have common columns

merge_one = pd.merge(orders, order_items)
merge_one.shape

(1616900, 8)

The merging of the dataframes together is to build a successful training dataset that will create a model that predicts gmv's with a high accuracy

In [9]:
#I want to just double check that all column values are within my merge
merge_one.head()

Unnamed: 0,store_id,user_id,order_id,store_order_id,master_item_id,top_level_category,unit_price,quantity
0,31928e8b164b9bc3c0ff06f6babe09a8,06e42070c778a50a1485a7f6cd480961,8410f4008654f7a88f209ed59b0dd94f,ff5172de1c9e840af71d8bce0ff6834a,4304.0,Liquor,19.99,1
1,31928e8b164b9bc3c0ff06f6babe09a8,06e42070c778a50a1485a7f6cd480961,8410f4008654f7a88f209ed59b0dd94f,ff5172de1c9e840af71d8bce0ff6834a,4510.0,Liquor,21.99,1
2,67aecfcf3981614f6e5a89dcd19c3b03,0cf662033efeff66a17e03797960b8e3,81f5bd06ca42eba4876880b06a2d9d51,8e5e14442db17499be390ce5bd415e05,1252.0,Wine,19.99,1
3,67aecfcf3981614f6e5a89dcd19c3b03,0cf662033efeff66a17e03797960b8e3,81f5bd06ca42eba4876880b06a2d9d51,8e5e14442db17499be390ce5bd415e05,7005.0,Wine,12.85,1
4,9f87b8cacdd1cc6bbe4f09dff920a125,c56e808028271e41b15ca6274c260a04,8b98b3164460bb57d295249bd9ae5632,78d77cb7d80eff8be6dea2f5263be90d,2179.0,Wine,11.95,1


In [10]:
#now lets add in the third dataframe so that we can then build a model off of it
main_data = pd.merge(merge_one, first_orders)
main_data.shape

(1616900, 13)

In [17]:
#now lets calculate the total gmv which is the gross market value for each customer

main_data['total_gmv'] = main_data['unit_price'] * main_data['quantity']
main_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1616900 entries, 0 to 1616899
Data columns (total 14 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   store_id            1616900 non-null  object 
 1   user_id             1616900 non-null  object 
 2   order_id            1616900 non-null  object 
 3   store_order_id      1616900 non-null  object 
 4   master_item_id      1612863 non-null  float64
 5   top_level_category  1612640 non-null  object 
 6   unit_price          1616900 non-null  float64
 7   quantity            1616900 non-null  int64  
 8   dma_name            1616900 non-null  object 
 9   is_gift             1616228 non-null  object 
 10  is_corporate_user   1616900 non-null  bool   
 11  device              1616900 non-null  object 
 12  gmv                 1616900 non-null  float64
 13  total_gmv           1616900 non-null  float64
dtypes: bool(1), float64(4), int64(1), object(8)
memory usage: 238.7+ M

In [25]:
#time to change all our values to numeric format
from sklearn import preprocessing

data_encoder = preprocessing.LabelEncoder()
data_encoded = main_data.apply(data_encoder.fit_transform)

data_encoded.head()

Unnamed: 0,store_id,user_id,order_id,store_order_id,master_item_id,top_level_category,unit_price,quantity,dma_name,is_gift,is_corporate_user,device,gmv,total_gmv
0,143,5520,283066,559631,1844,2,1719,0,2,0,0,1,4842,1972
1,143,5520,283066,559631,1931,2,1881,0,2,0,0,1,4842,2160
2,143,5520,512089,529375,1844,2,1719,0,2,0,0,1,4842,1972
3,143,5520,512089,529375,1931,2,1881,0,2,0,0,1,4842,2160
4,702,5520,358568,259567,4252,2,1475,0,2,0,0,1,4842,1690


By encoding every value it makes it easier for me to build a model because I don't have to tediously go through and make sure that each datatype matches each other and will mesh well when building my model

In [26]:
#Now time to define our variables
x_var = data_encoded[['device', 'is_gift','is_corporate_user', 'gmv','top_level_category']]
y_var = data_encoded['total_gmv']

In [33]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x_var, y_var, test_size = 0.3, random_state = 42) #as monty python says 42 is the answer to life

lr = LinearRegression()
lr.fit(X_train, y_train)

gmv_predict = lr.predict(X_test)

In [42]:
#time to calculate out some things to see how this model is predicting
from sklearn.metrics import r2_score

R_squared = lr.score(X_test, y_test)
print('My R-squared Value is: ', R_squared)

My R-squared Value is:  0.10430102589502876


In [45]:
#now lets look at the mean squared error as well
from sklearn.metrics import mean_squared_error
'''
y_true = y_train
y_pred = gmv_predict
print('My mean squared error was: ', mean_squared_error(y_true, y_pred))

'''

"\ny_true = y_train\ny_pred = gmv_predict\nprint('My mean squared error was: ', mean_squared_error(y_true, y_pred))\n\n"

In [40]:
#time to plot out my model
'''
sns.distplot(gmv_predict, hist=False, color='r', label = 'Predicted Values')
sns.displot(y_test, hist=False, color = 'b', label ='Actual Values')
plt.title('Actual versus Predicted Values', fontsize = 16)
plt.xlabel('Values', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)
plt.legend(loc = 'upper left', fontsize = 13)

plt.show()
'''

"\nsns.distplot(gmv_predict, hist=False, color='r', label = 'Predicted Values')\nsns.displot(y_test, hist=False, color = 'b', label ='Actual Values')\nplt.title('Actual versus Predicted Values', fontsize = 16)\nplt.xlabel('Values', fontsize = 12)\nplt.ylabel('Frequency', fontsize = 12)\nplt.legend(loc = 'upper left', fontsize = 13)\n\nplt.show()\n"