# Using Random Forest Algorithm to Predict Retail Purchase Patterns
## *by Charlie Glover, M.A.* 
## *June 1, 2017* 

---
>## Abstract

>Big Data and specifically the ability to analyze vast amounts of data instantly is a key tool in helping companies position themselves for retail shelf space and customer sales. Accurate sales forecasts from distributors and retaileers, as well as actual point-of-sale (POS) data, are critical in helping determine production volumes, distribution and pricing strategy.

> Hence, the purpose of this project is to develop a statistical model that would allow a business to predict the likelihood of a customer purchasing a particular product so that the model can suggest the product to the customer. This will help in converting visits into tangible outcomes such as repeated purchases or predicting what products customes will purchase. To achieve this goal, I will use the Random Forest Algorithm for training and testing on a free transactional dataset downloaded from Tableau. The aim from using this algoritm is to discover relationships and patterns among purchases made by the customer over several transactions and to produce a suggestion matrix across all customers.

---
>## Motiviation:  "Everybody is coming, but No One is Buying"
    
>Having high foot or online traffic to your store is good, but every business needs customers to purchase product. Hence, the golden grail of any business is to try to figure out buying behaviors to convience customers to buy their business, product or service. This has created the biggest challenges for businesses who perform high volumes of data collecting about specific individuals or groups of customers that is both meaningful and actionable (affinity analysis). But the potential for using data collected is often limited either by “small data” (“data iceburg effect”) or is left unexplored. As a result, the intentions and reactions of individual customers can be overlooked.
        
>Enter the world of machine learning, where algorithms evolve from the study of pattern recognition and computational learning. According to Lisa Burton, co-founder and chief scientist with AdMass Inc., advertisers and e-commerce businesses have the highest potential gain from machine learning becaues of the ease of measurement and quickfeedback needed to train and improve machince learning algorithms (https://www.marketingprofs.com/articles/2017/32097/the-marketing-impact-of-ai-and-machine-learning-3-predictions-by-51-ml-marketing-executives).  This thought seem evident especially with Amazon, who announced on May 14, 2017 of there efforts to building an online tailor with machine learning. The purpose of that effort is help cut back on customer returns of apparel that doesn't fit (http://www.bizjournals.com/seattle/news/2017/05/16/amazon-is-building-an-online-tailor.html).
        
>My work on project is considered an introductory piece to this hot topic and can defintily be developed further into more complex algoritms to fit a wide array of purchase predictions across a wide array of industries. I chose random forest due to hit robustness and the opportunity to learn one algorithm well enough as a springboard to learn others.

># Problem Statement

>For the purposes of this project,  we will try to answer the following questions predicting
customer purchases using the Random Forest algorithm as follows: 
    
>1. Given the transactional history of a customer, predict the next product that customer may purchase.
>2. Is possible to improve prediction accuracy of customer purchases by increasing the number of trees without overfitting?
  
    

># Background Discussion: The Random Forest Algorithm 

>Random Forest is a machine learning algorithm used for classsification, regression, and feature selection. The algorithm basically combines the predictions of mutiple decision trees by averaging decision tree output. It also ranks an individual tree's output, by comparing it to the known output from the training data, which allows it to rank features. With Random Forest, some of the decision trees will perform better. Therefore, the features wihtin those trees will be deemed more important. A Random Forest generalizes well will have a higher accuracy by each tree, and higher diversity among its trees.  For instance, a decision tree constructed baed on a small sample might be not be generalizable to future, large samples. To overcome this,mulitple decision trees could be constructed, by randomizing the combination and order of variables used. The aggregated result from these forest of trees would form an ensemble, known as random forest.

># Methods/Procedure

> In this project, I am going to train a Random Forest algorithm to predicate the next customer's purchase using dataset downloaded from Tableau webset.  To this, I will disucuss each piece as follows
    
>* Import the Modules
>* Import and review the dataset
>* How do the results above contrast with the results from applying the same analyses to data from June, 2001? June, 2002?
    
    


>#### Import Modules

>First, I will will import scikit-learn, pandas, numPy, and sciPy modules needed for this effort

In [4]:
# import required modules
%matplotlib inline
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import cross_val_score
# The error metric using c-stat (aka roc/auc)
from sklearn.metrics import roc_auc_score


>## Data

>The dataset contains information on transaction level data from customers. The transactional data shows  what the customer ended up buying. The first sheet show transactional purchase data for customers
    

In [5]:
trans_data = pd.read_excel("/home/x7/Desktop/projxdata/01_superstore_data__trans_cg7.xlsx", sheetname=0)

In [6]:
trans_data.head()

Unnamed: 0,1,Aaron Bergman,117,765,1145,413,1392,1432,0,0.1,...,0.34,0.35,0.36,0.37,0.38,0.39,0.40,0.41,0.42,0.43
0,2,Aaron Hawkins,1501,69,908,544,716,1604,220,662,...,0,0,0,0,0,0,0,0,0,0
1,3,Aaron Smayling,1752,790,169,429,178,266,275,1797,...,0,0,0,0,0,0,0,0,0,0
2,4,Adam Bellavance,1660,876,612,894,77,1371,1670,50,...,0,0,0,0,0,0,0,0,0,0
3,5,Adam Hart,1264,1202,1308,1203,131,682,1817,1615,...,0,0,0,0,0,0,0,0,0,0
4,6,Adam Shillingsburg,466,405,254,1197,290,327,317,2,...,0,0,0,0,0,0,0,0,0,0


In [7]:
trans_data.describe()

Unnamed: 0,1,117,765,1145,413,1392,1432,0,0.1,0.2,...,0.34,0.35,0.36,0.37,0.38,0.39,0.40,0.41,0.42,0.43
count,792.0,792.0,792.0,792.0,792.0,792.0,792.0,792.0,792.0,792.0,...,792,792,792,792,792,792,792,792,792,792
mean,397.5,923.811869,904.526515,889.626263,889.334596,872.974747,799.689394,777.762626,724.039141,699.261364,...,0,0,0,0,0,0,0,0,0,0
std,228.774999,524.380577,538.338744,534.790621,552.725346,572.611366,580.984949,590.484949,609.678595,608.107793,...,0,0,0,0,0,0,0,0,0,0
min,2.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
25%,199.75,487.5,441.75,439.25,428.5,361.0,256.75,201.0,69.75,0.0,...,0,0,0,0,0,0,0,0,0,0
50%,397.5,901.5,897.5,846.5,866.0,865.0,759.5,753.5,676.0,692.0,...,0,0,0,0,0,0,0,0,0,0
75%,595.25,1378.25,1376.5,1345.25,1380.25,1410.0,1304.25,1287.25,1276.75,1222.25,...,0,0,0,0,0,0,0,0,0,0
max,793.0,1848.0,1850.0,1850.0,1850.0,1848.0,1850.0,1842.0,1850.0,1850.0,...,0,0,0,0,0,0,0,0,0,0


The following definitions for each of these fields. 

<dl class="dl-horizontal">
<dt>Row_ID</dt>
<dd>Row identification for transaction</dd>
<dt>Order_ID</dt>
<dd>Number representing the receipt number</dd>
<dt>Order_Date</dt>
<dd>The date the customer made the purchase</dd>
<dt>Ship_Date</dt>
<dd>The date the order was shipped to the customer if the purchase was made over the phone or online</dd>
<dt>Ship_Mode_ID</dt>
<dd>Unique number representating the ship mode type</dd>
<dt>Ship_Mode</dt>
<dd>The ship mode description</dd>
<dt>Customer_ID2</dt>
<dd>Unique number assigned by the store to identify the customer. This number was not used in processing the random forest for simiplicity purposes and was replaced by the Customer_ID field for algorithm use.</dd>
<dt>Customer_ID</dt>
<dd>Uniqued to identify the customer for purposes of applying random forest algorithm</dd>
<dt>Gender</dt>
<dd>Unique indicator to indicate gender of customer</dd>
<dt>Customer_Number</dt>
<dd>Customers first and last name</dd>
<dt>Segment _ID</dt>
<dd>Represents the id number of the customer segment type</dd>
<dt>Segment</dt>
<dd>Text representing the description of the segment type</dd>
<dt>City_ID</dt>
<dd>Unique number representing the city</dd>
</dl>



For the purposes of the specific questions stated at the top of this notebook, we only need a subset of the available columns, namely delay metrics, origin and destination states, and the flight date. We'll ignore the other fields.

In [8]:
trans_data

Unnamed: 0,1,Aaron Bergman,117,765,1145,413,1392,1432,0,0.1,...,0.34,0.35,0.36,0.37,0.38,0.39,0.40,0.41,0.42,0.43
0,2,Aaron Hawkins,1501,69,908,544,716,1604,220,662,...,0,0,0,0,0,0,0,0,0,0
1,3,Aaron Smayling,1752,790,169,429,178,266,275,1797,...,0,0,0,0,0,0,0,0,0,0
2,4,Adam Bellavance,1660,876,612,894,77,1371,1670,50,...,0,0,0,0,0,0,0,0,0,0
3,5,Adam Hart,1264,1202,1308,1203,131,682,1817,1615,...,0,0,0,0,0,0,0,0,0,0
4,6,Adam Shillingsburg,466,405,254,1197,290,327,317,2,...,0,0,0,0,0,0,0,0,0,0
5,7,Adrian Barton,195,898,1038,777,1541,266,656,400,...,0,0,0,0,0,0,0,0,0,0
6,8,Adrian Hane,1195,966,1609,1125,338,1473,909,273,...,0,0,0,0,0,0,0,0,0,0
7,9,Adrian Shami,577,265,1688,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,10,Aimee Bixby,1161,1022,77,722,1681,1748,1793,1605,...,0,0,0,0,0,0,0,0,0,0
9,11,Alan Barnes,1385,914,1774,307,1508,717,1557,673,...,0,0,0,0,0,0,0,0,0,0


In [9]:
trans_data.shape

(792, 52)

In [10]:
trans_prod = pd.read_excel("/home/x7/Desktop/projxdata/01_superstore_data__trans_prod_cg4.xlsx", sheetname=0)
trans_prod.columns = ["Prod_ID", "Prod_Descrip"]
trans_prod['n'] = 1
trans_prod.head()

Unnamed: 0,Prod_ID,Prod_Descrip,n
0,1,"""While you Were Out"" Message Book, One Form pe...",1
1,2,"#10- 4 1/8"" x 9 1/2"" Recycled Envelopes",1
2,3,"#10- 4 1/8"" x 9 1/2"" Security-Tint Envelopes",1
3,4,"#10 Gummed Flap White Envelopes, 100/Box",1
4,5,#10 Self-Seal White Envelopes,1


In [11]:
trans_prod

Unnamed: 0,Prod_ID,Prod_Descrip,n
0,1,"""While you Were Out"" Message Book, One Form pe...",1
1,2,"#10- 4 1/8"" x 9 1/2"" Recycled Envelopes",1
2,3,"#10- 4 1/8"" x 9 1/2"" Security-Tint Envelopes",1
3,4,"#10 Gummed Flap White Envelopes, 100/Box",1
4,5,#10 Self-Seal White Envelopes,1
5,6,"#10 White Business Envelopes,4 1/8 x 9 1/2",1
6,7,"#10-4 1/8"" x 9 1/2"" Premium Diagonal Seam Enve...",1
7,8,#6 3/4 Gummed Flap White Envelopes,1
8,9,"1.7 Cubic Foot Compact ""Cube"" Office Refrigera...",1
9,10,1/4 Fold Party Design Invitations & White Enve...,1


In [12]:
# join the product and transactions table
df = pd.merge(trans_data, trans_prod)
# create a "pivot table" which will give us the number of times each customer bought product
matrix = df.pivot_table(index=['Customer_ID','Customer_Name','Gender', 'Prod_ID','Order_Date','Sales'],values='n');
# a little tidying up. fill NA values with 0 and make the index into a column
matrix = matrix.fillna(0).reset_index()
# save a list of the 0/1 columns. we'll use these a bit later
x_cols = matrix.columns[1:]

In [13]:
matrix

Unnamed: 0,Customer_ID,Customer_Name,Gender,Prod_ID,Order_Date,Sales,n
0,1,Aaron Bergman,M,117,2013-02-18,12.624,1
1,1,Aaron Bergman,M,413,2013-03-07,242.940,1
2,1,Aaron Bergman,M,765,2013-03-07,48.712,1
3,1,Aaron Bergman,M,1145,2013-03-07,17.940,1
4,1,Aaron Bergman,M,1392,2015-11-11,221.980,1
5,1,Aaron Bergman,M,1432,2015-11-11,341.960,1
6,2,Aaron Hawkins,M,69,2013-04-22,9.912,1
7,2,Aaron Hawkins,M,220,2013-12-31,18.900,1
8,2,Aaron Hawkins,M,245,2016-12-18,18.704,1
9,2,Aaron Hawkins,M,544,2013-05-13,8.000,1


In [66]:
matrix2 = matrix['Data_lagged'] = matrix.groupby(['Customer_ID'])['Prod_ID'].shift(1)

In [67]:
vv= pd.concat([matrix,matrix2])

In [68]:
vv

Unnamed: 0,0,Customer_ID,Customer_Name,Data_lagged,Gender,Order_Date,Prod_ID,Sales,n
0,,1,Aaron Bergman,,M,2013-02-18,117,12.624,1
1,,1,Aaron Bergman,117,M,2013-03-07,413,242.940,1
2,,1,Aaron Bergman,413,M,2013-03-07,765,48.712,1
3,,1,Aaron Bergman,765,M,2013-03-07,1145,17.940,1
4,,1,Aaron Bergman,1145,M,2015-11-11,1392,221.980,1
5,,1,Aaron Bergman,1392,M,2015-11-11,1432,341.960,1
6,,2,Aaron Hawkins,,M,2013-04-22,69,9.912,1
7,,2,Aaron Hawkins,69,M,2013-12-31,220,18.900,1
8,,2,Aaron Hawkins,220,M,2016-12-18,245,18.704,1
9,,2,Aaron Hawkins,245,M,2013-05-13,544,8.000,1


In [49]:
gg= pd.concat([matrix, matrix.shift(), matrix.shift(2)], axis=4)

In [50]:
gg

Unnamed: 0,Customer_ID,Customer_Name,Gender,Prod_ID,Order_Date,Sales,n,Data_lagged,Customer_ID.1,Customer_Name.1,...,n.1,Data_lagged.1,Customer_ID.2,Customer_Name.2,Gender.1,Prod_ID.1,Order_Date.1,Sales.1,n.2,Data_lagged.2
0,1,Aaron Bergman,M,117,2013-02-18,12.624,1,,,,...,,,,,,,NaT,,,
1,1,Aaron Bergman,M,413,2013-03-07,242.940,1,,1,Aaron Bergman,...,1,,,,,,NaT,,,
2,1,Aaron Bergman,M,765,2013-03-07,48.712,1,,1,Aaron Bergman,...,1,,1,Aaron Bergman,M,117,2013-02-18,12.624,1,
3,1,Aaron Bergman,M,1145,2013-03-07,17.940,1,117,1,Aaron Bergman,...,1,,1,Aaron Bergman,M,413,2013-03-07,242.940,1,
4,1,Aaron Bergman,M,1392,2015-11-11,221.980,1,413,1,Aaron Bergman,...,1,117,1,Aaron Bergman,M,765,2013-03-07,48.712,1,
5,1,Aaron Bergman,M,1432,2015-11-11,341.960,1,765,1,Aaron Bergman,...,1,413,1,Aaron Bergman,M,1145,2013-03-07,17.940,1,117
6,2,Aaron Hawkins,M,69,2013-04-22,9.912,1,1145,1,Aaron Bergman,...,1,765,1,Aaron Bergman,M,1392,2015-11-11,221.980,1,413
7,2,Aaron Hawkins,M,220,2013-12-31,18.900,1,1392,2,Aaron Hawkins,...,1,1145,1,Aaron Bergman,M,1432,2015-11-11,341.960,1,765
8,2,Aaron Hawkins,M,245,2016-12-18,18.704,1,,2,Aaron Hawkins,...,1,1392,2,Aaron Hawkins,M,69,2013-04-22,9.912,1,1145
9,2,Aaron Hawkins,M,544,2013-05-13,8.000,1,69,2,Aaron Hawkins,...,1,,2,Aaron Hawkins,M,220,2013-12-31,18.900,1,1392


In [15]:
num_variables = list(matrix.dtypes[matrix.dtypes !="object"].index)

In [16]:
matrix[num_variables].head()

Unnamed: 0,Customer_ID,Prod_ID,Order_Date,Sales,n,Data_lagged
0,1,117,2013-02-18,12.624,1,
1,1,413,2013-03-07,242.94,1,117.0
2,1,765,2013-03-07,48.712,1,413.0
3,1,1145,2013-03-07,17.94,1,765.0
4,1,1392,2015-11-11,221.98,1,1145.0


In [17]:
matrix[num_variables]

Unnamed: 0,Customer_ID,Prod_ID,Order_Date,Sales,n,Data_lagged
0,1,117,2013-02-18,12.624,1,
1,1,413,2013-03-07,242.940,1,117
2,1,765,2013-03-07,48.712,1,413
3,1,1145,2013-03-07,17.940,1,765
4,1,1392,2015-11-11,221.980,1,1145
5,1,1432,2015-11-11,341.960,1,1392
6,2,69,2013-04-22,9.912,1,
7,2,220,2013-12-31,18.900,1,69
8,2,245,2016-12-18,18.704,1,220
9,2,544,2013-05-13,8.000,1,245


In [18]:
y = matrix[0:6]

In [19]:
matrix[7:16]

Unnamed: 0,Customer_ID,Customer_Name,Gender,Prod_ID,Order_Date,Sales,n,Data_lagged
7,2,Aaron Hawkins,M,220,2013-12-31,18.9,1,69
8,2,Aaron Hawkins,M,245,2016-12-18,18.704,1,220
9,2,Aaron Hawkins,M,544,2013-05-13,8.0,1,245
10,2,Aaron Hawkins,M,579,2015-03-21,86.45,1,544
11,2,Aaron Hawkins,M,662,2014-12-27,323.1,1,579
12,2,Aaron Hawkins,M,716,2013-10-25,49.408,1,662
13,2,Aaron Hawkins,M,733,2014-12-27,668.16,1,716
14,2,Aaron Hawkins,M,908,2013-05-13,279.456,1,733
15,2,Aaron Hawkins,M,1501,2013-04-22,247.84,1,908


In [20]:
model = RandomForestRegressor(n_estimators=100, oob_score='true', random_state=42)

In [21]:
model.fit(matrix[num_variables],y)

TypeError: float() argument must be a string or a number

In [None]:
model.oob_score_

In [None]:
y_oob = model.oob_prediction_
#Compute Area Under the Curve (AUC) from prediction scores
#print "c-stat: ", roc_auc_score(y,y_oob)
print "C-stat is:  ", roc_auc_score(y, y_oob)