## Sprocket Central Pty Ltd Company Customers Recommendation  Data modeling

### About the Dataset

**Sprocket Central Pty Ltd**, a medium size bikes & cycling accessories organisation which has a large dataset relating to its customers, but their team is unsure how to effectively analyse it to help optimise its marketing strategy. 

The client provided us with 3 datasets:

- Customer Demographic

- Customer Addresses

- Transactions data in the past 3 months

### Objective of the report

In Phase #1 i have conducted a data cleaning process on each dataset and joined them together.

In Phase #2 we have conducted a Data Exploratory analysis , RFM Analysis and segmented the customers into four segments: (Platinum,Gold,Silver,Bronze)

In this Phase #3 The client has provided us with 1 extra datasets **New Customers Demographic** with 1000 records of new customers that hasn't purchased any prodcuts from us and wants recommendations on which customers to target with the marketing campaigns from the new dataset

In order to solve this task i will use a machine learning classification model and train it on my old customers dataset with the RFM segmentations we created and try to predict which from the new customers most likely to be which segmentation

In [1]:
# import all packages and set plots to be embedded inline
import pandas as pd
import numpy as np
import datetime as dt
import calendar

# suppress warnings 
import warnings
warnings.simplefilter("ignore")

In [2]:
#Importing the rfm segmented dataset 
CTA_rfm=pd.read_csv('CTA_rfm_allinfo.csv')

In [3]:
# checking the data first row
CTA_rfm.head(1)

Unnamed: 0.1,Unnamed: 0,transaction_id,product_id,customer_id,transaction_date,online_order,order_status,brand,product_line,product_class,...,property_valuation,recency,frequency,monetary,R,F,M,RFMClass,RFMscore,RFM_loyalty_level
0,0,1,2,2950,2017-02-25,False,Approved,Solex,Standard,medium,...,6,76,3,1953.15,3,4,4,344,11,Bronze


I will create a new dataframe with columns that i will use to train my classification model on to predict RFM_loyalty_level for a fresh dataset with similar features with 1000 new customer

In [4]:
old_customers=CTA_rfm[['gender','past_3_years_bike_related_purchases','job_industry_category','wealth_segment','owns_car','tenure','Age','property_valuation','RFM_loyalty_level']]

### Old customers dataset Features Engineering

In [5]:
old_customers.shape

(19125, 9)

Since the (gender,job_industry,own_car) columns are nominal columns then i will use one hot coding to transform them into binary values to use them in my ML model

In [6]:
# changing gender data columns using one hot coding into binary
gender=old_customers[['gender']]
gender=pd.get_dummies(gender,drop_first=True)
gender.head()

Unnamed: 0,gender_Male
0,1
1,1
2,1
3,0
4,0


In [7]:
# changing job_industry_category data columns using one hot coding into binary

job_industry_category=old_customers[['job_industry_category']]
job_industry_category=pd.get_dummies(job_industry_category,drop_first=True)
job_industry_category.head()

Unnamed: 0,job_industry_category_Entertainment,job_industry_category_Financial Services,job_industry_category_Health,job_industry_category_IT,job_industry_category_Manufacturing,job_industry_category_Property,job_industry_category_Retail,job_industry_category_Telecommunications
0,0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0
4,0,0,1,0,0,0,0,0


In [8]:
# changing owns_car data columns using one hot coding into binary
owns_car=old_customers[['owns_car']]
owns_car=pd.get_dummies(owns_car,drop_first=True)
owns_car.head()

Unnamed: 0,owns_car_Yes
0,1
1,1
2,1
3,1
4,1


Converting wealth_segment column into binary column using label encoder since it is an ordinal category column

In [9]:
# changing wealth_segment data columns using Label  Encoder into binary
from sklearn.preprocessing import LabelEncoder
old_customers['wealth_segment']=LabelEncoder().fit_transform(old_customers['wealth_segment'])

I will create a new dataframe to use in my ML model consists of binary transformed columns and numerical columns

In [10]:
old_customers1=old_customers[['past_3_years_bike_related_purchases','tenure','Age','property_valuation','wealth_segment']]

In [11]:
#Concatenating transformed categorical columns with the old_customers dataframe
old_customers1=pd.concat([gender,job_industry_category,owns_car,old_customers1],axis=1)

In [12]:
old_customers1.shape

(19125, 15)

In [13]:
# final result
old_customers1.head(1)

Unnamed: 0,gender_Male,job_industry_category_Entertainment,job_industry_category_Financial Services,job_industry_category_Health,job_industry_category_IT,job_industry_category_Manufacturing,job_industry_category_Property,job_industry_category_Retail,job_industry_category_Telecommunications,owns_car_Yes,past_3_years_bike_related_purchases,tenure,Age,property_valuation,wealth_segment
0,1,0,1,0,0,0,0,0,0,1,19,10.0,66,6,2


Now i will import the new customers dataset to adjust it for the machine learning model

### New customers dataset

In [14]:
# Retrieving new customers dataset
new_customers=pd.read_excel('New customer list.xlsx')

In [15]:
#checking data first rows
new_customers.head(1)

Unnamed: 0,Note: The data and information in this document is reflective of a hypothetical situation and client. This document is to be used for KPMG Virtual Internship purposes only.,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22
0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,...,state,country,property_valuation,,,,,,Rank,Value


In [16]:
# Making first row as header
new_customers.rename(columns=new_customers.iloc[0], inplace = True)
new_customers.drop([0], inplace = True)

In [17]:
#checking results
new_customers.head(1)

Unnamed: 0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,...,state,country,property_valuation,NaN,NaN.1,NaN.2,NaN.3,NaN.4,Rank,Value
1,Chickie,Brister,Male,86,1957-07-12,General Manager,Manufacturing,Mass Customer,N,Yes,...,QLD,Australia,6,0.46,0.575,0.71875,0.610938,1.0,1,1.71875


In [18]:
#Dropping (nan) header columns from dataset as it doesnt exist in the riginal dataset
new_customers.columns = new_customers.columns.fillna('to_drop')
new_customers.drop('to_drop', axis = 1, inplace = True)

In [19]:
# Checking dataset info
new_customers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 1 to 1000
Data columns (total 18 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   first_name                           1000 non-null   object
 1   last_name                            971 non-null    object
 2   gender                               1000 non-null   object
 3   past_3_years_bike_related_purchases  1000 non-null   object
 4   DOB                                  983 non-null    object
 5   job_title                            894 non-null    object
 6   job_industry_category                835 non-null    object
 7   wealth_segment                       1000 non-null   object
 8   deceased_indicator                   1000 non-null   object
 9   owns_car                             1000 non-null   object
 10  tenure                               1000 non-null   object
 11  address                              1000 n

In [20]:
# checking for duplicates in the dataset
new_customers.duplicated().sum()

0

No duplicates in the data

Checking for consistences of values in categorical column

In [21]:
# Collecting the categorical columns into  list
cat_col=[]
for x in new_customers.dtypes.index:
    if new_customers.dtypes[x]=='object':
        cat_col.append(x)
cat_col

['first_name',
 'last_name',
 'gender',
 'past_3_years_bike_related_purchases',
 'DOB',
 'job_title',
 'job_industry_category',
 'wealth_segment',
 'deceased_indicator',
 'owns_car',
 'tenure',
 'address',
 'postcode',
 'state',
 'country',
 'property_valuation',
 'Rank',
 'Value']

In [22]:
#checking for duplicated values in the categorical columns nd the accuracy of the values
for col in cat_col:
    print(col)
    print(new_customers[col].value_counts())
    print()
    print('*******')
    print()

first_name
Rozamond     3
Dorian       3
Mandie       3
Inglebert    2
Ricki        2
            ..
Diego        1
Lucilia      1
Eddy         1
Caron        1
Sylas        1
Name: first_name, Length: 940, dtype: int64

*******

last_name
Sissel       2
Minshall     2
Borsi        2
Shoesmith    2
Sturch       2
            ..
O'Moylane    1
Axtens       1
Moxted       1
Conrad       1
Duffill      1
Name: last_name, Length: 961, dtype: int64

*******

gender
Female    513
Male      470
U          17
Name: gender, dtype: int64

*******

past_3_years_bike_related_purchases
60    20
59    18
42    17
70    17
11    16
      ..
19     5
9      5
92     5
85     4
20     3
Name: past_3_years_bike_related_purchases, Length: 100, dtype: int64

*******

DOB
1965-07-03    2
1978-01-15    2
1979-07-28    2
1995-08-13    2
1941-07-21    2
             ..
1978-05-27    1
1945-08-08    1
1943-08-27    1
1999-10-24    1
1955-10-02    1
Name: DOB, Length: 961, dtype: int64

*******

job_title
Assoc

The general information about the dataframe points out to several problems:
    
- changing Date of birth column into date colunm to extract Age from it
- we have some null values in some column

### New customers dataset Features Engineering

In [23]:
# How many missing points in each variable
count_missing_new_customers = new_customers.isnull().sum()
percent_missing_new_customers = round(new_customers.isnull().sum()/len(new_customers) * 100, 1)
missing_new_customers = pd.concat([count_missing_new_customers, percent_missing_new_customers], axis = 1)
missing_new_customers.columns = ["Missing (count)", "Missing (%)"]
missing_new_customers

Unnamed: 0,Missing (count),Missing (%)
first_name,0,0.0
last_name,29,2.9
gender,0,0.0
past_3_years_bike_related_purchases,0,0.0
DOB,17,1.7
job_title,106,10.6
job_industry_category,165,16.5
wealth_segment,0,0.0
deceased_indicator,0,0.0
owns_car,0,0.0


In [24]:
# we will fill in the job_title,job_industry_category column with the mode value which is the most repeated value in the column
new_customers['job_title'] = new_customers['job_title'].fillna(new_customers['job_title'].mode()[0])
new_customers['job_industry_category'] = new_customers['job_industry_category'].fillna(new_customers['job_industry_category'].mode()[0])

In [25]:
# removing null values from dataset
new_customers['DOB'].dropna(inplace=True,axis=0)

In [26]:
# removing null values from dataset
new_customers['last_name'].dropna(inplace=True,axis=0)

We will drop the value (U) from gender because it is inconsistent with the column values 

In [27]:
# dropping U from Gender
new_customers=new_customers[new_customers.gender!='U']

In [28]:
#changing Date of birth column into date column to extract Age from it
new_customers['DOB']=pd.to_datetime(new_customers['DOB'])

In [29]:
# This function converts given date to age
def from_dob_to_age(born):
    today = dt.date.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

In [30]:
#creating a new Age column in the dataset
new_customers['Age']=new_customers['DOB'].apply(lambda x: from_dob_to_age(x))

In [31]:
#checking final results
new_customers.head(2)

Unnamed: 0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,address,postcode,state,country,property_valuation,Rank,Value,Age
1,Chickie,Brister,Male,86,1957-07-12,General Manager,Manufacturing,Mass Customer,N,Yes,14,45 Shopko Center,4500,QLD,Australia,6,1,1.71875,64
2,Morly,Genery,Male,69,1970-03-22,Structural Engineer,Property,Mass Customer,N,No,16,14 Mccormick Park,2113,NSW,Australia,11,1,1.71875,51


In [32]:
# changing categorical data columns into binary using one hot coding

gender_new=new_customers[['gender']]
gender_new=pd.get_dummies(gender_new,drop_first=True)
gender_new.head()

Unnamed: 0,gender_Male
1,1
2,1
3,0
4,0
5,0


In [33]:
# changing job_industry_category_new categorical column into binary using one hot coding
job_industry_category_new=new_customers[['job_industry_category']]
job_industry_category_new=pd.get_dummies(job_industry_category_new,drop_first=True)
job_industry_category_new.head()

Unnamed: 0,job_industry_category_Entertainment,job_industry_category_Financial Services,job_industry_category_Health,job_industry_category_IT,job_industry_category_Manufacturing,job_industry_category_Property,job_industry_category_Retail,job_industry_category_Telecommunications
1,0,0,0,0,1,0,0,0
2,0,0,0,0,0,1,0,0
3,0,1,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0
5,0,1,0,0,0,0,0,0


In [34]:
# changing owns_car_new categorical column into binary using one hot coding
owns_car_new=new_customers[['owns_car']]
owns_car_new=pd.get_dummies(owns_car_new,drop_first=True)
owns_car_new.head()

Unnamed: 0,owns_car_Yes
1,1
2,0
3,0
4,1
5,0


Converting wealth_segment column into binary column using label encoder since it is an ordinal category column

In [35]:
#Transforming using label_encoder
from sklearn.preprocessing import LabelEncoder
new_customers['wealth_segment']=LabelEncoder().fit_transform(new_customers['wealth_segment'])

In [36]:
#creating a new dataframe with numerical values only
new_customers1=new_customers[['past_3_years_bike_related_purchases','tenure','Age','property_valuation','wealth_segment']]

In [37]:
#Concatenating transformed categorical columns with the new_customer numerical dataframe
new_customers1=pd.concat([gender_new,job_industry_category_new,owns_car_new,new_customers1],axis=1)

Now checking for the two transformed datasets

In [38]:
old_customers1.shape

(19125, 15)

In [39]:
old_customers1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19125 entries, 0 to 19124
Data columns (total 15 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   gender_Male                               19125 non-null  uint8  
 1   job_industry_category_Entertainment       19125 non-null  uint8  
 2   job_industry_category_Financial Services  19125 non-null  uint8  
 3   job_industry_category_Health              19125 non-null  uint8  
 4   job_industry_category_IT                  19125 non-null  uint8  
 5   job_industry_category_Manufacturing       19125 non-null  uint8  
 6   job_industry_category_Property            19125 non-null  uint8  
 7   job_industry_category_Retail              19125 non-null  uint8  
 8   job_industry_category_Telecommunications  19125 non-null  uint8  
 9   owns_car_Yes                              19125 non-null  uint8  
 10  past_3_years_bike_related_purchase

In [40]:
new_customers1.shape

(983, 15)

In [41]:
new_customers1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 983 entries, 1 to 1000
Data columns (total 15 columns):
 #   Column                                    Non-Null Count  Dtype 
---  ------                                    --------------  ----- 
 0   gender_Male                               983 non-null    uint8 
 1   job_industry_category_Entertainment       983 non-null    uint8 
 2   job_industry_category_Financial Services  983 non-null    uint8 
 3   job_industry_category_Health              983 non-null    uint8 
 4   job_industry_category_IT                  983 non-null    uint8 
 5   job_industry_category_Manufacturing       983 non-null    uint8 
 6   job_industry_category_Property            983 non-null    uint8 
 7   job_industry_category_Retail              983 non-null    uint8 
 8   job_industry_category_Telecommunications  983 non-null    uint8 
 9   owns_car_Yes                              983 non-null    uint8 
 10  past_3_years_bike_related_purchases       983 non

# Model building

We will train the ML model on the old customers dataset and predict on the new customers dataset

In [42]:
# Split our old customers 1 dataset
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(old_customers1,
                                                                            old_customers['RFM_loyalty_level'],
                                                                            test_size= 0.25, random_state=10,)

In [43]:
# Decision tree
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=10)
tree.fit(train_features, train_labels)

# Predict the labels for the test data
pred_labels_tree = tree.predict(test_features)

In [44]:
# Create the classification report
from sklearn.metrics import classification_report
class_rep_tree = classification_report(test_labels, pred_labels_tree)

In [45]:
#View the performance of the model
print("Decision Tree: \n", class_rep_tree)

Decision Tree: 
               precision    recall  f1-score   support

      Bronze       0.99      0.94      0.96       470
        Gold       1.00      1.00      1.00      1744
    Platinum       0.99      1.00      0.99      1826
      Silver       0.99      0.98      0.98       742

    accuracy                           0.99      4782
   macro avg       0.99      0.98      0.98      4782
weighted avg       0.99      0.99      0.99      4782



In [46]:
# Decision RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rs = RandomForestClassifier()
rs.fit(train_features, train_labels)

# Predict the labels for the test data
pred_labels_rs = rs.predict(test_features)

In [47]:
# Create the classification report
class_rep_rs = classification_report(test_labels, pred_labels_rs)

In [48]:
#View the performance of the model
print("RandomForestClassifier: \n", class_rep_rs)

RandomForestClassifier: 
               precision    recall  f1-score   support

      Bronze       1.00      0.92      0.96       470
        Gold       0.99      1.00      0.99      1744
    Platinum       0.99      1.00      0.99      1826
      Silver       0.99      0.98      0.99       742

    accuracy                           0.99      4782
   macro avg       0.99      0.98      0.98      4782
weighted avg       0.99      0.99      0.99      4782



Now i will use the decision tree model to predict new segments on the new data

In [49]:
# predict the new segments using decision tree model
output_label = tree.predict(new_customers1)

Now concatenating the predicted array on the new customers dataset as a dataframe column

In [50]:
#converting an array into a dataframe column
new_customers['RFM_segments_predicted']=output_label.tolist()

In [51]:
#checking final results
new_customers[['first_name','last_name','gender','RFM_segments_predicted']]

Unnamed: 0,first_name,last_name,gender,RFM_segments_predicted
1,Chickie,Brister,Male,Gold
2,Morly,Genery,Male,Gold
3,Ardelis,Forrester,Female,Silver
4,Lucine,Stutt,Female,Platinum
5,Melinda,Hadlee,Female,Gold
...,...,...,...,...
996,Ferdinand,Romanetti,Male,Gold
997,Burk,Wortley,Male,Platinum
998,Melloney,Temby,Female,Bronze
999,Dickie,Cubbini,Male,Bronze
