# Developing two recommendation engines based on product pairs and customer purchases

For this project I shall carry out data analysis of a dataset from an Online Store and use that data to create two recommendation engines relating to the products sold by this Online Store.  These two engines will relate to:
1. Identifying other products commonly sold with a particular item
2. Identifying other products that a customer has frequently purchased in the past

The main part of the project is carried out using Tableau Public.  This Python notebook contains the programming required to transform the initial dataset into the reference tables required to support the development of the two recommendation engines.

In [1]:
# set up the working environment
import pandas as pd
from pandas_profiling import ProfileReport
import numpy as np
import os
from itertools import permutations, combinations
from collections import Counter
import datetime as dt
import names

## Dataset
The dataset I am using for this project is the Online_Retail dataset available from https://www.kaggle.com/datasets/lakshmi25npathi/online-retail-dataset which has information about approximatly 500k orders recieved by an online store between December 2010 and December 2011.

In [2]:
# load the dataset
dataset = pd.read_csv('Online_Retail.csv')

## Cleaning the dataset

Because I am focussed here on the products purchased by customers, I will:
- drop any orders where there is no description for a product
- drop any product with a price that is higher than 1000, as these are admin items e.g postage etc
- drop any products that do not have a UnitPrice

In [3]:
# clean the dataset

# drop any orders where there is no description for a product
dataset = dataset.dropna(axis=0, subset=['Description'])
# drop products where the price is higher than 1000 - these are admin items e.g postage etc
dataset = dataset[dataset['UnitPrice']<1000]
# drop items that do not have a UnitPrice
dataset = dataset[dataset['UnitPrice']>0.1]
# drop any orders where there is no CustomerID
dataset = dataset[dataset['CustomerID'].isnull() == False]
# remove the orders relating to 'DOTCOM POSTAGE'
dataset = dataset[dataset['Description']!='DOTCOM POSTAGE']
# drop any orders where the InvoiceID starts with a 'C'. these were cancelled orders
dataset = dataset[~dataset['InvoiceNo'].str.contains("C")]

In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354031 entries, 0 to 495477
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    354031 non-null  object 
 1   StockCode    354031 non-null  object 
 2   Description  354031 non-null  object 
 3   Quantity     354031 non-null  int64  
 4   InvoiceDate  354031 non-null  object 
 5   UnitPrice    354031 non-null  float64
 6   CustomerID   354031 non-null  float64
 7   Country      354031 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 24.3+ MB


## Look at a summary of the dataset

In [None]:
profile = ProfileReport(dataset)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

## Working on the orders information

In [None]:
# Add a total cost for each product to the orders table
dataset['Product_cost'] = dataset['Quantity']*dataset['UnitPrice']
dataset

In [None]:
order_totals = dataset.groupby('InvoiceNo')['Product_cost'].sum()
order_totals=order_totals.reset_index()
order_totals.columns = ['InvoiceNo', 'Total_cost']
order_totals.head()

## Including customer's names

CustomerID's are so impersonnel, so let's use a random name generator to put names against the numbers!

In [None]:
dataset.head()

In [None]:
# Create the new column that will contain the names
dataset['CustomerName'] = ''

In [None]:
# Create a list of the unique CustomerID's in the dataset
customer_names = dataset.groupby(['CustomerID']).size()
customer_names=customer_names.reset_index()
customer_names.columns = ['CustomerID', 'CustomerName']
customer_names

In [None]:
# Create a function that generates a random name for each of the CustomerID's and puts it in the CustomerName column
def generate(x):
    x = names.get_full_name()
    return x

# Run the function on the customer_names dataframe
customer_names['CustomerName'] = customer_names['CustomerName'].apply(generate)

In [None]:
# Check it has worked
customer_names

In [None]:
# Update the CustomerName column in the original dataset table with the relevant names from the customer_names table
dataset['CustomerName'] = (dataset['CustomerID'].map(customer_names.set_index('CustomerID')['CustomerName']).fillna('Unknown')                       )

In [None]:
# check it has worked
dataset

## Saving the updated Online_Retail file and the new order_totals file

In [None]:
# save the cleansed and transformed Online-Retail dataset and the order_totals dataset as CSV files
dataset.to_csv(r'C:\Users\annsc\OneDrive\Documents\3 - Data Science work\3 - Product Popularity Recommendation Engine\Online_Retail_cleansed.csv')
order_totals.to_csv(r'C:\Users\annsc\OneDrive\Documents\3 - Data Science work\3 - Product Popularity Recommendation Engine\order_totals.csv')

## Working on the products information

#### Creating a function that finds all products that were purchased together. 

In [None]:
# create a function that finds all items that were purchased together, listing them in two columns (Item 'A' and Item 'B')
def find_pairs(x):
    pairs = pd.DataFrame(list(permutations(x.values,2)), columns=['A', 'B'])
    return pairs

#### Run the function on the dataset

In [None]:
# Group the products by InvoiceNo then apply the function
dataset_combo = dataset.groupby('InvoiceNo')['Description'].apply(find_pairs).reset_index(drop=True)
dataset_combo.head(20)

#### Calculate the frequency of each pairing of products

In [None]:
# Calculate the frequency of item_A being purchased with item_B
dataset_combo2 = dataset_combo.groupby(['A', 'B']).size()
dataset_combo2

#### Format the results from the pairing and frequency table into something to work with in Tableau

In [None]:
# create a sorted dataframe by the most frequent combinations
products_combo=dataset_combo2.reset_index()
products_combo.columns = ['A', 'B', 'Frequency']
products_combo.sort_values(by='Frequency', ascending=False, inplace=True)
products_combo.head()

#### Export the final products table as a csv file

In [None]:
products_combo.to_csv(r'C:\Users\annsc\OneDrive\Documents\3 - Data Science work\3 - Product Popularity Recommendation Engine\Product_pairs.csv')

#### Create a table that maps products against their prices.
Note that various prices appear across the dataset for a particular item and therefore I shall use the maximum price for each item here. 

In [None]:
# create a table of the product and its price
product_prices = dataset.groupby('Description')['UnitPrice'].max()
product_prices = pd.DataFrame(data=product_prices).rename(columns={"UnitPrice": "Unit Price (excl tax)"})
product_prices

#### Export the product prices table for use in Tableau

In [None]:
product_prices.to_csv(r'C:\Users\annsc\OneDrive\Documents\3 - Data Science work\3 - Product Popularity Recommendation Engine\Product_prices.csv')

## Working on the customers information

#### Find all of the different products that each customner has purchased, with the relevant frequency

In [None]:
customer_purchases = dataset.groupby(['CustomerName', 'Description']).size()
customer_purchases

#### Put this into a suitable format for Tableau to work with

In [None]:
# create a sorted dataframe by the most frequent products bought by customers
customer_purchases=customer_purchases.reset_index()
customer_purchases.columns = ['CustomerName', 'Products purchased', 'Frequency']
customer_purchases.sort_values(by='Frequency', ascending=False, inplace=True)

#### Looking at the results for a particular customer

In [None]:
# Let's look at a particular customer's top 10 most frequently purchased products
customer_purchases[customer_purchases['CustomerName']=='Angela Riles'].head(10)

#### Exporting the customer table for use in Tableau

In [None]:
customer_purchases.to_csv(r'C:\Users\annsc\OneDrive\Documents\3 - Data Science work\3 - Product Popularity Recommendation Engine\Customer_purchases.csv')