# Product Recommendations

____

## Table of Contents
- [Importing libraries](#Importing-libraries)
- [Load data](#Load-data)
- [Data Cleaning and Preparation](#Data-Cleaning-and-Preparation)
- [Data Exploration & Visualization](#Data-Exploration-&-Visualization)
- [Machine Learning for Product Recommendations](#Machine-Learning-for-Product-Recommendations)

____

## Importing libraries 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import squarify as sq

ModuleNotFoundError: No module named 'squarify'

____

## Load data

source: https://www.kaggle.com/hellbuoy/online-retail-k-means-hierarchical-clustering/data

In [None]:
df = pd.read_excel('../data/Online_Retail.xlsx')

### Check dimensions 

In [None]:
df.head(10)

In [None]:
df.shape

____

## Data Cleaning and Preparation

### Check data types

In [None]:
df.info()

### Drop some columns 

In [None]:
df.drop(['StockCode', 'CustomerID'], axis=1, inplace=True)

In [None]:
df.head()

### Eliminate the white spaces

In [None]:
df['Description'] = df['Description'].str.strip()

In [None]:
df.head()

### Check for nulls

In [None]:
df.isnull().any()

#### Drop nulls

In [None]:
df.dropna(inplace=True)

In [None]:
df.shape

### Clear returned in InvoiceNo

In [None]:
df['InvoiceNo'].str.contains('C').value_counts()

In [None]:
df['InvoiceNo'] = df['InvoiceNo'].astype(str)

In [None]:
df = df[-df['InvoiceNo'].str.contains('C')]

In [None]:
df.shape

### Remove duplicates

In [None]:
df = df.drop_duplicates()

In [None]:
df.shape

### Remove Postage Invoice

In [None]:
postage = df['Description'] == 'POSTAGE'
postage.value_counts()

In [None]:
df = df.drop(df[postage].index)

In [None]:
df.shape

### Check for unique country values

In [None]:
df['Country'].value_counts()

#### Filter for Germany, Spain, France, Netherlands  & Belgium 

In [None]:
country_list = ['Germany', 'France', 'Spain', 'Netherlands', 'Belgium']
df = df.loc[df['Country'].isin(country_list)].reset_index().drop('index', axis=1)

In [None]:
df.shape

### Add a column for Total Prices

In [None]:
df['TotalPrice'] = df['Quantity']*df['UnitPrice']

In [None]:
df.head()

### Add more dates formats

In [None]:
df.dtypes

In [None]:
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month
df['Hour'] = df['InvoiceDate'].dt.hour
df['DayOfWeek'] = df['InvoiceDate'].dt.dayofweek
df['DayName'] = df['InvoiceDate'].dt.day_name()

In [None]:
df.sample(5)

#### Order By days

In [None]:
day_names = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
df['DayName'] = pd.Categorical(df['DayName'], categories = day_names, ordered = True )

In [None]:
df.head()

____

## Data Exploration & Visualization

#### Sales by days

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(x = 'DayName', y = 'TotalPrice', data=df)
plt.title('Sales by days')
plt.xlabel('Day')
plt.ylabel('Sales')
plt.show()

there is no sales on Saturdays?, best days for sales are Tuesday & Thursday

#### Sales by Month

In [None]:
df_pivot = df.pivot_table(index='Month', columns='DayName', values='TotalPrice', aggfunc='mean')
df_pivot

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_pivot, linewidths=1, annot=True)
plt.title('Avg Revenue')
plt.xlabel('Days of the Week')
plt.ylabel('Month')
plt.show()

there is no real pattern

#### 10 Most popular items

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
df['Description'].value_counts().sort_values(ascending=False).head(10).plot.bar()
plt.title('Top 10 Most Popular Items')
plt.xlabel('Description')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.figure(figsize=(16,8))
x = df['Description'].value_counts().sort_values(ascending=False).head(10)
color = ['lime', 'pink', 'lightgreen', 'yellow', 'orange', 'red', 'lightblue', 'cyan', 'azure']
sq.plot(sizes=x, label=x.index, color=color).axis('off')
plt.title('Top 10 Most Popular Items')
plt.show()

____

## Machine Learning for Product Recommendations

### Items sold together

#### Keep only InvoiceNo & Description

In [None]:
df = df[['InvoiceNo', 'Description']]

In [None]:
df.head()

In [None]:
df = df.groupby('InvoiceNo').agg(','.join).reset_index()

In [None]:
df.head()

In [None]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

In [None]:
item_list = [item.split(',') for item in df.Description]
item_list

### Encode

In [None]:
te = TransactionEncoder()
te_array = te.fit(item_list).transform(item_list)
te_array

In [None]:
item_df = pd.DataFrame(te_array, columns=te.columns_)
item_df

In [None]:
item_df[''].value_counts()

In [None]:
item_df = item_df.drop('', axis=1)

In [None]:
item_df.shape

### Some query

In [None]:
space_boy_df = item_df[item_df['SPACEBOY LUNCH BOX'] == True]

In [None]:
space_boy_df[['SPACEBOY LUNCH BOX']]

In [None]:
space_dolly_df = item_df[(item_df['SPACEBOY LUNCH BOX'] == True) & (item_df['DOLLY GIRL LUNCH BOX'] == True)]
space_dolly_df[['SPACEBOY LUNCH BOX', 'DOLLY GIRL LUNCH BOX']]

### Apriori algorithm 
Refers to an algorithm that is used in mining frequent products sets and relevant association rules. Generally, the apriori algorithm operates on a database containing a huge number of transactions. 

In [None]:
pd.set_option('display.max_colwidth', None)

### 10% of transactions 

In [None]:
freq_items = apriori(item_df, min_support = 0.1, use_colnames=True)
freq_items.sort_values('support', ascending=False)

### 5% of transactions

In [None]:
freq_items = apriori(item_df, min_support = 0.05, use_colnames=True)
freq_items.sort_values('support', ascending=False)

### 2% of transactions

In [None]:
freq_items = apriori(item_df, min_support = 0.02, use_colnames=True)
freq_items.sort_values('support', ascending=False)

### Association Rules

#### Support

In [None]:
assoc_rules = association_rules(freq_items, metric='support', min_threshold = 0.10)
assoc_rules

#### Confidence 

In [None]:
assoc_rules = association_rules(freq_items, metric='confidence', min_threshold = 1)
assoc_rules

In [None]:
assoc_rules = association_rules(freq_items, metric='confidence', min_threshold = 0.9)
assoc_rules

#### Lift

In [None]:
assoc_rules = association_rules(freq_items, metric='lift', min_threshold = 28)
assoc_rules

### Filter

In [None]:
assoc_rules = association_rules(freq_items, metric='support', min_threshold = 0.02)
assoc_rules.shape

In [None]:
assoc_rules[(assoc_rules['confidence'] >= 0.9) & (assoc_rules['lift'] >= 25)].sort_values('confidence', ascending=False)

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(x = assoc_rules['support'], y = assoc_rules['confidence'], hue = assoc_rules['lift'], s=100)
plt.title('Support vs Confidence')
plt.xlabel('Support')
plt.ylabel('Confidence')
plt.show()

____

## Make Recommendations

In [None]:
type(assoc_rules['antecedents'][0])

In [None]:
assoc_rules[assoc_rules['antecedents'] == {'SPACEBOY CHILDRENS BOWL'}]

In [None]:
assoc_rules[assoc_rules['antecedents'] == {'SPACEBOY CHILDRENS CUP', 'DOLLY GIRL CHILDRENS CUP'}]

In [None]:
assoc_rules[assoc_rules['consequents'] == {'ALARM CLOCK BAKELIKE RED'}]

## Frequently items

In [None]:
freq_items = apriori(item_df, min_support = 0.01, use_colnames=True)
freq_items

In [None]:
assoc_rules = association_rules(freq_items, metric='support', min_threshold = 0.01)
assoc_rules.sort_values('support', ascending =False)

In [None]:
round_snacks = assoc_rules[assoc_rules['antecedents'] == {'ROUND SNACK BOXES SET OF4 WOODLAND'}]
round_snacks = round_snacks.sort_values('support', ascending =False).head(5)
round_snacks

____

## Top 5 Product Recommendations 

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(x = 'support', y = 'consequents', data = round_snacks, color='red')
plt.title('Top 5 Recomendations')
plt.xlabel('Support')
plt.ylabel('Recommend')
plt.show()