# Data understanding

In [None]:
#Used for displaying plots below the cell
%matplotlib inline
import math
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt

from collections import defaultdict
from scipy.stats.stats import pearsonr

In [None]:
df = pd.read_csv('customer_supermarket.csv', sep='\t', index_col=0)

In [None]:
df.head()

The dataset seems to contain data about the shopping habits of the customers of a grocery store chain.  
Each row represents an object purchased:  
- BasketID: represents a batch of items bought during the same shopping session  
- BasketDate: date in which the shopping session took place  
- Sale: represents the value of the item, we need to figure out if it refers to a single item or the item*quantity
- CustomerID: identifies a unique customer
- ProdID: identifies a unique product for sale
- ProdDescr: describes the product
- Qta: number of items of the with id ProdID bought

In [None]:
df.info()

In [None]:
len(df.index)

Only ProdDescr and CustomerID contain null values.

In [None]:
df.describe()

The statistics regarding the CustomerID are meaningless since the assignment of an ID is usually done progressively and without having any additional information on the customer.  
We need to fix the data type situation in order to get a better understanding of the data set.

## Data type conversion  
Let's start by checking out the data type that pandas assigns to the attributes, in order to get an idea of the potential problems.

In [None]:
df.dtypes

In [None]:
df = df.convert_dtypes()

In [None]:
df.dtypes

### CustomerID

CustomerID got converted to a reasonable data type while the others became a generic "string".  
We don't however care for CustomerID as a number.

In [None]:
df["CustomerID"] = df["CustomerID"].astype("string")

### BasketDate
Let's convert the BasketDate type from String to datetime, just in case we need to perform some analysis that requires ordinal data.

In [None]:
df.BasketDate = pd.to_datetime(df.BasketDate)

### Sale

The "Sale" attribute is considered a generic object while it should be recognised as a float.  
Let's see why.

In [None]:
df.Sale.map(type)

In [None]:
df.Sale

It seems that Sale uses a comma instead of a point to separate the decimal part, so it is considered a "str" instead of a "float64".  
Let's replace the commas in "Sale" with dots in order to have them be recognised as float64 by pandas.

In [None]:
df.Sale = df.Sale.apply(lambda x: x.replace(',','.'))

In [None]:
df.Sale = df.Sale.astype("float64")

Sale is now correctly identified as a float64.

## Data exploration

### Exploration data frame
Used for exploration purposes but not necessarily useful for clustering.  
Initialised with some additional features that could prove useful.

In [None]:
#Auxiliary df to be used throughout the data understanding phase
df_expl = df[["BasketID", "Qta", "Sale"]].copy()

df_expl["QtaPositive"] = 0
df_expl.loc[df_expl["Qta"] > 0, "QtaPositive"] = 1 #Indicates whether the records Qta is positive

df_expl["SalePositive"] = 0
df_expl.loc[df_expl["Sale"] > 0, "SalePositive"] = 1 #Indicates whether the records Sale is positive

df_expl.head()

### BasketID

Let's check why BasketID is not considered an int64 like CustomerID.

In [None]:
nonNumSeries = pd.to_numeric(df.BasketID, errors='coerce').isnull()
# Print the records with BasketIDs containing a non-numeric value
df[nonNumSeries].head()

In [None]:
df.loc[nonNumSeries, "BasketID"].str.slice(0,1).unique()

It seems that a good chunk of the BasketID values start with a "C" and some with "A" instead of being just numbers.  

In [None]:
basket_c_df = df.loc[df["BasketID"].str.get(0) == "C"]
len_basket_c = len(basket_c_df)
print(f"Records starting with 'C' (Size: {len_basket_c}):\n")
basket_c_df.head(5)

In [None]:
basket_a_df = df.loc[df["BasketID"].str.get(0) == "A"]
len_basket_a = len(basket_a_df)
print(f"Records starting with 'A' (Size: {len_basket_a}):\n")
basket_a_df.head(10)

There seems to be a strong correlation between the "C" and a negative quantity, this could indicate a customer that asked for a refund.  

There is also some interesting correlation between the "A" start and a ProdDescr containing "Adjust bad debt", maybe the "A" stands for adjust and since the CustomerID in both cases is NaN this could be an operation that concerns only the management of the shop and not something that concerns the customers (which is our primary objective).  
These records, however, are too few to be meaningful, they skew too much the characteristics of the sale data (outliers) and they don't concern the activities of the customers.

Let's try to add a "BasketID type A" and "BasketID type C" binary attribute (0/1) and see if there are correlations.

In [None]:
#Initialise all the cells to 0
df_expl["BasketIDTypeA"] = 0
df_expl["BasketIDTypeC"] = 0

#Set the cells appropriately depending on the BasketID type
df_expl.loc[df["BasketID"].str.get(0) == "A", "BasketIDTypeA"] = 1
df_expl.loc[df["BasketID"].str.get(0) == "C", "BasketIDTypeC"] = 1

df_expl["NewBasketID"] = df_expl["BasketID"]

#Remove the initial letter from BasketID where necessary
df_expl.loc[df_expl["BasketID"].str.get(0) == "A", "NewBasketID"] = df_expl.loc[(df_expl["BasketID"].str.get(0) == "A"), "BasketID"].str.slice(start=1)
df_expl.loc[df_expl["BasketID"].str.get(0) == "C", "NewBasketID"] = df_expl.loc[(df_expl["BasketID"].str.get(0) == "C"), "BasketID"].str.slice(start=1)

df_expl.corr()

The BasketID of type C has a strong negative correlation with the sign of Qta.

In [None]:
df.loc[df["BasketID"].str.get(0) == "C", "ProdDescr"].unique()

What could this mean for the C type? Probably indicates discounts/refunds.

In [None]:
df_expl["NewBasketID"] = df_expl["NewBasketID"].astype("int64")
df_expl.info()

We notice that there are no more anomalies inside BasketID since it can be now converted to int64.

In [None]:
df_expl["NewBasketID"] = df_expl["NewBasketID"].astype("string")

Let's check if we now have less unique BasketIDs in our records, after removing the type from the BasketID attribute.

In [None]:
print(f'The original number of unique BasketIDs is: {df_expl["BasketID"].unique().size}')
print(f'The current number of unique BasketIDs is: {df_expl["NewBasketID"].unique().size}')

The number is the same, therefore each BasketID of type A or C didn't merge with pre-existing shopping sessions.  
It could prove useful to take into account the BasketDate and see if it would make sense to merge the type C records with the ones in the same day.

### BasketDate
Let's see how the entries are distributed over time.

In [None]:
nonNullEntries = df[df["BasketID"].notna()]
k = math.ceil(math.log(len(nonNullEntries), 2) + 1) #Sturge's rule
df["BasketDate"].hist(bins=k, figsize=(10,5))
plt.show()

The number of transactions increases month by month.

In [None]:
#Distributions of Sale and Qta taking into account the BasketDate
fig = plt.figure(figsize=(20, 5)) 
fig_dims = (1, 2)
fig.subplots_adjust(hspace=0.2, wspace=0.2)

plt.subplot2grid(fig_dims, (0, 0))

plt.scatter(df['BasketDate'], 
            df['Sale'], color='g', marker='*', label='Data')
plt.xlabel('BasketDate')
plt.xticks(rotation='vertical')
plt.ylabel('Sale')


plt.subplot2grid(fig_dims, (0, 1))

plt.scatter(df['BasketDate'], 
            df['Qta'], color='g', marker='*', label='Data')
plt.xlabel('BasketDate')
plt.xticks(rotation='vertical')
plt.ylabel('Qta')
plt.show()

Let's see the number of shopping sessions per customer per day.

In [None]:
df.groupby(by=["CustomerID", "BasketDate"])["BasketID"].size()

There doesn't seem to be a way to easily merge the type C BasketID records with other shopping sessions.  
The discounts/refunds will be considered as separate orders.

### Sale
We need to figure out if the Sale value refers to the cost of a single item or cost of item * Qta

In [None]:
df.sort_values(by="ProdID").head()

It seems that Sale doesn't change if the Qta is changed... let's verify it further.

In [None]:
df.corr()

There doesn't seem to be a correlation in general between Sale and Qta, so they are indipendent variables(?) and therefore Sale is the cost of the signle item.

In [None]:
#Visualize the Sale distribution
fig = plt.figure(figsize=(20, 10)) 
fig_dims = (1, 2)
fig.subplots_adjust(hspace=0.2, wspace=0.2)

plt.subplot2grid(fig_dims, (0, 0))
k = math.ceil(math.log(len(df["Sale"]), 2) + 1) #Sturge's rule
df["Sale"].hist(bins=k)

plt.subplot2grid(fig_dims, (0, 1))
df.boxplot(column=["Sale"])
plt.show()

As expected the vast majority of Sale values is near 0, we need however to check for 0 values since they don't make sense in the contest of Sale and therefore should be considered as missing values.  
Note also that the median is near the 0.

In [None]:
df.loc[df["Sale"] == 0].size

Almost a quarter of the Sale values are 0, this needs to be fixed in the Data Preparation phase.

### CustomerID

Let's see why the number of non-null CustomerID entries is so low and if there are any interesting properties to be found.

In [None]:
df_expl["CustomerID"] = df["CustomerID"]

df_expl["CustomerIDNull"] = 0
df_expl.loc[df_expl["CustomerID"].isna(), "CustomerIDNull"] = 1

df_expl.corr()["CustomerIDNull"]

No interesting correlation.  
Let's check if we can retrieve some missing CustomerIDs by using the records referencing the same BasketID.

In [None]:
df.groupby(by="BasketID").filter(lambda x: x["CustomerID"].isna().any() & x["CustomerID"].notna().any())

Since the code above didn't give any result there doesn't seem to be a way to easily fill-in the missing CustomerID values.

### Customer country

In [None]:
df["CustomerCountry"].value_counts().plot(kind='bar')

The majority of the operations take place in the United Kingdom.  
It could be interesting to however take into account the revenue by country and see which is more profitable relative to the number of orders.

In [None]:
countryList = df["CustomerCountry"].sort_values().unique()
country_df = pd.DataFrame(data=countryList, columns=["Country"])

df["ProductSaleQta"] = df["Sale"]*df["Qta"]

for country in countryList:
    country_df.loc[country_df["Country"] == country, "TotalSale"] = df.loc[df["CustomerCountry"] == country, "ProductSaleQta"].sum()

df = df.drop("ProductSaleQta", axis=1)
country_df.sort_values(by="TotalSale", ascending=False).head(10)

### ProdID
Let's find out why this wasn't converted to a number.

In [None]:
df.loc[df["ProdID"].str.isnumeric(), ("ProdID", "ProdDescr")].value_counts() #Records with ProdIDs containing only numbers

In [None]:
df.loc[df["ProdID"].str.isalpha(), ("ProdID", "ProdDescr")].value_counts() #Records with ProdIDs containing only letters

In [None]:
#Records with ProdID terminating with a letter
term_letter_prodid = df.loc[(df["ProdID"].str.slice(start=-1).str.isalpha()) & (df["ProdID"].str.slice(0, -1).str.isnumeric())]
term_letter_prodid.sort_values(by="ProdID")

In [None]:
term_letter_prodid["ProdID"].str.slice(start=-1).sort_values().unique()

Given the diversity and lack of structure of the ProdIDs there doesn't seem to be interesting information to obtain.  
There doesn't even seem to be consistency between the descriptions and ProdIDs.

### Qta

In [None]:
k = math.ceil(math.log(len(df["Qta"]), 2) + 1) #Sturge's rule
df["Qta"].hist(bins=k)

Let's check for 0 values.

In [None]:
df.loc[df["Qta"] == 0].size

There are no records with Qta equal to 0.

In [None]:
df_expl.corr()["QtaPositive"]

As noted in the BasketID section there is a strong correlation between the sign of Qta and a BasketID of type C.  
Let's see if there is something interesting distribution in the remaining negative quantities.

In [None]:
expl_result = df_expl.loc[(df_expl["Qta"] < 0) & (df_expl["BasketIDTypeC"] == 0)]
expl_result.head()

Let's check if the trend of Sale equal 0 continues throughout the subset of records.

In [None]:
expl_result["Sale"].describe()

It does.  
Let's check if all CustomerIDs in the subset are Null.

In [None]:
expl_result.describe()["CustomerIDNull"]

They are all Null.  
It might be a good idea to remove this data in the Data preparation phase.  
This way we will also have a correlation of 1 between the BasketID class C and negative quantities.

# Data preparation

TODO: fill in missing values of sale, remove outliers, remove records with neg Qta except class C.

## Additional features

Let's add some new features into the data frame

In [None]:
unq_cust_id = df["CustomerID"].sort_values().unique()[0:-1] #Remove NaN value, last value
cust_df = pd.DataFrame(data=unq_cust_id, columns=["CustomerID"]) #Dataframe containing customer features

#Total number of items bought by customer
IFeature = df.groupby(["CustomerID"]).Qta.sum()
cust_df = cust_df.merge(IFeature, on="CustomerID").rename(columns={"Qta":"I"})

#Total number of unique items bought by customer
IuFeature = df.groupby(["CustomerID"]).ProdID.nunique()
cust_df = cust_df.join(IuFeature, on="CustomerID").rename(columns={"ProdID":"Iu"})

#Max number of item bought by customer across all shopping sessions
BasketIDQtaSum= df.groupby(["CustomerID", "BasketID"]).Qta.sum()
ImaxFeature = BasketIDQtaSum.groupby(["CustomerID"]).max()
cust_df = cust_df.join(ImaxFeature, on="CustomerID").rename(columns={"Qta":"Imax"})

#The Shannon entropy on the purchasing behaviour of the customer (sum -p_items * log2(p_items))
#Potential problem: p could be a negative value!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
cust_prod_tot = df.groupby(["CustomerID", "ProdID"]).Qta.sum()
probSeries = (cust_prod_tot/IFeature).rename({"Qta":"P_prod_customer"})
logSeries = np.log2(probSeries)
entropy = -1*probSeries*logSeries
EFeature = entropy.groupby(["CustomerID"]).sum()
cust_df = cust_df.join(EFeature, on="CustomerID").rename(columns={"Qta":"E"})

cust_df.head()

In [None]:
cust_df.corr()

Interesting correlation between E and Iu

In [None]:
plt.scatter(cust_df['I'], 
            cust_df['Iu'], color='g', marker='*', label='Data')
plt.xlabel('I')
plt.xticks(rotation='vertical')
plt.ylabel('Iu')
plt.show()