# 1. Business Problem
## 1.1 Problem Context
Our client is an online retailer based in the UK. They sell all-occasion gifts, and many of their customers are wholesalers.
* Most of their customers are from the UK, but they have a small percent of customers from other countries.
* They want to create groups of these international customers based on their previous purchase patterns.
* Their goal is to provide more tailored services and improve the way they market to these international customers.

## 1.2 Problems with current approach
Currently, the retailer simply groups their international customers by country. As you'll see in the project, this is quite inefficient because:
* There's a large number of countries (which kind of defeats the purpose of creating groups).
* Some countries have very few customers.
* This approach treats large and small customers the same, regardless of their purchase patterns.

## 1.3 Problem Statement
The retailer has hired us to help them create customer clusters, a.k.a **"customer segments"** through a data-driven approach.
* They've provided us a dataset of past purchase data at the transaction level.
* Our task is to build a clustering model using that dataset.
* Our clustering model should factor in both aggregate sales patterns and specific items purchased.

## 1.4 Business Objectives and constraints

# 2. Machine Learning Problem
## 2.1 Data Overview
For this project:
* The dataset has 35116 observations for previous international transactions.
* The observations span 37 different countries.
* **Note:** There is no target variable.

We have the following features:

Invoice information
* 'InvoiceNo' – Unique ID for invoice
* 'InvoiceDate' – Invoice date

Item information
* 'StockCode' – Unique ID for item
* 'Description' – Text description for item
* 'Quantity' – Units per pack for item
* 'UnitPrice' – Price per unit in GBP

Customer information
* 'CustomerID' – Unique ID for customer
* 'Country' – Country of customer

## 2.2 Mapping Buisness problem to ML problem
### 2.2.1 Type of ML Problem
It is an unsupervised learning task, where given the features about each transaction, we need to segment the customers based on their buying patterns.
* It is importnat to note that the given data is transaction-level while the clusters (or segmenst) we need to create are customer-level.

# 3. Exploratory Data Analysis
Import the libraries

In [None]:
# NumPy for numerical computing
import numpy as np

# Pandas for DataFrames
import pandas as pd
pd.set_option('display.max_columns', 100)

# Matplotlib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline 

# Seaborn for easier visualization
import seaborn as sns

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

# StandardScaler from Scikit-Learn
from sklearn.preprocessing import StandardScaler

# PCA from Scikit-Learn
from sklearn.decomposition import PCA

# Scikit-Learn's KMeans algorithm
from sklearn.cluster import KMeans

# Adjusted Rand index
from sklearn.metrics import adjusted_rand_score

## 3.1 Load the dataset

In [None]:
# Load international online transactions data from CSV
df = pd.read_csv('Files/int_online_tx.csv')

In [None]:
#Dataframe dimension
df.shape

In [None]:
# First 10 rows of data
df.head(10)

Here are some questions to consider:
* In the first 10 observations, how many different customers are there?
* How many different invoices are there?
* Given answers to the first two questions, how many unique purchases are shown?
* Technically, aren't these observations actually line-items within each transaction? 
* Do you expect the customer-level dataset to be much smaller?

Display the distribution of transactions by country.

In [None]:
# Make figsize
plt.figure(figsize=(9,10))

# Bar plot by country
sns.countplot(y='Country', data=df)

There are many **sparse classes**. Countries like Lithuania, Brazil, and even the USA have a tiny number of transactions.

**Note:** This is at the transaction/line-item level. The number of customers for each country is even smaller because each customer has multiple transactions! Therefore, it's plain to see that clustering by country is not very efficient.

# 4. Transaction-level Cleaning
Before we aggregate to the customer level, we need to tidy up a few things at the transaction level.
* Technically, this is the **"line-item" level** because one invoice (a.k.a. transaction) spans multiple rows. However, we'll just refer to it as the "transaction level" for simplicity.
* Also the terms **"aggegrating up"** and **"rolling up"** are used interchangeability.

First, we need to check for any missing data before rolling up. If we don't, we might get confusing results later because rolling up to a higher level can sometimes hide the fact that data was missing at the lower level.

Display the number of missing observations for each feature.

In [None]:
# Missing data by feature
df.isnull().sum()

'CustomerID' has missing observations
* Should we **label them as missing** (as for categorical features) or should we **flag and fill** them (as for numeric features)?
* We should do neither. Instead, we simply need to drop transactions with missing CustomerID.
* Think back to the project scope: We are trying to cluster customers in order to provide more tailored service!
* That means transactions with missing 'CustomerID' are actually pointless to keep.
* In other words, they should be considered "unwanted observations" instead of "missing data!".

Drop observations with missing customer ID's.

In [None]:
# Only keep transactions with CustomerID's
df = df[df.CustomerID.notnull()]

Next, just for clarity, convert the CustomerID's from floats into integers. (This is technically not required, but it's good practice to save ID's as either strings or integers so they don't get mixed up with other numeric features.)

In [None]:
# Convert customer ID's into integers
df['CustomerID'] = df.CustomerID.astype(int)

# Display first 5 CustomerID's in the transaction dataset
df.CustomerID.head()

There's one feature we need to create at the transaction level.

Look back at the first 10 observations you displayed earlier.
* Are there any features that tell you how much money the customer spent on each transaction?
* Well, we have 'Quantity' and 'UnitPrice', but those are for individual units, not for the transaction.
* We need to multiply them together to get the Sales amount for that transaction.

In [None]:
# Create 'Sales' interaction feature
df['Sales'] = df.Quantity * df.UnitPrice

# Display first 5 Sales values in the transaction dataset
df.Sales.head()

Before moving on, save your cleaned transaction-level data as **cleaned_transactions.csv.**

In [None]:
# Save cleaned transaction-level data
df.to_csv('Files/cleaned_transactions.csv', index=None)

# 5. Customer-level feature engineering
Now that we have a cleaned transaction-level dataset, it's time to **roll it up** (aggregate up) to the customer level, which we'll feed into our machine learning algorithms later.

We want 1 customer per row, and we want the features to represent information such as:
* Number of unique purchases by the customer
* Average cart value for the customer
* Total sales for the customer
* Etc.

To do so, we'll use two tools seen already:
* groupby() to roll up by customer.
* agg() to engineer aggregated features.

Aggegrate invoice data by customer. We'll engineer 1 feature:
* 'total_transactions' - the total number of unique transactions for each customer.

In [None]:
# Aggegrate invoice data
invoice_data = df.groupby('CustomerID').InvoiceNo.agg({ 'total_transactions' : 'nunique' })

# Display invoice data for first 5 customers
invoice_data.head()

In [None]:
# Aggregate product data
product_data = df.groupby('CustomerID').StockCode.agg( { 'total_products' : 'count', 
                                                     'total_unique_products' : 'nunique' } )

# Display product data for first 5 customers
product_data.head()

By definition, 'total_unique_products' should always be less than or equal to 'total_products'.

Finally, aggregate sales data by customer. Engineer 2 features:
* 'total_sales' - the total sales for each customer.
* 'avg_product_value' - the average value of the products purchased by the customer (not the UnitPrice!).

In [None]:
# Roll up sales data
sales_data = df.groupby('CustomerID').Sales.agg( { 'total_sales' : 'sum', 
                                                  'avg_product_value' : 'mean' } )

# Display sales data for first 5 customers
sales_data.head()

# 6. Intermediary levels
You won't always be able to easily roll up to customer-level directly. Sometimes, it will be easier to create intermediary levels first.

For example, let's say we wanted to calculate the average cart value for each customer.
* 'avg_product_value' isn't the same thing because it doesn't first group products that were purchased within the same "cart" (i.e. invoice).

Therefore, let's first aggregate cart data at the "cart-level."
* We'll group by 'CustomerID' AND by 'InvoiceID'. Remember, we're treating each invoice as a "cart."
* Then, we'll calculate 'cart_value' by taking the sum of the Sales column. This is the total sales by invoice (i.e. cart).
* Finally, we'll call .reset_index() to turn CustomerID and InvoiceID back into regular columns so we can perform another aggregation.

In [None]:
# Aggregate cart-level data (i.e. invoice-level)
cart_data = df.groupby(['CustomerID' , 'InvoiceNo']).Sales.agg( { 'cart_value' : 'sum' })

# Display cart data for first 20 CARTS
cart_data.head(20)

In [None]:
# Reset index
cart_data.reset_index(inplace=True)

# Display cart data for first 10 CARTS
cart_data.head(10)

Now that we have cart-level cart data, all we need to do is roll up by CustomerID again to get customer-level cart data.

Aggregate cart data by customer. Engineer 3 features:
* 'avg_cart_value' - average cart value by customer.
* 'min_cart_value' - minimum cart value by customer.
* 'max_cart_value' - maximum cart value by customer.

In [None]:
# Aggregate cart data (at customer-level)
agg_cart_data = cart_data.groupby('CustomerID').cart_value.agg( { 'avg_cart_value' : 'mean', 
                                                                 'min_cart_value' : 'min',
                                                                 'max_cart_value' : 'max'})

# Display cart data for first 5 CUSTOMERS
agg_cart_data.head()

# 7. Joining various customer level dataframes
We have multiple dataframes that each contain customer-level features:
* invoice_data
* product_data
* sales_data
* agg_cart_data

Let's join the various customer-level datasets together with the .join() function.
* Just pick one of the customer-level dataframes and join it to a list of the others.
* By default, it will join the dataframes on their index... In this case, it will join by CustomerID, which is exactly what we want.
* You can read more about the .join() function in the official documentation.

In [None]:
# Join together customer-level data
customer_df = invoice_data.join([product_data, sales_data, agg_cart_data])

# Display customer-level data for first 5 customers
customer_df.head()

Finally, let's save customer_df as our **analytical base table** to use later.

**Very Important:** We will not set index=None because we want to keep the CustomerID's as the index (this will be important and we'll see later).

In [None]:
# Save analytical base table
customer_df.to_csv('Files/analytical_base_table.csv')

# 8. Curse of Dimensionality.
Let's import the cleaned dataset (not the analytical base table) that we saved in previously.

In [None]:
# Read cleaned_transactions.csv
df = pd.read_csv('Files/cleaned_transactions.csv')

## 8.1 So what is "The Curse of Dimensionality?"

"dimensionality" refers to the number of features in your dataset. The basic idea is that as the number of features increases, you'll need more and more observations to build any sort of meaningful model, especially for clustering.

Because cluster models are based on the "distance" between two observations, and distance is calculated by taking the differences between feature values, every observation will seem "far away" from each other if the number of features increases.

The reference link provides an excellent analogy : https://www.quora.com/What-is-the-curse-of-dimensionality

    Let's say you have a straight line 100 yards long and you dropped a penny somewhere on it. It wouldn't be too hard to find. You walk along the line and it takes two minutes.

    Now let's say you have a square 100 yards on each side and you dropped a penny somewhere on it. It would be pretty hard, like searching across two football fields stuck together. It could take days.

    Now a cube 100 yards across. That's like searching a 30-story building the size of a football stadium. Ugh.

    The difficulty of searching through the space gets a lot harder as you have more dimensions.

For our practical purposes, it's enough to remember that when you have many features (high dimensionality), it makes clustering especially hard because every observation is "far away" from each other.

The amount of "space" that a data point could potentially exist in becomes larger and larger, and clusters become very hard to form.

## 8.2 Item Data
So how does The Curse of Dimensionality arise in this problem?

Well, in the previous module, we created a customer-level analytical base table with important features such as total sales by customer and average cart value by customer.

However, remember, the client would also like to to include information about individual items that were purchased.
* For example, if two customers purchased similar items, our model should be more likely to group them into the same cluster.
* In other words, we care not just about how much a customer purchases, but also what they purchase.
* In every observation, along with the data that we have built, we need to include the information of what are the products that are purchased by each customer, i.e. we need to find some way to represent each unique item.

One way is to create a vector of unique values of the StockCode column and if the customer has purchased a particular product, fill it by 1, else fill it by 0.

This is like the binary CountVectorizer technique.

In [None]:
# Get item_dummies - creates the vector of StockCode
item_dummies = pd.get_dummies(df.StockCode)

item_dummies.head()

Now, add 'CustomerID' to this new dataframe so that we can roll up (aggregate) by customer later.

In [None]:
# Add CustomerID to item_dummies
item_dummies['CustomerID'] = df.CustomerID

# Display first 5 rows of item_dummies
item_dummies.head()

Next, roll up the item dummies data into customer-level item data.

In [None]:
# Create item_data by aggregating at customer level
item_data = item_dummies.groupby('CustomerID').sum()

# Display first 5 rows of item_data
item_data.head()

As you can see, even after rolling up to the customer level, most of the values are still 0. That means that most customers are not buying a huge array of different items, which is to be expected.

Finally, let's display the total number times each item was purchased.

In [None]:
# Total times each item was purchased
item_data.sum()

As you can see, most items were purchased less than a handful of times!

First of all, we've just created 2574 customer-level item features, which leads to The Curse of Dimensionality.
To make matters even worse, most of the values for many of those features are 0!

So, we'll introduce a couple of strategies for reducing the number of item features that we actually keep.

Before moving on, let's save this customer-level item dataframe as 'item_data.csv'. We'll use it again in the next module.

In [None]:
# Save item_data.csv
item_data.to_csv('Files/item_data.csv')

## 8.3 Method 1 - Thresholding
One very simple and straightforward way to reduce the dimensionality of this item data is to set a threshold for keeping features.
* The rationale is that you might only want to keep **popular items.**
* For example, let's say item A was only purchased by 2 customers. Well, the feature for item A will be 0 for almost all observations, which isn't very helpful.
* On the other hand, let's say item B was purchased by 100 customers. The feature for item B will allow more meaningful comparisons.

To make this concrete, assume we only wish to keep item features for the 20 most popular items.

First, we can see which items those are and the number of times they were purchased.
* Take the sum by column.
* Sort the values.
* Look at the last 20 (since they are sorted in ascending order by default)

In [None]:
# Display most popular 20 items
item_data.sum().sort_values().tail(120)

Next, if we take the .index of the above series, we can get just a list of the StockCodes for those 20 items.

In [None]:
# Get list of StockCodes for the 20 most popular items
top_20_items = item_data.sum().sort_values().tail(120).index

top_20_items

Finally, we can keep only the features for those 20 items.

In [None]:
# Keep only features for top 20 items
top_20_item_data = item_data[top_20_items]

# Shape of remaining dataframe
top_20_item_data.shape

In [None]:
top_20_item_data.head()

These 20 features are much more manageable than the 2574 from earlier, and they are arguably the most important features because they are the most popular items.

Finally, save this top 20 items dataframe as 'threshold_item_data.csv'.

In [None]:
# Save threshold_item_data.csv
top_20_item_data.to_csv('Files/threshold_item_data.csv')

## 8.4 PCA
Let's import the full item data that we saved in the previous module (before applying thresholds)

This time, we'll also pass in the argument index_col=0 to tell Pandas to treat the first column (CustomerID) as the index.

In [None]:
# Read item_data.csv
item_data = pd.read_csv('Files/item_data.csv', index_col=0)

In [None]:
# Display item_data's shape
item_data.shape

**Principal Component Analysis (PCA)** is an Unsupervised Learning task that creates a sequence of new, uncorrelated features that each try to maximize its "explained variance" of the original dataset.
* It does so by generating linear combinations from your original features.
* These new features are meant to replace the original ones.

Here's where dimensionality reduction comes into play, and it's brilliantly simple:
* You don't need to keep all of the principal components!
* You can just keep some number of the "best" components, a.k.a. the ones that explain the most variance.
* Remember, PCA creates a sequence of principal components and each one tries to capture the most variance after accounting for the ones before it.

First, **scale item_data**, which you imported.

In [None]:
# Initialize instance of StandardScaler
scaler = StandardScaler()

# Fit and transform item_data
item_data_scaled = scaler.fit_transform(item_data)

# Display first 5 rows of item_data_scaled
item_data_scaled[:5]

Next, initialize and fit an instance of the PCA transformation.

Keep all of the components for now (just don't pass in any argument).

In [None]:
# Initialize and fit a PCA transformation
pca = PCA()
pca.fit(item_data_scaled)

Finally, generate new "principal component features" from item_data_scaled.

In [None]:
# Generate new features
PC_items = pca.transform(item_data_scaled)

# Display first 5 rows
PC_items[:5]

### Explained Variance
It's very helpful to calculate and plot the cumulative explained variance.
* This will tell us the total amount of variance we'd capture if we kept up to the n-th component.
* First, we'll use np.cumsum() to calculate the cumulative explained variance.
* Then, we'll plot it so we can see how many PC features we'd need to keep in order to capture most of the original variance.

In [None]:
# Cumulative explained variance
cumulative_explained_variance = np.cumsum(pca.explained_variance_ratio_)

# Plot cumulative explained variance
plt.grid()
plt.plot(range(len(cumulative_explained_variance)), cumulative_explained_variance)

This chart is saying: To capture about 98% of the variance, we'd need to keep around 300 components.

We can confirm:

In [None]:
# How much variance we'd capture with the first 125 components
cumulative_explained_variance[300]

Reducing 2574 features down to 300 (about 88% fewer features) while capturing almost 80% of the original variance is certainly not bad!

Initialize and fit another PCA transformation.
* This time, only keep 300 components.
* Generate the principal component features from the fitted instance and name the new matrix PC_items.
* Then, display the shape of PC_items to confirm it only has 300 features.

In [None]:
pca = PCA(n_components=300)

# Fit and transform item_data_scaled
PC_items = pca.fit_transform(item_data_scaled)

# Display shape of PC_items
PC_items.shape

Next, for convenience, let's put PC_items into a new dataframe.

We'll also name the columns and update its index to be the same as the orginal item_data's index.

In [None]:
# Put PC_items into a dataframe
items_pca = pd.DataFrame(PC_items)

# Name the columns
items_pca.columns = ['PC{}'.format(i + 1) for i in range(PC_items.shape[1])]

# Update its index
items_pca.index = item_data.index

# Display first 5 rows
items_pca.head()

* Now we have a dataframe of 300 customer-level principal component features.
* These were generated from the 300 principal components that explained the most variance for the original features.
* The index of this PCA item dataframe contains CustomerID's, which will make it possible for us to join this to our analytical base table.

Finally, save this item dataframe with PCA features as 'pca_item_data.csv'.
* Next, we'll compare the clusters made from using these features against those in 'threshold_item_data.csv'.
* Do not set index=None because we want to keep the CustomerID's as the index.

In [None]:
# Save pca_item_data.csv
items_pca.to_csv('Files/pca_item_data.csv')

# 9. KMeans Clustering
Let's import 3 CSV files we've saved throughout this project.
* Let's import 'analytical_base_table.csv' as base_df.
* Let's import 'threshold_item_data.csv' as threshold_item_data.
* Let's import 'pca_item_data.csv' as pca_item_data.
* Set index_col=0 for each one to use CustomerID as the index.

In [None]:
# Import analytical base table
base_df = pd.read_csv('Files/analytical_base_table.csv', index_col=0)

# Import thresholded item features
threshold_item_data = pd.read_csv('Files/threshold_item_data.csv', index_col=0)

# Import PCA item features
pca_item_data = pd.read_csv('Files/pca_item_data.csv', index_col=0)

Print the shape of each one to make sure we're on the same page.

In [None]:
# Print shape of each dataframe
print( base_df.shape )
print( threshold_item_data.shape )
print( pca_item_data.shape )

Because K-Means creates clusters based on distances, and because distances are calculated by between observations defined by their feature values, **the features you choose to input into the algorithm heavily influence the clusters that are created.**

For this project, we will look at 3 possible feature sets and compare the clusters created from them. We'll try:
1. Only purchase pattern features ("Base DF")
2. Purchase pattern features + item features chosen by thresholding ("Threshold DF")
3. Purchase pattern features + principal component features from items ("PCA DF")

Create a threshold_df by joining base_df with threshold_item_data.

In [None]:
# Join base_df with threshold_item_data
threshold_df = base_df.join(threshold_item_data)

# Display first 5 rows of threshold_df
threshold_df.head()

Create a pca_df by joining base_df with pca_item_data.

In [None]:
# Join base_df with pca_item_data
pca_df = base_df.join(pca_item_data)

# Display first 5 rows of pca_df
pca_df.head()

### Number of clusters
So, how many clusters should you set?
* As with much of Unsupervised Learning, there's no right or wrong answer.
* Typically, you should consider how your client/key stakeholder will use the clusters.
* For example, let's say our client, the online gift retailer, employs 3 customer service reps, and they want to assign one cluster to each rep.
* In that case, the obvious answer is 3.
* For this project, we'll set the number of clusters to 3. However, you should always feel free to adapt this number depending on what you need.

First scale both the dataframes

In [None]:
# Initialize instance of StandardScaler
t_scaler = StandardScaler()
p_scaler = StandardScaler()

# Fit and transform
threshold_df_scaled = t_scaler.fit_transform(threshold_df)
pca_df_scaled = p_scaler.fit_transform(pca_df)

K-Means with threshold_df

In [None]:
t_kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 123)

In [None]:
t_kmeans.fit(threshold_df_scaled)
threshold_df['cluster'] = t_kmeans.fit_predict(threshold_df_scaled)

In [None]:
# Scatterplot, colored by cluster
sns.lmplot(x='total_sales', y='avg_cart_value', hue='cluster', data=threshold_df, fit_reg=False)

K-Means with pca_df

In [None]:
p_kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 123)

In [None]:
p_kmeans.fit(pca_df_scaled)
pca_df['cluster'] = p_kmeans.fit_predict(pca_df_scaled)

In [None]:
# Scatterplot, colored by cluster
sns.lmplot(x='total_sales', y='avg_cart_value', hue='cluster', data=pca_df, fit_reg=False)

In [None]:
# Similary between base_df.cluster and threshold_df.cluster
adjusted_rand_score(pca_df.cluster, threshold_df.cluster)