In [None]:
!pip install mlxtend

In [1]:
# libraries you will need for following through this notebook.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import csv

# allow us to view more rows at a time
pd.options.display.max_rows = 999
pd.options.display.max_colwidth = 999

# the functions we need from mlxtend are here
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

In [2]:
# The following is code for uploading a file to the colab.research.google 
# environment.

# library for uploading files
from google.colab import files 

def upload_files():
    # initiates the upload - follow the dialogues that appear
    uploaded = files.upload()

    # verify the upload
    for fn in uploaded.keys():
        print('User uploaded file "{name}" with length {length} bytes'.format(
            name=fn, length=len(uploaded[fn])))

    # uploaded files need to be written to file to interact with them
    # as part of a file system
    for filename in uploaded.keys():
        with open(filename, 'wb') as f:
            f.write(uploaded[filename])

## Assignment Dataset Description

We will be looking at a dataset from an ecommerce company. This dataset has each item purchased on a separate line. The data spans a year of purchases. The fields InvoiceNo and CustomerID can be used to identify the transactions.  Here is more info about the dataset:

https://www.kaggle.com/carrie1/ecommerce-data

https://archive.ics.uci.edu/ml/datasets/online+retail

Please upload the ecommerce-data.csv file we have supplied:

In [None]:
# upload the ecommerce-data.csv file
upload_files()

## Parsing the data

We have provided a functions for loading the data and for getting the data into transactions. The functions can be found below. Run them so you can call them in the future.

The `load_data` function only needs to be called once if you save the output to a variable.

The `load_transactions` function allows you to filter by time and to change out transactions are aggregated. You can try changing the parameters and see what kind of output you get.

Refer to the comments under each function header for more information.

You should load the data and prepare the transactions now.

In [3]:
def load_data():
    """
    Loads the ecommerce dataset as a pandas DataFrame
    Returns the DataFrame
    """
    ecom_df = pd.read_csv("ecommerce-data.csv", encoding="latin1")
    ecom_df["InvoiceDate"] = pd.to_datetime(ecom_df["InvoiceDate"])
    return ecom_df

In [4]:

def load_transactions(ecom_df, start_date=None, end_date=None, by_customer=False):
    """
    Turns the DataFrame of the ecommerce data into a list of lists representing transactions
    Params:
        ecom_df - the DataFrame returned by load_data()
        start_date - A start date to filter by, if None then does not filter by start date
                     Argument should be a string, format should be "YYYY-MM-DD"
        end_date - An end date to filter by, if None then does not filter by end date
                   Argument should be a string, format should be "YYYY-MM-DD"
        by_customer - if True, transactions represent all items bought by a customer
                      if False, transactions represent each checkout
    
    Returns the list of lists representing transactions
    """
    if start_date and end_date:
        ecom_df_filtered = ecom_df.loc[(ecom_df["InvoiceDate"] >= start_date) & (ecom_df["InvoiceDate"] < end_date), :]
    elif start_date:
        ecom_df_filtered = ecom_df.loc[(ecom_df["InvoiceDate"] >= start_date), :]
    elif end_date:
        ecom_df_filtered = ecom_df.loc[(ecom_df["InvoiceDate"] < end_date), :]
    else:
        ecom_df_filtered = ecom_df
    
    group_cols = "CustomerID" if by_customer else ["CustomerID", "InvoiceNo"]
    
    transactions = []
    ecom_df_filtered.groupby(group_cols).apply(lambda x, transactions=transactions: transactions.append(x["Description"].tolist()))
    return transactions


In [None]:
# Load data here

## Task

Use the functions we discussed in the example notebook to do an Association Rule analysis of the ecommerce data. The basic analysis will have you use the complete dataset to generate some rules and draw some conclusions. There are three steps to the analysis:

1) Encode the transactions as a one-hot DataFrame. Use the TransactionEncoder() like shown in the Example notebook.

2) Generate the candidate item sets using the apriori function. The tricky part here is choosing a min_support that yields useful item sets, but doesn't take forever to run.

3) Generate the associaton rules using the associaton_rules function. The function takes the candidate item set generated by the apriori function. Choose a confidence level that includes some interesting rules. When you display rules, rules should be ordered by Lift.



In [None]:
# begin analysis

##Discussion

We have some general questions about the rules you found that you can answer:

1) What rules did you find that you think are obvious?

2) What rules did you find that you think are surprising?

3) What rules could yield actions for the e-commerce company? What could these actions be?

4) What additional investigations would you do using this data or another data source that could aid in the interpretation of the rules?


##Optional Tasks

There are some more analysis you can do if time permits:

1) Investigate how the time range affects the rules generated. Choosing different time ranges should expose different patterns. What rules exist only at Christmas? What rules are found in the summer?

2) Investigate how forming transactions by customer vs by checkout event can affect the rules found. What rules are unique to each method of aggregating transactions? What can this tell you about your customer base?