# Module 4: Exercise

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In this exercise, we work with a bread basket data set. It contains 20,507 items with five columns. The data were collected from customer transactions between January 11, 2016, and December 3, 2017, including a total of 9,000 transactions. These transactions represent a variety of items ordered from the store.

## Data Preprocessing

In [2]:
bakery = pd.read_csv('bread_basket.csv')
bakery.head()

Unnamed: 0,Transaction,Item,date_time,period_day,weekday_weekend
0,1,Bread,30-10-2016 09:58,morning,weekend
1,2,Scandinavian,30-10-2016 10:05,morning,weekend
2,2,Scandinavian,30-10-2016 10:05,morning,weekend
3,3,Hot chocolate,30-10-2016 10:07,morning,weekend
4,3,Jam,30-10-2016 10:07,morning,weekend


>__Task 1__
>
>Count the number of rows and columns

In [None]:
...

>__Task 2__
>
>Find the data type of each column

In [None]:
...

### Data Type

Let’s modify the __date_time__ column to make handling of date and time information easier.

>__Task 3__
>
>- Convert the __date_time__ column to `datetime` type 
>- Print the column type to verify the result

In [None]:
...

In [None]:
# Print the column type
...

We can now extract particular properties, such as date and hour.

>__Task 4__
>
>- Add the following four columns
>     - `date`: the date of the transaction
>     - `hour`: the hour of the transaction
>     - `month`: the month of the transaction, using `%Y-%m` format
>     - `weekday`: the day of the week for the transaction (e.g., Monday, Tuesday, etc.)
>- Drop the __date_time__ column from the data
>- Print the first 5 rows of the data to verify the result

In [None]:
...

### Strings

Since we have strings in the data, let’s ensure the data format is consistent for a better presentation of the information.

>__Task 5__
>
>- Change the names in the __Item__ column to lowercase
>- Remove spaces at the beginning and the end of the string
>- Print the first 5 rows of the data to verify the result

In [None]:
...

## Exploratory Data Analysis

>__Task 6__
>
>Count the number of unique items sold
>
>- Sort items from the most frequent to the least frequent
>- Place these items in a new DataFrame called `freq_items` with __Item__ and __Count__ columns
>- Print the first 5 rows of the data to verify the result

In [None]:
...

>__Task 7__
>
>Calculate the percentage of the most frequent items
>
>- Display them as a third column in the `freq_items` data
>- Name this new column `Percentage`
>- Print the first 5 rows of the data to verify the result

In [None]:
...

## Visualization

Since there are many data points and displaying them all might be confusing, let’s limit our focus to the first 15 rows of the `freq_items` table.

In [11]:
freq_items = freq_items.head(15)

>__Task 8__
>
>Create a bar plot using seaborn with the x-axis representing the items from the `freq_items` and the y-axis representing the count of those items among all transactions
>
>- Set `figsize` to (14,6)
>- Use a palette of your choice
>- Rotate the x-axis tick labels to avoid overlapping
>- Label the graph and axes correspondingly

In [None]:
...

>__Task 9__
>
>Display the number of transactions for each month
>
>- Use the `bakery` data to access the information for each month (hint: `groupby`)
>- Use a palette of your choice
>- Rotate the x-axis tick labels to avoid overlapping

In [None]:
...

## The Apriori Algorithm

Let’s begin by selecting the variables required for the implementation of the algorithm.

>__Task 10__
>
>Create a new DataFrame named `baskets`, with columns __Transaction__, __Item__, and __Count__
>
>- Fill out these columns using the information from the `bakery` data
>- Record the count of each item in the `Count` column
>- Print the first 5 rows of the data to verify the result

In [None]:
...

Since we are dealing with categorical variables, we need to transform them into binary codes.

>__Task 11__
>
>Create a a new DataFrame named `transactions`, where each row indicates if an item is in the basket (transaction)
>
>- Use the `pivot_table` function for this task, and set `aggfunc`to `any`
>- Print the first 5 rows of the data to verify the result

In [None]:
...

>__Task 12__
>
>- Convert the `NaN` values to 0 in the `transactions` data
>- Print the table to verify the result

In [None]:
...

We can now use the Apriori algorithm to find items that are associated with each other.

>__Task 13__
>
>Define a DataFrame named `freq_itemsets` using the Apriori method
>
>- Use the data from `transactions`
>- Set a minimum support value of 0.01
>- Retain column names in the returned DataFrame

In [None]:
...

Now that we have identified the itemsets that meet the threshold, we can use the lift factor to find stronger association rules.

>__Task 14__
>
>- Create rules using a `lift` metric with a threshold of 1.1 and store them in a `rules` table
>- Sort the values descending based on a `confidence` metric
>- Print the table to verify the result

In [None]:
...

We see that the indexes of the table are derived from the original data.

>__Task 15__
>
>- Reset the indexes so the rows start from 0
>- Print the table to verify the result

In [None]:
...