In [1]:
import pandas as pd
import seaborn as sn
import plotly.express as px

# Assignment

In this assignment, we want to read in the `retail-churn.csv` dataset and run some EDA on the data. Generally speaking, when we run EDA on a dataset, we don't have a particular goal in mind. Instead we want to get a "gut-feel" for what the data looks like. The goal of the assignment is to show your ability to examine a dataset with increasing depth as you go.

In [None]:
col_names = ['user_id', 'gender', 'address', 'store_id', 'trans_id', 'timestamp', 'item_id', 'quantity', 'dollar']
churn = pd.read_csv("../../data/retail-churn.csv", sep = ",", skiprows = 1, names = col_names, index_col=3)
churn

In [None]:
churn.sort_values('store_id')

Here are some examples of questions we can be asking:

1. What are the columns, their types and their distribution (when it makes sense)? <span style="color:red" float:right>[1 point]</span>

> The columns are expanded and described by the code below. Right off we can see `transaction_id` and `store_id` are counters with `store_id` effectively being the index for this dataset, which is why I chose to use it as the index for the dataframe above.

In [None]:
for col in churn.columns:
    print(str(churn[col].describe()) + '\n\n')

In [None]:
churn.columns

2. Do the columns have the right types for the analysis? If not, convert them to the right type. <span style="color:red" float:right>[1 point]</span>

In [None]:
churn.dtypes

> Based on the questions asked below and reviewing the data file, I decided that the follwing needed to happen to better understand the data:
>   1. `timestamp` needed to become actual `pd.Timestamps` for better analysis (and the next question asks us to convert it)
>   2. `item_id` & `user_id` are better served as string objects so that we can use categorical techniques with them for questions 5, 6, & 7

In [7]:
churn.timestamp = pd.to_datetime(churn.timestamp)
churn.item_id = churn.item_id.astype(int).astype(str)
churn.user_id = churn.user_id.astype(int).astype(str)

In [None]:
churn.dtypes

> These updates deserve a new view of the column descriptions:

In [None]:
for col in churn.columns:
    print(str(churn[col].describe()) + '\n\n')

3. Do any columns appear to have all rows with unique categories? How do we show that? <span style="color:red" float:right>[1 point]</span>

In [None]:
churn.shape

In [None]:
churn.nunique()

> Based on the shape `(252204, 9)` and the output of `.nunique()` above, it appears that `store_id` and `trans_id` are all completely unique. 

> Due to the results of this analysis I've set `store_id` (column 3 of the data CSV) to be the index for this assignment.

4. What are some "obvious" questions we can ask about the data? We can ask many questions here, but we limit it to two:
   - Is quantity or volume ever negative and why? <span style="color:red" float:right>[1 point]</span>
   - What is the date range covered by the data? <span style="color:red" float:right>[1 point]</span>  
     HINT: You will need to convert `timestamp` into an a `datetime` column. You can use `pd.to_datetime` for that. We leave it to you to learn more about working with `datetime` colums.

> I suppose some "obvious" questions would be:
>   - How many times does one user have transactions?
>   - Are any regions showing more transactions?
>   - Is there a minimum transaction quantity that should be ignored to better understand who/why users purchase higher quantitiies?

In [None]:
f"There is {"no" if not (churn['quantity'] < 0).any() else "some"} values below 0 for the quantity column"

In [None]:
f"There is {"no" if not (churn['dollar'] < 0).any() else "some"} values below 0 for the dollar column"

> I would be surprised to see quantity be negative, as that would *very likely* be an outlier as any sale should include at least one item.

In [None]:
print(f"The date range starts on {churn.timestamp.min()} and ends on {churn.timestamp.max()}")

5. What are some "not-so-obvious" questions we can ask about this data? What are some important summary statistics and visualizations we should look at to answer them? Note that having domain knowledge can make this easier, so here's a list of questions and your task is to pick at least two questions and answer them using statistical summaries or visualizations: <span style="color:red" float:right>[2 point]</span>
   - How many transactions on average do users have in a given week? 
   - Are there items that are more commonly sold in bulk (quantity greater than 1)? 
   - ~~How do quantity and volume tend to change over the course of the day (hour by hour)?~~

***Note that the above questions are intetionally phrased to sound non-technical. It is up to you to "translate" them into something that can be answered by a query on the data or a visualization.***

> Approaching the problem, I'm focusing on the number of transactions and number of unique users per week of the `churn` dataset

In [None]:
# group and resample based on user and week
user_trans_by_week = churn.groupby('user_id').resample('W-Mon', on='timestamp')['trans_id'].count()
# build a new dataset for calculating averages
trans_per_user_per_week = pd.DataFrame()
weeks = user_trans_by_week.index.levels[1]
for n,week in enumerate(weeks):
    # print(n,week.strftime('%Y-%m-%d'),weeks[n+1])
    try:
        trans_count = churn[(churn['timestamp'] > week) & (churn['timestamp'] < weeks[n+1])]['trans_id'].count()
        uniq_users = churn[(churn['timestamp'] > week) & (churn['timestamp'] < weeks[n+1])]['user_id'].nunique()
    except IndexError:
        # trans_count = churn[(churn['timestamp'] > week)]['trans_id'].count()
        # uniq_users = churn[(churn['timestamp'] > week)]['user_id'].nunique()
        ## Turns out there is no transaction data after 2001-03-05, so we drop the last week
        break
    
    trans_per_user_per_week.loc[week.strftime('%Y-%m-%d'), 'trans_count'] = trans_count
    trans_per_user_per_week.loc[week.strftime('%Y-%m-%d'), 'unique_users'] = uniq_users
    trans_per_user_per_week.loc[week.strftime('%Y-%m-%d'), 'trans_per_user'] = trans_count/uniq_users if uniq_users>0 else None
    

trans_per_user_per_week

In [None]:
trans_per_user_per_week.describe()

In [None]:
px.bar(trans_per_user_per_week,y='trans_per_user',template='seaborn')

> For the 17 weeks within the dataset we have an average of 7.9 transactions/user without a large standard deviation. 

> Unfortunately, without more context of the data I'm unable to determine what this really means and if it makes sense when looking at the data as I do below. Below I find that there are some users with very high transaction counts. These two analyses lead me to conclude that the vast majority of users have quantities of 2 or less per week. 

In [None]:
px.scatter(
    user_trans_by_week.reset_index().sample(5000), 
    x='timestamp',y='trans_id',color='user_id',
    labels={'trans_id': 'count of transactions'},
    )

> Now to approach:

 Are there items that are more commonly sold in bulk (quantity greater than 1)? 

 > The question requires us to look solely at the `quantity` column but on a 'per `item_id`' view

In [None]:
items_stats = churn.groupby('item_id')['quantity'].describe()
items_stats

> Now we limit our results to only those who have some amount of variance within the per `item_id` samples

In [None]:
items_stats[items_stats['std'] > 0]

In [None]:
px.bar(items_stats[items_stats['std'] > 0],y='max',log_y=True)

> This approach shows that there is a little more than a third of the `item_id`s that have at least one transaction with a quantity greater than 1, but that doesn't feel very descriptive, so we look for those that have a `max` greater than 2 to see what `item_id`s have a transaction with 3 or more items sold.

In [None]:
items_stats[(items_stats['std'] > 0) & (items_stats['max'] > 2)]

In [None]:
px.bar(items_stats[(items_stats['std'] > 0) & (items_stats['max'] > 2)], y='max', log_y=True)

> In the end, there is definitely a subset of `item_id`s that often have been sold in bulk. Further analysis and comparisons with quantity and number of transactions could answer the question better.

6. Do the results mesh with what we expected? Note that to answer this we need to have some domain knowledge, so you can ignore this for the assignment. <span style="color:red" float:right>[0 point]</span>

> Well, following the prompt above, I'm ignoring this question because I have zero expectations of what this data is or what it should look like. I'm not even sure what an outlier could be.

7. What are additional features we could extract from the data? This is especially relevant if the data contains a timestamp column or raw text column (such as a full address for example). <span style="color:red" float:right>[1 point]</span>

> Running a simple pair plot can help:

In [None]:
sn.pairplot(churn.sample(200))

> But since most of the data is not numerical anymore, I'll use a different approach:

In [None]:
f"The available columns are {str([n for n in churn.columns])}"

In [None]:
fig = px.scatter_matrix(churn, dimensions=['user_id', 'gender', 'address', 'timestamp', 'item_id', 'quantity', 'dollar'], template='seaborn')
fig.update_traces(showupperhalf=False, diagonal_visible=False)
fig.show()

> Based on the above plot, I'd add the following as potential correlations to investigate:
>   - dollars vs gender
>   - dollars vs address
>   - quantity vs gender
>   - quantity vs address

8. Do I see any relationships between the features in the data? You will need to back this up with some statistical summaries or visualizations like what we covered in the lab. <span style="color:red" float:right>[2 point]</span>

> Based on the scatter matrix above, I see the follwing questions/relationships that could exist:
>   - are there more items with time?
>   - are there more users with time? do old users stop returning?
>     - this may be something like count of users with transactions per week

Run EDA on the data and answer the above questions and any additional questions that may cross your mind along the way. As you can imagine, there isn't a single way to proceed, and the answer doesn't always have to be exact. It is up to you to decide how you want to convey the results, but assume that your audience is non-technical and not familiar with some of the terminology we learned in the lecture.

There are also third-party libraries we can used to run EDA. One example is the `pandas-profiling` library which provides us with a full report. You do not need to use it in this assignment, but we recommend that you install it and take a look on your own time.

# End of assignment