# University of Chicago - Applied Machine Learning Fellows Program Analysis Task | Aram Dovlatyan | Task 1
I searched for relevant data-sets to build a minimum viable product on and found none that met the requirements listed in the task. Therefore I will create a sample transaction dataset and work through its transformations until the final model output.

For the sake of example, I will be generating data that is random to a degree. If given enough time you could simulate ideal data-sets that are a proxy for types of people, such could be working professionals, students, kids, parents, etc. When looking at the variables listed in the task, right away I know there are patterns that I will briefly cover in bullet points.

# Outline of Variables
Business ID
- Repeating bills get charged to the same company unless a person changes their plan. Examples include utilities, subscriptions(more common now), insurance, internet, memberships
- People purchase goods from the same businesses and have makeshift routines. Examples include daily coffee from Starbucks, lunch from local food joint, weekend shopping

Transaction Date
- Bills get charged on a set date every month usually. Examples include first day of the month, end of the month, certain first/last day of the week(ie, Monday/Tuesday)
- People create patterns in purchase history on weekdays. Examples include brunch on weekends, breakfast/lunch on weekdays, shopping on weekends, going out Friday and or Saturday

Transaction Time
- There will probably be patterns in time blocks for purchases, Examples include, breakfast between 7-10am, Lunch 12-1pm, Drinks on Friday/Saturday evenings, etc. This would be an interesting variable to dive through and extract patterns from.
- Higher-level patterns could form from seasonality and behavioral effects due to weather.

Transaction Value
- Some bills are charged the same amount such as subscriptions and memberships but utilities might differ from month-to-month.
- Some purchases are repeated implying the same dollar value. Examples include daily coffee, lunch or other foods. One-off purchases as well like drop-ins at a spa, gym, entrance fee, etc.

# Analysis
**Problem:** Facilitate financial management and planning for Americans, the tool is intended to provide convenience and improve lives through money management. We want to search transaction statements and identify recurring transactions to help predict upcoming bills in the next month. Americans will be aware of upcoming expenses and can plan accordingly. It will help track expenses and the awareness will help people with budgeting, financial planning, and allow them to extract the insights from the data they create. There is a vast amount of information in the world that every person generates, to be able to extract relevant knowledge from that information and deliver it to people is a powerful service. Using data on the business ID, and transition characteristics we have to build a model that mines through the data and looks for patterns, extract the most confident recurring transactions we find and notify the user in a forecast of expenses for the next month. It would also be useful to calculate a moving average of weighted expenses as a prediction for total expenses in the month in addition to the recurring bills. The CitiBank app on my iPhone does this. The circles are predicted upcoming bills, and the dots represent a piecewise function where unknown expenses are in between.

<img src="https://lh3.googleusercontent.com/VBXcUtCXTaSYbrxVa2l-yDkpEhuQIwtjl3oKa8nZ9TsSwrxkpd0NReMFk9PauZsb63oicza-UPVHFFfjJqGN1bGfij0=s268" alt="A screenshot of CitiBank monthly expense prediction features" title="An example from my transaction history" />

**Methodology and Approach:** My approach is to use association rule mining algorithms and search through the transaction statement looking for frequent sets of values throughout business ID, transaction date, time and value. I would start with Apriori then if necessary implement improved versions of it. I believe this approach is a lightweight solution that is relatively easy to implement and would be a great start to the problem, possibly even providing an ideal final solution. Computational complexity might be an issue here but there is an array of improved versions of Apriori algorithms that address this problem. When I approach data science problems I consider the simplest solutions first then work my way up in complexity if necessary. The association rule algorithm can solve the problem by discovering frequent transactions if the time-series dataset is large enough and has the parameters *support* and *confidence* tuned properly. The algorithm would be able to pick up on frequent association rules that span several variables and then we can deduce if it is a recurring transaction. For example, there could be an association between Business ID "A", Transaction Date "2nd of the Month", and value of "14.99" that has a high value of confidence relative to Business ID "A", meaning we see this set of values in a high percentage of transactions with Business ID "A". The confidence would be the percentage in which the date "2nd of the Month" and value of "14.99" occur with Business ID "A". We can gather associations like this and deduce if they are recurring transactions. We could set more constraints and criteria to be met in the results from the algorithms search then use the leftover set of associations as transactions that would be predicted by our model for upcoming months. You can choose which variable you want to create a confidence rule relative too, however, I believe Business ID would be the strongest indicator to determine recurring transactions when using the confidence metric.

The first thing I would do is filter out one or several customer IDs to build a minimum viable product on and test ideas. I want to feature engineer Transaction date and look at the weekday(ie, Monday, Tuesday) and date within the month(1st, 2nd...). Deriving these variables will help in mining transactions for similar patterns with business ID, date and value. In addition, the binary variables constructed from transaction values should be limited for the sake of computational complexity and avoiding overfitting. A good method would be to utilize exploratory data analysis and descriptive statistics such as histograms, box plots, summary statistics, and frequency plots to see the most frequent transaction values(ie 29.99, 39.99...). A similar pre-processing step can be done to business IDs as that is also a categorical variable, it would not be efficient to have one-off business IDs in a dataset for frequent pattern mining. After all this data pre-processing I would expect to have a variable for every unique Business ID(that I have selected to use), weekdays, transaction date, and certain transaction values. Using these variables the association rule mining algorithm will search the space of transactions as I tune the thresholds for *support* and *confidence*.




**Methodology Outline:**

Data Pre-Processing
- Search through the dataset for any missing values and errors, clean the data
- Identify relevant features to use from the dataset

Exploratory Data Analysis
- Explore the dataset and relationships between the variables with visualizations
- Utilize descriptive statistics like histograms, distributions, plots to understand the structure of the dataset

Feature Engineering
- Construct features from the original variables to improve model performance
- Create relevant categorical variables from Business ID, Transaction Date, and Value

Model Testing
- Observe the associations discovered with the variables created, refine the features if necessary to reach desired results (Recurring transactions)
- Split a training and test set from the data, for example, see if you can find recurring bills with 6 months of transactions and use those strong association rules discovered to predict the upcoming bills for the next 6 months.
- Tune the model parameters, support and confidence threshold

Evaluation
- Is the solution presented an efficient model to scale and work in real-time?
- Are the patterns mined accurate in regards to predicting upcoming bills?
- Gather feedback, refine and continue to optimize the model

**Alternative Method 1:** An alternative method is to use an ARIMA time series with transaction value over time and filter out trends, seasonality, and residuals. With the time-series, you can search for periodic changes by using periodograms. You can set different time intervals and look for the recurring deviations in the time-series. In addition, you can utilize **time series motifs** as well which analyze repeated patterns in signals over time. I think these are very practical methods as they are well suited for the task at hand and use statistical methods that many people are familiar with. My best guess would be this is what personal banks use in their class of prediction algorithms, similar to the image shown earlier from my bank's application.

**Alternative Method 2:** Another alternative could be to categorize the business ID and Transaction Date and utilize clustering algorithms with *Gower Distance*. You would use Gower Distance because you are clustering business IDs, transaction dates and values which is a mix of categorical and numerical variables. Gower Distance is suited to handle this mix of variables and discover clusters in the data that are similar. We would infer clustered transactions to be similar and possibly recurring as well. Traditional clustering methods are developed for euclidean distance since that is a relevant distance metric for numerical variables only.

**Model Description:** Association rule algorithms like Apriori and the many improved versions of it shift through rows looking for frequent and recurring associations. It utilizes a breadth-first search rather than depth, but there are improved versions of the algorithm in the aspects of algorithmic complexity utilizing depth-first searches. The first metric it uses is *support threshold* which says out of the total set of rows, how many of those rows contain a certain set of attribute values. We determine if a set of attribute values is *frequent* if the proportion to total rows is greater than the *support threshold*. An example will help illustrate this, imagine we are using one of the business IDs as an attribute and have 100 transactions. We set a *support threshold* of 0.5, the support for that business ID is the proportion of how many transactions contain that business ID over the total number of transactions. If there were 20 transactions with that business ID then the support for that business ID would be 20 / 100 or 0.20. This is less than our *support threshold* of 0.5 and therefore that business ID is not *frequent* and would not be discovered as a frequent rule.

Building off the foundation of support, the *Confidence* metric comes next. Confidence is the set of attributes occurring relative to the total rows that contain an item from that set. It is easier to explain through an example. Let's say we want to find the confidence between a transaction with a business ID and a date. Imagine Business ID "1" and the date 03/01/19, we can feature engineer the date to look at the first of every month or every weekday(Friday in this case). The *confidence* of the item set or association rule "Business ID 1 -> First of the month" will be the number of transactions that have both business ID 1 and land on the first of a month *divided* by the total number of transactions containing business ID 1. This will tell you the percentage in which The first of the Month occurs with Business ID 1 in transactions. That example can be done using Fridays instead of the 1st of the month as well. You can also think of this in probability, the *Confidence* is the probability that both items A & B occur *divided* by the probability item A occurs. Confidence(A => B)=P(A ∩ B) / P(A). We also set the value of confidence before running the algorithms search.

Now I will work through a small sample to demonstrate the process of applying the model.

In [11]:
import pandas as pd
import numpy as np


In [12]:
sample_snippet = pd.DataFrame(
{"Business ID" : [1,2,3,4,3],
"Transaction Date" : ["3/1/19", "3/15/19", "4/1/19", "4/12/19", "5/1/19"],
"Transaction Value" : [9.99, 14.99, 9.99, 29.99, 9.99]},
index = [1, 2, 3, 4, 5])

sample_snippet
# This is what a snippet of the transaction statement would look like represented
# as a tabular data frame. Notice how a 9.99 transaction value with the same business ID
# shows up twice on the first day of the month. This could be a hint and the association 
# rule algorithm would be able to pickup this pattern.

Unnamed: 0,Business ID,Transaction Date,Transaction Value
1,1,3/1/19,9.99
2,2,3/15/19,14.99
3,3,4/1/19,9.99
4,4,4/13/19,29.99
5,3,5/1/19,9.99


In [13]:
# Before the data is put into the Apriori algorithm, necessary pre-processing
# would be done. This would be turning the categorical variable values into their
# own binary variables. Sometimes this is referred to as one-hot-encoding.
# This is what it could look like after the preparation. We can use different feature
# engineered versions of Transaction date for Weekday and or Month Date. Feature
# engineering and relevant data is one of the most important components of the
# machine learning process. This would have the most impact on model performance.
# Notice how now it is a simple representation with binary values and now we can
# perform the probability operations from the Apriori algorithm rules.

sample_prepared = pd.DataFrame(
{ "Business ID 1" : [1,0,0,0,0],
"Business ID 2" : [0,1,0,0,0],
"Business ID 3" : [0,0,1,0,1],
"Business ID 4" : [0,0,0,1,0],
"1st of Month" : [1,0,1,0,1],
"Friday" : [0,1,0,1,0],
"Value 9.99" : [1,0,1,0,1]},
index = [1,2,3,4,5])

sample_prepared

Unnamed: 0,Business ID 1,Business ID 2,Business ID 3,Business ID 4,1st of Month,Friday,Value 9.99
1,1,0,0,0,1,0,1
2,0,1,0,0,0,1,0
3,0,0,1,0,1,0,1
4,0,0,0,1,0,1,0
5,0,0,1,0,1,0,1


In [16]:
# For an example of what the final results what would look, I printed another dataframe
# that shows how a sample of rules and their parameter values. Using this returned list
# we can set additional constraints as well as test our predictions on an out of sample
# dataset to see how the model performs in finding recurring transactions.

apriori_results = pd.DataFrame(
{ "Rule" : ["Business ID 3 -> 1st of Month","Business ID 3 -> 1st of Month -> Value 9.99"],
"Support" : [0.40, 0.40],
"Confidence" : [1, 1]},
index = [1,2])

apriori_results

Unnamed: 0,Rule,Support,Confidence
1,Business ID 3 -> 1st of Month,0.4,1
2,Business ID 3 -> 1st of Month -> Value 9.99,0.4,1


Those rules filtered on support and confidence will be our predicted recurring transactions that have been mined.