# Unsupervised Machine Learning with Association
Michaela Webster - mawebster9

_This notebook is an introductory guide to machine learning and walks you through the Machine Learning process with a sample dataset. You can use your own dataset with some minor adjustments to the code. This notebook provides guidance on how to implement both text and boolean based solutions to machine learning._

_This guide only touches on a few machine learning techniques and should not be used as your one-stop-shop for all machine learning problems._

__Dataset was downloaded from https://drive.google.com/file/d/1y5DYn0dGoSbC22xowBq2d4po6h1JxcTQ/view__

In [78]:
#import statements
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori

## Import Dataset

For this example, we will focus on the relationships between different food items purchased. The data is separated by transaction and each transaction can have multiple items purchased. Our goal is to see if there are any relationships between two food items based on the given transactions. 

In [79]:
#connect to CSV file that contains our data
path_to_file = "https://raw.githubusercontent.com/mawebster9/MachineLearningMadeEasy/master/store_data.csv"

#open, read, and store our data into a pandas dataframe
df = pd.read_csv(path_to_file, encoding='latin-1', header=None)

In [80]:
#print first 10 entries to see data
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
5,low fat yogurt,,,,,,,,,,,,,,,,,,,
6,whole wheat pasta,french fries,,,,,,,,,,,,,,,,,,
7,soup,light cream,shallot,,,,,,,,,,,,,,,,,
8,frozen vegetables,spaghetti,green tea,,,,,,,,,,,,,,,,,
9,french fries,,,,,,,,,,,,,,,,,,,


## Data Preprocessing

We have our data in a pandas dataframe but bacause of the format needed for apriori, we need to convert this to a list within a list (list of all transactions that each have a list of items purchased) in order for our algorithm to be able to process this.

In [81]:
#define a new list variable to house all of the data
list_data = []

for i in range(0,7501):
    list_data.append([str(df.values[i,j]) for j in range(0,20)])

## Applying apriori

Now that our data is in the correct format, we can pass it into the algorithm. The algorithm will use multiple calculations to determine the relationships between two items. To determine these relationships we need to specify a few parameters to use it such as: 

* __min_support:__ will select the items that have above the minimum support value - the frquency that the items in the rule appear together in the data set _(#transactions containing item/total # of transactions)_
* __min_confidence:__ will filter the rules that have a confidence greater than this number _(probability of Y given X)_
* __min_lift:__ will specify the minimum lift value _(Lift(I1->I2) = (confidence(I1->I2))/(support(I2))_
* __min_length:__ will specify the minimum number of items you want in the rules

In [82]:
#pass data into the apriori algorithm
relationships = apriori(list_data, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2)  

#turn results into a list
model_results = list(relationships) 

## Viewing Results

Now that we have run the model, we can analyze our results and be able to see the relationships between items. We can see how many rules we have (relationships between 2 items) and also see the different metrics between 2 different items such as:
* __support__ (percent of item one out of entire transaction count)
* __confidence__ (% of transactions that contain item 1 also contain item 2)
* __lift__ (number of times item 2 is likely to be bought by customer who has bought item 1)

In [83]:
#print number of rules found in dataset
numRelationships = len(model_results)
print("Our model has found that we have",numRelationships,"association rules.")  

Our model has found that we have 48 association rules.


In [84]:
#print out all rules and their corresponding metrics
print("RESULTS OF THE MODEL: ")
print("============================================================")
for item in model_results:

    # first index of the inner list - contains base item and add item (I1)
    pair = item[0] 
    items = [x for x in pair]
    
    #header for output results
    print("Rule (relationship): " + items[0] + " -> " + items[1])

    #second index of the inner list (I2)
    print("Support: " + str(item[1]))

    #third index of the list located at 0th
    #of the third index of the inner list
    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("-----------------------------------------------------------")

RESULTS OF THE MODEL: 
Rule (relationship): light cream -> chicken
Support: 0.004532728969470737
Confidence: 0.29059829059829057
Lift: 4.84395061728395
-----------------------------------------------------------
Rule (relationship): escalope -> mushroom cream sauce
Support: 0.005732568990801226
Confidence: 0.3006993006993007
Lift: 3.790832696715049
-----------------------------------------------------------
Rule (relationship): escalope -> pasta
Support: 0.005865884548726837
Confidence: 0.3728813559322034
Lift: 4.700811850163794
-----------------------------------------------------------
Rule (relationship): herb & pepper -> ground beef
Support: 0.015997866951073192
Confidence: 0.3234501347708895
Lift: 3.2919938411349285
-----------------------------------------------------------
Rule (relationship): tomato sauce -> ground beef
Support: 0.005332622317024397
Confidence: 0.3773584905660377
Lift: 3.840659481324083
-----------------------------------------------------------
Rule (relations

In the example of spaghetti and cooking oil, we can see that there is a 0.004799 for support - _meaning that out of all transactions .4799% of them contained spaghetti_.

We can also see that there is a 57% confidence rate - which means that _out of all of the transactions containing spaghetti, 57% of them are likely to also contain cooking oil_.

The lift for this same example is 3.28 which means that _cooking oil is 3.28 times more likely to be bought by the customers that buy spaghetti, compared to its default sale_.