# Market Basket Analysis and Association Rules
## MovieLens 1M Dataset

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

pd.set_option("max_colwidth", 150)

<b>NOTE:</b> If you do not have `mlxtend` installed, run `pip install mlxtend` or `conda install -c conda-forge mlxtend` (if using Anaconda).

## Dataset

In [2]:
base_url = 'https://raw.githubusercontent.com/cs6220/cs6220.summer2021/master/data/ml-1m/'
kwargs = {'sep': '::', 'header': None, 'engine': 'python', 'encoding': 'latin1'}

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table(base_url + 'users.dat', names=unames, **kwargs)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(base_url + 'ratings.dat', names=rnames, **kwargs)

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table(base_url + 'movies.dat', names=mnames, **kwargs)

In [3]:
df = pd.merge(pd.merge(ratings, users), movies)
df.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama


In [4]:
age_dict = {
     1: "Under 18",
    18: "18-24",
    25: "25-34",
    35: "35-44",
    45: "45-49",
    50: "50-55",
    56: "56+"
}

occupation_dict = { 
     0: "not specified", 
     1: "academic/educator",
     2: "artist",
     3: "clerical/admin",
     4: "college/grad student",
     5: "customer service",
     6: "doctor/health care",
     7: "executive/managerial",
     8: "farmer",
     9: "homemaker",
    10: "K-12 student",
    11: "lawyer",
    12: "programmer",
    13: "retired",
    14: "sales/marketing",
    15: "scientist",
    16: "self-employed",
    17: "technician/engineer",
    18: "tradesman/craftsman",
    19: "unemployed",
    20: "writer"
}

df['age'] = df['age'].replace(age_dict)
df['occupation'] = df['occupation'].replace(occupation_dict)
df['title'] = df['title'].str.decode('latin1')
df.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,Under 18,K-12 student,48067,,Drama
1,2,1193,5,978298413,M,56+,self-employed,70072,,Drama
2,12,1193,4,978220179,M,25-34,programmer,32793,,Drama
3,15,1193,4,978199279,M,25-34,executive/managerial,22903,,Drama
4,17,1193,5,978158471,M,50-55,academic/educator,95350,,Drama


### What is the distribution of the ratings?

### What are top 5 most rated movies?

### What are the top 5 most rated genres?

### How many ratings are there for movies in the "Film-Noir" genre?

### Which users have the greatest number of ratings? What are their average ratings?

### Plot a boxplot of the ratings by occupation.

###  How many people below the age of 18 are "retired" ?   ¯\\_(ツ)_/¯ 

###  What are the most well-rated genres?

###  What are the ratings for each genre by users of each occupation? HINT: pivot tables

### Which movies do men and women most disagree on?

## Association Rules

In [5]:
df_trans = pd.pivot_table(df[['user_id', 'movie_id', 'title']],
                          values='movie_id', index='user_id', columns='title',
                          aggfunc=lambda x: 1, fill_value=0)
df_trans.head(10)

### What are the most frequent three movies rated together?

### What is the highest support for 1-, 2-, and 3-item sets of movies?

### What is the highest confidence association rule with confidence greater than 0.5 and support greater than 0.25?