# Association Rules

#### Import libraries

In [1]:
import os
import sys
from pathlib import Path
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

In [2]:
# Import local libraries
root_dir = Path.cwd().resolve().parents[0]
sys.path.append(str(root_dir))

# Visualization functions
from src.utils.helpers import *

# Load the "autoreload" extension so that code can change
%load_ext autoreload
#%reload_ext autoreload

# Always reload modules so that as you change code in src, it gets loaded
%autoreload 2

### Example 1: Lecture

Due to the size of this example data, a SparseDtype structure is not required. Check out the documentation for [pandas.sparseDtype](https://pandas.pydata.org/docs/reference/api/pandas.SparseDtype.html) for more information.

**Use TransactionEncoder() to create matrix from arrays**

<h4 style="color:blue"> Write Your Code Below: </h4>

In [None]:
transactions = [
 ['Pizza', 'Salad', 'Soda'],
 ['Garlic Bread', 'Soda'],
 ['Pizza', 'Garlic Bread', 'Salad', 'Soda'],
 ['Burger', 'Fries', 'Water'],
 ['Burger', 'Fries' , 'Ice Cream'],
 ['Pizza', 'Soda'],
 ['Burger', 'Fries', 'Soda'],
 ['Soup', 'Salad', 'Water'],
 ['Pizza', 'Soup', 'Salad', 'Soda'],
 ['Pizza', 'Salad', 'Soda', 'Ice Cream']
]

te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)

ex_1 = pd.DataFrame(te_array, columns=te.columns_)
# Emulate tran_id from index
ex_1.index = ex_1.index + 1
ex_1.index.name = "tran_id"
# Sort the columns index
ex_1 = ex_1.sort_index(axis=1)
ex_1

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0_level_0,Burger,Fries,Garlic Bread,Ice Cream,Pizza,Salad,Soda,Soup,Water
tran_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,False,False,False,False,True,True,True,False,False
2,False,False,True,False,False,False,True,False,False
3,False,False,True,False,True,True,True,False,False
4,True,True,False,False,False,False,False,False,True
5,True,True,False,True,False,False,False,False,False
6,False,False,False,False,True,False,True,False,False
7,True,True,False,False,False,False,True,False,False
8,False,False,False,False,False,True,False,True,True
9,False,False,False,False,True,True,True,True,False
10,False,False,False,True,True,True,True,False,False


**Display all frequent itemsets for a minimum support (`min_sup`) of 0.4.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,support,itemsets
2,0.7,(Soda)
0,0.5,(Pizza)
1,0.5,(Salad)
4,0.5,"(Pizza, Soda)"
3,0.4,"(Pizza, Salad)"
5,0.4,"(Salad, Soda)"
6,0.4,"(Pizza, Salad, Soda)"


**Display strong association rules with a minimum confidence (`min_conf`) of 0.9 and only include the `antecedents`, `consequents`, `support`, `confidence`, and `lift` columns.**

<h4 style="color:blue"> Write Your Code Below: </h4>

In [None]:
# Convert Into Rules
rule_cols = ['antecedents','consequents','support','confidence','lift']


<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,antecedents,consequents,support,confidence,lift
2,"(Salad, Soda)",(Pizza),0.4,1.0,2.0
0,(Pizza),(Soda),0.5,1.0,1.428571
1,"(Pizza, Salad)",(Soda),0.4,1.0,1.428571


### Problem 15.2: Identifying Course Combinations

The Institute for Statistics Education at Statistics.com offers online courses in statistics and analytics, and is seeking information that will help in packaging and sequencing courses.  Consider the data in the file _CourseTopics.csv_, the first few rows of which are shown in the Table. These data are for purchases of online statistics courses at Statistics.com. Each row represents the courses attended by a single customer.
The firm wishes to assess alternative sequencings and bundling of courses. Use association rules to analyze these data, and interpret several of the resulting rules.

**Read in data from `coursetopics.csv` file.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [6]:
ct_df = pd.read_csv(os.path.join('..', 'data', 'coursetopics.csv'))
# Use bools for ar
ct_df = ct_df.astype(bool, 0)
ct_df.head()

Unnamed: 0,Intro,DataMining,Survey,Cat Data,Regression,Forecast,DOE,SW
0,True,True,False,False,False,False,False,False
1,False,False,True,False,False,False,False,False
2,False,True,False,True,True,False,False,True
3,True,False,False,False,False,False,False,False
4,True,True,False,False,False,False,False,False


**Display all frequent itemsets for a minimum support (`min_sup`) of 0.01.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,support,itemsets
0,0.394521,(Intro)
7,0.221918,(SW)
3,0.208219,(Cat Data)
4,0.208219,(Regression)
2,0.186301,(Survey)
...,...,...
58,0.010959,"(DataMining, Survey, Forecast)"
85,0.010959,"(Cat Data, Regression, DOE, Intro)"
86,0.010959,"(Cat Data, Regression, SW, Intro)"
57,0.010959,"(DataMining, Regression, Survey)"


**Display strong association rules with a minimum confidence (`min_conf`) of 0.1 and only include the `antecedents`, `consequents`, `support`, `confidence`, and `lift` columns.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,antecedents,consequents,support,confidence,lift
321,"(DOE, Intro)","(SW, Regression)",0.019178,0.411765,7.514706
316,"(SW, Regression)","(DOE, Intro)",0.019178,0.35,7.514706
319,"(Regression, DOE)","(SW, Intro)",0.019178,0.636364,6.636364
318,"(SW, Intro)","(Regression, DOE)",0.019178,0.2,6.636364
249,"(Regression, Forecast)","(DataMining, Intro)",0.013699,0.357143,6.517857
248,"(DataMining, Intro)","(Regression, Forecast)",0.013699,0.25,6.517857


**Filter rules to have only one consequent.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,antecedents,consequents,support,confidence,lift
245,"(Regression, Intro, Forecast)",(DataMining),0.013699,0.714286,4.010989
264,"(Survey, DOE, Intro)",(Cat Data),0.010959,0.8,3.842105
233,"(DataMining, Cat Data, Intro)",(Regression),0.016438,0.75,3.601974
253,"(Survey, Cat Data, Intro)",(Forecast),0.013699,0.5,3.578431
243,"(DataMining, Regression, Intro)",(Forecast),0.013699,0.5,3.578431
315,"(Regression, DOE, Intro)",(SW),0.019178,0.777778,3.504801


# Collaborative Filtering

#### Import libraries

In [10]:
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Dataset, Reader, KNNBasic

### Example 1: Lecture - Netflix Prize

**Data from rating matrix in Collaborative_Filtering_Examples.xlsx sheet 1.**

Set Customer ID as the index and make sure the data types are float.

<h4 style="color:blue"> Write Your Code Below: </h4>

In [None]:
nf_df = pd.read_csv(os.path.join('..', 'data', 'netflix_example.csv'))
nf_df.set_index('Customer ID', inplace=True)
nf_df.index = nf_df.index.astype(int)
nf_df.columns = nf_df.columns.astype(int)
nf_df = nf_df.astype(float)
nf_df

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0_level_0,1,5,8,17,18,28,30,44,48
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
30878,4.0,1.0,,,3.0,3.0,4.0,5.0,
124105,4.0,,,,,,,,
822109,5.0,,,,,,,,
823519,3.0,,1.0,4.0,,4.0,5.0,,
885013,4.0,5.0,,,,,,,
893988,3.0,,,,,,4.0,4.0,
1248029,3.0,,,,,2.0,4.0,,3.0
1503895,4.0,,,,,,,,
1842128,4.0,,,,,,3.0,,
2238063,3.0,,,,,,,,


**Calculate correlation similarity for customers 30878, 823519.**

Utilize helper `calc_corr_sim` function.

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Corr(30878, 823519): 0.3441


**Calculate cosine similarity for customers 30878, 823519.**

Utilize `cosine_similarity` function as well as helper `calc_cos_sim` function.

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

array([[1.        , 0.97179743],
       [0.97179743, 1.        ]])

Cos(30878, 823519): 0.9718


**Create a Pearson correlation similarity matrix for all customers.**

The intent in these next two steps is to demonstrate similarity with NaN masking and user centered means.

Utilize helper `sim_matrix_nan` function.

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Customer ID,30878,124105,822109,823519,885013,893988,1248029,1503895,1842128,2238063
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
30878,1.0,,,0.34415,-0.874981,0.206216,0.70475,,0.0,
124105,,,,,,,,,,
822109,,,,,,,,,,
823519,0.34415,,,1.0,,0.646233,0.402911,,-0.857493,
885013,-0.874981,,,,1.0,,,,,
893988,0.206216,,,0.646233,,1.0,0.44185,,-0.946773,
1248029,0.70475,,,0.402911,,0.44185,1.0,,-0.707107,
1503895,,,,,,,,,,
1842128,0.0,,,-0.857493,,-0.946773,-0.707107,,1.0,
2238063,,,,,,,,,,


**Create a Cosine similarity matrix for all customers.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Customer ID,30878,124105,822109,823519,885013,893988,1248029,1503895,1842128,2238063
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
30878,1.0,1.0,1.0,0.971797,0.795432,0.992915,0.986025,1.0,0.989949,1.0
124105,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
822109,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
823519,0.971797,1.0,1.0,1.0,1.0,0.994692,0.971668,1.0,0.926092,1.0
885013,0.795432,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
893988,0.992915,1.0,1.0,0.994692,1.0,1.0,1.0,1.0,0.96,1.0
1248029,0.986025,1.0,1.0,0.971668,1.0,1.0,1.0,1.0,0.96,1.0
1503895,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1842128,0.989949,1.0,1.0,0.926092,1.0,0.96,0.96,1.0,1.0,1.0
2238063,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**Create a correlation matrix for all items.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,1,5,8,17,18,28,30,44,48
1,1.0,0.0,,,,0.0,-0.550482,0.928477,
5,0.0,1.0,,,,,,,
8,,,,,,,,,
17,,,,,,,,,
18,,,,,,,,,
28,0.0,,,,,1.0,0.707107,,
30,-0.550482,,,,,0.707107,1.0,,
44,0.928477,,,,,,,1.0,
48,,,,,,,,,


**Create a cosine matrix for all items.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,1,5,8,17,18,28,30,44,48
1,1.0,0.83205,1.0,1.0,1.0,0.955395,0.963256,0.999512,1.0
5,0.83205,1.0,,,1.0,1.0,1.0,1.0,
8,1.0,,1.0,1.0,,1.0,1.0,,
17,1.0,,1.0,1.0,,1.0,1.0,,
18,1.0,1.0,,,1.0,1.0,1.0,1.0,
28,0.955395,1.0,1.0,1.0,1.0,1.0,0.983838,1.0,1.0
30,0.963256,1.0,1.0,1.0,1.0,0.983838,1.0,0.993884,1.0
44,0.999512,1.0,,,1.0,1.0,0.993884,1.0,
48,1.0,,,,,1.0,1.0,,1.0


### Problem 15.3: Recommending Courses

We again consider the data in _CourseTopics.csv_ describing course purchases at Statistics.com (see Problem 15.2 and data sample in Table). We want to provide a course recommendation to a student who purchased the Regression and Forecast courses. Apply user-based collaborative filtering to the data.

**Use the same data as 15.2 from `coursetopics.csv` file.**

Keep the *int* dtype for collaborative filtering

<h4 style="color:blue"> Write Your Code Below: </h4>

ct_df2 = pd.read_csv(os.path.join('..', 'data', 'coursetopics.csv'))
ct_df2.head()

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,Intro,DataMining,Survey,Cat Data,Regression,Forecast,DOE,SW
0,1,1,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0
2,0,1,0,1,1,0,0,1
3,1,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0


For the surprise Dataset loader, data should be presented in columns with a customer/user, item and purchase/rating.

Utilize helper `create_long_data` function. This function handles NaN and binary = 1 data.

**Transform data from matrix to long form columns.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,customer,course,purchase
0,0,Intro,1
1,0,DataMining,1
2,1,Survey,1
3,2,DataMining,1
4,2,Cat Data,1
...,...,...,...
619,361,Cat Data,1
620,361,SW,1
621,362,SW,1
622,363,Cat Data,1


**Make predictions for all users.**

<h4 style="color:blue"> Write Your Code Below: </h4>

In [None]:
reader = Reader(rating_scale=(1, 1))
data = Dataset.load_from_df(purchases, reader)
trainset = data.build_full_trainset()
# compute cosine similarities between users
sim_options = {'name': 'cosine', 'user_based': False} 
algo = KNNBasic(sim_options=sim_options)
algo.fit(trainset)

predictions = []
for user in ct_df2.index:
    predictions.append([algo.predict(user, course).est for course in ct_df2])
predictions = pd.DataFrame(predictions, columns=ct_df2.columns)
predictions.head()

<h3 style="color:teal"> Expected Output: </h3>

Computing the cosine similarity matrix...
Done computing similarity matrix.


Unnamed: 0,Intro,DataMining,Survey,Cat Data,Regression,Forecast,DOE,SW
0,1,1,1,1,1,1,1,1
1,1,1,1,1,1,1,1,1
2,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1
4,1,1,1,1,1,1,1,1


**Why are all predictions 1?**

<h4 style="color:purple"> Write Your Free-Form Response Below: </h4>

If you train `KNNBasic` using only purchase data where every observed value equals 1, the model will often predict 1 for nearly all items. This happens because `KNNBasic` computes predictions as a **normalized weighted average** of neighbor ratings. Since there is no variation in the rating values, the algorithm cannot distinguish between stronger or weaker preferences. This may produce predictions that look reasonable (they fall between 0 and 1), but they can be misleading. Similarities may be inflated by shared zeros, and recommendations may be biased toward predicting low scores simply because most entries are 0.

### Problem 15.5: Course Ratings

The Institute for Statistics Education at Statistics.com asks students to rate a variety of aspects of a course as soon as the student completes it. The Institute is contemplating instituting a recommendation system that would provide students with recommendations for additional courses as soon as they submit their rating for a completed course. Consider the excerpt from student ratings of online statistics courses shown in Table 14.7, and the problem of what to recommend to student EN.

**Read in data from `courserating.csv` file.**

<h4 style="color:blue"> Write Your Code Below: </h4>

In [None]:
cr_df = pd.read_csv(os.path.join('..', 'data', 'courserating.csv'))
cr_df.set_index('Unnamed: 0', inplace=True)
cr_df.index.name = 'User'
cr_df

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0_level_0,SQL,Spatial,PA1,DM in R,Python,Forecast,R Prog,Hadoop,Regression
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
LN,4.0,,,,3.0,2.0,4.0,,2.0
MH,3.0,4.0,,,4.0,,,,
JH,2.0,2.0,,,,,,,
EN,4.0,,,4.0,,,4.0,,3.0
DU,4.0,4.0,,,,,,,
FL,,4.0,,,,,,,
GL,,4.0,,,,,,,
AH,,3.0,,,,,,,
SA,,,4.0,,,,,,
RW,,,2.0,,,,,4.0,


**1. First consider a user-based collaborative filter.  This requires computing correlations between all student pairs. 
For which students is it possible to compute correlations with EN? Compute them.**

<h4 style="color:purple"> Write Your Free-Form Response Below: </h4>

We need to identify the users that share ratings with EN These are: LN, MH, JH, DU, and DS However, only LN and DS share more than one rating with EN 

To compute this correlation, we first compute average rating by each of these 
students.  Note that the average is computed over a different number of 
courses for each of these students, because they each rated a different set 
of courses.

Average ratings:

- LN: (4 + 3 + 2 + 4 + 2) / 5 = 3
- EN: (4 + 4 + 4 + 3) / 4 = 3.75
- DS: (4 + 2 + 4) / 3 = 3.33

Co-rated courses for users EN and LN: SQL, R Prog, Regression.

- Denominator LN: sqrt((4-3)^2 + (4-3)^2 + (2-3)^2) = 1.732051
- Denominator EN: sqrt((4-3.75)^2 + (4-3.75)^2 + (3-3.75)^2) = 0.8291562

**Corr(LN, EN) = ((4-3)*(4-3.75) + (4-3)*(4-3.75) + (2-3)*(3-3.75)) / (1.732051 * 0.8291562) = 0.8703882**

Co-rated courses for users EN and LN: SQL, DM in R, R Prog.

- Denominator EN: sqrt((4-3.75)^2 + (4-3.75)^2 + (4-3.75)^2) = 0.4330127
- Denominator DS: sqrt((4-3.33)^2 + (2-3.33)^2 + (4-3.33)^2) = 1.633003

**Corr(EN, DS) = ((4-3.75)*(4-3.33) + (4-3.75)*(2-3.33) + (4-3.75)*(4-3.33)) / (0.4330127 * 1.633003) = 0.003535513**

**Programmatically calculate correlation similarity for users LN, EN and EN, DS.**

Use the helper function `calc_corr_sim`.

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Corr(LN, EN): 0.8704
Corr(EN, DS): 0.0035


**2. Based on the single nearest student to EN, which single course should we recommend to EN? Explain why.**

<h4 style="color:purple"> Write Your Free-Form Response Below: </h4>

From the correlations computed in (a) above, student LN is nearest to EN. Among the courses that LN has taken (but not taken by EN), Python is highly preferred by LN. So Python should be recommended to EN.

**3. Compute the cosine similarity between users.**

<h4 style="color:purple"> Write Your Free-Form Response Below: </h4>

Co-rated courses for users EN and LN: SQL, R Prog, Regression.

- Denominator LN: sqrt(4^2 + 4^2 + 2^2) = 6
- Denominator EN: sqrt(4^2 + 4^2 + 3^2) = 6.403124

**Cosine(LN, EN) = (4*4 + 4*4 + 2*3) / (6 * 6.403124) = 0.9891005**

Co-rated courses for users EN and LN: SQL, DM in R, R Prog.

- Denominator EN: sqrt(4^2 + 4^2 + 4^2) = 6.928203
- Denominator DS: sqrt(4^2 + 2^2 + 4^2) = 6

**Cosine(EN, DS) = (4*4 + 4*2 + 4*4) / (6.928203 * 6) = 0.9622505**

**Programmatically calculate cosine similarity for users LN, EN and EN, DS.**

Use the scipy `cosine_similarity` and  helper function `calc_cos_sim`.

*Demonstrate `NaN` issues with scipy function.*

<h4 style="color:blue"> Write Your Code Below: </h4>

In [None]:
#cosine_similarity(cr_df.loc[['LN', 'EN'], :])

In [None]:
print(f"Cos(LN, EN): {cosine_similarity(cr_df.loc[['LN', 'EN'], ['SQL', 'R Prog', 'Regression']])[0, 1]:.4f}")
print(f"Cos(EN, DS): {cosine_similarity(cr_df.loc[['EN', 'DS'], ['SQL', 'DM in R', 'R Prog']])[0, 1]:.4f}")

In [None]:
print(f"Cos(LN, EN): {calc_cos_sim(ln, en):.4f}")
print(f"Cos(EN, DS): {calc_cos_sim(en, ds):.4f}")

<h3 style="color:teal"> Expected Output: </h3>

Cos(LN, EN): 0.9891
Cos(EN, DS): 0.9623


Cos(LN, EN): 0.9891
Cos(EN, DS): 0.9623


**4. Based on the cosine similarities of the nearest students to EN, which course should be recommended to EN?**

<h4 style="color:purple"> Write Your Free-Form Response Below: </h4>

From the cosine similarities based on course ratings, student LN is nearest to EN. Among the courses that LN has taken (but not taken by EN), Python is highly preferred by LN. So Python should be recommended to EN.

If we use the binary matrix, student DS is more similar to EN based on courses taken. However, as DS hasn't taken any courses other than the ones EN already took, we cannot make a recommendation in this case.

**5. What is the conceptual difference between using the correlation as opposed to cosine similarities? (Hint: how are the missing values in the matrix handled in each case?)**

<h4 style="color:purple"> Write Your Free-Form Response Below: </h4>

If we consider the rating matrix, both methods basically only consider co-rated items. Correlation uses the not co-rated items to calculate the averages which will impact the correlation.

If we calculate the cosine-similarity after converting to a binary form, we use all items in the similarity calculation. Using the actual ratings only on co-rated items does not take into consideration items that are not co-rated, which may be useful information.

Using the binary form, is more useful if not all items are rated by most users. On the other hand, if most items are rated by most users, using the actual ratings will add power to the analysis, compared to just using binary data.

**6. With large datasets, it is computationally difficult to compute user-based recommendations in real time, and an item-based approach is used instead. Returning to the rating data (not the binary matrix), let's now take that approach.**

**6i. If the goal is still to find a recommendation for EN, for which course pairs is it possible and useful to calculate correlations?**

<h4 style="color:purple"> Write Your Free-Form Response Below: </h4>

There is enough data to find correlations for the following pairs:

* SQL - Spatial
* SQL - DM in R
* SQL - Python
* DM in R - R Prog
* Spatial - Python
  
However, EN has already taken SQL, DM in R, and R Prog. Hence, only the Spatial and Python correlations are useful.

**6ii. Just looking at the data, and without yet calculating course pair correlations, which course would you recommend to EN, relying on item‐based filtering? Calculate two course pair correlations involving your guess, and report the results.**

<h4 style="color:purple"> Write Your Free-Form Response Below: </h4>

The SQL - Spatial ratings match the best, and there are more co-rated items, so Spatial would be the best guess.

**7. Apply item-based collaborative filtering to this dataset (using Python) and based on the results, recommend a course to EN**

*Convert the rating_df dataframe into a format suitable for the Surprise package.*

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,user,course,rating
0,LN,SQL,4.0
1,LN,Python,3.0
2,LN,Forecast,2.0
3,LN,R Prog,4.0
4,LN,Regression,2.0


**Make predictions for EN.**

<h4 style="color:blue"> Write Your Code Below: </h4>

In [None]:
reader = Reader(rating_scale=(1, 4))
data = Dataset.load_from_df(ratings, reader)
trainset = data.build_full_trainset()
# compute cosine similarities between items
sim_options = {'name': 'cosine', 'user_based': False}  
algo = KNNBasic(sim_options=sim_options)
algo.fit(trainset)

courses = cr_df.columns
for course in courses: 
    print(course, algo.predict('EN', course).est)

<h3 style="color:teal"> Expected Output: </h3>

Computing the cosine similarity matrix...
Done computing similarity matrix.
SQL 3.7504416393899813
Spatial 4
PA1 3.433333333333333
DM in R 3.743416490252569
Python 3.6621621621621623
Forecast 3.6666666666666665
R Prog 3.7504416393899813
Hadoop 3.433333333333333
Regression 3.747548783981962


**Interpret the results.**

<h4 style="color:purple"> Write Your Free-Form Response Below: </h4>

The item-based collaborative filtering recommends the **Spatial** course to EN.

**Conceptual review of collaborative filtering with a binary purchase matrix**

In PA2, you will be working with a large binary purchase/cart dataset from Instacart and will be tasked to make item-based and user-based recommendations for a specific user and the products they have added to their cart.

When working with purchase data, remember that a "non-purchase" (0) usually means unknown, not dislike. If you include many 0 values in KNNBasic, the algorithm treats them like real ratings. Because KNNBasic predicts using a normalized weighted average within the rating scale, the large number of zeros can dominate the math. This may produce predictions that look reasonable (they fall between 0 and 1), but they can be misleading. Similarities may be inflated by shared zeros, and recommendations may be biased toward predicting low scores simply because most entries are 0. Always think carefully about what your zeros represent and interpret results with caution.

Now, we will convert the the course rating matrix into binary form (course taken or not) and talk through conceptually how you might handle these tasks for this smaller data.

**Convert course ratings to binary purchase matrix**

We utilize this smaller dataset instead of course topics since it is user interactions and easier to relate to with the predictions we oobserved earlier.

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0_level_0,SQL,Spatial,PA1,DM in R,Python,Forecast,R Prog,Hadoop,Regression
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
LN,1,0,0,0,1,1,1,0,1
MH,1,1,0,0,1,0,0,0,0
JH,1,1,0,0,0,0,0,0,0
EN,1,0,0,1,0,0,1,0,1
DU,1,1,0,0,0,0,0,0,0
FL,0,1,0,0,0,0,0,0,0
GL,0,1,0,0,0,0,0,0,0
AH,0,1,0,0,0,0,0,0,0
SA,0,0,1,0,0,0,0,0,0
RW,0,0,1,0,0,0,0,1,0


**Calculate the cosine similarity using the binary matrix.**

Use the scipy `cosine_similarity` and  helper function `calc_cos_sim`.

Since we no longer have NaN, scipy `cosine_similarity` is recommended.

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,SQL,Spatial,PA1,DM in R,Python,Forecast,R Prog,Hadoop,Regression
SQL,1.0,0.5,0.0,0.57735,0.57735,0.288675,0.707107,0.0,0.57735
Spatial,0.5,1.0,0.0,0.0,0.288675,0.0,0.0,0.0,0.0
PA1,0.0,0.0,1.0,0.0,0.0,0.288675,0.0,0.408248,0.0
DM in R,0.57735,0.0,0.0,1.0,0.0,0.0,0.816497,0.0,0.5
Python,0.57735,0.288675,0.0,0.0,1.0,0.5,0.408248,0.0,0.5
Forecast,0.288675,0.0,0.288675,0.0,0.5,1.0,0.408248,0.0,0.5
R Prog,0.707107,0.0,0.0,0.816497,0.408248,0.408248,1.0,0.0,0.816497
Hadoop,0.0,0.0,0.408248,0.0,0.0,0.0,0.0,1.0,0.0
Regression,0.57735,0.0,0.0,0.5,0.5,0.5,0.816497,0.0,1.0


Unnamed: 0,SQL,Spatial,PA1,DM in R,Python,Forecast,R Prog,Hadoop,Regression
SQL,1.0,0.5,0.0,0.57735,0.57735,0.288675,0.707107,0.0,0.57735
Spatial,0.5,1.0,0.0,0.0,0.288675,0.0,0.0,0.0,0.0
PA1,0.0,0.0,1.0,0.0,0.0,0.288675,0.0,0.408248,0.0
DM in R,0.57735,0.0,0.0,1.0,0.0,0.0,0.816497,0.0,0.5
Python,0.57735,0.288675,0.0,0.0,1.0,0.5,0.408248,0.0,0.5
Forecast,0.288675,0.0,0.288675,0.0,0.5,1.0,0.408248,0.0,0.5
R Prog,0.707107,0.0,0.0,0.816497,0.408248,0.408248,1.0,0.0,0.816497
Hadoop,0.0,0.0,0.408248,0.0,0.0,0.0,0.0,1.0,0.0
Regression,0.57735,0.0,0.0,0.5,0.5,0.5,0.816497,0.0,1.0


**Calculate the cosine similarity for courses taken by EN.**

As we know, EN has already taken **SQL**, **DM in R**, **R Prog**, and **Regression** so that only leaves the following as potential recommendations:

* Spatial
* PA1
* Python
* Forecast
* Hadoop

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,SQL,DM in R,R Prog,Regression
SQL,1.0,0.57735,0.707107,0.57735
Spatial,0.5,0.0,0.0,0.0
PA1,0.0,0.0,0.0,0.0
DM in R,0.57735,1.0,0.816497,0.5
Python,0.57735,0.0,0.408248,0.5
Forecast,0.288675,0.0,0.408248,0.5
R Prog,0.707107,0.816497,1.0,0.816497
Hadoop,0.0,0.0,0.0,0.0
Regression,0.57735,0.5,0.816497,1.0


**Item-based similarity-based voting model**

One way to approach this problem with a binary purchase matrix is use the similariies for each course that EN has taken and calculate how much each of the other courses contribute overall using the similaries for the courses EN has not taken. Based on the above similaries, which course would you recommend for EN?

<h4 style="color:purple"> Write Your Free-Form Response Below: </h4>

We can calculate the following total similarities for each of the courses that EN has not taken:

**Spatial* only has one co-occurrence with SQL.  
**Spatial = 0.50**

**PA1** does not have any co-occurrences. ❌

Python has three co-occurrences with SQL, R Prog and Regression.  
**Python = 0.577 + 0.408 + 0.50 = 1.485**

Forecast also has three co-occurrences with SQL, R Prog and Regression.  
**Forecast = 0.289 + 0.408 + 0.50 = 1.197**

**Hadoop** does not have any co-occurrences. ❌

Based on this item-based similarity voting approach, **Python** would be the recommended course for EN.

Think about how you might take a similar approach but user-based, do you think the recommendation would be the same?

**Calculate the cosine similarity for users.**

Display similarities with **EN** in descending order.

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

Unnamed: 0,User,EN
0,DS,0.866025
1,LN,0.67082
2,JH,0.353553
3,DU,0.353553
4,MH,0.288675
5,FL,0.0
6,GL,0.0
7,AH,0.0
8,SA,0.0
9,RW,0.0


**User-based similarity-based voting model**

Another way to approach this problem is use the similariies for each user with EN and calculate how much each user contributes overall using the similaries for the users but only if the other user has taken a course EN has not taken. Based on the above similaries, which course would you recommend for EN?

<h4 style="color:purple"> Write Your Free-Form Response Below: </h4>

Assume we use k = 5 for the number of neighbors. In this situation, there are only 5 users that have overlap with EN:

* **DS 0.866**
* **LN 0.671**
* **JH 0.354**
* **DU 0.354**
* **MH 0.289**

**DS** has taken 3 of the same courses as **EN** and has the highest similarity but has not taken any new courses not taken by EN. ❌  
**LN** has also taken 3 of the same courses as **EN** and has taken 2 new courses **Python** and **Forecast**.  
**JH** has only taken 1 of the same courses as **EN** and has only taken 1 new course **Spatial**.  
**DU** has also only taken 1 of the same courses as **EN** and has also only taken 1 new course **Spatial**.  
**MH** has also only taken 1 of the same courses as **EN** and has taken 2 new courses **Spatial** and **Python**.  

**Spatial = 0.354 + 0.354 + 0.289 = 0.997**  
**Python = 0.671 + 0.289 = 0.96**  
**Forecast = 0.671 = 0.671**  

Based on this user-based similarity voting approach, **Spatial** would be the recommended course for EN.

What potential issues do you notice with this method?

* Did the number of neighbors influence the predictions?
* Should we consider a similarity threshold instead of an arbitrary k cutoff?