# Assignment 2
## Twitch Recommendation System
Authors: Alex Cojocaru, Kyle Jorrin, Diego Otero-Caldwell, Alec Drumm

Research articles to get started:\
[YouTube Recommendation Algorithm](https://dl.acm.org/doi/abs/10.1145/1864708.1864770)\
[Overview Of Recommender Systems](https://search-library.ucsd.edu/permalink/01UCS_SDI/1vtf07t/cdi_doaj_primary_oai_doaj_org_article_e1fff15ae9b64b96915b66bc5dc81ac5)\
[FPMC](https://dl.acm.org/doi/abs/10.1145/1772690.1772773)\
...




In [None]:
# If first time, run this script to set up virtual environment
# Requires python3.11
!chmod +x scripts/setup_env.sh
!./scripts/setup_env.sh

Data link: https://cseweb.ucsd.edu/~jmcauley/datasets.html#twitch

Start and stop times are provided as integers and represent periods of 10 minutes. Stream ID could be used to retrieve a single broadcast segment from a streamer (not used in our work).

    User ID (anonymized)
    Stream ID
    Streamer username
    Time start
    Time stop

[Original research paper of the data](https://search-library.ucsd.edu/permalink/01UCS_SDI/1vtf07t/cdi_unpaywall_primary_10_1145_3460231_3474267)


Load in the data

In [None]:
# Imports
import pandas as pd
import numpy as np

In [None]:
# Data with header names
data = pd.read_csv('100k_a.csv', names=['user_id', 'stream_id', 'streamer_username', 'time_start', 'time_stop'])
data.head()

In [None]:
# Create train and test splits temporally sorted by time_start
data = data.sort_values('time_start').reset_index(drop=True)
split_point = int(len(data) * 0.8)

train_data = data.iloc[:split_point].sample(frac=1)
test_data = data.iloc[split_point:].sample(frac=1)

print('train_data entries:', len(train_data), train_data.head(), sep='\n')
print('test_data entries:', len(test_data), test_data.head(), sep='\n')

### Training the model (Part 3)
#### From section 4.2 of the paper

In [None]:
import pandas as pd
from scipy.sparse import csr_matrix

# Group by user_id and streamer_username, count interactions
interaction_counts = train_data.groupby(['user_id', 'streamer_username']).size().reset_index(name='count')

# Map user_id and streamer_username to indices for the matrix
user_ids = interaction_counts['user_id'].unique()
streamer_usernames = interaction_counts['streamer_username'].unique()

user_to_idx = {user: idx for idx, user in enumerate(user_ids)}
streamer_to_idx = {streamer: idx for idx, streamer in enumerate(streamer_usernames)}
idx_to_streamer = {idx: streamer for idx, streamer in enumerate(streamer_usernames)}


# Prepare data for sparse matrix
rows = interaction_counts['user_id'].map(user_to_idx)
cols = interaction_counts['streamer_username'].map(streamer_to_idx)
values = interaction_counts['count']

# Create sparse matrix (users x streamers)
user_streamer_matrix = csr_matrix((values, (rows, cols)), shape=(len(user_ids), len(streamer_usernames)))
user_streamer_matrix.shape

In [None]:
# Replace surprise with implicit
import implicit

# Train ALS model
model = implicit.als.AlternatingLeastSquares(factors=50)
model.fit(user_streamer_matrix)

# Recommend for user 0
recommendations, scores = model.recommend(0, user_streamer_matrix[0])
print("Recommended items:", recommendations)
print("Scores:", scores)

In [None]:
# Testing the model - get testing recs
test_users = test_data['user_id']
user_set = set()

for user in test_users:
  if user in user_to_idx:
    user_set.add(user_to_idx[user])

# Convert user_set to a list for iteration and get recommendations for each user
all_recommendations = []
all_scores = []

for user_idx in user_set:
    recs, scores = model.recommend(user_idx, user_streamer_matrix[user_idx])
    all_recommendations.append((user_idx, recs))
    all_scores.append((user_idx, scores))

print("Recommended items:", all_recommendations)
print("Scores:", all_scores)

In [None]:
# Evaluating the model with hit@1
total_num_recs = len(all_recommendations)
num_hit_at_1 = 0

for uid, rec in all_recommendations:
  # filter the test data for only rows with the given user, then filter that for rows with the top reccomended steamer
  temp_df = test_data[test_data['user_id'] == uid]
  temp_df = temp_df[temp_df['streamer_username'] == idx_to_streamer[rec[0]]]
  if not temp_df.empty:
    num_hit_at_1 += 1

print('Hit@1 prediction accuracy:', num_hit_at_1 / total_num_recs)

In [None]:
# Evaluating the model with hit@10
num_hit_at_10 = 0
# i=0

for uid, rec in all_recommendations:
  # filter the test data for only rows with the given user, then filter that for rows with the top reccomended steamer
  temp_df = test_data[test_data['user_id'] == uid]
  for r in rec:
    # if i<10:
    #     print('idx_to_streamer[r]', idx_to_streamer[r], "temp_df['streamer_username']", temp_df['streamer_username'].to_numpy())
    #     i += 1
    if idx_to_streamer[r] in temp_df['streamer_username'].to_numpy():
      num_hit_at_10 += 1
      break


print('Hit@10 prediction accuracy:', num_hit_at_10 / total_num_recs)
