## Recommender System From Scratch 

#### By: Aurelio Barrios

### What Is Shown In This Notebook

This notebook is used after the `twitter_scraping.ipynb` file to build a recommender system from scratch using twitter scraped data. What you will find in this notebook is three implementations of a recommender system: a simple recommender that recommends whats popular, a user-based collaborative filtering recommender model and the final deployed item-based collaborative filtering recommender model. This model will take in Twitch streamers a user follows and will recommend other streamers based on these streamers. 

### Imports

Load in necessary imports for the recommender system implementation.

In [1]:
import os
import json
import numpy as np
import pandas as pd
from collections import defaultdict
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine

### Data Preparation

Here we build the data necessary to implement the recommender system. The data will be a list of lists, where every sublist is a representation of a user and the streamers that user follows. We can then use this list to build connections via user or item based collaborative filtering. 

In [2]:
#read in twitch streamer twitter handle dataset
twitch_df = pd.read_csv('data/twitter_handles.csv')
twitch_df.head()

Unnamed: 0,channel,twitter_handle
0,xQcOW,xQc
1,summit1g,summit1g
2,Tfue,Tfue
3,Asmongold,Asmongold
4,NICKMERCS,NICKMERCS


In [3]:
#get a list of all the unique streamers twitter handles
streamers = sorted(list(twitch_df['twitter_handle']))

#helper function to find streamers amongst twitter handles
def include_streamer(name, handles=streamers):
    if name in handles:
        return True
    return False

In [None]:
#build streamer connection data for recommendor
streamer_data = []
#gather datasets from each streamer
for streamer in twitch_df['twitter_handle']:
    outdir = 'data/streamers/' + streamer
    if os.path.exists(outdir):        
        #loop through all the files in this directory
        for file in os.listdir(outdir):
            filename = outdir + '/' + file
            if file[-4:] == '.csv' and os.path.getsize(filename) > 20:
                #read in dataset
                curr_df = pd.read_csv(filename)
                curr_df['include'] = curr_df['screen_name'].apply(include_streamer)
                curr_df = curr_df[curr_df['include']]
                #get the streamers they follow
                streamers_list = list(curr_df['screen_name'])
                if streamer not in streamers_list:
                    streamers_list.append(streamer)
                if len(streamers_list) > 1:
                    streamer_data.append(streamers_list)
            else:
                print(filename)

In [None]:
#save dataset for easier access in the future
with open('data/outfiles/connections.json', 'w') as f:
    json.dump(streamer_data, f)

In [4]:
streamer_data = json.load(open('data/outfiles/connections.json'))

In [5]:
#build columns for twitch dataset
twitch_df['lower_handles'] = twitch_df['twitter_handle'].apply(lambda x: x.lower())
twitch_df['lower_channel'] = twitch_df['channel'].apply(lambda x: x.lower())

In [None]:
#save our twitch dataset to file for easy access
twitch_df.to_csv('data/outfiles/twitch_data.csv', index=False)

### Recommendor One: Popularity 

The first recommender that we will build from scratch is a recommender that simply recommends what is popular and which the current user is not already following. This is a simple approach and therefore may not be the best recommender. The only scenario where this recommender would make sense would be in an instance where we have a new user from which we dont have enough data, recommending whats popular would be a safe bet in this case.

In [6]:
from collections import Counter

#gets the most popular streamers observed
popular = Counter(streamer for stream_list in streamer_data for streamer in stream_list).most_common()

In [7]:
#helper function that recommends the popular streamers a user is not already interested in
def recommend_popular(streamers, num_streamers=5):
    recommended = [streamer for streamer, _ in popular if streamer not in streamers]
    return recommended[:num_streamers]

In [8]:
#see which streamers user 9 follows
streamer_data[9]

['fuslie', 'xQc', 'Sykkuno', 'Thebuddha_3', 'AnthonyZ1O', 'cyr', 'summit1g']

In [9]:
#recommend popular streamers user 9 does not follow already
recommend_popular(streamer_data[9])

['pokimanelol', 'NICKMERCS', 'Tfue', 'shroud', 'Myth_']

### Recommendor Two: User-Based Collaborative Filtering

The second recommender system is more intuitive and therefore provides better recommendations. In this recommender system we are looking for similarities between users. When we have a user we try to find users that are somehow similar to him and then based on this we recommend streamers that those users are interested in, for which the primary user is not already interested in. They key here is to measure the similarity between users and to do this we use the cosine similarity metric, defined below.

In [10]:
from scipy import spatial
#helper function to find the cosine similarity between two vectors
def similarity(v, w):
    return 1 - spatial.distance.cosine(v, w)

In [11]:
#helper function to build followed streamer vector
def build_vector(streamer_list):
    return np.array([1 if streamer in streamer_list else 0 for streamer in streamers])

In [12]:
#build matrix holding one hot encoding of followed streamer vector for every user
streamer_matrix = np.array(list((map(build_vector, streamer_data))))

In [13]:
#get the similarities between each user
users = 1 - pairwise_distances(streamer_matrix, metric='cosine')

In [14]:
#here we see the similarity rating between user 38 and user 36
users[38][36]

0.18257418583505536

In [15]:
#function that returns the most similar users to the current user
def get_similar_users(curr_user):
    #get other users and their similarity ratings to the current user
    user_pairs = [(other_user, similarity) 
                  for other_user, similarity in enumerate(users[curr_user]) 
                  if curr_user != other_user and similarity > 0]
    return sorted(user_pairs, key=lambda x: x[1], reverse=True)

In [16]:
#function that recommends using user-based collaborative filtering
def recommend_user(curr_user, add_streamers_followed=False, num_rec=5):
    rec_streamers = defaultdict(float)
    for other_user, similarity in get_similar_users(curr_user):
        for streamer in streamer_data[other_user]:
            rec_streamers[streamer] += similarity
    rec_streamers = sorted(rec_streamers.items(), key=lambda x: x[1], reverse=True)
    if add_streamers_followed:
        return [streamer for streamer, _ in rec_streamers][:num_rec]
    else:
        return [streamer for streamer, _ in rec_streamers
               if streamer not in streamer_data[curr_user]][:num_rec]

In [17]:
#see which streamers user 9 follows
streamer_data[9]

['fuslie', 'xQc', 'Sykkuno', 'Thebuddha_3', 'AnthonyZ1O', 'cyr', 'summit1g']

In [18]:
#recommend 5 streamers using user-based filtering
recommend_user(9)

['pokimanelol', 'LilyPichu', 'REALMizkif', 'QuarterJade', 'shroud']

### Recommender Three: Item-Based Collaborative Filtering

The third recommender and the recommender deployed on the website for this project is an item-based collaborative filtering recommender. This recommender handles the fall backs of a user-based recommender because when we have a vector space that is large this means that the distances between each vector is large. This means that in some cases a user that is most similar to another user within a large vector space would most likely not be that similar to the user at all. Rather than deal with this large vector space we can recommend using similarities in the items of a user. So instead of recommending based on similar users we recommend by aggregating interests that are similar to the users current interests.

In [19]:
#build matrix that stores wether streamer(item) i is followed by user j
# streamer_items[i][j] = 1 , if streamer i is followed by user j
# transpose of streamer_matrix
item_matrix = np.array([[streams_followed[i] for streams_followed in streamer_matrix] 
                  for i, _ in enumerate(streamers)])

In [20]:
#get the similarities between each streamer(item) rather than user
items = 1 - pairwise_distances(item_matrix, metric='cosine')

In [21]:
#here we see the similarity rating between streamer 20 and streamer 10
items[20][10]

0.019596545041740465

In [22]:
#function that returns most similar items(streamer) to the current item(current streamer)
def get_similar_items(curr_item):
    item_pairs = [(streamers[other_item], similarity)
                 for other_item, similarity in enumerate(items[curr_item])
                 if curr_item != other_item and similarity > 0]
    return sorted(item_pairs, key=lambda x: x[1], reverse=True)

In [23]:
#function that recommends using item-based filtering
def recommend_item(curr_user, add_self_items=False, num_rec=5):
    rec_streamers = defaultdict(float)
    curr_user_streamers = streamer_matrix[curr_user]
    for curr_item, follows in enumerate(curr_user_streamers):
        if follows:
            similar_streams = get_similar_items(curr_item)
            for streamer, similarity in similar_streams:
                rec_streamers[streamer] += similarity
    rec_streamers = sorted(rec_streamers.items(), key=lambda x: x[1], reverse=True)
    if add_self_items:
        return [streamer for streamer, _ in rec_streamers][:num_rec]
    else:
        return [streamer for streamer, _ in rec_streamers
               if streamer not in streamer_data[curr_user]][:num_rec]

In [24]:
#see which streamers user 9 follows
streamer_data[9]

['fuslie', 'xQc', 'Sykkuno', 'Thebuddha_3', 'AnthonyZ1O', 'cyr', 'summit1g']

In [25]:
#recommend five streamers using item-based filtering
recommend_item(9)

['LilyPichu', 'QuarterJade', 'Curtisryan__', 'pokimanelol', 'REALMizkif']