# Subreddit recommender system in Python
Nowadays, **recommender systems** are used to personalize your experience on the web, telling you what to buy, where to eat or even who you should be friends with. People's tastes vary, but generally follow patterns. People tend to like things that are similar to other things they like, and they tend to have similar taste as other people they are close with. Recommender systems try to capture these patterns to help predict what else you might like.

<img src='http://etailment.de/news/media/1/humor-e-commerce-5672-detailp.jpeg'/>


#### I will be using the subreddit interaction dataset.

<img src='https://assets.ifttt.com/images/channels/1352860597/icons/on_color_large.png'/>

It contains subreddit interactions from approximately 25k users.You should add unzipped subreddit dataset folder to your notebook directory.You can download the dataset [here](https://www.kaggle.com/colemaclean/subreddit-interactions/downloads/subreddit-interactions-for-25000-users.zip)

#### Importing all the necessary modules

In [1]:
import numpy as np
import pandas as pd
import math as mt
import csv
from pandas import DataFrame,Series,read_csv
import scipy.sparse as sp
from sparsesvd import sparsesvd        #used for matrix factorization
from scipy.sparse import csc_matrix    #used for sparse matrix
from scipy.sparse.linalg import *      #used for matrix multiplication

#### Get a sneak peek of the first 5 rows in the dataset. Next, let's count the number of unique users and subreddits.

In [2]:
subreddit_df = read_csv('reddit_data.csv')
print "Top 5 rows of the dataset - \n"+ str(subreddit_df.head())

Top 5 rows of the dataset - 
    username         subreddit           utc
0  kabanossi  photoshopbattles  1.482748e+09
1  kabanossi      GetMotivated  1.482748e+09
2  kabanossi            vmware  1.482748e+09
3  kabanossi           carporn  1.482748e+09
4  kabanossi               DIY  1.482747e+09


In [3]:
n_users = subreddit_df.username.unique().shape[0]
n_items = subreddit_df.subreddit.unique().shape[0]
print 'Number of users = ' + str(n_users) + '\nNumber of subreddits = ' + str(n_items)

Number of users = 22610
Number of subreddits = 34967


In [4]:
print "Number of rows in the dataset = " +str(subreddit_df.shape[0])+ \
'\nNumber of columns = ' + str(subreddit_df.shape[1])

Number of rows in the dataset = 14000000
Number of columns = 3


#### Gathering some information from the data and checking if it contains Null values

In [5]:
print "Information from the data set - \n "
subreddit_df.info()

Information from the data set - 
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14000000 entries, 0 to 13999999
Data columns (total 3 columns):
username     object
subreddit    object
utc          float64
dtypes: float64(1), object(2)
memory usage: 320.4+ MB


In [6]:
print "Number of Null entries in the Dataset - \n" + str(subreddit_df.isnull().any())

Number of Null entries in the Dataset - 
username     False
subreddit    False
utc          False
dtype: bool


#### Lets list down the most popular subreddits based on the user interaction.

In [7]:
subreddit_grouped = subreddit_df.groupby(['subreddit']).agg({'username': 'count'}).\
                    rename(columns={'username':'subreddit_count'}).reset_index()
subreddit_grouped.head()

Unnamed: 0,subreddit,subreddit_count
0,007,3
1,065_082_071,1
2,0ad,4
3,0x10c,3
4,0x3642,39


In [8]:
grouped_sum = subreddit_grouped['subreddit_count'].sum()
grouped_sum

14000000L

In [9]:
subreddit_grouped['percentage']  = subreddit_grouped['subreddit_count'].div(grouped_sum)*100
subreddit_grouped.sort_values(['subreddit_count', 'subreddit'], ascending = [0,1]).head(10)

Unnamed: 0,subreddit,subreddit_count,percentage
1402,AskReddit,1030290,7.359214
29812,politics,367860,2.627571
17058,The_Donald,216939,1.549564
28536,nfl,173883,1.242021
26783,leagueoflegends,157663,1.126164
34646,worldnews,156605,1.118607
24419,funny,152921,1.092293
28380,nba,150985,1.078464
29564,pics,143496,1.024971
28497,news,140492,1.003514


#### The above table clearly shows the top 10 reddits based on the user interactions.

--------------------------------------------------------------------------

# Preprocessing data to convert into User - Subreddit Matrix Format

In [10]:
user_sub_df = subreddit_df.groupby(['username','subreddit']).agg({'subreddit':'count'}).\
              rename(columns={'subreddit':'subreddit_count'}).reset_index()
user_sub_df.head(10)

Unnamed: 0,username,subreddit,subreddit_count
0,--ANUSTART-,AOImmortals,2
1,--ANUSTART-,Addons4Kodi,1
2,--ANUSTART-,AdviceAnimals,7
3,--ANUSTART-,AskReddit,14
4,--ANUSTART-,Assistance,9
5,--ANUSTART-,CombatFootage,1
6,--ANUSTART-,Documentaries,1
7,--ANUSTART-,FantasyPL,3
8,--ANUSTART-,FiftyFifty,1
9,--ANUSTART-,Fitness,7


#### Lets convert the subreddit count into a subreddit document format which will be helpful while converting into a matrix format

In [11]:
user_sub_doc_df = subreddit_df.groupby('username')['subreddit'].apply(lambda x: "%s" % ' '.join(x)).reset_index()

In [12]:
user_sub_doc_df.head()

Unnamed: 0,username,subreddit
0,--ANUSTART-,Testosterone Testosterone Testosterone Testost...
1,--Sko--,DestinyTheGame DestinyTheGame DestinyTheGame D...
2,--UNKN0WN--,AceAttorney AceAttorney AceAttorney AceAttorne...
3,--harley--quinn--,LGBTeens Patriots asktransgender Patriots Patr...
4,-A-p-r-i-l-,tdi tdi tdi AskReddit tdi tdi tdi tdi tdi tdi ...


In [13]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
document = user_sub_doc_df.iloc[:, 1]
document = document.apply(lambda row: tokenizer.tokenize(row))
document.head()

0    [Testosterone, Testosterone, Testosterone, Tes...
1    [DestinyTheGame, DestinyTheGame, DestinyTheGam...
2    [AceAttorney, AceAttorney, AceAttorney, AceAtt...
3    [LGBTeens, Patriots, asktransgender, Patriots,...
4    [tdi, tdi, tdi, AskReddit, tdi, tdi, tdi, tdi,...
Name: subreddit, dtype: object

In [14]:
user_queries = subreddit_df['subreddit'].unique()
users = user_sub_doc_df['username'].unique()
print "The number of unique users - ",str(len(users))
print "The number of unique subreddits - ",str(len(user_queries))
corpus_of_subreddits = []
for unique_subreddit in user_queries:
    corpus_of_subreddits.append(unique_subreddit)

The number of unique users -  22610
The number of unique subreddits -  34967


#### Lets create User Vs Item Matrix

In [15]:
voc2id = dict(zip(corpus_of_subreddits, range(len(corpus_of_subreddits))))
rows, cols, vals = [], [], []
for r, d in enumerate(document):
    for e in d:
        if voc2id.get(e) is not None:
            rows.append(r)
            cols.append(voc2id[e])
            vals.append(1)
user_subreddit_matrix = csc_matrix((vals, (rows, cols)), dtype=np.float32)
print "The dimensions of user subreddit matrix are - " +str(user_subreddit_matrix.shape)

The dimensions of user subreddit matrix are - (22610, 34967)


#### Lets calculate the sparsity of the dataframe

In [16]:
sparsity=round(1.0-len(subreddit_df)/float(len(users)*len(user_queries)),3)
print 'The sparsity level of Subreddit dataframe is ' +  str(sparsity*100) + '%'

The sparsity level of Subreddit dataframe is 98.2%


### SVD
A well-known matrix factorization method is **Singular value decomposition (SVD)**. Collaborative Filtering can be formulated by approximating a matrix `X` by using singular value decomposition. The winning team at the Netflix Prize competition used SVD matrix factorization models to produce product recommendations, for more information I recommend to read articles: [Netflix Recommendations: Beyond the 5 stars](http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html) and [Netflix Prize and SVD](http://buzzard.ups.edu/courses/2014spring/420projects/math420-UPS-spring-2014-gower-netflix-SVD.pdf)

#### Lets define a function which converts the user subreddit matrix into corresponding svd matrices

In [17]:
def computeSVD(user_subreddit_matrix, no_of_latent_factors):
    
    """Compute the SVD of the given matrix.
    :user_subreddit_matrix: a numeric matrix
    :no_of_latent_factors : numeric scalar value
    
    :U  : User to concept matrix 
    :S  : Strength of the concepts matrix
    :Vt : Subreddit to concept matrix
    """
    U, s, Vt = sparsesvd(user_subreddit_matrix, no_of_latent_factors)
    
    dim = (len(s), len(s))
    S = np.zeros(dim, dtype=np.float32)
    for i in range(0, len(s)):
        S[i,i] = mt.sqrt(s[i])

    U = csc_matrix(np.transpose(U), dtype=np.float32)
    S = csc_matrix(S, dtype=np.float32)
    Vt = csc_matrix(Vt, dtype=np.float32)

    return U, S, Vt

In [18]:
#Compute estimated recommendations for the given user
def computeEstimatedRecommendation(U, S, Vt, uTest):
    """Compute the recommendation for the given user.
    
    :U     : User to concept matrix 
    :S     : Strength of the concepts matrix
    :Vt    : Subreddit to concept matrix
    :uTest : Index of the user for which the recommendation has to be made
    
    :recom : List of recommendations made to the user
    """
 
    #constants defining the dimensions of the estimated rating matrix
    MAX_PID = 34967
    MAX_UID = 22610
    
    rightTerm = S*Vt 

    EstimatedRecommendation = np.zeros(shape=(MAX_UID, MAX_PID), dtype=np.float16)
    for userTest in uTest:
        prod = U[userTest, :]*rightTerm
        # Converting the vector to dense format in order to get the indices 
        # of the movies with the best estimated ratings 
        
        EstimatedRecommendation[userTest, :] = prod.todense()
        recom = (-EstimatedRecommendation[userTest, :]).argsort()[:250]
    return recom

#### Checking the contribution of the top 250 subreddits to decide the number of latent factors for SVD

In [19]:
subreddit_contributions = subreddit_grouped.sort_values(['subreddit_count', 'subreddit'],\
                                                        ascending = [0,1]).head(250)

print "Top 250 subreddits contribute a total of %s percentage to the total subreddits in the dataset"\
      %sum(subreddit_contributions.percentage)

Top 250 subreddits contribute a total of 61.8459428571 percentage to the total subreddits in the dataset


In [20]:
no_of_latent_factors = 250 #Selected number of latent factors as 250
no_of_recommendations_for_each_user = 5
uTest = [np.where(users == 'kabanossi')[0][0]]
U, S, Vt = computeSVD(user_subreddit_matrix, no_of_latent_factors)

In [21]:
print("------------------------------------------------------------------------------------\n")
print("User for whom recommendations are needed: %s\n" % users[uTest[0]])
print("------------------------------------------------------------------------------------\n")
print("Previous Subreddit interactions - \n")
previous_subredit_history = user_queries[np.where(user_subreddit_matrix[uTest[0],:].todense().T != 0)[0]]
previous_subredit_history
for previous_subredits in previous_subredit_history:
     print previous_subredits
print("\n------------------------------------------------------------------------------------\n")

------------------------------------------------------------------------------------

User for whom recommendations are needed: kabanossi

------------------------------------------------------------------------------------

Previous Subreddit interactions - 

photoshopbattles
GetMotivated
vmware
carporn
DIY
food
CatastrophicFailure
techsupport
VapePorn
nottheonion
Citrix
sysadmin
HyperV
Vaping
wheredidthesodago
networking
HOTandTRENDING
creepy
Catloaf
CozyPlaces
MechanicalKeyboards
freenas
Justrolledintotheshop
trippinthroughtime
nevertellmetheodds
homelab
oddlysatisfying
AnimalsBeingJerks
pcmasterrace
Technology_
techsupportmacgyver
techsupportgore
OSHA
GifRecipes
europe
interestingasfuck
funny
sports
synology
AskEngineers
mallninjashit
knitting
philadelphia
tattoos
EarthPorn
gaming
AnimalsBeingBros
pics
gifs
OldSchoolCool
mildlyinteresting
WTF
HomeNetworking
DataHoarder
battlefield_one
virtualization
RealGirls
cats
titanfall
DestinyTheGame
news
aww
Showerthoughts
natureismetal
thala

In [22]:
#Get estimated recommendations for test user
recommended_items = computeEstimatedRecommendation(U, S, Vt, uTest)
final_recommendation = []
for r in user_queries[recommended_items]:
    if r not in previous_subredit_history:
        final_recommendation.append(r)
        if len(final_recommendation) == no_of_recommendations_for_each_user:
            break

print("------------------------------------------------------------------------------------\n")
print("Recommendation for %s are as follows - \n" % users[uTest[0]])
print("------------------------------------------------------------------------------------\n")

for recommendation in final_recommendation:
    print recommendation
print("------------------------------------------------------------------------------------\n")


------------------------------------------------------------------------------------

Recommendation for kabanossi are as follows - 

------------------------------------------------------------------------------------

apple
hardwareswap
vancouver
programming
elderscrollsonline
------------------------------------------------------------------------------------



------------------------------------------------------------