*****************************************************
# The Social Web Assignment 4: Recommendation

- Instructors: Jacco van Ossenbruggen.
- TAs: Ayesha Noorain, Alex Boyko, Caio Silva, Elena Beretta, Mirthe Dankloff.
- Exercises for Hands-on session 4 
*****************************************************

In this notebook you will use the similarity measures to provide recommendations by comparing users and content based on expressed preferences (ratings). You will also explore textual similarity using a very popular natural language processing library, NLTK. Finally, you will explore recommendations on the Reddit platform.

Required packages:
* feedparser, praw,  nltk

In [1734]:
import sys

!pip install feedparser
!pip install praw
!pip install nltk



In the snippets below, you can find:
* creation of a small toy database in form of a dictionary of dictionaries;
* issuing several similarity measures based on critics' preferences; and
* use those values to obtain meaningful statistics pertaining a user.

# Movie preferences of movie critics
As example data, let us define a python dictionary of movie critics and their ratings of a small set of movies


In [1735]:

critics = {
    'Lisa Rose': {
        'Lady in the Water': 2.5,
        'Snakes on a Plane': 3.5,
        'Just My Luck': 3.0,
        'Superman Returns': 3.5,
        'You, Me and Dupree': 2.5,
        'The Night Listener': 3.0,
    },
    'Gene Seymour': {
        'Lady in the Water': 3.0,
        'Snakes on a Plane': 3.5,
        'Just My Luck': 1.5,
        'Superman Returns': 5.0,
        'The Night Listener': 3.0,
        'You, Me and Dupree': 3.5,
    },
    'Michael Phillips': {
        'Lady in the Water': 2.5,
        'Snakes on a Plane': 3.0,
        'Superman Returns': 3.5,
        'The Night Listener': 4.0,
    },
    'Claudia Puig': {
        'Snakes on a Plane': 3.5,
        'Just My Luck': 3.0,
        'The Night Listener': 4.5,
        'Superman Returns': 4.0,
        'You, Me and Dupree': 2.5,
    },
    'Mick LaSalle': {
        'Lady in the Water': 3.0,
        'Snakes on a Plane': 4.0,
        'Just My Luck': 2.0,
        'Superman Returns': 3.0,
        'The Night Listener': 3.0,
        'You, Me and Dupree': 2.0,
    },
    'Jack Matthews': {
        'Lady in the Water': 3.0,
        'Snakes on a Plane': 4.0,
        'The Night Listener': 3.0,
        'Superman Returns': 5.0,
        'You, Me and Dupree': 3.5,
    },
    'Toby': {'Snakes on a Plane': 4.5,
             'You, Me and Dupree': 1.0,
             'Superman Returns': 4.0},
}

# **Exercise 1: Finding Similar Users**

In the code below, two different simililarity measures are used: Euclidean distance and the Pearson correlation. If you are not familiar with them, we recommend you look them up to deepen your understanding.

## Euclidian distance

To assess the degree similarity between critics given their respective preferences, we can use the euclidian distance.
Its formula for an N-dimensional space is is: ![image.png](attachment:image.png)
Because we want a smaller distance to indicate a larger similarity, we will use 1/d(p,q) as our similarity value:

In [1736]:
from math import sqrt


def sim_distance(p1, p2, show_common_dims=False, prefs=critics):
    '''
    Returns a distance-based similarity score between two critics.
    '''

    # Get the list of shared_items
    common_items = []
    for movie in prefs[p1]:
        if movie in prefs[p2]:
            common_items.append(movie)
    # If they have no ratings in common, return 0
    if len(common_items) == 0:
        return 0
    if show_common_dims:
        print("common dimensions between {} and {}: ".format(p1, p2) + str(len(common_items)))
    # Add up the squares of all the differences
    sum_of_squares = sum([pow(prefs[p1][movie] - prefs[p2][movie], 2) for movie in common_items])

    # return sqrt(sum_of_squares)
    return 1 / sqrt(sum_of_squares)

Using this simple formula, you can calculate a similarity between two critics:

In [1737]:
# get the distance between 'Lisa Rose' and 'Gene Seymour'
sim_distance('Lisa Rose', 'Gene Seymour', True)

common dimensions between Lisa Rose and Gene Seymour: 6


0.41702882811414954

Try this with other names so you can see who is closer or further.

Name at least two problems with the sim_distance function as it is defined above. 

Problems:
1. first problem is what funciton returns. Imagine the distance of 2 person is 0. we can not perform division to 0. Our suggestion is to change result as follows:
    return 1 / (1 + sqrt(sum_of_squares))
2. Second problem is existence of "sqrt". If the sum_of_squares is between 0 and 1, the sqrt increases the distance unwillingly. So it is better to remove the sqrt. So the final result would be:
    1/ (1 + sum_of_squares)
    

A different measure of similarity can be given by pearson correlation.
Which follows: ![image.png](attachment:image.png)

Where the dividend represents a measure of covariance between dimensions, whereas the divisor is the product of the standard deviation of the scores given by each user.

In [1738]:
def sim_pearson(p1, p2, prefs=critics, verbose=False):
    '''
    Returns the Pearson correlation coefficient for p1 and p2.
    '''

    '''Step 1: Get the list of mutually rated items'''
    common_items = []
    dic = {}
    for movie in prefs[p1]:
        if movie in prefs[p2]:
            common_items.append(movie)
    # If they are no ratings in common, return 0
    if len(common_items) == 0:
        return 0
    '''Step 2: Sum calculations'''
    n_common_items = len(common_items)
    sum1 = sum([prefs[p1][movie] for movie in common_items])
    sum2 = sum([prefs[p2][movie] for movie in common_items])
    # Sums of squares
    sum1Sq = sum([pow(prefs[p1][movie], 2) for movie in common_items])
    sum2Sq = sum([pow(prefs[p2][movie], 2) for movie in common_items])
    # Sum of the products
    pSum = sum([prefs[p1][movie] * prefs[p2][movie] for movie in common_items])
    # Calculate r (Pearson score)
    num = pSum - sum1 * sum2 / n_common_items
    den = sqrt((sum1Sq - pow(sum1, 2) / n_common_items) * (sum2Sq - pow(sum2, 2) / n_common_items))
    if den == 0:
        return 0
    r = num / den
    if verbose:
        print("common dimensions: %s" % len(common_items))
        print("Similarity Score for {} and {}: {}".format(p1, p2, r))
    return r


#for k in critics.keys():
#    sim_pearson('Michael Phillips', k, verbose=True)

Try the examples you used for the eucledian distance again, but now using the pearson correlation:

In [1739]:
sim_pearson('Lisa Rose', 'Gene Seymour')

0.39605901719066977

### Ranking critics on similarity
The topMatches function below calculates all similarities of a given critic with his peers:

In [1740]:
def topMatches(person, n=5, similarity=sim_pearson, prefs=critics):
    '''
    Returns the best matches for person from the prefs dictionary. 
    Number of results and similarity function are optional params.
    '''
    if similarity not in [sim_distance, sim_pearson]:
        # NB: here we are comparing FUNCTION DEFINITION.
        # We do that only in a jupyter notebook for the sake of simplicity.
        raise ValueError("Callback functions should be: 'sim_pearson' or 'sim_distance'.")

    scores = [(similarity(person, other, prefs=prefs), other) for other in prefs
              if other != person]
    scores.sort()
    scores.reverse()
    return scores[0:n]

So you can now get the 3 critics closest to Toby by calling:

In [1741]:
topMatches('Toby', n=3)

[(0.9912407071619299, 'Lisa Rose'),
 (0.9244734516419049, 'Mick LaSalle'),
 (0.8934051474415647, 'Claudia Puig')]

*****************************************************
### Task: Effect of similarity function used
Call the topMatches function on a number of critics with both the default sim_pearson, but also with the sim_distance function. Would you have preference of one over the other? 
*****************************************************

<b style="color: red">Answer:</b>
the Euclidean distance or Euclidean metric is the "ordinary" straight-line distance between two points in Euclidean space while Pearson Correlation measures the similarity in shape between two profiles.

The pearson considers which movie is the most favorable to a user. for example if 3 movies A B C are respectively most favorable movies of user1, and A B C also in same order are respectively the most favorable movies of user2, These 2 users have pearson similarity of 1. while in the Euclidean similarity only simple linear distance is calculated.

So based on the our research and trying the topMatch function, we prefer the pearson similarity.

### **Exercise 2: Recommending Items**

One way to recommend movies to a person would be to rate the movies she has not seen yet by using the scores of the others weighted by the similarity.

In [1742]:
def getRecommendations(person, similarity=sim_pearson, prefs=critics):
    '''
    Gets recommendations for a person by using a weighted average
    of every other user's rankings
    '''
    if similarity not in [sim_distance, sim_pearson]:
        raise ValueError("Callback functions should be: 'sim_pearson' or 'sim_distance'.")

    totals = {}
    simSums = {}
    for other in prefs:
        # Don't compare me to myself
        if other == person:
            continue
        sim = similarity(person, other, prefs=prefs)
        # Ignore scores of zero or lower
        if sim <= 0:
            continue
        for item in prefs[other]:
            # Only score movies I haven't seen yet
            if item not in prefs[person] or prefs[person][item] == 0:
                # Similarity * Score
                totals.setdefault(item, 0)
                # The final score is calculated by multiplying each item by the
                #   similarity and adding these products together
                totals[item] += prefs[other][item] * sim
                # Sum of similarities
                simSums.setdefault(item, 0)
                simSums[item] += sim
    # Create the normalized list
    rankings = [(total / simSums[item], item) for (item, total) in
                totals.items()]
    # Return the sorted list
    rankings.sort()
    rankings.reverse()
    return rankings

In [1743]:
getRecommendations('Toby', similarity=sim_distance)

[(3.4721701369256524, 'The Night Listener'),
 (2.7709066207646793, 'Lady in the Water'),
 (2.4349456273856207, 'Just My Luck')]

In [1744]:
getRecommendations('Toby')


[(3.3477895267131017, 'The Night Listener'),
 (2.8325499182641614, 'Lady in the Water'),
 (2.530980703765565, 'Just My Luck')]

Note that the output does not only consist of a movie title, but also a guess at what the user's rating for each movie would be.

*****************************************************
### Task: Explainable recommendations
Can you also find out how to give information on how the recommendation is built up. For example about the 'closest' person that also watched this movie?
*****************************************************

In [1745]:
# so based on what we understood of this task we need to write a function
# that not only returns score and movie, but also more information like closest people that also watched that movie

def getCompleteRecommendation(person, similarity=sim_pearson, prefs=critics):
    '''
    Gets recommendations for a person by using a weighted average
    of every other user's rankings
    '''
    if similarity not in [sim_distance, sim_pearson]:
        raise ValueError("Callback functions should be: 'sim_pearson' or 'sim_distance'.")

    totals = {}
    simSums = {}
    for other in prefs:
        # Don't compare me to myself
        if other == person:
            continue
        sim = similarity(person, other, prefs=prefs)
        # Ignore scores of zero or lower
        if sim <= 0:
            continue

        closest_person = ''
        for item in prefs[other]:
            # Only score movies I haven't seen yet
            if item not in prefs[person] or prefs[person][item] == 0:
                # Similarity * Score
                totals.setdefault(item, 0)
                # The final score is calculated by multiplying each item by the
                #   similarity and adding these products together
                totals[item] += prefs[other][item] * sim
                # Sum of similarities
                simSums.setdefault(item, 0)
                simSums[item] += sim

    # Create the normalized list
    rankings = [(total / simSums[item], item, close_person(person, item, total / simSums[item])) for (item, total) in
                totals.items()]


    # Return the sorted list
    rankings.sort()
    rankings.reverse()
    return rankings

def close_person(person, movie_name, person_score, prefs=critics):
    storage = {}
    for other in prefs:
        if other == person:
            continue
        for movie in prefs[other]:
            if movie == movie_name:
                diff = abs(person_score - prefs[other][movie]);
                # there could be multiple people with minimum difference in score
                if diff not in storage:
                    storage[diff] = []
                storage[diff].append(other)
    minimum_diff = min(storage.keys())
    return storage[minimum_diff]

getCompleteRecommendation('Toby')

# above line returns movies that Toby did not rate and also closest people related to each movie to toby. (those people have exact same closeness to toby in specified movie)

[(3.3477895267131017,
  'The Night Listener',
  ['Lisa Rose', 'Gene Seymour', 'Mick LaSalle', 'Jack Matthews']),
 (2.8325499182641614,
  'Lady in the Water',
  ['Gene Seymour', 'Mick LaSalle', 'Jack Matthews']),
 (2.530980703765565, 'Just My Luck', ['Lisa Rose', 'Claudia Puig'])]

### **Exercise 3: Transformations** 
**You have been building recommendations based on similar users in Exercise 2, but you could of course also build recommendations based on similar items. In this exercise you will do this.** 

The function is essentially the same, but you need to transfer your data, from:

<code>{'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5},
'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5}}</code>

to

<code>{'Lady in the Water': {'Lisa Rose': 2.5,'Gene Seymour': 3.0},
'Snakes on a Plane': {'Lisa Rose': 3.5,'Gene Seymour': 3.5}}</code>

This is what the transformPrefs function does. 

You can now create a dictionary for movies with their scores assigned by different people by invoking:

In [1746]:
def transformPrefs(prefs=critics):
    '''
    Transform the recommendations into a mapping where persons are described
    with interest scores for a given title e.g. {title: person} instead of
    {person: title}.
    '''
    result = {}
    for person in prefs:
        for item in prefs[person]:
            result.setdefault(item, {})
            # Flip item and person
            result[item][person] = prefs[person][item]
    return result

In [1747]:
movies = transformPrefs()
print(movies)

{'Lady in the Water': {'Lisa Rose': 2.5, 'Gene Seymour': 3.0, 'Michael Phillips': 2.5, 'Mick LaSalle': 3.0, 'Jack Matthews': 3.0}, 'Snakes on a Plane': {'Lisa Rose': 3.5, 'Gene Seymour': 3.5, 'Michael Phillips': 3.0, 'Claudia Puig': 3.5, 'Mick LaSalle': 4.0, 'Jack Matthews': 4.0, 'Toby': 4.5}, 'Just My Luck': {'Lisa Rose': 3.0, 'Gene Seymour': 1.5, 'Claudia Puig': 3.0, 'Mick LaSalle': 2.0}, 'Superman Returns': {'Lisa Rose': 3.5, 'Gene Seymour': 5.0, 'Michael Phillips': 3.5, 'Claudia Puig': 4.0, 'Mick LaSalle': 3.0, 'Jack Matthews': 5.0, 'Toby': 4.0}, 'You, Me and Dupree': {'Lisa Rose': 2.5, 'Gene Seymour': 3.5, 'Claudia Puig': 2.5, 'Mick LaSalle': 2.0, 'Jack Matthews': 3.5, 'Toby': 1.0}, 'The Night Listener': {'Lisa Rose': 3.0, 'Gene Seymour': 3.0, 'Michael Phillips': 4.0, 'Claudia Puig': 4.5, 'Mick LaSalle': 3.0, 'Jack Matthews': 3.0}}


And find similar items for a particular movie like this:

In [1748]:
topMatches('Superman Returns', prefs=movies)

[(0.6579516949597695, 'You, Me and Dupree'),
 (0.4879500364742689, 'Lady in the Water'),
 (0.11180339887498941, 'Snakes on a Plane'),
 (-0.1798471947990544, 'The Night Listener'),
 (-0.42289003161103106, 'Just My Luck')]

Or find people who may like a particular movie:

In [1749]:
getRecommendations('Just My Luck', prefs=movies)

[(4.0, 'Michael Phillips'), (3.0, 'Jack Matthews')]

*****************************************************
#### Task: why does the example above work?
Try to follow exactly what is going on in the last call. Notice that Michael and Jack did not rate 'Just my Luck'. How is their rating for it built up?
*****************************************************

<b style="color: red">Answer:</b>
In the last task, We were recommending movies to people who did not watch them and tried to predict what would be their rating for those movies.
But in this task we transformed the dictionary and basically exchanged movies and people. So we are now recommending the audience for a specific movie and predict their ratings for that movie.

So the code works because we simply just exchanged people and movies. And lack of movie in dictionary of specific person in last assignment is the same as lack of person in dictionary of movies in this task. And both means that the user did not rate the movie.

for example: for micheal and movie called 'Just my Luck' we have following dictionaries:

<code>'Michael Phillips': {
        'Lady in the Water': 2.5,
        'Snakes on a Plane': 3.0,
        'Superman Returns': 3.5,
        'The Night Listener': 4.0,
    }</code>

and transformed one is:

<code>"Just My Luck":{
      "Lisa Rose":3.0,
      "Gene Seymour":1.5,
      "Claudia Puig":3.0,
      "Mick LaSalle":2.0
   }</code>

so in first case "Just My Luck" movie does not exist in list of movies that Michael watched. In the second case, Michael is not in the list of people that saw movie "Just My Luck".

So the result of this excersise ([(4.0, 'Michael Phillips'), (3.0, 'Jack Matthews')]) predicts that Michael will rate "Just My Luck" with score of 4.

### **Exercise 4: Sentence Similarity**

In [1750]:
# import natural language processing software we need later.
import nltk
from nltk.stem import WordNetLemmatizer


In [1751]:
# Download wordnet and punkt sentence tokenizer
nltk.download('wordnet')
nltk.download('punkt')


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/alirezasoltaninezhad/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/alirezasoltaninezhad/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Below we have some example sentences to compare later on.

In [1752]:
movies = ["I saw a really good movie last night.",
          "The movie is based on the director's life.",
          "The movie starts at ten.",
          "I took her to a movie.",
          "The movie stars Al Pacino.",
          "The movie opened last weekend.",
          "The movie lasted two hours.",
          "He directed several movies.",
          "We just shot another movie.",
          "The movie was set in New York."]

In [1753]:
def get_jaccard_sim(str1, str2):
    a = set(str1.split())
    b = set(str2.split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [1754]:
def compare(s1, s2):
    #Import a Lemmatizer to get the root form of certain words
    lemmatizer = WordNetLemmatizer()

    #Tokenize both sentences to get each word separately
    word_list1 = nltk.word_tokenize(s1)
    word_list2 = nltk.word_tokenize(s2)

    #     print("Tokenized sentence", word_list1) #Uncomment to see an example of the tokenized sentence

    #Lemmatize both sentences
    lemmatized_output1 = ' '.join([lemmatizer.lemmatize(w, 'v') for w in word_list1])
    lemmatized_output2 = ' '.join([lemmatizer.lemmatize(w, 'v') for w in word_list2])

    return lemmatized_output1, lemmatized_output2

In [1755]:
from nltk.stem import WordNetLemmatizer

for x in range(len(movies)):
    l1, l2 = compare(movies[0], movies[x])
    print("Sentence 1:", l1, '\n', "Sentence 2:", l2, '\n', "Similarity Score:", get_jaccard_sim(l1, l2))

Sentence 1: I saw a really good movie last night . 
 Sentence 2: I saw a really good movie last night . 
 Similarity Score: 1.0
Sentence 1: I saw a really good movie last night . 
 Sentence 2: The movie be base on the director 's life . 
 Similarity Score: 0.11764705882352941
Sentence 1: I saw a really good movie last night . 
 Sentence 2: The movie start at ten . 
 Similarity Score: 0.15384615384615385
Sentence 1: I saw a really good movie last night . 
 Sentence 2: I take her to a movie . 
 Similarity Score: 0.3333333333333333
Sentence 1: I saw a really good movie last night . 
 Sentence 2: The movie star Al Pacino . 
 Similarity Score: 0.15384615384615385
Sentence 1: I saw a really good movie last night . 
 Sentence 2: The movie open last weekend . 
 Similarity Score: 0.25
Sentence 1: I saw a really good movie last night . 
 Sentence 2: The movie last two hours . 
 Similarity Score: 0.25
Sentence 1: I saw a really good movie last night . 
 Sentence 2: He direct several movies . 
 Si

*****************************************************
#### Task: In what scenario's could the Jaccard Similarity be more useful than the Euclidean distance and the Pearson Similarity metrics? Why is that? 
*****************************************************

<b style="color: red">Answer:</b>
The Jaccard similarity function compares two binary vectors (sets). Jaccard similarity is a widely used metric for determining text similarity. Calculating the Jaccard similarity is more computationally expensive because it compares all of the terms in one document to others. The Jaccard similarity proves useful in detecting duplicates. Euclidean distance is less commonly used in natural language processing  (NLP)  than Jaccard similarity. It is appropriate for continuous numerical variables. Because the Euclidean distance is not scale invariant, the data should be scaled before computing the distance. The redundant information impact in the dataset is also multiplied by Euclidean distance. We would weigh the redundancy effect n times if I had n highly correlated variables and used all variables as input. WordNetLemmatizer is used to normalize speech parts in sentences, reducing the effects of redundant vocabulary. For example, be can be used instead of am/is/are to find more relevant sentences. The distance between two data vectors that have no attribute values in common may be smaller than the distance between another pair of data vectors that have the same attribute values. Although WordNetLemmatizer reduce precision, it increases recall.

### **Exercise 5: Building a Reddit Recommender**

After having created your Reddit account, go to User Settings -> Safety & Privacy -> Manage third-party app authorization.
Here, you will create your own app. Give it a name, and add "https://www.reddit.com/prefs/apps/" to the redirect uri. Keep the other settings as they are.

* replace the '???' in the user_agent string with your name (or any unique string).
* replace the '???' in the client_id with the id right underneath your web app name.
* replace the '???' in the client_secret with the key next to 'secret'.

NOTE: install praw v. 3.5 


In [1756]:
# -*- coding: utf-8 -*-

import praw
import time

#Delete keys before handing in the notebook
r = praw.Reddit(user_agent='behnam', client_id='FUF106C4ntH8mTYHzyA17Q',
                client_secret='s0Bcm8lVyKA8nkHsm45bfVV_LEiv6Q',
                redirect_url='https://www.reddit.com/prefs/apps/'
                             'authorize_callback')


def initializeUserDict(subreddit, count=10):
    user_dict = {}
    # get the top count' popular posts
    for post in r.subreddit(subreddit).top(limit=count):
        # find all users who commented in this
        flat_comments = post.comments.list()
        for comment in flat_comments:
            try:
                user = comment.author.name
                user_dict[user] = {}
            except AttributeError:
                pass
    return user_dict


def fillItems(user_dict, count=100):
    all_items = {}
    # Find links posted by all users
    for user in user_dict:
        # print("finding subreddits where user " + user + "has commented")
        # find new comments for given user
        comments = r.redditor(user).comments.new(limit=count)
        for c in comments:
            # Get the subreddit where the comment was made
            subreddit = c.subreddit
            sub_name = subreddit.display_name
            # print(sub_name)
            if sub_name in user_dict[user]:
                user_dict[user][sub_name] += 1.0
            else:
                user_dict[user][sub_name] = 1.0

            all_items[sub_name] = 1
    #     Fill in missing items with 0
    #     for subr_counts in user_dict.values():
    #         for item in all_items:
    #             if item not in subr_counts:
    #                 subr_counts[item]=0.0

    return user_dict


You can get a list of popular recent posts about programming from the programming subreddit (https://www.reddit.com/r/VUAmsterdam) by invoking the code below.  Don't forget to replace the '???' in the user_agent string with your name (or any unique string).

In [1757]:
print("praw version == " + praw.__version__)

# subreddit = r.subreddit("programming")
for post in r.subreddit('programming').top(limit=15):
    print(end='\n * ')
    print(post.title)

praw version == 7.5.0

 * There’s a reason that programmers always want to throw away old code and start over: they think the old code is a mess. They are probably wrong. The reason that they think the old code is a mess is because of a cardinal, fundamental law of programming: It’s harder to read code than to write it.

 * YouTube page load is 5x slower in Firefox and Edge than in Chrome because YouTube's Polymer redesign relies on the deprecated Shadow DOM v0 API only implemented in Chrome.

 * How I Got Paid $0 From the Uber Security Bug Bounty

 * Google wins trial against Oracle as jury finds Android is “fair use”

 * TIL there's a community called "dwitter" where people compose 140 character JavaScript programs that produce interesting visuals

 * Eye tracking software for sufferers of ALS/MND can cost tens of thousands of dollars, so I've spent 3.5 years of my spare time writing a free & open-source alternative - meet OptiKey (C#, Rx, WPF) (x-post from r/Software)

 * Cool websi

See here a list of other subreddits you can explore with this code: https://www.reddit.com/reddits/

To automatically create a data set of reddit users similar to the movie watchers you can invoke the initializeUserDict function in redditrec.py 

In [1758]:
red_users = initializeUserDict('university', count=50)  # or for any other subreddit
print(red_users)

{'Curlyzza': {}, 'theshavedyeti': {}, 'xekatinaz': {}, 'Leprol': {}, 'rubot78': {}, 'airbornemint': {}, 'Sagittarius712': {}, 'hellothere96_2': {}, 'ImperialDiet': {}, 'HappySebs': {}, 'DamnYouRichardParker': {}, 'lexi2706': {}, 'anonymoususer762': {}, 'Waaarrrggghhh': {}, 'LISK2AC': {}, 'divingforducks': {}, 'Meshuggah1166': {}, 'sovietanada': {}, 'reverse_mango': {}, 'harolddawizard': {}, 'Matt14451': {}, 'SirHurDurr': {}, 'scoffburn': {}, '1Killag123': {}, 'bombarclart': {}, 'EFDcoach': {}, 'dizzyhazza': {}, 'BiteMeMaybe': {}, 'ultralillers': {}, 'rheetkd': {}, 'qarxyna': {}, 'Filsafat': {}, 'RSCyka': {}, 'BadDadBot': {}, 'thelifesponge': {}, 'sweetcandyfloss': {}, 'dr_LauraM': {}, 'juanjuansilvasilva': {}, '_Grotesque_': {}, 'serennow': {}, 'worldofbooks444': {}, 'laken_alexandra': {}, 'lordofthepages': {}, 'Oblivious_Chicken': {}, 'auzierrichard': {}, 'aanamt': {}, 'sir_ramen': {}, 'SighFTL': {}, 'nadamama': {}, 'triplezeroAR': {}, 'crimpmaveric': {}, 'Illustrious-Brother': {}, 'S

Now initializeUserDict has only created the user keys. We of course also want to know what subreddits they posted comments on. You can pull those in through:

In [1759]:
fillItems(red_users, count=15)
# here you can see how often each user commented in what sub.

{'Curlyzza': {'thenetherlands': 1.0, 'University': 3.0},
 'theshavedyeti': {'CellToSingularity': 1.0,
  'Showerthoughts': 1.0,
  'AbruptChaos': 1.0,
  'nextfuckinglevel': 1.0,
  'TIHI': 2.0,
  'mildlyinteresting': 1.0,
  'Wellthatsucks': 2.0,
  'technicallythetruth': 2.0,
  'ConvenientCop': 2.0,
  'IdiotsInCars': 2.0},
 'xekatinaz': {'MovieSuggestions': 2.0,
  'cinematography': 1.0,
  'TrueFilm': 5.0,
  'AskProfessors': 2.0,
  'scifi': 2.0,
  'assholedesign': 1.0,
  'minimalism': 2.0},
 'Leprol': {'cwru': 1.0,
  'University': 1.0,
  'ApplyingToCollege': 2.0,
  'college': 1.0},
 'rubot78': {'funny': 2.0,
  'blessedimages': 1.0,
  'pics': 1.0,
  'Damnthatsinteresting': 1.0,
  'TheDepthsBelow': 1.0,
  'nextfuckinglevel': 2.0,
  'ScarlettJohansson': 1.0,
  'BeAmazed': 1.0,
  'WTF': 2.0,
  'interestingasfuck': 2.0,
  'aww': 1.0},
 'airbornemint': {'biology': 8.0,
  'prephysicianassistant': 6.0,
  'NewToEMS': 1.0},
 'Sagittarius712': {'africanparents': 1.0,
  'MechanicalEngineering': 1.0,
  

This script may take a few minutes to collect all the data. Use this time to review what is going on in the code. Notice that users don't give ratings to subreddits, instead we are counting how many comments they posted in each subreddit. 

To recommend a similar user, we can use our topMatches function again.

First choose a random user for whom you're going to find neighbours

In [1760]:
import random

user = random.choice(list(red_users.keys()))
print(user)  # print the username
topMatches(user, prefs=red_users)  # from all redditors, get the most similar to user

comatose_papaya


[(0, 'zola129'),
 (0, 'younglemon4'),
 (0, 'ylata'),
 (0, 'xlez'),
 (0, 'xekatinaz')]

If no similar user was found, you can try increasing the count of users or comments for each initializeUserDict and fillItems.

*****************************************************
#### Task: Recommend subreddits for a user based on what subreddits similar users have commented in. Recommend posts for a user based on posts they have commented on. 
*****************************************************

In [1765]:
print(red_users)


print ('Distances:')
print (sim_distance('houseplantfortheroad', 'HauntingArmadillo519',prefs=red_users))
print (sim_distance('houseplantfortheroad', 'amguiltyguiltyguilty',prefs=red_users))
print (sim_distance('HauntingArmadillo519', 'amguiltyguiltyguilty', prefs=red_users))

print ('Pearsons:')
print (sim_pearson('houseplantfortheroad', 'HauntingArmadillo519', prefs=red_users))
print (sim_pearson('houseplantfortheroad', 'amguiltyguiltyguilty', prefs=red_users))
print (sim_pearson('HauntingArmadillo519', 'amguiltyguiltyguilty', prefs=red_users))



print ('Recommendations:')
print (getRecommendations('houseplantfortheroad',prefs=red_users)[0:10])
print (getRecommendations('HauntingArmadillo519',prefs=red_users)[0:10])
print (getRecommendations('amguiltyguiltyguilty',prefs=red_users)[0:10])

{'Curlyzza': {'thenetherlands': 1.0, 'University': 3.0}, 'theshavedyeti': {'CellToSingularity': 1.0, 'Showerthoughts': 1.0, 'AbruptChaos': 1.0, 'nextfuckinglevel': 1.0, 'TIHI': 2.0, 'mildlyinteresting': 1.0, 'Wellthatsucks': 2.0, 'technicallythetruth': 2.0, 'ConvenientCop': 2.0, 'IdiotsInCars': 2.0}, 'xekatinaz': {'MovieSuggestions': 2.0, 'cinematography': 1.0, 'TrueFilm': 5.0, 'AskProfessors': 2.0, 'scifi': 2.0, 'assholedesign': 1.0, 'minimalism': 2.0}, 'Leprol': {'cwru': 1.0, 'University': 1.0, 'ApplyingToCollege': 2.0, 'college': 1.0}, 'rubot78': {'funny': 2.0, 'blessedimages': 1.0, 'pics': 1.0, 'Damnthatsinteresting': 1.0, 'TheDepthsBelow': 1.0, 'nextfuckinglevel': 2.0, 'ScarlettJohansson': 1.0, 'BeAmazed': 1.0, 'WTF': 2.0, 'interestingasfuck': 2.0, 'aww': 1.0}, 'airbornemint': {'biology': 8.0, 'prephysicianassistant': 6.0, 'NewToEMS': 1.0}, 'Sagittarius712': {'africanparents': 1.0, 'MechanicalEngineering': 1.0, 'banned': 5.0, 'PersonalFinanceCanada': 6.0, 'blackladies': 2.0}, 'hel