<h1> User recommendation system based on sentiment analysis and matrix factorization</h1>

This notebook represents the code for the user recommendation system. The code includes crawling tweets from twitter, performing a sentiment analysis based on those tweets, using matrix factorization to estimate missing ratings and computing cosine similarity to measure user similarity.
    
Running the notebook was tested on windows operating system using an anaconda environment.

In [17]:
# install the necessary libraries tweepy and textblob
# the other libraries, such as pandas or numpy, are already pre-installed through anaconda
!pip install tweepy
!pip install textblob




In [18]:
# import all needed libraries
import tweepy as tw
import re
from textblob import TextBlob
import numpy
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

In [19]:
# keys, which are needed to authenticate this application to my twitter developer account
consumer_key= "KZrWgSozl4KlNdO8E78C7MQYV"
consumer_secret= "ANvxLJ1yxzwaGmHP4psLp66YI7rfi0lwJWCKIbPQ5u7HIpWq34"
access_token= "2410718785-PLIAN7lUp8Bm6AN4jsmEgY35f9Ha8Ubgv2N3oy9"
access_token_secret= "p5Yn0UpVTV1eDeulItW9zew71RHkNMwVzfTLqXLxwYofN"

# authenticate tweepy to the twitter developer account
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)


In [20]:
# initialize global variables
# matrix stores the user-rating matrix, which will be created through sentiment analysis
matrix = []
# a list of all users, which ratings are stored in the user-rating matrix
# the index of a user in this list represents the index of the users ratings in the matrix
users = []


<h2> Crawling data from twitter  </h2>

 The following cell deals about crawling tweets from twitter. The search strings are being initialized and used by tweepy to collect tweets, which match the given search string. Only tweets created after 2021-08-15 are being considered. 


In [21]:
# hashtag search strings for each topic
# retweets are being filtered out
climatechange = "#climate+change -filter:retweets"
movies = "#movies -filter:retweets"
football = "#football -filter:retweets"
coronavirus = "#coronavirus -filter:retweets"
cars = "#cars -filter:retweets"
date_since = "2021-08-15"
# the number of tweets, which will be collected per topic
number_of_tweets = 100

# collect tweets per topic using tweepy with the search strings
# language of the tweets must be english
climateTweets = tw.Cursor(api.search,
              q=climatechange,
              lang="en",
              since=date_since).items(number_of_tweets)

movieTweets = tw.Cursor(api.search,
              q=movies,
              lang="en",
              since=date_since).items(number_of_tweets)
    
footballTweets = tw.Cursor(api.search,
              q=football,
              lang="en",
              since=date_since).items(number_of_tweets)

coronaviursTweets = tw.Cursor(api.search,
              q=coronavirus,
              lang="en",
              since=date_since).items(number_of_tweets)

carTweets = tw.Cursor(api.search,
              q=cars,
              lang="en",
              since=date_since).items(number_of_tweets)

In [22]:
# helper method, used before the sentiment analysis
# removes all urls in a text string
def remove_url(txt):
    return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())

<h2>Sentiment Analysis</h2>

The following cells deal about performing a sentiment analysis based on the collected tweets. The python library TextBlob is used to compute the sentiment value. The results of the sentiment analysis are written into a user-rating matrix. The matrix stores each user as a row and the corresponding sentiment values for the topics as columns. If a user did not tweet about a topic, the sentiment value is 0 for that topic. One method for each topic is implemented, which are structured the same but the sentiment values are being stored into different columns, depending on the topic.

In [23]:
# methods for performing the sentiment analysis
# one method for each topic
def analyzeClimatechangeTweets(tweets):
    # using the global variables
    global matrix
    global users
    # iterate over all tweets of this topic
    for tweet in tweets:
        # save the user of the tweet
        user = tweet.user.screen_name
        # remove all URLs from the tweet
        text = remove_url(tweet.text)
        # create a TextBlob object, which performs the sentiment analysis
        sentiment_object = TextBlob(text)
        # read the sentiment polarity value from the TextBlob object
        # the value is a float within the range [-1.0, 1.0]
        sentiment_value_temp = sentiment_object.sentiment.polarity
        # rescale the sentiment value to a range of [1.0, 5.0]
        sentiment_value = (((sentiment_value_temp - (-1)) * 4) / 2) + 1

        # check whether there exists entries for the same user in the user-rating matrix
        # the reason for that is that we want one single row for one user and not one new row per tweet
        if user in users:
            # get the index of the user in the user-rating matrix
            i = users.index(user)
            # store the sentiment value in the first column (climate change tweets are being stored in the first column) 
            # of the row
            matrix[i][0] = sentiment_value
        # otherwise this is the first entry of the user    
        else:
            # append a new row for the user in the user rating matrix and save the sentiment value in the first column
            # the other columns are zero, because this is the first entry for this user and we do not have any tweets
            # regarding the other topics
            matrix.append([sentiment_value,0,0,0,0])
            # to remember that we now have an entry for that user, save that user in the user list
            users.append(user)

def analyzeMovieTweets(tweets):
    # using the global variables
    global matrix
    global users
    # iterate over all tweets of this topic
    for tweet in tweets:
        # save the user of the tweet
        user = tweet.user.screen_name
        # remove all URLs from the tweet
        text = remove_url(tweet.text)
        # create a TextBlob object, which performs the sentiment analysis
        sentiment_object = TextBlob(text)
        # read the sentiment polarity value from the TextBlob object
        # the value is a float within the range [-1.0, 1.0]
        sentiment_value_temp = sentiment_object.sentiment.polarity
        # rescale the sentiment value to a range of [1.0, 5.0]
        sentiment_value = (((sentiment_value_temp - (-1)) * 4) / 2) + 1

        # check whether there exists entries for the same user in the user-rating matrix
        # the reason for that is that we want one single row for one user and not one new row per tweet
        if user in users:
            # get the index of the user in the user-rating matrix
            i = users.index(user)
            # store the sentiment value in the second column (movie tweets are being stored in the second column) 
            # of the row
            matrix[i][1] = sentiment_value
        # otherwise this is the first entry of the user      
        else:
            # append a new row for the user in the user rating matrix and save the sentiment value in the second column
            # the other columns are zero, because this is the first entry for this user and we do not have any tweets
            # regarding the other topics
            matrix.append([0,sentiment_value,0,0,0])
            # to remember that we now have an entry for that user, save that user in the user list
            users.append(user)

def analyzeFootballTweets(tweets):
    # using the global variables
    global matrix
    global users
    # iterate over all tweets of this topic
    for tweet in tweets:
        # save the user of the tweet
        user = tweet.user.screen_name
        # remove all URLs from the tweet
        text = remove_url(tweet.text)
        # create a TextBlob object, which performs the sentiment analysis
        sentiment_object = TextBlob(text)
        # read the sentiment polarity value from the TextBlob object
        # the value is a float within the range [-1.0, 1.0]
        sentiment_value_temp = sentiment_object.sentiment.polarity
        # rescale the sentiment value to a range of [1.0, 5.0]
        sentiment_value = (((sentiment_value_temp - (-1)) * 4) / 2) + 1

        # check whether there exists entries for the same user in the user-rating matrix
        # the reason for that is that we want one single row for one user and not one new row per tweet
        if user in users:
            # get the index of the user in the user-rating matrix
            i = users.index(user)
            # store the sentiment value in the third column (football tweets are being stored in the third column) 
            # of the row
            matrix[i][2] = sentiment_value
        # otherwise this is the first entry of the user
        else:
            # append a new row for the user in the user rating matrix and save the sentiment value in the third column
            # the other columns are zero, because this is the first entry for this user and we do not have any tweets
            # regarding the other topics
            matrix.append([0,0,sentiment_value,0,0])
            # to remember that we now have an entry for that user, save that user in the user list
            users.append(user)

def analyzeCoronavirusTweets(tweets):
    # using the global variables
    global matrix
    global users
    # iterate over all tweets of this topic
    for tweet in tweets:
        # save the user of the tweet
        user = tweet.user.screen_name
        # remove all URLs from the tweet
        text = remove_url(tweet.text)
        # create a TextBlob object, which performs the sentiment analysis
        sentiment_object = TextBlob(text)
        # read the sentiment polarity value from the TextBlob object
        # the value is a float within the range [-1.0, 1.0]
        sentiment_value_temp = sentiment_object.sentiment.polarity
        # rescale the sentiment value to a range of [1.0, 5.0]
        sentiment_value = (((sentiment_value_temp - (-1)) * 4) / 2) + 1

        # check whether there exists entries for the same user in the user-rating matrix
        # the reason for that is that we want one single row for one user and not one new row per tweet
        if user in users:
            # get the index of the user in the user-rating matrix
            i = users.index(user)
            # store the sentiment value in the fourth column (coronavirus tweets are being stored in the fourth column) 
            # of the row
            matrix[i][3] = sentiment_value
        # otherwise this is the first entry of the user
        else:
            # append a new row for the user in the user rating matrix and save the sentiment value in the fourth column
            # the other columns are zero, because this is the first entry for this user and we do not have any tweets
            # regarding the other topics
            matrix.append([0,0,0,sentiment_value,0])
            # to remember that we now have an entry for that user, save that user in the user list
            users.append(user)
            
def analyzeCarTweets(tweets):
    # using the global variables
    global matrix
    global users
    # iterate over all tweets of this topic
    for tweet in tweets:
        # save the user of the tweet
        user = tweet.user.screen_name
        # remove all URLs from the tweet
        text = remove_url(tweet.text)
        # create a TextBlob object, which performs the sentiment analysis
        sentiment_object = TextBlob(text)
        # read the sentiment polarity value from the TextBlob object
        # the value is a float within the range [-1.0, 1.0]
        sentiment_value_temp = sentiment_object.sentiment.polarity
        # rescale the sentiment value to a range of [1.0, 5.0]
        sentiment_value = (((sentiment_value_temp - (-1)) * 4) / 2) + 1

        # check whether there exists entries for the same user in the user-rating matrix
        # the reason for that is that we want one single row for one user and not one new row per tweet
        if user in users:
            # get the index of the user in the user-rating matrix
            i = users.index(user)
            # store the sentiment value in the fifth column (car tweets are being stored in the fifth column) 
            # of the row
            matrix[i][4] = sentiment_value
        # otherwise this is the first entry of the user
        else:
            # append a new row for the user in the user rating matrix and save the sentiment value in the fifth column
            # the other columns are zero, because this is the first entry for this user and we do not have any tweets
            # regarding the other topics
            matrix.append([0,0,0,0,sentiment_value])
            # to remember that we now have an entry for that user, save that user in the user list
            users.append(user)

In [24]:
# perform the sentiment analysis for each climate change tweet
# the results are being saved in the user-rating matrix
print("Starting sentiment analysis for climate change tweets!")
analyzeClimatechangeTweets(climateTweets)
print("Sentiment analysis done!")

Starting sentiment analysis for climate change tweets!
Sentiment analysis done!


In [25]:
# perform the sentiment analysis for each movie tweet
# the results are being saved in the user-rating matrix
print("Starting sentiment analysis for movie tweets!")
analyzeMovieTweets(movieTweets)
print("Sentiment analysis done!")

Starting sentiment analysis for movie tweets!
Sentiment analysis done!


In [26]:
# perform the sentiment analysis for each football tweet
# the results are being saved in the user-rating matrix
print("Starting sentiment analysis for football tweets!")
analyzeFootballTweets(footballTweets)
print("Sentiment analysis done!")

Starting sentiment analysis for football tweets!
Sentiment analysis done!


In [27]:
# perform the sentiment analysis for each coronavirus tweet
# the results are being saved in the user-rating matrix
print("Starting sentiment analysis for coronavirus tweets!")
analyzeCoronavirusTweets(coronaviursTweets)
print("Sentiment analysis done!")

Starting sentiment analysis for coronavirus tweets!
Sentiment analysis done!


In [28]:
# perform the sentiment analysis for each car tweet
# the results are being saved in the user-rating matrix
print("Starting sentiment analysis for car tweets!")
analyzeCarTweets(carTweets)
print("Sentiment analysis done!")

Starting sentiment analysis for car tweets!
Sentiment analysis done!


<h2>Matrix Factorization</h2>

The following cells are implementing a matrix factorization, which is used for predicting user ratings on topics. The user-feature matrix and the topic-feature matrix are being initialized randomly. The goal is to discover 10 latent features. Gradient descent, with a maximum of 500 iterations, a learning rate of 0.0002 and a regularization parameter of 0.02, computes the difference of the ratings and tries to find a local minimum of the difference. At the end, the user-rating matrix with the estimated ratings is being constructed by computing the dot product of the user-feature matrix P and the topic-feature matrix Q.

In [29]:
# the method for performing the matrix factorization
# input is the user-rating matrix, which has been created by the sentiment analysis
def mft(R):
    # N denotes the number of users
    N = len(R)  
    # M denotes the number of topics
    M = len(R[0])
    # K denotes the number of latent features
    K = 10
    
    # P is the user feature matrix
    # Q is the topic feature matrix
    # both matrices are being initialized randomly
    P = numpy.random.rand(N,K)
    Q = numpy.random.rand(M,K)
    
    Q = Q.T
    # alpha denotes the learning rate used by gradient descent
    alpha = 0.0002
    # beta denotes the regularization parameter used by gradient descent
    beta = 0.02
    # iterations denotes the number of iterations of gradient descent
    iterations = 500

    # perform 500 iterations of gradient descent
    # P and Q can both be updated through iterations until the error converges to its local minimum (<0.001)
    for iteration in range(iterations):
        # iterate over the user-rating matrix
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    # calculate the error
                    eij = R[i][j] - numpy.dot(P[i,:],Q[:,j])

                    for k in range(K):
                        # calculate gradient with alpha and beta parameter
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])

        # compute the dot product of the matrices P and Q
        eR = numpy.dot(P,Q)
        
        # initialize the error
        e = 0
    
        # calculate the error
        for i in range(len(R)):

            for j in range(len(R[i])):

                if R[i][j] > 0:

                    e = e + pow(R[i][j] - numpy.dot(P[i,:],Q[:,j]), 2)

                    for k in range(K):

                        e = e + (beta/2) * (pow(P[i][k],2) + pow(Q[k][j],2))
                        
        # stop gradient descent if the error converges to its local minimum (< 0.001) 
        if e < 0.001:
            break
            
    Q = Q.T
    # compute the final user-rating matrix and return this matrix
    nR = numpy.dot(P, Q.T)
    return nR

In [30]:
np_array = numpy.array(matrix)
# call the matrix factorization method
rating_matrix = mft(np_array)
# To check the performance of matrix factorization, print some original and estimated ratings:
print("Original rating from User 1 of topic 1: " + str(matrix[0][0]))
print("Estimated rating from User 1 of topic 1: "+ str(rating_matrix[0][0]))
print("Estimated rating from User 121 of topic 2: " + str(matrix[120][1]))
print("Estimated rating from User 121 of topic 2: "+ str(rating_matrix[120][1]))
# print a snippet of the result matrix
print("Snippet of result matrix:")
print(rating_matrix)

Original rating from User 1 of topic 1: 4.0
Estimated rating from User 1 of topic 1: 3.4703481495884634
Estimated rating from User 121 of topic 2: 4.6
Estimated rating from User 121 of topic 2: 4.4296923807216775
Snippet of result matrix:
[[3.47034815 3.83017351 4.05417249 3.74525239 3.86464074]
 [3.0025354  3.29682384 3.67092534 3.01741231 3.59319616]
 [2.87460188 3.77041979 3.33502218 3.28419803 3.25597761]
 ...
 [2.80830211 3.02412361 3.35346654 2.74489136 3.20137839]
 [3.22831835 3.36947189 3.58654634 3.26339422 3.33855978]
 [3.36917483 3.0253206  3.33371442 3.21677487 3.82202904]]


<h2>Measure Cosine Similarity</h2>

With the help of the library sklearn, the user similarity is computed by using the cosine similarity measure. After that, for each user the top-10 most similar users (users with the highest similarity value) are being collected and written on the recommendation list for the corresponding user.

In [31]:
# transform the rating matrix into a pandas dataframe
df = pd.DataFrame(rating_matrix)
# measure the cosine similarity using sklearn
cos_sim = cosine_similarity(df,df)

In [32]:
numOfRows = cos_sim.shape[0]
numOfRow = 0
# the top 10 most similar user should be recommended for each user
k = 10
# the recommendation list stores the top k most similar users for each user
recommendationlist = []

# iterate over all rows of the cosine similarity result
while numOfRow < numOfRows:
    # the current row
    row = cos_sim[numOfRow,:]
    
    # create a copy of the current row
    # the copy is needed, because after finding the most similar user in this row, the most similar user is being deleted from
    # the copy, to be able to find the next most similar user in the next iteration
    # however the original row should stay the same, because the indices of users will be manipulated otherwise
    tempRow = cos_sim[numOfRow,:]
    
    i = 0
    # topkRow saves the top k most similar users only for the current user
    topkRow = []
    # iterate until the top k most similar users have been found
    while i < k:
        # find the most similar user of the current user in the copy of the row
        max = numpy.amax(tempRow)
        # get the index of the most similar user in the original row
        index = numpy.where(row == max)[0]
        
        # check whether the most similar user is a different user than the current user
        # the most similar user to the current user will always be the user himself (cosine similarity == 1)
        # this case is being ignored
        if index != numOfRow:
            # the name of the most similar user is being appended to the topk list of the current user
            name = users[index[0]]
            topkRow.append(name)
            # a recommendation has been found, therefore increase i by 1
            i += 1
        
        # delete the most similar user from the copy of the row
        tempRowIndex = numpy.where(tempRow == max)[0]
        tempRow = numpy.delete(tempRow,tempRowIndex)
    
    # after the top k most similar users for a user have been found, append those to the recommendation list, which stores the 
    # top k most similar users for each user
    recommendationlist.append(topkRow)
    numOfRow += 1

# print recommendation lists of the first three users (printing the recommendations lists for all users would be a big print)
print("Recommendation list for user " + users[0] + ":")
print(recommendationlist[0])
print("Recommendation list for user " + users[1] + ":")
print(recommendationlist[1])
print("Recommendation list for user " + users[2] + ":")
print(recommendationlist[2])

Recommendation list for user PeterGleick:
['DERVIEMOO', 'CindyBearsDen', 'Infamous_Raptor', 'ETObligations', 'Mango_News', 'weweia', 'Super_Said', '_covid_19_bot_', 'TheReviewWire', 'BradHopwood']
Recommendation list for user pirone_art:
['KODIMANBUILDS05', 'Fandoro4U', 'RevolutionsCen', 'JLawsFunhouse', 'MascotMotors', 'MGDezigns', 'MrsimpleJesse', 'zerostriker_', 'screenshotmag', 'antonio_m_reed']
Recommendation list for user ChristopherNFox:
['pm_girl', 'vthomas14', 'RCDavieHonda', 'Workshedanimate', 'PartsTown', 'scotgov', 'CovidFactoid', 'UpulTR', 'Tokyo_Tom', 'phtosbyakhughes']
