# Recommendation Engines Lab 2

- **Author:** Annie Pi
- **Last Updated:** Feb. 28, 2018
- **Assignment:** Provide a recommendation engine for a question-and-answer website, Quora-like. The prerequisites for the first part of the assignment are that the engine must be based on users binary feedback and the question topics. For the second part of assignment, a set of hybrid methods have to be used to fine-tune the recommendations and more actions can be used.*

## 0. Data setup

I begin by loading in all the available tables of Quora information:
- Topics: This table shows what topics are covered by each question. A question can have 1 or more relevant topics, ranging from Sports to Superheroes.
- User Feedback: This table shows user actions for a question with 1 indicating that the user wanted an answer and -1 indicating that the user downvoted the question. 
- User Answers: This table shows the number of upvotes or downvotes for a user answer for a particular question. 

In [1]:
# Load libraries
import pandas as pd
import numpy as np
import math

In [12]:
# Load data
topics = pd.read_csv("Quora_topics.csv")
user_feedback = pd.read_csv("Quora_userfeedback.csv")
user_answers = pd.read_csv("Quora_useranswers.csv")

In [13]:
# Examine topics
topics.head()

Unnamed: 0,Question,Sports,Books,Leadership,Philosophy,Society,Fiction,Security,Love,VideoGames,Superheroes
0,question1,1,0,1,0,1,1,0,0,0,1
1,question2,0,1,1,1,0,0,0,1,0,0
2,question3,0,0,0,1,1,1,0,0,0,0
3,question4,0,0,1,1,0,0,1,1,0,0
4,question5,0,1,0,0,0,0,0,0,1,1


In [14]:
# Examine user_feedback
user_feedback.head()

Unnamed: 0,Question,User 1,User 2,User 3,User 4
0,question1,1.0,-1.0,,
1,question2,-1.0,1.0,,
2,question3,,,,
3,question4,,1.0,,
4,question5,,,1.0,


In [15]:
# Examine user_answers
user_answers.head()

Unnamed: 0,Question,User 1,User 2,User 3,User 4
0,question1,15.0,,,
1,question2,,,40.0,
2,question3,,,,
3,question4,,,,
4,question5,,2.0,,


Two of the tables, user_feedback and user_answers, were loaded with missing or NaN values. As missing values will throw off some of my functions later in the notebook, I fill these in with 0. 

In [3]:
# Replace missing values with 0 
user_feedback.fillna(0, inplace=True)
user_answers.fillna(0, inplace=True)

Next, I store column headers as lists, so that I can access these later for both loops and column headers. 

In [6]:
# Get list of topics
topiclist = topics.columns[1:].tolist()

# Get list of users
userlist = user_feedback.columns[1:].tolist()

# Get list of question numbers
questionlist = topics["Question"].tolist()

# Create list of prediction numbers
predictionlist = ["Pred1", "Pred2", "Pred3", "Pred4"]

Finally, I define the functions that I will use multiple times throughout this lab. The only function that will change is how the user profile is calculated, but the rest, such as calculating predictions or selecting the top 5 predictions, will remain the same. 

In [7]:
# Define a function to calculate the sumproduct of two lists
def sumproduct(list1, list2):
    return sum([x*y for x,y in zip(list1,list2)])

In [8]:
# Define cosine function to return predictions
def cosine(list1, list2):
    return(sumproduct(list1, list2)/(math.sqrt(sumproduct(list2,list2))*math.sqrt(sumproduct(list1,list1))))

In [16]:
# Define function to populate predictions dataframe based on cosine function for each user & question
def calc_predictions(user, pred):
    col = 0
    row = 0
    
    #loop through all users
    for index1, row1 in user.iterrows():
        list1 = list(row1)
        #loop through all topics
        for index2, row2 in topics.iterrows():
            list2 = list(row2)[1:]
            #populate predictions using cosine function
            pred.iloc[row, col] = cosine(list1, list2)
            row += 1
        col += 1
        row = 0
    
    #fill missing predictions with 0
    pred.fillna(0, inplace=True)
    
    return(pred)

In [17]:
# Define function to calculate totals for likes, dislikes, and neutral based on prediction scores
def calc_likes(pred):
    for i in predictionlist:
        print(i)
        print("Likes: " + str(len(pred[pred[i] > 0])))
        print("Dislikes: " + str(len(pred[pred[i] < 0])))
        print("Neutral: " + str(len(pred[pred[i] == 0])))
        print("\n")

In [18]:
# Define function to calculate top five questions predicted for each user based on predictions scores
def calc_top(pred):
    count = 0 

    for i in predictionlist:
        print(i)
        print(pred.sort_values(by=[i],ascending=False).head(5).iloc[:,count])
        print("\n")
        count += 1

## 1. CB Filtering - Simple Unary

The first approach involves aggregating item vectors through a simple unary: that is, taking the topics covered by a question and calculating a product with whether a user wanted an answer or downvoted the question in Quora. It does not take into account number of topics in a question or the frequency of certain topics.

In [19]:
# Create an empty dataframe with list of topics and users
userprofile1 = pd.DataFrame(index=userlist, columns=topiclist)

In [20]:
# Populate user profile dataframe based on sumproduct of topics and user feedback
for i in userlist:
    row = userlist.index(i)
    for j in topiclist:
        col = topiclist.index(j)
        userprofile1.iloc[row, col] = (sumproduct(topics[j], user_feedback[i]))
        
# Display user profile
userprofile1

Unnamed: 0,Sports,Books,Leadership,Philosophy,Society,Fiction,Security,Love,VideoGames,Superheroes
User 1,3,-2,-1,0,0,2,-1,-1,1,0
User 2,-2,2,2,3,-1,-2,0,3,0,-1
User 3,-2,1,1,0,0,-3,-1,-2,0,1
User 4,0,0,0,0,0,0,0,0,0,0


In [21]:
# Create an empty dataframe for predictions
predictions1 = pd.DataFrame(index=questionlist, columns=predictionlist)

In [22]:
# Calculate predictions using simple unary method
calc_predictions(userprofile1, predictions1)

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Pred1,Pred2,Pred3,Pred4
question1,0.39036,-0.298142,-0.29277,0
question2,-0.436436,0.833333,0.0,0
question3,0.251976,0.0,-0.377964,0
question4,-0.327327,0.666667,-0.218218,0
question5,-0.125988,0.096225,0.251976,0
question6,0.46291,0.117851,-0.308607,0
question7,-0.154303,0.235702,-0.154303,0
question8,-0.218218,0.333333,0.109109,0
question9,0.46291,-0.235702,-0.46291,0
question10,-0.377964,0.096225,0.0,0


In [23]:
# For simple unary, calculate totals for likes, dislikes, and neutrals for each user
calc_likes(predictions1)

Pred1
Likes: 7
Dislikes: 11
Neutral: 2


Pred2
Likes: 15
Dislikes: 4
Neutral: 1


Pred3
Likes: 5
Dislikes: 10
Neutral: 5


Pred4
Likes: 0
Dislikes: 0
Neutral: 20




In [24]:
# For simple unary, calculate top 5 questions for each prediction/user
calc_top(predictions1)

Pred1
question16    0.755929
question12    0.503953
question9     0.462910
question6     0.462910
question1     0.390360
Name: Pred1, dtype: float64


Pred2
question17    0.833333
question2     0.833333
question4     0.666667
question13    0.583333
question14    0.583333
Name: Pred2, dtype: float64


Pred3
question5     0.251976
question14    0.218218
question19    0.195180
question11    0.125988
question8     0.109109
Name: Pred3, dtype: float64


Pred4
question1     0
question2     0
question19    0
question18    0
question17    0
Name: Pred4, dtype: int64




## 2. CB Filtering - Unit Weight

The next approach involves aggregating item vectors through unit weights. Unlike the last approach, it normalizes preferences based on the number of topics in a question, so I start by calculating the total number of topics per question and then tweak the formula for calculating user profiles by dividing each question topic by the total number of topics per question. 

In [25]:
# Calculate sum for # of topics per question
numtopics = topics.sum(axis=1).tolist()
numtopics

[5, 4, 3, 4, 3, 2, 2, 4, 2, 3, 3, 3, 4, 4, 4, 3, 4, 2, 5, 4]

In [26]:
# Create an empty dataframe with list of topics and users
userprofile2 = pd.DataFrame(index=userlist, columns=topiclist)

In [27]:
# Populate user profile dataframe based on sumproduct of topics/numtopics and user feedback
for i in userlist:
    row = userlist.index(i)
    for j in topiclist:
        col = topiclist.index(j)
        userprofile2.iloc[row, col] = sumproduct((topics[j]/numtopics), user_feedback[i])
        
# Display user profile
userprofile2

Unnamed: 0,Sports,Books,Leadership,Philosophy,Society,Fiction,Security,Love,VideoGames,Superheroes
User 1,1.03333,-0.45,-0.25,0.25,0.0,0.533333,-0.2,-0.25,0.333333,0.0
User 2,-0.533333,0.5,0.55,0.75,-0.2,-0.533333,-0.0833333,0.75,0.0,-0.2
User 3,-0.666667,0.333333,0.25,0.0,0.0,-0.916667,-0.333333,-0.75,0.0,0.0833333
User 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
# Create an empty dataframe for predictions
predictions2 = pd.DataFrame(index=questionlist, columns=predictionlist)

In [29]:
# Populate predictions dataframe based on cosine function for each user & question
calc_predictions(userprofile2, predictions2)

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Pred1,Pred2,Pred3,Pred4
question1,0.427934,-0.268373,-0.382235,0
question2,-0.254363,0.834683,-0.05698,0
question3,0.328679,0.006299,-0.361873,0
question4,-0.163519,0.643743,-0.284901,0
question5,-0.048952,0.113389,0.164488,0
question6,0.659494,0.100297,-0.322329,0
question7,-0.128473,0.254601,-0.322329,0
question8,-0.072675,0.332782,0.0,0
question9,0.445373,-0.246885,-0.443203,0
question10,-0.272734,0.081892,0.0,0


In [30]:
# For unit weight, calculate totals for likes, dislikes, and neutrals for each user
calc_likes(predictions2)

Pred1
Likes: 10
Dislikes: 10
Neutral: 0


Pred2
Likes: 16
Dislikes: 4
Neutral: 0


Pred3
Likes: 4
Dislikes: 13
Neutral: 3


Pred4
Likes: 0
Dislikes: 0
Neutral: 20




In [31]:
# For unit weight, calculate top 5 questions for each prediction/user
calc_top(predictions2)

Pred1
question16    0.797222
question6     0.659494
question12    0.573441
question9     0.445373
question1     0.427934
Name: Pred1, dtype: float64


Pred2
question17    0.834683
question2     0.834683
question4     0.643743
question13    0.605555
question14    0.589188
Name: Pred2, dtype: float64


Pred3
question14    0.199431
question5     0.164488
question19    0.101929
question11    0.098693
question8     0.000000
Name: Pred3, dtype: float64


Pred4
question1     0
question2     0
question19    0
question18    0
question17    0
Name: Pred4, dtype: int64




## 3. CB Filtering - IDF

The last approach aggregates item vectors through TFIDF (inverse document frequency), which both normalizes based on the number of topics in a question and adjusts based on the frequency of different topics. For this approach, I have to calculate DF (how many times the topic appears) and IDF (how rare is the topic in the document). In the formula, I multiply the IDF by the sumproduct so rare topics have more weight. 

In [32]:
# Calculate DF for each topic
DF = topics.sum(axis=0)[1:].tolist()
DF

[4, 6, 10, 11, 6, 6, 7, 6, 7, 5]

In [33]:
# Calculate IDF for each topic
IDF = []

for i in range(0, len(DF)):
    IDF.append(math.log(len(topics)/DF[i],10))
    
IDF

[0.6989700043360187,
 0.5228787452803376,
 0.30102999566398114,
 0.2596373105057561,
 0.5228787452803376,
 0.5228787452803376,
 0.4559319556497243,
 0.5228787452803376,
 0.4559319556497243,
 0.6020599913279623]

In [34]:
# Create an empty dataframe with list of topics and users
userprofile3 = pd.DataFrame(index=userlist, columns=topiclist)

In [35]:
# Populate user profile dataframe based on sumproduct of topics/numtopics and user feedback * IDF
counter = 0
for i in userlist:
    row = userlist.index(i)
    for j in topiclist:
        col = topiclist.index(j)
        #userprofile2.iloc[row, col] = sumproduct((topics[j]/numtopics[counter]), user_feedback[i])
        userprofile3.iloc[row, col] = sumproduct((topics[j]/numtopics), user_feedback[i]) * IDF[counter]
        counter += 1
    counter = 0
    
# Check values for userprofile3
userprofile3

Unnamed: 0,Sports,Books,Leadership,Philosophy,Society,Fiction,Security,Love,VideoGames,Superheroes
User 1,0.722269,-0.235295,-0.0752575,0.0649093,0.0,0.278869,-0.0911864,-0.13072,0.151977,0.0
User 2,-0.372784,0.261439,0.165566,0.194728,-0.104576,-0.278869,-0.0379943,0.392159,0.0,-0.120412
User 3,-0.46598,0.174293,0.0752575,0.0,0.0,-0.479306,-0.151977,-0.392159,0.0,0.0501717
User 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [36]:
# Create an empty dataframe for predictions
predictions3 = pd.DataFrame(index=questionlist, columns=predictionlist)

In [37]:
# Populate predictions dataframe based on cosine function for each user & question
calc_predictions(userprofile3, predictions3)

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Pred1,Pred2,Pred3,Pred4
question1,0.490309,-0.436363,-0.450526,0
question2,-0.222832,0.695633,-0.087616,0
question3,0.235027,-0.149509,-0.340032,0
question4,-0.13751,0.490191,-0.28807,0
question5,-0.056961,0.111728,0.159241,0
question6,0.659111,-0.172767,-0.404874,0
question7,-0.109453,0.263674,-0.297141,0
question8,-0.060115,0.138516,-0.016311,0
question9,0.360751,-0.270584,-0.416452,0
question10,-0.223202,0.094173,0.015831,0


In [38]:
# For IDF, calculate totals for likes, dislikes, and neutrals for each user
calc_likes(predictions3)

Pred1
Likes: 10
Dislikes: 10
Neutral: 0


Pred2
Likes: 14
Dislikes: 6
Neutral: 0


Pred3
Likes: 5
Dislikes: 14
Neutral: 1


Pred4
Likes: 0
Dislikes: 0
Neutral: 20




In [39]:
# For IDF, calculate top 5 questions fo reach prediction/user
calc_top(predictions3)

Pred1
question16    0.788337
question6     0.659111
question12    0.622096
question1     0.490309
question9     0.360751
Name: Pred1, dtype: float64


Pred2
question17    0.695633
question2     0.695633
question4     0.490191
question13    0.444510
question14    0.426572
Name: Pred2, dtype: float64


Pred3
question5     0.159241
question14    0.153319
question19    0.081188
question11    0.053390
question10    0.015831
Name: Pred3, dtype: float64


Pred4
question1     0
question2     0
question19    0
question18    0
question17    0
Name: Pred4, dtype: int64




## 4. Hybrid - Switching

For a hybrid switching model, I use the same IDF approach as in #3, but if there are no predictions for the user (because it's a new user), I switch to a non-personalized approach that takes the average of everyone's user profiles to estimate what is generally popular amongst all users.

In [40]:
# Create an empty dataframe with list of topics and users
userprofile4 = pd.DataFrame(index=userlist, columns=topiclist)

In [41]:
# Populate user profile dataframe based on sumproduct of topics/numtopics and user feedback * IDF
counter = 0
for i in userlist:
    row = userlist.index(i)
    for j in topiclist:
        col = topiclist.index(j)
        userprofile4.iloc[row, col] = sumproduct((topics[j]/numtopics), user_feedback[i]) * IDF[counter]
        counter += 1
    counter = 0
    
# Check values for userprofile4
userprofile4

Unnamed: 0,Sports,Books,Leadership,Philosophy,Society,Fiction,Security,Love,VideoGames,Superheroes
User 1,0.722269,-0.235295,-0.0752575,0.0649093,0.0,0.278869,-0.0911864,-0.13072,0.151977,0.0
User 2,-0.372784,0.261439,0.165566,0.194728,-0.104576,-0.278869,-0.0379943,0.392159,0.0,-0.120412
User 3,-0.46598,0.174293,0.0752575,0.0,0.0,-0.479306,-0.151977,-0.392159,0.0,0.0501717
User 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
# Calculate average of all user profiles, excluding User 4
profilemeans = userprofile4[0:3].mean().tolist()
profilemeans

[-0.03883166690755661,
 0.06681228411915424,
 0.05518883253839654,
 0.08654577016858538,
 -0.03485858301868918,
 -0.15976850550232535,
 -0.09371934643911,
 -0.04357322877336147,
 0.050659106183302695,
 -0.023413444107198537]

In [43]:
# Create an empty dataframe for predictions
predictions4 = pd.DataFrame(index=questionlist, columns=predictionlist)

In [44]:
# Define function to populate predictions dataframe based on cosine function for each user & question
def calc_predictions2(user, pred):
    col = 0
    row = 0

    for index1, row1 in user.iterrows():
        list1 = list(row1)
        for index2, row2 in topics.iterrows():
            list2 = list(row2)[1:]
            # Add an if statement to check if prediction exists
            if np.isnan(pred.iloc[row, col]):
                # Populate with cosine predictions using mean user profile and topics covered
                pred.iloc[row, col] = cosine(profilemeans, list2)
            else:
                # Populate with cosine predictions using user profile and topics covered
                pred.iloc[row, col] = cosine(list1, list2)
            row += 1
        col += 1
        row = 0
    
    return(pred)

In [45]:
# Using hybrid switching, calculate predictions for each user & question
calc_predictions2(userprofile4, predictions4)

Unnamed: 0,Pred1,Pred2,Pred3,Pred4
question1,-0.377637,-0.377637,-0.377637,-0.377637
question2,0.345362,0.345362,0.345362,0.345362
question3,-0.261264,-0.261264,-0.261264,-0.261264
question4,0.00929911,0.00929911,0.00929911,0.00929911
question5,0.227366,0.227366,0.227366,0.227366
question6,0.141261,0.141261,0.141261,0.141261
question7,-0.198319,-0.198319,-0.198319,-0.198319
question8,0.0515024,0.0515024,0.0515024,0.0515024
question9,-0.323026,-0.323026,-0.323026,-0.323026
question10,-0.149306,-0.149306,-0.149306,-0.149306


In [46]:
# For hybrid switching, calculate totals for likes, dislikes, and neutrals for each user
calc_likes(predictions4)

Pred1
Likes: 11
Dislikes: 9
Neutral: 0


Pred2
Likes: 11
Dislikes: 9
Neutral: 0


Pred3
Likes: 11
Dislikes: 9
Neutral: 0


Pred4
Likes: 11
Dislikes: 9
Neutral: 0




In [47]:
# For hybrid switching, calculate top 5 questions for each prediction/user
calc_top(predictions4)

Pred1
question14    0.542631
question18    0.406204
question2     0.345362
question17    0.345362
question5     0.227366
Name: Pred1, dtype: object


Pred2
question14    0.542631
question18    0.406204
question2     0.345362
question17    0.345362
question5     0.227366
Name: Pred2, dtype: object


Pred3
question14    0.542631
question18    0.406204
question2     0.345362
question17    0.345362
question5     0.227366
Name: Pred3, dtype: object


Pred4
question14    0.542631
question18    0.406204
question2     0.345362
question17    0.345362
question5     0.227366
Name: Pred4, dtype: object




## 5. Hybrid - Challenge

To define my own hybrid solution, as Quora is not just about rating, but also about the community of users, I want to also use user similarity or user answers, which have not been incorporated yet in any of the recommendation systems.

In [48]:
# Create an empty dataframe with list of topics and users
userprofile5 = pd.DataFrame(index=userlist, columns=topiclist)

In [49]:
# Populate user profile dataframe based on sumproduct of topics/numtopics and user feedback * IDF
counter = 0
for i in userlist:
    row = userlist.index(i)
    for j in topiclist:
        col = topiclist.index(j)
        userprofile5.iloc[row, col] = sumproduct((topics[j]/numtopics), user_feedback[i]) * IDF[counter]
        counter += 1
    counter = 0
    
# Check values for userprofile5
userprofile5

Unnamed: 0,Sports,Books,Leadership,Philosophy,Society,Fiction,Security,Love,VideoGames,Superheroes
User 1,0.722269,-0.235295,-0.0752575,0.0649093,0.0,0.278869,-0.0911864,-0.13072,0.151977,0.0
User 2,-0.372784,0.261439,0.165566,0.194728,-0.104576,-0.278869,-0.0379943,0.392159,0.0,-0.120412
User 3,-0.46598,0.174293,0.0752575,0.0,0.0,-0.479306,-0.151977,-0.392159,0.0,0.0501717
User 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


I start by trying to categorize similar users. If we look at the user feedback table, there are very few questions that have the same feedback from two users. Moreover, the feedback is binary, so it does not provide us a lot of information on user similarity. 

In [50]:
# Look at user feedback
user_feedback

Unnamed: 0,Question,User 1,User 2,User 3,User 4
0,question1,1.0,-1.0,0.0,0.0
1,question2,-1.0,1.0,0.0,0.0
2,question3,0.0,0.0,0.0,0.0
3,question4,0.0,1.0,0.0,0.0
4,question5,0.0,0.0,1.0,0.0
5,question6,1.0,0.0,0.0,0.0
6,question7,0.0,0.0,-1.0,0.0
7,question8,0.0,0.0,1.0,0.0
8,question9,0.0,0.0,0.0,0.0
9,question10,0.0,0.0,0.0,0.0


Therefore, it seems to make more sense to look at the correlations between user profiles and how interested they are in certain topics, based on all of their available question feedback.

In [51]:
# Check correlation coefficients between user profiles

# Correlation between User 1 and 2
print(np.corrcoef(userprofile5.iloc[0,].tolist(), userprofile5.iloc[1,].tolist())[0,1])

# Correlation between User 1 and 3
print(np.corrcoef(userprofile5.iloc[0,].tolist(), userprofile5.iloc[2,].tolist())[0,1])

# Correlation between User 2 and 3
print(np.corrcoef(userprofile5.iloc[1,].tolist(), userprofile5.iloc[2,].tolist())[0,1])

-0.779051147152
-0.628542549622
0.424715999996


The results show that the correlation coefficients are either very negative or positive, but low. Based on these results, there do not appear to be very similar users in this data set, and thus it doesn't make sense to try to recommend questions to a user based on what a similar user likes. With more users or more data on user preferences, aside from the simple binary feedback collected, it might be possible to use this method, but for now I discard this approach.

Next, I examine the user answers table.

In [52]:
user_answers

Unnamed: 0,Question,User 1,User 2,User 3,User 4
0,question1,15.0,0.0,0.0,0.0
1,question2,0.0,0.0,40.0,0.0
2,question3,0.0,0.0,0.0,0.0
3,question4,0.0,0.0,0.0,0.0
4,question5,0.0,2.0,0.0,0.0
5,question6,25.0,0.0,0.0,0.0
6,question7,0.0,0.0,0.0,0.0
7,question8,0.0,-4.0,0.0,0.0
8,question9,0.0,0.0,0.0,0.0
9,question10,0.0,0.0,0.0,0.0


Looking at the answers table, I see that User 3 has a lot of upvotes (ranging from 20 all the way to 110), showing that he is a popular user on Quora and may even an authority on certain topics. In contrast, User 1 has a medium number of upvotes (ranging from 15 to 26) and User 2 seems to be very unreliable with mostly downvotes for his answers and a low number of interactions from the rest of the community (ranging from 2 to 4).

Thus I conclude it may be useful to try to calculate a "trustworthiness" score for each user based on the number of upvotes and downvotes.

In [53]:
# Create an empty dataframe with list of topics and users for user trust
usertrust = pd.DataFrame(index=userlist, columns=topiclist)

I populate my user trust dataframe using a similar formula as the weighted approach seen earlier. However, this time instead of calculating a product of question topics and user feedback, I use the question topics and user answers. I want a weighted score because I don't want questions with lots of topics to inflate someone's trustworthiness, but I do not multiply by IDF as topic rarity/relevance is not as important in determining trustworthiness.

In [54]:
# Populate user profile dataframe based on sumproduct of topics/numtopics and user feedback
counter = 0
for i in userlist:
    row = userlist.index(i)
    for j in topiclist:
        col = topiclist.index(j)
        usertrust.iloc[row, col] = sumproduct((topics[j]/numtopics), user_answers[i])
        counter += 1
    counter = 0
    
# Check values for usertrust
usertrust

Unnamed: 0,Sports,Books,Leadership,Philosophy,Society,Fiction,Security,Love,VideoGames,Superheroes
User 1,24.1667,0.0,3.0,12.5,3.0,11.6667,0.0,0.0,8.66667,3.0
User 2,0.0,-0.733333,-2.4,-2.0,-1.15,0.0,-0.65,-1.75,1.41667,-0.733333
User 3,0.0,52.5,70.0,48.0,27.0,0.0,34.5,35.5,12.5,22.0
User 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now I have my absolute trust values for each user and topic, but I need to consider the community of users - that is, how does each user's trust in a topic rank or compare to others? 

Therefore, I transform my dataframe to calculate each trust value as proportion of all trust values in that topic. 

In [55]:
# Calculate proportion of user's trust value to the sum of all trust values for that topic 
row = 0
col = 0

for i in userlist:
    for j in topiclist:
        usertrust.iloc[row, col] = usertrust.iloc[row, col] / usertrust.sum(axis=0)[col]
        col += 1
    row += 1
    col = 0

# Check new values for usertrust
usertrust

Unnamed: 0,Sports,Books,Leadership,Philosophy,Society,Fiction,Security,Love,VideoGames,Superheroes
User 1,1,0.0,0.0424929,0.213675,0.103986,1,0.0,0.0,0.383764,0.123626
User 2,0,-0.0141661,-0.0354807,-0.0432772,-0.0443092,0,-0.0192024,-0.0518519,0.0990646,-0.0342835
User 3,0,1.00027,0.9999,0.996463,0.997795,0,1.00056,1.00146,0.96281,0.995955
User 4,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0


Now I have my user trust table, but I don't want to completely disregard the user profiles. There are some topics, such as Sports, where only User 1 has a relative trust score because the other users did not answer these questions. I don't want these to zero out a user's profile and remove the effect of user feedback.

So I decide to multiply the user's profile by a constant and treat the user trust as an exponent for this constant. If the user trust is 0, a constant to the power of 0 is 1, so the user profile value will remain the same. However, I need this constant to be above 1 because 1^1 is 1 and the effect will be the same as if the trust value was 0. Therefore, I make my constant 2. 

In [56]:
#  Calculate final profile and updated profile means
finalprofile = userprofile5 * 2**usertrust
profilemeans = finalprofile[0:3].mean().tolist()
finalprofile

Unnamed: 0,Sports,Books,Leadership,Philosophy,Society,Fiction,Security,Love,VideoGames,Superheroes
User 1,1.44454,-0.235295,-0.0775071,0.0752714,0.0,0.557737,-0.0911864,-0.13072,0.198291,0.0
User 2,-0.372784,0.258885,0.161544,0.188973,-0.101413,-0.278869,-0.037492,0.378315,0.0,-0.117584
User 3,-0.46598,0.348651,0.150505,0.0,0.0,-0.479306,-0.304072,-0.785114,0.0,0.100062
User 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [57]:
# Create an empty dataframe for predictions
predictions5 = pd.DataFrame(index=questionlist, columns=predictionlist)

In [58]:
# Calculate predictions using hybrid switching function and hew profilemeans
calc_predictions2(userprofile5, predictions5)

Unnamed: 0,Pred1,Pred2,Pred3,Pred4
question1,0.212973,0.212973,0.212973,0.212973
question2,0.152438,0.152438,0.152438,0.152438
question3,-0.0198473,-0.0198473,-0.0198473,-0.0198473
question4,-0.215503,-0.215503,-0.215503,-0.215503
question5,0.29187,0.29187,0.29187,0.29187
question6,0.562381,0.562381,0.562381,0.562381
question7,-0.358779,-0.358779,-0.358779,-0.358779
question8,0.0221746,0.0221746,0.0221746,0.0221746
question9,-0.00138702,-0.00138702,-0.00138702,-0.00138702
question10,-0.0854603,-0.0854603,-0.0854603,-0.0854603


In [59]:
# For hybrid challenge, calculate totals for likes, dislikes, and neutrals for each user
calc_likes(predictions5)

Pred1
Likes: 12
Dislikes: 8
Neutral: 0


Pred2
Likes: 12
Dislikes: 8
Neutral: 0


Pred3
Likes: 12
Dislikes: 8
Neutral: 0


Pred4
Likes: 12
Dislikes: 8
Neutral: 0




In [60]:
# For hybrid challenge, calculate top 5 questions fo reach prediction/user
calc_top(predictions5)

Pred1
question6     0.562381
question14    0.488758
question16    0.318586
question18    0.298984
question5      0.29187
Name: Pred1, dtype: object


Pred2
question6     0.562381
question14    0.488758
question16    0.318586
question18    0.298984
question5      0.29187
Name: Pred2, dtype: object


Pred3
question6     0.562381
question14    0.488758
question16    0.318586
question18    0.298984
question5      0.29187
Name: Pred3, dtype: object


Pred4
question6     0.562381
question14    0.488758
question16    0.318586
question18    0.298984
question5      0.29187
Name: Pred4, dtype: object


