<a href="https://colab.research.google.com/github/deepanshudaw/Revidly-Post-Reco/blob/master/revidly.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Useful Imports


In [0]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances 

#Recommender System

In [0]:
user=pd.read_excel(r"/content/revid.xlsx",sheet_name='USER')

##Data

The data is described below. It consist of data of 100 users each interacting with 100 posts. There are total of 1000 posts.


> The columns represent the following:


1.   userID : The unique ID provided to user to identify the, uniquely.
2.   postID : Unique ID provided to every post. The postID refers to the post which was seen by the user. Every user has seen 100 random posts out of 1000.
3. vote : +1 represents upvote, -1 represents downvote , 0 represents no vote.
4. t_spent : It denotes the time spent by the user on given post. (scaled to factor of 1)
5. shrd : 1 represents that the user has shared the post, 0 represents he hasn't.
6. comm : 1 represents that the user has commented on the post, 0 represents he hasn't.
7. score : This column is left empty. The score will give the extent to which the user has liked the post. This will be calculated in upcoming stages.





In [5]:
no_user = user.userID.unique().shape[0]
no_post = user.postID.unique().shape[0]
user.describe()

Unnamed: 0,userID,postID,vote,t_spent,shrd,comm,score
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,0.0
mean,50.5,505.5862,0.0046,0.507173,0.501,0.504,
std,28.867513,289.825601,0.81893,0.288886,0.500024,0.500009,
min,1.0,1.0,-1.0,6.8e-05,0.0,0.0,
25%,25.75,257.0,-1.0,0.260805,0.0,0.0,
50%,50.5,510.0,0.0,0.50896,1.0,1.0,
75%,75.25,760.0,1.0,0.760199,1.0,1.0,
max,100.0,1000.0,1.0,0.999993,1.0,1.0,


Score Formula

---
The score represents the measure of the user liking the particular post.
I've scaled all the factors to 1 to make the distribution even.
> The weightage of every feature is determined as:


1.   Vote is the most decisive factor which decides that user liked the post or not. So i've given it 0.5 weightage. In this way if the user downvotes the post it will be a negative 0.5 to the score and the score will remain negative. 
2.   t_spent holds more value than commenting and sharing the post so I gave it a rating of 0.3
3. shared and commented gets weightage of 0.1 .

I also wanted to include one more feature namely post_value which will describe how the post performed on all the users. This feature will take be calculated on the basis of total upvotes, downvotes, comments and total views on the post. In this way if the post performed well on others it can be supposed that it will be liked by the user also. I made another dataset (post) in my excel worksheet to calculate this.

But including this will bias the posts on the basis of global result more and may damage our purpose if recommending what the user likes personally.
hence I didn't include that feature in the model.






In [0]:
user['score'] = (user['vote'] * 0.5 ) + (user['t_spent'] *0.3) + (user['shrd']*0.1) + (user['comm']*0.1)

In [7]:
user.head(10)

Unnamed: 0,userID,postID,vote,t_spent,shrd,comm,score
0,1,337,-1,0.680855,0,1,-0.195744
1,1,787,-1,0.72171,1,1,-0.083487
2,1,116,-1,0.762514,1,0,-0.171246
3,1,592,-1,0.32237,0,1,-0.303289
4,1,621,1,0.173896,1,0,0.652169
5,1,191,1,0.654372,0,0,0.696312
6,1,714,0,0.143644,1,1,0.243093
7,1,883,0,0.972166,1,0,0.39165
8,1,143,0,0.509148,1,1,0.352744
9,1,354,-1,0.126354,0,0,-0.462094


In [8]:
print("Number of users:",no_user)
print("Number of posts:",no_post)
user.describe()

Number of users: 100
Number of posts: 1000


Unnamed: 0,userID,postID,vote,t_spent,shrd,comm,score
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,50.5,505.5862,0.0046,0.507173,0.501,0.504,0.254952
std,28.867513,289.825601,0.81893,0.288886,0.500024,0.500009,0.424439
min,1.0,1.0,-1.0,6.8e-05,0.0,0.0,-0.499648
25%,25.75,257.0,-1.0,0.260805,0.0,0.0,-0.165036
50%,50.5,510.0,0.0,0.50896,1.0,1.0,0.255198
75%,75.25,760.0,1.0,0.760199,1.0,1.0,0.673554
max,100.0,1000.0,1.0,0.999993,1.0,1.0,0.999998


Data matrix represents the score of every user along with the posts. All the post he hasn't interacted with are scored zero.

In [0]:
data_matrix = np.zeros((no_user, no_post))
for rows in user.itertuples():
    data_matrix[rows[1]-1,rows[2]-1]=rows[7] 

post_similarity gives the similarity between two posts . If a user likes post a he is likely to like post b as well if the parameters are matching.

In [0]:
post_similarity = pairwise_distances(data_matrix.T, metric='cosine')

In [0]:
prediction = data_matrix.dot(post_similarity) / np.array([np.abs(post_similarity).sum(axis=1)])

In [13]:
pred = prediction.argsort()
print(pred)

[[919 190  57 ... 304  93 821]
 [  0 632 672 ... 153 117 421]
 [ 99 369 588 ... 887 639 424]
 ...
 [752 449 761 ... 973 298 840]
 [ 54 637 140 ... 478 256 334]
 [550 246 508 ... 492 329 592]]


In [17]:
n_post = 10   #change to get more predicted posts

pred_post=np.argsort(-prediction)[:,:n_post]

array([821,  93, 304, 874, 840, 865, 990, 548, 353, 645])

In [0]:
pred_item=pd.DataFrame({'user_id':[] , 'recommended_post':[]})

for i in range(no_user):
    pred_item = pred_item.append({'user_id': i+1, 'recommended_post':pred_post[i]}, ignore_index=True)
    
pred_item.user_id.astype='int'

In [27]:
pred_item

Unnamed: 0,user_id,recommended_post
0,1.0,"[821, 93, 304, 874, 840, 865, 990, 548, 353, 645]"
1,2.0,"[421, 117, 153, 94, 402, 850, 912, 269, 510, 458]"
2,3.0,"[424, 639, 887, 162, 184, 833, 136, 531, 446, 74]"
3,4.0,"[991, 378, 75, 617, 431, 772, 976, 233, 291, 529]"
4,5.0,"[321, 266, 579, 188, 134, 547, 646, 519, 659, ..."
...,...,...
95,96.0,"[380, 472, 189, 116, 950, 526, 850, 73, 327, 262]"
96,97.0,"[856, 537, 507, 515, 651, 978, 678, 897, 404, ..."
97,98.0,"[840, 298, 973, 140, 514, 347, 256, 312, 45, 971]"
98,99.0,"[334, 256, 478, 30, 58, 202, 876, 979, 781, 363]"


Exporting the predictions to excel

In [0]:
pred_item.to_excel("output.xlsx")