# Predict Number of Upvotes

An online question and answer platform has hired you as a data scientist to identify the best question authors on the platform. This identification will bring more insight into increasing the user engagement. Given the tag of the question, number of views received, number of answers, username and reputation of the question author, the problem requires you to predict the upvote count that the question will receive.

Data Dictionary

 

Variable -	Definition
ID -	Question ID
Tag -	Anonymised tags representing question category
Reputation -	Reputation score of question author
Answers -	Number of times question has been answered
Username -	Anonymised user id of question author
Views -	Number of times question has been viewed
Upvotes -	(Target) Number of upvotes for the question
 

Evaluation Metric

The evaluation metric for this competition is RMSE (root mean squared error)

In [667]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt

In [668]:
training_data = pd.read_csv('train_NIR5Yl1.csv')
testing_data = pd.read_csv('test_8i3B3FC.csv')

In [669]:
print(training_data.head())

       ID Tag  Reputation  Answers  Username    Views  Upvotes
0   52664   a      3942.0      2.0    155623   7855.0     42.0
1  327662   a     26046.0     12.0     21781  55801.0   1175.0
2  468453   c      1358.0      4.0     56177   8067.0     60.0
3   96996   a       264.0      3.0    168793  27064.0      9.0
4  131465   c      4271.0      4.0    112223  13986.0     83.0


In [670]:
print(training_data.shape)

(330045, 7)


In [671]:
print(training_data['ID'].nunique())

330045


In [672]:
print(training_data['Tag'].unique())

['a' 'c' 'r' 'j' 'p' 's' 'h' 'o' 'i' 'x']


In [673]:
##using feature 'Upvotes' as label

labels = training_data[['Upvotes']]

##Removing feature 'Upvotes', 'ID' and 'Username' from training data
training_data = training_data.drop(['Upvotes', 'ID', 'Username'], axis=1)

##Removing feature 'ID' and 'Username' from testing data
testing_data = testing_data.drop(['ID', 'Username'], axis=1)

In [674]:
##Using One Hot Encoder for 'Tag' feature

oh_encoder = OneHotEncoder(handle_unknown = 'ignore', sparse = False)

ohe_train = pd.DataFrame(oh_encoder.fit_transform(training_data['Tag'].to_numpy().reshape(-1,1) ))
ohe_test = pd.DataFrame(oh_encoder.transform(testing_data['Tag'].to_numpy().reshape(-1,1)))
##adding index to encoded data

ohe_train.index = training_data.index
ohe_test.index = testing_data.index
##Dropping 'Tag' feature from original table

new_training_data = training_data.drop('Tag', axis=1)
new_testing_data = testing_data.drop('Tag', axis=1)
##Concating dataframes

training_data = pd.concat([new_training_data, ohe_train], axis = 1)
testing_data = pd.concat([new_testing_data, ohe_test], axis=1)

In [675]:
print(training_data.head())

   Reputation  Answers    Views    0    1    2    3    4    5    6    7    8  \
0      3942.0      2.0   7855.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
1     26046.0     12.0  55801.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
2      1358.0      4.0   8067.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
3       264.0      3.0  27064.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
4      4271.0      4.0  13986.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   

     9  
0  0.0  
1  0.0  
2  0.0  
3  0.0  
4  0.0  


In [676]:
print(testing_data.head())

   Reputation  Answers    Views    0    1    2    3    4    5    6    7    8  \
0      5645.0      3.0  33200.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
1     24511.0      6.0   2730.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
2       927.0      1.0  21167.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0   
3        21.0      6.0  18528.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0   
4      4475.0     10.0  57240.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0   

     9  
0  0.0  
1  0.0  
2  0.0  
3  0.0  
4  0.0  


In [677]:
##normalization of features
norm_train_data = training_data[['Reputation', 'Answers', 'Views']]
norm_test_data = testing_data[['Reputation', 'Answers', 'Views']]
col_names = norm_data.columns
scaler = MinMaxScaler()
norm_train_data = scaler.fit_transform(norm_train_data)
norm_test_data = scaler.transform(norm_test_data)

norm_train_data = pd.DataFrame(norm_train_data, columns=col_names)
norm_test_data = pd.DataFrame(norm_test_data, columns=col_names)

training_data = training_data.drop(col_names, axis=1)
training_data = pd.concat([training_data, norm_train_data], axis=1)

testing_data = testing_data.drop(col_names, axis=1)
testing_data = pd.concat([testing_data, norm_test_data], axis=1)

In [678]:
print (training_data.columns)

Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 'Reputation', 'Answers', 'Views'], dtype='object')


In [679]:
print(training_data.head())

     0    1    2    3    4    5    6    7    8    9  Reputation   Answers  \
0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.003782  0.026316   
1  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.024986  0.157895   
2  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.001303  0.052632   
3  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.000253  0.039474   
4  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.004097  0.052632   

      Views  
0  0.001500  
1  0.010666  
2  0.001540  
3  0.005172  
4  0.002672  


In [680]:
training_data.drop(col_names, axis=1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
330040,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
330041,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
330042,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
330043,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [681]:
## using linear regression algorithm to train

lr = LinearRegression()
lr.fit(training_data,labels)
test_prediction = np.rint(lr.predict(testing_data)).flatten()


In [682]:
print (test_prediction)

[ 376.  248.  177. ...  588. -124. -161.]


In [683]:
testing_data = pd.read_csv('test_8i3B3FC.csv')
testing_data = testing_data.loc[:, 'ID'].values

test_report_array = np.array(list(zip(testing_data, test_prediction)))
#testing_data['Upvotes'] = test_prediction.tolist()

In [684]:
print(testing_data)

[366953  71864 141692 ... 282334 386629 107271]


In [685]:
print(test_report_array)

[[ 3.66953e+05  3.76000e+02]
 [ 7.18640e+04  2.48000e+02]
 [ 1.41692e+05  1.77000e+02]
 ...
 [ 2.82334e+05  5.88000e+02]
 [ 3.86629e+05 -1.24000e+02]
 [ 1.07271e+05 -1.61000e+02]]


In [686]:
test_report = pd.DataFrame(data=test_report_array, columns = ['ID', 'Upvotes'])
print(test_report.head())
test_report.to_csv('test_report1', index=False)

         ID  Upvotes
0  366953.0    376.0
1   71864.0    248.0
2  141692.0    177.0
3  316833.0    -89.0
4  440445.0    684.0
