 ## **Clean the Data (deal with missing values):**
 
 There are no missing values in this dataset, and each of the 141,000 instances do not have missing or mismatched data (https://www.kaggle.com/datasets/umairnsr87/predict-the-number-of-upvotes-a-post-will-get).
 
*The dataset is also already split into a training and testing sets* by the Kaggle entry's author, with a 70/30 split. The test set contains approximately 141k entries, and the training set contains about 330k entries. 

Thinking for the ultimate task of classification, as there appeared to be no set of "bad ID's" (users associated with trolling, etc) according to Google, the ID column was removed.

In [60]:
import pandas as pd

train_set = pd.read_csv("train_upvotes.csv")
test_set = pd.read_csv("test_upvotes.csv")

In [61]:
train_set.describe()

Unnamed: 0,ID,Reputation,Answers,Username,Views,Upvotes
count,330045.0,330045.0,330045.0,330045.0,330045.0,330045.0
mean,235748.682789,7773.147,3.917672,81442.888803,29645.07,337.505358
std,136039.418471,27061.41,3.579515,49215.10073,80956.46,3592.441135
min,1.0,0.0,0.0,0.0,9.0,0.0
25%,117909.0,282.0,2.0,39808.0,2594.0,8.0
50%,235699.0,1236.0,3.0,79010.0,8954.0,28.0
75%,353620.0,5118.0,5.0,122559.0,26870.0,107.0
max,471493.0,1042428.0,76.0,175738.0,5231058.0,615278.0


In [62]:
test_set.describe()

Unnamed: 0,ID,Reputation,Answers,Username,Views
count,141448.0,141448.0,141448.0,141448.0,141448.0
mean,235743.073497,7920.927,3.914873,81348.231117,29846.33
std,136269.867118,27910.72,3.57746,49046.098215,80343.74
min,7.0,0.0,0.0,4.0,9.0
25%,117797.0,286.0,2.0,40222.75,2608.0
50%,235830.0,1245.0,3.0,78795.5,8977.0
75%,353616.0,5123.0,5.0,122149.0,26989.25
max,471488.0,1042428.0,73.0,175737.0,5004669.0


In [63]:
train_set = train_set.drop(["ID"], axis = 1)
test_set = test_set.drop(["ID"], axis = 1)

## Use a One Hot Encoder

One Hot Encoding is used to turn categorical variables (which cannot be fed into most mathematical ML tools) into equivalent numerical variables that can be operated on. This dataset has one categorical variable- the *tag* that denotes what section of Reddit the post belongs to (denoted by a letter).

Because the number of different sections is relatively small (10), it can be easily one-hot-encoded without an influx of training features bogging down a potential model's training time.

In [64]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330045 entries, 0 to 330044
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Tag         330045 non-null  object 
 1   Reputation  330045 non-null  float64
 2   Answers     330045 non-null  float64
 3   Username    330045 non-null  int64  
 4   Views       330045 non-null  float64
 5   Upvotes     330045 non-null  float64
dtypes: float64(4), int64(1), object(1)
memory usage: 15.1+ MB


In [65]:
train_set['Tag'].value_counts() #There's only 10- this is easy to OHE!

c    72458
j    72232
p    43407
i    32400
a    31695
s    23323
h    20564
o    14546
r    12442
x     6978
Name: Tag, dtype: int64

In [66]:
from sklearn.preprocessing import OneHotEncoder

#the only categorical variable we need is the tag
upvote_tag_train = train_set[['Tag']]
upvote_tag_test = test_set[['Tag']]

#create the one hot encoder
categorical_encoder = OneHotEncoder()

upvote_tag_train = categorical_encoder.fit_transform(upvote_tag_train)
upvote_tag_test = categorical_encoder.fit_transform(upvote_tag_test)

In [67]:
upvote_tag_train.toarray()[0:10] #Properly converted/OHE'd

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]])

## Scale/normalize/standardize features using sklearn.preprocessing


The scale of this data is very disparate. While the average answer count is in the single or double digits, the reputation of a user or the number of views that a given post receives are entire magnitudes larger than that. With such extreme scales, the data needs to be normalized to prevent category weight being so drastically unequal that lesser-scale variables have no bearing on the final result.

In [68]:
from sklearn.preprocessing import StandardScaler

upvote_numerical = ['Reputation', 'Answers', 'Username', 'Views']

#train_set[['Tag']]

standard_scaler = StandardScaler()

upvote_train_scaled = standard_scaler.fit_transform(train_set[upvote_numerical])
upvote_test_scaled = standard_scaler.transform(test_set[upvote_numerical])

In [69]:
upvote_train_scaled

array([[-0.14157253, -0.53573597,  1.5072655 , -0.26915833],
       [ 0.67523751,  2.25794312, -1.21226978,  0.32308687],
       [-0.23705919,  0.02299985, -0.51337753, -0.26653963],
       ...,
       [-0.05894553, -0.53573597,  0.20843454, -0.33588566],
       [-0.2839526 , -0.53573597, -0.0243399 , -0.34015957],
       [-0.21329838,  0.02299985,  1.48834852, -0.33463807]])

In [70]:
#solution adapted from https://stackoverflow.com/questions/64161419/how-can-i-convert-the-standardscaler-transformation-back-to-dataframe
cols = ['Reputation', 'Answers', 'Username', 'Views']

X_train_sc = pd.DataFrame(standard_scaler.fit_transform(train_set[upvote_numerical]), columns=cols)
X_test_sc = pd.DataFrame(standard_scaler.transform(test_set[upvote_numerical]), columns=cols)

In [71]:
X_train_sc.head()

Unnamed: 0,Reputation,Answers,Username,Views
0,-0.141573,-0.535736,1.507266,-0.269158
1,0.675238,2.257943,-1.21227,0.323087
2,-0.237059,0.023,-0.513378,-0.26654
3,-0.277486,-0.256368,1.774867,-0.031882
4,-0.129415,0.023,0.625421,-0.193426


In [72]:
y_train = train_set['Upvotes']

## Use sklearn.linear_model.LinearRegression

Now that the data is properly scaled, we can attempt to predict with it. First, the classic example of a Linear Regression. There is reason to believe a positive trend between reputation and the number of upvotes, so this may be a good fit.

In [73]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train_sc, y_train)

LinearRegression()

In [74]:
some_data = X_train_sc.iloc[:5]
some_labels = y_train.iloc[:5]

print("Predictions:", lin_reg.predict(some_data))
print("Actual:", some_labels)

Predictions: [-167.44011809 1177.38633173 -303.11939673   49.30290753 -100.62149145]
Actual: 0      42.0
1    1175.0
2      60.0
3       9.0
4      83.0
Name: Upvotes, dtype: float64


Well, there were two that it predicted pretty closely (the ones with positive numbers: 1170 to 1175 and 60 to 83)! That's a start. But can we find more accurate predictors with this?

## Use sklearn.tree.DecisionTreeRegressor

In [75]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train_sc, y_train)

DecisionTreeRegressor()

In [76]:
some_data = X_train_sc.iloc[:5]
some_labels = y_train.iloc[:5]

print("Predictions:", tree_reg.predict(some_data))
print("Actual:", some_labels)

Predictions: [  42. 1175.   60.    9.   83.]
Actual: 0      42.0
1    1175.0
2      60.0
3       9.0
4      83.0
Name: Upvotes, dtype: float64


This looks significantly better! But, a perfect score is likely to be the result of overfitting.

In [77]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, X_train_sc, y_train,
                         scoring="neg_mean_squared_error", cv=10)


In [78]:
import numpy as np
tree_rmse_scores = np.sqrt(-scores)

In [79]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())
    
display_scores(tree_rmse_scores)

Scores: [1641.15935069 2005.67754592 2024.31555296 1249.60744179 1779.41194592
 1205.55095199 1341.7221639  1375.8279352  1267.75289087 1607.05444618]
Mean: 1549.808022543073
Standard deviation: 293.09926825999435


Spoiler, Decision Trees did NOT do well, here. Not even remotely. At least they had positive results for all of them, though (which is better than the Linear Regression model, which predicted NEGATIVE numbers.

## Use sklearn.ensemble.RandomForestClassifier

In [80]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(X_train_sc, y_train)

forest_rsme_scores = cross_val_score(forest_reg, X_train_sc, y_train,
                         scoring="neg_mean_squared_error", cv=10)

display_score(forest_rsme_scores)

# Evaluate your system on the Test Data

Of the models chosen, the best-working one was the Random Decision Tree Model. Let's see what happens when this is evaulated against the test results. From this dataset, there *is* no "correct" number of upvotes/ground truth given, so the prediction is only for a general idea. This is a problem with the dataset's test set and is thus noted in this discussion as well.

In [81]:
print("Predictions:", tree_reg.predict(X_test_sc[:20]))

Predictions: [260.  67.  48.   4. 430. 104.  13. 101.  34.   2.  14.  58.  21. 100.
  20. 456. 151. 138. 731.   0.]


In [82]:
test_set.head(20)

Unnamed: 0,Tag,Reputation,Answers,Username,Views
0,a,5645.0,3.0,50652,33200.0
1,c,24511.0,6.0,37685,2730.0
2,i,927.0,1.0,135293,21167.0
3,i,21.0,6.0,166998,18528.0
4,i,4475.0,10.0,53504,57240.0
5,c,3252.0,1.0,115109,2307.0
6,x,859.0,1.0,88355,6507.0
7,c,770.0,4.0,74489,57775.0
8,s,8727.0,2.0,37904,4459.0
9,p,170.0,4.0,162810,4899.0


This answer *seems* reasonable, though it weighs views more than reputation, unless the reputation happens to be *very large* (as in the case of the 15th entry, with almost 300,000 reputation)

## Create a single pipeline that does full process from data preparation to final prediction.

In [111]:
#starting from scratch so that the pipeline can go the full way
train_set = pd.read_csv("train_upvotes.csv")
test_set = pd.read_csv("test_upvotes.csv")

X_train = train_set[['ID', 'Tag', 'Reputation', 'Answers', 'Username', 'Views']]
y_train = train_set["Upvotes"]

X_test = test_set[['ID', 'Tag', 'Reputation', 'Answers', 'Username', 'Views']]
##y_test = test_set["Upvotes"] - there actually is no y_test in this dataset for whatever reason

In [106]:
#To do this, we will need a custom transformer 
#https://stackoverflow.com/questions/68402691/adding-dropping-column-instance-into-a-pipeline
class columnDropperTransformer():
    def __init__(self,columns):
        self.columns=columns

    def transform(self,X,y=None):
        return X.drop(self.columns,axis=1)

    def fit(self, X, y=None):
        return self 

In [107]:
#Numerical
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ("std-scaler", StandardScaler())
])

In [127]:
from sklearn.compose import ColumnTransformer

num_attribs = ['Reputation', 'Answers', 'Username', 'Views']
cat_attribs = ['Tag']

full_pipeline = ColumnTransformer([
        ('clmn_drpr', 'drop', ['ID']),
        ("num", pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

In [128]:
train_prepared = full_pipeline.fit_transform(X_train)

tree_reg = DecisionTreeRegressor()
tree_reg.fit(train_prepared, y_train)


In [130]:
print("Predictions:", tree_reg.predict(train_prepared)[:20])
print("Labels:", list(y_train)[:20])

Predictions: [  42. 1175.   60.    9.   83.    4.   17.    3.   79.    0.  166.   42.
   19.    2.   10.  223.   13.    8.    9.   79.]
Labels: [42.0, 1175.0, 60.0, 9.0, 83.0, 4.0, 17.0, 3.0, 79.0, 0.0, 166.0, 42.0, 19.0, 2.0, 10.0, 223.0, 13.0, 8.0, 9.0, 79.0]


In [None]:
#The same problem as overfitting as with the normal DecisionTreeRegressor()- after using cross_val, it is still WILDLY off

In [131]:
scores_pipe = cross_val_score(tree_reg, train_prepared, y_train,
                         scoring="neg_mean_squared_error", cv=10)
display_scores(scores_pipe)

Scores: [-4101712.28407817 -3917583.50086351 -1885833.01578549 -1602262.68259355
 -3321645.80527193  -998785.5661132  -1691758.7502121  -1697028.09607926
 -1690590.18664404 -2462991.62965095]
Mean: -2337019.151729219
Standard deviation: 1018617.1654271611
