 ## **Clean the Data (deal with missing values):**
 
 There are no missing values in this dataset, and each of the 141,000 instances do not have missing or mismatched data (https://www.kaggle.com/datasets/umairnsr87/predict-the-number-of-upvotes-a-post-will-get).
 
*The dataset is also already split into a training and testing sets* by the Kaggle entry's author, with a 70/30 split. The test set contains approximately 141k entries, and the training set contains about 330k entries. 

In [3]:
import pandas as pd

train_set = pd.read_csv("train_upvotes.csv")
test_set = pd.read_csv("test_upvotes.csv")

In [4]:
train_set.describe()

Unnamed: 0,ID,Reputation,Answers,Username,Views,Upvotes
count,330045.0,330045.0,330045.0,330045.0,330045.0,330045.0
mean,235748.682789,7773.147,3.917672,81442.888803,29645.07,337.505358
std,136039.418471,27061.41,3.579515,49215.10073,80956.46,3592.441135
min,1.0,0.0,0.0,0.0,9.0,0.0
25%,117909.0,282.0,2.0,39808.0,2594.0,8.0
50%,235699.0,1236.0,3.0,79010.0,8954.0,28.0
75%,353620.0,5118.0,5.0,122559.0,26870.0,107.0
max,471493.0,1042428.0,76.0,175738.0,5231058.0,615278.0


In [5]:
test_set.describe()

Unnamed: 0,ID,Reputation,Answers,Username,Views
count,141448.0,141448.0,141448.0,141448.0,141448.0
mean,235743.073497,7920.927,3.914873,81348.231117,29846.33
std,136269.867118,27910.72,3.57746,49046.098215,80343.74
min,7.0,0.0,0.0,4.0,9.0
25%,117797.0,286.0,2.0,40222.75,2608.0
50%,235830.0,1245.0,3.0,78795.5,8977.0
75%,353616.0,5123.0,5.0,122149.0,26989.25
max,471488.0,1042428.0,73.0,175737.0,5004669.0


## Use a One Hot Encoder

One Hot Encoding is used to turn categorical variables (which cannot be fed into most mathematical ML tools) into equivalent numerical variables that can be operated on. This dataset has one categorical variable- the *tag* that denotes what section of Reddit the post belongs to (denoted by a letter).

Because the number of different sections is relatively small (10), it can be easily one-hot-encoded without an influx of training features bogging down a potential model's training time.

In [8]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330045 entries, 0 to 330044
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   ID          330045 non-null  int64  
 1   Tag         330045 non-null  object 
 2   Reputation  330045 non-null  float64
 3   Answers     330045 non-null  float64
 4   Username    330045 non-null  int64  
 5   Views       330045 non-null  float64
 6   Upvotes     330045 non-null  float64
dtypes: float64(4), int64(2), object(1)
memory usage: 17.6+ MB


In [10]:
train_set['Tag'].value_counts() #There's only 10- this is easy to OHE!

c    72458
j    72232
p    43407
i    32400
a    31695
s    23323
h    20564
o    14546
r    12442
x     6978
Name: Tag, dtype: int64

In [11]:
from sklearn.preprocessing import OneHotEncoder

#the only categorical variable we need is the tag
upvote_tag_train = train_set[['Tag']]
upvote_tag_test = test_set[['Tag']]

#create the one hot encoder
categorical_encoder = OneHotEncoder()

upvote_tag_train = categorical_encoder.fit_transform(upvote_tag_train)
upvote_tag_test = categorical_encoder.fit_transform(upvote_tag_test)

In [15]:
upvote_tag_train.toarray()[0:10] #Properly converted/OHE'd

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]])

## Scale/normalize/standardize features using sklearn.preprocessing


The scale of this data is very disparate. While the average answer count is in the single or double digits, the reputation of a user or the number of views that a given post receives are entire magnitudes larger than that. With such extreme scales, the data needs to be normalized to prevent category weight being so drastically unequal that lesser-scale variables have no bearing on the final result.

In [19]:
from sklearn.preprocessing import StandardScaler

upvote_numerical = ['ID', 'Reputation', 'Answers', 'Username', 'Views']

#train_set[['Tag']]

standard_scaler = StandardScaler()

upvote_train_scaled = standard_scaler.fit_transform(train_set[upvote_numerical])
upvote_test_scaled = standard_scaler.transform(test_set[upvote_numerical])

In [20]:
upvote_train_scaled

array([[-1.34582287, -0.14157253, -0.53573597,  1.5072655 , -0.26915833],
       [ 0.67563841,  0.67523751,  2.25794312, -1.21226978,  0.32308687],
       [ 1.71056795, -0.23705919,  0.02299985, -0.51337753, -0.26653963],
       ...,
       [-0.18371676, -0.05894553, -0.53573597,  0.20843454, -0.33588566],
       [-1.3206463 , -0.2839526 , -0.53573597, -0.0243399 , -0.34015957],
       [ 0.47636498, -0.21329838,  0.02299985,  1.48834852, -0.33463807]])

In [27]:
#solution adapted from https://stackoverflow.com/questions/64161419/how-can-i-convert-the-standardscaler-transformation-back-to-dataframe
cols = ['ID','Reputation', 'Answers', 'Username', 'Views']

X_train_sc = pd.DataFrame(standard_scaler.fit_transform(train_set[upvote_numerical]), columns=cols)
X_test_sc = pd.DataFrame(standard_scaler.transform(test_set[upvote_numerical]), columns=cols)

In [28]:
X_train_sc.head()

Unnamed: 0,ID,Reputation,Answers,Username,Views
0,-1.345823,-0.141573,-0.535736,1.507266,-0.269158
1,0.675638,0.675238,2.257943,-1.21227,0.323087
2,1.710568,-0.237059,0.023,-0.513378,-0.26654
3,-1.019946,-0.277486,-0.256368,1.774867,-0.031882
4,-0.766571,-0.129415,0.023,0.625421,-0.193426
