# Modelling

In the modelling notebook, I downloaded all the necessary libraries. Then created the training and testing data sets. These were then divided into further classes:
- training and testing for the reddit posts
- training and testing for the reddit post titles

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier

In [2]:
series = pd.read_csv('./series.csv')

In [3]:
series.drop('Unnamed: 0',axis = 1, inplace = True)

In [4]:
series.dropna(inplace=True)

In [5]:
series.isnull().sum()

created_utc     0
id              0
num_comments    0
score           0
selftext        0
spoiler         0
subreddit       0
title           0
dtype: int64

In [6]:
X_train, X_test, y_train, y_test = train_test_split(series[['id', 'selftext','title']],
                                                    series['subreddit'],
                                                    test_size = 0.25,
                                                    random_state = 42)

In [7]:
X_train = X_train.reset_index(drop = True)
y_train = y_train.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)
y_test = y_test.reset_index(drop = True)

In [8]:
X_train.head()

Unnamed: 0,id,selftext,title
0,92pvn0,need charlott charact real charlott becam dolor,actress go play dolor delor charlott
1,7nima3,thought whole deal fli wormhol also mcpoyl sai...,spoiler black woman jock escap dali server ship
2,8u669p,final made confus understand whole timelin get...,anyon explain whole season proper timelin
3,8tqlr6,thank anoth great season hope regardless neg m...,lisa jonathan cast crew
4,8sbpo9,sourc peopl social network let collect spoiler...,first reaction lucki watcher final uk spoiler ...


## TF-IDF

In our case, using TF-IDF is a more accurate way to create the models because utilizing the frequencies is a more valid approach than using only the count of words in a certain observation.

There are two different models that I used. One is logistic regression and the other one is the random forest. I will build four models in total. Two for post body and two for titles. Let's see which one can be used to predict the series better:

### Selftext

In [9]:
vect = TfidfVectorizer()
X_text_train = vect.fit_transform(X_train['selftext'].tolist())
X_text_test = vect.transform(X_test['selftext'].tolist())
posts_train = pd.DataFrame(X_text_train.toarray(), columns=vect.get_feature_names())
posts_train.head(5)

Unnamed: 0,aa,aaaaaa,aaaaaaand,aaaaaahhhh,aaaaahhhh,aaaahhhhh,aaaay,aaf,aaghh,aamir,...,zwvfo,zx,zxj,zy,zylx,zyq,zz,zzivyep,zzlrmto,zzzzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Now we can start modelling using the arrays!

#### Logistic Regression

In [10]:
lr = LogisticRegression()
lr.fit(X_text_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [11]:
# Score on training data.
lr.score(X_text_train, y_train)

0.9674298173889343

In [12]:
# Score on testing data.
lr.score(X_text_test, y_test)

0.9255928045789044

In [13]:
X_text_test

<2446x15658 sparse matrix of type '<class 'numpy.float64'>'
	with 102757 stored elements in Compressed Sparse Row format>

In [14]:
pred_text = lr.predict(X_text_test)

In [15]:
accuracy_score(y_test, pred_text)

0.9255928045789044

#### Random Forest

In [16]:
rf= RandomForestClassifier()
rf_param = {
    'n_estimators':[13,14,15,16,17],
    'max_depth':[11,12,13],
    'max_features':['auto',1.0]
}
gs = GridSearchCV(rf, param_grid=rf_param, cv=5)
gs.fit(X_text_train, y_train)
print(gs.best_score_)
gs.best_params_

0.868629054238212


{'max_depth': 13, 'max_features': 1.0, 'n_estimators': 17}

### Title

In [17]:
vect = TfidfVectorizer()
X_title_train = vect.fit_transform(X_train['title'].tolist())
X_title_test = vect.transform(X_test['title'].tolist())
titles_train = pd.DataFrame(X_title_train.toarray(), columns=vect.get_feature_names())
titles_train.head(5)

Unnamed: 0,aaaaah,aaron,abbrevi,abernathi,abher,abi,abil,abl,abort,abram,...,yul,yuval,za,zen,zero,zodiac,zoe,zombi,zombifi,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Now we can start modelling using the arrays!

#### Logistic Regression

In [18]:
lr = LogisticRegression()
lr.fit(X_title_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [19]:
# Score on training data.
lr.score(X_title_train, y_train)

0.9309076042518397

In [20]:
# Score on testing data.
lr.score(X_title_test, y_test)

0.884709730171709

In [21]:
pred_title = lr.predict(X_title_test)

In [22]:
accuracy_score(y_test, pred_title)

0.884709730171709

#### Random Forest

In [25]:
rf= RandomForestClassifier()
rf_param = {
    'n_estimators':[13,14,15,16,17],
    'max_depth':[15,16,17],
    'max_features':['auto',1.0]
}
gs = GridSearchCV(rf, param_grid=rf_param, cv=5)
gs.fit(X_title_train, y_train)
print(gs.best_score_)
gs.best_params_

0.8626328699918234


{'max_depth': 17, 'max_features': 1.0, 'n_estimators': 14}

## Conclusion

As conclusion, I can say that using logistic regression with the post bodies has a better predictive power than any other with the score of 92%