# Project 3: Web APIs & NLP

---

## Part 3: Data Preprocessing & Modelling

### Contents:

- [Data Preprocessing](#Data-Preprocessing)
- [Modelling](#Modelling)
- [Summary](#Summary)
- [Recommendations and Future Works](#Recommendations-and-Future-Works)

---

#### Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

from bs4 import BeautifulSoup

from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (confusion_matrix, ConfusionMatrixDisplay,
accuracy_score, f1_score, plot_roc_curve, roc_auc_score, RocCurveDisplay)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier

#### Read in datasets

In [None]:
ps4 = pd.read_csv("../datasets/ps4_clean.csv")
ps5 = pd.read_csv("../datasets/ps5_clean.csv")

In [None]:
ps4.info()

In [None]:
ps5.info()

## Data Preprocessing

In [None]:
# Merge both dataframes
df = pd.concat([ps4, ps5], axis=0) 

In [None]:
df.shape

In [None]:
df.info()

#### Binarize target variable

Convert PS4 and PS5 in `subreddit` into binary labels:
 - 0 for PS4
 - 1 for PS5

In [None]:
df['subreddit'] = df['subreddit'].map({'PS4': 0, 'PS5': 1})

In [None]:
df['subreddit'].value_counts()

After data cleaning, we find that there are more valid posts retrieved from PS5 subreddit than PS4.

As the PS5 is the latest gaming console from Sony, we can expect more active discussions in terms of new game releases, announcements and reviews made, as compared to the PS4, resulting in higher number of valid posts.

With an imbalanced dataset, we can use other metrics such as the F1 score to assess the classifier.

#### Engineer new feature `post`

During data cleaning, many `selftext` rows are found blank. However, we can still utilise the words from `title`. We will engineer a new feature called `post` by combining `title` and `selftext`, in order to build on the list of features for modelling.

In [None]:
# Create new `post` consisting of `title` and `selftext`
df['post'] = df['title'] + ' ' + df['selftext']

In [None]:
df.head(2)

In [None]:
# Drop unwanted features
df.drop(['id', 'title', 'selftext'], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df.to_csv("../datasets/combined.csv", index=False)

#### Train/test split dataset

In [None]:
# Perform train/test split for data preprocessing:
X = df['post']
y = df['subreddit']

In [None]:
df.info()

In [None]:
y.value_counts(normalize=True)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20,
                                                    stratify=y,
                                                    random_state=42
                                                   )

In [None]:
X_train.shape

In [None]:
X_test.shape

#### Tokenize and lemmatize

In [None]:
# Update list of stopwords
stop_words = set(stopwords.words('english'))
add_stopwords = ['ps4', 'ps5', 'playstation', 'game', 'video', 'sony' 'ps', 'plus', 'got', 'just']

stop_words = stop_words.union(add_stopwords)

In [None]:
# Instantiate lemmatizer.
lemmatizer = WordNetLemmatizer()

In [None]:
# Create function to further clean `post`

def clean_post(raw_post):
    
    # 1. Remove HTML.
    html_removed = BeautifulSoup(raw_post).get_text()
    
    # 2. Remove http.
    http_removed = re.sub(r"http\S+", "", html_removed)
    
    # 3. Remove www.
    www_removed = re.sub(r"www\S+", "", http_removed)
    
    # 4. Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ",  www_removed)
    
    # 5. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    #6. Lemmatize word tokens.
    tokens_lem = [lemmatizer.lemmatize(i) for i in words]
   
    # 7. Remove stopwords.
    meaningful_words = [w for w in tokens_lem if not w in stop_words]
    
    # 8. Join the words back into one string separated by space and return the result.
    return(" ".join(meaningful_words))

In [None]:
# Get the number of posts based on the dataframe size.
total_posts = df.shape[0]
print(f'There are {total_posts} posts.')

In [None]:
# Initialize an empty list to hold clean posts.
clean_train_post = []
clean_test_post = []

In [None]:
# Clean 'post' 

print("Cleaning and parsing the training set posts...")

# Instantiate counter.
j = 0

# For every post in our training set...
for train_post in X_train:
    
    # Convert post to words, then append to clean_train_post.
    clean_train_post.append(clean_post(train_post))
    
    # If the index is divisible by 1000, print a message.
    if (j + 1) % 1000 == 0:
        print(f'{j + 1} of {total_posts} posts.')
    
    j += 1

# Let's do the same for our testing set.
print("Cleaning and parsing the testing set posts...")

# For every post in our testing set...
for test_post in X_test:
    
    # Convert post to words, then append to clean_test_post.
    clean_test_post.append(clean_post(test_post))
    
    # If the index is divisible by 1000, print a message.
    if (j + 1) % 1000 == 0:
        print(f'{j + 1} of {total_posts} posts.')
        
    j += 1
print('Cleaning for all posts completed')

In [None]:
# Store cleaned posts back in train/test sets
X_train = clean_train_post
X_test = clean_test_post

## Modelling

We will create our models using a mixture of transformers and classifiers:
1. CountVectorizer and Multinomial Naive Bayes
2. TF-IDF Vectorizer and Multinomial Naive Bayes
3. CountVectorizer and Logistic Regression
4. TF-IDF Vectorizer and Logistic Regression
5. CountVectorizer and Decision Tree Classifier
6. TF-IDF Vectorizer and Decision Tree Classifier
7. CountVectorizer and Random Forest Classifier
8. TF-IDF Vectorizer and Random Forest Classifier

#### Baseline model

We will first establish a baseline model for our prediction before running the classification models.

In [None]:
y_train.value_counts(normalize=True)

The baseline model has a 70% accuracy of predicting the correct subreddit. We will evaluate the classification models whether they can provide a better prediction by obtaining higher accuracy scores.

#### Model 1 - CountVectorizer + MultinomialNB

In [None]:
# Set up pipeline, hyperparameters and GridSearch
pipe1 = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB())
])

pipe1_params = {
    'cvec__max_features': [4000],      #[1000, 2000, 3000, 3500, 4500, 5000]
    'cvec__min_df': [1],               #[2, 3]
    'cvec__max_df': [.90],             #[0.95, 0.85, 0.80, ]
    'cvec__ngram_range': [(1,4)]       #[(1,1), (1,2), (1,3), (1,5) ]
}

gs1 = GridSearchCV(pipe1,
                   param_grid=pipe1_params, 
                   cv=5,
                   n_jobs=-1,
                   verbose=1
                  )

In [None]:
gs1.get_params().keys()

In [None]:
%%time
# Fit GridSearch to training data.
gs1.fit(X_train, y_train)

In [None]:
# Model 1 best parameters
gs1.best_params_

In [None]:
print(f"Model 1 Train score: {gs1.score(X_train, y_train)}")
print(f"Model 1 Test score: {gs1.score(X_test, y_test)}")
print(f"Model 1 CV score: {gs1.best_score_}")

In [None]:
gs1_features = gs1.best_estimator_[0].get_feature_names_out()
gs1_features

In [None]:
log_prob_diff = gs1.best_estimator_.steps[1][1].feature_log_prob_[1] - gs1.best_estimator_.steps[1][1].feature_log_prob_[0]
log_prob_diff

In [None]:
gs1_df = pd.DataFrame(log_prob_diff, index=gs1_features, columns=['log_prob'])

In [None]:
gs1_df.sort_values(by='log_prob', ascending=False).head(30)

In [None]:
# Get predictions
preds = gs1.predict(X_test)

# Save confusion matrix values
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
cm = confusion_matrix(y_test, preds)

In [None]:
# View confusion matrix
disp = ConfusionMatrixDisplay(cm)
disp.plot();

In [None]:
# Calculate F1 score
print(f"Model 1 F1 score: {f1_score(y_test, preds)}")

#### Model 2 - TF-IDF Vectorizer + MultinomialNB

In [None]:
# Set up pipeline, hyperparameters and GridSearch
pipe2 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

pipe2_params = {
    'tvec__max_features': [4500],     #[1000, 3000, 3500, 4000, 5000]
    'tvec__min_df': [2],              #[1,3]
    'tvec__max_df': [.85],            #[0.95, 0.9, 0.8]
    'tvec__ngram_range': [(1,2)]      #[(1,4), (1,3)]
}

gs2 = GridSearchCV(pipe2,
                   param_grid=pipe2_params, 
                   cv=5,
                   n_jobs=-1,
                   verbose=1
                  )

In [None]:
gs2.get_params().keys()

In [None]:
%%time
# Fit GridSearch to training data.
gs2.fit(X_train, y_train)

In [None]:
# Model 2 best parameters
gs2.best_params_

In [None]:
print(f"Model 2 Train score: {gs2.score(X_train, y_train)}")
print(f"Model 2 Test score: {gs2.score(X_test, y_test)}")
print(f"Model 2 CV score: {gs2.best_score_}")

In [None]:
gs2_features = gs2.best_estimator_[0].get_feature_names_out()

In [None]:
log_prob_diff = gs2.best_estimator_.steps[1][1].feature_log_prob_[1] - gs2.best_estimator_.steps[1][1].feature_log_prob_[0]

In [None]:
gs2_df = pd.DataFrame(log_prob_diff, index=gs2_features, columns=['log_prob'])

In [None]:
gs2_df.sort_values(by='log_prob', ascending=False).head(30)

In [None]:
# Get predictions
preds = gs2.predict(X_test)

# Save confusion matrix values
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
cm = confusion_matrix(y_test, preds)

In [None]:
# View confusion matrix
disp = ConfusionMatrixDisplay(cm)
disp.plot();

In [None]:
# Calculate F1 score
print(f"Model 2 F1 score: {f1_score(y_test, preds)}")

#### Model 3 - CountVectorizer + Logistic Regression

In [None]:
# Set up pipeline, hyperparameters and GridSearch
pipe3 = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression())
])

pipe3_params = {
    'cvec__max_features': [5000],           #[4000, 6000, 4500, 5500]
    'cvec__min_df': [2],                    #[1,3]
    'cvec__max_df': [0.85],                 #[0.9, 0.8]
    'cvec__ngram_range': [(1,2)],           #[(1,3), (1,4)]
    'lr__C': [0.1],                         #[10, 1, 0.01]
    'lr__penalty': ['l2'],                  #['l1']
    'lr__solver': ['sag'],                  #['liblinear', 'newton-cg', 'lbfgs']
    'lr__max_iter': [7000],
    'lr__class_weight': [None]              #['balanced']
}

gs3 = GridSearchCV(pipe3,
                   param_grid=pipe3_params, 
                   cv=5,
                   n_jobs=-1,
                   verbose=1
                  )

In [None]:
gs3.get_params().keys()

In [None]:
%%time
# Fit GridSearch to training data.
gs3.fit(X_train, y_train)

In [None]:
# Model 3 best parameters
gs3.best_params_

In [None]:
print(f"Model 3 Train score: {gs3.score(X_train, y_train)}")
print(f"Model 3 Test score: {gs3.score(X_test, y_test)}")
print(f"Model 3 CV score: {gs3.best_score_}")

In [None]:
# Get predictions
preds = gs3.predict(X_test)

# Save confusion matrix values
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
cm = confusion_matrix(y_test, preds)

In [None]:
# View confusion matrix
disp = ConfusionMatrixDisplay(cm)
disp.plot();

In [None]:
# Calculate F1 score
print(f"Model 3 F1 score: {f1_score(y_test, preds)}")

In [None]:
gs3_features = gs3.best_estimator_[0].get_feature_names_out()
gs3_features.shape

In [None]:
gs3_coef = np.exp(gs3.best_estimator_.named_steps.lr.coef_).T
gs3_coef.shape

In [None]:
gs3_df = pd.DataFrame(gs3_coef, index=gs3_features, columns=['coef'])

In [None]:
gs3_df.sort_values(by='coef', ascending=False).head(30)

#### Model 4 - TF-IDF Vectorizer + Logistic Regression

In [None]:
# Set up pipeline, hyperparameters and GridSearch
pipe4 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression())
])

pipe4_params = {
    'tvec__max_features': [5000],           #[4000, 6000, 4500, 5500]
    'tvec__min_df': [1,2],                  #[1,3]
    'tvec__max_df': [0.85],                 #[0.9, 0.8]
    'tvec__ngram_range': [(1,2)],           #[(1,3), (1,4)]
    'lr__C': [1],                           #[10, 0.1, 0.01]
    'lr__penalty': ['l2'],                  #['l1']
    'lr__solver': ['sag'],                  #['sag','liblinear', 'newton-cg', 'lbfgs']
    'lr__max_iter': [7000],
    'lr__class_weight': [None]              #['balanced']
}

gs4 = GridSearchCV(pipe4,
                   param_grid=pipe4_params, 
                   cv=5,
                   n_jobs=-1,
                   verbose=1
                  )

In [None]:
gs4.get_params().keys()

In [None]:
%%time
# Fit GridSearch to training data.
gs4.fit(X_train, y_train)

In [None]:
# Model 4 best parameters
gs4.best_params_

In [None]:
print(f"Model 4 Train score: {gs4.score(X_train, y_train)}")
print(f"Model 4 Test score: {gs4.score(X_test, y_test)}")
print(f"Model 4 CV score: {gs4.best_score_}")

In [None]:
# Get predictions
preds = gs4.predict(X_test)

# Save confusion matrix values
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
cm = confusion_matrix(y_test, preds)

In [None]:
# View confusion matrix
disp = ConfusionMatrixDisplay(cm)
disp.plot();

In [None]:
# Calculate F1 score
print(f"Model 4 F1 score: {f1_score(y_test, preds)}")

In [None]:
gs4_features = gs4.best_estimator_[0].get_feature_names_out()


In [None]:
gs4_coef = np.exp(gs4.best_estimator_.named_steps.lr.coef_).T


In [None]:
gs4_coef.shape

In [None]:
gs4_df = pd.DataFrame(gs4_coef, index=gs4_features, columns=['coef'])

In [None]:
gs4_df.sort_values(by='coef', ascending=False).head(30)

#### Model 5 - CountVectorizer + Decision Tree

In [None]:
# Set up pipeline, hyperparameters and GridSearch
pipe5 = Pipeline([
    ('cvec', TfidfVectorizer()),
    ('dt', DecisionTreeClassifier())
])

pipe5_params = {
    'cvec__max_features': [4000],            #[4500, 5000]
    'cvec__min_df': [3],                     #[1, 2]
    'cvec__max_df': [0.85],                  #[0.90, 0.80]
    'cvec__ngram_range': [(1,3)],            #[(1,2), (1,4)]
    'dt__ccp_alpha': [0],                   #[0.1, 1]                              
    'dt__max_depth': [20],                   #[10, 30]             
    'dt__min_samples_leaf': [1],             #[2, 5]
    'dt__min_samples_split': [30],           #[10, 20, 40]
    'dt__random_state': [42]                       
}

gs5 = GridSearchCV(pipe5,
                   param_grid=pipe5_params, 
                   cv=5,
                   n_jobs=-1,
                   verbose=1
                  )

In [None]:
gs5.get_params().keys()

In [None]:
%%time
# Fit GridSearch to training data.
gs5.fit(X_train, y_train)

In [None]:
# Model 5 best parameters
gs5.best_params_

In [None]:
print(f"Model 5 Train score: {gs5.score(X_train, y_train)}")
print(f"Model 5 Test score: {gs5.score(X_test, y_test)}")
print(f"Model 5 CV score: {gs5.best_score_}")

In [None]:
# Get predictions
preds = gs5.predict(X_test)

# Save confusion matrix values
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
cm = confusion_matrix(y_test, preds)

In [None]:
# View confusion matrix
disp = ConfusionMatrixDisplay(cm)
disp.plot();

In [None]:
# Calculate F1 score
print(f"Model 5 F1 score: {f1_score(y_test, preds)}")

In [None]:
gs5_features = gs5.best_estimator_[0].get_feature_names_out()

In [None]:
gs5_coef = gs5.best_estimator_.steps[1][1].feature_importances_
gs5_coef

In [None]:
gs5_df = pd.DataFrame(gs5_coef, index=gs5_features, columns=['node_prob'])

In [None]:
gs5_df.sort_values(by='node_prob', ascending=False).head(30)

#### Model 6 - TF-IDF Vectorizer + Decision Tree

In [None]:
# Set up pipeline, hyperparameters and GridSearch
pipe6 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('dt', DecisionTreeClassifier())
])

pipe6_params = {
    'tvec__max_features': [4500],            #[4000, 5000, 5500]
    'tvec__min_df': [1],                     #[2]
    'tvec__max_df': [0.85],                  #[0.80, 0.90]
    'tvec__ngram_range': [(1,2)],            #[(1,3)]
    'dt__ccp_alpha': [0] ,                   #[0.01, 0.1, 1, 10]                              
    'dt__max_depth': [20],                   #[None, 5, 10]             
    'dt__min_samples_leaf': [1],             #[2, 5, 10]
    'dt__min_samples_split': [20],           #[2,10,20,50]
    'dt__random_state': [42]                       
}

gs6 = GridSearchCV(pipe6,
                   param_grid=pipe6_params, 
                   cv=5,
                   n_jobs=-1,
                   verbose=1
                  )

In [None]:
gs6.get_params().keys()

In [None]:
%%time
# Fit GridSearch to training data.
gs6.fit(X_train, y_train)

In [None]:
# Model 6 best parameters
gs6.best_params_

In [None]:
print(f"Model 6 Train score: {gs6.score(X_train, y_train)}")
print(f"Model 6 Test score: {gs6.score(X_test, y_test)}")
print(f"Model 6 CV score: {gs6.best_score_}")

In [None]:
# Get predictions
preds = gs6.predict(X_test)

# Save confusion matrix values
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
cm = confusion_matrix(y_test, preds)

In [None]:
# View confusion matrix
disp = ConfusionMatrixDisplay(cm)
disp.plot();

In [None]:
# Calculate F1 score
print(f"Model 6 F1 score: {f1_score(y_test, preds)}")

In [None]:
gs6_features = gs6.best_estimator_[0].get_feature_names_out()
gs6_features

In [None]:
gs6_coef = gs6.best_estimator_.steps[1][1].feature_importances_
gs6_coef

In [None]:
gs6_df = pd.DataFrame(gs6_coef, index=gs6_features, columns=['node_prob'])

In [None]:
gs6_df.sort_values(by='node_prob', ascending=False).head(30)

#### Model 7 - CountVectorizer + Random Forest

In [None]:
# Set up pipeline, hyperparameters and GridSearch
pipe7 = Pipeline([
    ('cvec', CountVectorizer()),
    ('rt', RandomForestClassifier())
])

pipe7_params = {
    'cvec__max_features': [4500],         #[4000, 5000]      
    'cvec__min_df': [1],                  #[2]        
    'cvec__max_df': [0.85],               #[0.90]
    'cvec__ngram_range': [(1,2)],         #[(1,3)]     
    'rt__ccp_alpha': [0],                 #[1, 0.1]         
    'rt__max_depth': [None],              #[20]         
    'rt__min_samples_leaf': [1],            
    'rt__min_samples_split': [40],        #[20, 60] 
    'rt__n_estimators': [200],            #[100, 300]
    'rt__random_state': [42]                   
}

gs7 = GridSearchCV(pipe7,
                   param_grid=pipe7_params, 
                   cv=5,
                   n_jobs=-1,
                   verbose=1
                  )

In [None]:
gs7.get_params().keys()

In [None]:
%%time
# Fit GridSearch to training data.
gs7.fit(X_train, y_train)

In [None]:
# Model 7 best parameters
gs7.best_params_

In [None]:
print(f"Model 7 Train score: {gs7.score(X_train, y_train)}")
print(f"Model 7 Test score: {gs7.score(X_test, y_test)}")
print(f"Model 7 CV score: {gs7.best_score_}")

In [None]:
# Get predictions
preds = gs7.predict(X_test)

# Save confusion matrix values
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
cm = confusion_matrix(y_test, preds)

In [None]:
# Calculate F1 score
print(f"Model 7 F1 score: {f1_score(y_test, preds)}")

In [None]:
gs7_features = gs7.best_estimator_[0].get_feature_names_out()

In [None]:
gs7_coef = gs7.best_estimator_.steps[1][1].feature_importances_
gs7_coef

In [None]:
gs7_df = pd.DataFrame(gs7_coef, index=gs7_features, columns=['node_prob'])

In [None]:
gs7_df.sort_values(by='node_prob', ascending=False).head(30)

#### Model 8 - TF-IDF Vectorizer + Random Forest

In [None]:
# Set up pipeline, hyperparameters and GridSearch
pipe8 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('rt', RandomForestClassifier())
])

pipe8_params = {
    'tvec__max_features': [4500],      #[4000, 5000]      
    'tvec__min_df': [2],               #[1,3 ]        
    'tvec__max_df': [0.85],                  
    'tvec__ngram_range': [(1,3)],      #[(1,2), (1,4)]     
    'rt__ccp_alpha': [0],                              
    'rt__max_depth': [None],           #[20]         
    'rt__min_samples_leaf': [1],            
    'rt__min_samples_split': [40],    #[20, 60] 
    'rt__n_estimators': [200],         # [100, 300]
    'rt__random_state': [42]                     
}

gs8 = GridSearchCV(pipe8,
                   param_grid=pipe8_params, 
                   cv=5,
                   n_jobs=-1,
                   verbose=1
                  )

In [None]:
%%time
# Fit GridSearch to training data.
gs8.fit(X_train, y_train)

In [None]:
# Model 8 best parameters
gs8.best_params_

In [None]:
print(f"Model 8 Train score: {gs8.score(X_train, y_train)}")
print(f"Model 8 Test score: {gs8.score(X_test, y_test)}")
print(f"Model 8 CV score: {gs8.best_score_}")

In [None]:
# Get predictions
preds = gs8.predict(X_test)

# Save confusion matrix values
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
cm = confusion_matrix(y_test, preds)

In [None]:
# View confusion matrix
disp = ConfusionMatrixDisplay(cm)
disp.plot();

In [None]:
# Calculate F1 score
print(f"Model 8 F1 score: {f1_score(y_test, preds)}")

In [None]:
gs8_features = gs8.best_estimator_[0].get_feature_names_out()

In [None]:
gs8_coef = gs8.best_estimator_.steps[1][1].feature_importances_
gs8_coef

In [None]:
gs8_df = pd.DataFrame(gs8_coef, index=gs8_features, columns=['node_prob'])

In [None]:
gs8_df.sort_values(by='node_prob', ascending=False).head(30)

## Summary

#### Classification Models

| Classification Model                       | Train Score | Test Score | Cross Val Score | F1 Score |
|--------------------------------------------|-------------|------------|-----------------|----------|
| 1. CountVectorizer + Multinomial NB        | 0.79849     | 0.73531    | 0.74214         | 0.82652  |
| 2. TF-IDF Vectorizer + Multinomial NB      | 0.79223     | 0.74093    | 0.74840         | 0.83910  |
| 3. CountVectorizer + Logistic Regression   | 0.83453     | 0.74604    | 0.75019         | 0.83710  |
| **4. TF-IDF Vectorizer + Logistic Regression** | **0.81919**     | **0.75013**    | **0.75300**         | **0.84077**  |
| 5. CountVectorizer + Decision Tree         | 0.78469     | 0.72305    | 0.72451         | 0.82750  |
| 6. TF IDF Vectorizer + Decision Tree       | 0.78776     | 0.71794    | 0.72170         | 0.82432  |
| 7. CountVectorizer + Random Forest         | 0.93087     | 0.73020    | 0.73869         | 0.83088  |
| 8. TF IDF Vectorizer + Random Forest       | 0.96179     | 0.73684    | 0.74240         | 0.83252  |

Results from the various models are summarised in the table above. The best performing model is using a combination of TF-IDF Vectorizer and Logistic Regression, obtaining the highest cross-validation and f1 score among all the models. 

All of the models are overfitted, with higher performance score observed on the train dataset and lower score on test dataset. By applying k-fold cross validation, the models are able to learn from more dataset and thus reducing overfitting. Therefore, the cross-validation scores obtained are much to their respective test scores. 

The classification models used did not obtain a much higher score over our baseline score. This is likely due to the imbalanced dataset used. The F1 score is calculated as it gives a better measure of incorrectly classified posts than accuracy score, using the harmonic mean to penalize extreme values. In our case, false positives and false negatives are of equal importance, hence F1 score is used.


In [None]:
# ROC curve
RocCurveDisplay.from_estimator(gs4, X_test, y_test)

# add worst case scenario line
plt.plot([0,1], [0,1], label='baseline', linestyle='--')

# add a legend
plt.legend();

In [None]:
# Calculate ROC AUC.
roc_auc_score(y_test, gs4.predict_proba(X_test)[:,1])

#### Top features

In [None]:
model_df = gs4_df.sort_values(by='coef', ascending=False).head(30)

In [None]:
plt.figure(figsize = (15,8))
ax = sns.barplot(data=model_df, y=model_df.index, x='coef', color='slateblue')
plt.ylabel('features', size=14)
plt.xlabel('coefficient', size=14)
plt.title('Top 30 features with highest coefficients', size=14);

From the list of top 30 words generated, 'trailer' has obtained the highest coefficient, which is likely referring to discussions of upcoming games to be released. Other words such as 'upgrade', 'preview' and 'development' also contribute to the list of features for r/PS5, reflecting the ongoing changes and development for the console. There are technical terms such as the DualSense wireless controller and Variable Refresh Rate (VRR) specific to PS5 listed here. Lastly, game titles exclusive to the PS5 such as 'Bloodhunt' and 'Returnal' are included as well.


## Recommendations and Future Works



We have created a classification model using TF-IDF Vectorizer and Logistic Regression to differentiate between r/PS5 and r/PS4 posts. Gamers can use the classifier to determine which subreddit is more suitable to submit their post.

Our best model did not obtain a relatively higher score over our baseline model, likely due to the imbalanced dataset. To improve our model, we can consider scrapping more data from the subreddits and obtain a more balanced dataset to work with. Other classification models such as Support Vector Machines or K Nearest Neighbors can be explored and reviewed. 

For future works, we could expand the project to further classify other similar subreddits such as [r/playstation](https://www.reddit.com/r/playstation/) or [r/gaming](https://www.reddit.com/r/gaming/). Given that images and videos are commonly used in Reddit posts, we can consider decoding and adding such features to our models.