## Problem Statement

We want to distinguish between those who are **willfully wrong** about history from those who are merely **uninformed**. Those in the former category are more dangerous, as since they present themselves as correct more people will follow their misinformation. The two categories can be analyzed by using the Pushshift API to take posts from two subreddits, **r/badhistory** and **r/AskHistorians**.

The point of this exercise is to determine which classification model can correctly predict which posts come from which subreddit. The final model should be used by media companies (including Reddit itself) who want to combat misinformation by showing which words are associated with actively false and misleading claims.

## 1. Data Collection and Cleanup

In [None]:
import requests
import pandas as pd
import time
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import text
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

In [None]:
url_badhistory = "https://api.pushshift.io/reddit/search/submission?subreddit=badhistory"
res_badhistory = requests.get(url_badhistory)
res_badhistory.status_code

In [None]:
bh = res_badhistory.json()

In [None]:
# Double-checking to see if "data" is the only key
bh.keys()

In [None]:
df = pd.DataFrame(columns = ['Post', 'Subreddit'])
df

In [None]:
for post in bh['data']:
    if 'selftext' in post.keys():
        # Had to use this website for help:
        # https://www.geeksforgeeks.org/how-to-create-an-empty-dataframe-and-append-rows-columns-to-it-in-pandas/
        df = df.append({'Post': post['selftext'], 'Subreddit': 'badhistory'}, ignore_index = True)

# Check to make sure I have gotten the right posts
df.head()

In [None]:
# Creating a loop so I can get more data points
for i in range(50):
    before = bh['data'][-1]['created_utc']
    url_bh = "https://api.pushshift.io/reddit/search/submission?subreddit=badhistory&before=" + str(before)
    res_bh = requests.get(url_bh)
    bh = res_bh.json()
    
    for post in bh['data']:
        if 'selftext' in post.keys():
            df = df.append({'Post': post['selftext'], 'Subreddit': 'badhistory'}, ignore_index = True)
    
    # Creating a timer so the requests do not come too fast
    if i % 3 == 0:
        time.sleep(2)

In [None]:
url_askhistorians = "https://api.pushshift.io/reddit/search/submission?subreddit=askhistorians"
res_askhistorians = requests.get(url_askhistorians)
res_askhistorians.status_code

In [None]:
ah = res_askhistorians.json()

In [None]:
# Double-checking to see if "data" is the only key
ah.keys()

In [None]:
for post in ah['data']:
    if 'selftext' in post.keys():
        df = df.append({'Post': post['selftext'], 'Subreddit': 'askhistorians'}, ignore_index = True)

# Check to make sure I have gotten the right posts
df.tail()

In [None]:
# Creating a loop so I can get more data points
for i in range(50):
    before = ah['data'][-1]['created_utc']
    url_ah = "https://api.pushshift.io/reddit/search/submission?subreddit=askhistorians&before=" + str(before)
    res_ah = requests.get(url_ah)
    ah = res_ah.json()
    
    for post in ah['data']:
        if 'selftext' in post.keys():
            df = df.append({'Post': post['selftext'], 'Subreddit': 'askhistorians'}, ignore_index = True)
    
    # Creating a timer so the requests do not come too fast
    if i % 3 == 0:
        time.sleep(2)

In [None]:
df.drop_duplicates(inplace = True)

In [None]:
# Checking if there are blank posts
df['Post'].sort_values()

In [None]:
df = df[df['Post'] != '']

In [None]:
df.to_csv('../data/history_data.csv')

## 2. Word Counts

In [None]:
wnl = WordNetLemmatizer()

# Lemmatizing words for better predictions
def lemmatize_columns(df):
    for column in df.columns:
        lem = wnl.lemmatize(column)
    
        if lem != column:
            # If lemmatized word is already a column, add counts to total; otherwise replace column
            if lem in df.columns:
                df[lem] = df[lem] + df[column]
            else:
                df[lem] = df[column]

            df.drop(columns = column, inplace = True)
    return df

In [None]:
# The most common non-stop words in r/AskHistorians posts
X_askhistorians = df[df['Subreddit'] == 'askhistorians']['Post']
cv_askhistorians = text.CountVectorizer(stop_words = text.ENGLISH_STOP_WORDS)
Xcv_askhistorians = cv_askhistorians.fit_transform(X_askhistorians)
Xcv_askhistorians_df = pd.DataFrame(Xcv_askhistorians.toarray(), 
                                    columns = cv_askhistorians.get_feature_names())
Xcv_askhistorians_df = lemmatize_columns(Xcv_askhistorians_df)

In [None]:
Xcv_askhistorians_df.sum().sort_values(ascending = False).head(15)

In [None]:
Xcv_askhistorians_df.sum().sort_values(ascending = False).head(15).plot(kind='barh');

In [None]:
# The most common non-stop words in r/badhistory posts
X_badhistory = df[df['Subreddit'] == 'badhistory']['Post']
cv_badhistory = text.CountVectorizer(stop_words = text.ENGLISH_STOP_WORDS)
Xcv_badhistory = cv_badhistory.fit_transform(X_badhistory)
Xcv_badhistory_df = pd.DataFrame(Xcv_badhistory.toarray(), 
                                    columns = cv_badhistory.get_feature_names())
Xcv_badhistory_df = lemmatize_columns(Xcv_badhistory_df)

In [None]:
Xcv_badhistory_df.sum().sort_values(ascending = False).head(15)

In [None]:
Xcv_badhistory_df.sum().sort_values(ascending = False).head(15).plot(kind='barh');

The words that appear in the top 15 in both subreddits and appear less than 10 times as much in r/badhistory as in r/AskHistorians - history, war, people, time, did, like, just, and year - will not be in my predictive model.

Given that http, com, www, amp, and org appear very frequently in r/badhistory, it seems that posts with links are much more likely to appear in r/badhistory than in r/AskHistorians.

## 3. Vectorizing

In [None]:
X = df['Post']
y = df['Subreddit'].map({'askhistorians': 1, 'badhistory': 0})

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

In [None]:
y.value_counts(normalize = True)

55% is our baseline, so my model needs to do better than that in terms of accuracy.

In [None]:
# Only including words in at least 2 posts
cv = text.CountVectorizer(stop_words = text.ENGLISH_STOP_WORDS.union(['history', 'war', 'people', 'time', 
                                                                 'did', 'like', 'just', 'year']), 
                     min_df = 2)
Xcv_train = cv.fit_transform(X_train)
Xcv_test = cv.transform(X_test)

In [None]:
Xcv_train_df = pd.DataFrame(Xcv_train.toarray(), columns = cv.get_feature_names())
Xcv_test_df = pd.DataFrame(Xcv_test.toarray(), columns = cv.get_feature_names())

In [None]:
Xcv_train_df = lemmatize_columns(Xcv_train_df);
Xcv_test_df = lemmatize_columns(Xcv_test_df);

## 4. Model Analysis

In [None]:
mnb = MultinomialNB()
mnb.fit(Xcv_train_df, y_train)
mnb.score(Xcv_train_df, y_train)

In [None]:
mnb.score(Xcv_test_df, y_test)

In [None]:
lr = LogisticRegression()
lr.fit(Xcv_train_df, y_train)
lr.score(Xcv_train_df, y_train)

In [None]:
lr.score(Xcv_test_df, y_test)

In [None]:
rfc = RandomForestClassifier()
rfc.fit(Xcv_train_df, y_train)
rfc.score(Xcv_train_df, y_train)

In [None]:
rfc.score(Xcv_test_df, y_test)

The Multinomial Naive Bayes, Logistic Regression, and Random Forest models have high scores for both the training and test sets. While Random Forest has the highest score for the training set and Logistic Regression has the highest for the test set, all three have around the same score difference, so further analysis is needed to determine which model is the best. Since all have a difference higher than 0.1 between the training and test scores, the models are likely overfit.

In [None]:
mnb_preds = mnb.predict(Xcv_test_df)
tn, fp, fn, tp = confusion_matrix(y_test, mnb_preds).ravel()
plot_confusion_matrix(mnb, Xcv_test_df, y_test, cmap='Reds');

In [None]:
# Sensitivity
print(tp/(tp+fn))

# Specificity
print(tn/(tn+fp))

# Precision
print(tp/(tp+fp))

In [None]:
lr_preds = lr.predict(Xcv_test_df)
tn, fp, fn, tp = confusion_matrix(y_test, lr_preds).ravel()
plot_confusion_matrix(lr, Xcv_test_df, y_test, cmap='Reds');

In [None]:
# Sensitivity
print(tp/(tp+fn))

# Specificity
print(tn/(tn+fp))

# Precision
print(tp/(tp+fp))

In [None]:
rfc_preds = rfc.predict(Xcv_test_df)
tn, fp, fn, tp = confusion_matrix(y_test, rfc_preds).ravel()
plot_confusion_matrix(rfc, Xcv_test_df, y_test, cmap='Reds');

In [None]:
# Sensitivity
print(tp/(tp+fn))

# Specificity
print(tn/(tn+fp))

# Precision
print(tp/(tp+fp))

The logistic regression model heavily outscores the naive Bayes model in specificity, but is worse in precision and much worse in specificity. The random forest model has both specificity and sensitivity in between those of the other two models.

Since the point of this problem is to correctly identify those who are **wrong** about history, it is more important that I correctly identify those in the former category. Thus, it would be better to have higher specificity than higher sensitivity.

Given both its high specificity and accuracy, I would argue to use the **random forest classifier** model to determine whether a post is in r/AskHistorians or r/badhistory based on its contents, and thus to determine which words are more closely associated with false and misleading information.

As r/badhistory contains so many posts with links and link terms do not have much to do with misinformation, future research would need to show if an algorithm can clearly tell non-link posts in r/badhistory from those in r/AskHistorians.