### Project 3

### Classification Modeling on Subreddits: Futurists / Scientists

#### Libary imports

In [72]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score


import warnings; warnings.simplefilter('ignore')
##this will hide deprecation/future warnings

from IPython.display import Markdown, display
pd.set_option('display.max_row', 200) # Set ipython's max row display
pd.set_option('display.max_columns', 85) # Set iPython's max column count
pd.set_option('display.max_colwidth', 1_000) # Set iPython's max column width

# pseudo-markdown in code cells
def printmd(string):
    display(Markdown(string))
# ref: https://discuss.analyticsvidhya.com/t/how-to-make-a-text-bold-within-print-statement-in-ipython-notebook/14552/2

#### Data collection

I utilized the Python Reddit API Wrapper (PRAW) API in the data collection process. You can find the steps I took in the `PRAW_data_collection` notebook, located in the code folder of this repository.

The East Coast local instructors were very generous with their walkthrough of the process, so I definitely credit them for the ease of the data collection.

#### Preprocessing and EDA

In [31]:
subred1 = pd.read_csv('../data/df_with_both_subs.csv', index_col = 'id')
subred1 = subred1.drop(columns = 'Unnamed: 0')
display(pd.set_option('display.max_colwidth', 50), subred1.head())

None

Unnamed: 0_level_0,title,score,url,comms_num,created,body,subreddit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
japxs3,What would the point be to do anything if AI c...,0,https://www.reddit.com/r/Futurology/comments/j...,4,1602664000.0,The more I look into AI and new projects like ...,futurology
japq75,ELCC Explained: the Critical Renewable Energy ...,4,https://blog.ucsusa.org/mark-specht/elcc-expla...,0,1602663000.0,,futurology
jaojnb,There's a 50-50 chance we're living in a simul...,8,https://boingboing.net/2020/10/13/new-research...,4,1602659000.0,,futurology
jaofpy,"Eight nations sign NASA's Artemis Accords, ple...",3,https://www.engadget.com/nasa-artemis-accords-...,1,1602658000.0,,futurology
jaocqq,Mercedes benz AVTR - In Action,8,https://www.youtube.com/watch?v=ChqM3zqTREQ&ab...,0,1602658000.0,,futurology


---

In [24]:
printmd('**Value counts:**')
display(subred1['subreddit'].value_counts())

printmd('**Value counts by weight:**')
subred1['subreddit'].value_counts(normalize = True)

**Value counts:**

science       931
futurology    869
Name: subreddit, dtype: int64

**Value counts by weight:**

science       0.517222
futurology    0.482778
Name: subreddit, dtype: float64

* I may want to return to the subreddits to get a bigger dataset to work with. 

* The classes are slightly unbalanced, so that will be a consideration during the preprocessing / get-more-data phase.

In [33]:
subred1[['title', 'subreddit']]

Unnamed: 0_level_0,title,subreddit
id,Unnamed: 1_level_1,Unnamed: 2_level_1
japxs3,What would the point be to do anything if AI c...,futurology
japq75,ELCC Explained: the Critical Renewable Energy ...,futurology
jaojnb,There's a 50-50 chance we're living in a simul...,futurology
jaofpy,"Eight nations sign NASA's Artemis Accords, ple...",futurology
jaocqq,Mercedes benz AVTR - In Action,futurology
...,...,...
ir136p,Adults with positive SARS-CoV-2 test results w...,science
ir0vem,50% of Phosphorus Lost to Erosion,science
ir0pvm,Motivated Helplessness in the Context of the C...,science
iqzn0v,Political ideology may explain why despair spr...,science


In [97]:
# Null body rows
printmd(f"**Only {len(subred1['body'][(subred1['body'].isnull()).astype(int) == 0])} non-null `body` records.**")

non_null_body = subred1[['body', 'subreddit']][(subred1['body'].isnull()).astype(int) == 0]

**Only 69 non-null `body` records.**

In [95]:
# Check out the non-null body text

for i in range(5):
    display(non_null_body.iloc[i, :])

printmd('<br>')

print(non_null_body[-5:])


body         The more I look into AI and new projects like GPT-3, we humans suck. Sure GPT-3 isn't on the level as a human brain, but that doesn't mean gpt-4 or gpt-5 won't be.  I can already see the massive jobs loss as AI will offset more jobs than it will create. I'm just feeling a little existential crisis here
subreddit                                                                                                                                                                                                                                                                                                         futurology
Name: japxs3, dtype: object

body         As our dependence on computer technologies increases, will kids who were not allowed to use electronics (phones, computers, tablets) as toddlers become stunted because they wouldn't know how to use computers?\n\nWill computers be so ingrained in society that toddlers need to learn how to use them just as how they need to learn language?\n\nBy this metric, is it child abuse if you DON'T teach your kids how to use computers?
subreddit                                                                                                                                                                                                                                                                                                                                                                                                                                    futurology
Name: jamgnw, dtype: object

body         Please post all climate change news here unless the submission is an unique event that is a global headline across several trusted news sources.
subreddit                                                                                                                                          futurology
Name: jag7ht, dtype: object

body         Is there any effort by others to bring the other technologies such as, Molecular Manufacturing into reality?
subreddit                                                                                                      futurology
Name: ja7ygc, dtype: object

body         Everyday we're learing more efficient ways to synthesize food. Eventually there's going to be a synthetic dietary movement similar to vegetarianism and veganism. Probably sooner rather than later. And it won't be restricted to meat, synthetic vegetables and fruit will be included.\n\nSo what would a term for people who only eat synthetic food be? Artificivore? Synthetivore? Synthetarian? \n\nPitch ideas bellow.
subreddit                                                                                                                                                                                                                                                                                                                                                                                                                        futurology
Name: ja5exx, dtype: object

<br>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

In [98]:
printmd('**Can we get enough content from post titles?**')

**Can we get enough content from post titles?**