#### Project 3: Reddit NLP
#### Corey J Sinnott
# Data Cleaning, Initial EDA and Early Featurization

## Executive Summary

This report was commissioned to perform natural language processing (NLP) and analysis on two subreddits of Reddit.com. Data includes over 8000 posts, 4000 belonging to r/AskALiberal, and 4000 belonging to r/AskAConservative. The problem statement was defined as, can we classify to which subreddit a post belongs? After in-depth analysis, conclusions and recommendations will be presented.

*See model_classification_exec_summary.ipynb for the full summary, data dictionary, and findings.*

## Contents:
- [Initial EDA & Cleaning](#Initial-EDA-&-Cleaning)
- [EDA and Featurization](#EDA-and_Featurization)

#### Importing Libraries

In [94]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import language_tool_python
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
from textblob import Blobber
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [12]:
#cd project-3

# Initial EDA & Cleaning
 - Reading in data and exploring basic metrics.
 - Cleaning data, which includes removing "removed by moderator" posts.

In [13]:
df = pd.read_csv('./data/full_pull_4000_each_incl_self_text.csv')

In [14]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,selftext,subreddit,created_utc
0,0,Biden plans to cancel the Keystone XL pipeline...,[https://www.cbc.ca/amp/1.5877038](https://www...,AskALiberal,1610945588
1,1,2020 Best of r/AskALiberal Results,"#Good afternoon, everyone!\n\n\nThe winners an...",AskALiberal,1610943721
2,2,Place your bets: will Trump be removed by forc...,We already know Trump will not attend Biden's ...,AskALiberal,1610942754
3,3,Have you ever gotten conservatives to rethink ...,I’m had both positive/negative conversations f...,AskALiberal,1610942080
4,4,Who is winning the culture war right now?,Liberals? Conservatives? China?,AskALiberal,1610939740


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   8000 non-null   int64 
 1   title        8000 non-null   object
 2   selftext     7209 non-null   object
 3   subreddit    8000 non-null   object
 4   created_utc  8000 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 312.6+ KB


In [16]:
df = df.drop(columns = ['Unnamed: 0'])

In [17]:
df.title.describe()

count                                                  8000
unique                                                 7761
top       Trump fires Esper. McConnell meets with Barr a...
freq                                                      6
Name: title, dtype: object

In [18]:
df = df.drop_duplicates()

In [19]:
#dropping nulls for now; will most likely use all titles for further exploration later
df = df.dropna()

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7209 entries, 0 to 7999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        7209 non-null   object
 1   selftext     7209 non-null   object
 2   subreddit    7209 non-null   object
 3   created_utc  7209 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 281.6+ KB


# EDA and Featurization
 - Creating new features:
     - Length metrics using word counts
     - Sentiment analysis using Beautiful Soup TextBlob
     - Grammar analysis using Python Language Tool

In [21]:
#creating new columns for status length and word count
df['post_length'] = [len(i) for i in df['selftext']]
df['post_word_count'] = [len(i.split()) for i in df['selftext']]
df['title_length'] = [len(i) for i in df['title']]
df['title_word_count'] = [len(i.split()) for i in df['title']]
df.head(2)

Unnamed: 0,title,selftext,subreddit,created_utc,post_length,post_word_count,title_length,title_word_count
0,Biden plans to cancel the Keystone XL pipeline...,[https://www.cbc.ca/amp/1.5877038](https://www...,AskALiberal,1610945588,83,3,143,26
1,2020 Best of r/AskALiberal Results,"#Good afternoon, everyone!\n\n\nThe winners an...",AskALiberal,1610943721,3412,416,34,5


In [22]:
# separating for simpler EDA
df_lib = df[df.subreddit == 'AskALiberal']
df_lib.head(2)

Unnamed: 0,title,selftext,subreddit,created_utc,post_length,post_word_count,title_length,title_word_count
0,Biden plans to cancel the Keystone XL pipeline...,[https://www.cbc.ca/amp/1.5877038](https://www...,AskALiberal,1610945588,83,3,143,26
1,2020 Best of r/AskALiberal Results,"#Good afternoon, everyone!\n\n\nThe winners an...",AskALiberal,1610943721,3412,416,34,5


In [23]:
df_cons = df[df.subreddit == 'askaconservative']
df_cons.head(2)

Unnamed: 0,title,selftext,subreddit,created_utc,post_length,post_word_count,title_length,title_word_count
4000,Why are do so many liberals think conservatism...,Its like saying communism is right leaning con...,askaconservative,1599996054,258,45,89,15
4001,Are you familiar with the CIA’s “Operation Con...,Basically the CIA supported a bunch of South A...,askaconservative,1599985914,380,54,61,9


In [24]:
df_lib.sort_values('post_length').describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
created_utc,3979.0,1605385000.0,3189884.0,1600006000.0,1602755000.0,1605012000.0,1608112000.0,1610946000.0
post_length,3979.0,439.433,729.7412,1.0,22.0,223.0,538.5,12956.0
post_word_count,3979.0,71.51872,118.8088,1.0,2.0,37.0,89.0,2315.0
title_length,3979.0,83.191,49.31885,3.0,49.0,71.0,105.0,299.0
title_word_count,3979.0,14.34406,8.543239,1.0,8.0,12.0,18.0,56.0


In [25]:
df_cons.sort_values('post_length').describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
created_utc,3230.0,1592490000.0,4194537.0,1584374000.0,1589224000.0,1592442000.0,1595972000.0,1599996000.0
post_length,3230.0,378.7715,662.8354,1.0,9.0,211.0,498.0,19893.0
post_word_count,3230.0,62.587,97.78238,1.0,1.0,35.0,84.0,2007.0
title_length,3230.0,75.11486,48.55969,3.0,42.0,63.0,93.0,299.0
title_word_count,3230.0,12.67399,8.33288,1.0,7.0,11.0,16.0,59.0


#### Analyzing Grammar

In [26]:
# testing grammar evaluation tool
# https://github.com/jxmorris12/language_tool_python
tool = language_tool_python.LanguageTool('en-US')
text = 'This are bad.'
matches = tool.check(text)
len(matches)

2

In [28]:
#defining a function to rate grammar
#later simply added columns for grammar EDA
def grammarater (source_1, source_2, samples):
    count_1 = 0
    count_2 = 0
    
    for i in source_1:
        while count_1 <= samples:           
            errors_1 = tool.check(i)
            error_rate_1 = (len(errors_1) / len(i)) * 100
            count_1 += 1
    print(f'{source_1} total errors: {errors_1}')
    print(f'{source_1} has an error rate of {error_rate_1}')
          
    for x in source_2:
        while count_2 <= samples:            
            errors_2 = tool.check(x)
            error_rate_2 = len(errors_2) / len(x)
            count_2 += 1
    print(f'{source_2} total errors: {errors_2}')
    print(f'{source_2} has an error rate of {error_rate_2}')

In [29]:
#grammarater(df_lib['title'], df_cons['title'], 5)

In [30]:
df['grammar_errors'] = [tool.check(i) for i in df['selftext']]

In [31]:
df['num_of_grammar_errors'] = [len(i) for i in df['grammar_errors']]

In [134]:
df['gramm_err_rate'] = df.num_of_grammar_errors / len(df.num_of_grammar_errors) * 100

In [135]:
#results include some kind of formatting text
#results will be skewed if one subreddit has more formatting in the selftext body
df.head(3)

Unnamed: 0,title,selftext,subreddit,created_utc,post_length,post_word_count,title_length,title_word_count,num_of_grammar_errors,selftext_polarity,selftext_subjectivity,gramm_err_rate
0,Biden plans to cancel the Keystone XL pipeline...,[https://www.cbc.ca/amp/1.5877038](https://www...,AskALiberal,1610945588,83,3,143,26,0,0.0,0.0,0.0
1,2020 Best of r/AskALiberal Results,"#Good afternoon, everyone!\n\n\nThe winners an...",AskALiberal,1610943721,3412,416,34,5,55,0.263574,0.449266,0.762935
2,Place your bets: will Trump be removed by forc...,We already know Trump will not attend Biden's ...,AskALiberal,1610942754,726,117,69,13,5,-0.11,0.375952,0.069358


In [56]:
df = df.drop(columns = ['grammar_errors'])

In [33]:
df[df['subreddit'] == 'AskALiberal']['num_of_grammar_errors'].describe()

count    3979.000000
mean        2.104046
std         4.596240
min         0.000000
25%         0.000000
50%         0.000000
75%         2.000000
max        83.000000
Name: num_of_grammar_errors, dtype: float64

In [34]:
df[df['subreddit'] == 'askaconservative']['num_of_grammar_errors'].describe()

count    3230.000000
mean        1.947368
std         5.645045
min         0.000000
25%         0.000000
50%         0.000000
75%         2.000000
max       216.000000
Name: num_of_grammar_errors, dtype: float64

#### Analyzing Sentiment

In [39]:
df['selftext_polarity'] = [TextBlob(i).polarity for i in df['selftext']]

In [42]:
df['selftext_subjectivity'] = [TextBlob(i).subjectivity for i in df['selftext']]

In [47]:
#df['pos_or_neg'] = [TextBlob(i).sentiment for i in df['selftext']]

In [125]:
df.sort_values(by = ['selftext_polarity']).tail(10)

Unnamed: 0,title,selftext,subreddit,created_utc,post_length,post_word_count,title_length,title_word_count,num_of_grammar_errors,selftext_polarity,selftext_subjectivity
5360,Donald Trump's reelection chances and whether ...,If the election was held today do you think Do...,askaconservative,1594871677,169,34,71,11,0,0.8,0.4
2435,Be honest: who do you think is going to win an...,I think Biden will win but I'm still nervous.,AskALiberal,1604104911,45,9,60,14,1,0.8,0.4
7441,Reagan vs Trump,I see many people hail Donald Trump as the gre...,askaconservative,1587188308,348,57,15,3,2,0.833333,0.833333
1668,What's your opinion on Toll Roads?,"Do you think that they're a good idea, or not ...",AskALiberal,1605812413,81,16,34,6,0,0.85,0.45
7616,how do you feel about appropriate ppe?,how do you feel knowing that the “best healthc...,askaconservative,1586503567,209,34,38,7,2,1.0,0.3
3546,"From a strategic standpoint, why is Biden even...",65% of Americans say that their minds won't be...,AskALiberal,1601248123,539,45,62,10,2,1.0,0.3
6024,Would you be happy if Candace Owens or Nikki H...,I'm impressed by both of them!,askaconservative,1592503483,30,6,74,14,0,1.0,1.0
1980,What's the best way to help in the Georgia run...,"This January, we have a chance to pick up 2 se...",AskALiberal,1605031947,146,27,51,10,0,1.0,0.3
771,"Former conservatives and other right-wingers, ...",And what do you think is the best way to get o...,AskALiberal,1609017746,108,21,77,11,0,1.0,0.3
7403,Solution to pollution?,I feel that plastic pollution in the ocean sho...,askaconservative,1587353428,317,60,22,3,0,1.0,0.3


In [124]:
df.sort_values(by = ['selftext_subjectivity']).tail(10)

Unnamed: 0,title,selftext,subreddit,created_utc,post_length,post_word_count,title_length,title_word_count,num_of_grammar_errors,selftext_polarity,selftext_subjectivity
5920,What are your thoughts on the 95 percent decre...,https://www.independent.co.uk/news/world/ameri...,askaconservative,1592840791,328,30,105,17,0,0.0,1.0
6501,What do you think of qualified immunity?,Do you view it as a valid doctrine? Judicial a...,askaconservative,1591105492,94,17,40,7,1,0.0,1.0
2379,"The Siena/NYT Upshot polls of AZ, PA, WI, and ...",NYT: [“Final polls show Biden ahead.The presid...,AskALiberal,1604248779,170,13,78,14,1,-0.375,1.0
4505,Simply hypothetical...,"Imagine it has happened, the extreme leftists ...",askaconservative,1598068722,217,37,22,2,0,-0.125,1.0
2009,What do you think of lockdowns and mandates?,"As the title suggests, I want to know what you...",AskALiberal,1604981503,113,19,44,8,0,0.0,1.0
4585,How do conservatives feel about the smoking gu...,Does it not matter or should he be perp walked...,askaconservative,1597775986,100,23,86,15,1,-0.5,1.0
550,What are your thoughts on Jovan Pulitzers test...,Link to video in comments. Only 10 minutes.,AskALiberal,1609789074,43,8,71,11,0,0.0,1.0
3140,How do you feel about the ACLU?,Are you generally supportive of it? Why or why...,AskALiberal,1602367736,51,10,31,7,0,0.5,1.0
6528,What do you think about the UK vetoing Trumps ...,Just curious about what the perspective over h...,askaconservative,1591044459,218,41,66,13,0,-0.1,1.0
122,Can a president be impeached after leaving off...,Like can Obama get impeached today or is only ...,AskALiberal,1610650121,101,17,50,8,0,0.0,1.0


In [133]:
df.sort_values(by = ['selftext_polarity']).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
created_utc,7209.0,1599607000.0,7390874.0,1584374000.0,1593113000.0,1601048000.0,1605473000.0,1610946000.0
post_length,7209.0,412.2536,701.1563,1.0,13.0,217.0,521.0,19893.0
post_word_count,7209.0,67.51685,109.9689,1.0,1.0,36.0,87.0,2315.0
title_length,7209.0,79.57248,49.14119,3.0,46.0,68.0,99.0,299.0
title_word_count,7209.0,13.59578,8.489777,1.0,8.0,12.0,17.0,59.0
num_of_grammar_errors,7209.0,2.033847,5.093163,0.0,0.0,0.0,2.0,216.0
selftext_polarity,7209.0,0.06031956,0.1588505,-1.0,0.0,0.0,0.1333333,1.0
selftext_subjectivity,7209.0,0.3109562,0.2598272,0.0,0.0,0.3666667,0.5,1.0


In [130]:
df.sort_values(by = ['selftext_subjectivity']).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
created_utc,7209.0,1599607000.0,7390874.0,1584374000.0,1593113000.0,1601048000.0,1605473000.0,1610946000.0
post_length,7209.0,412.2536,701.1563,1.0,13.0,217.0,521.0,19893.0
post_word_count,7209.0,67.51685,109.9689,1.0,1.0,36.0,87.0,2315.0
title_length,7209.0,79.57248,49.14119,3.0,46.0,68.0,99.0,299.0
title_word_count,7209.0,13.59578,8.489777,1.0,8.0,12.0,17.0,59.0
num_of_grammar_errors,7209.0,2.033847,5.093163,0.0,0.0,0.0,2.0,216.0
selftext_polarity,7209.0,0.06031956,0.1588505,-1.0,0.0,0.0,0.1333333,1.0
selftext_subjectivity,7209.0,0.3109562,0.2598272,0.0,0.0,0.3666667,0.5,1.0


In [72]:
lib_pol = df[df['subreddit'] == 'AskALiberal']['selftext_polarity'].mean()

lib_sub = df[df['subreddit'] == 'AskALiberal']['selftext_subjectivity'].mean()

con_pol = df[df['subreddit'] == 'askaconservative']['selftext_polarity'].mean()

con_sub = df[df['subreddit'] == 'askaconservative']['selftext_subjectivity'].mean()

In [93]:
# df['grammer_err_rate'] = [(i / x) if i > 0 else (i * 0) for i in df['num_of_grammar_errors'] for x in df['post_length']]
# df.head()

In [80]:
print(f'r/askaliberal polarity = {round((lib_pol), 3)}')
print(f'r/askaliberal subjectivity = {round((lib_sub), 3)}')
print(f'r/askaconservative polarity = {round((con_pol), 3)}')
print(f'r/askaconservative subjectivity = {round((con_sub), 3)}')

r/askaliberal polarity = 0.064
r/askaliberal subjectivity = 0.322
r/askaconservative polarity = 0.056
r/askaconservative subjectivity = 0.297


In [95]:
cvzr = CountVectorizer(stop_words='english', max_features=200)

In [96]:
df_lib = df[df.subreddit == 'AskALiberal']
df_cons = df[df.subreddit == 'askaconservative']

## CountVectorizing to Analyze Top Words, Bi-grams, and Tri-grams

In [99]:
self_text_lib = cvzr.fit_transform(df_lib['selftext'])
title_text_lib = cvzr.fit_transform(df_lib['title'])
self_text_cons = cvzr.fit_transform(df_cons['selftext'])
title_text_cons = cvzr.fit_transform(df_cons['title'])

In [105]:
self_text_lib.sum().sort_values(ascending = False).head(10)

systemic       1476
news           1322
left           1233
state          1154
having         1062
best            925
immigration     923
protests        823
china           720
democratic      712
dtype: int64

In [107]:
title_text_lib.sum().sort_values(ascending = False).head(10)

thoughts    636
systemic    595
beliefs     458
left        222
tax         210
economy     205
non         202
does        179
leftists    159
use         158
dtype: int64

In [109]:
self_text_cons.sum().sort_values(ascending = False).head(10)

point          1069
tax             821
republicans     819
trying          818
left            805
immigration     690
come            654
healthcare      591
getting         580
community       570
dtype: int64

In [110]:
title_text_cons.sum().sort_values(ascending = False).head(10)

trump            495
think            391
conservatives    368
conservative     311
people           163
does             151
thoughts         138
feel             134
support          114
like             113
dtype: int64

In [101]:
# askaliberal bigrams
cvzr_bi = CountVectorizer(stop_words='english', max_features=200, ngram_range=(2,2))

self_text_lib_bigrams = cvzr_bi.fit_transform(df_lib['selftext'])
title_text_lib_bigrams = cvzr_bi.fit_transform(df_lib['title'])

In [111]:
self_text_lib_bigrams.sum().sort_values(ascending = False).head(10)

https www           624
org wiki            120
wikipedia org       120
supreme court       118
https en            115
en wikipedia        115
joe biden           109
don think            96
united states        91
democratic party     91
dtype: int64

In [112]:
title_text_lib_bigrams.sum().sort_values(ascending = False).head(10)

joe biden               79
supreme court           64
think trump             41
democratic party        39
trump supporters        34
covid 19                28
donald trump            27
united states           22
biden administration    21
look like               21
dtype: int64

In [102]:
# askaconservative bigrams
cvzr_bi = CountVectorizer(stop_words='english', max_features=200, ngram_range=(2,2))

self_text_cons_bigrams = cvzr_bi.fit_transform(df_cons['selftext'])
title_text_cons_bigrams = cvzr_bi.fit_transform(df_cons['title'])

In [113]:
self_text_cons_bigrams.sum().sort_values(ascending = False).head(10)

https www        351
gt gt            169
covid 19          86
don know          70
ve seen           63
amp x200b         61
united states     59
feel like         58
14 2020           54
don think         53
dtype: int64

In [114]:
title_text_lib_bigrams.sum().sort_values(ascending = False).head(10)

joe biden               79
supreme court           64
think trump             41
democratic party        39
trump supporters        34
covid 19                28
donald trump            27
united states           22
biden administration    21
look like               21
dtype: int64

In [103]:
# askaliberal trigrams
cvzr_tri = CountVectorizer(stop_words='english', max_features=200, ngram_range=(3,3))

self_text_lib_trigrams = cvzr_tri.fit_transform(df_lib['selftext'])
title_text_lib_trigrams = cvzr_tri.fit_transform(df_lib['title'])

In [115]:
title_text_lib_trigrams.sum().sort_values(ascending = False).head(10)

weekly general chat           16
amy coney barrett             16
askaliberal weekly general    16
think joe biden                9
worst things trump             7
green new deal                 7
esper mcconnell meets          6
think democratic party         6
meets barr backs               6
mcconnell meets barr           6
dtype: int64

In [104]:
# askaconservative trigrams
cvzr_tri = CountVectorizer(stop_words='english', max_features=200, ngram_range=(3,3))

self_text_cons_trigrams = cvzr_tri.fit_transform(df_cons['selftext'])
title_text_cons_trigrams = cvzr_tri.fit_transform(df_cons['title'])

In [117]:
title_text_cons_trigrams.sum().sort_values(ascending = False).head(10)

black lives matter           14
ranked choice voting          6
people socio political        5
think cringe people           5
cringe people socio           5
lives matter movement         5
response covid 19             5
socio political spectrum      5
000 millionaires stimulus     4
sexist negative things        4
dtype: int64