# Project 3 - Elliot Richardson

*This is the notebook in which I do my data acquisition, cleaning, and processing. For analysis, modeling, and findings, please check out the other jupyter notebook in this repository.*


## Table of Contents:

I. [Requesting content from subreddits](#Requesting-content-from-the-subreddits)

II. [Formatting requested content](#Formatting-requested-posts)

III. [Initial processing and cleaning](#Processing-part-1)

IV. [Processing and word selection](#Processing-part-2)

V. [Exporting for analysis](#Exporting-features-and-target-variable)

# Importing packages/libraries

In [26]:
import requests
import pandas as pd
import numpy as np
import string
import spacy
import time
import re

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import PunktSentenceTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer

# Requesting content from the subreddits

I created a function to grab the maximum number of posts allowed by the Pushshift API so that I could automate the process with a while loop.

In [25]:
def get_posts(subreddit,n=10,to_extend=None):
    """By default returns a list of 1,000 posts from the subreddit passed unless errors encountered"""
    """If no errors encountered, it will return n*100 posts"""
    """If you are tacking more posts on to an existing list, pass the list through the to_extend parameter"""
    
    url = 'https://api.pushshift.io/reddit/search/submission'
    
    # if no existing list given to tack these posts onto
    if to_extend == None:
        posts = []
        
    # if an existing list is given
    else:        
        posts = to_extend
              
    # extending list by n sets of 100 posts if no errors encountered
    for i in range(n):
        try:
            # if there are posts already in the list
            before = posts[-1]['created_utc']

        except:
            # if there aren't any posts in list yet
            before = None

    # tacking on the 100 posts before the last one already in the list
        params = {'subreddit': subreddit,'size': 100,'before': before}
        res = requests.get(url,params)
        try:
            data = res.json()
            posts.extend(data['data'])
        except:
            pass
        
    return posts

I used the above function and while loops to get at least 2,500 substantive posts from each subreddit. I exported the posts and their corresponding data to CSVs and read them back in. I'm going to do modeling and analysis in another notebook so I don't have to run this whole thing everytime!

In [26]:
s_posts = get_posts('Socialism')
s_df = pd.DataFrame(s_posts)
s_actual_posts = ~s_df['selftext'].isin(['removed','','[removed]','deleted'])
s_not_null = ~s_df['selftext'].isnull()

while s_df[s_actual_posts & s_not_null].shape[0] < 2800:
    s_posts = get_posts('Socialism',n=50,to_extend=s_posts)
    s_df = pd.DataFrame(s_posts)
    s_actual_posts = ~s_df['selftext'].isin(['removed','','[removed]','deleted'])
    s_not_null = ~s_df['selftext'].isnull()
    time.sleep(5)
    print(s_df[s_actual_posts & s_not_null].shape)

s_df = s_df.loc[(s_actual_posts & s_not_null),:]
s_df.duplicated('selftext').sum()

In [27]:
c_posts = get_posts('Capitalism')
c_df = pd.DataFrame(c_posts)
c_actual_posts = ~c_df['selftext'].isin(['removed','','[removed]','deleted'])
c_not_null = ~c_df['selftext'].isnull()

while c_df[c_actual_posts & c_not_null].shape[0] < 2800:
    c_posts = get_posts('Capitalism',n=50,to_extend=c_posts)
    c_df = pd.DataFrame(c_posts)
    c_actual_posts = ~c_df['selftext'].isin(['removed','','[removed]','deleted'])
    c_not_null = ~c_df['selftext'].isnull()
    time.sleep(5)
    print(c_df[c_actual_posts & c_not_null].shape)

c_df = c_df.loc[(c_actual_posts & c_not_null),:]
c_df.duplicated('selftext').sum()

In [20]:
s_df.to_csv('./data/socialism.csv',index=False)
c_df.to_csv('./data/capitalism.csv',index=False)

s = pd.read_csv('./data/socialism.csv')
c = pd.read_csv('./data/capitalism.csv')

# Formatting requested posts
Now that I have 2,500+ substantive posts from each subreddit, I am going to combine them into one dataframe by eliminating any columns that they don't have in common and adding a binary indicator of the sourced subreddit. Then I'm going to do some work to clean and process the text.

In [21]:
s_cols = set(s.columns)
c_cols = set(c.columns)

s.drop(columns=s_cols.difference(c_cols),inplace=True)
c.drop(columns=c_cols.difference(s_cols),inplace=True)

In [22]:
s.shape,c.shape,set(s.columns) == set(c.columns)

((2985, 79), (2862, 79), True)

In [23]:
c = c.loc[:,s.columns]

In [24]:
s['subreddit_s'] = 1
c['subreddit_s'] = 0

In [42]:
df = pd.concat([s,c])
df.reset_index(drop=True,inplace=True)
print(df.shape)

df.to_csv('./data/unprocessed_df.csv',index=False)

(5847, 80)


# Processing part 1
After all of that, I ended up with a dataframe of 5,847 rows and 80 columns. That should be plenty of information to train a model! I'm going to start creating some features that measure the structure and sentiment of each post before breaking down the posts into their most important words.

In [82]:
df = pd.read_csv('./data/unprocessed_df.csv')
df['all_text'] = [df.loc[i,'title'] + ' '+df.loc[i,'selftext'] for i in range(df.shape[0])]

In [83]:
df['all_text'] = df['all_text'].map(lambda x: re.sub('http[s]://\S+','',x))

df['all_text'] = df['all_text'].str.replace('publsih','publish') # just a typo i noticed in the head()

for s in ['\n','\xa0','&gt;']:
    df['all_text'] = df['all_text'].map(lambda x: x.replace(s,''))

In [84]:
senter = PunktSentenceTokenizer()
df['sentences'] = [0] * df.shape[0]
df['avg_sent_len'] = [0] * df.shape[0]
df['words'] = [0] * df.shape[0]
df['avg_word_len'] = [0] * df.shape[0]


for i,text in enumerate(df['all_text']):
    sentences = senter.sentences_from_text(text)
    num_sent = len(sentences)
    sent_len = []
    for sent in sentences:
        words = sent.split(' ')
        sent_len.append(len(words))
        word_len = []
        for word in words:
            word_len.append(len(word))
    
    df.loc[i,'sentences'] = num_sent
    df.loc[i,'avg_sent_len'] = np.mean(sent_len)
    df.loc[i,'words'] = sum(sent_len)
    df.loc[i,'avg_word_len'] = np.mean(word_len)
    
    if (i % 1000) == 0:
        print(i)       

0
1000
2000
3000
4000
5000


In [85]:
df['all_text'] = df.all_text.map(lambda x: ''.join([y if y in string.ascii_lowercase else ' ' for y in list(str(x).lower())]))
df['all_text'] = df.all_text.map(lambda x: x.replace('-',' '))

In [86]:
en_nlp = spacy.load('en')

In [87]:
# A:
parsed_quotes = []
for i, parsed in enumerate(en_nlp.pipe(df.all_text.values, batch_size=50, n_threads=4)):
    assert parsed.is_parsed
    if (i % 1000) == 0:
        print(i)
    parsed_quotes.append(parsed)

0
1000
2000
3000
4000
5000


In [88]:
unique_pos = []
for parsed in parsed_quotes:
    unique_pos.extend([t.pos_ for t in parsed])
unique_pos = np.unique(unique_pos)

for pos in unique_pos:
    df[pos+'_prop'] = 0.

for i, parsed in enumerate(parsed_quotes):
    if (i % 1000) == 0:
        print(i)
    parsed_len = len(parsed)
    for pos in unique_pos:
        count = len([x for x in parsed if x.pos_ == pos])
        try:
            df.loc[i, pos+'_prop'] = float(count)/parsed_len    
        except:
            pass

0
1000
2000
3000
4000
5000


In [89]:
analyzer = SentimentIntensityAnalyzer()

df['vader_neg'] = 0
df['vader_pos'] = 0
df['vader_neu'] = 0
df['vader_compound'] = 0

for i, t in enumerate(df.all_text.values):
    vs = analyzer.polarity_scores(t)
    df.loc[i, 'vader_neg'] = vs['neg']
    df.loc[i, 'vader_pos'] = vs['pos']
    df.loc[i, 'vader_neu'] = vs['neu']
    df.loc[i, 'vader_compound'] = vs['compound']
    if (i % 1000) == 0:
        print(i)


0
1000
2000
3000
4000
5000


In [90]:
full_df = df.copy(deep=True)

In [91]:
df.columns[60:]

Index(['thumbnail_height', 'thumbnail_width', 'url_overridden_by_dest',
       'crosspost_parent', 'crosspost_parent_list', 'author_flair_template_id',
       'author_flair_text_color', 'author_flair_background_color', 'media',
       'media_embed', 'secure_media', 'secure_media_embed', 'media_metadata',
       'link_flair_text', 'banned_by', 'edited', 'suggested_sort',
       'distinguished', 'author_cakeday', 'subreddit_s', 'all_text',
       'sentences', 'avg_sent_len', 'words', 'avg_word_len', 'ADJ_prop',
       'ADP_prop', 'ADV_prop', 'AUX_prop', 'CCONJ_prop', 'DET_prop',
       'INTJ_prop', 'NOUN_prop', 'NUM_prop', 'PART_prop', 'PRON_prop',
       'PROPN_prop', 'PUNCT_prop', 'SCONJ_prop', 'SPACE_prop', 'SYM_prop',
       'VERB_prop', 'X_prop', 'vader_neg', 'vader_pos', 'vader_neu',
       'vader_compound'],
      dtype='object')

In [92]:
df = df.loc[:,[  'subreddit_s',
                 'all_text',
                 'sentences',
                 'avg_sent_len',
                 'words',
                 'avg_word_len',
                 'ADJ_prop',
                 'ADP_prop',
                 'ADV_prop',
                 'AUX_prop',
                 'CCONJ_prop',
                 'DET_prop',
                 'INTJ_prop',
                 'NOUN_prop',
                 'NUM_prop',
                 'PART_prop',
                 'PRON_prop',
                 'PROPN_prop',
                 'PUNCT_prop',
                 'SCONJ_prop',
                 'SPACE_prop',
                 'SYM_prop',
                 'VERB_prop',
                 'X_prop',
                 'vader_neg',
                 'vader_pos',
                 'vader_neu',
                 'vader_compound']]

df.to_csv('./data/processed_df.csv',index=False)

# Processing part 2

Now that I have done some syntax and sentiment analysis, I want to narrow down the huge list of words in these posts to the most important stems.

In [31]:
df = pd.read_csv('./data/processed_df.csv')

cv = CountVectorizer(stop_words='english')
cv.fit(df['all_text'])

words_cv = cv.transform(df['all_text'])
words_df = pd.DataFrame(words_cv.todense(), columns=cv.get_feature_names())

In [32]:
words_df.shape

(5847, 26518)

In [33]:
stemmer = PorterStemmer()

word_list = words_df.columns
stemmed_words = list(set([stemmer.stem(word) for word in word_list]))

In [34]:
len(word_list),len(stemmed_words)

(26518, 17083)

Great! We narrowed the list pretty significantly. I'm going to keep pushing forward to narrow even more! I'm going to transfer the stems into a dictionary so that I can edit and add stems as I see fit. You can see some of the words I decided to tweak below.

In [35]:
stem_dict = {}

for word in word_list:
    stem = stemmer.stem(word)
    stem_dict[word]= (stem)    

In [36]:
stem_dict['capitalism'] = 'capitalis'
stem_dict['capitalist'] = 'capitalis'
stem_dict['capitalistic'] = 'capitalis'
stem_dict['capitalists'] = 'capitalis'
stem_dict['capitalize'] = 'capitalize'

In [37]:
stem_dict['socialism'] = 'socialis'
stem_dict['socialiam'] = 'socialis'
stem_dict['socialist'] = 'socialis'
stem_dict['socialists'] = 'socialis'
stem_dict['socialized'] = 'socialize'
stem_dict['socialismvcapitalism'] = 'socialis versus capitalis'

In [38]:
stem_dict['communist'] = 'communis'
stem_dict['communists'] = 'communis'
stem_dict['communism'] = 'communis'
stem_dict['communicate'] = 'communicat'
stem_dict['communications'] = 'communicat'
stem_dict['communication'] = 'communicat'
stem_dict['communicated'] = 'communicat'

In [39]:
stem_dict['youtu'] = 'youtub'

Now that I have stems assigned to all the non-stop words in my posts, I'm going to create strings with only the stems for further processing.

In [40]:
df['stemmed'] = [np.nan]*df.shape[0]

for i, text in enumerate(list(df['all_text'])):
    stemmed = []
    for word in text.split(' '):
        if word in stem_dict.keys():
            stem = stem_dict[word]
            stemmed.append(stem)
            
    df.loc[i,'stemmed'] = ' '.join(stemmed)
    if (i % 1000) == 0:
        print(i)

0
1000
2000
3000
4000
5000


The sentences have now been narrowed down to their stems, so I'm going to vectorize again and use TF IDF to figure out which words will be most important in my modeling.

In [41]:
cv = CountVectorizer(stop_words='english')
cv.fit(df['stemmed'])

words_cv = cv.transform(df['stemmed'])

stems_df = pd.DataFrame(words_cv.todense(), columns=cv.get_feature_names())

In [42]:
s_string = ''
c_string = ''

for i in range(df.shape[0]):
    string = df.loc[i,'stemmed']
    if df.loc[i,'subreddit_s'] == 1:
        s_string += ' '+string
    else: 
        c_string += ' '+string

tvec = TfidfVectorizer(stop_words='english')

tvec.fit([s_string,c_string])

tv  = pd.DataFrame(tvec.transform([s_string, c_string]).todense(),
                   columns=tvec.get_feature_names(),
                   index=['socialism', 'capitalism'])

I'm selecting the top 1,000 words from each subreddit as features as I think these will be the most useful in a predictive model.

In [43]:
s_words = set(tv.T.sort_values('socialism', ascending=False).head(1000).T.columns)
c_words = set(tv.T.sort_values('capitalism', ascending=False).head(1000).T.columns)
selected_words = list(s_words.union(c_words))
selected_words.sort()
len(selected_words)

1197

# Exporting features and target variable
Awesome! I have a bunch of information about the grammatical makeup, sentiment, and vocabulary used in these posts. Now I'm going to combine my features and target variable into one clean dataframe for use in my other notebook.

In [44]:
model_df = df.drop(columns = ['all_text','stemmed'])

for word in selected_words:
    model_df[word] = stems_df[word]

In [45]:
model_df.to_csv('./data/model_df.csv',index=False)