# Sentiment Analysis

The purpose of this section is to glean some preliminary insights from the data. I am investigating whether the two subreddits distinctly differ in their usage of extreme language, represented here by the use of **superlatives**. The idea is that the more superlatives used, the more polarizing or extreme the language is. 

After analyzing superlative usage, I will then look at the rate of **positive and negative word usage**. The metric for positive or negative language is usually dependent on context, but here I will just be using a "bag of words" analysis that simply counts the occurance of a predefined set of negative and positive words.

## Superlative Sentiment Analysis

In [1]:
import pandas as pd
import requests
import time
import numpy as np
from bs4 import BeautifulSoup

from nltk.tokenize import RegexpTokenizer

In [2]:
# A function to convert the comments into one list of tokenized words

def convert_to_tokens(series):
    ls = series.tolist()
    lower_ls = [i.lower() for i in ls]
    lists_of_words = [RegexpTokenizer('[a-z]\w+').tokenize(i) for i in lower_ls]
    big_list = []
    for chunk in lists_of_words:
        big_list += chunk
    return big_list

In [3]:
# Scraping a list of superlatives from the internet

df_superlatives = pd.read_html(
    'https://www.easypacelearning.com/all-lessons/grammar/1436-comparative-superlative-adjectives-list-from-a-to-z')

In [4]:
# Creating the superlatives words list

df_superlatives = df_superlatives[0]

superlatives = list(df_superlatives[2][1:])

In [5]:
# adding superlative modifiers to the list

superlatives.append(['least', 'most'])

### Conservative Subreddit Superlative Analysis

In [8]:
df_con = pd.read_csv('./data/df_con.csv')

In [10]:
# Calling the tokenizing function on the comments to create a 'bag of words'

con_words = convert_to_tokens(df_con['body'])

In [21]:
# Calculating number of occurances of our scraped superlatives in the subreddit comments

con_superlatives = [w for w in con_words if w in superlatives]
print(f'Total number of superlatives used: {len(con_superlatives)}')

# Rate of superlative use in Conservative subreddit comments

print(
    f'Rate of superlative use: {round(((len(con_superlatives) / len(con_words))*100), 3)}%')

Total number of superlatives used: 821
Rate of superlative use: 0.121%


### Socialism Subreddit Superlative Analysis

In [23]:
df_lib = pd.read_csv('./data/df_lib.csv')

In [25]:
# Calling the tokenizing function on the comments to create a 'bag of words'

lib_words = convert_to_tokens(df_lib['body'])

In [26]:
# Calculating number of occurances of our scraped superlatives in the subreddit comments

lib_superlatives = [w for w in lib_words if w in superlatives]

len(lib_superlatives)

844

In [29]:
# Calculating number of occurances of our scraped superlatives in the subreddit comments

lib_superlatives = [w for w in lib_words if w in superlatives]

print(f'Total number of superlatives used: {len(lib_superlatives)}')

# Rate of superlative use in Socialism subreddit comments

print(
    f'Rate of superlative use: {round(((len(lib_superlatives) / len(lib_words))*100), 3)}%')

Total number of superlatives used: 844
Rate of superlative use: 0.117%


### Conclusion on Superlative Analysis

The rates of superlative use are very similar for both subreddits, with 'r/conservatives' using slightly more, by a margin of .004%, or 4 more superlatives per 100,000 words

## Positive/Negative Sentiment Analysis

In [30]:
# Scraping to create a list of positive words

positive_words_url = 'http://www.creativeaffirmations.com/positive-words.html'
pos_res = requests.get(positive_words_url)
positive_soup = BeautifulSoup(pos_res.content, 'lxml')

In [31]:
# Isolating the body of positive words

table = positive_soup.find('table', {'cellpadding' : '2'})

In [32]:
# Creating a list of positive words from a table scrape

positive_words = [i.text.lower() for i in table.find_all('td')]

In [33]:
"party" in positive_words

True

In [34]:
"leader" in positive_words

True

In [35]:
positive_words.remove('leader')
positive_words.remove('party')

I am removing these words because they may come up often, but in either a positive or negative political sense depending on context. I think that including them in the positive words list would be a mistake that could throw off the results.

In [36]:
# Scraping to create a list of negative words

negative_words_url = "https://www.enchantedlearning.com/wordlist/negativewords.shtml"
neg_res = requests.get(negative_words_url)
negative_soup = BeautifulSoup(neg_res.content, 'lxml')
neg_table = negative_soup.find_all('div', {'class' : 'wordlist-item'})

In [37]:
# Creating a list of negative words from a table scrape

negative_words = [i.text for i in neg_table]

In [38]:
# Counting the frequency with which positive or negative words
# show up in the two different subreddit's comments

lib_positive = [w for w in lib_words if w in positive_words]
con_positive = [w for w in con_words if w in positive_words]

lib_negative = [w for w in lib_words if w in negative_words]
con_negative = [w for w in con_words if w in negative_words]

print(
    f'''Rate of positive words in r/socialism: 
    {round((len(lib_positive) / len(lib_words) *100), 2)}%''')
print(
    f'''Rate of positive words in r/conservatives:
    {round((len(con_positive)/ len(con_words)*100), 2)}%''')
print(
    f'''Rate of negative words in r/socialism:
    {round((len(lib_negative)/len(lib_words)*100), 2)}%''')
print(
    f'''Rate of negative words in r/conservatives:
    {round((len(con_negative)/len(con_words)*100), 2)}%''')

Rate of positive words in r/socialism: 
    3.4%
Rate of positive words in r/conservatives:
    3.27%
Rate of negative words in r/socialism:
    1.88%
Rate of negative words in r/conservatives:
    2.19%


### Positive/Negative Sentiment Conclusion

The rates of us for both positive and negative words for each subreddit are similar. The subreddit "r/socialism" has a slightly higher rate of positive word usage than "r/conservatives", with a margin of 0.13%. It also has a lower rate of negative word usage, with a margin of -0.31%.

The only result that I would conclude shows a potential for actual difference in the language of the two subredits is the margin of difference in negative word usage, although a rate of three tenths of a percent is very low. 