# CISC 351/372 Advanced Data Analytics Group Project
## Group 8: Political Sentiment Analysis In Liberal and Conservative Reddit Communities
### RQ1: What are the most common topics discussed by each political class?

This notebook contains the code to answer the first research question.

#### Imports

In [1]:
import json
import pandas as pd
import random

import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

import re
from pprint import pprint

import gensim
from gensim.utils import simple_preprocess
import gensim.corpora as corpora

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Cynthia\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Dataset Extraction

In [2]:
with open("Conservative.json", "r") as file:
    con_data = json.load(file)

with open("Liberal.json", "r") as file:
    lib_data = json.load(file)

con_articles = [doc["article"] for doc in con_data]
lib_articles = [doc["article"] for doc in lib_data]

# Preview the dataset
print(con_articles[:5])
print(lib_articles[:5])



#### Dataset Sampling

In [3]:
# Fixed sample size of 10000
sample_size = 10000

con_sample = random.sample(con_articles, sample_size)
lib_sample = random.sample(lib_articles, sample_size)

# Preview the sampled datasets
print(con_sample[:5])
print(lib_sample[:5])

['Advertisement Houston immigration attorney Raed Gonzalez claims the controversy surrounding the Obama administration\'s release of tens of thousands of criminal illegal aliens is just a "publicity stunt" generated by the authors of a book documenting the case for impeaching Obama. Gonzalez has been the liaison between the Executive Office for Immigration Review, which administers the immigration court system, and the American Immigration Lawyers Association. Advertisement - story continues below Asked about the scandal on Fox News\' "Justice with Judge Jeanine," Gonzalez told the audience, "I think all of you fell for the publicity stunt of Mr. Elliott and miss, miss, what was it, Brenda Elliot and Mr. Aaron Klein, with their book, \'Impeachable Offenses,\' that came out a couple of weeks ago. And they were denouncing the Obama administration for releasing all of these people." Host Jeanine Pirro interjected, "Raed, I am not here to promote anybody\'s book. We are having a discussion

#### Text Data Preprocessing

In [None]:
# Convert to dataframe
con_sample = pd.DataFrame({"text": con_sample})
lib_sample = pd.DataFrame({"text": lib_sample})

# Remove punctuation
con_sample["text_processed"] = con_sample["text"].map(lambda x: re.sub("[,\.!?]", "", x))
lib_sample["text_processed"] = lib_sample["text"].map(lambda x: re.sub("[,\.!?]", "", x))

# Convert the text to lowercase
con_sample["text_processed"] = con_sample["text_processed"].map(lambda x: x.lower())
lib_sample["text_processed"] = lib_sample["text_processed"].map(lambda x: x.lower())

# Preview the processed dataset
print(con_sample["text_processed"].head())
print(lib_sample["text_processed"].head())

  con_sample["text_processed"] = con_sample["text"].map(lambda x: re.sub("[,\.!?]", "", x))
  lib_sample["text_processed"] = lib_sample["text"].map(lambda x: re.sub("[,\.!?]", "", x))


0    advertisement houston immigration attorney rae...
1    e-edition get the latest news delivered daily ...
2    illinois’ comeback story starts here house spe...
3    where politics meets the press get alerts from...
4    moe lane i am an evil giraffe who no longer bl...
Name: text_processed, dtype: object
0    columbia sc (ap) — joe biden scored a thunderi...
1    we use cookies and other tracking technologies...
2    © first look institute a division of first loo...
3    edward herman and noam chomsky demolish one of...
4    login our latest edition is out in print and o...
Name: text_processed, dtype: object


#### Data Preparation

In [5]:
stop_words = stopwords.words("english")

def to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc))
             if word not in stop_words] for doc in texts]

con_data = con_sample.text_processed.values.tolist()
lib_data = lib_sample.text_processed.values.tolist()

con_words = list(to_words(con_data))
lib_words = list(to_words(lib_data))

con_words = remove_stopwords(con_words)
lib_words = remove_stopwords(lib_words)

# Preview the word list
print(con_words[:1][0][:10])
print(lib_words[:1][0][:10])

['advertisement', 'houston', 'immigration', 'attorney', 'raed', 'gonzalez', 'claims', 'controversy', 'surrounding', 'obama']
['columbia', 'sc', 'ap', 'joe', 'biden', 'scored', 'thundering', 'victory', 'saturday', 'south']


#### Dictionary Creation

In [6]:
# Create dictionary
con_dict = corpora.Dictionary(con_words)
lib_dict = corpora.Dictionary(lib_words)

# Create corpus
con_texts = con_words
lib_texts = lib_words

con_corp = [con_dict.doc2bow(text) for text in con_texts]
lib_corp = [lib_dict.doc2bow(text) for text in lib_texts]

# Preview the corpus
print(con_corp[:1][0][:10])
print(lib_corp[:1][0][:10])

[(0, 1), (1, 1), (2, 4), (3, 1), (4, 1), (5, 1), (6, 5), (7, 11), (8, 1), (9, 1)]
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 3), (6, 5), (7, 1), (8, 1), (9, 1)]


#### Latent Dirichlet Allocation (LDA)

In [9]:
# LDA model
con_lda = gensim.models.LdaMulticore(con_corp, id2word=con_dict, num_topics=10, passes=10)
lib_lda = gensim.models.LdaMulticore(lib_corp, id2word=lib_dict, num_topics=10, passes=10)

#### Results

In [10]:
# Print the resulting topics
pprint(con_lda.print_topics())
pprint(lib_lda.print_topics())

[(0,
  '0.008*"people" + 0.006*"one" + 0.004*"would" + 0.004*"like" + 0.004*"new" + '
  '0.003*"said" + 0.003*"time" + 0.003*"get" + 0.003*"us" + 0.002*"think"'),
 (1,
  '0.010*"percent" + 0.007*"yang" + 0.006*"online" + 0.004*"slot" + '
  '0.004*"say" + 0.004*"black" + 0.004*"blacks" + 0.004*"dan" + 0.004*"di" + '
  '0.003*"obama"'),
 (2,
  '0.006*"reply" + 0.004*"one" + 0.004*"posted" + 0.003*"view" + '
  '0.003*"replies" + 0.003*"post" + 0.003*"private" + 0.003*"pdt" + '
  '0.003*"crusades" + 0.003*"crusade"'),
 (3,
  '0.020*"see" + 0.005*"climate" + 0.004*"one" + 0.004*"lottery" + '
  '0.004*"spell" + 0.003*"dr" + 0.003*"would" + 0.003*"world" + 0.003*"people" '
  '+ 0.003*"global"'),
 (4,
  '0.008*"said" + 0.007*"president" + 0.006*"trump" + 0.006*"would" + '
  '0.005*"republican" + 0.005*"party" + 0.005*"obama" + 0.005*"democrats" + '
  '0.005*"house" + 0.005*"republicans"'),
 (5,
  '0.011*"news" + 0.010*"abortion" + 0.009*"life" + 0.007*"live" + '
  '0.007*"action" + 0.007*"pro"

#### Results Interpretation
##### Conservative Topics
1. people, one, would, like, new, said, time, get, us, think --> **Personal Opinions**
2. percent, yang, online, slot, say, black, blacks, dan, di, obama --> **Race**
3. reply, one, posted, view, replies, post, private, pdt, crusades, crusade --> **Online Discourse**
4. see, climate, one, lottery, spell, dr, would, world, people, global --> **Climate Discussions**
5. said, president, trump, would, republican, party, obama, democrats, house, republicans --> **Trump vs. Obama**
6. news, abortion, life, live, action, pro, said, women, children, would --> **Pro-Life and Abortion**
7. trump, said, one, people, would, like, media, president, clinton, know --> **Trump vs. Clinton**
8. would, government, tax, one, people, us, obama, percent, new, health --> **Taxes and Healthcare**
9. said, us, pm, obama, posted, president, state, government, news, would --> **Obama and Government Statements**
10. government, people, american, one, us, would, state, states, political, law --> **American Government and States' Rights**
##### Liberal Topics
1. people, one, like, would, think, us, get, even, new, time --> **Personal Opinions**
2. us, said, obama, president, people, one, would, mr, war, police --> **Obama's Statements on War and Police**
3. trump, said, president, obama, would, clinton, democratic, party, campaign, republican --> **Trump vs. Obama vs. Clinton**
4. qe, stevt, rdf, li, action, length, application, parameters, mcconnel, obj --> **Economics and Data Science**
5. trump, said, new, us, climate, would, also, president, one, million --> **Trump's Statements on Climate**
6. economic, china, us, countries, state, world, one, income, would, growth --> **Global Economics**
7. tax, would, people, said, state, one, workers, new, us, jones --> **Tax Policies and Workers' Rights**
8. people, new, political, would, government, party, one, us, also, social --> **Political Discussions**
9. gun, obama, would, health, president, people, care, new, said, one --> **Obama's Statements on Guns and Healthcare**
10. people, occupy, said, one, new, movement, says, students, women, city --> **Occupy Movement and Social Justice**