Text mining project using data from Reddit.

1 - Detect subreddits that act as echo chambers in Reddit

2 - Find the main topics discussed in those subreddits

3 - Find subreddits with opposing views and analyze the sentiment towards each topic

4 - Finally, I want to measure if users change the sentiment, their vocabulary or their opinion

on a topic based on the subreddit they are interacting with. I also would to see if
spending a greater deal of time on a subreddit that acts as an echo chamber changes
their sentiment from negative/neutral to positive (or vice versa) on specific topics.

List of methods that I need to apply in the project:
1. tf-idf
2. show the most frequent words. Use wordcloud
3. correlation between words (network)
4. topic modeling and connect topics with the most frequent words
5. sentiment analysis
6. Maybe cluster analysis

# Outline

In [2]:
# Load libraries
import pandas as pd
import numpy as np
import datetime, json, os, re, zstandard
from nltk.corpus import stopwords
from zst_reader import read_lines_zst, write_line_zst

In [3]:
subreddits = ["Conservative", "progressive",
              "democrats", "Republican",
              "NeutralPolitics", "PoliticalDiscussion", "politics"]

## 1. Detecting Echo Chambers

### 1.a. Retrieve Reddit data

Posts: Retrieve the following information for each post:
* Title: The title of the post.
* Content: The text content of the post.
* Upvotes and downvotes: The number of upvotes and downvotes received by the post.
* Timestamp: The date and time when the post was created.

Comments: Retrieve the following information for each comment:
* Content: The text content of the comment.
* Upvotes and downvotes: The number of upvotes and downvotes received by the comment.
* Timestamp: The date and time when the comment was posted.
 

In [4]:
from collections import Counter
from nltk.util import ngrams

In [4]:
# Count terms frequencies (unigram and bigram) in the comments
def count_terms_frequency(input_comments: list, output_frequencies: list) -> None:

    # Load stop words using nltk
    stop_words = stopwords.words('english')

    # Loop through input paths
    for in_comment, out_grams in zip(input_comments, output_frequencies):

        unigrams = Counter()
        bigrams = Counter()

        for line, file_bytes_processed in read_lines_zst(in_comment):

            # Load the json object
            obj = json.loads(line)

            # Skip if body doesn't exist
            if 'body' not in obj:
                continue

            # Get body of comment
            body = obj['body']

            # Skip if body is deleted or removed
            if (body == 'deleted') or (body == 'removed'):
                continue

            # Clean the text
            body = clean_comments(body, stop_words)

            # Split the text into unigrams and bigrams
            unigrams_list = body.split()
            bigrams_list = list(ngrams(unigrams_list, 2))

            unigrams.update(unigrams_list)
            bigrams.update(bigrams_list)

        # Create the zst handler
        handle_unigram = zstandard.ZstdCompressor().stream_writer(open(out_grams[0], 'wb'))

        # Write the unigrams to the zst file
        for unigram in unigrams:
            line = {'term': unigram, 'frequency': unigrams[unigram]}
            line = json.dumps(line)
            write_line_zst(handle_unigram, line)

        # Create the zst handler
        handle_bigram = zstandard.ZstdCompressor().stream_writer(open(out_grams[1], 'wb'))

        # Write the bigrams to the zst file
        for bigram in bigrams:
            line = {'term': bigram, 'frequency': bigrams[bigram]}
            line = json.dumps(line)
            write_line_zst(handle_bigram, line)
    
    return

In [5]:
input_comments = [f"data/{s}/{s}_comments_clean.zst" for s in subreddits]
output_frequencies = [(f"analysis/1a/{s}_comments_unigrams.zst", f"analysis/1a/{s}_comments_bigrams.zst") for s in subreddits]

In [None]:
count_terms_frequency(input_comments, output_frequencies)

In [6]:
# Load unigrams
unigrams = {}
for subreddit in subreddits:
    unigrams[subreddit] = {}
    for line, file_bytes_processed in read_lines_zst(f"analysis/1a/{subreddit}_comments_unigrams.zst"):
        obj = json.loads(line)
        unigrams[subreddit][obj["term"]] = obj["frequency"]

In [124]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns

In [22]:
stop_words = stopwords.words('english')
stop_words = [re.sub("'", "", word) for word in stop_words]
stop_words += ["like", "ever", "ive", "always", "final", "people", "would", "rrepublican", "rneutralpolitics", "get", "one", "thats", "trump", "karma", "said"]
# Function to clean unigrams
def clean_unigrams(unigrams: dict, min_frequency: int=10, max_frequency: int=1000) -> dict:

    # Strip whitespace
    unigrams = {k.strip(): v for k, v in unigrams.items()}
    
    # Remove stop words
    unigrams = {k: v for k, v in unigrams.items() if k not in stop_words}
    
    # Remove terms with frequency less than min_frequency
    unigrams = {k: v for k, v in unigrams.items() if v >= min_frequency}

    # Remove terms with frequency more than max_frequency
    unigrams = {k: v for k, v in unigrams.items() if v <= max_frequency}

    return unigrams

In [23]:
unigrams_clean = {}
for subreddit in subreddits:
    unigrams_clean[subreddit] = clean_unigrams(unigrams[subreddit], min_frequency=10, max_frequency=10000)

In [9]:
# Calculate unigram frequencies for each subreddit
unigrams_frequencies = {}
for subreddit in subreddits:
    total = sum(unigrams[subreddit].values())
    unigrams_frequencies[subreddit] = {k: v/total for k, v in unigrams[subreddit].items()}

    # Remove stop words
    unigrams_frequencies[subreddit] = {k: v for k, v in unigrams_frequencies[subreddit].items() if k not in stop_words}

    # Remove terms with frequency in the top 1%
    unigrams_frequencies[subreddit] = {k: v for k, v in unigrams_frequencies[subreddit].items() if v < 0.002}
    

In [24]:
# Plot a word cloud for each subreddit using the top 100 unigrams
for subreddit in subreddits:

    # Create the word cloud
    wc = WordCloud(background_color="white", max_words=150, width=800, height=400)
    wc.generate_from_frequencies(unigrams_clean[subreddit])

    # Plot the word cloud
    plt.figure(figsize=(16,8))
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title(f"Word Cloud for {subreddit}")
    #plt.show()

    # Save the image
    plt.tight_layout(pad=0)
    plt.savefig(f"analysis/1a/plots/{subreddit}_comments_2_wordcloud.png")

    # Close the plot
    plt.close()

In [30]:
# Plot a bar chart of the 20 most frequent unigrams for each subreddit
for subreddit in subreddits:
    
    # Get the top 20 unigrams
    top_20_unigrams = dict(sorted(unigrams_clean[subreddit].items(), key=lambda item: item[1], reverse=True)[:20])

    # Plot the bar chart horizontally using seaborn. Use yellowgreen as the color with gradient
    plt.figure(figsize=(8,8))
    sns.barplot(x=list(top_20_unigrams.values()), y=list(top_20_unigrams.keys()), palette="YlGnBu_r")

    # Set the title and axis labels
    plt.title(f"Top 20 Unigrams for {subreddit}")
    plt.xlabel("Frequency")
    plt.ylabel("Unigram")

    # Save the image
    plt.tight_layout(pad=0)
    plt.savefig(f"analysis/1a/plots/{subreddit}_comments_clean_unigrams_top20.png")
    #plt.show()
    plt.close()

In [40]:
# Compare unigram frequency between Republican and Democrat subreddits of top 100 unigrams
# Create a dataframe of the top 100 unigrams for each subreddit
df_unigrams = pd.DataFrame(unigrams_frequencies)
df_unigrams = df_unigrams.fillna(0)

In [41]:
df_unigrams

Unnamed: 0,Conservative,progressive,democrats,Republican,NeutralPolitics,PoliticalDiscussion,politics
die,0.000354,0.000349,0.000297,0.000294,0.000089,0.000216,2.248652e-04
able,0.000597,0.000572,0.000522,0.000486,0.000614,0.000690,5.463000e-04
afford,0.000153,0.000256,0.000134,0.000125,0.000132,0.000161,1.656714e-04
healthcare,0.000283,0.000585,0.000412,0.000274,0.000296,0.000553,4.594411e-04
aca,0.000015,0.000115,0.000081,0.000016,0.000073,0.000129,7.595825e-05
...,...,...,...,...,...,...,...
predictespecially,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.098700e-09
looksa,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.098700e-09
massprivatizationof,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.098700e-09
falsedillemma,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.098700e-09


In [58]:
# Plot scatterplot of unigram frequency between two subreddits (select top 100 unigrams)
s_1 = "democrats"
s_2 = "politics"

# Create the scatterplot
plt.figure(figsize=(8,8))
plt.scatter(df_unigrams[s_1], df_unigrams[s_2], alpha=0.5)

# Set the title and axis labels
plt.title(f"Unigram Frequency Comparison between {s_1} and {s_2}")
plt.xlabel(f"Frequency of {s_1}")
plt.ylabel(f"Frequency of {s_2}")

# Anotate the top 10 unigrams
for i, unigram in enumerate(df_unigrams.index):
    if i < 10:
        plt.annotate(unigram, (df_unigrams.loc[unigram, s_1], df_unigrams.loc[unigram, s_2]),
                     fontweight="bold", backgroundcolor="white")
    else:
        break

# Add a line with slope 1
plt.plot([0, 0.002], [0, 0.002], color="grey", linestyle="--")

plt.savefig(f"analysis/1a/plots/{s_1}_{s_2}_unigram_frequency_comparison.png")
#plt.show()
plt.close()

In [None]:
# Compare unigram frequency between Republican and Democrat subreddits, and between Conservative and Progressive subreddits
for subreddit1, subreddit2 in [("Republican", "democrats"), ("Conservative", "progressive")]:

    # Get the top 20 unigrams
    top_20_unigrams1 = {k: v for k, v in sorted(unigrams[subreddit1].items(), key=lambda item: item[1], reverse=True)[:20]}
    top_20_unigrams2 = {k: v for k, v in sorted(unigrams[subreddit2].items(), key=lambda item: item[1], reverse=True)[:20]}

    # Plot the bar chart
    plt.figure(figsize=(16,8))
    plt.bar(top_20_unigrams1.keys(), top_20_unigrams1.values(), label=subreddit1)
    plt.bar(top_20_unigrams2.keys(), top_20_unigrams2.values(), label=subreddit2)
    plt.xticks(rotation=45)
    plt.title(f"Top 20 Unigrams for {subreddit1} and {subreddit2}")
    plt.legend()
    plt.show()

    # Save the image
    #plt.tight_layout(pad=0)
    #plt.savefig(f"analysis/1a/plots/{subreddit1}_{subreddit2}_comments_unigrams_top20.png")

In [83]:
# Load bigrams
bigrams = {}
for subreddit in subreddits:
    bigrams[subreddit] = {}
    for line, file_bytes_processed in read_lines_zst(f"analysis/1a/{subreddit}_comments_bigrams.zst"):
        obj = json.loads(line)
        bigrams[subreddit][tuple(obj["term"])] = obj["frequency"]

In [97]:
# Count the total frequency of bigrams in each subreddit
total_bigrams = {}
for subreddit in subreddits:
    total_bigrams[subreddit] = sum(bigrams[subreddit].values())

In [91]:
results_bigrams = {}

In [103]:
for s in subreddits:
    f = bigrams[s][("dont", "agree")]
    print(f"{s}: {f/total_bigrams[s]:.8f}")

Conservative: 0.00011325
progressive: 0.00007762
democrats: 0.00007669
Republican: 0.00010815
NeutralPolitics: 0.00003501
PoliticalDiscussion: 0.00006631
politics: 0.00006496


In [None]:
result = {}
b_1, b_2 = ("i", "argue")
for s in subreddits:
    f = bigrams[s][(b_1, b_2)]
    f_total = f/total_bigrams[s]
    result[s] = f_total

# Plot result using seaborn
plt.figure(figsize=(8,8))

# Get the x and y values sorted by x
y = sorted(result.keys(), key=lambda k: result[k], reverse=True)
x = [result[k] for k in y]

# Plot the bar chart horizontally using seaborn. Use yellowgreen as the color with gradient
sns.barplot(x=x, y=y, palette="YlGnBu_r")

# Set the title and axis labels
plt.title(f"Frequency of Bigram ({b_1}, {b_2}) in Subreddits")
plt.xlabel("Frequency")

# Save the image
plt.savefig(f"analysis/1a/plots/{b_1}_{b_2}_bigram_frequency.png")
plt.show()

### 1.b. Creating a network of subreddits

Find users that post in multiple subreddits. Create a network of subreddits based on the users that post in them.

In [None]:
def get_unique_users(file: str) -> list:
    """Get unique users from a file"""

    # Initialize set
    users = set()

    # Read file line by line
    with open(file, 'r') as f:

        lines = f.readlines()

        # Iterate through lines, skipping header
        for line in lines[1:]:
                            
            # Get user from line
            user = line.split(',')[0]

            # Skip if user is deleted or AutoModerator
            if (user == '[deleted]') or (user == 'AutoModerator'):
                continue

            # Add user to set
            users.add(user)

    return users


In [None]:
# Create empty dictionary to store unique users for each subreddit
unique_users = {}

# Loop through all files to get unique users
for file in files:

    # Get unique users for each file
    print(f'Getting unique users for {file}')
    users = get_unique_users(file, start_date, end_date)

    # Save to dictionary
    subreddit = file.split('/')[-1].split('_')[0]
    unique_users[subreddit] = users

### 1.c. Topic Modeling

Find main topics being discussed in the subreddits.

* Text preprocessing: Clean and preprocess the textual data by removing stop words, punctuation, and special characters. Perform stemming or lemmatization to normalize the text.

* Topic modeling: Apply topic modeling techniques such as Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF) to extract the main topics discussed within each subreddit. This will help you identify the prevalent themes and subjects.


In [61]:
input_submissions = [f"data/{s}/{s}_submissions_clean.zst" for s in subreddits]

**Latent Dirichlet Allocation (LDA)**

In [5]:
from gensim.corpora import Dictionary
from gensim.models import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.utils import tokenize
from nltk.corpus import stopwords

In [62]:
def clean_submission(text: str, stop_words: list) -> str:
    """Clean text by removing non-alphabetical characters, stop words,
    and other words"""

    # Remove stop words
    text = ' '.join([word for word in text.split() if word not in stop_words])

    # Remove 's 
    text = text.replace("'s ", ' ')

    # Remove non-alphabetical characters
    text = re.sub(r'[^a-zA-Z ]+', '', text)

    # Lowercase
    text = text.lower()

    return text

In [66]:
# Function to find topics across multiple texts using LDA model
def create_corpus(input_paths: list) -> None:
    """Find topics across multiple texts using LDA model"""

    # Create empty list to store texts
    texts = []

    # Load stop words using nltk
    stop_words = stopwords.words('english')

    # Custom stop words
    custom_stop_words = ['biden', 'trump', 'republican', 'democrat', 'politics']
    stop_words.extend(custom_stop_words)

    # Loop through input paths
    for path in input_paths:

        # Read lines
        lines = read_lines_zst(path)

        # Loop through lines
        for line, _ in lines:

            # Convert the line to a json object
            obj = json.loads(line)

            # Get text and title
            text = obj['selftext']
            title = obj['title']

            # Skip if text is deleted, or removed
            if (text == 'deleted') or (text == 'removed'):
                continue

            # Combine title and text
            full_text = title + ' ' + text

            # Clean text
            full_text = clean_submission(full_text, stop_words)

            # Skip if text is empty
            if len(full_text) == 0:
                continue

            # Add to list
            texts.append(full_text)

    # Tokenize texts
    tokenized_texts = [list(tokenize(text, lowercase=True)) for text in texts]
 
    # Create dictionary
    dictionary = Dictionary(tokenized_texts)

    # Filter extremes (remove words that appear in more than 30% of documents and less than 10 documents)
    dictionary.filter_extremes(no_below=10, no_above=0.3)

    # Create corpus
    corpus = [dictionary.doc2bow(tokens) for tokens in tokenized_texts]

    return corpus, dictionary, tokenized_texts


In [67]:
# Create corpus
corpus, dictionary, tokenized_texts = create_corpus(input_submissions)

In [None]:
# Find the optimal number of topics using coherence scores
def find_optimal_num_topics(corpus: list, dictionary: Dictionary, texts: list, limit: int, start: int=2, step: int=3) -> None:
    """Find the optimal number of topics"""

    # Create empty list to store models
    models = []

    # Create empty list to store coherence scores
    coherence_scores = []

    # Loop through number of topics
    for num_topics in range(start, limit, step):

        # Create model
        model = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)

        # Save model
        models.append(model)

        # Create coherence model
        coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')

        # Save coherence score
        coherence_scores.append(coherence_model.get_coherence())

    # Create dataframe of coherence scores
    df = pd.DataFrame({'num_topics': range(start, limit, step), 'coherence_score': coherence_scores})

    return models, coherence_scores, df
    

In [None]:
# Find optimal number of topics
lda_models, lda_coherence_scores, lda_df = find_optimal_num_topics(corpus, dictionary, tokenized_texts, limit=10, start=4, step=1)

In [None]:
FLAG_SAVE_1c = True

In [None]:
if FLAG_SAVE_1c:
     
    # Save results
    lda_df.to_csv('analysis/1c/lda_coherence_scores.csv', index=False)

    # Save models
    for i, model in enumerate(lda_models):
        model.save(f'analysis/1c/lda_model_{i+3}.model')

**Hierarchical Dirichlet Process (HDP)**

In [None]:
#from gensim.models import HdpModel

In [None]:
# Run HDP model
#hdp_model = HdpModel(corpus=corpus, id2word=dictionary)


**Classify submissions**

In [68]:
# Select lda model with highest coherence score
optimal_lda_model = LdaModel.load('analysis/1c/lda_model_7.model')

In [75]:
for idx, topic in optimal_lda_model.show_topics(formatted=False, num_words=10):
    print('Topic: {} \nWords: {}'.format(idx, ', '.join([w[0] for w in topic])))

Topic: 0 
Words: house, court, supreme, white, news, committee, judge, state, georgia, senate
Topic: 1 
Words: jan, says, s, us, desantis, election, maralago, donald, fbi, joe
Topic: 2 
Words: us, ukraine, covid, arizona, war, says, health, years, china, care
Topic: 3 
Words: new, bill, s, texas, states, law, tax, governor, plan, state
Topic: 4 
Words: s, senate, vote, capitol, gop, says, race, doj, us, general
Topic: 5 
Words: s, abortion, election, gop, voting, voters, poll, climate, party, ic
Topic: 6 
Words: people, one, dont, opinion, time, like, student, right, get, president


In [71]:
# Write function to classify submissions using lda model
def classify_submissions(input_paths: list,
                         output_paths: list,
                         lda_model: LdaModel,
                         dictionary: Dictionary) -> None:
    """Classify submissions using lda model"""

    # Loop through input paths
    for in_path, out_path in zip(input_paths, output_paths):

        # Create the zst handler
        handle = zstandard.ZstdCompressor().stream_writer(open(out_path, 'wb'))

        # Save the data to zst file
        with open(out_path, mode="w", newline="") as file:

            for line, file_bytes_processed in read_lines_zst(in_path):
                obj = json.loads(line)

                # Get text and title
                text = obj['selftext']
                title = obj['title']

                # Skip if text is deleted, or removed
                if (text == 'deleted') or (text == 'removed'):
                    continue

                # Combine title and text
                full_text = title + ' ' + text

                # Skip if text is empty
                if len(full_text) == 0:
                    continue

                # Get topic distribution
                topic_dist = lda_model.get_document_topics(dictionary.doc2bow(full_text.split()), minimum_probability=0.0)

                # Add topic distribution to object (make it serializable)
                obj['topic_dist'] = str(topic_dist)
                
                # Write the data to the zst file
                new_line = json.dumps(obj)
                write_line_zst(handle, new_line)

In [56]:
output_submissions = [f"data/{s}/{s}_submissions_classified.zst" for s in subreddits]

In [80]:
# Classify submissions
classify_submissions(input_submissions, output_submissions, optimal_lda_model, dictionary)

### 1.d. Sentiment Analysis

Assess the homogeneity of the discussions within each subreddit.

* Sentiment analysis: measure the sentiment (positive, negative, or neutral) of (most engaged) comments on a specific post in each subreddit.
Then, aggregate those sentiments, using upvotes and downvotes as weights, to get a sentiment score for each post. Finally, aggregate the sentiment scores of all posts in a subreddit to get a sentiment score for each subreddit on a specific topic.
(consider using sentiment entropy)

* Visualization and interpretation: Visualize the overall sentiment towards the topic using charts, histograms, or other visual representations. Analyze the results to interpret the subreddit's sentiment and understand the prevailing sentiment towards the topic.


In [None]:
# Calculate average sentiment towards each submission 
def calculate_sentiment_submission(input_comments: list, output_comments: list) -> None:
    """Calculate average sentiment towards each submission"""

    # Loop through input paths
    for in_comment, out_comment in zip(input_comments, output_comments):

        # Create the zst handler
        handle = zstandard.ZstdCompressor().stream_writer(open(out_comment, 'wb'))

        # Save the data to zst file
        with open(out_comment, mode="w", newline="") as file:

            submissions = {}

            for line, file_bytes_processed in read_lines_zst(in_comment):
                obj = json.loads(line)

                # Skip if body doesn't exist
                if 'sentiment' not in obj or 'link_id' not in obj:
                    continue

                # Calculate score if ups and downs exist
                if obj.get('ups', '0') != '0' or obj.get('downs', '0') != '0':

                    # Give it a higher weight the more votes it has
                    votes = obj['ups'] + obj['downs']
                    if votes == 0: interactions = 1
                    sentiment = obj['sentiment'] * votes
                    interactions = obj['ups'] + abs(obj['downs'])
                    
                # Otherwise, use the score
                else:
                    interactions = max(obj['score'], 1)
                    sentiment = obj['sentiment'] * interactions
                
                # Add sentiment to object
                if obj['link_id'] not in submissions:
                    submissions[obj['link_id']] = [sentiment, interactions]
                else:
                    submissions[obj['link_id']][0] += sentiment
                    submissions[obj['link_id']][1] += interactions

            # Save the data to zst file
            for link_id, (sentiment, interactions) in submissions.items():
                obj = {'link_id': link_id,
                       'sentiment': sentiment,
                       'interactions': interactions}
                new_line = json.dumps(obj)
                write_line_zst(handle, new_line)

    return

In [None]:
# Create input and output paths
output_overall_sentiment = [f"analysis/1d/{s}_submissions_overall_sentiment.zst" for s in subreddits]

In [None]:
calculate_sentiment_submission(output_comments, output_overall_sentiment)

# 2. Compare Echo Chambers

### 2.a. Analyzing sentiment and opposing views:

- Sentiment analysis: Utilize sentiment analysis techniques to determine the sentiment (positive, negative, or neutral) of the posts and comments related to specific topics within each subreddit.

- Identify opposing subreddits: Identify subreddits with opposing views by comparing the sentiment and language used in their discussions.

- Sentiment analysis across subreddits: Compare the sentiment distribution and polarity scores of the same topic discussed in different subreddits with opposing views.

In [16]:
input_comments_sentiment = [f"data/{s}/{s}_comments_sentiment.zst" for s in subreddits if s != 'politics']
output_submission_sentiment = [f"analysis/1d/{s}_submissions_sentiment.zst" for s in subreddits if s != 'politics']

In [94]:
# Function to calculate average emotion towards each submission
def calculate_emotion_submission(input_comments_sentiment: list, output_submission_sentiment: list) -> None:
    
    emotions = {'fear': 0, 'anger': 0, 'anticip': 0, 'trust': 0,
            'surprise': 0, 'sadness': 0, 'joy': 0, 'disgust': 0,
            'positive': 0, 'negative': 0}

    submissions = {}

    # Loop through input paths
    for in_comment, out_submission in zip(input_comments_sentiment, output_submission_sentiment):

        # Create the zst handler
        handle = zstandard.ZstdCompressor().stream_writer(open(out_submission, 'wb'))

        # Save the data to zst file
        with open(out_submission, mode="w", newline="") as file:

            for line, file_bytes_processed in read_lines_zst(in_comment):
                obj = json.loads(line)

                # Skip if body doesn't exist
                if 'link_id' not in obj:
                    continue
                
                # Calculate score if ups and downs exist
                if obj.get('ups', '') == '': ups = 0
                else: ups = int(obj.get('ups', ''))
                if obj.get('downs', '') == '': downs = 0
                else: downs = int(obj.get('downs', ''))

                if ups != 0 or downs != 0:
                    
                    interactions = ups + downs
                    if interactions == 0: interactions = 1
                    
                # Otherwise, use the score
                else:
                    interactions = max(obj['score'], 1)
                
                # Add polarity, subjectivity, and emotions to object
                if obj['link_id'] not in submissions:
                    submissions[obj['link_id']] = {'polarity': 0,
                                                   'subjectivity': 0,
                                                   'emotions': emotions.copy(),
                                                   'interactions': 0}
                
                # Calculate average polarity, subjectivity, and emotions
                submissions[obj['link_id']]['polarity'] += obj['polaritiy'] * interactions
                submissions[obj['link_id']]['subjectivity'] += obj['subjectivity'] * interactions

                for e in emotions:
                    submissions[obj['link_id']]['emotions'][e] += obj['emotions'][e] * interactions
                
                # Add interactions
                submissions[obj['link_id']]['interactions'] += interactions

        # Save the data to zst file, including link_id
        for link_id, obj in submissions.items():
            obj['link_id'] = link_id
            new_line = json.dumps(obj)
            write_line_zst(handle, new_line)

    return

In [95]:
calculate_emotion_submission(input_comments_sentiment, output_submission_sentiment)

In [59]:
output_submissions_classified = [f"data/{s}/{s}_submissions_classified.zst" for s in subreddits if s != 'politics']

In [71]:
import ast

In [76]:
# For list of tuples, find tuple with highest value in second element and return first element
def find_max_tuple(tuples: list) -> tuple:
    """Find tuple with highest value in second element"""

    # Initialize max tuple
    max_tuple = ('', 0)

    # Loop through tuples
    for t in tuples:

        # Update max tuple if second element is greater than current max
        if t[1] > max_tuple[1]:
            max_tuple = t

    return max_tuple

In [85]:
# Load submissions classified and get topic with highest probability
submissions_classified = {s: {} for s in subreddits if s != 'politics'}

for sub in subreddits[:-1]:

    s_class = f"data/{sub}/{sub}_submissions_classified.zst"

    # Load topic distribution from each submission classified
    for line, file_bytes_processed in read_lines_zst(s_class):

            # Convert the line to a json object
            obj = json.loads(line)
    
            # Get link_id
            link_id = obj['id']

            # Find topic with highest probability
            topic_dist = ast.literal_eval(obj['topic_dist'])
            topic = find_max_tuple(topic_dist)[0]

            # Add to dictionary
            submissions_classified[sub][link_id] = {'topic': topic}


In [107]:
# Load submissions sentiment and get average sentiment
submissions_sentiment = {s: {} for s in subreddits if s != 'politics'}

for sub in subreddits[:-1]:
    
    s_sent = f"analysis/1d/{sub}_submissions_sentiment.zst"

    # Load topic distribution from each submission classified
    for line, file_bytes_processed in read_lines_zst(s_sent):

            # Convert the line to a json object
            obj = json.loads(line)

            # Get link_id
            link_id = obj['link_id'].split('_')[-1]

            # Get emotions, polarity, and subjectivity
            emotions = obj['emotions']
            polarity = obj['polarity']
            subjectivity = obj['subjectivity']

            # Add to dictionary
            submissions_sentiment[sub][link_id] = {'emotions': emotions,
                                                    'polarity': polarity,
                                                    'subjectivity': subjectivity}

In [116]:
# Iterate over all submissions_classified in each subreddit, and calcuate average sentiment and emotions for each topic
submissions_classified_sentiment = {s: {} for s in subreddits if s != 'politics'}
count = 0
for sub in subreddits[:-1]:
    count += 1
    # Loop through submissions classified
    for link_id, obj in submissions_classified[sub].items():

        count += 1
        if link_id not in submissions_sentiment[sub]:
            continue

        # Get topic
        topic = obj['topic']

        # Get sentiment and emotions from submissions sentiment
        sentiment = submissions_sentiment[sub][link_id]
        emotions = sentiment['emotions']
        polarity = sentiment['polarity']
        subjectivity = sentiment['subjectivity']

        # Add to dictionary
        if topic not in submissions_classified_sentiment[sub]:
            submissions_classified_sentiment[sub][topic] = {'emotions': emotions,
                                                            'polarity': polarity,
                                                            'subjectivity': subjectivity,
                                                            'count': 1}
        else:
            submissions_classified_sentiment[sub][topic]['emotions'] = {k: v + submissions_classified_sentiment[sub][topic]['emotions'][k] for k, v in emotions.items()}
            submissions_classified_sentiment[sub][topic]['polarity'] += polarity
            submissions_classified_sentiment[sub][topic]['subjectivity'] += subjectivity
            submissions_classified_sentiment[sub][topic]['count'] += 1

In [121]:
# For each topic, plot the positive and negative emotions for each subreddit
topics = {0: {}, 1: {}, 2: {}, 3: {}, 4: {}, 5: {}, 6: {}}
for sub in subreddits[:-1]:

    for t in range(7):
        positive = submissions_classified_sentiment[sub][t]['emotions']['positive']
        negative = submissions_classified_sentiment[sub][t]['emotions']['negative']
        count = submissions_classified_sentiment[sub][t]['count']
        topics[t][sub] = {'positive': positive/count, 'negative': negative/count}



In [170]:
topic_mapping = {0: 'Judicial System', 1: 'Political Figures & Investigations', 2: 'International Affairs', 3: 'State Government', 4: 'Federal Government', 5: 'Social Issues', 6: 'Other'}

In [173]:
# Rename "Conservative" to "Rigt-Leaning", 'democrat" to "Left-Leaning", and "NeutralPolitics" to "Neutral"
topics_renamed = {topic_mapping[k]: v for k, v in topics.items()}

In [177]:
# Remove from the keys the subreddits that are not needed (progressive, Republican, PoliticalDiscussion)
topics_renamed = {k: {k2: v2 for k2, v2 in v.items() if k2 not in ['progressive', 'Republican', 'PoliticalDiscussion']} for k, v in topics.items()}

# Rename from the keys "Conservative" to "Rigt-Leaning", 'democrat" to "Left-Leaning", and "NeutralPolitics" to "Neutral"
rename_dict = {'Conservative': 'Right-Leaning', 'democrats': 'Left-Leaning', 'NeutralPolitics': 'Neutral'}
topics_renamed = {k: {rename_dict[k2]: v2 for k2, v2 in v.items()} for k, v in topics_renamed.items()}

In [181]:
topics_renamed

{0: {'Right-Leaning': {'positive': 4510.43842364532,
   'negative': 4535.679802955665},
  'Left-Leaning': {'positive': 521.4, 'negative': 494.23},
  'Neutral': {'positive': 11995.953125, 'negative': 11434.296875}},
 1: {'Right-Leaning': {'positive': 6675.825980392156,
   'negative': 6602.651960784314},
  'Left-Leaning': {'positive': 674.5031055900621,
   'negative': 623.8322981366459},
  'Neutral': {'positive': 9115.82142857143, 'negative': 9143.330357142857}},
 2: {'Right-Leaning': {'positive': 8824.87012987013,
   'negative': 8349.298701298701},
  'Left-Leaning': {'positive': 340.4117647058824,
   'negative': 321.97058823529414},
  'Neutral': {'positive': 12648.6875, 'negative': 11405.125}},
 3: {'Right-Leaning': {'positive': 3007.7134831460676,
   'negative': 3071.938202247191},
  'Left-Leaning': {'positive': 214.7704918032787,
   'negative': 221.0983606557377},
  'Neutral': {'positive': 12876.446808510638, 'negative': 12592.31914893617}},
 4: {'Right-Leaning': {'positive': 3746.738

In [217]:
# Plot each topic as a bar chart
for t in range(6):

    values = topics_renamed[t]
    df = pd.DataFrame(values).T
    df["overall"] = df["positive"] - df["negative"]

    # Plot the bar chart using seaborn
    plt.figure(figsize=(12,8))

    # Plot positive and negative emotions as bar chart horizontally using seaborn
    sns.barplot(y=df["overall"], x=df.index, palette="YlGnBu_r")

    # Set the title and axis labels
    title = f"Average Positive and Negative Emotions for Topic: {topic_mapping[t]}"
    plt.title(title)
    plt.ylabel("Average Emotion")
    plt.xlabel("Subreddits by Polical Leaning")

    # Save the image
    plt.tight_layout(pad=1)
    plt.savefig(f"analysis/1d/plots/topic_{t}_positive_negative_emotions.png")
    #plt.show()
    plt.close()

State Government
Social Issues


___
# Ideas !

Possible research questions:

Does the echo chamber effect have a stronger influence on certain topics?
What are the dominant topics, sentiments, and differences in language usage between liberal and conservative subreddits, and how do these findings contribute to our understanding of echo chambers?

1. Detect subreddits that act as echo chambers in Reddit:
   - Get Reddit data (posts and comments) using the Reddit API or any available dataset.
   - Identify different subreddits that can be good for the analysis.
   - Measure the homogeneity of the discussions within each subreddit by looking at the similarity of language and opinions.

2. Find the main topics discussed in those subreddits:
   - Clean the textual data by removing stop words, punctuation, and special characters. Maybe stemming or lemmatization too?
   - Identify main topics discussed in those subreddits using Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF).

3. Analyze sentiment:
   - Determine the sentiment (positive, negative, or neutral) of the posts and comments related to specific topics within each subreddit.
   - Identify subreddits with opposing views by comparing the sentiment and language used in their discussions.
   - (Optional) Analyze a more sentiment on a more granular level by analyzing different emotions (anger, joy, sadness, etc.)

4. Analyzing sentiment change, vocabulary, and opinion shift:
   - I also would to see if spending a greater deal of time on a subreddit that acts as an echo chamber changes their sentiment from negative/neutral to positive (or vice versa) on specific topics.
   - Finally, I want to measure if an user's sentiment changes based on the subreddit they are interacting with -> VERY difficult to do without being able to track users consistently across subreddits and looking at exposure
   to the same topics/comments and measuring their sentiment towards them.
   