## **Network and Sentiment Analysis using Reddit Posts - Proposal**

**Submitted by:** Euclides Rodriguez 

**Course:** DATA 620

**Data Source: Link:** https://github.com/linanqiu/reddit-dataset

### **Introduction**

Reddit is a social network platform that allows user to create posts within a specific group called subreddits.  The subreddit groups are formed under a vast number of topics.  Anything from news, entertainment to physics are topics that are clustered.  There are multiple sub-reddits that have similar topics that are considered under one umberella meta-reddit.  The goal of this project is to perform network and sentiment analysis within sub-reddit groups under the meta-reddit group of news.  Reddit serves as a rich platform for gauging public opinion and emotional response, offering valuable insights for political organizations and other stakeholders interested in understanding which issues resonate with different communities. 

Key questions guiding this analysis include: 

Are discussions dominated by certain users? 
What is the prevailing sentiment of posts? 
Are there multiple users influencing multiple reddit post?


### **Data Collection and Preparation**
For this project, we will leverage several Python libraries to carry out both the network and sentiment analyses. NetworkX will be used to construct and analyze user interaction graphs within and across subreddits. This includes building edgelists to model user relationships, identifying central or highly connected users, and detecting distinct communities or clusters within the network. NLTK (Natural Language Toolkit) will be employed for the sentiment analysis component, enabling us to process and analyze the textual content of user comments. This includes calculating sentiment scores for individual comments, categorizing them as positive, negative, or neutral, and tracking sentiment trends over time. Together, these tools will allow us to visualize the structure of subreddit communities, highlight influential participants, and interpret the emotional tone of discussions, providing insights into both the social dynamics and collective sentiment of the selected Reddit groups.

Data is obtained from the githbub user Linan Qiu. The data sets include the following structure.  

* **text:** Text of the comment / thread
* **id:** Unique reddit id for the comment / thread
* **subreddit:** Subreddit that the comment / thread belongs to
* **meta:** Metareddit that the comment / thread belongs to. Subreddits belong to metareddits. A subreddit can be leagueoflegends. The metareddit for that subreddit would be gaming, which can also include the subreddit dota2
* **time:** UNIX timestamp of the comment / thread
* **author:** Username of the author of the comment / thread
* **ups:** Number of upvotes the comment / thread received
* **downs:** Number of downvotes the comment / thread received
* **authorlinkkarma:** The author's link karma. What is Link Karma?
* **authorkarma:** The author's karma. Reddit FAQ explaining karma.
* **authorisgold:** Boolean indicator for the gold status of the user. 1 for gold users, 0 for non-gold (normal) users. Reddit FAQ explaining gold status.

Data preparation for this project will be divided into two key components to support both the network and sentiment analyses.

**Network Construction (Edgelist Creation):**  
The first stage involves building an edgelist that represents relationships between users within each subreddit. In this context, an edgelist is a structured list of user interactions, where each edge connects two users based on a defined interaction—such as replying to a comment or participating in the same discussion thread. This structure will allow us to model the community as a graph, with nodes representing users and edges representing the interactions between them. This graph will serve as the foundation for conducting network analysis using tools like NetworkX.

**Text Cleaning and Preprocessing:**  
The second stage focuses on preparing the user comments for sentiment analysis by cleaning the textual data. Raw comments from Reddit often contain elements that do not contribute meaningful information to the analysis, such as markdown formatting, URLs, special characters, and extraneous punctuation. These will be removed as part of the preprocessing pipeline. In addition, common stopwords (e.g., “the,” “and,” “is”) will be eliminated to focus on the most informative terms. All text will be converted to lowercase to ensure uniformity and reduce redundancy caused by case sensitivity. This cleaned and standardized text will then be ready for further processing, including tokenization, frequency analysis, and sentiment scoring.

In [5]:
import pandas as pd
import warnings
import matplotlib.pyplot as plt



In [6]:
df_conservative = pd.read_csv('https://raw.githubusercontent.com/linanqiu/reddit-dataset/refs/heads/master/news_conservative.csv', encoding='unicode_escape')
df_conspiracy = pd.read_csv('https://raw.githubusercontent.com/linanqiu/reddit-dataset/refs/heads/master/news_conspiracy.csv', encoding='unicode_escape')
df_libertarian = pd.read_csv('https://raw.githubusercontent.com/linanqiu/reddit-dataset/refs/heads/master/news_libertarian.csv', encoding='unicode_escape')
df_news = pd.read_csv('https://raw.githubusercontent.com/linanqiu/reddit-dataset/refs/heads/master/news_news.csv', encoding='unicode_escape')
df_offbeat = pd.read_csv('https://raw.githubusercontent.com/linanqiu/reddit-dataset/refs/heads/master/news_offbeat.csv', encoding='unicode_escape')
df_politics = pd.read_csv('https://raw.githubusercontent.com/linanqiu/reddit-dataset/refs/heads/master/news_politics.csv', encoding='unicode_escape')
df_truereddit = pd.read_csv('https://raw.githubusercontent.com/linanqiu/reddit-dataset/refs/heads/master/news_truereddit.csv', encoding='unicode_escape')
df_worldnews = pd.read_csv('https://raw.githubusercontent.com/linanqiu/reddit-dataset/refs/heads/master/news_worldnews.csv', encoding='unicode_escape')

In [7]:
df_conservative.head()

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0,to be honest i do nt completely understand th...,d02n8mf,conservative,news,1455671951,promethean7,1,0,21,5178,0.0
1,1,ugh i clicked out after reading this libertard...,d02giht,conservative,news,1455661637,wmegenney,1,0,821,7065,1.0
2,2,like or dislike anyone i do nt think i would ...,d02lqvl,conservative,news,1455669637,gizayabasu,1,0,509,3241,1.0
3,3,kasich is already talking about states after s...,d01xuob,conservative,news,1455635818,SonyXperiaZ3c,9,0,21,3428,0.0
4,4,why just one,d01wy1x,conservative,news,1455634221,propshaft,7,0,42134,27119,0.0


In [8]:
df_conspiracy.head()

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0,,465unr,conspiracy,news,1455674000.0,yyhhggt,16.0,0.0,172356,6301,0.0
1,1,i have a feeling we are going to start seeing ...,d02jw2c,conspiracy,news,1455667000.0,Putin_loves_cats,5.0,0.0,4402,14843,1.0
2,2,here s another article http virologydownunde...,d029ao3,conspiracy,news,1455652000.0,Irishpunk72,-8.0,0.0,114,-3,0.0
3,3,i too can post articles just like the shills ...,d02f5sn,conspiracy,news,1455660000.0,docmongre,8.0,0.0,787,6661,0.0
4,4,this guy is awesome wish we had multitudes of ...,d02co1v,conspiracy,news,1455656000.0,nonconformist3,5.0,0.0,3352,36148,0.0


In [9]:
df_libertarian.head()

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0,,4644c4,libertarian,news,1455651000.0,DrWinters,3.0,0.0,12728,1641,0.0
1,1,,464kvf,libertarian,news,1455657000.0,unknownman19,20.0,0.0,50364,14436,0.0
2,2,,466hv9,libertarian,news,1455683000.0,hp_chabanais,1.0,0.0,178,3,0.0
3,3,,462wog,libertarian,news,1455636000.0,ghostofpennwast,18.0,0.0,231424,80539,1.0
4,4,if you include the citizens conscripted with t...,d01nh1w,libertarian,news,1455605000.0,PacificBreeze,0.0,0.0,333,17315,0.0


In [10]:
df_news.head()

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0,,46579k,news,news,1455664000.0,the_last_broadcast,73.0,0.0,385513,1971,0.0
1,1,protesters lose jobs for not showing up to wo...,d02jvmq,news,news,1455667000.0,tiamdi,9.0,0.0,943,76932,0.0
2,2,i do believe they are nt understanding that th...,d02rgsy,news,news,1455678000.0,TechnologyIsAmazing,8.0,0.0,1,74,0.0
3,3,why did nt they care this much about their fre...,d02sy4c,news,news,1455681000.0,BlueSardines,1.0,0.0,90,17376,0.0
4,4,even if they wrote a program to stop the wait ...,d02svp7,news,news,1455681000.0,MagicalMick,1.0,0.0,19,1691,0.0


In [11]:
df_offbeat.head()

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0,,465rlz,offbeat,news,1455672503,Fitbitnitwit,6,0,14331,451,0.0
1,1,,4628rq,offbeat,news,1455626105,Fitbitnitwit,31,0,14331,451,0.0
2,2,have you seen their drive thru ha,d01tfw3,offbeat,news,1455626120,Fitbitnitwit,2,0,14331,451,0.0
3,3,something about the phrase intestinal flora ...,d027kv5,offbeat,news,1455649681,PatchClark,2,0,243,471,0.0
4,4,well by saying outlandish stuff that generate...,d027ue8,offbeat,news,1455650040,Doctor_Sportello,13,0,798,4381,0.0


In [12]:
df_politics.head()

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0,disclaimer i think obama should nominate some...,d028c5d,politics,news,1455651000.0,degausse,3.0,0.0,1,1941,0.0
1,1,,463sa9,politics,news,1455647000.0,trash_reason,371.0,0.0,5613,1361,0.0
2,2,either way the process will be dragged out unt...,d026od6,politics,news,1455648000.0,cyberspyder,5.0,0.0,1300,6462,1.0
3,3,republicans have always battled with severe ca...,d026wk3,politics,news,1455649000.0,jabb0,15.0,0.0,44981,141501,0.0
4,4,politics were so different back then people on...,d02a3k3,politics,news,1455653000.0,Hypertension123456,2.0,0.0,46,42036,0.0


In [13]:
df_truereddit.head()

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0,a tie in the supreme court means that the lowe...,d00s8oc,truereddit,news,1455555000.0,Wierd_Carissa,1.0,0.0,914,24073,0.0
1,1,gt memories of his regrettable prejudices wil...,d01wyc3,truereddit,news,1455634000.0,joefuf,0.0,0.0,8941,2570,0.0
2,2,the article undermines its own headline i beli...,d0246a0,truereddit,news,1455645000.0,paulrpotts,4.0,0.0,833,6425,0.0
3,3,,462man,truereddit,news,1455632000.0,Schlagv,5.0,0.0,3010,35576,0.0
4,4,i do nt take pleasure in anyone s death but i...,d00u99b,truereddit,news,1455558000.0,DrOil,24.0,0.0,2047,13234,0.0


In [14]:
df_worldnews.head()

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0,,465vcp,worldnews,news,1455674000.0,JonnyTheRobot,7.0,0.0,49.0,32.0,0.0
1,1,gt more mistakes than achievements if that ...,d02k5wp,worldnews,news,1455667000.0,bob_marley98,1.0,0.0,254.0,14415.0,0.0
2,2,i am certain the guardian will issue a retract...,d02s7uf,worldnews,news,1455680000.0,Imapopulistnow,1.0,0.0,413.0,30756.0,0.0
3,3,gt when the fishing gets tough penguins simp...,d02g2rx,worldnews,news,1455661000.0,_The-Big-Giant-Head_,2.0,0.0,17485.0,9456.0,1.0
4,4,i m doing my part to keep those hard working v...,d02odz1,worldnews,news,1455674000.0,CaramelApplesRock,1.0,0.0,6.0,1347.0,0.0


### **Visualization**
To effectively illustrate the relationships and interactions within each subreddit, we will use Matplotlib in combination with NetworkX to generate network visualizations. These networks will be constructed based on user interactions, such as replies or comment threads, highlighting the structure and centrality of user engagement within the subreddit communities. Due to the potential length and clutter caused by usernames, node labels will be omitted in the visualizations to maintain clarity and readability. Instead, visual emphasis will be placed on node size and color to represent user activity or connectivity levels.

In addition to the network graphs, we will provide visualizations showcasing the most frequently used words and most active users within each subreddit. These charts will help identify recurring themes and key contributors to discussions. Furthermore, a dedicated sentiment analysis visualization will be included, displaying the sentiment score distribution across posts within each subreddit. This will enable comparisons of emotional tone between the communities and may reveal shifts in sentiment over time.

### **Centrality Analysis**

Centrality Measures: Calculate various centrality measures using NetworkX.

Degree Centrality: Identify nodes with the highest number of direct connections.

Betweenness Centrality: Determine nodes that act as bridges in the network.

Closeness Centrality: Assess nodes that are close to many others in the network. Data Representation: Store centrality measures in a DataFrame for easy analysis and comparison.

### **Sentiment Analysis**
Sentiment analysis in this project will be conducted using a lexicon-based approach, which relies on a predefined list of words associated with positive or negative sentiment. Specifically, we will utilize the Bing Liu Lexicon, a widely used sentiment lexicon that categorizes words as either positive or negative based on their typical usage and emotional connotation. Each user comment will be tokenized into individual words, and the sentiment of the comment will be determined by counting the number of positive and negative words it contains. A comment will be classified as positive if it contains more positive words than negative, negative if the opposite is true, or neutral if there is a balance or no sentiment-bearing words are present.

After all comments are assigned sentiment scores, we will aggregate the results to perform a comparative analysis across the subreddits included in the dataset. This will allow us to examine overall sentiment trends, identify which communities lean more positively or negatively in their discourse, and detect variations in emotional tone between different subreddit groups. Additionally, sentiment scores may be analyzed over time to observe how user sentiment evolves in response to events or shifts in community dynamics. This method provides a transparent and interpretable way to evaluate emotional expression in text-based data at scale.

### **Conclusion**
In the final section, we will summarize our key findings, interpreting the results in relation to the original questions posed. We will also provide insights into how the findings can be used.