## **Centrality Measures using a Reddit Hyperlink Network**

**Submitted by:** Euclides, Umais

**Course:** DATA 620

**Data Source: Link:** https://snap.stanford.edu/data/soc-RedditHyperlinks.html

### **Introduction**

The hyperlink network represents the directed connections between two subreddits (a subreddit is a community on Reddit). We also provide subreddit embeddings. The network is extracted from publicly available Reddit data of 2.5 years from Jan 2014 to April 2017.

Subreddit Hyperlink Network: the subreddit-to-subreddit hyperlink network is extracted from the posts that create hyperlinks from one subreddit to another. We say a hyperlink originates from a post in the source community and links to a post in the target community. Each hyperlink is annotated with three properties: the timestamp, the sentiment of the source community post towards the target community post, and the text property vector of the source post. The network is directed, signed, temporal, and attributed. 

### **Load Data and Seperation for Positive and Negative Sentiment**

In [3]:

# Import packages
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import os

In [None]:
#Load .tsv into a Dataframe
current_dir = os.getcwd()
filename = "soc-redditHyperlinks-body.tsv"
file_path = os.path.join(current_dir, filename)

#imported data with select columns only
df = pd.read_csv(file_path, sep='\t', usecols=[0,1,4])
df.head()

Unnamed: 0,SOURCE_SUBREDDIT,TARGET_SUBREDDIT,LINK_SENTIMENT
0,leagueoflegends,teamredditteams,1
1,theredlion,soccer,-1
2,inlandempire,bikela,1
3,nfl,cfb,1
4,playmygame,gamedev,1


In [21]:
#Create a subset for positvie sentiment data
df_pos = df[df['LINK_SENTIMENT']==1]
df_pos.head()


Unnamed: 0,SOURCE_SUBREDDIT,TARGET_SUBREDDIT,LINK_SENTIMENT
0,leagueoflegends,teamredditteams,1
2,inlandempire,bikela,1
3,nfl,cfb,1
4,playmygame,gamedev,1
5,dogemarket,dogecoin,1


In [22]:
#Create a subset for negative sentiment data 
df_neg = df[df['LINK_SENTIMENT']==-1]
df_neg.head()

Unnamed: 0,SOURCE_SUBREDDIT,TARGET_SUBREDDIT,LINK_SENTIMENT
1,theredlion,soccer,-1
34,karmaconspiracy,funny,-1
43,badkarma,gamesell,-1
53,casualiama,teenagers,-1
55,australia,sydney,-1


### **Proposal and Project Plan for Network Analysis**

**1. Introduction Objective:** To analyze the relationships between users with negative and positive relationships.  Do users tend to have a bias for being positive or negatve?  We will use network analysis to investigate this topic.

**2. Data Collection** The data set will be obtained from the Standford Large Network Dataset Collection. In particular the social network between two "subreddits" will be used.  

**3. Data Preparation**

Organize Data data into positive and negative subsets.  The data will filter the temporal and properties information for simplicity.

**4. Graph Construction**

The networkX package will be used to contruct the graph. Given the large dataset, the graph will be limited to the top 100 nodes for each subset. 

**5. Visualization**

Graph Visualization: Use Matplotlib with NetworkX to visualize the constructed a network data visualizations. Given that the usernames can be long, the labels within the nodes will be omittied. 

**6. Centrality Analysis**

Centrality Measures: Calculate various centrality measures using NetworkX.

Degree Centrality: Identify nodes with the highest number of direct connections.

Betweenness Centrality: Determine nodes that act as bridges in the network.

Closeness Centrality: Assess nodes that are close to many others in the network. Data Representation: Store centrality measures in a DataFrame for easy analysis and comparison.


**7. Analysis of Results**

Identify Key Nodes: A tabular chart will be providing the measures listed above. This will help us identify any key nodes or trends in the network. 


**8. Conclusion**

Summary: Summarize the findings and their implications for understanding these two reddit networks. 

