<a href="https://colab.research.google.com/github/arminioluigi/blm-hashtag-network-analysis/blob/main/Hashtag_Analysis_2_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hashtag Analysis - Hashtag network of BLM-related tweets

## Co-occurrence Network: centrality measures (Python)


We firstly explored our scraped with Rteeet. Then, by using R, we generated a dataset with all the tweets scraped.

Let's generate the co-occurrence network in Python to explore the central hashtags. In this case, we will focus on the direct connection of the #blacklivesmatter hashtag and the direct connection of the connections. Hence, we will have a connected graph.

The choice of focusing on the *#blacklivesmatter* hashtag is based on the fact that this slogan is representing the social struggle for the emancipation of black people in the whole wester context, and it's interesting, by analyzing the related hashtags, the features of the communication flow related to this socio-political reality.

First of all, we load all the necessary packages and our data:

In [None]:
#Loading our packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import pickle
import itertools
import igraph as ig
from igraph import Graph
import requests
import httpimport
import urllib.request
import networkx as nx
from pyvis.network import Network

In [None]:
#Readiing
df = pd.read_csv("blm_dataset.csv")

Now let's generate the co-occurrence hashtag matrix:

In [None]:
#Putting the hashtag occurring in each tweet containing the #blacklivesmatter hashtag
hs = set()
for x in df.text:
    x = x.lower()
    y = x.split()
    listx = []
    if not "#blacklivesmatter" in y:
        continue
    for el in y:
        if el.startswith("#") and el not in listx:
            listx.append(el)
            hs.add(el)


In [None]:
#Putting all the hashtags-per-tweet into a list
docs = []
for x in df.text:
    x = x.lower()
    y = x.split()
    listx = []
    for el in y:
        if el.startswith("#") and el not in listx:
            listx.append(el)
    if len(listx)>0:
        docs.append(listx)

In [None]:
#Generating all the co-occurring hashtag couples (edges) in a list of tuples
edgeslist = []
for i in range(len(docs)):
    edgeslist.extend(list(itertools.combinations(docs[i],2)))
    
#deleting nodes that cannot reach the #blachlivesmatter hashtag
edgeslist = [(x,y) for (x,y) in edgeslist if (x in hs or y in hs)]

In [None]:
#Creating a dictionary that map, for each hashtag co-occurrence couple, the number of tweets in which 2 hashtag co-occur
ret = {}
for el in edgeslist:
    sel = tuple(sorted(el))
    if sel in ret:
        ret[sel]+=1
    else:
        ret[sel]=1

In [None]:
#Generating a list of edges with the associated weight per edge, as a list of tuples
weightededges = []
for el in ret:
    weightededges.append((el[0],el[1],ret[el]))

We have our list of edges. The edges are "weighted" according to the frequency with whom they co-occur (i.e., the number of tweets in which they co-occur). This allows us to generate a weighted graph:

In [None]:
#Generating the co-occurrence graph
G = nx.Graph()
G.add_weighted_edges_from(weightededges)

In [None]:
#Seeing if the graph is connected
nx.is_connected(G)

True

# Centrality measures
As evidenced by the existing sociological literature, we can analyze the relationship between different hashtags in a corpus of tweets by applying the typical centrality measures from SNA to the co-occurrence matrix of the hashtags


**Degree Centrality**

First of all, let's explore the most connected hashtags by using the degree centrality:


In [None]:
#Degree centrality
degree = list(G.degree())
degree.sort(key = lambda x: x[1],reverse=True)

In [None]:
degree[:10]

[('#blacklivesmatter', 622),
 ('#blm', 598),
 ('#black', 205),
 ('#blackhistorymonth', 120),
 ('#racism', 108),
 ('#lives', 107),
 ('#art', 104),
 ('#usa', 102),
 ('#matter', 89),
 ('#georgefloyd', 88)]

We can also replicate it with network weights, based on the frequency of the co-occurrence in the corpus:

In [None]:
#Normalized weighted degree centrality
degree = G.degree(weight='weight')
max_degree = max(dict(degree).values())
degree_centrality_weighted = [deg/max_degree for deg in dict(degree).values()]
degree_centrality_weighted = list(zip(G.nodes,degree_centrality_weighted))
degree_centrality_weighted.sort(key = lambda x: x[1],reverse=True)

In [None]:

degree_centrality_weighted[:10]

[('#blacklivesmatter', 1.0),
 ('#blm', 0.6897703549060543),
 ('#art', 0.4697286012526096),
 ('#black', 0.3974947807933194),
 ('#racism', 0.3594989561586639),
 ('#lives', 0.2580375782881002),
 ('#photography', 0.2384133611691023),
 ('#artists', 0.21711899791231734),
 ('#crisisart', 0.21711899791231734),
 ('#hate', 0.21711899791231734)]

We can notice how the *#art* hashtag significantly grew with this approach. It also appeared, among the first hashtags, the *#artists* hashtag.

**Closeness centrality**

The hashtag with the highest closeness are the ones that guarantee the information flow at the best possible velocity.

In [None]:
#Closeness centrality (unweighted)
closeness_noweights = list(nx.closeness_centrality(G).items())
closeness_noweights.sort(key = lambda x: x[1],reverse=True)
closeness_noweights = [(x,round(y,3)) for (x,y) in closeness_noweights]

In [None]:
closeness_noweights[:10]

[('#blacklivesmatter', 0.628),
 ('#blm', 0.602),
 ('#black', 0.49),
 ('#racism', 0.472),
 ('#blackhistorymonth', 0.468),
 ('#art', 0.466),
 ('#georgefloyd', 0.464),
 ('#equality', 0.463),
 ('#usa', 0.462),
 ('#lives', 0.46)]

In [None]:
#Closeness centrality  (weighted)
G_distance_dict = {(e1, e2): 1 / weight for e1, e2, weight in G.edges(data='weight')}
nx.set_edge_attributes(G, G_distance_dict, 'distance')

closenesses = list(nx.closeness_centrality(G, distance='distance').items())
closenesses = [(x,round(y,3)) for (x,y) in closenesses]
closenesses.sort(key = lambda x: x[1],reverse=True)

In [None]:
closenesses[:10]

[('#blacklivesmatter', 1.063),
 ('#blm', 1.061),
 ('#art', 1.056),
 ('#racism', 1.053),
 ('#photography', 1.05),
 ('#artists', 1.048),
 ('#crisisart', 1.048),
 ('#hate', 1.048),
 ('#whitepower', 1.048),
 ('#kkk', 1.048)]

Also in this case, by reading the weighted closeness, with notice how the hashtags to visual arts allows rapid diffusion of content related to the BLM sphere.

**Betweenness centrality**

The hashtags that best meet the status of *bridging words* will have higher values on betweenness centrality:

In [None]:
#Betweenness centrality (unweighted)
betweenness_centrality_unweighted = nx.betweenness_centrality(G)
betweennessesu = list(betweenness_centrality_unweighted.items())
betweennessesu = [(x,round(y,3)) for (x,y) in betweennessesu]
betweennessesu.sort(key = lambda x: x[1],reverse=True)

In [None]:
betweennessesu[:10]

[('#blacklivesmatter', 0.479),
 ('#blm', 0.41),
 ('#black', 0.074),
 ('#blackhistorymonth', 0.047),
 ('#usa', 0.042),
 ('#letsgobrandon', 0.031),
 ('#racism', 0.026),
 ('#mlkday', 0.023),
 ('#etsy', 0.021),
 ('#mlk', 0.015)]

In [None]:
#Betweenness centrality 
betweenness_centrality_weighted = nx.betweenness_centrality(G, weight='weight')
betweennesses = list(betweenness_centrality_weighted.items())
betweennesses = [(x,round(y,3)) for (x,y) in betweennesses]
betweennesses.sort(key = lambda x: x[1],reverse=True)

In [None]:
betweennesses[:10]

[('#blacklivesmatter', 0.446),
 ('#blm', 0.389),
 ('#black', 0.082),
 ('#blackhistorymonth', 0.052),
 ('#usa', 0.041),
 ('#letsgobrandon', 0.033),
 ('#etsy', 0.026),
 ('#racism', 0.021),
 ('#mlkday', 0.019),
 ('#mlk', 0.019)]

We can notice how art-related words disappeared.

**Eigenvector centrality**

With eigenvector centrality, we can consider the importance of hashtags based on the centrality of their direct links:

In [None]:
#Eigenvector centrality (unweighted)
eigenvector_centrality_u = nx.eigenvector_centrality(G)
eigenvectorsu = list(eigenvector_centrality_u.items())
eigenvectorsu = [(x,round(y,3)) for (x,y) in eigenvectorsu]
eigenvectorsu.sort(key = lambda x: x[1],reverse=True)

In [None]:
eigenvectorsu[:10]

[('#blacklivesmatter', 0.404),
 ('#blm', 0.358),
 ('#black', 0.187),
 ('#racism', 0.143),
 ('#lives', 0.128),
 ('#matter', 0.114),
 ('#equality', 0.109),
 ('#art', 0.103),
 ('#georgefloyd', 0.1),
 ('#love', 0.096)]

In [None]:
#Eigenvector centrality 
eigenvector_centrality_weighted = nx.eigenvector_centrality(G, weight='weight')
eigenvectors = list(eigenvector_centrality_weighted.items())
eigenvectors = [(x,round(y,3)) for (x,y) in eigenvectors]
eigenvectors.sort(key = lambda x: x[1],reverse=True)

In [None]:
eigenvectors[:10]

[('#blacklivesmatter', 0.44),
 ('#art', 0.374),
 ('#blm', 0.358),
 ('#racism', 0.309),
 ('#artists', 0.267),
 ('#crisisart', 0.267),
 ('#hate', 0.267),
 ('#whitepower', 0.267),
 ('#kkk', 0.267),
 ('#photography', 0.155)]

We can notice how the *#art* words returned as an important word.

**Network visualization**

To have a good visualization of this network, it can be useful to explore different packages from NetworkX.

With the hashed code below, you can save the graphic representation of the network in html format:

In [None]:
#*NETWORK VISUALIZATION
#net = Network(notebook = True)
#net.from_nx(G)

#net.show("network.html")

You can also export the graph in graphml format to explore and visualize it with other software:

In [None]:
nx.write_graphml(G,"hashtagnetwork_BLM.graphml")