# Explainer-Notebook

In [3]:
import json
import pandas as pd
import os
import glob
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import random
from scipy import stats
import networkx as nx
import pandas as pd
from collections import defaultdict
import pickle
from ast import literal_eval
import pickle
import netwulf as nw

# Motivation

* What is your dataset?
* Why did you choose this/these particular dataset(s)?
* What was your goal for the end user’s experience?

### What is the data?

### Raw data

The raw data consists of a range of tweets made by users all related to the U.S congress  either as members, committees or caucus, with the tweets' IDs, the user's handle names, timestamp, a link to the tweet and finally the text of the tweet.  The tweets were taken daily from 2017-2023 and were arranged in JSON files by date with a time resolution of months.

We also had a JSON file called "historical users" consisting of data on the different twitter users. It contained the name of the user, which chamber they were a part of, their political affiliation, what type of user they were (comittee, political party, member or caucus) and a list of which twitter accounts were associated with that user. Furthermore, if the user was a member of congress they would also have a field explaining which state they represented. 

10 of the members had a prev_props field, which indicated which political party, if not their own, they had served and when.

The accounts list consisted of dictionaries with the handle names, what the accounts had been used for (campaign, office etc.) and unique IDs of the different twitter accounts belonging to that person. Some of the twitter accounts also had old handle names, which were included in the accoutns list.

# Basic stats. Let’s understand the dataset better

* Write about your choices in data cleaning and preprocessing
* Write a short section that discusses the dataset stats (here you can recycle the work you did for Project Assignment A)

## Write about your choices in data cleaning and preprocessing

### Data frame

We made multiple choices when extracting and proccessing the data from our raw JSON file, as described above.

User dataframe:

- Removing prev_pres: As mentioned, there was an attribute called "prev_pres," which, if the member had been elected for a previous party or house, described this party or house. However, only 10 members in total had this attribute, so we decided to consider it as an outlying attribute and removed it from our graph. 

Tweets dataframe:

-  Multiple mentions in the same tweet: One could argue, that multiple mentions in the same tweet should account for a single edge, as it could be viewed as a single instance of communication. However, at the same time, one most likely does not accidentally mention another twitter user, and there is therefore en explicit attempt at connection between the mentioner and the mentionee. As we're researching how the different congress members relate to one another, it would not make sense to throw away this information, despite the fact, that it may cause overrepresentation of certain individuals who're connected with other individuals that use a lot of mentions in their tweets compared to the average.

- Only including mentions to accounts in our accounts dataframe: We did not include mentions directed at- or from accounts outside our accounts dataframe, as we would have no political information on these accounts, the most important being which party they belonged to. It would therefore most likely not make sense to include these extra edges and nodes, as it would just unnecessarily increase the complexity of our graph, and not help us answer our main research question regarding whether or not the U.S Congress is politically polarized.

- Include RT tweets: We included retweets as we saw them as a piece of information regarding the relationship between two users. 

We had four dataframes; User, accounts, tweets and text dataframe.

User dataframe: A dataframe of the bibliographic data sorrounding a single twitter user, a user being the organization or persons(s) actually using twitter.

Size: 806 entries

- name: This was the unique identifier of each user as no user had the same exact name.

- chamber: Which chamber (if any,) the user was a part of, could either be house or congress.

- type: Whether they were a (congress) member, caucus, party or committee

- party: Which political party they were party of

- id: Their ID

- state: Which state they represented, if they were a member of a congress

- prev_props: 10 users had served under different parties previously, prev_props told us which parties and the timeframe.

In [None]:
"""Create the tweeters fundamental DB with:
name, chamber, type, party, (list) accounts, id, state, prev_props
"""

users_DF = pd.read_json(filteredDataFilePath)

#Check for duplicate names
names_list = users_DF["name"]
duplicate_names_counter = 0
for name in names_list:
    name_query = users_DF[users_DF["name"] == names_list[0]]
    if(len(name_query)>1):
        print(name_query)
        duplicate_names_counter += 1

if(duplicate_names_counter>0):
    print(f"Number of duplicate names: {duplicate_names_counter}")
else:
    print("We can use name as a unique identifier for Twitter users")

NameError: name 'filteredDataFilePath' is not defined

Accounts dataframe: Each user could have multiple accounts. The accounts dataframe was therefore a way to link twitter accounts' handles and ids to the user behind them.

Here each row is a single twitter account.

Size: 1754 entries

- screen_name: The username of that account

- account_type: What the account was used for, i.g campaign, office etc.

- name: The name of the user behind the account

In [None]:
"""
We create a seperate dataFrame for quickly connecting twitter handles with names.
We need this in order to iterate over the tweets and filter mentions based on whether they're part of our filtered dataset.
"""

#Get all the account dicts as rows
accounts_dataFrame = users_DF[["accounts"]].explode(["accounts"])["accounts"].apply(pd.Series)
users_DF = users_DF.drop("accounts",1)

accounts_dataFrame["name"] = users_DF["name"]
accounts_dataFrame = accounts_dataFrame.drop(["party","type","deleted","chamber"],1)

#As can be seen some of the twitter accounts have had previous names, which we further need to explode and add to the dataframe
#We swap screen_names with prev_names, and add the new connections between names and twitter handles to the original accounts_DF

prev_names_DF = accounts_dataFrame[accounts_dataFrame["prev_names"].isna() == False].explode("prev_names").drop("screen_name",1)

prev_names_DF.columns = ["id","account_type","screen_name","name"]
accounts_dataFrame = pd.concat([accounts_dataFrame,prev_names_DF]).drop("prev_names",1)

Tweets dataframe:

The tweets dataframe stores information surrounding a tweet. This made it easy for us to create a graph later on, as all the edges could be found in the tweets dataframe. Each entry is a single tweet.

Size: 4777249 entries

- user_name: The username of the user behind the tweet

- screen_name: The name of the account from which the tweet was posted

- tweet_account_ID: The ID of that account

- tweet_ID: The unique identifier identifying that single tweet

- handles_mentioned: All the twitter handles mentioned, which are in our account dataframe.

- names_mentioned: The username of the accounts mentioned.

Text dataframe:

The text dataframe consists of all the text of every tweet in the tweets dataframe. The primary purpose of this dataframe was to provide an organized way to do text analysis.

Each row is a single tweet.

Size: 4777249 entries

- user_name: The username of the user behind the tweet

- screen_name: The name of the account from which the tweet was posted

- text: The actual text of that tweet

- tweet_ID: The unique identifier identifying that single tweet

In [None]:
RT_pattern = "RT @[a-zA-Z0-9]*"

def searchText(formula,twitter_message):
    try:
        matches = re.findall(formula,twitter_message)
        matches[0]
        return matches
    except:
        return None

#Takes a tweet and returns a list of mentions - bar RT mentions - that are in our accounts_dataFrame and their corresponding users' names
def get_Handles_Names(tweet):
    mentions_re = "@[a-zA-Z0-9]*"
    mentions_excl_RT_re = "(?<!RT )@[a-zA-Z0-9]*" #Will exclude all mentions that follow "RT"
    mentions = searchText(mentions_re,tweet)
    screen_names = []
    names_result = []

    #If there is a mention
    if(mentions != None):
        mentions = [handle[1:] for handle in mentions] #Remove @

        #Add to our dict, if the mentions exist in our dataFrame
        for idx,handle in enumerate(mentions):
            names = accounts_dataFrame[accounts_dataFrame['screen_name'] == handle]['name']

            #If the handle exists in our filtered accounts_dataFrame
            if(len(names)>0):
                screen_names += [handle]
                names_result += list(names)
    if(len(names_result)>0):
        return screen_names,names_result
    else:
        return None,None

def get_tweeter_screen_name(tweet):
    return tweet["screen_name"]



#In case the way we accept tweets changes
def accept_tweet(tweet):
    screen_name = [get_tweeter_screen_name(tweet)]

    #We check to see if the sender is a part of our network
    if(accounts_dataFrame["screen_name"].isin(screen_name).sum() == 1):
        return True
    else:
        return False



def get_tweets_row(tweet,year):
        screen_name = get_tweeter_screen_name(tweet)
        tweet_text = tweet["text"]
        tweet_ID = tweet["id"]
        tweet_account_ID = tweet["user_id"] #What they call user, we call account

        handles_mentioned,names_mentioned = get_Handles_Names(tweet_text)
        hashtags = searchText(hashtags_re,tweet_text)
        user_name = list(accounts_dataFrame[accounts_dataFrame["screen_name"] == screen_name]["name"])[0]

        return {"user_name":user_name,"screen_name":screen_name,"tweet_account_ID":tweet_account_ID,"tweet_ID":tweet_ID,"handles_mentioned":handles_mentioned,"names_mentioned":names_mentioned,"hashtags":hashtags,"year":year}

def get_text_row(tweet):
        screen_name = get_tweeter_screen_name(tweet)
        user_name = list(accounts_dataFrame[accounts_dataFrame["screen_name"] == screen_name]["name"])[0]
        tweet_text = tweet["text"]
        tweet_ID = tweet["id"]
        return {"user_name":user_name,"screen_name":screen_name,"text":tweet_text,"tweet_ID":tweet_ID}


In [None]:
"""We will now create the Tweets and text dataframes"""

tweets_DF = pd.DataFrame(["user_name","screen_name","tweet_account_ID","tweet_ID","handles_mentioned","names_mentioned","hashtags"])

hashtags_re = "#[a-zA-Z0-9]*"

#Regex assumes \ is an escape char, Python assumes \\ is an escape char for a single \, we need \\ (Regex) -> \\\\ (Python)
year_re = "\\\\[0-9]*-"
tweets_rows = []
text_rows = []
false_tweets = 0
total_tweets = 0

for idx,filename in enumerate(os.listdir(dataFilePath)):
    print(idx)

    full_path = os.path.join(dataFilePath, filename)
    tweet_year = re.search(year_re,full_path)[0][1:-1]

    with open(full_path, encoding="utf8") as f:
        json_file = json.load(f)

    try:
        for tweet in json_file:
            if(accept_tweet(tweet)):
                tweets_rows += [get_tweets_row(tweet,tweet_year)]
                text_rows += [get_text_row(tweet)]
                total_tweets += 1

    except Exception as e:
        false_tweets += 1
        print(tweet)
        print(e)

#We now have our four different dataFrames.
tweets_DF = pd.DataFrame.from_dict(tweets_rows,orient="columns")
text_DF = pd.DataFrame.from_dict(text_rows,orient="columns")

### Main Graph creation

- Directed: The choice of using a directed graph versus an undirected graph stems from the fact, that there is a huge difference between a user who mentions a lot of other users, or whom is mentioned by a lot of other users. With the ladder playing a more central role in our network.

- Multigraph: We couldn't just assign number of tweets between two nodes as weights, as we needed to separate the graph by year later on, and we would therefore have to mark which how many mentions were distributed in each year. Also, originally we had intended to use semantic analysis, and by using the multigraph, we could have a single weight for semantics. This would make it much easier to use the standard tools we had, like Netwulff and NetworkX.

- Users are nodes, not accounts: For the choice of using twitter users instead of their accounts as the nodes of our graph, we went with the twitter users. Having the accounts as nodes would've added more detail, as there could now be cases where the same users had accounts in different communities. Or, we could see how the different election periods affected the use of different accounts. However, it would also add much more complexity to our graph, and it could have had a possibility of making it much harder to get an overview of our network based off of visual cues, like our Netwulff graph. As we had limited time for this assignment, we therefore went with the former option.

In [None]:
#We now create the network graph


#When we load the dataFrame, it interprets the names_mentioned as a string, not a list
def read_names_mentioned(tweet):
    return literal_eval(tweet["names_mentioned"])

def load_DF(path):
    return pd.read_csv(path).drop("Unnamed: 0",1)

#And load the different dataFrames
tweets_DF = load_DF("tweetsDF")
text_DF = load_DF("textDF")
users_DF = load_DF("userDF")
accounts_dataFrame = load_DF("accountsDF")

After having created our different dataframes, we now reached the last data extraction state.

We wanted to model the relationship between the different of the US congress, not their twitter accounts, so each node in our graph consists of a single user and its data from the user dataframe.

In order to construct the actual graph, we started by constructing the edges by looking at each tweet from our tweets dataframe that mentioned at least one other user from the account dataframe. We then created an edge between the user who tweeted the tweet and each mention in that tweet. If the same user was mentioned multiple times, there would be created an edge for every mention.

As we wanted to do text analysis on the communities of the graph and a temporal analysis later on, we added the year of the tweet as well as the tweet's ID to every edge. This way we could look at all the edges of each community and get the text from those tweets from the text dataframe, and we could separate the different edges into different graphs by year.

In [None]:
weets_w_edges = tweets_DF[tweets_DF["names_mentioned"].isna() == False]
#%%
#First we create the edges
edges = []
for idx in range(len(tweets_w_edges)):
    tweet = tweets_w_edges.iloc[idx]
    origin = tweet["user_name"]
    name_mentions = read_names_mentioned(tweet)
    try:
        for name_mention in name_mentions:
            if (origin != name_mention):
                edges += [(origin, name_mention,{"year":tweet["year"],"ID":tweet["tweet_ID"]})]
                if (len(name_mention) == 1):
                    print(f"Origin:{origin} and the mentioned: {name_mention}")
    except:
        print(tweet)


We then created a directed multigraph based on the edges. 

The next step was adding the attributes of each node:

In [None]:
#Now we add our attributes to the nodes in the graph
attribute_dicts = dict()

#For each node, we create an attributes dict
for idx in range(len(users_DF)):
    row = users_DF.iloc[idx]
    node_dict = dict()

    #We take all the attributes but name and id
    for attr in users_DF.drop(["name","id"],1).columns:
        value = row[attr]
        node_dict[attr] = value

    attribute_dicts[row["name"]] = node_dict

nx.set_node_attributes(G,attribute_dicts)
print(f"A quick litmus test: {G.nodes['Bernie Sanders']}\nThe values should be equal to senate, member, I, VT and nan")

Finally, we created a directed multigraph for every year we had tweets. 

For year i, if a node had edges from- or to itself with the attribute year = i, the two nodes and the edge was included in the graph for that year.

In [None]:
#Finally we create a graph for every year there is a tweet
years = nx.get_edge_attributes(G, "year")  # Get the year attribute for each edge
graph_by_year = {}  # Dictionary to store subgraphs by year


for edge in G.edges(data=True):
    year = edge[2]["year"]

    # Create a new subgraph for each year
    if year not in graph_by_year:
        graph_by_year[year] = []

    graph_by_year[year] += [edge]# Add the edge to the respective subgraph


In [None]:
#%%
#We now need to populate the nodes' attributes

#We were not succesful in extracting a subgraph based on the edges directly,
#or using the nx.set_node(), so we're just gonna iterate over all the nodes and set their attributes "manually"
all_node_attributes = {"chamber":None,"type":None,"party":None,"state":None,"prev_props":None}

for idx,year in enumerate(graph_by_year.keys()):

    yearGraph = nx.MultiDiGraph(graph_by_year[year])

    #First we populate our "all_node_attributes"
    for attr in all_node_attributes.keys():
        all_node_attributes[attr] = nx.get_node_attributes(G,attr)

    #We then iterate over all the attributes and set them in our graph
    for node_attr in all_node_attributes.keys():

        #single node attribute, like chamber,type, etc.
        single_node_attribute = all_node_attributes[node_attr]


        for node in single_node_attribute.keys():
            if(node in yearGraph.nodes):
                yearGraph.nodes[node][node_attr] = single_node_attribute[node]

    pickle.dump(yearGraph, open(f'directed_multi_twitter_graph_fixed{year}.pickle', 'wb'))

## Basic data statistics

In [None]:
print(DiG)
print(nx.density(DiG))

degree_list = list(DiG.degree())
degree_list = [degree_tuple[1] for degree_tuple in degree_list]
print(f"average {np.mean(degree_list)}, \nmedian {np.median(degree_list)}, \nmode {stats.mode(degree_list)[0][0]}, \nminimum {np.min(degree_list)}, \nmaximum {np.max(degree_list)} \nvalue of the degree\n\n")

In [None]:
in_degrees = [degree for note, degree in DiG.in_degree()]
out_degrees = [degree for note, degree in DiG.out_degree()]
print(f"In degree: average {np.mean(in_degrees):0.1f}, \tmedian {np.median(in_degrees)}, \tmode {stats.mode(in_degrees,keepdims=False)[0]}, \tminimum {np.min(in_degrees)}, \tmaximum {np.max(in_degrees)} ")
print(f"Out degree: average {np.mean(out_degrees):0.1f}, \tmedian {np.median(out_degrees)}, \tmode {stats.mode(out_degrees,keepdims=False)[0]}, \tminimum {np.min(out_degrees)}, \tmaximum {np.max(out_degrees)}")

fig, ax = plt.subplots(1,2,figsize=(10,4))

ax[0].hist(in_degrees,bins=30)
ax[0].set_title('In Degree distribution of DiG')
ax[0].set_ylabel('count')
ax[0].set_xlabel('degree')

ax[1].hist(out_degrees,bins=30)
ax[1].set_title('Out Degree distribution of DiG')
ax[1].set_ylabel('count')
ax[1].set_xlabel('degree')

plt.show()

We then wanted to look at the users with the highest in and out degree

In [None]:
in_degrees_sorted = [(node, degree) for node, degree in DiG.in_degree()]
in_degrees_sorted.sort(key=lambda x: x[1], reverse=True)
out_degrees_sorted = [(node, degree) for node, degree in DiG.out_degree()]
out_degrees_sorted.sort(key=lambda x: x[1], reverse=True)


print("in degrees sorted : \n", in_degrees_sorted[:5])
print("out degrees sorted : \n", out_degrees_sorted[:5])

From this we can see that the most mentioned twitter account for all years are for 'VP' and 'SpeakerPelosi'. This comes as no suprise beause they are some of the main twitter accounts for the united states goverment and they have been activate for all years we have data for.

For the out degree we have 'RepDonBeyer', 'RepDonBacon and 'auctnr1' which is the account for 'Billy Long'. They are all very activate republian twitter users. That averages multiple tweets each day. 'NRCC' is another interesting account, beacuse it is not a person but a group as stated on the twitter bio: "The NRCC is dedicated to defending our conservative majority in the House". 

# Tools, theory and analysis. Describe the process of theory to insight

* Talk about how youve worked with text, including regular expressions, unicode, etc.
* Describe which network science tools and data analysis strategies youve used, how those network science measures work, and why the tools youve chosen are right for the problem youre solving.
* How did you use the tools to understand your dataset?

## Communities

We used NetworkX's modularity function to measure the quality of our network partitions. It measures this by looking at the edges intra-community with edges going outside the community and comparing it with our modulation and a random graph. The higher the modularity, the more non-random our communities are, and therefore we assume them to be better defined. That is, if we could achieve the modulation from pure chance, it most likely is not a community. 

One of the qualitative tools we used was the Netwulff graph visualization tool. It works by applying physics to the edges and nodes, such that we get a more dynamic graph, which pushes away nodes from each other and makes it easier to distinguish them from one another. 

In conjunction with the modularity, Netwulff allowed us to recognize, that partitioning the network into party-chamber configurations was a better partition than simply by party-party. Modularity then allowed us to confirm this hypothesis qualitatively.

In [4]:
#Example of the difference between the two

G = pickle.load(open('directed_Multi_twitter_graph.pickle', 'rb'))

def replace_na(graph,code_for_NaN = ''):
    #We turn NaNs into '' - this is only for Netwulf, as JSON cannot handle Nan values
    for node_key in graph.nodes:
        node = graph.nodes[node_key]
        for attribute in node.keys():
            if(type(node[attribute]) == float):

                #If !(x>=0 or x<=0) => x is not a float, x is a nan value
                if not(node[attribute] >= 0 or node[attribute] <= 0):
                    node[attribute] = code_for_NaN

def get_party_affiliation(G):
    normalizing_constant = len(G.nodes())
    affiliation_dict = defaultdict(int)
    replace_na(G,"None")
    for x,y in G.nodes(data=True):
        affiliation_dict[y["party"]] += 1

    for key in affiliation_dict.keys():
        affiliation_dict[key] = affiliation_dict[key]/normalizing_constant

    return affiliation_dict

# Specify node colors based on "party" attribute
for node in G.nodes:
    party = G.nodes[node]["party"]
    if party == "D":
        G.nodes[node]["color"] = "blue"
    elif party == "R":
        G.nodes[node]["color"] = "red"
    elif party == "I":
        G.nodes[node]["color"] = "yellow"
    else:
        G.nodes[node]["color"] = "orange"

#We give each node a size based on their indegree
for node in G.nodes:
  G.nodes[node]["size"] = G.in_degree[node]


replace_na(G,"")
nw.visualize(G)

(None, None)

In the data processing stage, we used the following regex formulas:

In [None]:
RT_pattern = "RT @[a-zA-Z0-9]*" #Pattern for retweets -> it looks for "RT @some_user_name"
mentions_re = "@[a-zA-Z0-9]*" #Pattern for finding mentions -> it looks for "@some_user_name"
hashtags_re = "#[a-zA-Z0-9]*" #Pattern for finding hashtags -> it looks for "#some_hastag"

We used them in the following functions, which we then further used to extract the mentions and hashtags in the different tweet texts.

In [None]:
RT_pattern = "RT @[a-zA-Z0-9]*"

def searchText(formula,twitter_message):
    try:
        matches = re.findall(formula,twitter_message)
        matches[0]
        return matches
    except:
        return None

#Takes a tweet and returns a list of mentions that are in our accounts_dataFrame and their corresponding users' names
def get_Handles_Names(tweet):
    mentions_re = "@[a-zA-Z0-9]*"
    mentions_excl_RT_re = "(?<!RT )@[a-zA-Z0-9]*" #Will exclude all mentions that follow "RT"
    mentions = searchText(mentions_re,tweet)
    screen_names = []
    names_result = []

    #If there is a mention
    if(mentions != None):
        mentions = [handle[1:] for handle in mentions] #Remove @

        #Add to our dict, if the mentions exist in our dataFrame
        for idx,handle in enumerate(mentions):
            names = accounts_dataFrame[accounts_dataFrame['screen_name'] == handle]['name']

            #If the handle exists in our filtered accounts_dataFrame
            if(len(names)>0):
                screen_names += [handle]
                names_result += list(names)
    if(len(names_result)>0):
        return screen_names,names_result
    else:
        return None,None

def get_tweeter_screen_name(tweet):
    return tweet["screen_name"]


### Tools used in the tweet analysis section

The mean tools used in the text analysis section is TF-IDF.  TF-IDF works by calculating the term frequency of each token in each community, and then weighting the term frequency by the inverse community frequency. As it is weighted by the community frequency it is a good way to get frequent words that are uniqe to each community. 
We have used a special regex to clean and tokenise each tweet ```https?://\S+|@\w+|&\w+|#\w+|[^\w\s]|[^a-zA-Z\s]+```. It is a set of 'or' criterias that each remove some unwanted item from the text. The first part removes all links, the next 3 parts remove the "@", "&","#" symbols and the following text before a whitespace. The last part '[^\w\s]|[^a-zA-Z\s]+' removes all none word characters.

The full code is shown below

In [None]:
# Loads tweets
textdf = pd.read_csv('data/textDF')

# drop all columns except for the text and tweet_id column
textdf = textdf.drop(['Unnamed: 0','user_name','screen_name'],axis=1)

textdf_dict = dict(zip(textdf.tweet_ID, textdf.text))

In [None]:
def tokenize_tweet(raw_tweet: str):
    # Remove URLs, mentions, hashtags, punctuation, and any any non-alphabetic characters
    raw_tweet = re.sub(r'https?://\S+|@\w+|&\w+|#\w+|[^\w\s]|[^a-zA-Z\s]+', '', raw_tweet.lower())

    # tokenize -> converts each word to a list element
    tokens = nltk.word_tokenize(raw_tweet, language='english')

    # loads in predifined english stop words
    stop_words = set(stopwords.words('english'))
    # ads domain Specific Stop words specific for twitter, this includes rt for retweets and qt for quoted tweets
    stop_words.update(['rt','qt','amp','lr','cm','im', 'today', 'us','must','would','thank','new','bcbra','cm','rsc','rnh'])

    tokens = [token for token in tokens if token not in stop_words]
    
    return tokens

In [None]:
G_partitions_tokens = []

for i, G_partition in enumerate(G_partitions):
    tweets_ids_in_partition = []
    tweets_tokenized_in_partition= []

    for node in G_partition:
        for mention_node in G[node].keys():
            edge_data = G[node][mention_node]
            for tweet_id in edge_data['ID']:
                tweets_ids_in_partition.append(tweet_id)

    for tweet_id in tqdm(tweets_ids_in_partition, desc=f'Tokenizing tweets {i+1} of {len(G_partitions)}'):
        tweet_text = textdf_dict[tweet_id]
        tokenized_tweet = tokenize_tweet(tweet_text)
        tweets_tokenized_in_partition += tokenized_tweet
    
    G_partitions_tokens.append(tweets_tokenized_in_partition)

Calculates TF for each community

In [None]:
tf_dict = defaultdict(lambda: defaultdict(lambda: 0)) # term documents

for i, partition_tokens in enumerate(G_partitions_tokens):
    # calculates TF for community
    for token in partition_tokens:
        tf_dict[token][i] += 1 / len(partition_tokens)

top5_tokens = []
for partition_index in range(len(G_partitions_tokens)):
   tf_dict_copy = deepcopy(tf_dict) # makes deepcopy to not change default dict structure
   sorted_tokens = sorted(tf_dict_copy, key=lambda x: tf_dict_copy[x][partition_index], reverse=True)#

   top5_tokens.append(sorted_tokens[:10])

print(f'Top 10 tokens for partitions')
# print matrix as table using markdown
print('| Partition | Top 10 Tokens |')
print('| --- | --- |')
for i, top5 in enumerate(top5_tokens):
    print(f'| {i} | {", ".join(top5)} |')
    

Calculates IDF for each community

In [None]:
# IDF for each document
idf_dict = {}
num_docs = len(G_partitions_tokens)
for token in tf_dict:
    idf_dict[token] = np.log(num_docs / (len(tf_dict[token]))) # all tf_dict tokens apears so we do not need to add 1

Calculates tf-idf

In [None]:
# Calculate the TF-IDF weight for each term in each document
tfidf_dict = defaultdict(lambda: defaultdict(lambda: 0)) 
for token in tf_dict:
    for community_index in tf_dict[token]:
        tfidf_dict[community_index][token] = tf_dict[token][community_index] * idf_dict[token]

In [None]:
# Prints top 10 TF-IDF for each community 
top_10_TFIDF = []
for community_index in tfidf_dict:
    top_10_TFIDF.append(sorted(tfidf_dict[community_index], key=lambda x: tfidf_dict[community_index][x],reverse=True)[:10])
    #print("Top 10 TF-IDF words for community: ",community_index, )

print(f'Top 10 TF-IDF for partitions')
# print matrix as table using markdown
print('| Partition | Top 10 TF-IDF |')
print('| --- | --- |')
for i, top10 in enumerate(top_10_TFIDF):
    print(f'| {i} | {", ".join(top10)} |')

Creates wordcloud based on TF-IDF (to get better words than simply the most frequent)

In [None]:
# Generate a word cloud image
fig, axs = plt.subplots(3,2, figsize=(20,10))
#fig.subplots_adjust(hspace=-.2)
fig.subplots_adjust(wspace=-0.5)

for i, community_index in enumerate(tfidf_dict):
    #wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(['sadf','sadf'])
    wordcloud = WordCloud(max_font_size=40, max_words=50, background_color="white").generate_from_frequencies(tfidf_dict[community_index])

    axs[i // 2, i % 2].imshow(wordcloud, interpolation="bilinear")
    axs[i // 2, i% 2].axis("off")
    axs[i // 2, i% 2].set_title(f'Community: {community_index}', fontsize=20)

    if i == 6:
        break

plt.show()

Finally the biggest nodes in each comminty

In [None]:
G_partitions

top_3_nodes_per_community = []

for i, G_partition in enumerate(G_partitions):
    top_3_nodes_per_community.append(sorted(G_partition, key=lambda x: G.degree(x,weight='weight'), reverse=True)[:3])

print(f'Top 3 accounts for partitions')
# print matrix as table using markdown
print('| Partition | Top 3 accounts |')
print('| --- | --- |')
for i, top10 in enumerate(top_3_nodes_per_community):
    print(f'| {i} | {", ".join(top10)} |')

In our community section of the webpage, we used 

# Discussion. Think critically about your creation

From our analysis we have found a level of polarisation in the US Goverment. We think that this is a worying trend

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=2916efa0-c445-43ca-8dcd-0a1bcdbbd016' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>