# README

This notebook creates the graphs and perform basic analysis of the data.

We use the data stored in `data.json` file, which has the following structure:

- It is a dictionary and the keys are the initial target tags
- The values for each of these keys are lists of post data dictionaries
- Each dictionary has information about the post and a list of tags

# Preparing Data

With all the data collected in previous notebook, it is time to analyse it.

## Loading Previous Data

In [1]:
# link to access the data.json file
share_url = 'https://drive.google.com/open?id=1L0vboR9Y7u7VH6gwf78AlLCFh7p-iZ0D'

# link do download the data.json file
download_url = 'https://drive.google.com/uc?export=download&id=' + '1L0vboR9Y7u7VH6gwf78AlLCFh7p-iZ0D'

In [2]:
# download the data.json file if the notebook is running on google colab
!wget "https://drive.google.com/uc?export=download&id=1L0vboR9Y7u7VH6gwf78AlLCFh7p-iZ0D" -nc -q -O "data.json"

In [11]:
# reading the data.json file

import json

file = open('data.json')

data_json = json.load(file)

## Creating Lists of Edges

We want to construct a graph, so we need to retrieve the nodes and establish the edges between them.

As we are going to analyse the hashtags in posts text, they will be our nodes.

Two nodes will be connected if they were posted together at least once.

As expected, data collection is not perfect, and lots of trash came to the tags lists in the posts.

So, we need to filter them, and the function bellow does this task.

In [12]:
import re
    
def validate_tag(tag):
    
    """
    Checks if a tag is valid according to its contents and size
    """

    MAX_LEN = 25
    MIN_LEN = 1

    #pattern = '^[a-zA-Z0-9]+$'
    pattern = "^[-'a-zA-ZÀ-ÖØ-öø-ÿ0-9]+$" # allow accent
    
    if re.match(pattern, tag) and len(tag) < MAX_LEN and len(tag) > MIN_LEN :
        return True
    else :
        return False

**Note**

In the code snippet bellow, we create two lists of edges:

- The `keys` one refers to the edges between our initial tags (also refered as key tags) and all others.

- The `all` one refers to the edges between all tags.

After some previous analysis and problems, some improvements were done:

- We had problems in importing large graphs to Gephi, lots of errors occurred. So, we limited the number of posts to work with initially.

- Lots of selfloop edges were occurring in the key tags. We created a new list without the key tag to perform the connections.

- The same edge between the tags were being added more than once for each post. We created a slice of the tags to prevent it.

- When calculating the edge weight in next steps, the value was divided in two, because the source and target nodes were swapped. So, we appended each edge with nodes sorted alphabetically.

In [17]:
data_json['thirdmolar'][0]

{'id_post': '2432898161887553073',
 'id_owner': '44427672855',
 'shortcode': 'CHDY_z6AiYx',
 'text': 'Classical caldwell approach (canine fossa radical antrostomy) for retrieval of displaced root of maxillary tooth from maxillary antrum/sinus.24 year old female .. reffered for removal of displaced mesiobuccal root of maxillary second molar from sinus.. lateral window antrostomy.. The tooth root was visualized and removed with help of suction tip.. ((video))Incision was distally extended to d socket site and primary closure with buccal advancement was done to avoid possible oro antral communication..#oralsurgery #extraction #impacted #thirdmolar #dentistry #dentistrystudents #dentist #minorsurgery #dental #wisdomtooth #wisdomtoothremoval #wisdomtoothextraction #wisdomtoothsurgery #dentalsurgery #dentalpassion #maxillarysinus #sinusitis #caldwell #antrum',
 'post_url': 'https://www.instagram.com/p/CHDY_z6AiYx/',
 'tags': ['oralsurgery',
  'extraction',
  'impacted',
  'thirdmolar',
  'de

In [5]:
# trying a limitation in the number of posts
POSTS_MAX = 100

# this list contains just edges from initial target (keys) tags to related post tags
edges_list_keys = []

# this list contains all edges between pairs of tags from the same post
edges_list_all = []

# populating the lists of edges
for person, posts in data_json.items() :
    
    # traversing each post for each key tag
    for post in posts[:POSTS_MAX] :
        
        # list of tags in the post including trash tags
        post_tags = post['tags']
        
        # list of tags in the post after filtering
        post_tags = [tag for tag in post_tags if validate_tag(tag)]
        
        # list of tags without the key tag
        post_tags_drop_person = [tag for tag in post_tags if not tag == person]
        
        # creating edges between key tag and all others
        for tag in post_tags_drop_person :
            
            edge_keys = (person, tag)
            
            edges_list_keys.append( edge_keys )
        
        # creating the edges between all the tags
        for tag in post_tags :
            
            # index of the current tag in the list
            tag_index = post_tags.index(tag)
            
            # this slice is needed in order to connect all edges one and only on time
            post_tags_slice = post_tags[tag_index+1:]
            
            for stag in post_tags_slice :
                
                edge_all_pre = (tag, stag)
                
                # creating the edge element in alphabetical order
                edge_all = ( min(edge_all_pre) , max(edge_all_pre) )
                
                edges_list_all.append( edge_all )

In [6]:
print('Numbers of edges:')

print(len(edges_list_keys))

print(len(edges_list_all))

Numbers of edges:
10518
123209


In [7]:
# checking a sample of edges
edges_list_all[:10]

[('calabria', 'italia'),
 ('biologico', 'italia'),
 ('ciro', 'italia'),
 ('italia', 'santavenere'),
 ('enotecar', 'italia'),
 ('italia', 'italiawineshop'),
 ('italia', 'sardellacalabrese'),
 ('biologico', 'calabria'),
 ('calabria', 'ciro'),
 ('calabria', 'santavenere')]

In [8]:
# checking a sample of edges
edges_list_keys[:10]

[('ciro', 'italia'),
 ('ciro', 'calabria'),
 ('ciro', 'biologico'),
 ('ciro', 'santavenere'),
 ('ciro', 'enotecar'),
 ('ciro', 'italiawineshop'),
 ('ciro', 'sardellacalabrese'),
 ('ciro', 'rock'),
 ('ciro', 'rocknacional'),
 ('ciro', 'rockargento')]

# Handling List of All Edges

## Initial Graph

In [9]:
import networkx as nx

In [10]:
G = nx.from_edgelist(edges_list_all)

In [11]:
list(G.nodes)[:10]

['abaladissimaa',
 'cultura',
 'juizsergiomoro',
 'astrofotografia',
 'lapulga',
 'liverpool',
 'morovazajatotheintercept',
 'gafe',
 'jessika',
 'summer']

In [12]:
list(G.edges())[:10]

[('abaladissimaa', 'partidoalto'),
 ('abaladissimaa', 'corrupcao'),
 ('abaladissimaa', 'provas'),
 ('abaladissimaa', 'moro'),
 ('abaladissimaa', 'happy'),
 ('abaladissimaa', 'lula'),
 ('abaladissimaa', 'vazajato'),
 ('abaladissimaa', 'bolsonaro'),
 ('abaladissimaa', 'juizsergiomoro'),
 ('abaladissimaa', 'meme')]

In [13]:
len(G.nodes)

2796

In [14]:
len(G.edges)

44477

In [15]:
# percentage from graph edges to list of edges
100 * len(G.edges)/len(edges_list_all)

36.09882394954914

**Note**

What should we do with duplicates edges which disappear when added to the graph? They could be counted as a weight parameter.

## Grouping and Counting Edges

In [16]:
import pandas as pd

In [17]:
edges_df = pd.DataFrame(edges_list_all, columns=['source', 'target'])

In [18]:
edges_df.head()

Unnamed: 0,source,target
0,calabria,italia
1,biologico,italia
2,ciro,italia
3,italia,santavenere
4,enotecar,italia


In [19]:
# edges_df.to_csv('edges_list_all.csv')

In [20]:
edges_df['tuple'] = pd.Series(zip(edges_df.source, edges_df.target))

In [21]:
edges_df.head()

Unnamed: 0,source,target,tuple
0,calabria,italia,"(calabria, italia)"
1,biologico,italia,"(biologico, italia)"
2,ciro,italia,"(ciro, italia)"
3,italia,santavenere,"(italia, santavenere)"
4,enotecar,italia,"(enotecar, italia)"


In [22]:
edges_grouped = edges_df.groupby('tuple').count()

In [23]:
edges_grouped.sample(5)

Unnamed: 0_level_0,source,target
tuple,Unnamed: 1_level_1,Unnamed: 2_level_1
"(ptbrasil, racismo)",4,4
"(bolsonaro, brasilfelizdenovo)",1,1
"(antoroccuzzo, couplegoals)",3,3
"(dilmabolada, longliverocknroll)",1,1
"(likes, neymar)",3,3


**Note**

We can add the count for each connection between tags as a parameter of the edge.

Let's improve the dataframe fot this task.

In [24]:
edges_grouped.drop(columns='target', inplace=True, errors='ignore')

In [25]:
edges_grouped.columns=['weight']

In [26]:
edges_grouped.reset_index(inplace=True)

In [27]:
edges_grouped.sample(5)

Unnamed: 0,tuple,weight
31431,"(holland, win)",1
28513,"(gaming, pes2019)",1
17513,"(d10s, leoandresmessi)",4
41590,"(penal, theintercept)",2
43662,"(sanches, semedo)",1


In [28]:
edges_grouped.shape

(44477, 2)

In [29]:
edges_grouped['source'] = edges_grouped.tuple.str[0]

In [30]:
edges_grouped['target'] = edges_grouped.tuple.str[1]

In [31]:
edges_grouped = edges_grouped.drop(columns='tuple')

In [32]:
edges_grouped.sample(5)

Unnamed: 0,weight,source,target
38214,1,mancity,proevolutionsoccer
14476,3,ciromessi,leo10
1473,1,amanha,poesias
14764,1,colt,mexico
44414,1,vamos,vermelho


In [33]:
# edges_grouped.to_csv('edges_counted.csv')

**Note**

Now, let's finally create the graph.

## Creating New Graph

In [34]:
G = nx.from_pandas_edgelist(edges_grouped, edge_attr=True)

In [35]:
list(G.nodes)[:10]

['abaladissimaa',
 'cultura',
 'juizsergiomoro',
 'astrofotografia',
 'lapulga',
 'ecuador',
 'liverpool',
 'avenidapaulista',
 'gafe',
 'jessika']

In [36]:
list(G.edges(data=True))[:10]

[('abaladissimaa', 'meme', {'weight': 2}),
 ('abaladissimaa', 'corrupcao', {'weight': 2}),
 ('abaladissimaa', 'provas', {'weight': 2}),
 ('abaladissimaa', 'moro', {'weight': 2}),
 ('abaladissimaa', 'happy', {'weight': 2}),
 ('abaladissimaa', 'juizsergiomoro', {'weight': 2}),
 ('abaladissimaa', 'politica', {'weight': 2}),
 ('abaladissimaa', 'bolsonaro', {'weight': 2}),
 ('abaladissimaa', 'pt', {'weight': 2}),
 ('abaladissimaa', 'lula', {'weight': 2})]

In [37]:
len(G.nodes)

2796

In [38]:
len(G.edges)

44477

**Note**

We have the same number of nodes and edges, but now with the weight.

In [39]:
# the same percetual as before, but now with the grouped dataframe
100 * len(G.edges)/edges_grouped.shape[0]

100.0

In [40]:
nx.write_graphml(G, "edges_counted_" + str(POSTS_MAX) + ".graphml")

**Note**

Lets's take a closer look.

Let's check the most important nodes.

## Inspecting Edges

In [41]:
edges_grouped.sort_values(by='weight', ascending=False).head(10)

Unnamed: 0,weight,source,target
21904,198,elenao,lulalivre
37306,175,lulalivre,manueladavila
21726,162,eleicoes2018,manueladavila
30728,153,haddad,lulalivre
36755,138,lula,lulalivre
6972,127,bolsonaro,brasil
7445,115,bolsonaro,moro
21854,113,elenao,haddad
21725,109,eleicoes2018,lulalivre
6938,108,bolsonaro,bolsonaro2018


In [42]:
# defining masks to select data

mask_source_lulalivre = edges_grouped.source == 'lulalivre'
mask_source_lulapresopolitico = edges_grouped.source == 'lulapresopolitico'

mask_target_lulalivre = edges_grouped.target == 'lulalivre'
mask_target_lulapresopolitico = edges_grouped.target == 'lulapresopolitico'

In [43]:
edges_grouped[mask_source_lulalivre & mask_target_lulapresopolitico]

Unnamed: 0,weight,source,target
37298,54,lulalivre,lulapresopolitico


In [44]:
edges_grouped[mask_target_lulalivre & mask_source_lulapresopolitico]

Unnamed: 0,weight,source,target


**Note**

No pair of tags is duplicated.

## Inspecting Weights

In [45]:
edges_grouped.weight.sort_values(ascending=False).sample(15)

31836    1
4171     4
26557    1
23275    8
8997     1
8746     2
35385    4
11173    2
25279    1
4205     2
43769    1
31843    1
37702    1
39127    1
34061    2
Name: weight, dtype: int64

In [46]:
weight_counts = edges_grouped.weight.value_counts().sort_index(ascending=False)

In [47]:
weight_counts.head(10)

198    1
175    1
162    1
153    1
138    1
127    1
115    1
113    1
109    1
108    1
Name: weight, dtype: int64

In [48]:
weight_counts.tail(15)

15       69
14       56
13       70
12      330
11       70
10      154
9       186
8       470
7       426
6       780
5       914
4      2085
3      3079
2      6460
1     28090
Name: weight, dtype: int64

**Note**

Most ot the edges are insignificant and can be dropped to a better visual analysis.

## Dropgging Insignificant Edges

The dropping task was done here, but Gephi also have filtering tools.

In [49]:
TRESHOLD = 5

mask_insignificant = edges_grouped.weight.apply(lambda x : x <= TRESHOLD)

In [50]:
edges_grouped_dropped = edges_grouped[~mask_insignificant]

In [51]:
edges_grouped_dropped.weight.value_counts().sort_index(ascending=False).head(10)

198    1
175    1
162    1
153    1
138    1
127    1
115    1
113    1
109    1
108    1
Name: weight, dtype: int64

In [52]:
edges_grouped_dropped.weight.value_counts().sort_index(ascending=False).tail(15)

20    100
19     29
18     58
17     28
16    214
15     69
14     56
13     70
12    330
11     70
10    154
9     186
8     470
7     426
6     780
Name: weight, dtype: int64

In [53]:
# creating a new graph with dropped data
G_dropped = nx.from_pandas_edgelist(edges_grouped_dropped, edge_attr=True)

## Selfloop Edges

### Dropped Graph

In [54]:
list(G_dropped.selfloop_edges(data=True))[:10]

[('family', 'family'),
 ('elenao', 'elenao'),
 ('bolsominionsarrependidos', 'bolsominionsarrependidos'),
 ('manueladavila', 'manueladavila'),
 ('politica', 'politica'),
 ('ptnao', 'ptnao'),
 ('lulalivre', 'lulalivre'),
 ('bolsonaro', 'bolsonaro'),
 ('conservadores', 'conservadores'),
 ('lula', 'lula')]

In [55]:
len(list(G_dropped.selfloop_edges(data=True)))

35

### Complete Graph

In [56]:
list(G.selfloop_edges(data=True))[:10]

[('ptnao', 'ptnao'),
 ('repost', 'repost'),
 ('globolixo', 'globolixo'),
 ('justica', 'justica'),
 ('glock', 'glock'),
 ('morocaboeleitoral', 'morocaboeleitoral'),
 ('euavisei', 'euavisei'),
 ('eleicoes2018', 'eleicoes2018'),
 ('instafood', 'instafood'),
 ('family', 'family')]

In [57]:
len(list(G.selfloop_edges(data=True)))

93

**Note**

The selfloop edges may occur because tha same tag is writen more than once on the same post.

So, they were not dropped from the graph at this moment (Gephi has an option to drop them).

A further verification can be done about this issue.

In [58]:
nx.write_graphml(G_dropped, "edges_counted_" + str(POSTS_MAX) + "_dropped.graphml")

## Plotting Graph

This task is left to Gephi.

This command takes about 45~50 minutes to complete.

In [59]:
import matplotlib.pyplot as plt

In [60]:
%%time

# turn to False to disable a long time operation
if False :

    nx.draw(G)

    plt.show()

CPU times: user 11 µs, sys: 0 ns, total: 11 µs
Wall time: 21.5 µs


# Handling List of Key Edges

## Creating Keys Graph

In [61]:
edges_list_keys[:10]

[('ciro', 'italia'),
 ('ciro', 'calabria'),
 ('ciro', 'biologico'),
 ('ciro', 'santavenere'),
 ('ciro', 'enotecar'),
 ('ciro', 'italiawineshop'),
 ('ciro', 'sardellacalabrese'),
 ('ciro', 'rock'),
 ('ciro', 'rocknacional'),
 ('ciro', 'rockargento')]

In [62]:
g = nx.from_edgelist(edges_list_keys)

In [63]:
len(g.nodes)

2799

In [64]:
len(g.edges)

3817

In [65]:
# percentage from graph edges to list of edges
100 * len(g.edges)/len(edges_list_keys)

36.29016923369462

**Note**

The same problem occourred with the percentage.

## Grouping and Counting Keys Edges

In [66]:
edges_df_keys = pd.DataFrame(edges_list_keys, columns=['source', 'target'])

In [67]:
edges_df_keys.sample(5)

Unnamed: 0,source,target
4176,bolsonaro,regrann
7325,guedes,cristiano
6734,dilma,elenao
6433,dilma,vazamoro
7519,guedes,todoscampeoes


In [68]:
# edges_df_keys.to_csv('edges_list_keys.csv')

In [69]:
edges_df_keys['tuple'] = pd.Series(zip(edges_df_keys.source, edges_df_keys.target))

In [70]:
edges_df_keys.sample(5)

Unnamed: 0,source,target,tuple
161,ciro,larenga,"(ciro, larenga)"
3250,moro,alvorada,"(moro, alvorada)"
2621,lula,juventude,"(lula, juventude)"
9372,haddad,lulalivre,"(haddad, lulalivre)"
9099,haddad,lulalivre,"(haddad, lulalivre)"


In [71]:
edges_grouped_keys = edges_df_keys.groupby('tuple').count()

In [72]:
edges_grouped_keys.sample(5)

Unnamed: 0_level_0,source,target
tuple,Unnamed: 1_level_1,Unnamed: 2_level_1
"(haddad, sdv)",13,13
"(lula, deltan)",2,2
"(dilma, neymar)",4,4
"(guedes, spain)",1,1
"(bolsonaro, cabelobranco)",2,2


In [73]:
edges_grouped_keys.drop(columns='target', inplace=True, errors='ignore')

In [74]:
edges_grouped_keys.columns=['weight']

In [75]:
edges_grouped_keys.reset_index(inplace=True)

In [76]:
edges_grouped_keys.sample(5)

Unnamed: 0,tuple,weight
562,"(bolsonaro, sbtonline)",2
3187,"(lula, randolferodrigues)",3
746,"(ciro, ciro2022)",1
568,"(bolsonaro, seguroempresarial)",1
78,"(bolsonaro, bolsonaropresidente17)",1


In [77]:
edges_grouped_keys['source'] = edges_grouped_keys.tuple.str[0]

In [78]:
edges_grouped_keys['target'] = edges_grouped_keys.tuple.str[1]

In [79]:
edges_grouped_keys.shape

(3828, 4)

In [80]:
edges_grouped_keys = edges_grouped_keys.drop(columns='tuple')

In [81]:
edges_grouped_keys.sample(5)

Unnamed: 0,weight,source,target
1517,8,dilma,haddadpresidente
3339,2,moro,augustonunes
2643,1,lula,54
1516,20,dilma,haddad
170,1,bolsonaro,deltan


In [82]:
# edges_grouped_keys.to_csv('edges_counted_keys.csv')

## Creating New Keys Graph

In [83]:
g = nx.from_pandas_edgelist(edges_grouped_keys, edge_attr=True)

In [84]:
list(g.nodes)[:10]

['abaladissimaa',
 'cultura',
 'juizsergiomoro',
 'astrofotografia',
 'lapulga',
 'liverpool',
 'avenidapaulista',
 'gafe',
 'jessika',
 'summer']

In [85]:
list(g.edges(data=True))[:10]

[('abaladissimaa', 'haddad', {'weight': 1}),
 ('abaladissimaa', 'dilma', {'weight': 1}),
 ('cultura', 'lula', {'weight': 1}),
 ('juizsergiomoro', 'bolsonaro', {'weight': 4}),
 ('juizsergiomoro', 'haddad', {'weight': 1}),
 ('juizsergiomoro', 'lula', {'weight': 2}),
 ('juizsergiomoro', 'moro', {'weight': 3}),
 ('juizsergiomoro', 'dilma', {'weight': 1}),
 ('astrofotografia', 'bolsonaro', {'weight': 5}),
 ('lapulga', 'ciro', {'weight': 1})]

In [86]:
len(g.nodes)

2799

In [87]:
len(g.edges)

3817

In [88]:
# the same percetual as before, but now with the grouped dataframe
100 * len(g.edges)/edges_grouped_keys.shape[0]

99.71264367816092

**Note**

This percentual is not 100% because the key tags are not appended alphabetically to the list of edges.

In [89]:
nx.write_graphml(g, "edges_counted_keys_" + str(POSTS_MAX) + ".graphml")

## Inspecting Keys Edges

In [90]:
edges_grouped_keys.sample(10)

Unnamed: 0,weight,source,target
3784,6,moro,stf
2281,1,guedes,winn
596,1,bolsonaro,stfvergonha
3442,1,moro,deliciousfood
213,7,bolsonaro,eduardobolsonaro
1097,1,ciro,messifans
2043,3,guedes,liverpool
2087,1,guedes,mrsatan
421,3,bolsonaro,military
1582,1,dilma,moroedallagnol


In [91]:
# checking if there are any null value
edges_grouped_keys.isnull().sum()

weight    0
source    0
target    0
dtype: int64

In [92]:
# checking different weight values
edges_grouped_keys.weight.sort_values().unique()

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  26,  27,
        28,  29,  30,  31,  32,  33,  34,  35,  36,  41,  42,  43,  44,
        46,  48,  49,  51,  54,  55,  62,  86, 113])

In [93]:
# checking for empty tags
edges_grouped_keys.source.apply( lambda x : x is '' ).sum()

0

In [94]:
# checking for empty tags
edges_grouped_keys.target.apply( lambda x : x is '' ).sum()

0

In [95]:
# checking for self loop edges
list(g.selfloop_edges())

[]

In [96]:
# checking for swapped key tags

key_tags = edges_grouped_keys.source.unique().tolist()

mask_key_tags = edges_grouped_keys.target.isin(key_tags)

edges_grouped_keys[mask_key_tags]

Unnamed: 0,weight,source,target
187,2,bolsonaro,dilma
322,2,bolsonaro,haddad
381,7,bolsonaro,lula
430,21,bolsonaro,moro
704,7,ciro,bolsonaro
787,1,ciro,dilma
923,5,ciro,haddad
1051,4,ciro,lula
1110,4,ciro,moro
1382,36,dilma,bolsonaro


**Note**

The number of swapped key tags is low. These ones can be handled in Gephi later by merge operations.

## Plotting Keys Graph

In [97]:
import matplotlib.pyplot as plt

In [98]:
%%time

# just to disable a long time operation
if False :
    
    nx.draw(g)

    plt.show()

CPU times: user 11 µs, sys: 1e+03 ns, total: 12 µs
Wall time: 20.7 µs


# Node Weights

Another important parameter needed for a better design of the network in Gephi is the node weight.

The node weight will be the frequency in which a tag is posted in the collected data.

## Calculating Weights

In [99]:
# creating a dictionary of weights
node_weights = {}

# populating the dictionary
for person, posts in data_json.items() :
    
    for post in posts[:POSTS_MAX] :
        
        post_tags = post['tags']
        
        post_tags = [tag for tag in post_tags if validate_tag(tag)]
        
        for tag in post_tags :
            
            if tag in node_weights : 
                node_weights[tag] = node_weights[tag] + 1
            else :
                node_weights[tag] = 1

In [100]:
# checking the nodes before assign weights
list(G.nodes(data=True))[:10]

[('abaladissimaa', {}),
 ('cultura', {}),
 ('juizsergiomoro', {}),
 ('astrofotografia', {}),
 ('lapulga', {}),
 ('ecuador', {}),
 ('liverpool', {}),
 ('avenidapaulista', {}),
 ('gafe', {}),
 ('jessika', {})]

In [101]:
# checking the nodes before assign weights
list(g.nodes(data=True))[:10]

[('abaladissimaa', {}),
 ('cultura', {}),
 ('juizsergiomoro', {}),
 ('astrofotografia', {}),
 ('lapulga', {}),
 ('liverpool', {}),
 ('avenidapaulista', {}),
 ('gafe', {}),
 ('jessika', {}),
 ('summer', {})]

In [102]:
len(G.nodes)

2796

In [103]:
len(G_dropped.nodes)

441

In [104]:
len(g.nodes)

2799

## Assigning Weights

### All Edges Graph

In [105]:
nx.set_node_attributes(G, node_weights, 'weight')

In [106]:
list(G.nodes(data=True))[:10]

[('abaladissimaa', {'weight': 2}),
 ('cultura', {'weight': 1}),
 ('juizsergiomoro', {'weight': 11}),
 ('astrofotografia', {'weight': 5}),
 ('lapulga', {'weight': 1}),
 ('ecuador', {'weight': 1}),
 ('liverpool', {'weight': 5}),
 ('avenidapaulista', {'weight': 1}),
 ('gafe', {'weight': 1}),
 ('jessika', {'weight': 1})]

In [107]:
nx.write_graphml(G, "edges_counted_" + str(POSTS_MAX) + "_nw.graphml")

### Dropped All Edges Graph

In [108]:
nx.set_node_attributes(G_dropped, node_weights, 'weight')

In [109]:
list(G_dropped.nodes(data=True))[:10]

[('bolsonarosempre', {'weight': 8}),
 ('antoroccuzzo', {'weight': 12}),
 ('juizsergiomoro', {'weight': 11}),
 ('sbtonline', {'weight': 7}),
 ('family', {'weight': 12}),
 ('coaf', {'weight': 8}),
 ('direitaunida', {'weight': 6}),
 ('comedia', {'weight': 11}),
 ('elenao', {'weight': 149}),
 ('cartacapital', {'weight': 27})]

In [110]:
nx.write_graphml(G_dropped, "edges_counted_" + str(POSTS_MAX) + "_dropped_nw.graphml")

### Key Edges Graph

In [111]:
nx.set_node_attributes(g, node_weights, 'weight')

In [112]:
list(G.nodes(data=True))[:10]

[('abaladissimaa', {'weight': 2}),
 ('cultura', {'weight': 1}),
 ('juizsergiomoro', {'weight': 11}),
 ('astrofotografia', {'weight': 5}),
 ('lapulga', {'weight': 1}),
 ('ecuador', {'weight': 1}),
 ('liverpool', {'weight': 5}),
 ('avenidapaulista', {'weight': 1}),
 ('gafe', {'weight': 1}),
 ('jessika', {'weight': 1})]

In [113]:
nx.write_graphml(g, "edges_counted_keys_" + str(POSTS_MAX) + "_nw.graphml")

# Discarded