# Exercise 3: Community detection on the network of Computational Social Scientists.
## Week 6, Exercise 4

- Consider the network you built in Week 4.

In [14]:
import json
import networkx as nx
import numpy as np
import netwulf as nw
from netwulf import visualize
from tqdm import tqdm


f = open('ndata/graph.json')
data = jso.load(f)


In [4]:
G = nx.node_link_graph(data) 
print(f"The number of nodes before the GCC has been found: {len(list(G.nodes))}")
largest_cc = max(nx.connected_components(G), key=len)
# update graph to only include the larget connected component. 
G = G.subgraph(largest_cc)
print(f"The number of nodes after the GCC has been found: {len(list(G.nodes))}")

The number of nodes before the GCC has been found: 2162
The number of nodes after the GCC has been found: 1271


- Use the Python Louvain-algorithm implementation to find communities. How many communities do you find? What are their sizes? Report the value of modularity found by the algorithm. Is the modularity significantly different than 0?
> The modularity is 0.899, which is significantly different than 0. Modularity measures <> and is found within the range [-1/2, 1]. Thus, this partition is well done. 

In [31]:
import community


# Find all communities in the graph
partition = community.best_partition(G)  # This returns inconsistent results
num_communities = len(set(partition.values()))

# Number of communities
print(f"Number of communities found: {num_communities}")
_, counts = np.unique(list((partition.values())), return_counts=True)

# Community sizes
print("\nCommunity : Count")
kek = 4
for i in range(0, len(counts), kek):
    for com, c in zip(list(range(i, i+kek)), counts[i:i+kek]):
        print(f"{com} : {c} ", end="| ")
    print()


# Modularity
modularity = community.modularity(partition, G)
print(f"\nThe modularity of the graph is: {modularity}")

Number of communities found: 31


- If you are curious, you can also try the Infomap algorithm. Go to [this page]. (https://mapequation.github.io/infomap/python/). It's harder to install, but a better community detection algorithm. You can read about it in advanced topics 9B.

- Visualize the network, using netwulf (see Week 5). This time assign each node a different color based on their community. Describe the structure you observe.
> The structure looks fine as hell. XYZ PROPER TERMS HERE.

# INCLUDE SEXY SCREEN SHOT HERE

- Make sure you save the assignment of authors to communities.

In [6]:
nx.set_node_attributes(G, partition, name="group")  # group controls color

# Exercise 4: TF-IDF and the Computational Social Science communities.
The goal for this exercise is to find the words charachterizing each of the communities of Computational Social Scientists.

## 4.1) Check wikipedia for TF-IDF.
Explain in your own words the point of TF-IDF. What does TF and IDF stand for?
> Short for `term frequency–inverse document frequency`, it is a method applied in information retrieval (IR) that down weighs frequent terms. This is important, because word frequencies are relatively logaritmic cf. Zipf's law, and we want to avoid that the most frequent words dominate the analysis. 


## 4.2) Community abstracts
Now, we want to find out which words are important for each community, so we're going to create several **large documents, one for each community**. Each document includes all the tokens of abstracts written by members of a given community.

- Consider a community c
- Find all the abstracts of papers written by a (ALL) member(S) of community c.
- Create a long array that stores all the abstract tokens
- Repeat for all the communities.

> This is quite the task. For completeness the tokenized abstracts are genereated here with code from week 7.

In [7]:
# Tokenizer code written in week7

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import re

urls = '\S+www\S+\w'    # remove urls by searching for www
symbols = '[^\w\s]'     # remove punctuation
numbers = '\d+'         # remove numbers
stop_words = stopwords.words('english')

def tokenize(text):
    if text is None:
        return None
    text = text.lower()
    text = re.sub(fr'{symbols}|{urls}|{numbers}','',text)
    text = [word for word in text.split() if word not in stop_words]
    return text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jason\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
import pickle  # HACK DELETE LATER

# abstracDataSet contains paperId and their abstracts
# paperDataSet contains paperId	title	year	externalId.DOI	citationCount	fields	authorIds	author_field

with open('data/paperAbstractDataSet.pkl', 'rb') as f:
    abstractDataSet = pickle.load(f)
abstractDataSet=abstractDataSet.drop_duplicates(subset=['papersId'])

with open('data/ccs_papers.pkl', 'rb') as f:
    paperDataSet = pickle.load(f)

paperDataSet.shape
# 0.5 min

(969493, 8)

> We need access to the papers written by the authors in the graph. We have a paperDataSet that has up to a million entrees. This is cut down by filtering out papers where none of the contributors exists in the graph. The `explode` command is used, because the `authors` column is a list of authors.

In [9]:
temp = dict(G.nodes)
valid = paperDataSet['authorIds'].apply(lambda x: any(elem in temp for elem in x))
papers = paperDataSet[valid]         # Filter out papers with authors not in graph
papers = papers.explode('authorIds') # Explode the authorIds column

> Now collect all unique paperIDs for each community using sets.

In [10]:
communityPaperIDs = [set() for _ in range(num_communities)]
for node in tqdm(G.nodes(data=True)):
    author = node[0]
    writtenPapers=papers[papers["authorIds"].isin([author])]["paperId"]
    community = partition[author]
    communityPaperIDs[community].update(writtenPapers)

> With the paperIDS, the corresponding abstracts are found in the abstracts dataset. 
The abstracts are then tokenized and stored in a list for each community.

In [24]:
communityTokens = [[] for _ in range(len(communityPaperIDs))]


for i, paperIDs in tqdm(enumerate(communityPaperIDs)):
    abstracts=abstractDataSet[abstractDataSet["papersId"].isin(paperIDs)]["papersAbstract"]

    abstracts = abstracts.dropna()    # Drop all rows with None values
    abstracts.apply(lambda x: communityTokens[i].extend(tokenize(x)))

# 1 minute

32it [01:09,  2.18s/it]


## 4.3) Calculate TF
Now, we're ready to calculate the TF for each word. Use the method of your choice to find the top 5 terms within the top 5 communities (by number of authors).

> First isolate the tokens of the top 5 communities. Then Cacluate TF for each token in each community.

In [45]:
# First find top 5 comminutes by size
top5communities = np.argsort(counts)[-5:][::-1]
top5tokens = [communityTokens[x] for x in top5communities]
print("Top 5 communities token count:")
for i, tokens in zip(top5communities, top5tokens):
    print(f"{i}) : {len(tokens)}")

Top 5 communities token count:
14) : 287162
15) : 626335
9) : 187775
21) : 541651
2) : 245486


# JASON: I GET DIFFERENT RESULTS WHEN DOING FOR ALL OPPOSED TO DOING FOR TOP 5. WHY?

In [76]:
# Calculate the tf for each community
from collections import Counter
TF = [Counter(tokens) for tokens in top5tokens]  # non-normalized term frequency
# Normliaze the tf
for i, tf in enumerate(TF):
    for key in tf:
        tf[key] /= len(top5tokens[i])

In [78]:
# Calculate the tf for ALL communities
from collections import Counter
TF = [Counter(tokens) for tokens in communityTokens]  # non-normalized term frequency
# Normliaze the tf
for i, tf in enumerate(TF):
    for key in tf:
        tf[key] /= len(communityTokens[i])

In [80]:
# Find the top 5 terms for each community
top5terms = [tf.most_common(5) for tf in TF]
top5terms = [list(zip(*terms))[0] for terms in top5terms] # Extract the terms

print("Top 5 terms for each community:")
for i, terms in enumerate(top5terms):
    print(f"{i}) : {terms}")

Top 5 terms for each community:
0) : ('data', 'social', 'information', 'music', 'research')
1) : ('social', 'study', 'results', 'research', 'model')
2) : ('data', 'social', 'network', 'information', 'model')
3) : ('social', 'media', 'political', 'data', 'news')
4) : ('agents', 'agent', 'systems', 'paper', 'system')
5) : ('data', 'model', 'systems', 'paper', 'social')
6) : ('data', 'social', 'network', 'information', 'networks')
7) : ('data', 'social', 'information', 'paper', 'results')
8) : ('data', 'users', 'social', 'information', 'network')
9) : ('social', 'data', 'information', 'time', 'using')
10) : ('urban', 'systems', 'model', 'data', 'cities')
11) : ('social', 'model', 'data', 'network', 'study')
12) : ('data', 'information', 'system', 'systems', 'paper')
13) : ('data', 'model', 'using', 'results', 'network')
14) : ('social', 'data', 'work', 'users', 'study')
15) : ('social', 'data', 'users', 'information', 'research')
16) : ('social', 'data', 'political', 'public', 'study')
17

In [81]:
# Find the top 5 terms for each community
top5terms = [tf.most_common(5) for tf in TF]
top5terms = [list(zip(*terms))[0] for terms in top5terms] # Extract the terms

print("Top 5 terms for each community:")
for i, terms in zip(top5communities, top5terms):
    print(f"{i}) : {terms}")

Top 5 terms for each community:
14) : ('data', 'social', 'information', 'music', 'research')
15) : ('social', 'study', 'results', 'research', 'model')
9) : ('data', 'social', 'network', 'information', 'model')
21) : ('social', 'media', 'political', 'data', 'news')
2) : ('agents', 'agent', 'systems', 'paper', 'system')


- Describe similarities and differences between the communities.
> Similarities include **social, data, and information**. 

> Differences include **network, time, show, and results**.

- Why aren't the TFs not necessarily a good description of the communities?
> TF alone does not consider the significance of a word, i.e. words that in general appear often will unfailry score high. 

- Next, we calculate IDF for every word.
> The IDF is calculated where it is assumed that each community has one document, such that N = number of communities. First determine the number of communities that contain a given word. Then calculate the IDF for each word.

In [83]:
# Compute IDF for every term
IDF = Counter()
N = num_communities

# Count the number of communities that contain a term
for i, tf in enumerate(TF):
    for key in tf:
        IDF[key] += 1

# Compute the IDF
for key in IDF:
    IDF[key] = np.log(N / IDF[key])  # natural log

# Compute TF-IDF
TFIDF = [tf.copy() for tf in TF]
for i, tfidf in enumerate(TFIDF):
    for key in tfidf:
        tfidf[key] *= IDF[key]

- What base logarithm did you use? Is that important?
> Natural logirithm. Not important, we are just interested in projecting onto the log scale.

## 4) TF-IDF
We're ready to calculate TF-IDF. Do that for the top 9 communities (by number of authors). Then for each community:

- List the 10 top TF words
- List the 10 top TF-IDF words
- List the top 3 authors (by degree)
- Are these 10 words more descriptive of the community? If yes, what is it about IDF that makes the words more informative?

In [82]:
# Find top 9 communities by size
top9communities = np.argsort(counts)[-9:][::-1]
top9tokens = [communityTokens[x] for x in top9communities]

# List top 10 terms for each community
top10terms = [tf.most_common(10) for tf in TF]
top10terms = [list(zip(*terms))[0] for terms in top10terms] # Extract the terms

print("Top 10 terms for each community:")
for i, terms in zip(top9communities, top10terms):
    print(f"{i}) : {terms}")

# Compute TF-IDF




Top 10 terms for each community:
14) : ('data', 'social', 'information', 'music', 'research', 'different', 'paper', 'users', 'study', 'results')
15) : ('social', 'study', 'results', 'research', 'model', 'crime', 'data', 'agents', 'using', 'also')
9) : ('data', 'social', 'network', 'information', 'model', 'paper', 'mobile', 'results', 'networks', 'using')
21) : ('social', 'media', 'political', 'data', 'news', 'information', 'study', 'research', 'public', 'networks')
2) : ('agents', 'agent', 'systems', 'paper', 'system', 'model', 'use', 'social', 'using', 'used')
8) : ('data', 'model', 'systems', 'paper', 'social', 'agents', 'models', 'coordination', 'different', 'system')
6) : ('data', 'social', 'network', 'information', 'networks', 'model', 'users', 'results', 'study', 'using')
7) : ('data', 'social', 'information', 'paper', 'results', 'model', 'using', 'different', 'problem', 'study')
17) : ('data', 'users', 'social', 'information', 'network', 'using', 'results', 'paper', 'use', 'syst

In [None]:

# GARBAGE CODE

import pandas as pd
degree = dict(G.degree())
nx.set_node_attributes(G, degree, name="degree") 
df=pd.DataFrame.from_dict(dict(G.nodes(data=True)), orient='index')
df = df.reset_index(names="authorID")
df = df[['authorID','group', 'degree']]  # Only interested in subset of columns

Unnamed: 0,authorID,group,degree
0,2101037,0,10
1,3001795,0,2
2,2080155085,0,2
3,33570565,0,4
4,66118125,0,3
...,...,...,...
1266,10852593,8,2
1267,1734917,7,1
1268,9486542,14,2
1269,144188281,27,1
