# Exercise 3: Community detection on the network of Computational Social Scientists.
## Week 6, Exercise 4

- Consider the network you built in Week 4.

In [90]:
import json
import networkx as nx
import numpy as np
import netwulf as nw
from netwulf import visualize
from tqdm import tqdm


f = open('data/graph.json')
data = json.load(f)


In [91]:
G = nx.node_link_graph(data) 
print(f"The number of nodes before the GCC has been found: {len(list(G.nodes))}")
largest_cc = max(nx.connected_components(G), key=len)
# update graph to only include the larget connected component. 
G = G.subgraph(largest_cc)
print(f"The number of nodes after the GCC has been found: {len(list(G.nodes))}")

The number of nodes before the GCC has been found: 2162
The number of nodes after the GCC has been found: 1271


- Use the Python Louvain-algorithm implementation to find communities. How many communities do you find? What are their sizes? Report the value of modularity found by the algorithm. Is the modularity significantly different than 0?
> The modularity is 0.899, which is significantly different than 0. Modularity measures <> and is found within the range [-1/2, 1]. Thus, this partition is well done. 

In [92]:
import community


# Find all communities in the graph
partition = community.best_partition(G)  # This returns inconsistent results
num_communities = len(set(partition.values()))

# Number of communities
print(f"Number of communities found: {num_communities}")
_, counts = np.unique(list((partition.values())), return_counts=True)

# Community sizes
print("\nCommunity : Count")
kek = 4
for i in range(0, len(counts), kek):
    for com, c in zip(list(range(i, i+kek)), counts[i:i+kek]):
        print(f"{com} : {c} ", end="| ")
    print()


# Modularity
modularity = community.modularity(partition, G)
print(f"\nThe modularity of the graph is: {modularity}")

Number of communities found: 33

Community : Count
0 : 27 | 1 : 29 | 2 : 77 | 3 : 30 | 
4 : 29 | 5 : 52 | 6 : 46 | 7 : 65 | 
8 : 80 | 9 : 22 | 10 : 29 | 11 : 86 | 
12 : 38 | 13 : 120 | 14 : 2 | 15 : 53 | 
16 : 43 | 17 : 30 | 18 : 22 | 19 : 29 | 
20 : 76 | 21 : 18 | 22 : 17 | 23 : 5 | 
24 : 19 | 25 : 31 | 26 : 41 | 27 : 37 | 
28 : 34 | 29 : 18 | 30 : 20 | 31 : 29 | 
32 : 17 | 

The modularity of the graph is: 0.8985766049389848


- If you are curious, you can also try the Infomap algorithm. Go to [this page]. (https://mapequation.github.io/infomap/python/). It's harder to install, but a better community detection algorithm. You can read about it in advanced topics 9B.

- Visualize the network, using netwulf (see Week 5). This time assign each node a different color based on their community. Describe the structure you observe.
> The structure looks fine as hell. XYZ PROPER TERMS HERE.

# INCLUDE SEXY SCREEN SHOT HERE

- Make sure you save the assignment of authors to communities.

In [93]:
nx.set_node_attributes(G, partition, name="group")  # group controls color

# Exercise 4: TF-IDF and the Computational Social Science communities.
The goal for this exercise is to find the words charachterizing each of the communities of Computational Social Scientists.
> Student critic: Calculate TF-IDF for each word in each community. THEN, find top 10 lists across community. As you'll see, there will be some redundancy, because we strictly follow your instructions. SUGGESTION: For better coding practice, have us write specific functions that we can call in the main code to answer your questions.

## 4.1) Check wikipedia for TF-IDF.
Explain in your own words the point of TF-IDF. What does TF and IDF stand for?
> Short for `term frequency–inverse document frequency`, it is a method applied in information retrieval (IR) that down weighs frequent terms. This is important, because word frequencies are relatively logaritmic cf. Zipf's law, and we want to avoid that the most frequent words dominate the analysis. 


## 4.2) Community abstracts
Now, we want to find out which words are important for each community, so we're going to create several **large documents, one for each community**. Each document includes all the tokens of abstracts written by members of a given community.

- Consider a community c
- Find all the abstracts of papers written by a (ALL) member(S) of community c.
- Create a long array that stores all the abstract tokens
- Repeat for all the communities.

> This is quite the task. For completeness the tokenized abstracts are genereated here with code from week 7.

In [94]:
# Tokenizer code written in week7

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re

urls = '\S+www\S+\w'    # remove urls by searching for www
symbols = '[^\w\s]'     # remove punctuation
numbers = '\d+'         # remove numbers
stop_words = stopwords.words('english')
ps = PorterStemmer()    # Stemming

def tokenize(text):
    if text is None:
        return None
    text = text.lower()
    text = re.sub(fr'{symbols}|{urls}|{numbers}','',text)
    text = [ps.stem(word) for word in text.split() if word not in stop_words] 
    return text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jason\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [95]:
import pickle  # HACK DELETE LATER

# abstracDataSet contains paperId and their abstracts
# paperDataSet contains paperId	title	year	externalId.DOI	citationCount	fields	authorIds	author_field

with open('data/paperAbstractDataSet.pkl', 'rb') as f:
    abstractDataSet = pickle.load(f)
abstractDataSet=abstractDataSet.drop_duplicates(subset=['papersId'])

with open('data/ccs_papers.pkl', 'rb') as f:
    paperDataSet = pickle.load(f)

paperDataSet.shape
# 0.5 min

(969493, 8)

> We need access to the papers written by the authors in the graph. We have a paperDataSet that has up to a million entrees. This is cut down by filtering out papers where none of the contributors exists in the graph. The `explode` command is used, because the `authors` column is a list of authors.

In [96]:
temp = dict(G.nodes)
valid = paperDataSet['authorIds'].apply(lambda x: any(elem in temp for elem in x))
papers = paperDataSet[valid]         # Filter out papers with authors not in graph
papers = papers.explode('authorIds') # Explode the authorIds column

> Now collect all unique paperIDs for each community using sets.

In [97]:
communityPaperIDs = [set() for _ in range(num_communities)]
for node in tqdm(G.nodes(data=True)):
    author = node[0]
    writtenPapers=papers[papers["authorIds"].isin([author])]["paperId"]
    community = partition[author]
    communityPaperIDs[community].update(writtenPapers)

100%|██████████| 1271/1271 [00:16<00:00, 77.73it/s]


> With the paperIDS, the corresponding abstracts are found in the abstracts dataset. 
The abstracts are then tokenized and stored in a list for each community.

In [98]:
corpus = [[] for _ in range(len(communityPaperIDs))]


for i, paperIDs in tqdm(enumerate(communityPaperIDs)):
    abstracts=abstractDataSet[abstractDataSet["papersId"].isin(paperIDs)]["papersAbstract"]

    abstracts = abstracts.dropna()    # Drop all rows with None values
    abstracts.apply(lambda x: corpus[i].extend(tokenize(x)))

# 7 minutes

33it [06:54, 12.57s/it]


## 4.3) Calculate TF
Now, we're ready to calculate the TF for each word. Use the method of your choice to find the top 5 terms within the top 5 communities (by number of authors).

> First isolate the tokens of the top 5 communities. Then Cacluate TF for each token in each community.

In [126]:
# First find top 5 comminutes by size
top5communities = np.argsort(counts)[-5:][::-1]
# top5tokens = [communityTokens[x] for x in top5communities]
print("Top 5 communities:", top5communities)
corpus5 = [corpus[x] for x in top5communities]


Top 5 communities: [13 11  8  2 20]


In [117]:
# Calculate the tf for each community
from collections import Counter
TF5 = [Counter(tokens) for tokens in corpus5]  # non-normalized term frequency

# Normliaze the tf
for i, tf in enumerate(TF5):
    for key in tf:
        tf[key] /= len(corpus5[i])

In [118]:
# Find the top 5 terms for each community
top5terms = [tf.most_common(5) for tf in TF5]
top5terms = [list(zip(*terms))[0] for terms in top5terms] # Extract the terms

print("Top 5 terms for each community:")
for i, terms in zip(top5communities, top5terms):
    print(f"{i}) : {terms}")

Top 5 terms for each community:
13) : ('use', 'user', 'social', 'studi', 'data')
11) : ('use', 'user', 'social', 'data', 'design')
8) : ('use', 'social', 'data', 'inform', 'user')
2) : ('use', 'network', 'data', 'social', 'model')
20) : ('model', 'use', 'algorithm', 'data', 'inform')


- Describe similarities and differences between the communities.
> Similarities include **social, data, and information**. 

> Differences include **network, time, show, and results**.

- Why aren't the TFs not necessarily a good description of the communities?
> TF alone does not consider the significance of a word, i.e. words that in general appear often will unfailry score high. 

- Next, we calculate IDF for every word.
> The IDF is calculated where it is assumed that each community has one document, such that N = number of communities. First determine the number of communities that contain a given word. Then calculate the IDF for each word.

In [121]:
# Compute IDF for every term
# IDF = log(N / n), 
# where N is the number of communities and n is the number of communities that contain the term

IDF5 = Counter()
N = len(corpus5)

# For each term, count the number of communities that contain it
for c in corpus5:
    for term in set(c):
        IDF5[term] += 1

# Compute the IDF
IDF5 = {key: np.log(N / value) for key, value in IDF5.items()}

- What base logarithm did you use? Is that important?
> Natural logirithm. Not important, we are just interested in projecting onto the log scale.

## 4) TF-IDF
We're ready to calculate TF-IDF. Do that for the top 9 communities (by number of authors). Then for each community:

- List the 10 top TF words

In [129]:
# First find top 9 communities by size
N = 9
top9communities = np.argsort(counts)[-N:][::-1]
print(f"Top {N} communities:", top9communities)
corpus9 = [corpus[x] for x in top9communities]

# Calculate the tf for each community
from collections import Counter
TF9 = [Counter(tokens) for tokens in corpus9]  # non-normalized term frequency

# Normliaze the tf
for i, tf in enumerate(TF9):
    for key in tf:
        tf[key] /= len(corpus9[i])

top10terms9 = [tf.most_common(10) for tf in TF9]
top10terms9 = [list(zip(*terms))[0] for terms in top10terms9] # Extract the terms

print("Top 10 terms for each community:")
for i, terms in zip(top9communities, top10terms9):
    print(f"{i}) : {terms}")

Top 9 communities: [13 11  8  2 20  7 15  5  6]
Top 10 terms for each community:
13) : ('use', 'user', 'social', 'studi', 'data', 'research', 'work', 'system', 'inform', 'model')
11) : ('use', 'user', 'social', 'data', 'design', 'system', 'research', 'inform', 'studi', 'commun')
8) : ('use', 'social', 'data', 'inform', 'user', 'algorithm', 'studi', 'time', 'model', 'result')
2) : ('use', 'network', 'data', 'social', 'model', 'system', 'mobil', 'inform', 'studi', 'result')
20) : ('model', 'use', 'algorithm', 'data', 'inform', 'result', 'problem', 'show', 'effect', 'learn')
7) : ('use', 'user', 'network', 'data', 'system', 'social', 'model', 'inform', 'result', 'show')
15) : ('use', 'network', 'social', 'data', 'model', 'studi', 'inform', 'research', 'result', 'polit')
5) : ('network', 'use', 'data', 'model', 'social', 'user', 'inform', 'studi', 'result', 'commun')
6) : ('use', 'data', 'model', 'propos', 'social', 'method', 'user', 'network', 'result', 'studi')


- List the 10 top TF-IDF words


In [133]:
IDF9 = Counter()
N = len(corpus9)

# For each term, count the number of communities that contain it
for c in corpus9:
    for term in set(c):
        IDF9[term] += 1

# Compute the IDF
IDF9 = {key: np.log(N / value) for key, value in IDF9.items()}

# Compute TF-IDF for each community
TFIDF9 = [Counter() for _ in range(N)]
for i, tf in enumerate(TF9):
    for term in tf:
        TFIDF9[i][term] = tf[term] * IDF9[term]

# Extract the top 10 terms for each community
top10terms9 = [tf.most_common(10) for tf in TFIDF9]
top10terms9 = [list(zip(*terms))[0] for terms in top10terms9] # Extract the terms

print("Top 10 terms for each community:")
for i, terms in zip(top9communities, top10terms9):
    print(f"{i}) : {terms}")



Top 10 terms for each community:
13) : ('dram', 'codemix', 'ictd', 'sci', 'phish', 'streamit', 'transliter', 'mooc', 'hindi', 'bangalor')
11) : ('earthworm', 'microfilm', 'hci', 'psl', 'searcher', 'ubuntu', 'vape', 'odk', 'hcai', 'spreadsheet')
8) : ('corros', 'maritim', 'anod', 'gull', 'calv', 'childless', 'zika', 'qatar', 'childbear', 'eubalaena')
2) : ('mbb', 'aria', 'latrin', 'tota', 'deli', 'saper', 'pfpr', 'roam', 'deaggreg', 'inod')
20) : ('dpp', 'bidder', 'ç', 'estimand', 'multirobot', 'wager', 'ewa', 'pprl', 'actr', 'timber')
7) : ('bitext', 'superp', 'diacrit', 'rumour', 'multicast', 'uma', 'tl', 'anycast', 'crosslanguag', 'gaminganywher')
15) : ('mdd', 'ora', 'lithium', 'gasolin', 'antidepress', 'bipolar', 'bd', 'roosevelt', 'olanzapin', 'carley')
5) : ('smallsid', 'eip', 'dasymetr', 'knot', 'ghsl', 'nonconserv', 'hashcod', 'tvg', 'ssg', 'antisci')
6) : ('reid', 'spancor', 'qatar', 'influenzanet', 'giorno', 'dtd', 'adblock', 'antiadblock', 'samoa', 'queryflow')


- List the top 3 authors (by degree)

In [170]:
# Determine degree of all nodes
import pandas as pd
degree = dict(G.degree())
nx.set_node_attributes(G, degree, name="degree") 
df=pd.DataFrame.from_dict(dict(G.nodes(data=True)), orient='index')

# Extract top 3 authors for each community by degree
top3authors = []
for i in range(num_communities):
    authors = [node for node in G.nodes(data=True) if node[1]['group'] == i]
    top3authors.append(sorted(authors, key=lambda x: x[1]['degree'], reverse=True)[:3])
    # Extract their names
    top3authors[i] = [x[1]['name'] for x in top3authors[i]]

print("Top 3 authors for each community:")
for i in top9communities:
    authors = top3authors[i]
    print(f"{i}) : {authors}")

Top 3 authors for each community:
13) : ['Joyojeet Pal', 'Priyanka Chandra', 'Vaishnav Kameswaran']
11) : ['Munmun De Choudhury', 'Sarita Yardi Schoenebeck', 'Neha Kumari Pawan Kumar']
8) : ['Ingmar G. Weber', 'Masoomali Fatehkia', 'Ridhi Kashyap']
2) : ['Alexander Sandy Pentland', 'Iyad Rahwan', 'Johannes Bjelland']
20) : ['Duncan J. Watts', 'Markus M. Mobius', 'Sharad Chandra Goel']
7) : ['Haewoon Kwak', 'Daniele Quercia', 'Krishna P. Gummadi']
15) : ['David M. J. Lazer', 'Jon Green', 'Katherine Ognyanova']
5) : ['Michael D. Conover', "M'arton Karsai", 'Filippo Menczer']
6) : ['Yelena A Mejova', 'Kyriaki Kalimeri', 'Daniela Paolotti']


- Are these 10 words more descriptive of the community? If yes, what is it about IDF that makes the words more informative?
> Looking at the output from the code just before the one above... Yes! As opposed to before where there was large overlap, now they seem very exclusive to each community. 