# This notebook

In this notebook, we exlore the co-occurrence of keywords and all other terms that appear in the keywords' opinion context across the three lockdown-defined time windows to detect patterns of change in co-occurrence.

Temporal windows are defined according to key events in the timeline of covid19 pandemic in the UK:

- up to 23 March 2020 (excluded): pre-lockdown
- 23 March to 10 May 2020: strict lockdown
- 11 May 2020 onwards: post- strict lockdown (lockdown eases)



We will:

- [co-occurrence] For each temporal window, calculate the co-occurrence of keyword and word pairs as Dice coefficient 
- Identify changes in keyword and emerging words co-occurrence across the temporal windows.
- Create networks of keyword co-occurrences for each of the three temporal windows and compare network and nodes characteristics across the three.


## Settings

In [None]:
import os

In [None]:
import random

In [None]:
import numpy as np

In [None]:
from ast import literal_eval

In [None]:
from math import log2

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
import matplotlib.colors as mcol
import matplotlib.cm as cm

In [None]:
from gensim.models.phrases import Phraser
from gensim.models import Phrases

In [None]:
from nltk import pos_tag
from nltk.util import ngrams
from nltk.tokenize import word_tokenize, sent_tokenize

from src.utils import chain_functions
from src.preproc_text import tokenise_sent, tokenise_word, remove_punctuation, remove_stopwords, flatten_irregular_listoflists

In [None]:
import networkx as nx
from operator import itemgetter

In [None]:
%matplotlib inline

In [None]:
from src.news_media.get_keywords_trend import *

In [None]:
pd.set_option('display.max_colwidth', None)

In [None]:
pd.set_option('display.max_rows', None)

The config file

In [None]:
CONFIG.keys()

In [None]:
DIR_DATA = os.environ.get("DIR_DATA_INTERIM")

In [None]:
DIR_DATA_EXTRA = os.environ.get("DIR_DATA_EXTRA")

In [None]:
# prominence
term_freqs_nm = "kword_rawfreq.csv"
doc_freqs_nm = "kword_yn_occurrence.csv"

In [None]:
# keywords to be excluded because of low frequency in the corpus
EXCLUDE_KWORDS = ['behav_insight', 'behavioural_economist', 'behav_analysis', 
                  'chater', 'american_behav_scientists', 'irrational_econ', 'nudge_choice']

In [None]:
# we kept in some important words that though are not keywords
KEY_WORDS = ['herd_immunity', 'behavioural_fatigue']

## Import data

In [None]:
uk_news = NewsArticles()

In [None]:
dir(uk_news)

In [None]:
opinions_data = uk_news.data_raw.drop('full_text', axis=1).copy()

In [None]:
opinions_data.shape

In [None]:
opinions_data.opinion_context_id.nunique()

##### Exclude opinion contexts that only contain non-keywords

In [None]:
opinions_data = opinions_data[~opinions_data['kword'].isin(KEY_WORDS)].copy()

In [None]:
opinions_data.shape

In [None]:
opinions_data.opinion_context_id.nunique()

In [None]:
#DO NOT RUN opinions_data = opinions_data[~opinions_data['kword'].isin(EXCLUDE_KWORDS)].copy()

#### Only keep each unique opinion context once (now there are as many instances as how many subkeywords they contain)

In [None]:
opinions_data.columns

In [None]:
opinions_data_uniq = opinions_data[['article_id', 'title', 'opinion_context',
                                    'pub_date_dt']].drop_duplicates()

In [None]:
opinions_data_uniq.shape

In [None]:
#opinions_data_uniq.sort_values("opinion_context")

#### Get nouns and adjectives only from opinion context

In [None]:
def remove_special_symbols(text: str) -> str:
    symbols = [
        "©", "\xad", "•", "…", "●", "“", "”", "•'", "\u200b", "£", "'", "'s",
        "·", "»", "com/"
    ]
    for symb in symbols:
        if symb not in text:
            continue
        text = text.replace(symb, "")
    return text


In [None]:
KWORDS = CONFIG['Actors'] + CONFIG['BehavSci'] + CONFIG['Behav_ins'] + CONFIG[
    'Behav_chan'] + CONFIG['Behav_pol'] + CONFIG['Behav_anal'] + CONFIG[
        'Psych'] + CONFIG['Econ_behav'] + CONFIG['Econ_irrational'] + CONFIG[
            'Nudge'] + CONFIG['Nudge_choice'] + CONFIG['Nudge_pater']
OTHER_IMPORTANT_WORDS = CONFIG['Covid'] + CONFIG['Fatigue'] + CONFIG['Immunity']


In [None]:
# these will be removed as we know they will co-occur with the related surnames without adding anything to the analysis
RELEVANT_FIRST_NAMES = ['susan', 'david', 'stephen', 'boris', 
                        'nick', 'daniel', 'robert', 'richard']

In [None]:
# We only keep NOUNS, ADJECTIVES and keywords
# Assumptions: anything else is not adding to the content of the articles
TEXT_PREPROC_PIPE = chain_functions(
    lambda x: x.lower()
    ,remove_special_symbols
    ,lambda x: re.sub(r'[.]+(?![0-9])', r' ', x)
    ,tokenise_sent
    ,tokenise_word
    ,lambda x: [[
        word for (word, pos) in pos_tag(sent) if 
        (pos.startswith("N")) 
        or (pos.startswith("J")) 
        or (word in KWORDS) or (word in OTHER_IMPORTANT_WORDS)
    ] for sent in x]
    ,lambda x: [[word for word in sent if word not in RELEVANT_FIRST_NAMES] for sent in x]
    ,remove_punctuation
    ,remove_stopwords
    ,flatten_irregular_listoflists
    ,list
    ,lambda x: ' '.join(x)
)

In [None]:
# convert opinion_context variable from str to tuple, only extract text
opinions_data_uniq["opinion_context_text"] = [literal_eval(opinion)[1] for opinion in 
                                              opinions_data_uniq.opinion_context]

In [None]:
# only keep nouns and adjectives, plus additional data cleaning
opinions_data_uniq["opinion_context_nj"] = [TEXT_PREPROC_PIPE(opinion) for opinion in 
                                            opinions_data_uniq.opinion_context_text]

In [None]:
opinions_data_uniq[:3]

In [None]:
opinions_data_uniq.shape

### Replace common phrases made of separate words with their combination
e.g., "university college london" > "university_college_london" 

### Create the bigram and trigram models based on co-collocation statistical patterns

Detect phrases, based on collected collocation counts. Adjacent words that appear together more frequently than expected are joined together with the `_` character.

We will use the articles' whole texts to learn collocation as the model requires a big enough sample to infer statistically reliable collocation patterns

##### Prepare the articles texts

In [None]:
# we remove the keywords from the collocation as otherwise, they may be replaced with a combination of 
# keyword + other common word the keyword tends to appear with (e.g., "prof_michie")

ARTICLES_PREPROC_PIPE = chain_functions(
    lambda x: x.lower()
    ,tokenise_sent
    ,tokenise_word
    ,lambda x: [[
        word for word in sent if ((word not in KWORDS) and (word not in OTHER_IMPORTANT_WORDS)
                                 and (word not in RELEVANT_FIRST_NAMES) )
    ] for sent in x]
    ,flatten_irregular_listoflists
    ,list
)

In [None]:
# As corpus to learn the common phrases, we use the text from all the articles
corpus = [ARTICLES_PREPROC_PIPE(article) for article in uk_news.data_raw['full_text']]

In [None]:
corpus[3]

In [None]:
opinions_data_uniq["opinion_context_nj"][10]

In [None]:
dir(Phrases)
#dir(Phraser)

Using gensim model: collocation-statistics-based Phrases process.

Automatically detect common phrases (multiword expressions) from a stream of texts as token strings.

The phrases are collocations (frequently co-occurring tokens).

In [None]:
# 
phrases = Phrases(corpus,
                min_count=10, #ignore all words and bigrams with total collected count lower than this
                threshold=10.0 #represents a threshold for forming the phrases (higher means fewer phrases). 
                  #A phrase of words a and b is accepted if 
                  #(cnt(a, b) - min_count) * N / (cnt(a) * cnt(b)) > threshold, 
                  #where N is the total vocabulary size. Bydefault it value is 10.0
                )

In [None]:
bigram = Phraser(phrases)

In [None]:
trigram = Phrases(bigram[corpus], 
                  min_count=10,
                 threshold=1)

Checks

In [None]:
bigram[word_tokenize(opinions_data_uniq["opinion_context_nj"][10])]

In [None]:
trigram[word_tokenize(opinions_data_uniq["opinion_context_nj"][10])]

In [None]:
opinions_data_uniq["opinion_context_nj"][39]

In [None]:
bigram[word_tokenize(opinions_data_uniq["opinion_context_nj"][39])]

In [None]:
trigram[word_tokenize(opinions_data_uniq["opinion_context_nj"][39])]

In [None]:
trigram[word_tokenize("she teaches at university college london")]

In [None]:
trigram[word_tokenize("prof michie teaches at university college london")]

In [None]:
trigram[word_tokenize("prof michie teaches at university college london and public health england")]

In [None]:
trigram[word_tokenize("prof susan michie teaches at university college london and public health england")]

#### Assign a new unique id for each opinion context

In [None]:
opinions_data_uniq["context_id"] = range(1, opinions_data_uniq.shape[0]+1)

In [None]:
opinions_data_uniq.tail()

In [None]:
opinions_data_uniq.shape

#### Replace sub-keywords with higher level keyword in opinion context

In [None]:
opinions_data_uniq["opinion_context_nj_tok"] = [word_tokenize(opinion) for opinion in 
                                                opinions_data_uniq.opinion_context_nj]

In [None]:
# function and sub-keyword to keyword dictionary (loaded as part of the src.get_keyword_trend)
SUBKEYWORDS_TO_KEYWORDS_DICT = expand_dict(SUBKEYWORDS_TO_KEYWORDS_MAP)

In [None]:
opinions_data_uniq.tail()

In [None]:
opinions_data_uniq["opinion_context_nj_tok_kw"] = [[SUBKEYWORDS_TO_KEYWORDS_DICT.get(word, word) for word in 
                                                    opinion] for opinion in opinions_data_uniq.opinion_context_nj_tok]

In [None]:
opinions_data_uniq.tail()

### Replace common two-word and three-word phrases with correspoding bi- and tri-gram 

Based on collocation-statistics learnt for the whole corpus of articles above,

In [None]:
opinions_data_uniq["opinion_context_nj_tok_kw_tri"] = [trigram[opinion_tok] for opinion_tok in 
                                                       opinions_data_uniq.opinion_context_nj_tok_kw]

In [None]:
opinions_data_uniq.tail()

In [None]:
opinions_data_uniq.shape

## Compute word document (opinion context) frequency

In [None]:
# input: from list of token words, to token string
opinions_data_uniq['opinion_context_nj_tok_kw_tri_str'] = opinions_data_uniq["opinion_context_nj_tok_kw_tri"].apply(
    lambda text: ' '.join(text))

In [None]:
# check
#opinions_data_uniq[["opinion_context_nj_tok_kw_tri", "opinion_nj_kw_str"]][:-10]

In [None]:
[t for t in opinions_data_uniq['opinion_context_nj_tok_kw_tri_str'] if "susan" in t]

In [None]:
def compute_words_raw_tf(df: pd.DataFrame,
                            text_col='') -> pd.DataFrame:
        """
        Computes the document-term frequency matrix for all the unigram nouns in the preprocessing texts.

        Args:
            df: pandas.Dataframe. Must contain columns "pub_date_dt", "article_id", "context_id"
            text_col: name of column containing the text for which to compute TD matrix. The dtype must be string.

        Returns:
            The document-term frequency matrix for all the unigrams in the preprocessed corpus of articles.
        """
        
        if not all([item in df.columns for item in ["pub_date_dt", "article_id", "context_id"]]):
            raise KeyError
            
        vec = CountVectorizer(stop_words=None,
                              tokenizer=word_tokenize,
                              ngram_range=(1, 1),
                              token_pattern=r"(?u)\b\w\w+\b")
        
        results_mat = vec.fit_transform(df[text_col])
        
        # sparse to dense matrix
        results_mat = results_mat.toarray()

        # get the feature names from the already-fitted vectorizer
        vec_feature_names = vec.get_feature_names()

        # make a table with word frequencies as values and vocab as columns
        out_df = pd.DataFrame(data=results_mat, columns=vec_feature_names)
        
        # add opinion context id and pub date as indexes
        # we use the property of CountVectorizer to keep the order of the original texts
        out_df["pub_date_dt"] = df["pub_date_dt"].tolist()
        out_df["article_id"] = df["article_id"].tolist()
        out_df["context_id"] = df["context_id"].tolist()
        out_df[text_col] = df[text_col].tolist()

        out_df.set_index(['pub_date_dt', 'article_id', 'context_id', text_col ],
                         append=True,
                         inplace=True)


        
        return out_df

In [None]:
#opinions_data_uniq[['article_id', 'context_id','pub_date_dt','opinion_context_nj_tok_kw']]


In [None]:
opinions_termfreqs = compute_words_raw_tf(df=opinions_data_uniq, text_col="opinion_context_nj_tok_kw_tri_str")

#### Check keywords frequencies

In [None]:
# checks
"susan" in opinions_termfreqs.columns

In [None]:
opinions_termfreqs[['american_behav_scientists', 'halpern', 'michie', 'chater', 'spi-b', 
                    'behav_insights_team', 'behav_science', 'behav_insight', 'behav_change', 
                    'behav_scientist', 'behav_analysis', 'psychology', 'psychologist', 
                    'behav_econ', 'behavioural_economist','nudge', 'herd_immunity', 'behavioural_fatigue']].sum().sort_values()

In [None]:
#pd.read_csv(os.path.join(DIR_DATA, term_freqs_nm)).iloc[:, 4:].sum().sort_values()

### Calculate document frequency for keywords and words

This is a value of 1 if the (key)word appears in the opinion context, regardless of how many times

In [None]:
opinions_termfreqs.columns

In [None]:
opinions_term_yn_occurrence = opinions_termfreqs.applymap(lambda cell: 1 if cell > 0 else 0)

In [None]:
opinions_term_yn_occurrence[['american_behav_scientists', 'halpern', 'michie', 'chater', 'spi-b', 
                    'behav_insights_team', 'behav_science', 'behav_insight', 'behav_change', 
                    'behav_scientist', 'behav_analysis', 'psychology', 'psychologist', 
                    'behav_econ', 'behavioural_economist','nudge']].sum().sort_values()

## Group data into time windows

According to dates: before 23 March, from 23 March to 10 May, from 11 May onwards.

In [None]:
def label_weeks(date):
    """Assigns and labels weeks to a time window."""
    if date <= datetime.strptime("2020-03-22", '%Y-%m-%d'):
        return "before-lockdown"
    if (date > datetime.strptime("2020-03-22", '%Y-%m-%d')) and (date <= datetime.strptime("2020-05-10", '%Y-%m-%d')):
        return "lockdown"
    if date > datetime.strptime("2020-05-10", '%Y-%m-%d'):
        return "post-lockdown"
    

In [None]:
opinions_term_yn_occurrence["time_window"] = opinions_term_yn_occurrence.index.get_level_values('pub_date_dt').map(label_weeks)

In [None]:
opinions_term_yn_occurrence.set_index([opinions_term_yn_occurrence.index, "time_window"], inplace=True)

In [None]:
opinions_term_yn_occurrence[:5]

### Remove words (other than keywords and term of conceptual interest, i.e., "behavioural fatigue" and  "herd immunity") with low document frequency

Words appeasring in less than 20 (TBC) opinion contexts

In [None]:
list(set(SUBKEYWORDS_TO_KEYWORDS_DICT.values()))

In [None]:
KEY_WORDS

In [None]:
KEEP_KEY_WORDS = list(set(SUBKEYWORDS_TO_KEYWORDS_DICT.values())) + KEY_WORDS

In [None]:
EXCLUDE_KWORDS

In [None]:
KEEP_KEY_WORDS = [word for word in KEEP_KEY_WORDS if word not in EXCLUDE_KWORDS]

Frequency of words (other than selected keywords)

In [None]:
opinions_term_yn_occurrence.drop(KEEP_KEY_WORDS, axis=1).sum(axis=0).sort_values(ascending=False)

How many non-keywords have a document frequency below vs. above 20? 

In [None]:
docfreqs_nonkeywords = opinions_term_yn_occurrence.drop(KEEP_KEY_WORDS, axis=1).sum(axis=0).sort_values(ascending=False)

In [None]:
len(docfreqs_nonkeywords[docfreqs_nonkeywords > 20])

In [None]:
len(docfreqs_nonkeywords[docfreqs_nonkeywords <= 20])

In [None]:
min(docfreqs_nonkeywords)

In [None]:
max(docfreqs_nonkeywords)

In [None]:
print(f"median: {np.percentile(docfreqs_nonkeywords, [50])}")

In [None]:
q75, q25 = np.percentile(docfreqs_nonkeywords, [75 ,25])
iqr = q75 - q25
print(f"interquartile range: {iqr}")

In [None]:
above20_terms = docfreqs_nonkeywords[docfreqs_nonkeywords > 20].index.to_list()

In [None]:
opinions_term_yn_occurrence[above20_terms].groupby('time_window').sum().sum(axis=1)

#### Only keep terms that are either our selected keywords or words with a document frequency above 20

In [None]:
words_toolow_docfreq = docfreqs_nonkeywords[docfreqs_nonkeywords <= 20].reset_index()["index"].to_list()

In [None]:
#words_toolow_docfreq

In [None]:
opinions_term_yn_occurrence.drop(words_toolow_docfreq, axis=1, inplace=True)

In [None]:
opinions_term_yn_occurrence.shape

# Co-occurrence

## Separate before-lockdown vs lockdown data

In [None]:
doc_freqs_before = opinions_term_yn_occurrence[
    opinions_term_yn_occurrence.index.get_level_values('time_window').isin(['before-lockdown'])]
doc_freqs_lock = opinions_term_yn_occurrence[
    opinions_term_yn_occurrence.index.get_level_values('time_window').isin(['lockdown'])]
doc_freqs_post = opinions_term_yn_occurrence[
    opinions_term_yn_occurrence.index.get_level_values('time_window').isin(['post-lockdown'])]

how many opinion contexts per time window?

In [None]:
opinions_term_yn_occurrence.groupby("time_window").count()

# Dice coefficient

We define co-occurrence as two words appearing in the same opinion context, regardless of how many times each appears in the articles. So our measure of co-occurrence is based on document co-occurrence where document is the opinion context.

We normalise co-occurrence by using the Dice coefficient, which is a used in corpus linguistics and should not inflate the importance of co-occurrence for keywords with a very low appearence count in the corpus. That is, words that are frequent by themselves tend to have frequent relations to other words. Other metrics like Dice coefficiet, mutual information orlog-likelihood ratio, will tell if a relation is more frequent than one would expect given the frequency of the individual words.

- The co-occurrence relationship between any two keywords was expressed by the Dice coefficient in information theory (pseudocode shown below), describing the strength of association between these two keywords. 
- The main reason for choosing mutual information instead of selecting the number of frequent words could be analysed by the following process, ADD
- Note that the co-occurrence of two keywords can be relatively small, but if they almost always appeared at the same time, their Dice coefficient will be higher, as they were considered to be in a co-occurrence relationship.

Ref: 
https://onlinelibrary.wiley.com/doi/pdf/10.1002/ecj.10347

https://www.aclweb.org/anthology/C12-2049.pdf

https://www.aclweb.org/anthology/J05-4002.pdf

`Dice coefficient = (2 * count(w1, w2)) / (count(w1) + count(w2))`

The number of co-occurrences multiplied by two divided by the sum of the two keywords' individual document occurrences.

E.g., Given the following document occurrences and co-occurrence: keyword A : 10, keyword B : 6, co-occurrence :4,
the Dice coefficient representing the co-occurrence of A and B is:

In [None]:
2*4/(10+6)

Compared this to a case where keyword A : 10, keyword B : 6, co-occurrence : 6

In [None]:
2*6/(10+6)

In [None]:
from itertools import combinations

def calc_dice(yn_occurence_data: pd.DataFrame, cooccurence_threshold: int=None, prefix=""):
    
    # keyword document occurrence
    kword_docfreqs = yn_occurence_data.sum(axis=0)
    
    # keywords co-occurrence matrix
    kword_cooccurences = yn_occurence_data.values.T.dot(yn_occurence_data.values)
    np.fill_diagonal(kword_cooccurences, 0)
    kwords = yn_occurence_data.columns
    kword_cooccurences = pd.DataFrame(kword_cooccurences, index=kwords, columns=kwords)
    kword_cooccurences = kword_cooccurences.stack()
    
    # if specified, only keep word pairs whose co-occurrence is about specified value
    if cooccurence_threshold:
        kword_cooccurences = kword_cooccurences[kword_cooccurences >= cooccurence_threshold].copy()
        
    # extract list of words
    words_list = list(set([tup[0] for tup in kword_cooccurences.index.values] + 
                           [tup[1] for tup in kword_cooccurences.index.values]))
    
    # filter word doc frequency accordigly
    kword_docfreqs = kword_docfreqs[words_list].copy()
    
    
    def _dice(w1, w2):
        # print(f"{w1}: {kword_docfreqs[w1]}")
        # print(f"{w2}: {kword_docfreqs[w2]}")
        # print(f"coocc: {kword_cooccurences[w1][w2]}")
        try:
            return (2 * kword_cooccurences[w1][w2]) / (kword_docfreqs[w1] + kword_docfreqs[w2])
        except (ValueError, ZeroDivisionError) as err: # one of the two individual counts are 0
            return np.nan
        except KeyError as err:
            print(f"{w1} and {w2} do not co-occur. Computation of Dice coefficient is skipped.")
            pass
        
    def dice(kwords_list: list) -> list:
        coefs = []
        for pair in combinations(kwords_list, r=2):
            try:
                coefs.append((*pair, _dice(*pair), kword_cooccurences[pair[0]][pair[1]], 
                              kword_docfreqs[pair[0]], kword_docfreqs[pair[1]] ))
            except KeyError as err:
                print(f"{pair} do not co-occur. Computation of Dice coefficient is skipped.")
                pass  
        return coefs
    
    dices = dice(kwords_list=words_list)
    dices_df = pd.DataFrame(dices, columns=['source', 'target', f'{prefix}_weight', f'{prefix}_co-occ', f'{prefix}_source_docfreq', f'{prefix}_target_docfreq'])
    
    return dices_df
    

In [None]:
before_dice_coefs = calc_dice(yn_occurence_data=doc_freqs_before, 
                              cooccurence_threshold=10, 
                              prefix="bef")

In [None]:
lock_dice_coefs = calc_dice(yn_occurence_data=doc_freqs_lock, 
                            cooccurence_threshold=10, 
                            prefix="lock")

In [None]:
post_dice_coefs = calc_dice(yn_occurence_data=doc_freqs_post, 
                            cooccurence_threshold=10, 
                            prefix="post")

## Filter results so that at least one of the word in the pair is a selected keyword

In [None]:
SELECTED_KEYWORDS = [kword for kword in KEEP_KEY_WORDS if kword not in ['herd_immunity', 'behavioural_fatigue']]

In [None]:
SELECTED_KEYWORDS

In [None]:
before_dice_coefs = before_dice_coefs[(before_dice_coefs['source'].isin(SELECTED_KEYWORDS)) |  
                  (before_dice_coefs['target'].isin(SELECTED_KEYWORDS))].copy()

In [None]:
lock_dice_coefs = lock_dice_coefs[(lock_dice_coefs['source'].isin(SELECTED_KEYWORDS)) |  
                  (lock_dice_coefs['target'].isin(SELECTED_KEYWORDS))].copy()

In [None]:
post_dice_coefs = post_dice_coefs[(post_dice_coefs['source'].isin(SELECTED_KEYWORDS)) |  
                  (post_dice_coefs['target'].isin(SELECTED_KEYWORDS))].copy()

## How has co-occurrence (as Dice coefficient) evolved before vs. during vs. post lockdown

CHECK : is Dice coefficient a fair representation?

#### Before

In [None]:
before_dice_coefs.sort_values('bef_weight', ascending=False)[:30]

### During

In [None]:
lock_dice_coefs.sort_values('lock_weight', ascending=False)[:30]

#### Post

In [None]:
post_dice_coefs.sort_values('post_weight', ascending=False)[:30]

Merge the coefficients from the three time blocks to compare them more easily

In [None]:
dice_coefs = before_dice_coefs.merge(lock_dice_coefs, 
                                     how='outer', 
                                     on = ['source', 'target']).merge(post_dice_coefs, 
                                                                      how='outer', 
                                                                      on = ['source', 'target']) 
                                   

In [None]:
#dice_coefs[['source', 'target', 'bef_weight', 'lock_weight', 'post_weight']][40:100]

### Top 10 co-occurrent keyword pairs in each time window and how their co-occurrence has changed across the three time blocks

Here we look at the 10 pairs of keyword with the highest Dice coefficients in each of the three time windows, and see how their co-occurrence (as Dice coefficient) has evolved across the three periods.

In [None]:
before_top10_coocurring_pairs = before_dice_coefs.sort_values('bef_weight', ascending=False)[:10].copy()

In [None]:
lock_top10_coocurring_pairs = lock_dice_coefs.sort_values('lock_weight', ascending=False)[:10].copy()

In [None]:
post_top10_coocurring_pairs = post_dice_coefs.sort_values('post_weight', ascending=False)[:10].copy()

Let's take a look (some pairs will be present in more than one time window)

In [None]:
before_top10_coocurring_pairs

In [None]:
lock_top10_coocurring_pairs

In [None]:
post_top10_coocurring_pairs

Merge the three datasets

In [None]:
top_coocurring_pairs = before_top10_coocurring_pairs.merge(lock_top10_coocurring_pairs, 
                                                           how='outer', 
                                                           on = ['source', 'target']).merge(post_top10_coocurring_pairs, 
                                                                                            how='outer', 
                                                                                            on = ['source', 'target']) 
                                   

In [None]:
top_coocurring_pairs[['source', 'target', 'bef_weight', 'lock_weight', 'post_weight']]

### Let's visualise how their pattern and strength of co-occurrence has changed across the three time windows

In [None]:
import chart_studio.plotly as py
import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode()
import plotly.figure_factory as ff

In [None]:
dice_coefs_matrix = np.array(top_coocurring_pairs[['bef_weight', 'lock_weight', 'post_weight']])


In [None]:
top_coocurring_pairs['keywords_pair'] = [s + "  |  " + t for s, t in 
                                         zip(top_coocurring_pairs['source'], top_coocurring_pairs['target'])]

In [None]:
top_coocurring_pairs[['keywords_pair', 'bef_weight', 'lock_weight', 'post_weight']]

In [None]:
# Make Annotated Heatmap
z_text = np.array(pd.DataFrame(np.around(dice_coefs_matrix, decimals=2)).fillna(''), dtype=str)

# Set Colorscale
colorscale=[[0.0, 'rgb(247, 232, 246)'], [1.0, 'rgb(255, 77, 148)']]


fig = ff.create_annotated_heatmap(
    z=dice_coefs_matrix,
    x=['before lockdown', 'during lockdown', 'post lockdown'],
    y=top_coocurring_pairs['keywords_pair'].tolist(), 
    annotation_text=z_text, 
    text=z_text,
    colorscale=colorscale, 
    hoverinfo='z',
    font_colors=['black'])

fig.update_layout(
    title_text='',
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    xaxis=dict(showline=True, linewidth=1, linecolor='black', mirror=True),
    yaxis=dict(visible=True,autorange='reversed', showline=True, linewidth=1, linecolor='black', mirror=True,
             ),
    )

fig.show()


## Co-occurrences that started, ended or continued

Here we summarise the trend in co-occurrence to identify whether the co-occurrence between two keywords:
- started / ended / stayed during lockdown (compared to before lockdown)
- started / ended / stayed post lockdown (compared to during lockdown)


In [None]:
def trend_in_cooccurrence(score1, score2):
    if ((score1 == 0.0) or (np.isnan(score1))) and ((score2 == 0.0) or (np.isnan(score2))):
        return "none"
    if ((score1 != 0.0) and (~np.isnan(score1))) and ((score2 != 0.0) and (~np.isnan(score2))):
        return "stayed"
    if ((score1 != 0.0) and (~np.isnan(score1))) and ((score2 == 0.0) or (np.isnan(score2))):
        return "ended"
    if ((score1 == 0.0) or (np.isnan(score1))) and ((score2 != 0.0) and (~np.isnan(score2))):
        return "started"

In [None]:
dice_coefs['coocc_trend_lock_vs_before'] = dice_coefs.apply(lambda row: 
                                                trend_in_cooccurrence(row['bef_weight'], row['lock_weight']), axis=1)

In [None]:
dice_coefs['coocc_trend_post_vs_lock'] = dice_coefs.apply(lambda row: 
                                                trend_in_cooccurrence(row['lock_weight'], row['post_weight']), axis=1)

In [None]:
dice_coefs[['source', 'target', 'bef_weight', 'lock_weight', 
            'post_weight', 'coocc_trend_lock_vs_before', 'coocc_trend_post_vs_lock']].sort_values("bef_weight", ascending=False)

## Co-occurrences that started during lock-down

In [None]:
np.array(dice_coefs[dice_coefs.coocc_trend_lock_vs_before == "started"][['source', 'target']])

## Co-occurrences that ended during lock-down

In [None]:
np.array(dice_coefs[dice_coefs.coocc_trend_lock_vs_before == "ended"][['source', 'target']])

## Co-occurrences that remained during lock-down

In [None]:
np.array(dice_coefs[dice_coefs.coocc_trend_lock_vs_before == "stayed"][['source', 'target']])

## Co-occurrences that started post lock-down

In [None]:
np.array(dice_coefs[dice_coefs.coocc_trend_post_vs_lock == "started"][['source', 'target']])

## Co-occurrences that ended post lock-down

In [None]:
np.array(dice_coefs[dice_coefs.coocc_trend_post_vs_lock == "ended"][['source', 'target']])

## Co-occurrences that remained post lock-down

In [None]:
np.array(dice_coefs[dice_coefs.coocc_trend_post_vs_lock == "stayed"][['source', 'target']])

## Network based on Dice coefficient

Here we display the co-occurrence network of keywords in the three time windows.

- **Nodes** represent keywords.
- **Edges** represent their co-occurrence strength (Dice coefficient)
- **The node size** represent the value of the node degree, which means the number of neighbor nodes connected to the node directly. The degree captures the importance of of the keyword in the network, a higher degree indictes a highly connected keyword. 

In [None]:
csfont = {'fontname':'Helvetica'}

### Community detection

We will use the Louvain modularity algorithm to detect communities of co-occurring words in our networks as implemented by the `python-louvain` Python package (https://python-louvain.readthedocs.io/en/latest/api.html). The alogirthm returns the partition of highest modularity, i.e. the highest partition of the dendrogram generated by the Louvain algorithm. 

The Louvain algorithm works by maximising modularity (Blondel et al., 2008). Modularity measures the density of connections within communties compared to the density of connections between communities, it takes on values between -1 and 1, and a higher value represents better community definition (Newman & Girvan, 2004). See supplementary material for additional information.

The Louivain algorithm γ > 0 is a resolution parameter. Higher resolutions lead to more communities, while lower resolutions lead to fewer communities. We iterated over possible values of the resolution parameters r (start 0.1, end 5, step=0.2) and opted for the lowest value of r that led to the maximum modularity value.

Blondel, V. D., Guillaume, J. L., Lambiotte, R., Lefebvre, E. (2008), Fast unfolding of communities in large networks, Journal of Statistical Mechanics: Theory and Experiment, Nr.10, P10008

Found to be one of the fastest and best performing algorithms in comparative analyses: Lancichinetti, A. & Fortunato, S. Community detection algorithms: A comparative analysis. Phys. Rev. E 80, 056117, https://doi.org/10.1103/PhysRevE.80.056117 (2009).
Yang, Z., Algesheimer, R. & Tessone, C. J. A Comparative Analysis of Community Detection Algorithms on Artificial Networks. Sci. Rep. 6, 30750, https://doi.org/10.1038/srep30750 (2016).


In [None]:
import community as community_louvain

### Before lockdown

In [None]:
# drop NaN cases and 0.0 values
before_dice_coefs.dropna(inplace=True)

In [None]:
before_dice_coefs = before_dice_coefs[before_dice_coefs.bef_weight > 0.0]

In [None]:
#Dice coefficient above a certain threshols?

#### network

In [None]:
before_dice_graph = nx.from_pandas_edgelist(before_dice_coefs[['source', 'target', 'bef_weight']], edge_attr=True)

In [None]:
# take a look at one
print(nx.to_dict_of_dicts(before_dice_graph).get('michie'))

#### communities

In [None]:
community_louvain.generate_dendrogram(before_dice_graph, weight='bef_weight', resolution=1)

Iterate over a range of values for gamma, the resolution parameter, and detect the value leading to the higher modularity.

In [None]:
bef_resolution_iter = {}
for r in [x / 10.0 for x in range(1, 50, 1)]:
    comp = community_louvain.best_partition(before_dice_graph, weight='bef_weight',  resolution=r)
    Q = community_louvain.modularity(comp, before_dice_graph, weight='bef_weight')
    bef_resolution_iter[r] = Q
    

In [None]:
bef_resolution_iter    #1.0: 0.505713233567339,

In [None]:
# before

bef_partition = community_louvain.best_partition(before_dice_graph, weight='bef_weight',  resolution=1.0)


In [None]:
bef_partition

In [None]:
print(f"Modularity (before network): {round(community_louvain.modularity(bef_partition, before_dice_graph, weight='bef_weight'), 2)}")

#### nodes' betweeness

to use as node size


In [None]:
before_betweenness_dict = nx.betweenness_centrality(
    before_dice_graph, 
    weight='bef_weight'
    ) 

# Assign each to an attribute in your network
nx.set_node_attributes(before_dice_graph, before_betweenness_dict, 'betweenness')


#### visualise

In [None]:
# extract weights, we'll use them for plotting
before_dice_graph_weights = list(nx.get_edge_attributes(before_dice_graph,'bef_weight').values())

In [None]:
# calculate nodes' degree (alternative to use as node size)
before_dice_graph_degrees = dict(nx.degree(before_dice_graph, weight='bef_weight'))
#before_dice_graph_degrees

In [None]:
# color the edges, starting from a darker tone of gray
#https://stackoverflow.com/questions/26102515/select-starting-color-in-matplotlib-colormap

lvTmp = np.linspace(0.1,1.0,len(before_dice_graph_weights)-1)
lvTmp

In [None]:
cmTmp = plt.cm.Greys(lvTmp)
newCmap = mcol.ListedColormap(cmTmp)

In [None]:
# color the nodes according to their partition
cmap = cm.get_cmap(cm.colors.ListedColormap(["lightblue", "peru",  "lightgreen", "fuchsia", "orange"]), 
                   max(bef_partition.values()) + 1)


In [None]:
plt.figure(figsize = (80,80))

nx.draw_spring(before_dice_graph, 
               k=10,
               with_labels=True, 
               edge_color=before_dice_graph_weights,
               width=[v*30 for v in before_dice_graph_weights],
               #node_color='lightgray',
               node_color=list(bef_partition.values()),
               cmap=cmap,
               linewidths=2,
               #node_size=[v * 10000 for v in before_dice_graph_degrees.values()],
               node_size=[(v * 100000)+2000 for v in before_betweenness_dict.values()],
               font_size=100,
               font_type="Helvetica",
               font_color="black",
               font_weight=3,
               edge_cmap=newCmap,
               edge_vmin=0,
               seed=20
              )

plt.title("Before strict lockdown".upper(), fontsize=150, **csfont)
plt.axis('off')
 
plt.savefig('before_network.png')
plt.show()

### During lockdown

In [None]:
# drop NaN cases and 0.0 values
lock_dice_coefs.dropna(inplace=True)

In [None]:
lock_dice_coefs = lock_dice_coefs[lock_dice_coefs.lock_weight > 0.0]

#### network

In [None]:
lock_dice_graph = nx.from_pandas_edgelist(lock_dice_coefs[['source', 'target', 'lock_weight']], edge_attr=True)

In [None]:
# check nodes
#sorted(lock_dice_graph.nodes)

In [None]:
# take a look at one
print(nx.to_dict_of_dicts(lock_dice_graph).get('michie'))

In [None]:
# extract weights, we'll use them for plotting
lock_dice_weights = list(nx.get_edge_attributes(lock_dice_graph,'lock_weight').values())

#### communities

iterate over value of the resolution parameter to find the one leading to highest modularity

In [None]:
lock_resolution_iter = {}
for r in [x / 10.0 for x in range(1, 50, 1)]:
    comp = community_louvain.best_partition(lock_dice_graph, weight='lock_weight',  resolution=r)
    Q = community_louvain.modularity(comp, lock_dice_graph, weight='lock_weight')
    lock_resolution_iter[r] = Q
    

In [None]:
lock_resolution_iter    #1.0: 0.48165320554268937,

In [None]:
# during
lock_partition = community_louvain.best_partition(lock_dice_graph, weight='lock_weight',  resolution=1.0)


In [None]:
lock_partition

In [None]:
print(f"Modularity (before network): {round(community_louvain.modularity(lock_partition, lock_dice_graph, weight='lock_weight'), 2)}")

#### nodes' betweeness

In [None]:
# calculate nodes' degree (alternative to use as node size)
lock_dice_graph_degrees = dict(nx.degree(lock_dice_graph, weight='lock_weight'))

In [None]:
# calculate nodes' betweeness to use as node size
lock_betweenness_dict = nx.betweenness_centrality(
    lock_dice_graph, 
    weight='lock_weight'
    ) 

# Assign each to an attribute in your network
nx.set_node_attributes(lock_dice_graph, lock_betweenness_dict, 'betweenness')


In [None]:
# let color of edges start from a darker tone of gray
#https://stackoverflow.com/questions/26102515/select-starting-color-in-matplotlib-colormap

lvTmp_lock = np.linspace(0.1,1.0,len(lock_dice_weights)-1)
lvTmp_lock

In [None]:
cmTmp_lock = plt.cm.Greys(lvTmp_lock)
newCmap_lock = mcol.ListedColormap(cmTmp_lock)

In [None]:
# color the nodes according to their partition
cmap = cm.get_cmap(cm.colors.ListedColormap(["lightblue", "crimson",  "lightgreen", "pink", "orange"]), 
                   max(lock_partition.values()) + 1)


In [None]:
plt.figure(figsize = (80,80))

nx.draw_spring(lock_dice_graph, 
               k=10,
               with_labels=True, 
               edge_color=lock_dice_weights,
               width=[v*30 for v in lock_dice_weights],
               #node_color='lightgray',
               node_color=list(lock_partition.values()),
               cmap=cmap,
               linewidths=2,
               #node_size=[v * 10000 for v in before_dice_graph_degrees.values()],
               node_size=[(v * 100000)+2000 for v in lock_betweenness_dict.values()],
               font_size=100,
               font_type="Helvetica",
               font_color="black",
               font_weight=3,
               edge_cmap=newCmap,
               edge_vmin=0,
               seed=28
              )

plt.title("During strict lockdown".upper(), fontsize=150, **csfont)
plt.axis('off')
 
plt.savefig('lock_network.png')
plt.show()

#### JUST FOR PLOTTING PURPOSES: Very hard to visualise correctly, otherwise. So we will plot the two sub plots separately. 

In [None]:
sub_data1 = lock_dice_coefs[['source', 'target', 'lock_weight']][
    (lock_dice_coefs['source'].isin(['halpern', 'behav_insights_team', 'dr'])) |  
                  (lock_dice_coefs['target'].isin(['halpern', 'behav_insights_team', 'dr']))].copy()

In [None]:
sub_network1 = nx.from_pandas_edgelist(sub_data1, edge_attr=True)

In [None]:
print(nx.to_dict_of_dicts(sub_network1))

In [None]:
sub_data2 = lock_dice_coefs[['source', 'target', 'lock_weight']][
    ~(lock_dice_coefs['source'].isin(['halpern', 'behav_insights_team', 'dr'])) & 
                  ~(lock_dice_coefs['target'].isin(['halpern', 'behav_insights_team', 'dr']))].copy()

In [None]:
sub_network2 = nx.from_pandas_edgelist(sub_data2, edge_attr=True)

In [None]:
# extract weights, we'll use them for plotting
lock_dice_weights_2 = list(nx.get_edge_attributes(sub_network2,'lock_weight').values())
lock_dice_weights_1 = list(nx.get_edge_attributes(sub_network1,'lock_weight').values())

In [None]:
lock_betweenness_dict_2 = dict((k, lock_betweenness_dict[k]) for k in lock_betweenness_dict.keys() 
     if k not in ['halpern', 'behav_insights_team', 'dr'])


In [None]:
lock_betweenness_dict_1 = dict((k, lock_betweenness_dict[k]) for k in ['halpern', 'behav_insights_team', 'dr'])


In [None]:
lock_partition_2 = dict((k, lock_partition[k]) for k in lock_partition.keys() 
     if k not in ['halpern', 'behav_insights_team', 'dr']) 

In [None]:
lvTmp_lock_2 = np.linspace(0.1,1.0,len(lock_dice_weights_2)-1)

cmTmp_lock_2 = plt.cm.Greys(lvTmp_lock_2)
newCmap_lock_2 = mcol.ListedColormap(cmTmp_lock_2)

In [None]:
cmap2 = cm.get_cmap(cm.colors.ListedColormap(["orange", "peru",  "lightgreen", "fuchsia"]), 
                   max(bef_partition.values()) + 1)


In [None]:
plt.figure(figsize = (80,80))

nx.draw_spring(sub_network2, 
               k=10,
               with_labels=True, 
               edge_color=lock_dice_weights_2,
               #width=8,
               width=[v*30 for v in lock_dice_weights_2],
               node_color=list(lock_partition_2.values()),
               cmap=cmap2,
               linewidths=1,
               #node_size=[v * 10000 for v in lock_dice_graph_degrees.values()],
               node_size=[(v * 100000)+2000 for v in lock_betweenness_dict_2.values()],
               font_size=100,
               font_type="Helvetica",
               font_color='black',
               font_weight=3,
               edge_cmap=newCmap_lock_2,
               edge_vmin=0,
               seed=37
              )

plt.title("During strict lockdown".upper(), fontsize=150, **csfont)
plt.axis('off')
 
plt.savefig('lock_network_2.png')
plt.show()

## After strict lockdown

In [None]:
post_dice_coefs.dropna(inplace=True)

In [None]:
post_dice_coefs = post_dice_coefs[post_dice_coefs.post_weight > 0.0]

#### network

In [None]:
post_dice_graph = nx.from_pandas_edgelist(post_dice_coefs[['source', 'target', 'post_weight']], edge_attr=True)

In [None]:
# take a look at one
print(nx.to_dict_of_dicts(post_dice_graph).get('michie'))

In [None]:
# extract weights, we'll use them for plotting
post_dice_weights = list(nx.get_edge_attributes(post_dice_graph,'post_weight').values())

#### communities

iterate over values of resolution parameter

In [None]:
post_resolution_iter = {}
for r in [x / 10.0 for x in range(1, 50, 1)]:
    comp = community_louvain.best_partition(post_dice_graph, weight='post_weight',  resolution=r)
    Q = community_louvain.modularity(comp, post_dice_graph, weight='post_weight')
    post_resolution_iter[r] = Q
    

In [None]:
post_resolution_iter   #1.0: 0.44476083753777085,

In [None]:
# during
post_partition = community_louvain.best_partition(post_dice_graph, weight='post_weight',  resolution=1.0)


In [None]:
post_partition

In [None]:
print(f"Modularity (post network): {round(community_louvain.modularity(post_partition, post_dice_graph, weight='post_weight'), 2)}")

#### nodes' betweeness

In [None]:
# calculate nodes' degree to use as node size
post_dice_graph_degrees = dict(nx.degree(post_dice_graph, weight='post_weight'))


In [None]:
# calculate nodes' betweeness to use as node size
post_betweenness_dict = nx.betweenness_centrality(
    post_dice_graph, 
    weight='lock_weight'
    ) 

# Assign each to an attribute in your network
nx.set_node_attributes(post_dice_graph, post_betweenness_dict, 'betweenness')


In [None]:
# let color of edges start from a darker tone of gray
#https://stackoverflow.com/questions/26102515/select-starting-color-in-matplotlib-colormap

lvTmp_post = np.linspace(0.1,1.0,len(lock_dice_weights)-1)
lvTmp_post

In [None]:
cmTmp_post = plt.cm.Greys(lvTmp_post)
newCmap_post = mcol.ListedColormap(cmTmp_post)

In [None]:
plt.figure(figsize = (80,80))

nx.draw_spring(post_dice_graph, 
               k=10,
               with_labels=True, 
               edge_color=post_dice_weights,
               width=[v*30 for v in post_dice_weights],
               #node_color='lightgray',
               node_color=list(post_partition.values()),
               cmap=cmap,
               linewidths=2,
               #node_size=[v * 10000 for v in before_dice_graph_degrees.values()],
               node_size=[(v * 100000)+2000 for v in post_betweenness_dict.values()],
               font_size=100,
               font_type="Helvetica",
               font_color="black",
               font_weight=3,
               edge_cmap=newCmap,
               edge_vmin=0,
               seed=20
              )

plt.title("Post strict lockdown".upper(), fontsize=150, **csfont)
plt.axis('off')
 
plt.savefig('post_network.png')
plt.show()

## Characteristics of the three networks and nodes

Main ref: https://programminghistorian.org/en/lessons/exploring-and-analyzing-network-data-with-python

### Number of nodes (keywords that co-occured)

In [None]:
print("Number of keywords co-occurring before-lockdown:", len(before_dice_graph.nodes))

In [None]:
print("Number of keywords co-occurring during-lockdown:", len(lock_dice_graph.nodes))

In [None]:
print("Number of keywords co-occurring post-lockdown:", len(post_dice_graph.nodes))

### Network density

Network density
= ratio between actual number of connections between nodes and maximum possible number of connections.

Give a sense of how closely knit the network is, a higher value (within [0,1]) indicates a more cohesive network, so a set of keywords that do tend to co-occur.



In [None]:
before_density = nx.density(before_dice_graph)
print("Network density (before lockdown):", before_density)

In [None]:
lock_density = nx.density(lock_dice_graph)
print("Network density (during lockdown):", lock_density)

In [None]:
post_density = nx.density(post_dice_graph)
print("Network density (post lockdown):", post_density)

- Network density has decreased during lockdown compared to pre-lockdown. 
    Interpretation: a decrease in the general tendency of keywords to co-occur together in the same documents. 
    
- Network density decreased post lockdown compared to lockdown.

### Network Clustering Coefficient

= n^ of connections between the neighbour nodes of a node / maximum possible number of connections between its neighbour nodes

(neighbour nodes are the nodes directly connected to a node).

A measure of the degree to which nodes in a graph tend to cluster together.

In [None]:
before_clustcoef = nx.average_clustering(before_dice_graph, weight='bef_weight')
print("Network clustering coefficient (before lockdown):", before_clustcoef)

In [None]:
lock_clustcoef = nx.average_clustering(lock_dice_graph, weight='lock_weight')
print("Network clustering coefficient (during lockdown):", lock_clustcoef)

In [None]:
post_clustcoef = nx.average_clustering(post_dice_graph, weight='post_weight')
print("Network clustering coefficient (post lockdown):", post_clustcoef)

Decreases during lockdown compared to pre-lockdown also decreased in the post-lockdown compared to lockdown period. 

## Centrality measures

Identify nodes (keywords) that are more important in the networks and compare the ranking them over time.

### Node Degree

The number of connection a node has. For a weighted network, this is the sum of the edge weights adjacent to the node. 

Here is with how many different keywords does each keyword co-occur?
Note that this is likely to be proportional to the keyword's frequency. Something we can also report.

In [None]:
def get_node_degree(graph, weight=None):
    node_degree_dict = {}
    for node in graph.nodes:
        node_degree_dict[node] = nx.degree(graph, node, weight)
    return node_degree_dict    

Before lockdown

In [None]:
before_node_degrees = pd.Series(get_node_degree(before_dice_graph, weight='bef_weight')).sort_values(ascending=False)
print(before_node_degrees)

During lockdown

In [None]:
lock_node_degrees = pd.Series(get_node_degree(lock_dice_graph, weight='lock_weight')).sort_values(ascending=False)
print(lock_node_degrees)

Post lockdown

In [None]:
post_node_degrees = pd.Series(get_node_degree(post_dice_graph, weight='post_weight')).sort_values(ascending=False)
print(post_node_degrees)

In [None]:
# alternative way to calculate it

In [None]:
before_degree_dict = dict(before_dice_graph.degree(before_dice_graph.nodes(), weight='bef_weight'))
nx.set_node_attributes(before_dice_graph, before_degree_dict, 'degree')

In [None]:
lock_degree_dict = dict(lock_dice_graph.degree(lock_dice_graph.nodes(), weight='lock_weight'))
nx.set_node_attributes(lock_dice_graph, lock_degree_dict, 'degree')

In [None]:
post_degree_dict = dict(post_dice_graph.degree(post_dice_graph.nodes(), weight='post_weight'))
nx.set_node_attributes(post_dice_graph, post_degree_dict, 'degree')

### Node Betweeness Centrality

Betweenness centrality doesn’t care about the number of edges any one node or set of nodes has. Betweenness centrality looks at all the shortest paths that pass through a particular node.

So a keyword with a high betweeness centrality is a keyword that works as a bridge by connecting several different other keywords - i.e., it is discussed in articles with a wider variety of other keywords.

https://networkx.org/documentation/networkx-1.10/reference/generated/networkx.algorithms.centrality.betweenness_centrality.html
for weighted networks

#### Pre-lockdown

In [None]:
before_betweenness_dict = nx.betweenness_centrality(
    before_dice_graph, 
    weight='bef_weight'
    ) 

# Assign each to an attribute in your network
nx.set_node_attributes(before_dice_graph, before_betweenness_dict, 'betweenness')


In [None]:
sorted(before_betweenness_dict.items(), key=itemgetter(1), reverse=True)

Compare degree and between centrality

In [None]:
#Then find and print their degree
for tb in sorted(before_betweenness_dict.items(), key=itemgetter(1), reverse=True): 
    degree = before_degree_dict[tb[0]] # Use degree_dict to access a node's degree
    print("Name:", tb[0], "| Betweenness Centrality:", tb[1], "| Degree:", degree)

#### During lockdown

In [None]:
lock_betweenness_dict = nx.betweenness_centrality(
    lock_dice_graph,
    weight='lock_weight') 

# Assign each to an attribute in your network
nx.set_node_attributes(lock_dice_graph, lock_betweenness_dict, 'betweenness')


sorted(lock_betweenness_dict.items(), key=itemgetter(1), reverse=True)

Compare degree and between centrality

In [None]:
for tb in sorted(lock_betweenness_dict.items(), key=itemgetter(1), reverse=True): 
    degree = lock_degree_dict[tb[0]] # Use degree_dict to access a node's degree
    print("Name:", tb[0], "| Betweenness Centrality:", tb[1], "| Degree:", degree)

#### Post lockdown

In [None]:
post_betweenness_dict = nx.betweenness_centrality(
    post_dice_graph,
    weight='post_weight'
) 

# Assign each to an attribute in your network
nx.set_node_attributes(post_dice_graph, post_betweenness_dict, 'betweenness')


sorted(post_betweenness_dict.items(), key=itemgetter(1), reverse=True)

In [None]:
for tb in sorted(post_betweenness_dict.items(), key=itemgetter(1), reverse=True): 
    degree = post_degree_dict[tb[0]] # Use degree_dict to access a node's degree
    print("Name:", tb[0], "| Betweenness Centrality:", tb[1], "| Degree:", degree)

## Detect community and re-draw networks

Using the Louivan algorithm for weighted undirected graphs

Ref: https://python-louvain.readthedocs.io/en/latest/

In [None]:
import community as community_louvain

##### Before

In [None]:
# before

bef_partition = community_louvain.best_partition(before_dice_graph, weight='bef_weight')


In [None]:
bef_partition

In [None]:
bef_partition.values()

In [None]:
max(bef_partition.values()) + 1

In [None]:
import matplotlib.cm as cm

In [None]:
# color the nodes according to their partition
cmap = cm.get_cmap(cm.colors.ListedColormap(["lightblue", "crimson",  "lightgreen", "pink", "orange"]), 
                   max(bef_partition.values()) + 1)


In [None]:
# color for the edges start from a darker tone of gray
#https://stackoverflow.com/questions/26102515/select-starting-color-in-matplotlib-colormap

lvTmp = np.linspace(0.1,1.0,len(before_dice_graph_weights)-1)
lvTmp

In [None]:
cmTmp = plt.cm.Greys(lvTmp)
newCmap = mcol.ListedColormap(cmTmp)

In [None]:
plt.figure(figsize = (80,80))

nx.draw_spring(before_dice_graph, 
               k=10,
               with_labels=True, 
               edge_color=before_dice_graph_weights,
               width=[v*30 for v in before_dice_graph_weights],
               #node_color='lightgray',
               node_color=list(bef_partition.values()),
               cmap=cmap,
               linewidths=2,
               #node_size=[v * 10000 for v in before_dice_graph_degrees.values()],
               node_size=[(v * 100000)+2000 for v in before_betweenness_dict.values()],
               font_size=100,
               font_type="Helvetica",
               font_color="black",
               font_weight=3,
               edge_cmap=newCmap,
               edge_vmin=0,
               seed=9
              )

plt.title("Before strict lockdown".upper(), fontsize=150, **csfont)
plt.axis('off')
 
plt.savefig('output.png')
plt.show()

##### During

In [None]:
lock_partition = community_louvain.best_partition(lock_dice_graph, weight='lock_weight')


In [None]:
lock_partition

##### Post

In [None]:
post_partition = community_louvain.best_partition(post_dice_graph, weight='post_weight')


In [None]:
post_partition

#### Alterative: Using Girvan-Newman algorithm with edge-betweeness-centrality

https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.community.centrality.girvan_newman.html#networkx.algorithms.community.centrality.girvan_newman

Ref: https://www.pnas.org/content/101/9/2658
Ref: 


In [None]:
from networkx import edge_betweenness_centrality as betweenness

In [None]:
from networkx.algorithms.community import girvan_newman

In [None]:
def _most_central_edge(G, weights):
    centrality = betweenness(G, weight=weights)
    return max(centrality, key=centrality.get)

#### before

In [None]:
def bef_most_central_edge(G):
    return _most_central_edge(G, weights='bef_weight')

In [None]:
_most_central_edge(before_dice_graph, weights='bef_weight')

In [None]:
bef_gn = girvan_newman(before_dice_graph, 
                   most_valuable_edge=bef_most_central_edge)

In [None]:
bef_comp = tuple(sorted(c) for c in next(bef_gn))

In [None]:
bef_comp

#### During

In [None]:
def lock_most_central_edge(G):
    return _most_central_edge(G, weights='lock_weight')

In [None]:
lock_gn = girvan_newman(lock_dice_graph, 
                   most_valuable_edge=lock_most_central_edge)

In [None]:
tuple(sorted(c) for c in next(lock_gn))

#### Post

In [None]:
def post_most_central_edge(G):
    return _most_central_edge(G, weights='post_weight')

In [None]:
post_gn = girvan_newman(post_dice_graph, 
                   most_valuable_edge=post_most_central_edge)

In [None]:
tuple(sorted(c) for c in next(post_gn))

https://pypi.org/project/ForceAtlas2/
https://noduslabs.com/wp-content/uploads/2019/06/InfraNodus-Paranyushkin-WWW19-Conference.pdf

We then apply community detection algorithm [37], [38] based
on modularity. This is an iterative algorithm that detects the
groups of nodes that are more densely connected together than
with the rest of the network. As a result, we obtain the groups of
nodes (words) which tend to appear together in the text: topical
clusters. We then apply Force-Atlas algorithm [39], which aligns
densely connected clusters together while pushing the most connected nodes apart, so that the network structure is more
visible on the graph. We then get a visual network
representation of the text with a clearly defined community
structure (using both color and network topology) and the
specific topical clusters