# This notebook

In this notebook, we exlore the co-occurrence of keywords across a set of temporal subsets to detect patterns of change in co-occurrence.

Temporal subsets are defined according to key events in the timeline of covid19 pandemic in the UK:

- up to 23 March 2020 (excluded): pre-lockdown
- 23 March to 10 May 2020: strict lockdown
- 11 May 2020 onwards: post- strict lockdown (lockdown eases)

Note that there are additional dates that we may have considered (e.g., 14 March 2020 "herd immunity" approach is mentioned, 13 June 2020 "social bubbles" introduced, 15 June non-essential shops reopen) but that would create temporal sub-windows with little amount of data.

We will:

- [keyword class] Classify keywords according to their normalised corpus frequency and relative document frequency values, for each of the three main time windows
- [co-occurrence] For each temporal window, calculate the co-occurrence of keyword pairs as Positive Pointwise Mutual Information and Simpson coefficient
- Identify changes in keyword class and keyword co-occurrence across the temporal windows.
- Create networks of keyword co-occurrences for each of the three temporal windows and compare network and nodes characteristics across the three.


## Settings

In [1]:
import os

In [2]:
import numpy as np

In [3]:
from math import log2

In [4]:
import seaborn as sns
import matplotlib.pyplot as plt

In [5]:
import networkx as nx
from operator import itemgetter

In [6]:
%matplotlib inline

In [7]:
from src.news_media.get_keywords_trend import *

/Users/alessiatosi/DS_projects/behavioural-sci-perception/docs/ext/keywords.yaml has been successfully loaded as a dict
/Users/alessiatosi/DS_projects/behavioural-sci-perception/docs/ext/subkw_to_kw_map.yaml has been successfully loaded as a dict


In [8]:
pd.set_option('display.max_colwidth', None)

The config file

In [9]:
CONFIG.keys()

dict_keys(['NgramRange', 'Actors', 'BehavSci', 'Behav_ins', 'Behav_chan', 'Behav_pol', 'Behav_anal', 'Psych', 'Econ_behav', 'Econ_irrational', 'Nudge', 'Nudge_choice', 'Nudge_pater', 'Covid', 'Fatigue', 'Immunity'])

## Import UK's news articles

In [10]:
news_uk = NewsArticles(country="uk")

`news_uk` is a `NewsArticles` class instance, with the following public attributes and methods:

In [11]:
[d for d in dir(news_uk) if not d.startswith("_")]

['allwords_raw_tf',
 'country',
 'data',
 'dates',
 'expand_dict',
 'get_num_ngrams',
 'kword_docfreq_week',
 'kword_rawfreq',
 'kword_rawfreq_week',
 'kword_reldocfreq_week',
 'kword_relfreq_week',
 'kword_rfrdf_week',
 'kword_yn_occurrence',
 'subkword_raw_tf',
 'unigram_count_perdoc']

`news_uk.data` contains the original dataset of articles:

In [12]:
news_uk.data.shape

(464, 13)

In [13]:
## Extract data needed for analysis

In [15]:
news_uk.subkword_raw_tf
kword_rawfreqs = news_uk.kword_rawfreq.copy()

In [16]:
kword_df = news_uk.kword_yn_occurrence.copy()

## Group data into time windows

According to dates: before 23 March, from 23 March to 10 May, from 11 May onwards.

In [17]:
kword_rawfreqs.index

MultiIndex([(  0, '2020-01-23', 149),
            (  1, '2020-01-23', 336),
            (  2, '2020-01-23', 295),
            (  3, '2020-01-23', 185),
            (  4, '2020-01-26', 123),
            (  5, '2020-01-26', 272),
            (  6, '2020-01-27', 658),
            (  7, '2020-01-28', 194),
            (  8, '2020-01-29', 158),
            (  9, '2020-01-29',  78),
            ...
            (454, '2020-05-10', 352),
            (455, '2020-05-10',  46),
            (456, '2020-05-10', 111),
            (457, '2020-05-10', 210),
            (458, '2020-05-10', 234),
            (459, '2020-05-10', 207),
            (460, '2020-05-10', 137),
            (461, '2020-05-10', 275),
            (462, '2020-05-10', 150),
            (463, '2020-05-10', 215)],
           names=['id', 'pub_date', 'word_count'], length=464)

In [18]:
def label_weeks(date):
    """Assigns and labels weeks to a time window."""
    if date <= datetime.strptime("2020-03-22", '%Y-%m-%d'):
        return "before-lockdown"
    if (date > datetime.strptime("2020-03-22", '%Y-%m-%d')) and (date <= datetime.strptime("2020-05-10", '%Y-%m-%d')):
        return "lockdown"
    if date <= datetime.strptime("2020-05-10", '%Y-%m-%d'):
        return "post-lockdown"
    

In [19]:
# indexes as columns
kword_rawfreqs.reset_index(['pub_date'], inplace=True)
kword_df.reset_index(['pub_date'], inplace=True)

In [20]:
kword_rawfreqs["time_window"] = kword_rawfreqs.pub_date.apply(label_weeks)

In [21]:
kword_df["time_window"] = kword_df.pub_date.apply(label_weeks)

In [22]:
kword_df.set_index([kword_df.index, kword_df.pub_date, kword_df.time_window], inplace=True, drop=True)
kword_df.drop(['pub_date', 'time_window'], inplace=True, axis=1)

In [23]:
kword_rawfreqs.set_index([kword_rawfreqs.index, kword_rawfreqs.pub_date, kword_rawfreqs.time_window], inplace=True, drop=True)
kword_rawfreqs.drop(['pub_date', 'time_window'], inplace=True, axis=1)

# Keywords corpus frequency per time window

Count of keyword occurrences devided by total word count in each time window.

In [24]:
kword_rawfreqs_agg = kword_rawfreqs.reset_index(['id', 'word_count']).groupby('time_window').agg(
    word_count=('word_count', 'sum')).merge(
                        kword_rawfreqs.groupby('time_window').sum(),
                        on='time_window')

In [25]:
kword_rawfreqs_agg

Unnamed: 0_level_0,word_count,american_behav_scientists,behav_analysis,behav_change,behav_econ,behav_insight,behav_insights_team,behav_science,chater,halpern,michie,nudge,nudge_choice,nudge_paternalism,psychology,spi-b
time_window,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
before-lockdown,48496,17,1,5,10,2,67,104,1,35,44,47,1,1,47,7
lockdown,85084,23,5,30,24,8,52,224,1,30,85,52,0,1,100,83


In [26]:
kword_agg_nkf = kword_rawfreqs_agg.iloc[:, 1:].div(kword_rawfreqs_agg.word_count, axis=0)

In [27]:
kword_agg_nkf_long = pd.melt(
                    kword_agg_nkf.reset_index(),
                    id_vars=['time_window'],
                    var_name='kword',
                    value_name='nkf')

In [28]:
kword_agg_nkf_long

Unnamed: 0,time_window,kword,nkf
0,before-lockdown,american_behav_scientists,0.000351
1,lockdown,american_behav_scientists,0.00027
2,before-lockdown,behav_analysis,2.1e-05
3,lockdown,behav_analysis,5.9e-05
4,before-lockdown,behav_change,0.000103
5,lockdown,behav_change,0.000353
6,before-lockdown,behav_econ,0.000206
7,lockdown,behav_econ,0.000282
8,before-lockdown,behav_insight,4.1e-05
9,lockdown,behav_insight,9.4e-05


## Keyword document frequency per time windows

In [None]:
kword_df_agg = kword_df.reset_index(['id']).groupby('time_window').agg(
    article_count=('id', 'count')).merge(
                        kword_df.groupby('time_window').sum(),
                        on='time_window')

In [None]:
kword_agg_rdf = kword_df_agg.iloc[:, 1:].div(kword_df_agg.article_count, axis=0)

In [None]:
kword_rdf_agg_long = pd.melt(
                    kword_agg_rdf.reset_index(),
                    id_vars=['time_window'],
                    var_name='kword',
                    value_name='rdf')

In [None]:
kword_rdf_agg_long

## Median normalised keyword frequency

Note that the `median nkf` is calculated only considering the keywords' nkf values and not all words' or all nouns' nkf. So the meaning of High Frequent and Low Frequent keywords - i.e., above the median and below the median frequently keywords - must be interpreted relatively to the use of keywords only and not to all words or nouns used in the articles.

An alternative apporach would be to set a threhold value or calculate the normalised word frequencies for all words (or nouns) in the articles and its median (more time consuming as it will require re-do some pre-existing steps).

In [None]:
# nkf_medians = kword_agg_nkf_long.groupby('time_window').agg({'nkf':'median'})
# print(nkf_medians)

In [None]:
# thanks to "transformation"
# kword_agg_nkf_long['above_median'] = kword_agg_nkf_long['nkf'] - kword_agg_nkf_long.groupby('time_window')['nkf'].transform('median') > 0

In [None]:
nkf_median = kword_agg_nkf_long.nkf.median()
kword_agg_nkf_long['nkf_above_median'] = kword_agg_nkf_long['nkf'] - nkf_median > 0

In [None]:
kword_agg_nkf_long

In [None]:
truthvalue2type_dict = {
    False: "low",
    True: "high"
}

In [None]:
kword_agg_nkf_long["nkf_type"] = kword_agg_nkf_long.nkf_above_median.apply(lambda row: truthvalue2type_dict.get(row))

### Which keywords have high vs low (above vs below median) normalised corpus frequency in the two time windows? 

In [None]:
kword_agg_nkf_long.groupby(['time_window', 'nkf_type']).kword.apply(list).reset_index(
    name='kwords').pivot(index='time_window', columns='nkf_type')['kwords']

### How has a keyword's corpus frequency changed across time windows?

In [None]:
kword_agg_nkf_long.pivot(index='kword', columns='time_window')['nkf_type']

The lack of change for most keywords must be partially explained by the fact that high vs low are defined with respect to the keywords' median keyword frequency rather than the median frequency calculated from all words' frequencies.

## Median relative document frequency by time window

In [None]:
rdf_median = kword_rdf_agg_long.rdf.median()

In [None]:
kword_rdf_agg_long['rdf_above_median'] = kword_rdf_agg_long['rdf'] - rdf_median > 0

In [None]:
kword_rdf_agg_long

In [None]:
kword_rdf_agg_long["rdf_type"] = kword_rdf_agg_long.rdf_above_median.apply(lambda row: truthvalue2type_dict.get(row))

In [None]:
kword_rdf_agg_long

### Which keywords have high vs low (above vs below median) relative doc frequency in the two time windows? 

In [None]:
kword_rdf_agg_long.groupby(['time_window', 'rdf_type']).kword.apply(list).reset_index(
    name='kwords').pivot(index='time_window', columns='rdf_type')['kwords']

In [None]:
kword_rdf_agg_long.pivot(index='kword', columns='time_window')['rdf_type']

## Combine the two datasets together

In [None]:
kword_agg_nkf_rdf = kword_agg_nkf_long.merge(kword_rdf_agg_long, on = ['time_window', 'kword'])

Let's take a look

Pre-lockdown

In [None]:
kword_agg_nkf_rdf[kword_agg_nkf_rdf.time_window == "before-lockdown"].groupby(['nkf_type', 'rdf_type']).kword.apply(list).reset_index(
    name='kwords').pivot(index='nkf_type', columns='rdf_type')['kwords']

Lockdown

In [None]:
kword_agg_nkf_rdf[kword_agg_nkf_rdf.time_window == "lockdown"].groupby(['nkf_type', 'rdf_type']).kword.apply(list).reset_index(
    name='kwords').pivot(index='nkf_type', columns='rdf_type')['kwords']

Does not seem to provide great insights as keywords which have an above-median normalised keyword frequency also have an above-average relative document frequency.

# Co-occurrence

## Remove keywords that do not appear in our corpus

Our keywords were theory driven so some do not appear in the corpus. Let's remove them.

In [None]:
kword_df.drop(['irrational_econ', 'behav_policy'], inplace=True, axis=1)
kword_rawfreqs.drop(['irrational_econ', 'behav_policy'], inplace=True, axis=1)

## Separate before-lockdown vs lockdown data

In [None]:
kword_df_before = kword_df[kword_df.index.get_level_values('time_window').isin(['before-lockdown'])]
kword_df_lock = kword_df[kword_df.index.get_level_values('time_window').isin(['lockdown'])]
kword_rawfreqs_before = kword_rawfreqs[kword_rawfreqs.index.get_level_values('time_window').isin(['before-lockdown'])]
kword_rawfreqs_lock = kword_rawfreqs[kword_rawfreqs.index.get_level_values('time_window').isin(['lockdown'])]

# Simpson' coefficient

Another approach is Simpson coefficient, which has been reported to work well to represent co-occurrence even for keywords with a low appearence count in a document.

Ref: 
https://onlinelibrary.wiley.com/doi/pdf/10.1002/ecj.10347

https://www.aclweb.org/anthology/C12-2049.pdf

https://www.aclweb.org/anthology/J05-4002.pdf

`Simpson coefficient = count(w1, w2) / min(count(w1), count(w2))`

In [None]:
from itertools import combinations
def calc_simpson(yn_occurence_data, kwords_list, prefix=""):
    # keyword document occurrence
    kword_docfreqs = yn_occurence_data.sum(axis=0)
    # keywords co-occurrence matrix
    kword_cooccurences = yn_occurence_data.values.T.dot(yn_occurence_data.values)
    np.fill_diagonal(kword_cooccurences, 0)
    kwords = yn_occurence_data.columns
    kword_cooccurences = pd.DataFrame(kword_cooccurences, index=kwords, columns=kwords)
    kword_cooccurences = kword_cooccurences.stack()
    
    
    def _simpson(w1, w2):
        # print(f"{w1}: {kword_docfreqs[w1]}")
        # print(f"{w2}: {kword_docfreqs[w2]}")
        # print(f"coocc: {kword_cooccurences[w1][w2]}")
        try:
            return kword_cooccurences[w1][w2]/ (min(kword_docfreqs[w1],kword_docfreqs[w2]))
        except (ValueError, ZeroDivisionError) as err: # one of the two individual counts are 0
            return np.nan
        
    def simpson(kwords_list: list) -> list:
        coefs = []
        for pair in combinations(kwords_list, r=2):
            coefs.append((*pair, _simpson(*pair), kword_cooccurences[pair[0]][pair[1]], kword_docfreqs[pair[0]], kword_docfreqs[pair[1]] ))
        return coefs
    
    simpsons = simpson(kwords_list=kwords_list)
    simspons_df = pd.DataFrame(simpsons, columns=['source', 'target', f'{prefix}_weight', f'{prefix}_co-occ', f'{prefix}_source_docfreq', f'{prefix}_docfreq'])
    
    return simspons_df
    

In [None]:
kwords = kword_df_before.columns.tolist()

In [None]:
before_simpson_coefs = calc_simpson(yn_occurence_data=kword_df_before, kwords_list=kwords, prefix="bef")

In [None]:
lock_simpson_coefs = calc_simpson(yn_occurence_data=kword_df_lock, kwords_list=kwords, prefix="lock")

Let's take a look

In [None]:
before_simpson_coefs.sort_values('bef_weight', ascending=False)[:30]

In [None]:
lock_simpson_coefs.sort_values('lock_weight', ascending=False)[:30]

Merge the two to compare them more easily

In [None]:
simpsons_coefs = before_simpson_coefs.merge(lock_simpson_coefs, on = ['source', 'target'])

In [None]:
simpsons_coefs[['source', 'target', 'bef_weight', 'lock_weight']][:40]

In [None]:
def trend_in_cooccurrence(score1, score2):
    if ((score1 == 0.0) or (np.isnan(score1))) and ((score2 == 0.0) or (np.isnan(score2))):
        return "never"
    if ((score1 != 0.0) and (~np.isnan(score1))) and ((score2 != 0.0) and (~np.isnan(score2))):
        return "stayed"
    if ((score1 != 0.0) and (~np.isnan(score1))) and ((score2 == 0.0) or (np.isnan(score2))):
        return "ended"
    if ((score1 == 0.0) or (np.isnan(score1))) and ((score2 != 0.0) and (~np.isnan(score2))):
        return "started"

In [None]:
simpsons_coefs['weights_trend1'] = simpsons_coefs.apply(lambda row: trend_in_cooccurrence(row['bef_weight'], row['lock_weight']), axis=1)

In [None]:
simpsons_coefs[['source', 'target', 'bef_weight', 'lock_weight','weights_trend1']][:50]

## Co-occurrences that started during lock-down

In [None]:
np.array(simpsons_coefs[simpsons_coefs.weights_trend1 == "started"][['source', 'target']])

## Co-occurrences that ended during lock-down

In [None]:
np.array(simpsons_coefs[simpsons_coefs.weights_trend1 == "ended"][['source', 'target']])

## Co-occurrences that remained during lock-down

In [None]:
np.array(simpsons_coefs[simpsons_coefs.weights_trend1 == "stayed"][['source', 'target']])

## Network based on Simpson's coefficient

### Before lockdown

In [None]:
# drop NaN cases and 0.0 values
before_simpson_coefs.dropna(inplace=True)

In [None]:
before_simpson_coefs = before_simpson_coefs[before_simpson_coefs.bef_weight > 0.0]

In [None]:
before_simpson_graph = nx.from_pandas_edgelist(before_simpson_coefs[['source', 'target', 'bef_weight']], edge_attr=True)

In [None]:
# take a look at one
print(nx.to_dict_of_dicts(before_simpson_graph).get('michie'))

In [None]:
# extract weights, we'll use them for plotting
before_simpson_graph_weights = list(nx.get_edge_attributes(before_simpson_graph,'bef_weight').values())

In [None]:
fig, ax = plt.subplots(figsize=(30,30))   
nx.draw_networkx(before_simpson_graph, 
                 with_labels=True, 
                 edge_color=before_simpson_graph_weights,
                 width=3,
                 node_color='lightgreen',
                 font_size=20,
                 font_color='red',
                 font_weight=3,
                 edge_cmap=plt.cm.Blues
                )

### During lockdown

In [None]:
# drop NaN cases and 0.0 values
lock_simpson_coefs.dropna(inplace=True)

In [None]:
lock_simpson_coefs = lock_simpson_coefs[lock_simpson_coefs.lock_weight > 0.0]

In [None]:
lock_simpson_graph = nx.from_pandas_edgelist(lock_simpson_coefs[['source', 'target', 'lock_weight']], edge_attr=True)

In [None]:
# take a look at one
print(nx.to_dict_of_dicts(lock_simpson_graph).get('michie'))

In [None]:
# extract weights, we'll use them for plotting
lock_simpson_graph_weights = list(nx.get_edge_attributes(lock_simpson_graph,'lock_weight').values())

In [None]:
fig, ax = plt.subplots(figsize=(30,30))   
nx.draw_networkx(lock_simpson_graph, 
                 with_labels=True, 
                 edge_color=lock_simpson_graph_weights,
                 width=3,
                 node_color='lightgreen',
                 font_size=20,
                 font_color='red',
                 font_weight=3,
                 edge_cmap=plt.cm.Blues
                )

In [None]:
lock_simpson_coefs[(lock_simpson_coefs.source == "nudge_paternalism")]

In [None]:
lock_simpson_coefs[(lock_simpson_coefs.target == "nudge_paternalism")]

# Dice coefficient

Another approach is the Dice coefficient, which should not inflate the importance of co-occurrence for keywords with a very low appearence count in the corpus.

Ref: 
https://onlinelibrary.wiley.com/doi/pdf/10.1002/ecj.10347

https://www.aclweb.org/anthology/C12-2049.pdf

https://www.aclweb.org/anthology/J05-4002.pdf

`Dice coefficient = (2 * count(w1, w2)) / (count(w1) + count(w2))`

In [None]:
from itertools import combinations
def calc_dice(yn_occurence_data, kwords_list, prefix=""):
    # keyword document occurrence
    kword_docfreqs = yn_occurence_data.sum(axis=0)
    # keywords co-occurrence matrix
    kword_cooccurences = yn_occurence_data.values.T.dot(yn_occurence_data.values)
    np.fill_diagonal(kword_cooccurences, 0)
    kwords = yn_occurence_data.columns
    kword_cooccurences = pd.DataFrame(kword_cooccurences, index=kwords, columns=kwords)
    kword_cooccurences = kword_cooccurences.stack()
    
    
    def _dice(w1, w2):
        # print(f"{w1}: {kword_docfreqs[w1]}")
        # print(f"{w2}: {kword_docfreqs[w2]}")
        # print(f"coocc: {kword_cooccurences[w1][w2]}")
        try:
            return (2 * kword_cooccurences[w1][w2]) / (kword_docfreqs[w1] + kword_docfreqs[w2])
        except (ValueError, ZeroDivisionError) as err: # one of the two individual counts are 0
            return np.nan
        
    def dice(kwords_list: list) -> list:
        coefs = []
        for pair in combinations(kwords_list, r=2):
            coefs.append((*pair, _dice(*pair), kword_cooccurences[pair[0]][pair[1]], kword_docfreqs[pair[0]], kword_docfreqs[pair[1]] ))
        return coefs
    
    dices = dice(kwords_list=kwords_list)
    dices_df = pd.DataFrame(dices, columns=['source', 'target', f'{prefix}_weight', f'{prefix}_co-occ', f'{prefix}_source_docfreq', f'{prefix}_target_docfreq'])
    
    return dices_df
    

In [None]:
kwords = kword_df_before.columns.tolist()

In [None]:
before_dice_coefs = calc_dice(yn_occurence_data=kword_df_before, kwords_list=kwords, prefix="bef")

In [None]:
lock_dice_coefs = calc_dice(yn_occurence_data=kword_df_lock, kwords_list=kwords, prefix="lock")

Let's take a look

In [None]:
before_dice_coefs.sort_values('bef_weight', ascending=False)[:30]

In [None]:
lock_dice_coefs.sort_values('lock_weight', ascending=False)[:30]

Merge the two to compare them more easily

In [None]:
dice_coefs = before_dice_coefs.merge(lock_dice_coefs, on = ['source', 'target'])

In [None]:
dice_coefs[['source', 'target', 'bef_weight', 'lock_weight']][40:100]

In [None]:
def trend_in_cooccurrence(score1, score2):
    if ((score1 == 0.0) or (np.isnan(score1))) and ((score2 == 0.0) or (np.isnan(score2))):
        return "never"
    if ((score1 != 0.0) and (~np.isnan(score1))) and ((score2 != 0.0) and (~np.isnan(score2))):
        return "stayed"
    if ((score1 != 0.0) and (~np.isnan(score1))) and ((score2 == 0.0) or (np.isnan(score2))):
        return "ended"
    if ((score1 == 0.0) or (np.isnan(score1))) and ((score2 != 0.0) and (~np.isnan(score2))):
        return "started"

In [None]:
dice_coefs['weights_trend1'] = dice_coefs.apply(lambda row: trend_in_cooccurrence(row['bef_weight'], row['lock_weight']), axis=1)

In [None]:
dice_coefs[['source', 'target', 'bef_weight', 'lock_weight','weights_trend1']][:50]

## Co-occurrences that started during lock-down

In [None]:
np.array(dice_coefs[dice_coefs.weights_trend1 == "started"][['source', 'target']])

## Co-occurrences that ended during lock-down

In [None]:
np.array(dice_coefs[dice_coefs.weights_trend1 == "ended"][['source', 'target']])

## Co-occurrences that remained during lock-down

In [None]:
np.array(dice_coefs[dice_coefs.weights_trend1 == "stayed"][['source', 'target']])

## Network based on Dice coefficient

### Before lockdown

In [None]:
# drop NaN cases and 0.0 values
before_dice_coefs.dropna(inplace=True)

In [None]:
before_dice_coefs = before_dice_coefs[before_dice_coefs.bef_weight > 0.0]

In [None]:
before_dice_graph = nx.from_pandas_edgelist(before_dice_coefs[['source', 'target', 'bef_weight']], edge_attr=True)

In [None]:
# take a look at one
print(nx.to_dict_of_dicts(before_dice_graph).get('michie'))

In [None]:
# extract weights, we'll use them for plotting
before_dice_graph_weights = list(nx.get_edge_attributes(before_dice_graph,'bef_weight').values())

In [None]:
fig, ax = plt.subplots(figsize=(30,30))   
nx.draw_networkx(before_dice_graph, 
                 with_labels=True, 
                 edge_color=before_dice_graph_weights,
                 width=3,
                 node_color='lightgreen',
                 font_size=20,
                 font_color='red',
                 font_weight=3,
                 edge_cmap=plt.cm.Blues
                )

### During lockdown

In [None]:
# drop NaN cases and 0.0 values
lock_dice_coefs.dropna(inplace=True)

In [None]:
lock_dice_coefs = lock_dice_coefs[lock_dice_coefs.lock_weight > 0.0]

In [None]:
lock_dice_graph = nx.from_pandas_edgelist(lock_dice_coefs[['source', 'target', 'lock_weight']], edge_attr=True)

In [None]:
# take a look at one
print(nx.to_dict_of_dicts(lock_dice_graph).get('michie'))

In [None]:
# extract weights, we'll use them for plotting
lock_dice_weights = list(nx.get_edge_attributes(lock_dice_graph,'lock_weight').values())

In [None]:
fig, ax = plt.subplots(figsize=(30,30))   
nx.draw_networkx(lock_dice_graph, 
                 with_labels=True, 
                 edge_color=lock_simpson_dice_weights,
                 width=3,
                 node_color='lightgreen',
                 font_size=20,
                 font_color='red',
                 font_weight=3,
                 edge_cmap=plt.cm.Blues
                )

## Characteristics of the two networks and nodes

Main ref: https://programminghistorian.org/en/lessons/exploring-and-analyzing-network-data-with-python

### Number of nodes (keywords that co-occured)

In [None]:
print("Number of keywords co-occurring before-lockdown:", len(before_dice_graph.nodes))

In [None]:
print("Number of keywords co-occurring during-lockdown:", len(lock_dice_graph.nodes))

### Network density

Network density
= ratio between actual number of connections between nodes and maximum possible number of connections.

Give a sense of how closely knit the network is, a higher value (within [0,1]) indicates a more cohesive network, so a set of keywords that do tend to co-occur.



In [None]:
before_density = nx.density(before_dice_graph)
print("Network density (before lockdown):", before_density)

In [None]:
lock_density = nx.density(lock_dice_graph)
print("Network density (during lockdown):", lock_density)

Network density has increased during lockdown compared to pre-lockdown. 

Interpretation: an increase in the general tendency of keywords to co-occur together in the same documents

### Network Clustering Coefficient

= n^ of connections between the neighbour nodes of a node / maximum possible number of connections between its neighbour nodes

(neighbour nodes are the nodes directly connected to a node).

A measure of the degree to which nodes in a graph tend to cluster together.

In [None]:
before_clustcoef = nx.average_clustering(before_dice_graph, weight='bef_weights')
print("Network clustering coefficient (before lockdown):", before_clustcoef)

In [None]:
lock_clustcoef = nx.average_clustering(lock_dice_graph, weight='lock_weights')
print("Network clustering coefficient (during lockdown):", lock_clustcoef)

Remained stable. 

## Centrality measures

Identify nodes (keywords) that are more important in the networks and compare the ranking them over time.

### Node Degree

The number of connection a node has. 

Here is with how many different keywords does each keyword co-occur?
Note that this is likely to be proportional to the keyword's frequency. Something we can also report.

In [None]:
def get_node_degree(graph):
    node_degree_dict = {}
    for node in graph.nodes:
        node_degree_dict[node] = nx.degree(graph, node)
    return node_degree_dict    

Before lockdown

In [None]:
before_node_degrees = pd.Series(get_node_degree(before_dice_graph)).sort_values(ascending=False)
print(before_node_degrees)

During lockdown

In [None]:
lock_node_degrees = pd.Series(get_node_degree(lock_dice_graph)).sort_values(ascending=False)
print(lock_node_degrees)

In [None]:
# alternative way to calculate it

In [None]:
before_degree_dict = dict(before_dice_graph.degree(before_dice_graph.nodes()))
nx.set_node_attributes(before_dice_graph, before_degree_dict, 'degree')

In [None]:
lock_degree_dict = dict(lock_dice_graph.degree(lock_dice_graph.nodes()))
nx.set_node_attributes(lock_dice_graph, lock_degree_dict, 'degree')

### Node Betweeness Centrality

Betweenness centrality doesn’t care about the number of edges any one node or set of nodes has. Betweenness centrality looks at all the shortest paths that pass through a particular node.

So a keyword with a high betweeness centrality is a keyword that works as a bridge by connecting several different other keywords - i.e., it is discussed in articles with a wider variety of other keywords.

Pre-lockdown

In [None]:
before_betweenness_dict = nx.betweenness_centrality(before_dice_graph) 

# Assign each to an attribute in your network
nx.set_node_attributes(before_dice_graph, before_betweenness_dict, 'betweenness')


In [None]:
sorted(before_betweenness_dict.items(), key=itemgetter(1), reverse=True)

Compare degree and between centrality

In [None]:
#Then find and print their degree
for tb in sorted(before_betweenness_dict.items(), key=itemgetter(1), reverse=True): 
    degree = before_degree_dict[tb[0]] # Use degree_dict to access a node's degree
    print("Name:", tb[0], "| Betweenness Centrality:", tb[1], "| Degree:", degree)

During lockdown

In [None]:
lock_betweenness_dict = nx.betweenness_centrality(lock_dice_graph) 

# Assign each to an attribute in your network
nx.set_node_attributes(lock_dice_graph, lock_betweenness_dict, 'betweenness')


sorted(lock_betweenness_dict.items(), key=itemgetter(1), reverse=True)