# A bag-of-words topic model for BirdLife

## Extracting sentences from the BirdLife text

Load the text data scraped from BirdLife assessments. The cleaned text that we will use is in the field 'text_short'. 

In [1]:
import pandas as pd

datapath = "../../data/"
bli_master = datapath + "master-BLI-11107.csv"

# the suffix 11107 is the number of species in the data frame

df = pd.read_csv(bli_master, index_col = None).fillna('')
df.head()

Unnamed: 0,link,name_com,name_sci,SISRecID,date,text_main,text_short,x,y,status
0,http://datazone.birdlife.org/species/factsheet...,Cream-browed White-eye,Heleia superciliaris,22714307,2022-01-31,\n Justification of Red List Category\nAlthoug...,Although this species may have a restricted ra...,10.250856,-0.3714,LC
1,http://datazone.birdlife.org/species/factsheet...,Striped Sparrow,Oriturus superciliosus,22721301,2022-01-31,\n Justification of Red List Category\nThis sp...,"This species has a very large range, and hence...",14.209733,-2.746421,LC
2,http://datazone.birdlife.org/species/factsheet...,White-chinned Prinia,Schistolais leucopogon,22713643,2022-01-31,\n Justification of Red List Category\nThis sp...,"This species has an extremely large range, and...",4.975323,4.849581,LC
3,http://datazone.birdlife.org/species/factsheet...,Masked Water-tyrant,Fluvicola nengeta,22700284,2022-01-31,\n Justification of Red List Category\nThis sp...,"This species has an extremely large range, and...",6.9901,-0.280215,LC
4,http://datazone.birdlife.org/species/factsheet...,Lendu Crombec,Sylvietta chapini,22715107,2022-01-31,\n Justification of Red List Category\nThis sp...,This species is listed as Critically Endangere...,5.344564,4.205028,CR


Using <a href="https://spacy.io/">spaCy</a>, we'll extract a list of distinct sentence from across all the 'text_short' values.

In [2]:
import spacy
from spacy.matcher import Matcher                                                                                                                                                                                         

# load a language model
nlp = spacy.load('en_core_web_md') 

# recognise verbs
def verbs(sent):
    pattern=[
        {'POS': 'VERB', 'OP': '?'},
        {'POS': 'ADV', 'OP': '*'},
        {'POS': 'VERB', 'OP': '+'}
    ]
    # instantiate a Matcher instance
    matcher = Matcher(nlp.vocab) 
    # add pattern to matcher
    matcher.add('verb-phrases', [pattern])
    d = nlp(sent.text)
    # call the matcher to find matches 
    matches = matcher(d)
    spans = [d[start:end] for _, start, end in matches] 
    return spans

# recognise clean sentences
def clean_sentences(sents):
    sentences = [s for s in sents if len(verbs(s)) > 0 and
                    len(s) > 3]
    return sentences

# build our list, called 'sentences'
texts = list(df['text_short'])
sentences = []
count = 0

for i in range(df.shape[0]):
    txt = df.at[i, 'text_short']
    doc = nlp(txt)
    new_sents = clean_sentences(list(doc.sents))
    count += len(new_sents)
    sentences += [str(x) for x in new_sents]
    if i % 500 == 0: 
        # dedupe and show progress
        sentences = list(set(sentences))
        print(f'{i}: {count} --> {len(sentences)}')


0: 6 --> 6
500: 5876 --> 3772
1000: 12448 --> 8263
1500: 19036 --> 12734
2000: 25598 --> 17116
2500: 31899 --> 21098
3000: 38509 --> 25490
3500: 44571 --> 29263
4000: 50407 --> 32734
4500: 56887 --> 36945
5000: 63580 --> 41319
5500: 69746 --> 45191
6000: 76384 --> 49520
6500: 84476 --> 55540
7000: 93551 --> 62462
7500: 103195 --> 69869
8000: 109954 --> 74205
8500: 115461 --> 77119
9000: 119839 --> 78991
9500: 125234 --> 82063
10000: 130274 --> 84643
10500: 135180 --> 87115
11000: 140673 --> 90070


That took a while, so let's write the sentences to disk for re-use.

In [3]:
outfile = datapath + "bli_sentences_11107.txt"
with open(outfile, 'w') as fp:
    for s in sentences:
        fp.write(s + '\n')
fp.close()

## Building a topic from the sentences

Load the sentences. (If cloning from the GitHub repo, you can start here as the sentece file is included in the <i>data</i> directory.)

In [4]:
datapath = "../../data/"

sentfile = datapath + "bli_sentences_11107.txt"
with open(sentfile, 'r') as fp:
    sentences = [s.strip() for s in fp.readlines()]
fp.close()

Load spaCy and build a list 'words'. Each entry is a list of normalised words (tokens) from a sentence in our set. 

In [5]:
import spacy

nlp = spacy.load('en_core_web_md') 

# tags we want to remove from the text
removal= ['ADV','PRON','CCONJ','PUNCT','PART','DET','ADP','SPACE', 'NUM', 'SYM']

# build token list
words = []
for s in nlp.pipe(sentences):
    toks = [token.lemma_.lower() for token in s
               if token.pos_ not in removal 
               and not token.is_stop 
               and token.is_alpha]
    words.append(toks) 

# check number of distinct words:
word_set = list(set(sum(words, [])))
print(f'Found {len(word_set)} words in {len(sentences)} distinct sentences.')

Found 31489 words in 91070 distinct sentences.


We'll use the 'gensim' package to build a dictionary and track frequencies.

In [6]:
from gensim.corpora.dictionary import Dictionary

bli_dictionary = Dictionary(words)
#bli_dictionary.filter_extremes(no_below = 10, 
#                           no_above = 0.5, 
#                           keep_n = 5000)
bli_vocab = bli_dictionary.token2id.keys()

# show top 20 most frequent
count = 20
for x in sorted(bli_dictionary.dfs.items(), key=lambda x: x[1], reverse=True):
    if count <= 0: break
    print(f'{x[1]:5} {bli_dictionary[x[0]]}')
    count -= 1

22913 population
22042 specie
12508 habitat
11148 forest
11101 decline
10133 range
 8882 area
 8048 individual
 7335 estimate
 7135 species
 6689 breeding
 6256 occur
 5861 small
 5707 size
 5693 bird
 4984 record
 4910 year
 4734 island
 4647 find
 4584 nest


We want to assign importance weightings to these words in a natural way.
The attribute 'dfs' is the number of documents (i.e. BLI sentences) containing a given token. We will convert this into a log-likelihood of seeing the token in a sentence drawn from the topic.

In [7]:
from math import log

log_nsents = log(len(words))
bli_loglik = dict()

# show top 20 
count = 20
for x in sorted(bli_dictionary.dfs.items(), key=lambda x: x[1], reverse=True):
    tok = bli_dictionary[x[0]]
    bli_loglik[tok] = log(x[1]) - log_nsents
    if count > 0:
        print(f'{bli_loglik[tok]:5} {bli_dictionary[x[0]]}')
        count -= 1

-1.379924006502522 population
-1.4186787173103763 specie
-1.9852600019841322 habitat
-2.1003683319600253 forest
-2.104593247229218 decline
-2.1958310171135462 range
-2.3276016847098173 area
-2.4262048282223443 individual
-2.5189710299847814 estimate
-2.546616190650928 species
-2.6111640554434565 breeding
-2.6780874383367177 occur
-2.7433082040854355 small
-2.7699349500251564 size
-2.772391091610782 bird
-2.905395660094575 record
-2.920354499773298 year
-2.956857930487935 island
-2.9754065914768866 find
-2.989056462167289 nest


This dict is now the model which we'll use for LitScan.

In [8]:
# write to disk 

import json

outfile = datapath + "bli_model_11107.json"
with open(outfile, 'w') as jf:
    json.dump(bli_loglik, jf)
jf.close()    