# A bag-of-words topic model for BirdLife

## Extracting sentences from the BirdLife text

Load the text data scraped from BirdLife assessments. The cleaned text that we will use is in the field 'text_short'. 

In [8]:
import pandas as pd

datapath = "../../data/"
bli_master = datapath + "master-BLI.csv"

df = pd.read_csv(bli_master, index_col = None).fillna('')
df.head()

Unnamed: 0,link,name_com,name_sci,SISRecID,date,text_main,text_short,x,y,status
0,http://datazone.birdlife.org/species/factsheet...,Cream-browed White-eye,Heleia superciliaris,22714307,2022-01-31,\n Justification of Red List Category\nAlthoug...,Although this species may have a restricted ra...,10.250856,-0.3714,LC
1,http://datazone.birdlife.org/species/factsheet...,Striped Sparrow,Oriturus superciliosus,22721301,2022-01-31,\n Justification of Red List Category\nThis sp...,"This species has a very large range, and hence...",14.209733,-2.746421,LC
2,http://datazone.birdlife.org/species/factsheet...,White-chinned Prinia,Schistolais leucopogon,22713643,2022-01-31,\n Justification of Red List Category\nThis sp...,"This species has an extremely large range, and...",4.975323,4.849581,LC
3,http://datazone.birdlife.org/species/factsheet...,Masked Water-tyrant,Fluvicola nengeta,22700284,2022-01-31,\n Justification of Red List Category\nThis sp...,"This species has an extremely large range, and...",6.9901,-0.280215,LC
4,http://datazone.birdlife.org/species/factsheet...,Lendu Crombec,Sylvietta chapini,22715107,2022-01-31,\n Justification of Red List Category\nThis sp...,This species is listed as Critically Endangere...,5.344564,4.205028,CR


Using <a href="https://spacy.io/">spaCy</a>, we'll extract a list of distinct sentence from across all the 'text_short' values.

In [10]:
import spacy
from spacy.matcher import Matcher                                                                                                                                                                                         

# load a language model
nlp = spacy.load('en_core_web_md') 

# recognise verbs
def verbs(sent):
    pattern=[
        {'POS': 'VERB', 'OP': '?'},
        {'POS': 'ADV', 'OP': '*'},
        {'POS': 'VERB', 'OP': '+'}
    ]
    # instantiate a Matcher instance
    matcher = Matcher(nlp.vocab) 
    # add pattern to matcher
    matcher.add('verb-phrases', [pattern])
    d = nlp(sent.text)
    # call the matcher to find matches 
    matches = matcher(d)
    spans = [d[start:end] for _, start, end in matches] 
    return spans

# recognise clean sentences
def clean_sentences(sents):
    sentences = [s for s in sents if len(verbs(s)) > 0 and
                    len(s) > 3]
    return sentences

# build our list, called 'sentences'
texts = list(df['text_short'])
sentences = []
count = 0

for i in range(df.shape[0]):
    txt = df.at[i, 'text_short']
    doc = nlp(txt)
    new_sents = clean_sentences(list(doc.sents))
    count += len(new_sents)
    sentences += [str(x) for x in new_sents]
    if i % 100 == 0: 
        # dedupe and show progress
        sentences = list(set(sentences))
        print(f'{i}: {count} --> {len(sentences)}')


0: 6 --> 6
100: 1259 --> 871
200: 2418 --> 1631
300: 3520 --> 2292
400: 4803 --> 3172
500: 5876 --> 3772
600: 6984 --> 4453
700: 8402 --> 5456
800: 9736 --> 6371
900: 11119 --> 7327
1000: 12448 --> 8263
1100: 13621 --> 8994
1200: 14823 --> 9783
1300: 16202 --> 10738
1400: 17442 --> 11552
1500: 19036 --> 12734
1600: 20410 --> 13666
1700: 21823 --> 14663
1800: 23168 --> 15572
1900: 24201 --> 16132
2000: 25598 --> 17116
2100: 26800 --> 17864
2200: 27987 --> 18591
2300: 29271 --> 19403
2400: 30498 --> 20170
2500: 31899 --> 21098
2600: 33210 --> 21970
2700: 34617 --> 22919
2800: 35977 --> 23840
2900: 37197 --> 24615
3000: 38509 --> 25490
3100: 39791 --> 26322
3200: 40973 --> 27058
3300: 42174 --> 27791
3400: 43253 --> 28379
3500: 44571 --> 29263
3600: 45686 --> 29905
3700: 46790 --> 30547
3800: 47937 --> 31231
3900: 49179 --> 31986
4000: 50407 --> 32734
4100: 51697 --> 33573
4200: 52855 --> 34297
4300: 54341 --> 35328
4400: 55544 --> 36060
4500: 56887 --> 36945
4600: 58181 --> 37788
4700: 5

That took a while, so let's write the sentences to disk for re-use.

In [11]:
outfile = datapath + "bli_sentences.txt"
with open(outfile, 'w') as fp:
    for s in sentences:
        fp.write(s + '\n')
fp.close()

## Building a topic from the sentences

Load the sentences. (If cloning from the GitHub repo, you can start here as the sentece file is included in the <i>data</i> directory.)

In [12]:
datapath = "../../data/"

sentfile = datapath + "bli_sentences.txt"
with open(sentfile, 'r') as fp:
    sentences = [s.strip() for s in fp.readlines()]
fp.close()

Load spaCy and build a list 'words'. Each entry is a list of normalised words (tokens) from a sentence in our set. 

In [13]:
import spacy

nlp = spacy.load('en_core_web_md') 

# tags we want to remove from the text
removal= ['ADV','PRON','CCONJ','PUNCT','PART','DET','ADP','SPACE', 'NUM', 'SYM']

# build token list
words = []
for s in nlp.pipe(sentences):
    toks = [token.lemma_.lower() for token in s
               if token.pos_ not in removal 
               and not token.is_stop 
               and token.is_alpha]
    words.append(toks) 

# check number of distinct words:
word_set = list(set(sum(words, [])))
print(f'Found {len(word_set)} words in {len(sentences)} distinct sentences.')

Found 20162 words in 39800 distinct sentences.


We'll use the 'gensim' package to build a dictionary and track frequencies.

In [19]:
from gensim.corpora.dictionary import Dictionary

bli_dictionary = Dictionary(words)
#bli_dictionary.filter_extremes(no_below = 10, 
#                           no_above = 0.5, 
#                           keep_n = 5000)
bli_vocab = bli_dictionary.token2id.keys()

# show top 20 most frequent
count = 20
for x in sorted(bli_dictionary.dfs.items(), key=lambda x: x[1], reverse=True):
    if count <= 0: break
    print(f'{x[1]:5} {bli_dictionary[x[0]]}')
    count -= 1

10121 population
 9610 specie
 5520 habitat
 4892 decline
 4794 forest
 4348 range
 3890 area
 3477 individual
 3267 estimate
 3155 species
 2992 breeding
 2668 occur
 2616 size
 2606 bird
 2552 small
 2124 year
 2087 island
 2036 record
 2000 find
 1972 nest


We want to assign importance weightings to these words in a natural way.
The attribute 'dfs' is the number of documents (i.e. BLI sentences) containing a given token. We will convert this into a log-likelihood of seeing the token in a sentence drawn from the topic.

In [21]:
from math import log

log_nsents = log(len(words))
bli_loglik = dict()

# show top 20 
count = 20
for x in sorted(bli_dictionary.dfs.items(), key=lambda x: x[1], reverse=True):
    tok = bli_dictionary[x[0]]
    bli_loglik[tok] = log(x[1]) - log_nsents
    if count > 0:
        print(f'{bli_loglik[tok]:5} {bli_dictionary[x[0]]}')
        count -= 1

-1.3692544390836279 population
-1.421062689308192 specie
-1.9754890520013877 habitat
-2.096265694465467 decline
-2.1165017762782004 forest
-2.214150943031429 range
-2.3254577546600377 area
-2.437697059264668 individual
-2.499994779671459 estimate
-2.534878416297216 species
-2.587924852178162 breeding
-2.7025377842370153 occur
-2.72222047869544 size
-2.726050433587739 bird
-2.746989546807849 small
-2.9305658089107 year
-2.948139284195329 island
-2.972879813602116 record
-2.9907197317304473 find
-3.004818656109949 nest


This dict is now the model which we'll use for LitScan.

In [24]:
# write to disk 

import json

outfile = datapath + "bli_model.json"
with open(outfile, 'w') as jf:
    json.dump(bli_loglik, jf)
jf.close()    