In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from flair.models import SequenceTagger, TextClassifier
from flair.data import Sentence
from flair.embeddings import WordEmbeddings
from flair.tokenization import SegtokSentenceSplitter

DATA_DIR = '../data'

In [2]:
df = pd.read_json(f'{DATA_DIR}/fine_filtered2020_attrs.json.bz2', compression='bz2')
df.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,speaker_qid,gender,nationality,date_of_birth,ethnic_group,occupation,party,academic_degree,domains
0,2020-02-13-100687,This diet focuses on lifelong changes to healt...,,[],2020-02-13 18:22:19,1,"[[None, 0.8001], [Dr. Sebi, 0.1999]]",[https://parade.com/995529/christinperry/dr-se...,E,,,,,,,,,[parade.com]
1,2020-02-07-007719,"As an empowered female leader myself, I am thr...",,[],2020-02-07 13:09:21,1,"[[None, 0.9507], [Maye Musk, 0.0493]]",[https://www.perishablenews.com/produce/celebr...,E,,,,,,,,,[perishablenews.com]
2,2020-01-21-098307,We stack the freshly plucked fruits into three...,Sahi Ram,[Q19605026],2020-01-21 12:49:00,1,"[[Sahi Ram, 0.9035], [None, 0.0965]]",[http://freshplaza.com/article/9181949/the-tin...,E,Q19605026,[male],[India],[+1959-10-10T00:00:00Z],,[politician],[Aam Aadmi Party],,[freshplaza.com]
3,2020-02-12-011869,Burger Burger will use local produce as much a...,,[],2020-02-12 18:56:09,1,"[[None, 0.9078], [El Chapo, 0.0922]]",[https://www.belfastlive.co.uk/whats-on/food-d...,E,,,,,,,,,[belfastlive.co.uk]
4,2020-01-07-042453,It's likely that as we're seeing drier conditi...,,[],2020-01-07 23:40:00,2,"[[None, 0.9688], [Scott Morrison, 0.0312]]",[http://msn.com/en-au/news/australia/kookaburr...,E,,,,,,,,,"[msn.com, msn.com]"


# Tagging

Here we experiment tagging on our dataset.

In [3]:
# Load NER
tagger = SequenceTagger.load('ner')
splitter = SegtokSentenceSplitter()

2021-11-12 05:28:04,408 --------------------------------------------------------------------------------
2021-11-12 05:28:04,409 The model key 'ner' now maps to 'https://huggingface.co/flair/ner-english' on the HuggingFace ModelHub
2021-11-12 05:28:04,409  - The most current version of the model is automatically downloaded from there.
2021-11-12 05:28:04,409  - (you can alternatively manually download the original model at https://nlp.informatik.hu-berlin.de/resources/models/ner/en-ner-conll03-v0.4.pt)
2021-11-12 05:28:04,410 --------------------------------------------------------------------------------
2021-11-12 05:28:04,956 loading file /home/romain/.flair/models/ner-english/4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4


In [4]:
full_text = '\n'.join(df['quotation'])
sentences = splitter.split(full_text)
tagger.predict(sentences)

In [5]:
spans = [el for sublist in list(map(lambda s: s.get_spans('ner'), sentences)) for el in sublist]
spans_df = pd.DataFrame([{
    'text': s.text,
    'tag': s.tag,
    'score': s.score
} for s in spans])
spans_df.head()

Unnamed: 0,text,tag,score
0,Burger Burger,ORG,0.981547
1,KVD Vegan Beauty,ORG,0.499235
2,Kat,PER,0.572307
3,Auckland,LOC,0.999819
4,Auckland,LOC,0.996379


## Most cited organizations

In the following cells, we experiment tagging and what we can learn about the cited organizations in our dataset.

In [6]:
org_spans = spans_df[spans_df['tag'] == 'ORG']

print(f"Most cited organizations (total = {len(org_spans)}):")
org_spans['text'].value_counts()[:15]

Most cited organizations (total = 610):


HFPA                       10
USDA                       10
Beyond Meat                 8
PETA India                  8
Hollywood Foreign Press     8
KENDO                       7
KFC                         7
BJP                         7
Hälsa                       5
KVD Vegan Beauty            5
APMC                        4
USMCA                       4
MPC                         4
CBD                         4
PETA                        4
Name: text, dtype: int64

These results are quite promising. We can expect to learn much more from it with the complete dataset (previous years) during the next milestone.

Also, maybe it can be interesting to also tag the context around the quotations, because sometimes speakers talk about them while they are given around the quotation. The usage of acronyms can also alter the result. That's why it can be interesting to solve this problem using for example wikidata, which contains the acronyms and alias of organizations.

## Most cited locations

In the following cells, we experiment tagging and what we can learn about the cited locations in our dataset.

In [7]:
loc_spans = spans_df[spans_df['tag'] == 'LOC']

print(f"Most cited locations (total = {len(loc_spans)}):")
loc_spans['text'].value_counts()[:15]

Most cited locations (total = 888):


India            30
UK               26
Australia        20
U.S.             17
China            14
US               12
Delhi            12
Vermont           9
Canada            8
Goa               8
Florida           8
Europe            7
Kolkata           7
Topsia            7
United States     7
Name: text, dtype: int64

Same processing as before could be done here. Also, we sometimes have regions mentioned rather than country: we could use wikidata to map then to their countries for example, if we want to do a country analysis.

## Most cited persons

In the following cells, we experiment tagging and what we can learn about the cited persons in our dataset.

In [8]:
per_spans = spans_df[spans_df['tag'] == 'PER']

print(f"Most cited persons (total = {len(per_spans)}):")
per_spans['text'].value_counts()[:10]

Most cited persons (total = 551):


Trump              13
Joaquin Phoenix    12
God                10
Alex                8
Joaquin             8
Jon Richardson      8
Greggs              6
Thor                5
Donald Trump        4
Alzheimer           4
Name: text, dtype: int64

Same comments as before.

## Events tagging

We experiment events tagging.

In [9]:
# Load NER Ontonotes
ontonotes_tagger = SequenceTagger.load('ner-ontonotes')

2021-11-12 05:28:48,573 --------------------------------------------------------------------------------
2021-11-12 05:28:48,574 The model key 'ner-ontonotes' now maps to 'https://huggingface.co/flair/ner-english-ontonotes' on the HuggingFace ModelHub
2021-11-12 05:28:48,574  - The most current version of the model is automatically downloaded from there.
2021-11-12 05:28:48,575  - (you can alternatively manually download the original model at https://nlp.informatik.hu-berlin.de/resources/models/ner-ontonotes/en-ner-ontonotes-v0.4.pt)
2021-11-12 05:28:48,575 --------------------------------------------------------------------------------
2021-11-12 05:28:49,080 loading file /home/romain/.flair/models/ner-english-ontonotes/f46dcd14689a594a7dd2a8c9c001a34fd55b02fded2528410913c7e88dbe43d4.1207747bf5ae24291205b6f3e7417c8bedd5c32cacfb5a439f3eff38afda66f7


In [10]:
full_text = '\n'.join(df['quotation'])
sentences = splitter.split(full_text)
ontonotes_tagger.predict(sentences)

In [11]:
spans = [el for sublist in list(map(lambda s: s.get_spans('ner'), sentences)) for el in sublist]
spans_df = pd.DataFrame([{
    'text': s.text,
    'tag': s.tag,
    'score': s.score
} for s in spans])
spans_df.head()

Unnamed: 0,text,tag,score
0,Musk,PERSON,0.951681
1,every single day,DATE,0.91475
2,three,CARDINAL,0.999228
3,One,CARDINAL,0.99895
4,third,ORDINAL,0.999984


In [12]:
event_spans = spans_df[spans_df['tag'] == 'EVENT']

print(f"Most cited events (total = {len(event_spans)}):")
event_spans['text'].value_counts()[:10]

Most cited events (total = 38):


Veganuary                                  2
Olympic                                    2
the Plant Based World Conference & Expo    2
Australian Open                            2
Annual Golden Globe Awards                 2
New Year                                   2
the UAE Innovation Month                   2
the Adelaide Cup                           1
Paralympic                                 1
Valentine 's Day                           1
Name: text, dtype: int64

The result is ok-tier. But we think that adding context could improve and help us to link events with our analysis.

## Most cited products

We experiment events tagging.

In [13]:
product_spans = spans_df[spans_df['tag'] == 'PRODUCT']

print(f"Most cited products (total = {len(product_spans)}):")
product_spans['text'].value_counts()[:10]

Most cited products (total = 62):


the Beyond Burger                          3
Covid-19                                   3
Big Mac                                    2
Sewa                                       2
Marmite                                    2
Beyond Meat                                2
Impossible Foods'                          1
Fanta                                      1
Vegan Jack Wings                           1
Biolage HydraSource Deep Treatment Pack    1
Name: text, dtype: int64

# Word embedding

In [14]:
glove_embedding = WordEmbeddings('glove')

In [15]:
sentence = Sentence(df.loc[0].quotation)
glove_embedding.embed(sentence)

for token in sentence:
    print(token)
    print(token.embedding)

Token: 1 This
tensor([-0.5706,  0.4418,  0.7010, -0.4171, -0.3406,  0.0234, -0.0715,  0.4818,
        -0.0131,  0.1683, -0.1339,  0.0406,  0.1583, -0.4434, -0.0194, -0.0097,
        -0.0463,  0.0932, -0.2733,  0.2285,  0.3309, -0.3647,  0.0787,  0.3585,
         0.4476, -0.2299,  0.1808, -0.6265,  0.0539, -0.2915, -0.4256,  0.6290,
         0.1439, -0.0460, -0.2101,  0.4888, -0.0577,  0.3743, -0.0301, -0.3449,
        -0.2970,  0.1509,  0.2825, -0.1658,  0.0761, -0.0930,  0.7936, -0.6049,
        -0.1887, -1.0173,  0.3196, -0.1634,  0.5418,  1.1725, -0.4787, -3.3842,
        -0.0813, -0.3528,  1.8372,  0.4452, -0.5267,  0.9979, -0.3218,  0.0335,
         1.1783, -0.0729,  0.3974,  0.2617,  0.3311, -0.3563, -0.1656, -0.4438,
        -0.1418, -0.3798,  0.2899, -0.0291, -0.3517, -0.2769, -1.3440,  0.1955,
         0.1689,  0.0402, -0.8021,  0.2337, -1.3837, -0.0231,  0.0854, -0.7405,
        -0.0739, -0.5884, -0.0857, -0.1053, -0.5157,  0.1504, -0.1669, -0.1637,
        -0.2270, -0.6610, 