Antonia Sanhueza

# Omelas Project

In [None]:
#! pip install transformers
#! pip install sentence_transformers
#! pip install bertopic 

In [2]:
import json
import pandas as pd
import numpy as np
import transformers
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

## EDA

In [4]:
f = open('../data/article_data.json')
df = json.load(f)
df =  pd.DataFrame(df['articles'])

In [5]:
df.isnull().sum()

translated_title    0
translated_text     0
text                0
dtype: int64

In [6]:
df.columns

Index(['translated_title', 'translated_text', 'text'], dtype='object')

In [7]:
df.shape

(1500, 3)

In [8]:
# Faulty data
print(df['translated_text'][21])
print('\n Same text in', sum(df['text'] == df['text'][21]), 'entries')

To improve the functioning of our website by showing you the most relevant news and announcements, we collect the technical information of your account in a completely anonymous way using third-party tools. If you want to know the details of how we treat your personal data, you can review our Privacy Policy. You will also find in our 'Cookie Policy and Automatic Login' detailed information about the tools we use for that purpose.

By clicking 'OK and close', you consent to us to collect your personal data in order to comply with the above.

You can withdraw your consent by following the steps in our Privacy Policy.

 Same text in 480 entries


In [13]:
# Replacing text that doesn't correspond to information
df.loc[df['text'] == df['text'][21], 'translated_text'] = ''
df

Unnamed: 0,translated_title,translated_text,text
0,"The Ministry of Defence took out some 1,500 ai...","From Almaty to Moscow and Yekaterinburg, some ...",Самолеты ВКС России доставили из Алма-Аты в Мо...
1,"Kosachev: In Kazakhstan, the US distinguished ...",The crisis in Kazakhstan has affected Washingt...,По кризису в Казахстане Вашингтон отметился тр...
2,Chinese banks use currency 'swaps' to absorb d...,,Para mejorar el funcionamiento de nuestra web ...
3,Kyrgyzstan reported the detention of five of i...,Kyrgyzstan reported the detention of five of i...,Киргизия сообщила о задержании в Казахстане пя...
4,Kosachev: U.S. statements on Kazakhstan are a ...,The statements made by the United States autho...,"Заявления властей США, касающиеся ситуации в К..."
...,...,...,...
1495,Video: Brazilian footballer gets brutal head k...,,Para mejorar el funcionamiento de nuestra web ...
1496,Uruguayan Vice Foreign Minister will travel to...,,Para mejorar el funcionamiento de nuestra web ...
1497,Bank exchange offices in Kazakhstan have been ...,"In Kazakhstan, for security reasons, banks &ap...",В Казахстане в целях безопасности временно при...
1498,A Kyrgyz musician detained in Kazakhstan was r...,"A citizen of Kyrgyzstan, Vikram Ruzahunov, who...","Гражданин Киргизии Викрам Рузахунов, которого ..."


In [14]:
df['translated_title_text'] = df['translated_title'] + ' .' + df['translated_text']
df['len_text'] = df['translated_text'].str.len()
print('Shortest text is length', df.len_text.min())
print('Longest text is length', df.len_text.max())
print('Std of text is length', df.len_text.std())

Shortest text is length 0
Longest text is length 79099
Std of text is length 3768.9963656488862


In [15]:
print(df['translated_title'][0])
print(df['translated_text'][0])
print(df['translated_title_text'][0])

The Ministry of Defence took out some 1,500 aircraft from Kazakhstan. Russian
From Almaty to Moscow and Yekaterinburg, some 1,500 aircraft were transported by the Russian Federation. Russians wishing to return from Kazakhstan to Russia were informed by the military department. &quot; All Russian military transport aircraft are transported from Kazakhstan to Russia 1461 Russian citizens&raquo; &quot; transmits to TASS a message from the Ministry of Defence. According to the agency, on 9 January, 1,422 Russian citizens and 14 Russians were transported to Yekaterinburg from Kazakhstan by military transport aircraft of the Russian Federation. As a reminder, on Sunday, a Russian FCS aircraft transported 100 Russian citizens from Kazakhstan to Moscow, who were in the Republic with relatives or on a tourist trip.
The Ministry of Defence took out some 1,500 aircraft from Kazakhstan. Russian .From Almaty to Moscow and Yekaterinburg, some 1,500 aircraft were transported by the Russian Federation

## Preprocessing

In [16]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

In [17]:
from preprocessing import clean_text

df["cleaned_text"] = [clean_text(text, stemming=False) for text in df["translated_title_text"]]
df[['translated_title_text', 'cleaned_text']]


Unnamed: 0,translated_title_text,cleaned_text
0,"The Ministry of Defence took out some 1,500 ai...",the ministry defence took aircraft kazakhstan ...
1,"Kosachev: In Kazakhstan, the US distinguished ...",kosachev in kazakhstan u distinguished three t...
2,Chinese banks use currency 'swaps' to absorb d...,chinese bank use currency swap absorb dollar
3,Kyrgyzstan reported the detention of five of i...,kyrgyzstan reported detention five citizen kaz...
4,Kosachev: U.S. statements on Kazakhstan are a ...,kosachev u statement kazakhstan refusal recogn...
...,...,...
1495,Video: Brazilian footballer gets brutal head k...,video brazilian footballer get brutal head kic...
1496,Uruguayan Vice Foreign Minister will travel to...,uruguayan vice foreign minister travel russia ...
1497,Bank exchange offices in Kazakhstan have been ...,bank exchange office kazakhstan suspended in k...
1498,A Kyrgyz musician detained in Kazakhstan was r...,a kyrgyz musician detained kazakhstan released...


## Topic modelling

In [18]:
# Create embeddings
#sentence_model = SentenceTransformer('sentence-transformers/paraphrase-xlm-r-multilingual-v1')
sentence_model = SentenceTransformer('all-MiniLM-L12-v2')
embeddings = sentence_model.encode(df['cleaned_text'], show_progress_bar=True)

# Create topic model
topic_model = BERTopic() 
topics, _ = topic_model.fit_transform(df['cleaned_text'], embeddings)
df['topic'] = topics

# Show list of clusters
topic_model.get_topic_info() 

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/573 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/352 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Batches:   0%|          | 0/47 [00:00<?, ?it/s]



Unnamed: 0,Topic,Count,Name
0,-1,306,-1_gt_and_nt_re
1,0,96,0_covid_vaccine_astrazeneca_coronavirus
2,1,75,1_almaty_the_city_possession
3,2,65,2_video_photo_electric_car
4,3,59,3_satanovsky_dialogue_current_united
5,4,50,4_session_csto_putin_videoconference
6,5,44,5_bashir_suicide_officer_scientist
7,6,39,6_detained_reported_the_apos
8,7,33,7_russian_aircraft_moscow_citizen
9,8,33,8_putin_allow_scenario_colour


In [19]:
# Interactive visualization
topic_model.visualize_topics() 

In [20]:
topic_model.get_topic(0)

[('covid', 0.2711517335438114),
 ('vaccine', 0.25455795736181996),
 ('astrazeneca', 0.1425674979129989),
 ('coronavirus', 0.07886968121975618),
 ('chile', 0.07800953684849742),
 ('dos', 0.07324573318013813),
 ('johnson', 0.06771336530953448),
 ('death', 0.06471963348078562),
 ('pfizer', 0.05938276088300465),
 ('vaccinated', 0.05585153862657356)]

In [21]:
topic_model.visualize_barchart()

In [22]:
topic_model.visualize_heatmap()

In [23]:
representative_docs = topic_model.get_representative_docs()
representative_docs

{0: ['a total people die switzerland receiving vaccine pfizer moderna',
  'boris johnson claim given very soon astrazeneca vaccine',
  'france resume vaccination astrazeneca march'],
 1: ['putin csto force kazakhstan long necessary the situation kazakhstan worsened resident city janaosen aktau almaty january came street calling reduction fuel price when authority made concession demonstrator protest intensified several major city turned riot the rioter blocked work ambulance fire service attacked took possession weapon army stormed administrative building sizo the attack began track handwriting islamist radical — beheaded the marauder also began operate smashing shop bank window trying take possession money jewelry the situation almaty considered difficult on morning january military police personnel launched anti terrorist operation bandit',
  'no yelling resident almaty told survived two worst day kazakhstan photo telegram sputnik kazakhstan the terrible day protest kazakh city almat

## NER

In [24]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm

nlp = en_core_web_sm.load()

In [25]:
entities = {}
for text in df['cleaned_text']:
    doc = nlp(text)
    for ent in doc.ents:
        label = ent.label_
        text = ent.text
        entities[label] = entities.get(label, [])
        entities[label].append(text)

In [26]:
entities.keys()

dict_keys(['PERSON', 'LANGUAGE', 'GPE', 'NORP', 'ORG', 'DATE', 'FAC', 'CARDINAL', 'ORDINAL', 'LOC', 'PRODUCT', 'EVENT', 'TIME', 'QUANTITY', 'MONEY', 'LAW', 'WORK_OF_ART'])

In [43]:
most_common_ent = {}
for typ, ents in entities.items():
    print('\n Type of entity', typ)
    print('most common entities')
    most_com_tuples = Counter(ents).most_common(5)
    print('\t', most_com_tuples)
    most_common_ent[typ] = most_common_ent.get(typ, [])
    for ent, _ in most_com_tuples:
        most_common_ent[typ].append(ent)



 Type of entity PERSON
most common entities
	 [('putin', 501), ('jomart tokaev', 300), ('vladimir putin', 288), ('nur sultan', 170), ('astrahani altai', 118)]

 Type of entity LANGUAGE
most common entities
	 [('russian', 77), ('english', 2)]

 Type of entity GPE
most common entities
	 [('kazakhstan', 3859), ('russia', 931), ('kyrgyzstan', 460), ('moscow', 385), ('tokayev', 366)]

 Type of entity NORP
most common entities
	 [('russian', 2337), ('kazakh', 152), ('american', 140), ('csto', 112), ('belarusian', 106)]

 Type of entity ORG
most common entities
	 [('kazakhstan', 404), ('united state', 158), ('national security committee', 136), ('security council', 132), ('session csto collective security council', 100)]

 Type of entity DATE
most common entities
	 [('january', 1324), ('monday', 172), ('today', 150), ('day', 56), ('thursday', 53)]

 Type of entity FAC
most common entities
	 [('kazakh security force', 62), ('kazakh protest direct', 60), ('kazakh statehood', 60), ('revolution 

In [46]:
#pd.DataFrame(dict([ (k, ent) for k,(ent, num) in entities.items() ]))
res = dict((k, most_common_ent[k]) for k in ['PERSON', 'NORP', 'ORG', 'FAC']
                                        if k in most_common_ent)
pd.DataFrame(res)

Unnamed: 0,PERSON,NORP,ORG,FAC
0,putin,russian,kazakhstan,kazakh security force
1,jomart tokaev,kazakh,united state,kazakh protest direct
2,vladimir putin,american,national security committee,kazakh statehood
3,nur sultan,csto,security council,revolution kazakhstan built
4,astrahani altai,belarusian,session csto collective security council,west kazakhstan mangistu railway road


## Sentiment Analysis

In [30]:
from transformers import pipeline

sentiment_model = pipeline('sentiment-analysis',
                           model = 'cardiffnlp/twitter-roberta-base-sentiment-latest',
                           max_length=512, truncation=True)
sentiment_model(df['cleaned_text'][0])

Downloading:   0%|          | 0.00/929 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

[{'label': 'Neutral', 'score': 0.9018459916114807}]

In [32]:
df['sentiment'] = [sentiment_model(text)[0]['label'] for text in df['cleaned_text']]

In [33]:
df.sentiment.value_counts()

Neutral     1018
Negative     463
Positive      19
Name: sentiment, dtype: int64

In [34]:
top_5 = df[df['topic'] <= 5]
print(top_5.groupby('topic').sentiment.value_counts() / top_5.groupby('topic').sentiment.count())

topic  sentiment
-1     Neutral      0.751634
       Negative     0.232026
       Positive     0.016340
 0     Neutral      0.802083
       Negative     0.197917
 1     Negative     0.680000
       Neutral      0.320000
 2     Neutral      0.846154
       Negative     0.092308
       Positive     0.061538
 3     Negative     0.796610
       Neutral      0.203390
 4     Neutral      1.000000
 5     Negative     0.863636
       Neutral      0.136364
Name: sentiment, dtype: float64


## Text Summarization

In [35]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
  
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")

model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-cnn-12-6")

def generate_summary(text):

   inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
   input_ids = inputs.input_ids
   attention_mask = inputs.attention_mask
   output = model.generate(input_ids, attention_mask=attention_mask)
   return tokenizer.decode(output[0], skip_special_tokens=True) 

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

In [36]:
for topic in set(topics):
    entire_text = ''.join(df[df['topic'] == topic]['cleaned_text'])
    print('Topic', topic, 'summary')
    print('\t', generate_summary(entire_text))

Topic 0 summary
	 france italy suspend use astrazeneca vaccinemexico authorizes antiviral remdesivir covid patientgermany suspends use Astrazeneca Vaccinewhat number one cause chronic coughecuador receives dos polio vaccine donated mexicosweden decides suspend immunization astrazenec vaccine.
Topic 1 summary
	  Protest erupted several kazakh city january escalating mass riot government building getting ransacked several city day later they accompanied attack police military governance body many city country primarily almaty. The ensuing violence left score people injured and fatality also reported subsequently. Foreign tv channel resume broadcasting kazakstan ’ almatya almatyan ’ region morning january foreign tv channel resumed broadcasting  almaty city.
Topic 2 summary
	 stopasianhate hashtag revives network cartoon grammysvideo unusual trick rejuvenate. The biggest sandstorm decade lash beijingvideo a severe sandstorm hit persian gulf countryvideo boreal aurora fill sky russiavideo 

In [38]:
df[df['topic']==0]

Unnamed: 0,translated_title,translated_text,text,translated_title_text,len_text,cleaned_text,topic,sentiment
68,"France, Italy suspend use of AstraZeneca vaccine",,Para mejorar el funcionamiento de nuestra web ...,"France, Italy suspend use of AstraZeneca vacci...",0,france italy suspend use astrazeneca vaccine,0,Neutral
75,Mexico authorizes antiviral Remdesivir for COV...,,Para mejorar el funcionamiento de nuestra web ...,Mexico authorizes antiviral Remdesivir for COV...,0,mexico authorizes antiviral remdesivir covid p...,0,Neutral
86,Germany suspends use of AstraZeneca vaccine,,Para mejorar el funcionamiento de nuestra web ...,Germany suspends use of AstraZeneca vaccine .,0,germany suspends use astrazeneca vaccine,0,Negative
122,What is the number one cause of chronic cough?,,Para mejorar el funcionamiento de nuestra web ...,What is the number one cause of chronic cough? .,0,what number one cause chronic cough,0,Negative
135,"Ecuador receives 95,000 doses of polio vaccine...",,Para mejorar el funcionamiento de nuestra web ...,"Ecuador receives 95,000 doses of polio vaccine...",0,ecuador receives dos polio vaccine donated mexico,0,Neutral
...,...,...,...,...,...,...,...,...
1481,Argentina closes deal with Sinopharm and will ...,,Para mejorar el funcionamiento de nuestra web ...,Argentina closes deal with Sinopharm and will ...,0,argentina close deal sinopharm receive million...,0,Neutral
1485,Hungary defends legality of anti-COVID vaccine...,,Para mejorar el funcionamiento de nuestra web ...,Hungary defends legality of anti-COVID vaccine...,0,hungary defends legality anti covid vaccine au...,0,Neutral
1486,Uruguay's Senate suspends sessions over COVID-...,,Para mejorar el funcionamiento de nuestra web ...,Uruguay's Senate suspends sessions over COVID-...,0,uruguay s senate suspends session covid case,0,Neutral
1490,"Chile's Interior Minister, Hospitalized Preemp...",,Para mejorar el funcionamiento de nuestra web ...,"Chile's Interior Minister, Hospitalized Preemp...",0,chile s interior minister hospitalized preempt...,0,Negative


In [39]:
df.to_csv('../data/analyzed_df.csv')