# Greek Political Discource during the Recession Era

Dimitris Tsirmpas f3352315

In this short assignment we will be creating and analyzing Greek word embeddings derived from [this paper](https://openreview.net/pdf?id=4u252OfG-xh).

## Preparing the dataset

We begin by downloading and extracting the data. We use the python standard library instead of cmd prompts in order to ensure that the script is platform-agnostic.

In [1]:
from zipfile import ZipFile 
import shutil
import os
import urllib.request
import pandas as pd


DATA_DIR = "data"

print("Downloading dataset...")
# download dataset and save it in a zip archive
url = "https://zenodo.org/record/6626316/files/Greek%20Parliament%20Proceedings%20Dataset_Support%20Files_Word%20Usage%20Change%20Computations.zip"
with urllib.request.urlopen(url) as f:
    zip_contents = f.read()

zip_path = os.path.join("data", "greek.zip")
with open(zip_path, "wb") as file:
    file.write(zip_contents)

# unzip saved archived
print("Unzipping...")
with ZipFile(zip_path, 'r') as zfile:
    zfile.extractall(path=DATA_DIR)

# organize zipped directory
old_dir = os.path.join(DATA_DIR, "Greek Parliament Proceedings Dataset_Support Files_Word Usage Change Computations")
new_dir = os.path.join(DATA_DIR, "greek")

# delete MACOSX directory
shutil.rmtree(os.path.join(DATA_DIR, "__MACOSX"))
# rename directory to a sensible name
os.rename(old_dir, new_dir) 


# load dataset
print("Loading...")
df = pd.read_csv(os.path.join(new_dir, "tell_all_cleaned.csv"), encoding="utf-8")
df 

Downloading dataset...
Unzipping...
Loading...


Unnamed: 0,member_name,sitting_date,parliamentary_period,parliamentary_session,parliamentary_sitting,political_party,government,member_region,roles,member_gender,speaker_info,speech
0,κρητικος νικολαου παναγιωτης,03/07/1989,period 5,session 1,sitting 1,πανελληνιο σοσιαλιστικο κινημα,['τζαννετακη τζαννη(02/07/1989-12/10/1989)'],β' πειραιως,['δ αντιπροεδρος βουλης(07/03/1989-21/11/1989)'],male,προεδρευων,παρακαλειται @sw γραμματεας βουλγαρακης @sw συ...
1,κρητικος νικολαου παναγιωτης,03/07/1989,period 5,session 1,sitting 1,πανελληνιο σοσιαλιστικο κινημα,['τζαννετακη τζαννη(02/07/1989-12/10/1989)'],β' πειραιως,['δ αντιπροεδρος βουλης(07/03/1989-21/11/1989)'],male,προεδρευων,παρακαλειται @sw κυριος γραμματεας @sw συνοδευ...
2,κρητικος νικολαου παναγιωτης,03/07/1989,period 5,session 1,sitting 1,πανελληνιο σοσιαλιστικο κινημα,['τζαννετακη τζαννη(02/07/1989-12/10/1989)'],β' πειραιως,['δ αντιπροεδρος βουλης(07/03/1989-21/11/1989)'],male,προεδρευων,κυριοι συναδελφοι παρακαλω @sw βουλη @sw εξουσ...
3,,03/07/1989,period 5,session 1,sitting 1,βουλη,['τζαννετακη τζαννη(02/07/1989-12/10/1989)'],,,,βουλευτης/ες,@sw @sw
4,κρητικος νικολαου παναγιωτης,03/07/1989,period 5,session 1,sitting 1,πανελληνιο σοσιαλιστικο κινημα,['τζαννετακη τζαννη(02/07/1989-12/10/1989)'],β' πειραιως,['δ αντιπροεδρος βουλης(07/03/1989-21/11/1989)'],male,προεδρευων,@sw βουλη παρεσχε @sw ζητηθεισα εξουσιοδοτηση....
...,...,...,...,...,...,...,...,...,...,...,...,...
1280913,κωνσταντινοπουλος κωνσταντινου οδυσσεας,24/07/2020,period 18 review 9,session 1,sitting 187,κινημα αλλαγης,['μητσοτακη κυριακου(08/07/2019-28/07/2020)'],αρκαδιας,['ε αντιπροεδρος βουλης(18/07/2019-28/07/2020)'],male,προεδρευων,κυριες @sw κυριοι συναδελφοι παρακαλω @sw σωμα...
1280914,,24/07/2020,period 18 review 9,session 1,sitting 187,βουλη,['μητσοτακη κυριακου(08/07/2019-28/07/2020)'],,,,βουλευτης/ες,@sw @sw
1280915,κωνσταντινοπουλος κωνσταντινου οδυσσεας,24/07/2020,period 18 review 9,session 1,sitting 187,κινημα αλλαγης,['μητσοτακη κυριακου(08/07/2019-28/07/2020)'],αρκαδιας,['ε αντιπροεδρος βουλης(18/07/2019-28/07/2020)'],male,προεδρευων,@sw σωμα παρεσχε @sw ζητηθεισα εξουσιοδοτηση κ...
1280916,,24/07/2020,period 18 review 9,session 1,sitting 187,βουλη,['μητσοτακη κυριακου(08/07/2019-28/07/2020)'],,,,βουλευτης/ες,@sw @sw


We will be focusing on the period of the aftermath of the Greek Recession (2010-2020). Since our analysis is contained to Greek embeddings, we will be only keeping the `sitting-date` and `speech` columns of the original dataset.

In [2]:
from tqdm.notebook import tqdm_notebook


# enable progress bar functionality
tqdm_notebook().pandas()

0it [00:00, ?it/s]

In [3]:
speech_df = df.loc[:, ["sitting_date", "speech"]].copy()

speech_df.sitting_date = pd.to_datetime(speech_df.sitting_date, dayfirst=True)
speech_df.set_index("sitting_date", inplace=True)
speech_df.sort_index(inplace=True) # required to avoid wrong slicing (and warnings)

speech_df = speech_df[~speech_df.speech.isnull()]

speech_df = speech_df.loc["2010-01-01":] 
speech_df

Unnamed: 0_level_0,speech
sitting_date,Unnamed: 1_level_1
2010-01-11,κυριες @sw κυριοι συναδελφοι @sw @sw ευχηθω @s...
2010-01-11,ευχαριστουμε κυριε συναδελφε.κυριες @sw κυριοι...
2010-01-11,@sw @sw
2010-01-11,@sw βουλη ενεκρινε @sw ζητηθεισα αδεια.@sw @sw...
2010-01-11,@sw @sw
...,...
2020-07-24,κυριες @sw κυριοι συναδελφοι παρακαλω @sw σωμα...
2020-07-24,@sw @sw
2020-07-24,@sw σωμα παρεσχε @sw ζητηθεισα εξουσιοδοτηση κ...
2020-07-24,@sw @sw


Before we begin we must take one crucial preprocessing step:

According to the paper:
> We replaced all references to political parties with the symbol “@” followed by an abbreviation of the
party name. We removed accents, strings with length less than 2 characters, all punctuation except
full stops, and replaced stopwords with “@sw”.

Since we won't be running our analysis on political parties, we will be removing all placeholders without replacing them.

Additionally, many words seem to be separated by periods without any space between them, causing our model to confuse them as single words. This is also addressed below.

In [4]:
# substantially faster than regex
def remove_placeholders(x: str) -> str:
    words = x.split()
    new_words = [word for word in words if not word.startswith("@")]
    return " ".join(new_words)


print("Reformatting words with periods...")
speech_df.speech = speech_df.speech.progress_apply(lambda x: " ".join(x.split(".")))

print("Removing placeholders...")
speech_df.speech = speech_df.speech.progress_apply(remove_placeholders)

speech_df = speech_df.loc[speech_df.speech.apply(lambda x: len(x.strip()) != 0)]
speech_df

Reformatting words with periods...


  0%|          | 0/539230 [00:00<?, ?it/s]

Removing placeholders...


  0%|          | 0/539230 [00:00<?, ?it/s]

Unnamed: 0_level_0,speech
sitting_date,Unnamed: 1_level_1
2010-01-11,κυριες κυριοι συναδελφοι ευχηθω κυριες κυριους...
2010-01-11,ευχαριστουμε κυριε συναδελφε κυριες κυριοι συν...
2010-01-11,βουλη ενεκρινε ζητηθεισα αδεια βουλευτης αικατ...
2010-01-11,βουλη ενεκρινε ζητηθεισα αδεια εισελθουμε ημερ...
2010-01-11,κυριες κυριοι συναδελφοι εισερχομαστε ημερησια...
...,...
2020-07-24,θεσεις κομματων αποτυπωθηκαν ψηφιση ηλεκτρονικ...
2020-07-24,ολοκληρωση ψηφοφοριας ηλεκτρονικο συστημα σχεδ...
2020-07-24,κυριες κυριοι συναδελφοι παρακαλω σωμα εξουσιο...
2020-07-24,σωμα παρεσχε ζητηθεισα εξουσιοδοτηση κυριοι συ...


## The Embedding model

We use a FastText model. We keep the vector size to 100, since this seems sufficient to encode all necessary semantic meaning. The n-gram window size is kept at 5, since larger values do not seem to affect the results, and substantially increase the training time. We discard rare words which we meet less than 3 times in our dataset and use 8 workers to fully take advantage of multi-processing. 

Additionally, we enable logging to monitor the training of our model and do not use further pre-processing, since the dataset is already processed.

In [5]:
from nltk.tokenize import sent_tokenize, word_tokenize
import logging
import gensim


# https://stackoverflow.com/questions/77096387/how-to-get-a-progess-bar-for-gensim-models-fasttext-train
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

docs = speech_df.speech.apply(word_tokenize).to_list()
model = gensim.models.FastText(docs, 
                               vector_size=100, 
                               window=5, 
                               epochs=5,
                               min_count=3,
                               workers=8)

2023-12-12 11:35:24,036 : INFO : collecting all words and their counts
2023-12-12 11:35:24,037 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2023-12-12 11:35:24,254 : INFO : PROGRESS: at sentence #10000, processed 852790 words, keeping 60179 word types
2023-12-12 11:35:24,462 : INFO : PROGRESS: at sentence #20000, processed 1672178 words, keeping 81559 word types
2023-12-12 11:35:24,650 : INFO : PROGRESS: at sentence #30000, processed 2389021 words, keeping 94475 word types
2023-12-12 11:35:24,852 : INFO : PROGRESS: at sentence #40000, processed 3155198 words, keeping 106175 word types
2023-12-12 11:35:25,089 : INFO : PROGRESS: at sentence #50000, processed 4040750 words, keeping 122331 word types
2023-12-12 11:35:25,274 : INFO : PROGRESS: at sentence #60000, processed 4744397 words, keeping 129977 word types
2023-12-12 11:35:25,458 : INFO : PROGRESS: at sentence #70000, processed 5431045 words, keeping 136917 word types
2023-12-12 11:35:25,649 : INFO : PRO

## Analysis

Since our embeddings are derived from Greek parliamentary proceedings, it makes sense to use analogies from political subjects.

### Perception of the police

An issue touched in the original paper is the perception of the Greek police during the period, since many controversial events implicating it led to significant public outcry. 

Words most often associated with the police are:

In [29]:
model.wv.most_similar("αστυνομια")

[('aστυνομια', 0.9618648290634155),
 ('ευρωαστυνομια', 0.8789981603622437),
 ('αστυνομοκρατια', 0.8481607437133789),
 ('αστυνομο', 0.7763506770133972),
 ('αστυνομιες', 0.7658748626708984),
 ('αστυνομευση', 0.762896716594696),
 ('αστυνομικινες', 0.7588578462600708),
 ('αστυνομιας', 0.7513791918754578),
 ('αστυνομοκρατουμενα', 0.749001145362854),
 ('αστυνομοκρατουμενο', 0.7475293874740601)]

If we correlate it with violence we get words associated with the events of early 2010's.

In [28]:
model.wv.most_similar(["αστυνομια", "βια"])

[('αστυνομοκρατια', 0.8734363913536072),
 ('aστυνομια', 0.8299976587295532),
 ('εμβια', 0.7736663222312927),
 ('οπλοφοβια', 0.7638539671897888),
 ('τρομοκρατια', 0.75651615858078),
 ('αστυνομοκρατουμενα', 0.734204888343811),
 ('αστυνομευση', 0.7258228063583374),
 ('ευρωαστυνομια', 0.7232462167739868),
 ('φοβια', 0.7230506539344788),
 ('κακαβια', 0.7204627394676208)]

On the other hand, if we explicitly remove the semantics of "violence" we get words associated with police divisions held in comparably higher regard and which did not participate in the afforementioned controversial events, such as the European police force and the coast guard.

In [30]:
model.wv.most_similar("αστυνομια", negative="βια")

[('αστυνομιας', 0.5167123675346375),
 ('ευρωαστυνομια', 0.5038996934890747),
 ('ευρωαστυνομιας', 0.49796822667121887),
 ('aστυνομια', 0.4861074686050415),
 ('σεπε', 0.45815831422805786),
 ('ακτοφυλακης', 0.4525245130062103),
 ('λιμενοφυλακων', 0.4479447305202484),
 ('λιμενικο', 0.44715550541877747),
 ('υπηρεσιαν', 0.4403153657913208),
 ('λιμενοφυλακας', 0.4393709897994995)]

### Perception of refugees

The Refugee crisis of 2015 rattled both Greece and the political discource during its heights. This is evident by the associated words for "immigrant": 

In [40]:
model.wv.most_similar("μεταναστης")

[('λαθρομεταναστης', 0.8822091817855835),
 ('αλλοδαπος', 0.7875159382820129),
 ('μεταναστριας', 0.7716271281242371),
 ('προσφυγας', 0.7496047616004944),
 ('μεταναστη', 0.7475847601890564),
 ('ασιγαστης', 0.74226975440979),
 ('αλλοκοτης', 0.7232744693756104),
 ('ισλαμιστης', 0.7221237421035767),
 ('μεταναστρια', 0.7171958684921265),
 ('ανθρωπιστης', 0.7108303904533386)]

If we remove the economical dimension however the associated words change significantly, although not necessarily in a more positive way.

In [43]:
model.wv.most_similar("μεταναστης", negative="οικονομια")

[('λαθρομεταναστης', 0.5967414379119873),
 ('μουζακιτης', 0.5943003296852112),
 ('ρατσιστης', 0.5847377777099609),
 ('φρανκφουρτης', 0.5718047022819519),
 ('ισλαμιστης', 0.5652790665626526),
 ('φυγας', 0.5618626475334167),
 ('τυφλοσουρτης', 0.5599074363708496),
 ('μουφτης', 0.5581765174865723),
 ('αλλοδαπος', 0.5540549159049988),
 ('μολυβιατης', 0.5525160431861877)]

Disassociating immigrants from "bad" meanings in general however does yield neutral or positive language such as "refugee" or "foreigner".

In [49]:
model.wv.most_similar("μεταναστης", negative="κακο")

[('προσφυγας', 0.565301775932312),
 ('αλλοδαπος', 0.5410193800926208),
 ('νοτης', 0.5362069010734558),
 ('ηπειρωτης', 0.5320284366607666),
 ('οιτης', 0.5311363339424133),
 ('φρικτης', 0.5296379327774048),
 ('ρεκτης', 0.5291882157325745),
 ('βωξιτης', 0.527543842792511),
 ('κνιτης', 0.5240750908851624),
 ('μεταναστριας', 0.52401202917099)]