# NLP and visualizations based on .txt files
The data is New Year's Speeches held by Danish Prime Ministers. The speeches included have been transcribed and publicly shared on the Government's Webpage. 

## This notebook is divided into the following sections:
1. Overview of prime ministers
2. Installing and importing libraries
3. Loading in the data, i.e., the transcripts
4. Cleaning the transcripts and preprare data frame
5. Making dataframe subsets
6. Sentiment analysis using Sentida
7. Creating Word Clouds
8. Finding the top 25 words
9. Lix score calculation
10. BERT Sentiment Analysis


Regeringsperioder:
- RØD [1998-2001]
- BLÅ [2003 - 2011]
- RØD [2012 - 2015]
- BLÅ [2016 - 2019]
- RØD [2020 - 2021]

## 1. Overview of speeches and prime ministers

- TS: Thorvald Stauning / Socialdemokratiet / R
- HH: Hans Hedtoft / Socialdemokratiet / R
- EE: Erik Eriksen / Venstre / B
- JOK: Jens Otto Krag / Socialdemokratiet / R
- HB: Hilmar Baunsgaard / Radikale Venstre / M
- AJ: Anker Jørensen / Socialdemokratiet / R
- PS: Poul Schlüter / Det Konservative Folkeparti / B
- PNR: Poul Nyrup Rasmussen / Socialdemokratiet / R
- AF: Anders Fogh / Venstre / B
- LL: Lars Løkke / Venstre B / 
- HTS: Helle Thorning Schmidt / Socialdemokratiet / R
- MF: Mette Frederiksen / Socialdemokratiet / R

## 2. Installing and importing libraries

In [None]:
# Importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import nltk
import os
import glob

import sentida
from sentida import Sentida
from wordcloud import WordCloud
from wordcloud import ImageColorGenerator
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from helper_functions import *

## 3. Loading in the data, i.e., the transcripts
I'll import .tx files containing the transcripts and call the cleaning_transcripts function.

In [None]:
os.chdir('/Users/emmaolsen/Library/CloudStorage/OneDrive-Aarhusuniversitet/cognitive_science/5th_semester/cultural_datascience/au650627_olsen_emma/NLP_nytaarstaler')

In [None]:
all_files = glob.glob("scraped_txts/*.txt")
all_files

In [None]:
speeches = {}

for file in all_files:
    loaded_file = opentxtfile(file)
    filename = file.split('/')[1].split('.')[0]
    
    speeches.update({filename: loaded_file})

In [None]:
speeches

## 4. Cleaning the transcripts & prepare dataframe

In [None]:
from helper_functions import *
# apply cleaning_transcripts function to speeches and save as new variabley
cleaned_speeches = {k: func_cleaning_transcripts(v) for k, v in speeches.items()}

#cleaned_speeches

### Remove stopwords

The stop word list used in this project is derived from the following GitHub repo: https://gist.github.com/berteltorp/0cf8a0c7afea7f25ed754f24cfc2467b

In [None]:
# danish stopwords
# npm install stopwords-da
# Run npm install -g npm@9.1.1 to update!

!!! OBS NOGLE STOPWORD VIL JEG BEHOLDE, og loop this

In [None]:
cleaned = {k: remove_stopwords(v) for k, v in cleaned_speeches.items()}

In [None]:
cleaned

In [None]:
# turn cleaned dictionary into dataframe
df = pd.DataFrame.from_dict(cleaned, orient='index', columns=['transcripts'])

In [None]:
df

In [None]:
# give first col in df a header
df['year'] = df.index
# remove everything except the last 4 numbers from the year col
df['year'] = df['year'].str[-4:]
df['minister']=df.index
# remove all non-alphabetic characters from minister col
df['minister'] = df['minister'].str.replace(r'\d', '')
# remove underscores from minister col with space
df['minister'] = df['minister'].str.replace('_', ' ')
# make year numeric
df['year'] = df['year'].astype(int)
# remove the word "nytaarstale" from minister col
df['minister'] = df['minister'].str.replace('nytaarstale', '').str.replace('statsminister','').str.replace('januar','')

# correct names
df['minister'] = df['minister'].str.replace('rasmussens', 'rasmussen')
df['minister'] = df['minister'].str.replace('schmidts', 'schmidt')
df['minister'] = df['minister'].str.replace('frederiksens', 'frederiksen')
df['minister'] = df['minister'].str.replace('schlueters', 'schlueter')

In [None]:
df

In [None]:
# Capitalise minister names
df['minister'] = df['minister'].str.title()

In [None]:
# remove spacings in beginning and end of minister names
df['minister'] = df['minister'].str.strip()

### Adding party information to dataframe

In [None]:
df

In [None]:
for col in df: 
    print(df['minister'].unique())

In [None]:
# Write a function that does the translation
def party (row):
    if row['minister']== 'Poul Nyrup Rasmussen':
        return 'Socialdemokratiet'
    if row['minister']== 'Anders Fogh Rasmussen':
        return 'Venstre'
    if row['minister']== 'Lars Loekke Rasmussen':
        return 'Venstre'
    if row['minister']== 'Helle Thorning Schmidt':
        return 'Socialdemokratiet'
    if row['minister']== 'Mette Frederiksen':
        return 'Socialdemokratiet'
    if row['minister']== 'Poul Schlueter':
        return 'Det Konservative Folkeparti'
    # if minister is none of the above, return 'other'
    else: 
        return 'other'

In [None]:
df['Party'] = df.apply(lambda row: party(row), axis=1)

In [None]:
df

In [None]:
def wing (row):
    if row['Party']== 'Socialdemokratiet':
        return 'Red'
    if row['Party']== 'Venstre':
        return 'Blue'
    if row['Party']== 'Det Konservative Folkeparti':
        return 'Blue'
    if row['Party']== 'Radikale Venstre':
        return 'Middle'
    return 'Other' 

In [None]:
df['Wing'] = df.apply (lambda row: wing(row), axis=1)

In [None]:
df

In [None]:
# make 'year' column numeric
df['year'].astype(int)

In [None]:
df.to_csv('speeches_df.csv', index=False)