# Data preparation

This notebook reads the `TXT` files of the speeches and builds a single dataframe with every tokenized and normalized content we'll use.

Please note that none of the code chunks of this notebook were actually ran from here. As the processing part took a great amount of time to be completed, we transformed this notebook in a `Python` script and submitted it through the slurm work manager of the Computer Science department. The script is the file `2_data_preparation.py` in this same directory, and the file `run_data_preparation.sbatch` loads it to the slurm environment.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import re
import spacy
from nltk.tokenize import sent_tokenize
#from lucem_illud_2020 import word_tokenize

import sys
sys.path.insert(0, '../../../scripts')
import data_cleaning as clean

Defining the directory of the speeches:

In [3]:
speeches_dir = '../../../data/presidentialSpeechPeru/txt'

We load the corpus from this path using a helper function we created:

In [4]:
speeches_raw = clean.loadcorpus(speeches_dir)

mensaje-1956-mpu.txt
mensaje-1957-mpu.txt
mensaje-1958-mpu.txt
mensaje-1959-mpu.txt
mensaje-1960-mpu.txt
mensaje-1961-mpu.txt
mensaje-1962-nll.txt
mensaje-1963-fbt.txt
mensaje-1964-fbt.txt
mensaje-1965-fbt.txt
mensaje-1966-fbt.txt
mensaje-1967-fbt.txt
mensaje-1968-fbt.txt
mensaje-1969-jva.txt
mensaje-1970-jva.txt
mensaje-1971-jva.txt
mensaje-1972-jva.txt
mensaje-1973-jva.txt
mensaje-1974-jva.txt
mensaje-1975-jva.txt
mensaje-1976-fmb.txt
mensaje-1977-fmb.txt
mensaje-1978-fmb.txt
mensaje-1979-fmb.txt
mensaje-1980-fbt.txt
mensaje-1981-fbt.txt
mensaje-1982-fbt.txt
mensaje-1983-fbt.txt
mensaje-1984-fbt.txt
mensaje-1985-ag.txt
mensaje-1986-ag.txt
mensaje-1987-ag.txt
mensaje-1988-ag.txt
mensaje-1989-ag.txt
mensaje-1990-af.txt
mensaje-1991-af.txt
mensaje-1992-af.txt
mensaje-1993-af.txt
mensaje-1994-af.txt
mensaje-1995-af.txt
mensaje-1996-af.txt
mensaje-1997-af.txt
mensaje-1998-af.txt
mensaje-1999-af.txt
mensaje-2000-af.txt
mensaje-2000-vp-noviembre.txt
mensaje-2001-at.txt
mensaje-2002-at.txt
m

After this, we load the result in a data frame and start adding some metadata columns:

In [5]:
speech = pd.DataFrame()
filenames = []
raw = []
for filename, raw_speech in speeches_raw.items():
    print(filename)
    filenames.append(filename)
    raw.append(raw_speech)
speech['filename'] = filenames
speech['raw text'] = raw

pattern = re.compile('[0-9]{4}')
speech['year'] = speech['filename'].apply(lambda x: pattern.search(x).group(0))
speech = speech.sort_values(by='year').reset_index(drop=True)

mensaje-1956-mpu.txt
mensaje-1957-mpu.txt
mensaje-1958-mpu.txt
mensaje-1959-mpu.txt
mensaje-1960-mpu.txt
mensaje-1961-mpu.txt
mensaje-1962-nll.txt
mensaje-1963-fbt.txt
mensaje-1964-fbt.txt
mensaje-1965-fbt.txt
mensaje-1966-fbt.txt
mensaje-1967-fbt.txt
mensaje-1968-fbt.txt
mensaje-1969-jva.txt
mensaje-1970-jva.txt
mensaje-1971-jva.txt
mensaje-1972-jva.txt
mensaje-1973-jva.txt
mensaje-1974-jva.txt
mensaje-1975-jva.txt
mensaje-1976-fmb.txt
mensaje-1977-fmb.txt
mensaje-1978-fmb.txt
mensaje-1979-fmb.txt
mensaje-1980-fbt.txt
mensaje-1981-fbt.txt
mensaje-1982-fbt.txt
mensaje-1983-fbt.txt
mensaje-1984-fbt.txt
mensaje-1985-ag.txt
mensaje-1986-ag.txt
mensaje-1987-ag.txt
mensaje-1988-ag.txt
mensaje-1989-ag.txt
mensaje-1990-af.txt
mensaje-1991-af.txt
mensaje-1992-af.txt
mensaje-1993-af.txt
mensaje-1994-af.txt
mensaje-1995-af.txt
mensaje-1996-af.txt
mensaje-1997-af.txt
mensaje-1998-af.txt
mensaje-1999-af.txt
mensaje-2000-af.txt
mensaje-2000-vp-noviembre.txt
mensaje-2001-at.txt
mensaje-2002-at.txt
m

Now we clean these raw texts using another ad-hoc function:

In [6]:
speech['cleaned text'] = speech['raw text'].apply(lambda x: clean.clean_raw_text(x))

Adding the administration and president of each speech:

In [7]:
speech.loc[(speech['year'].astype('int32') >= 1956) & \
           (speech['year'].astype('int32') <= 1961), 'administration'] = 'Prado'

speech.loc[(speech['year'].astype('int32') >= 1962) & \
           (speech['year'].astype('int32') <= 1962), 'administration'] = 'Lindley'

speech.loc[(speech['year'].astype('int32') >= 1963) & \
           (speech['year'].astype('int32') <= 1968), 'administration'] = 'Belaunde(1)'

speech.loc[(speech['year'].astype('int32') >= 1969) & \
           (speech['year'].astype('int32') <= 1975), 'administration'] = 'Velasco'

speech.loc[(speech['year'].astype('int32') >= 1976) & \
           (speech['year'].astype('int32') <= 1979), 'administration'] = 'Morales Bermudez'

speech.loc[(speech['year'].astype('int32') >= 1980) & \
           (speech['year'].astype('int32') <= 1984), 'administration'] = 'Belaunde(2)'

speech.loc[(speech['year'].astype('int32') >= 1985) & \
           (speech['year'].astype('int32') <= 1989), 'administration'] = 'Garcia(1)'

speech.loc[(speech['year'].astype('int32') >= 1990) & \
           (speech['year'].astype('int32') <= 1994), 'administration'] = 'Fujimori(1)'

speech.loc[(speech['year'].astype('int32') >= 1995) & \
           (speech['year'].astype('int32') <= 2000), 'administration'] = 'Fujimori(2)'

speech.loc[(speech['year'].astype('int32') >= 2001) & \
           (speech['year'].astype('int32') <= 2005), 'administration'] = 'Toledo'

speech.loc[(speech['year'].astype('int32') >= 2006) & \
           (speech['year'].astype('int32') <= 2010), 'administration'] = 'Garcia(2)'

speech.loc[(speech['year'].astype('int32') >= 2011) & \
           (speech['year'].astype('int32') <= 2015), 'administration'] = 'Humala'

speech.loc[(speech['year'].astype('int32') >= 2016), 'administration'] = 'Kuzcynski/Vizcarra'

speech.loc[(speech['year'].astype('int32') >= 1956) & \
           (speech['year'].astype('int32') <= 1961), 'president'] = 'Prado'

speech.loc[(speech['year'].astype('int32') >= 1962) & \
           (speech['year'].astype('int32') <= 1962), 'president'] = 'Lindley'

speech.loc[(speech['year'].astype('int32') >= 1963) & \
           (speech['year'].astype('int32') <= 1968), 'president'] = 'Belaunde'

speech.loc[(speech['year'].astype('int32') >= 1969) & \
           (speech['year'].astype('int32') <= 1975), 'president'] = 'Velasco'

speech.loc[(speech['year'].astype('int32') >= 1976) & \
           (speech['year'].astype('int32') <= 1979), 'president'] = 'Morales Bermudez'

speech.loc[(speech['year'].astype('int32') >= 1980) & \
           (speech['year'].astype('int32') <= 1984), 'president'] = 'Belaunde'

speech.loc[(speech['year'].astype('int32') >= 1985) & \
           (speech['year'].astype('int32') <= 1989), 'president'] = 'Garcia'

speech.loc[(speech['year'].astype('int32') >= 1990) & \
           (speech['year'].astype('int32') <= 2000), 'president'] = 'Fujimori'

speech.loc[(speech['year'].astype('int32') >= 2001) & \
           (speech['year'].astype('int32') <= 2005), 'president'] = 'Toledo'

speech.loc[(speech['year'].astype('int32') >= 2006) & \
           (speech['year'].astype('int32') <= 2010), 'president'] = 'Garcia'

speech.loc[(speech['year'].astype('int32') >= 2011) & \
           (speech['year'].astype('int32') <= 2015), 'president'] = 'Humala'

speech.loc[(speech['year'].astype('int32') >= 2016) & \
           (speech['year'].astype('int32') <= 2017), 'president'] = 'Kuzcynski'

speech.loc[(speech['year'].astype('int32') >= 2018), 'president'] = 'Vizcarra'

speech['year-president'] = speech['year'] + '-' + speech['president']

In [8]:
speech.head()

Unnamed: 0,filename,raw text,year,cleaned text,administration,president,year-president
0,mensaje-1956-mpu.txt,[MENSAJE DEL PRESIDENTE CONSTITUCIONAL DEL PER...,1956,MENSAJE DEL PRESIDENTE CONSTITUCIONAL DEL PERÚ...,Prado,Prado,1956-Prado
1,mensaje-1957-mpu.txt,[MENSAJE DEL PRESIDENTE CONSTITUCIONAL DEL PER...,1957,MENSAJE DEL PRESIDENTE CONSTITUCIONAL DEL PERÚ...,Prado,Prado,1957-Prado
2,mensaje-1958-mpu.txt,[MENSAJE DEL PRESIDENTE CONSTITUCIONAL DEL PER...,1958,MENSAJE DEL PRESIDENTE CONSTITUCIONAL DEL PERÚ...,Prado,Prado,1958-Prado
3,mensaje-1959-mpu.txt,[MENSAJE DEL PRESIDENTE CONSTITUCIONAL DEL PER...,1959,MENSAJE DEL PRESIDENTE CONSTITUCIONAL DEL PERÚ...,Prado,Prado,1959-Prado
4,mensaje-1960-mpu.txt,[MENSAJE DEL PRESIDENTE CONSTITUCIONAL DEL PER...,1960,MENSAJE DEL PRESIDENTE CONSTITUCIONAL DEL PERÚ...,Prado,Prado,1960-Prado


Tokenizing words:

In [None]:
speech['tokenized_words'] = speech['cleaned text'].apply(lambda x: clean.word_tokenize(x))

Now we normalize:

In [None]:
speech['normalized_words'] = speech['tokenized_words'].apply(lambda x: clean.normalize_tokens(x))

Now we check the most frequent words:

In [None]:
countsDict = {}
for word in speech['normalized_words'].sum():
    word = word.lower()
    if word in countsDict:
        countsDict[word] += 1
    else:
        countsDict[word] = 1
    if word in clean.STOP_WORDS:
        print('stop word detected:', word)
word_counts = sorted(countsDict.items(), key = lambda x : x[1], reverse = True)
word_counts[:100]

Then, we tokenize sentences using the function from `nltk` for this:

In [None]:
speech['tokenized_sentences'] = speech['cleaned text'].apply(sent_tokenize)

Now we tokenize each word in each sentence:

In [None]:
speech['tokenized_words_in_sentences'] = speech['tokenized_sentences'].apply(lambda x: [clean.word_tokenize(s) for s in x])

Finally, we normalized each tokenized word within each sentence:

In [None]:
speech['normalized_words_in_sentences'] = speech['tokenized_words_in_sentences'].apply(lambda x: [clean.normalize_tokens(s, stop_words) for s in x])

In [None]:
speech.head()