<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Headlines-preprocessing" data-toc-modified-id="Headlines-preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Headlines preprocessing</a></span></li><li><span><a href="#Texts-preprocessing" data-toc-modified-id="Texts-preprocessing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Texts preprocessing</a></span></li></ul></div>

This notebook will focus on how the preprocessing will occur in the src/features/build_features.py file

# Imports

In [92]:
import pandas as pd
import numpy as np
import pickle
import spacy
import nltk
from tqdm import tqdm

import tensorflow as tf

from string import punctuation

import os

In [93]:
data_dir_interim = '../data/interim'
data_dir_processed = '../data/processed'

In [53]:
folha_articles = pd.read_csv(os.path.join(data_dir_interim, 'news-of-the-site-folhauol/articles.csv'))

In [54]:
texts = folha_articles['text'].tolist()
headlines = folha_articles['title'].tolist()

# Headlines preprocessing

In [55]:
headlines_tokenizer = tf.keras.preprocessing.text.Tokenizer()
headlines_tokenizer.fit_on_texts(headlines)

In [56]:
encoded_headlines = headlines_tokenizer.texts_to_sequences(headlines)

From the previous notebook (1.0-bgm-inital-data-exploration) we saw that a good length to pad our headlines would be 16.

In [57]:
padded_headlines = tf.keras.preprocessing.sequence.pad_sequences(encoded_headlines, maxlen=16, 
                                                                 padding='post', truncating='post')

# Texts preprocessing

For the articles, since we just want the context to create our headlines, we are going to drop stopwords and use lemmatization.

In [61]:
def preprocess_string(stn):
    '''
    Remove punctuation and lower case sentence
    '''
    if pd.isna(stn):
        return ''
    stn = stn.lower()
    return ''.join([c for c in stn if c not in punctuation])

In [85]:
def drop_stopwords(texts):
    '''
    Remove portuguese stopwords from corpus
    '''
    new_texts = []
    stop_words = nltk.snowball.stopwords.words('portuguese')
    for stn in texts:
        if pd.isna(stn):
            new_texts.append('')
        else:
            new_texts.append(' '.join([word for word in stn.split() 
                                       if word.lower() not in stop_words]))
    return new_texts

In [None]:
def lemmatize_string(stn, lemmatizer):
    '''
    Lemmatize words in sentence
    '''
    new_stn = []
    for token in lemmatizer(stn):
        new_stn.append(token.lemma_)
        
    return ' '.join(new_stn)

In [90]:
texts = [preprocess_string(stn) for stn in texts]

In [64]:
texts = drop_stopwords(texts)

In [65]:
lemmatizer = spacy.load("pt_core_news_sm")

In [112]:
try:
    print("Lemmatized texts restored from {}".format(os.path.join(data_dir_processed, "lemmatized-texts.pickle")))
    with open(os.path.join(data_dir_processed, "lemmatized-texts.pickle"), 'rb') as f:
        lemmatized_texts = pickle.load(f)
    
except:
    lemmatized_texts = []

Lemmatized texts restored from ../data/processed/lemmatized-texts.pickle


In [113]:
start = len(lemmatized_texts)
end = len(texts)

In [111]:
for i in range(start, end):
    stn = texts[i]
    lemmatized_texts.append(lemmatize_string(stn, lemmatizer))
    if i%1000==0:
        with open(os.path.join(data_dir_processed, "lemmatized-texts.pickle"), 'wb') as f:
            pickle.dump(lemmatized_texts, f)
            
        print("Lemmatized texts saved at {}".format(os.path.join(data_dir_processed, "lemmatized-texts.pickle")))

Lemmatized texts saved at ../data/processed/lemmatized-texts.pickle
Lemmatized texts saved at ../data/processed/lemmatized-texts.pickle
Lemmatized texts saved at ../data/processed/lemmatized-texts.pickle
Lemmatized texts saved at ../data/processed/lemmatized-texts.pickle
Lemmatized texts saved at ../data/processed/lemmatized-texts.pickle
Lemmatized texts saved at ../data/processed/lemmatized-texts.pickle
Lemmatized texts saved at ../data/processed/lemmatized-texts.pickle
Lemmatized texts saved at ../data/processed/lemmatized-texts.pickle
Lemmatized texts saved at ../data/processed/lemmatized-texts.pickle
Lemmatized texts saved at ../data/processed/lemmatized-texts.pickle
Lemmatized texts saved at ../data/processed/lemmatized-texts.pickle
Lemmatized texts saved at ../data/processed/lemmatized-texts.pickle


KeyboardInterrupt: 

In [66]:
texts_tokenizer = tf.keras.preprocessing.text.Tokenizer()
texts_tokenizer.fit_on_texts(texts)

From the last notebook we can see a good length for the padded articles would be around 1000.

In [76]:
encoded_texts = texts_tokenizer.texts_to_sequences(texts)
padded_texts = tf.keras.preprocessing.sequence.pad_sequences(encoded_texts, maxlen=1000
                                                             , truncating='post', padding='post')