<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Headlines-preprocessing" data-toc-modified-id="Headlines-preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Headlines preprocessing</a></span></li><li><span><a href="#Texts-preprocessing" data-toc-modified-id="Texts-preprocessing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Texts preprocessing</a></span></li></ul></div>

This notebook will focus on how the preprocessing will occur in the src/features/build_features.py file

# Imports

In [6]:
import pandas as pd
import numpy as np
import pickle
import spacy
import nltk
from tqdm import tqdm

import tensorflow as tf

from string import punctuation

import os

In [7]:
data_dir_interim = '../data/interim'
data_dir_processed = '../data/processed'

In [8]:
folha_articles = pd.read_csv(os.path.join(data_dir_interim, 'news-of-the-site-folhauol/articles.csv'))

In [9]:
texts = folha_articles['text'].tolist()
headlines = folha_articles['title'].tolist()

# Headlines preprocessing

In [10]:
headlines_tokenizer = tf.keras.preprocessing.text.Tokenizer()
headlines_tokenizer.fit_on_texts(headlines)

In [11]:
encoded_headlines = headlines_tokenizer.texts_to_sequences(headlines)

From the previous notebook (1.0-bgm-inital-data-exploration) we saw that a good length to pad our headlines would be 16.

In [12]:
padded_headlines = tf.keras.preprocessing.sequence.pad_sequences(encoded_headlines, maxlen=16, 
                                                                 padding='post', truncating='post')

# Texts preprocessing

For the articles, since we just want the context to create our headlines, we are going to drop stopwords and use lemmatization, we are also going to use the first paragraph of each article since we saw on the previous notebook that they contain enough information to generate the headline.

In [18]:
def get_first_paragraph(stn):
    '''
    Returns first paragraph of sentence
    '''
    if pd.isna(stn):
        return ''
    return stn.split('  ')[0]

In [14]:
def preprocess_string(stn):
    '''
    Remove punctuation and lower case sentence
    '''
    if pd.isna(stn):
        return ''
    stn = stn.lower()
    return ''.join([c for c in stn if c not in punctuation])

In [15]:
def drop_stopwords(texts):
    '''
    Remove portuguese stopwords from corpus
    '''
    new_texts = []
    stop_words = nltk.snowball.stopwords.words('portuguese')
    for stn in texts:
        if pd.isna(stn):
            new_texts.append('')
        else:
            new_texts.append(' '.join([word for word in stn.split() 
                                       if word.lower() not in stop_words]))
    return new_texts

In [16]:
def lemmatize_string(stn, lemmatizer):
    '''
    Lemmatize words in sentence
    '''
    new_stn = []
    for token in lemmatizer(stn):
        new_stn.append(token.lemma_)
        
    return ' '.join(new_stn)

In [19]:
first_paragraph_texts = [get_first_paragraph(stn) for stn in texts]

In [20]:
processed_texts = [preprocess_string(stn) for stn in first_paragraph_texts]

In [21]:
processed_texts = drop_stopwords(processed_texts)

In [22]:
lemmatizer = spacy.load("pt_core_news_sm")

In [24]:
lemmatized_texts = []

In [28]:
for i in range(len(processed_texts)//1000):
    stn = processed_texts[i]
    lemmatized_texts.append(lemmatize_string(stn, lemmatizer))

CPU times: user 2.57 s, sys: 28.2 ms, total: 2.6 s
Wall time: 1.36 s


In [66]:
texts_tokenizer = tf.keras.preprocessing.text.Tokenizer()
texts_tokenizer.fit_on_texts(texts)

From the last notebook we can see a good length for the padded articles would be around 1000.

In [76]:
encoded_texts = texts_tokenizer.texts_to_sequences(texts)
padded_texts = tf.keras.preprocessing.sequence.pad_sequences(encoded_texts, maxlen=1000
                                                             , truncating='post', padding='post')