<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports-and-loading-the-datasets" data-toc-modified-id="Imports-and-loading-the-datasets-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports and loading the datasets</a></span></li><li><span><a href="#Analyzing-the-first-paragraph-of-the-articles" data-toc-modified-id="Analyzing-the-first-paragraph-of-the-articles-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Analyzing the first paragraph of the articles</a></span></li><li><span><a href="#Analyzing-the-categories-vocabulary" data-toc-modified-id="Analyzing-the-categories-vocabulary-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Analyzing the categories vocabulary</a></span></li></ul></div>

# Imports and loading the datasets

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter

import os

In [3]:
interim_data_dir = '../data/interim'

In [4]:
folha_articles = pd.read_csv(os.path.join(interim_data_dir, 'news-of-the-site-folhauol/articles.csv'))

Since we have a lot of data, for our model we are just going to use the first words of the news article and in order to reduce our vocabulary we are going to use just one category of news

# Analyzing the first paragraph of the articles

In [5]:
for _ in range(5):
    idx = np.random.choice(np.arange(len(folha_articles)), size=1)[0]
    print("Headline: {}".format(folha_articles.loc[idx, 'title']))
    print("First parahraph: {}".format(folha_articles.loc[idx, 'text'].split('  ')[0]))
    print()

Headline: Congresso faz 'contrarreforma', afirma presidente do PT
First parahraph: O presidente nacional do PT, Rui Falcão, classificou de "contrarreforma" a reforma política que tramita no Congresso Nacional e voltou a defender a instalação de uma Assembleia Constituinte para analisar o tema.

Headline: Andrés nega atraso no pagamento de contas da Arena Corinthians
First parahraph: O ex-presidente do Corinthians Andrés Sanchez afirmou neste domingo (15) que a Arena Corinthians não está com as dívidas atrasadas.

Headline: Astrologia
First parahraph: Astral eletrizante de hoje sinaliza oscilações nos mercados. Lua minguante em Câncer: 4/10.

Headline: Janot recomenda que STF arquive queixa-crime de Lula contra senador
First parahraph: Em parecer enviado ao STF (Supremo Tribunal Federal), o procurador-geral da República, Rodrigo Janot, recomendou o arquivamento da queixa-crime do ex-presidente Luiz Inácio Lula da Silva contra o senador Ronaldo Caiado (DEM-GO), que acusou o petista de "b

As we can see, in many cases the first few sentences already tell the content of the entire text. So we are going to use just the first words to generate the headlines.

# Analyzing the categories vocabulary

In [6]:
category_vocab = {cat: set() for cat in folha_articles['category'].unique()}

for cat, stn in zip(folha_articles['category'], folha_articles['text']):
    if pd.isna(stn):
        continue
    for word in stn.split():
        category_vocab[cat].add(word)

In [7]:
for cat in category_vocab:
    print(cat)
    print("Number of unique words: {}".format(len(category_vocab[cat])))
    print("Unique word per article: {}".format(len(category_vocab[cat])/folha_articles['category'].value_counts().loc[cat]))
    print()

poder
Number of unique words: 264968
Unique word per article: 12.031968031968033

ilustrada
Number of unique words: 370409
Unique word per article: 22.66191495870297

mercado
Number of unique words: 287498
Unique word per article: 13.709966618979495

mundo
Number of unique words: 289442
Unique word per article: 16.896789258610625

esporte
Number of unique words: 228998
Unique word per article: 11.60658895083629

tec
Number of unique words: 79953
Unique word per article: 35.377433628318585

cotidiano
Number of unique words: 252583
Unique word per article: 14.886721282489539

ambiente
Number of unique words: 38538
Unique word per article: 78.4887983706721

equilibrioesaude
Number of unique words: 76250
Unique word per article: 58.11737804878049

sobretudo
Number of unique words: 57136
Unique word per article: 54.05487228003784

colunas
Number of unique words: 445647
Unique word per article: 20.610813060771438

educacao
Number of unique words: 71415
Unique word per article: 33.71813031161

From the above code we can see that the best option for our model would be the esporte category since it has less unique words per articles