# Recomendação de notícias por WMD e análise de viés por subjetividade

Os dados utilizados nesta análise compreendem as notícias dos jornais `Carta Capital`, `O Antagonista`, `O Globo` e `Veja`, durante todo o ano de 2018. Uma análise detalhada dos dados está disponível [aqui](https://pages.github.com/).

Objetivo deste notebook é implementar um sistema de recomendação de notícias por meio de `WMD`, de modo que seja consi

### Similaridade por Word movers distance

In [1]:
# importing modules and setting log format
import os
import re
import nltk
import gensim, logging
import pandas as pd

from nltk.corpus import stopwords
from gensim.similarities import WmdSimilarity

nltk.download('stopwords')
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
PUNCTUATION = u'[^a-zA-Z0-9áéíóúÁÉÍÓÚâêîôÂÊÎÔãõÃÕçÇ%]' # define news punctuation 

[nltk_data] Downloading package stopwords to /home/diogo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# load data
carta_capital = pd.read_csv("data_2018/carta_capital.csv") 
oantagonista = pd.read_csv("data_2018/oantagonista.csv") 
oglobo = pd.read_csv("data_2018/oglobo.csv") 
veja = pd.read_csv("data_2018/veja.csv") 
# concat all news
news = pd.concat((carta_capital, oantagonista, oglobo, veja), sort=False, ignore_index=True)

In [3]:
# function for processing sentences
def processSentences(text):
    stop_words = stopwords.words('portuguese') # load stop words
    text = re.sub(PUNCTUATION, ' ', str(text)) # remove punctuation from text
    text = str(text).lower().split() # split sentences by words
    text = [word for word in text if word not in stop_words] # Remove stopwords
    return text

In [4]:
# processing news sentences
news['text'] = news['text'].apply(processSentences)

In [5]:
model = gensim.models.Word2Vec.load('model/news-w2v.bin')

In [None]:
model.wv.save_word2vec_format("model/news_vectors.bin")

### Query

In [6]:
# function for select news by date
def news_by_date(date):
    date_range = 3600 # 1 hour
    # select news by date range
    news_by_date = news.loc[(news['date'] >= (date - date_range)) & (news['date'] <= (date + date_range))]
    news_by_date = news_by_date.reset_index() # reset dataframe indexes
    return news_by_date

In [7]:
article = 'https://www.oantagonista.com/brasil/103078/' # url of the article to be compared
article = news.loc[news['url'] == article] # get article by url

news_date_range = news_by_date(float(article['date']))
instance = WmdSimilarity(news_date_range['text'], model, num_best=5)

In [9]:
sims = instance[article['text'].tolist()[0]]
print('Article:')
print(article['url'].tolist()[0])
for i in range(5):
    print('\n')
    print('sim = %.4f' % sims[i][1])
    print(news_date_range['url'][sims[i][0]])


Article:
https://www.oantagonista.com/brasil/103078/


sim = 1.0000
https://www.oantagonista.com/brasil/103078/


sim = 0.4823
https://www.oantagonista.com/brasil/greve-de-fome-nao-e-dele/


sim = 0.4760
https://www.oantagonista.com/brasil/farsa-das-greves-de-fome-de-lula/


sim = 0.4758
https://www.oantagonista.com/brasil/gilmar-exposto/


sim = 0.4752
https://www.oantagonista.com/brasil/gosto-alckmin-mas-nao-vou-votar-nele-para-presidente/


9

### Análise de Subjetividade