### Problema


Uma revista precisa catalogar todas as suas notícias em diferentes categorias. O objetivo desta competição é desenvolver o melhor modelo de aprendizagem profunda para prever a categoria de novas notícias.


<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/04/Untitled-Diagram.png
" style="width: 400px;"/>


As categorias possíveis são:

* ambiente
* equilibrioesaude
* sobretudo
* educacao
* ciencia
* tec
* turismo
* empreendedorsocial
* comida

In [2]:
!pip install unidecode

Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |█▍                              | 10kB 36.0MB/s eta 0:00:01[K     |██▊                             | 20kB 6.0MB/s eta 0:00:01[K     |████▏                           | 30kB 8.6MB/s eta 0:00:01[K     |█████▌                          | 40kB 5.7MB/s eta 0:00:01[K     |██████▉                         | 51kB 6.9MB/s eta 0:00:01[K     |████████▎                       | 61kB 8.2MB/s eta 0:00:01[K     |█████████▋                      | 71kB 9.4MB/s eta 0:00:01[K     |███████████                     | 81kB 10.5MB/s eta 0:00:01[K     |████████████▍                   | 92kB 11.7MB/s eta 0:00:01[K     |█████████████▊                  | 102kB 9.2MB/s eta 0:00:01[K     |███████████████▏                | 112kB 9.2MB/s eta 0:00:01[K     |████████████████▌               | 122kB 9.2M

In [0]:
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

import pandas as pd, xgboost, numpy as np, textblob, string
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers
import re
from unidecode import unidecode

In [3]:
import nltk 
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from urllib.parse import urlparse
from nltk import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
import nltk 
nltk.download('rslp')
from nltk.stem import RSLPStemmer

[nltk_data] Downloading package rslp to /root/nltk_data...
[nltk_data]   Unzipping stemmers/rslp.zip.


In [5]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


### Pre Processamento

In [0]:
# Constantes
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
PONTUACTION = re.compile('[^\w\s]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('portuguese'))
STOPWORDS2 = ['r', 'h', 'u', 'ub']

In [0]:
stemmer = RSLPStemmer()
lemmatizer = WordNetLemmatizer()

In [0]:
# Funcao pra remorar URL
def is_url(url):
  try:
    result = urlparse(url)
    return all([result.scheme, result.netloc])
  except ValueError:
    return False

# Funcao para extrair o radical da palavra
def stemming(text):
  word_tokens = word_tokenize(text) 
  filtering = [stemmer.stem(w) for w in word_tokens]

  text_final = ' '.join(filtering)
  return text_final

# tratamento lexical das palavras
def lemmatizing(text):
  list_filter = []
  for word in text:
    word_tokens = word_tokenize(word) 
    filtering = [lemmatizer.lemmatize(w) for w in word_tokens]
    
  text_final = ' '.join(filtering)
  return text_final

# remove palavras recorrente sem valor significativo
def remove_palavras_recorrentes(text):
  text = ' '.join(word2 for word2 in text.split() if word2 not in STOPWORDS2)
  return text

# pre processamento geral do texto
def limpa_texto(text):
    text = text.lower()
    text = ' '.join(unidecode(word3) for word3 in text.split())
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    text = REPLACE_BY_SPACE_RE.sub('', text)
    text = BAD_SYMBOLS_RE.sub('', text)
    text = PONTUACTION.sub('', text)
    text = ''.join([i for i in text if not i.isdigit()])
    text = text.strip()
    return text

In [0]:
# carrega o dataset
df = pd.read_csv("train.csv")

# concatenei titulo e texto
df = df.reset_index(drop=True)
df['content'] = df['text'] + '\n' + df['text']

# remove URLs
df['content'] = [' '.join(y for y in x.split() if not is_url(y)) for x in df['content']]

# limpa texto
df['content'] = df['content'].apply(limpa_texto)
df['content'] = df['content'].apply(stemming)
df['content'] = df['content'].apply(remove_palavras_recorrentes)

In [13]:
trainDF = pd.DataFrame()
trainDF['text'] = df.content
trainDF['label'] = df.category
Y_classes = pd.get_dummies(df['category']).columns

trainDF.head()

Unnamed: 0,text,label
0,urban anarqu bairr antig lisbo escond real col...,turismo
1,empr soc mostr possi unir negoci impact posi s...,empreendedorsocial
2,menos quatr estaco esqu centr chil vir obrig p...,turismo
3,gravid provoc mudanc fisic durado cerebr mulh ...,ciencia
4,algum vez voc ja ouv fras o facebook ar so cel...,tec


### Divisao do DataFrame

In [0]:
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])

# enconde das labels
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

In [0]:
# criaçãão do vetor de valores
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])

# transformacao em vetor os datasets de treino e teste
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)

In [0]:
#diversas metricas
trainDF['char_count'] = trainDF['text'].apply(len)
trainDF['word_count'] = trainDF['text'].apply(lambda x: len(x.split()))
trainDF['word_density'] = trainDF['char_count'] / (trainDF['word_count']+1)
trainDF['punctuation_count'] = trainDF['text'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
trainDF['title_word_count'] = trainDF['text'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
trainDF['upper_case_word_count'] = trainDF['text'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

In [17]:
trainDF.head()

Unnamed: 0,text,label,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count
0,urban anarqu bairr antig lisbo escond real col...,turismo,4979,816,6.094247,0,0,0
1,empr soc mostr possi unir negoci impact posi s...,empreendedorsocial,1531,250,6.099602,0,0,0
2,menos quatr estaco esqu centr chil vir obrig p...,turismo,2897,520,5.560461,0,0,0
3,gravid provoc mudanc fisic durado cerebr mulh ...,ciencia,3173,508,6.233792,0,0,0
4,algum vez voc ja ouv fras o facebook ar so cel...,tec,1263,234,5.374468,0,0,0


In [0]:
lda_model = decomposition.LatentDirichletAllocation(n_components=20, learning_method='online', max_iter=20)
X_topics = lda_model.fit_transform(xtrain_count)
topic_word = lda_model.components_ 
vocab = count_vect.get_feature_names()

# analise das principais palavras do texto
n_top_words = 10
topic_summaries = []
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))

In [19]:
topic_summaries

['mulh sex violenc feminin gravid dor marid hormoni ginecolog desej',
 'jog atlet futebol pinguim the batman kim titul burm peptide',
 'medic paci trat saud doenc canc hospit drog test cas',
 'abort ga oxigeni sall recolh recall carcac conversa branque tcu',
 'fotograf fot imag viag tir selfi album juni edico registr',
 'animal pesquis human celul gen dna gene vinh produz outr',
 'noit brasil pesso aere inclu caf val hotel sao soc',
 'ano especi agu pesquis are pod outr terr sol viru',
 'curs estud ano prov vag en ensin ser not univers',
 'sao cidad nao cas ond fic visit ano local ha',
 'aliment com pes consum gord obes diet acuc alimentaca sauda',
 'carr sao motor empr veicul ser cust model paul mil',
 'dia fas vestibul prim prov candidat segund list univers questo',
 'empr us usuari diss nao nov ser ano serv red',
 'cerebr beb son estud pesquis pel doenc pod cient corp',
 'russ bact antibio dron contribuica resist uerj schmidt min resistenc',
 'carn porc g butantan produt protoindoeu

###Definicoes do Modelo

In [20]:
value_max_feature = 7000
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=value_max_feature)
tfidf_vect.fit(trainDF['text'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

# ngram level tf-idf 
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=value_max_feature)
tfidf_vect_ngram.fit(trainDF['text'])
xtrain_tfidf_ngram =  tfidf_vect_ngram.transform(train_x)
xvalid_tfidf_ngram =  tfidf_vect_ngram.transform(valid_x)

# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=value_max_feature)
tfidf_vect_ngram_chars.fit(trainDF['text'])
xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(train_x) 
xvalid_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(valid_x) 



### Criacao dos Embedding do Texto

In [21]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.pt.300.vec.gz

--2020-03-18 00:42:49--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.pt.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.20.6.166, 104.20.22.166, 2606:4700:10::6814:16a6, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.20.6.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1271093660 (1.2G) [binary/octet-stream]
Saving to: ‘cc.pt.300.vec.gz’


2020-03-18 00:44:31 (12.0 MB/s) - ‘cc.pt.300.vec.gz’ saved [1271093660/1271093660]



In [0]:
!gzip -d cc.pt.300.vec.gz

In [25]:
!ls

 cc.pt.300.vec						    test.csv
'download.php?file=embeddings%2Fword2vec%2Fcbow_s300.zip'   train.csv
 sample_data


In [0]:
# carregamento do embedding pre treinado
embeddings_index = {}
for i, line in enumerate(open('cc.pt.300.vec')):
    values = line.split()
    embeddings_index[values[0]] = np.asarray(values[1:], dtype='float32')

# criacao dos tokens
token = text.Tokenizer()
token.fit_on_texts(trainDF['text'])
word_index = token.word_index

# converter texto em sequencia de vetores
train_seq_x = sequence.pad_sequences(token.texts_to_sequences(train_x), maxlen=70)
valid_seq_x = sequence.pad_sequences(token.texts_to_sequences(valid_x), maxlen=70)

# mapeamento do texto com o embedding
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

### Criacao dos Modelos

In [0]:
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    classifier.fit(feature_vector_train, label)
    predictions = classifier.predict(feature_vector_valid)
    
    if is_neural_net:
        predictions = predictions.argmax(axis=-1)
        
    return  classifier, metrics.f1_score(valid_y, predictions, average='weighted')

In [28]:
# Naive Bayes on Count Vectors
base_model, accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xvalid_count)
print("NB, Count Vectors: ", accuracy)

# Naive Bayes on Word Level TF IDF Vectors
base_model, accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf)
print("NB, WordLevel TF-IDF: ", accuracy)

# Naive Bayes on Ngram Level TF IDF Vectors
base_model, accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
print("NB, N-Gram Vectors: ", accuracy)

# Naive Bayes on Character Level TF IDF Vectors
base_model, accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)
print("NB, CharLevel Vectors: ", accuracy)

NB, Count Vectors:  0.9034749384151586
NB, WordLevel TF-IDF:  0.821393493574874
NB, N-Gram Vectors:  0.7948897679717226
NB, CharLevel Vectors:  0.5083857663372462


In [29]:
# Linear Classifier on Count Vectors
base_model, accuracy = train_model(linear_model.LogisticRegression(random_state=1, max_iter=200, class_weight='balanced'), xtrain_count, train_y, xvalid_count)
print("LR, Count Vectors: ", accuracy)

# Linear Classifier on Word Level TF IDF Vectors
base_model, accuracy = train_model(linear_model.LogisticRegression(random_state=1, class_weight='balanced', C=2.0), xtrain_tfidf, train_y, xvalid_tfidf)
print("LR, WordLevel TF-IDF: ", accuracy)

# Linear Classifier on Ngram Level TF IDF Vectors
base_model, accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
print("LR, N-Gram Vectors: ", accuracy)

# Linear Classifier on Character Level TF IDF Vectors
base_model, accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)
print("LR, CharLevel Vectors: ", accuracy)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LR, Count Vectors:  0.8878176209060492
LR, WordLevel TF-IDF:  0.9159347017018749


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LR, N-Gram Vectors:  0.8173047187194916
LR, CharLevel Vectors:  0.8708550547389966


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [30]:
# SVM Classifier on Count Vectors
base_model, accuracy = train_model(svm.SVC(), xtrain_count, train_y, xvalid_count)
print("LR, Count Vectors: ", accuracy)

# SVM Classifier on Word Level TF IDF Vectors
base_model, accuracy = train_model(svm.SVC(C=2, class_weight='balanced'), xtrain_tfidf, train_y, xvalid_tfidf)
print("LR, WordLevel TF-IDF: ", accuracy)

# SVM on Ngram Level TF IDF Vectors
base_model, accuracy = train_model(svm.SVC(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
print("SVM, N-Gram Vectors: ", accuracy)

LR, Count Vectors:  0.8704933600442418
LR, WordLevel TF-IDF:  0.9114085550020787
SVM, N-Gram Vectors:  0.8063254554658117


In [31]:
# RF on Count Vectors
base_model, accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_count, train_y, xvalid_count)
print("RF, Count Vectors: ", accuracy)

# RF on Word Level TF IDF Vectors
base_model, accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y, xvalid_tfidf)
print("RF, WordLevel TF-IDF: ", accuracy)

RF, Count Vectors:  0.8348484285986371
RF, WordLevel TF-IDF:  0.8515364828693002


In [32]:
# Extereme Gradient Boosting on Count Vectors
base_model, accuracy = train_model(xgboost.XGBClassifier(), xtrain_count.tocsc(), train_y, xvalid_count.tocsc())
print( "Xgb, Count Vectors: ", accuracy)

# Extereme Gradient Boosting on Word Level TF IDF Vectors
base_model, accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf.tocsc(), train_y, xvalid_tfidf.tocsc())
print( "Xgb, WordLevel TF-IDF: ", accuracy)

# Extereme Gradient Boosting on Character Level TF IDF Vectors
base_model, accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf_ngram_chars.tocsc(), train_y, xvalid_tfidf_ngram_chars.tocsc())
print( "Xgb, CharLevel Vectors: ", accuracy)

Xgb, Count Vectors:  0.8772081867294187
Xgb, WordLevel TF-IDF:  0.8746564297832835
Xgb, CharLevel Vectors:  0.8472514601644491


In [33]:
 # SGD Classifier on Count Vectors
base_model, accuracy = train_model(linear_model.SGDClassifier(max_iter=1000, tol=1e-3), xtrain_count, train_y, xvalid_count)
print("SGD, Count Vectors: ", accuracy)

# SGD Classifier on Word Level TF IDF Vectors
base_model, accuracy = train_model(linear_model.SGDClassifier(max_iter=2000, tol=1e-2), xtrain_tfidf, train_y, xvalid_tfidf)
print("SGD, WordLevel TF-IDF: ", accuracy)

SGD, Count Vectors:  0.8855914969599166
SGD, WordLevel TF-IDF:  0.9119467663816223


##Ensemble Learning

In [0]:
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

clf1 = LogisticRegression(random_state=1, class_weight='balanced', C=2.0)
clf2 = svm.SVC(C=2, class_weight='balanced')
clf3 = linear_model.SGDClassifier(max_iter=2000, tol=1e-2)

eclf = VotingClassifier(
    estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
    voting='hard')

In [35]:
# Linear Classifier on Count Vectors
base_model, accuracy = train_model(eclf, xtrain_count, train_y, xvalid_count)
print("LR, Count Vectors: ", accuracy)

base_model, accuracy = train_model(eclf, xtrain_tfidf, train_y, xvalid_tfidf)
print("LR, WordLevel TF-IDF: ", accuracy)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LR, Count Vectors:  0.8974643819169975
LR, WordLevel TF-IDF:  0.9156799334125799


In [36]:
# Extra Trees Classification (Bagging)
import pandas
from sklearn import model_selection
from sklearn.ensemble import ExtraTreesClassifier

seed = 7
num_trees = 100
max_features = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = ExtraTreesClassifier(n_estimators=num_trees)
results = model_selection.cross_val_score(model, xtrain_tfidf, train_y, cv=kfold)
print(results.mean())



0.8471283783783783


In [37]:
# AdaBoost Classification (Boosting)
import pandas
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier

seed = 1
num_trees = 30
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, xtrain_count, train_y, cv=kfold)
print(results.mean())



0.7118243243243243


In [38]:
# Stochastic Gradient Boosting Classification (Boosting)
import pandas
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingClassifier

seed = 42
num_trees = 100
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, xtrain_tfidf, train_y, cv=kfold)
print(results.mean())



0.8618243243243244


### Validacao e criacao arquivo submissao

In [39]:
# Leitura do Dataset de validação dos resultados
test_df = pd.read_csv('test.csv')
print(test_df.shape)
test_df.head()

(4251, 3)


Unnamed: 0,article_id,title,text
0,4763,Enem 2016 aumenta 46% o uso do nome social por...,O número de travestis e transexuais que usará ...
1,52,'Viagem ao Japão é aula de cultura e tradição'...,"O ator Jayme Matarazzo, 31, aproveita os inter..."
2,7682,Fotógrafo registra a beleza natural de países ...,"O fotógrafo Vitor Schietti, 29, passou quase u..."
3,10292,Azar genético explica preferência do Aedes aeg...,"Enquanto alguns sofrem, outros escapam incólum..."
4,7435,Parto humanizado e capital humano ganham apoio...,A Womanity Foundation anunciou no início do mê...


In [0]:
test_df = test_df.reset_index(drop=True)
test_df['content'] =  test_df['text'] + '\n' + test_df['text']

test_df['content'] = [' '.join(y for y in x.split() if not is_url(y)) for x in test_df['content']]

test_df['content'] = test_df['content'].apply(limpa_texto)

test_df['content'] = test_df['content'].apply(stemming)

test_df['content'] = test_df['content'].apply(remove_palavras_recorrentes)

In [0]:
def predict():
    new_text = tfidf_vect.transform(test_df.content)
    pred     = base_model.predict(new_text)
    return pred

In [43]:
pred         = predict()
pred_classes = [Y_classes[c] for c in pred]
pred_classes[:5]

['educacao', 'turismo', 'turismo', 'ciencia', 'empreendedorsocial']

In [44]:
# Atualizando a categoria dos artigos no dataset de validação
test_df['category'] = pred_classes
test_df.head()

Unnamed: 0,article_id,title,text,content,category
0,4763,Enem 2016 aumenta 46% o uso do nome social por...,O número de travestis e transexuais que usará ...,numer travestil transex us nom soc en exam nac...,educacao
1,52,'Viagem ao Japão é aula de cultura e tradição'...,"O ator Jayme Matarazzo, 31, aproveita os inter...",at jaym matarazz aproveit interval gravaco via...,turismo
2,7682,Fotógrafo registra a beleza natural de países ...,"O fotógrafo Vitor Schietti, 29, passou quase u...",fotograf vit schiett pass quas me viaj quatr p...,turismo
3,10292,Azar genético explica preferência do Aedes aeg...,"Enquanto alguns sofrem, outros escapam incólum...",enquant algum sofr outr escap incolum comum qu...,ciencia
4,7435,Parto humanizado e capital humano ganham apoio...,A Womanity Foundation anunciou no início do mê...,womanity foundation anunci inici me nov fellow...,empreendedorsocial


In [0]:
#criacao do arquivo de submissao para a competicao
test_df[["article_id", "category"]].to_csv("submission.csv", index=False)