# **Aula de Processamento de Linguagem Natural**

Professor: Leandro Romualdo 

Data: 17/04/2023

Contato: leandroromualdo@uni9.pro.br

Nesta aula vamos criar um classificador de textos, usaremos duas classes e faremos um exercício com 10 classes. 

#### **Bibliotecas**

In [None]:
from sklearn.datasets import fetch_20newsgroups # dataset de texto para classificação contendo 20 classes
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer # carrega Vectorizer e TFIDF
from sklearn.naive_bayes import MultinomialNB # algoritmo do Naive Bayes
from sklearn.pipeline import Pipeline # Cria pipeline contendo todas as transformações e modelo
from nltk.stem.snowball import SnowballStemmer # Função que retorna a palavra a sua raiz
import numpy as np
from sklearn.linear_model import SGDClassifier # Algoritmo Gradient Descendente Stocastico
from sklearn.model_selection import GridSearchCV
import nltk 
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report, ConfusionMatrixDisplay
import warnings
import matplotlib.pyplot as plt 

warnings.simplefilter('ignore')
#nltk.download()

#### **Carrega Dataset**

In [None]:
newsgroups = fetch_20newsgroups(subset='train') # Carrega o dataset de treinamento do fetch 20 news groups

In [None]:
list(newsgroups.target_names) # retorna as classes disponíveis para treinamento

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [None]:
categories = ['alt.atheism', 'soc.religion.christian'] # Lista com as classes que vamos trabalhar
df_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
df_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True)

In [None]:
df_train.target_names # Apresenta as classes

['alt.atheism', 'soc.religion.christian']

In [None]:
print("\n".join(df_train.data[1].split("\n")[:30]))

From: marshall@csugrad.cs.vt.edu (Kevin Marshall)
Subject: Re: some thoughts.
Organization: Virginia Tech Computer Science Dept, Blacksburg, VA
Lines: 72
NNTP-Posting-Host: csugrad.cs.vt.edu
Keywords: Dan Bissell

bissda@saturn.wwc.edu (DAN LAWRENCE BISSELL) writes:

>	Some reasons why he wouldn't be a liar are as follows.  Who would 
>die for a lie?  Wouldn't people be able to tell if he was a liar?  People 
>gathered around him and kept doing it, many gathered from hearing or seeing 
>someone who was or had been healed.  Call me a fool, but I believe he did 
>heal people.  

Anyone who dies for a "cause" runs the risk of dying for a lie.  As for
people being able to tell if he was a liar, well, we've had grifters and
charlatans since the beginning of civilization.  If David Copperfield had
been the Messiah, I bet he could have found plenty of believers.  
Jesus was hardly the first to claim to be a faith healer, and he wasn't the
first to be "witnessed."  What sets him apart?

>	Niet

#### **Feature Enginnering** 

In [None]:
# Converta uma coleção de documentos de texto em uma matriz de contagem de tokens
# Essa implementação produz uma representação esparsa das contagens

count_vect = CountVectorizer() # Instancia o algoritmo do count vectorizer
X_train_counts = count_vect.fit_transform(df_train.data)
X_train_counts.shape

(1079, 19666)

In [None]:
# TF-IDF 
tfidf_transformer = TfidfTransformer() 
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

#### **Abordagem 1**

#### **Treinamento do modelo**

In [None]:
clf = MultinomialNB() # Instancia o algoritmo Naive Bayes para treinamento
clf.fit(X_train_tfidf, df_train.target)

In [None]:
# Pipeline de machine learning 
clf_1 = Pipeline([
    ('vect', CountVectorizer()),    # Passo 1, aplicar o count vectorizer nos textos 
    ('tfidf', TfidfTransformer()),  # Passo 2, aplicar o TFIDF nos textos
    ('clf', MultinomialNB())])      # Passo 3, aplicar o algoritmo Naive Bayes

In [None]:
clf_trained = clf_1.fit(df_train.data, df_train.target) # Realiza o treinamento do modelo no pipeline

In [None]:
pred = clf_trained.predict(df_test.data) # Faz predição no dado de teste

In [None]:
acc = np.mean(pred == df_test.target)
print('>>>> Acurácia: ', acc)

>>>> Acurácia:  0.8521617852161785


In [None]:
# gera as métricas de assertividade do modelo
creport = classification_report(df_test.target, pred, target_names=df_test.target_names)
print(creport)

                        precision    recall  f1-score   support

           alt.atheism       0.99      0.68      0.80       319
soc.religion.christian       0.79      0.99      0.88       398

              accuracy                           0.85       717
             macro avg       0.89      0.83      0.84       717
          weighted avg       0.88      0.85      0.85       717



In [None]:
#plot_confusion_matrix(clf_trained, df_test.data, df_test.target, labels=clf_trained.classes_)
#plt.show()

#### **Tuning de parametros**

#### Modelo usando **Naive bayes** com Grid Search

In [None]:
# Grid Search
# Aqui, estamos criando uma lista de parâmetros para os quais gostaríamos de fazer o ajuste de desempenho.
# Todos os nomes dos parâmetros começam com o nome do classificador (lembre-se do nome arbitrário que demos).
# Por exemplo. vect__ngram_range; aqui estamos dizendo para usar unigramas e bigramas e escolher aquele que é o ideal.

parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'clf__alpha': (1e-2, 1e-3)}

In [None]:
gs_clf = GridSearchCV(clf_trained, parameters, n_jobs=-1) # Define o grid search para buscar os melhores parametros 
gs_clf = gs_clf.fit(df_train.data, df_train.target) # treinamento do modelo 

In [None]:
# Para ver a melhor pontuação média e os parâmetros, execute o seguinte código

print(gs_clf.best_score_)
gs_clf.best_params_

#A saída acima deve ser: A precisão agora aumentou para ~90,6% para o classificador NB (não tão ingênuo mais! 😄)
# e os parâmetros correspondentes são {'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}.

0.9823901808785529


{'clf__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}

In [None]:
pred = gs_clf.predict(df_test.data) # Faz predição no dado de teste

In [None]:
acc = np.mean(pred == df_test.target)
print('>>>> Acurácia: ', acc)

>>>> Acurácia:  0.9539748953974896


In [None]:
creport = classification_report(df_test.target, pred, target_names=df_test.target_names)
print(creport)

                        precision    recall  f1-score   support

           alt.atheism       0.99      0.91      0.95       319
soc.religion.christian       0.93      0.99      0.96       398

              accuracy                           0.95       717
             macro avg       0.96      0.95      0.95       717
          weighted avg       0.96      0.95      0.95       717



## **Abordagem II**

In [None]:
# Pipeline de machine learning 
clf_2 = Pipeline([
    ('vect', CountVectorizer()), # Passo 1, aplicar o count vectorizer nos textos 
    ('tfidf', TfidfTransformer()), # Passo 2, aplicar o TFIDF nos textos
    ('clf-svm', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=25, random_state=42))]) # Passo 3, aplicar o algoritmo Naive Bayes

In [None]:
svm_trained = clf_2.fit(df_train.data, df_train.target) # realiza o treinamento do modelo no pipeline

In [None]:
pred = svm_trained.predict(df_test.data) # Faz predição no dado de teste

In [None]:
acc = np.mean(pred == df_test.target)
print('>>>> Acurácia: ', acc)

>>>> Acurácia:  0.9330543933054394


#### **Tuning de parametros**

#### Modelo usando **SGD** com Grid Search

In [None]:
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False),'clf-svm__alpha': (1e-2, 1e-3)}

In [None]:
gs_clf_svm = GridSearchCV(svm_trained, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(df_train.data, df_train.target)

In [None]:
# Assertividade e melhores parametros
print(gs_clf_svm.best_score_)
gs_clf_svm.best_params_

0.9805297157622739


{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

In [None]:
pred = gs_clf_svm.predict(df_test.data) # Faz predição no dado de teste

In [None]:
acc = np.mean(pred == df_test.target)
print('>>>> Acurácia: ', acc)

>>>> Acurácia:  0.9428172942817294


In [None]:
creport = classification_report(df_test.target, pred, target_names=df_test.target_names)
print(creport)

                        precision    recall  f1-score   support

           alt.atheism       0.99      0.88      0.93       319
soc.religion.christian       0.91      0.99      0.95       398

              accuracy                           0.94       717
             macro avg       0.95      0.94      0.94       717
          weighted avg       0.95      0.94      0.94       717



#### **Abordagem III**

Remover os stopwords

In [None]:
nb_clf_stp = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),    # Passo 1, aplicar o count vectorizer nos textos excluindo stopwords
    ('tfidf', TfidfTransformer()),  # Passo 2, aplicar o TFIDF nos textos
    ('clf', MultinomialNB())])      # Passo 3, aplicar o algoritmo Naive Bayes

In [None]:
nb_clf_stp_trained = nb_clf_stp.fit(df_train.data, df_train.target) # realiza o treinamento do modelo no pipeline

In [None]:
pred = nb_clf_stp_trained.predict(df_test.data) # Faz predição no dado de teste

In [None]:
acc = np.mean(pred == df_test.target)
print('Accuracy = ', acc)

creport = classification_report(df_test.target, pred, target_names=df_test.target_names)
print(creport)

Accuracy =  0.9065550906555091
                        precision    recall  f1-score   support

           alt.atheism       0.98      0.80      0.88       319
soc.religion.christian       0.86      0.99      0.92       398

              accuracy                           0.91       717
             macro avg       0.92      0.90      0.90       717
          weighted avg       0.92      0.91      0.91       717



In [None]:
nb_clf_tuned = GridSearchCV(nb_clf_stp_trained, parameters, n_jobs=-1) # Define o grid search para buscar os melhores parametros 
nb_clf_tuned = nb_clf_tuned.fit(df_train.data, df_train.target) # treinamento do modelo 

In [None]:
# Assertividade e melhores parametros
print(nb_clf_tuned.best_score_)
nb_clf_tuned.best_params_

0.9870198105081824


{'clf__alpha': 0.01, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}

In [None]:
pred = nb_clf_tuned.predict(df_test.data) # Faz predição no dado de teste

In [None]:
acc = np.mean(pred == df_test.target)
print('>>>> Acurácia: ', acc)

>>>> Acurácia:  0.9693165969316597


In [None]:
creport = classification_report(df_test.target, pred, target_names=df_test.target_names)
print(creport)

                        precision    recall  f1-score   support

           alt.atheism       0.98      0.95      0.96       319
soc.religion.christian       0.96      0.99      0.97       398

              accuracy                           0.97       717
             macro avg       0.97      0.97      0.97       717
          weighted avg       0.97      0.97      0.97       717



#### Abordagem IV
Stopword com SVM

In [None]:
# Pipeline de machine learning 
svm_stp_clf = Pipeline([
    ('vect', CountVectorizer(stop_words='english')), # Passo 1, aplicar o count vectorizer nos textos 
    ('tfidf', TfidfTransformer()), # Passo 2, aplicar o TFIDF nos textos
    ('clf-svm', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=25, random_state=42))]) # Passo 3, aplicar o algoritmo Naive Bayes

In [None]:
svm_stp_trained = svm_stp_clf.fit(df_train.data, df_train.target) # realiza o treinamento do modelo no pipeline

In [None]:
pred = svm_stp_trained.predict(df_test.data) # Faz predição no dado de teste

In [None]:
acc = np.mean(pred == df_test.target)
print('>>>> Acurácia: ', acc)

>>>> Acurácia:  0.9372384937238494


In [None]:
svm_stp_tun = GridSearchCV(svm_stp_trained, parameters_svm, n_jobs=-1)
svm_stp_tun = svm_stp_tun.fit(df_train.data, df_train.target)

In [None]:
# Assertividade e melhores parametros
print(svm_stp_tun.best_score_)
svm_stp_tun.best_params_

0.9842420327304048


{'clf-svm__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}

In [None]:
pred = svm_stp_tun.predict(df_test.data) # Faz predição no dado de teste

In [None]:
acc = np.mean(pred == df_test.target)
print('>>>> Acurácia: ', acc)

>>>> Acurácia:  0.9358437935843794


In [None]:
creport = classification_report(df_test.target, pred, target_names=df_test.target_names)
print(creport)

                        precision    recall  f1-score   support

           alt.atheism       0.98      0.87      0.92       319
soc.religion.christian       0.91      0.99      0.94       398

              accuracy                           0.94       717
             macro avg       0.94      0.93      0.93       717
          weighted avg       0.94      0.94      0.94       717



## **Exercício para o Portifólio de vocês**

1. Abra um colab novo 
2. Baixe o mesmo conjunto de textos que usamos nesta aula
3. Selecione 10 classes do seu interesse 
4. Treine dois modelos, faça tuning e separe por abordagens 

**A entrega será somente dia 24/04. E vale 3 pontos** 

**Pensem que alguém vai ler este código depois de vocês fazerem**

Critérios de avaliação: 
- Não vou me apegar a assertividade do modelo 
- O critério será a organização e conclusões ao longo do codigo.