<a href="https://colab.research.google.com/github/dparaujo/Mineracao_Dados/blob/main/Trabalho_Minera%C3%A7%C3%A3o_de_Dados_David_Araujo2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Trabalho de Mineração de Dados (EDA)

**Dataset:** BBC News


*   https://www.kaggle.com/datasets/gpreda/bbc-news
*   https://www.kaggle.com/code/gpreda/bbc-news-rss-feeds
*   https://www.kaggle.com/datasets/pariza/bbc-news-summary/data
*   https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv

Outros datasets:

*   https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification/code
*   https://www.kaggle.com/c/learn-ai-bbc/overview
*   https://www.kaggle.com/datasets/hgultekin/bbcnewsarchive
*   https://www.kaggle.com/datasets/sahilkirpekar/bbcnews-dataset
*   https://www.kaggle.com/code/warcoder/chromadb-semantic-search
*   https://www.kaggle.com/code/anubhavgoyal10/getting-started-with-hugging-face
*   https://www.kaggle.com/datasets/khushikyad001/fake-news-detection/data
*   https://www.kaggle.com/datasets/mahdimashayekhi/fake-news-detection-dataset
*   https://github.com/payamesfandiari/fake_news_finder
*   https://www.kaggle.com/code/asif00/text-generation-with-tensorflow-nlp-rnn




### ***Alguns Testes:***

1. Preparação do Ambiente (Google Colab).
Instalando e importando as bibliotecas necessárias:

In [None]:
# Instalação das bibliotecas
!pip install pandas numpy seaborn matplotlib wordcloud nltk sentence-transformers faiss-cpu
# !pip install openai # opcional se quiser usar API da OpenAI


In [None]:
!pip install openai==0.28

In [None]:
# from google.colab import userdata
# userdata.get('HF_TOKEN')

In [None]:
# Importação das bibliotecas
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
from nltk.util import ngrams
from collections import Counter
from sentence_transformers import SentenceTransformer
import faiss

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


2. Carregamento e Inspeção do Dataset BBC News.
Dataset diretamente do Kaggle ou de outro link direto:

In [None]:
# Exemplo com URL direta do CSV
url = "https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv"
df = pd.read_csv(url)

# Primeiras linhas
df.head()


In [None]:
# @title category

from matplotlib import pyplot as plt
import seaborn as sns
df.groupby('category').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)
plt.xlabel('Quantidade')
plt.ylabel('Categorias')
plt.savefig('category.png', bbox_inches='tight', dpi=600)

In [None]:
# Plotando um histograma

df["category"].hist()

plt.xlabel('Category')
plt.ylabel('Amount')
plt.savefig('category-histograma.png', bbox_inches='tight', dpi=600)

In [None]:
df


In [None]:
# @title category

from matplotlib import pyplot as plt
import seaborn as sns
df.groupby('category').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
df.shape

In [None]:
df.size

In [None]:
df.info()

In [None]:
# df.mean()
# df.max()
# df.min()

3. Análise Exploratória de Dados (EDA).

a) Distribuição das Categorias

In [None]:
# Gráfico da Distribuição das Categorias

plt.figure(figsize=(8,5))
sns.countplot(y='category', data=df, order=df['category'].value_counts().index)
plt.title('Distribuição das Categorias')
plt.xlabel('Quantidade')
plt.ylabel('Categorias')
plt.savefig('distro_category.png', bbox_inches='tight', dpi=600)
plt.show()


b) Histograma e Boxplot para comprimento dos textos

In [None]:
# Gráficos do Histograma e Boxplot
df['text_length'] = df['text'].apply(lambda x: len(x.split()))

# Histograma
plt.figure(figsize=(10,4))
sns.histplot(df['text_length'], bins=30, kde=True)
plt.title('Distribuição do Comprimento dos Textos')
plt.xlabel('Quantidade de palavras')
plt.ylabel('Frequência')
plt.savefig('hist_length.png', bbox_inches='tight', dpi=600)
plt.show()

# Boxplot
plt.figure(figsize=(10,4))
sns.boxplot(x='category', y='text_length', data=df)
# plt.title('Boxplot de Comprimento dos Textos por Categoria')
plt.xlabel('Categorias')
plt.ylabel('Quantidade de palavras')
plt.xticks(rotation=45)
plt.savefig('boxplot_length.png', bbox_inches='tight', dpi=600)
plt.show()


c) Wordcloud (Nuvem de Palavras)

In [None]:
# Gráfico Nuvem de Palavras
text = ' '.join(df['text']).lower()
words = [word for word in text.split() if word not in stop_words]

wordcloud = WordCloud(width=800, height=400).generate(' '.join(words))

plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Wordcloud das Palavras mais Frequentes')
plt.savefig('wordcloud.png', bbox_inches='tight', dpi=600)
plt.show()


d) Distribuição de N-grams (ex.: bigramas)

In [None]:
# Gráfico da Distribuição de N=grams
bigrams = list(ngrams(words, 2))
bigram_counts = Counter(bigrams).most_common(10)

bigram_df = pd.DataFrame(bigram_counts, columns=['bigram', 'count'])
bigram_df['bigram'] = bigram_df['bigram'].apply(lambda x: ' '.join(x))

sns.barplot(y='bigram', x='count', data=bigram_df)
plt.title('Top 10 Bigramas Mais Frequentes')
plt.xlabel('Frequência')
plt.ylabel('Bigrama')
plt.savefig('top_bigrams.png', bbox_inches='tight', dpi=600)
plt.show()


4. Aplicando a técnica RAG (Retrieval-Augmented Generation)

a) Criação dos embeddings dos textos

In [None]:
# # model = SentenceTransformer('all-MiniLM-L6-v2')

# # import os
# # os.environ["HF_HUB_OFFLINE"] = "HF_TOKEN"
# # model = SentenceTransformer("path/to/all-MiniLM-L6-v2")

# model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", token=False)

# # Criando embeddings
# embeddings = model.encode(df['text'].tolist())
# print(embeddings)

# # Criação do índice FAISS
# dimension = embeddings.shape[1]
# index = faiss.IndexFlatL2(dimension)
# index.add(np.array(embeddings))

In [None]:
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", token=False)

# criando embeddings
embeddings = model.encode(df['text'].tolist())
print(embeddings)

# criação do índice FAISS
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))


b) Realizar busca semântica com RAG

In [None]:
def retrieve_documents(question, top_k=5):
    query_embedding = model.encode([question])
    distances, indices = index.search(query_embedding, top_k)
    return df.iloc[indices[0]]

# exemplo teste:
query = "What happened recently in UK politics?"
retrieved_docs = retrieve_documents(query)
print(retrieved_docs[['category', 'text']])


In [None]:
# Exemple2:
# query = "What the inferency or abstract the last five articles about technology?"
query = "Please provide a one-paragraph summary of your interpretation of the last five technology articles?"
retrieved_docs = retrieve_documents(query)
print(retrieved_docs[['category', 'text']])

**Sem RAG**

In [None]:
def retrieve_documents_semRAG(question, top_k=5):
    query_embedding = model.encode([question])
    distances, indices = index.search(query_embedding, top_k)
    return df.iloc[indices[0]]

# Exemplo:
query = "What happened recently in UK politics?"
retrieved_docs = retrieve_documents_semRAG(query)
print(retrieved_docs[['category', 'text']])

**Sem utilizar técnica de geração textual (RAG). Apenas a recuperação semântica baseada em embeddings e similaridade:**

In [None]:

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pandas as pd

# Carrega o modelo de embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

# Dataset de textos
df = pd.read_csv("https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv")

# Geração dos embeddings dos documentos
corpus = df['text'].tolist()
document_embeddings = model.encode(corpus, show_progress_bar=True)

# Criação do índice FAISS
index = faiss.IndexFlatL2(document_embeddings.shape[1])
index.add(np.array(document_embeddings))


**Função de busca semântica sem RAG**

In [None]:
def semantic_search(query, top_k=5):
    # Codifica a consulta como embedding
    query_embedding = model.encode([query])
    # Busca os top_k documentos mais próximos no índice FAISS
    distances, indices = index.search(query_embedding, top_k)
    # Retorna os documentos mais semelhantes
    return df.iloc[indices[0]]


**Executar a busca**

In [None]:
query = "What happened recently in UK politics?"
results = semantic_search(query, top_k=5)
print(results[['category', 'text']])


c) Gerar respostas com LLM (Opcional usando OpenAI GPT)

In [None]:
import openai
import os
from google.colab import userdata

# openai.api_key = 'SUA_API_KEY'
# openai.api_key = 'OPENAI_TOKEN'
# openai.api_key = userdata.get('OPENAI_API_KEY')
openai.api_key = userdata.get('OPENAI_TOKEN')

os.environ["API_TOKEN"] = userdata.get('OPENAI_TOKEN')


def generate_answer(question, context_texts):
    prompt = f"""
    Context: {' '.join(context_texts)}

    Question: {question}

    Answer:
    """

    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=[{'role': 'user', 'content': prompt}]
    )

    return response['choices'][0]['message']['content']

# Exemplo prático
contexts = retrieved_docs['text'].tolist()
# contexts = retrieve_documents['text'].tolist()
answer = generate_answer(query, contexts)
print(answer)

## **Tarefa de Classificação com BBC News:**

**Objetivo:** Classificar textos de notícias em categorias (ex.: política, negócios, esportes, tecnologia, entretenimento).

**Estrutura do dataset:**

*   Coluna text: Texto integral das notícias.
*   Coluna category: Rótulo da classe de cada notícia.

**Tipo de Classificação:** Multiclasse.

**Exemplo de classes:**

*   business
*   politics
*   sport
*   tech
*   entertainment

Exemplo usando Python (Scikit-learn):

In [None]:
# Importando bibliotecas
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Dividindo dataset
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['category'], test_size=0.3, random_state=42
)

# Vetorização TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Classificação usando Naive Bayes
clf = MultinomialNB()
clf.fit(X_train_vec, y_train)

# Avaliando o modelo
y_pred = clf.predict(X_test_vec)
print(classification_report(y_test, y_pred))


### **Como combinar RAG com Classificação?**

Embora RAG seja tradicionalmente usado para geração textual baseada em recuperação, é possível usá-lo de maneira indireta para auxiliar na tarefa de classificação:

**Usar embeddings de RAG para aprimorar a representação dos textos:**
Os embeddings usados no RAG (ex.: Sentence-BERT) podem ser diretamente usados

*   como entrada para classificadores mais avançados (ex.: Redes Neurais, SVM, ou Random Forest).

In [None]:
# Embeddings
embeddings = model.encode(df['text'].tolist())

# Classificação usando embeddings com Random Forest
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(
    embeddings, df['category'], test_size=0.3, random_state=42
)

clf_rf = RandomForestClassifier(n_estimators=100, random_state=42)
clf_rf.fit(X_train, y_train)

y_pred = clf_rf.predict(X_test)
print(classification_report(y_test, y_pred))


## **Passo a Passo Completo em Python no Google Colab**

**1. Instalação e importação das bibliotecas**

In [None]:
# Instalar bibliotecas necessárias
!pip install pandas numpy matplotlib seaborn nltk wordcloud sentence-transformers scikit-learn faiss-cpu


In [None]:
# Importações
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sentence_transformers import SentenceTransformer
import faiss

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


**2. Carregar o Dataset BBC News**

In [None]:
# Carregar dataset diretamente do link CSV
url = 'https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv'
df = pd.read_csv(url)

# Exibir primeiras linhas
df.head()


**3. Análise Exploratória de Dados (EDA)**

3.1. Estrutura do Dataset

In [None]:
print(f"Forma dos dados: {df.shape}")
print("Categorias disponíveis:", df['category'].unique())
df.info()


3.2. Distribuição das Categorias (Gráfico de barras)

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(y='category', data=df, order=df['category'].value_counts().index, palette='viridis')
plt.title('Distribuição das Categorias')
plt.xlabel('Quantidade de Artigos')
plt.ylabel('Categorias')
plt.savefig('distro_category2.png', bbox_inches='tight', dpi=600)
plt.show()


3.3. Distribuição do Comprimento dos Textos (Histograma e Boxplot)

In [None]:
df['text_length'] = df['text'].apply(lambda x: len(x.split()))

plt.figure(figsize=(10,5))
sns.histplot(df['text_length'], bins=30, kde=True)
plt.title('Distribuição do Comprimento dos Textos')
plt.xlabel('Número de palavras')
plt.ylabel('Frequência')
plt.savefig('hist_length2.png', bbox_inches='tight', dpi=600)
plt.show()

plt.figure(figsize=(10,6))
sns.boxplot(x='category', y='text_length', data=df, palette='pastel')
plt.title('Boxplot do Comprimento dos Textos por Categoria')
plt.xticks(rotation=45)
plt.savefig('boxplot_length2.png', bbox_inches='tight', dpi=600)
plt.show()


3.4. Nuvem de Palavras (Wordcloud)

In [None]:
text = ' '.join(df['text']).lower()
filtered_words = [word for word in text.split() if word not in stop_words]

wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(filtered_words))

plt.figure(figsize=(10,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Wordcloud das Palavras mais Frequentes')
plt.savefig('wordcloud2.png', bbox_inches='tight', dpi=600)
plt.show()


3.5. Bigramas Mais Frequentes (N-Grams)

In [None]:
from nltk.util import ngrams

bigrams = list(ngrams(filtered_words, 2))
bigram_counts = Counter(bigrams).most_common(10)

bigram_df = pd.DataFrame(bigram_counts, columns=['Bigram', 'Contagem'])
bigram_df['Bigram'] = bigram_df['Bigram'].apply(lambda x: ' '.join(x))

plt.figure(figsize=(10,5))
sns.barplot(y='Bigram', x='Contagem', data=bigram_df, palette='coolwarm')
plt.title('Top 10 Bigramas Mais Frequentes')
plt.xlabel('Contagem')
plt.ylabel('Bigrama')
plt.savefig('top_bigrams2.png', bbox_inches='tight', dpi=600)
plt.show()


In [None]:
from nltk.util import ngrams

bigrams = list(ngrams(filtered_words, 3))
bigram_counts = Counter(bigrams).most_common(10)

bigram_df = pd.DataFrame(bigram_counts, columns=['Bigram', 'Contagem'])
bigram_df['Bigram'] = bigram_df['Bigram'].apply(lambda x: ' '.join(x))

plt.figure(figsize=(10,5))
sns.barplot(y='Bigram', x='Contagem', data=bigram_df, palette='coolwarm')
plt.title('Top 10 Bigramas Mais Frequentes')
plt.xlabel('Contagem')
plt.ylabel('Bigrama')
plt.savefig('top_trigrams2.png', bbox_inches='tight', dpi=600)
plt.show()

**4. Aplicação da Técnica RAG (embeddings com Sentence-BERT)**

4.1. Criação dos Embeddings

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

# Criando embeddings
embeddings = model.encode(df['text'].tolist())


4.2. Criação do índice de busca semântica (FAISS)

In [None]:
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))


4.3. Exemplo de Recuperação (RAG simplificado)

In [None]:
def retrieve_docs(query, top_k=5):
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding, top_k)
    return df.iloc[indices[0]]

# Exemplo
# question = "What recent technology developments were reported?"
question = "What the inferency or abstract the last five articles about technology?"
results = retrieve_docs(question)

print(results[['category', 'text']].head())


**5. Classificação Textual usando Embeddings**

5.1. Divisão dos Dados (Treinamento e Teste)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    embeddings, df['category'], test_size=0.3, random_state=42
)


5.2. Treinamento com Random Forest (classificador robusto)

In [None]:
clf = RandomForestClassifier(n_estimators=200, random_state=42)
clf.fit(X_train, y_train)

# Previsões
y_pred = clf.predict(X_test)

# Avaliação detalhada
print(classification_report(y_test, y_pred))


5.3 Relatório

In [None]:
# Importação adicional necessária
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Calculando as métricas de classificação
accuracy = accuracy_score(y_test, y_pred)
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred, average=None, labels=clf.classes_)

# Criando dataframe para visualização
metrics_df = pd.DataFrame({
    'Categoria': clf.classes_,
    'Precisão': precision,
    'Recall': recall,
    'F1-Score': f1_score
})

print("Acurácia geral do modelo: {:.2f}%".format(accuracy * 100))
# print("\n"+metrics_df)
print(metrics_df)


1. Gráfico de Acurácia Geral

In [None]:
# Gráfico de Acurácia Geral
plt.figure(figsize=(6,4))
sns.barplot(x=['Acurácia Geral'], y=[accuracy*100], palette='Greens')
plt.ylim(0,100)
plt.ylabel('Acurácia (%)')
plt.title('Acurácia Geral do Modelo')
for i in range(1):
    plt.text(i, accuracy*100 + 1, f'{accuracy*100:.2f}%', ha='center')
    plt.savefig('acuracia_geral2.png', bbox_inches='tight', dpi=600)
plt.show()


2. Gráfico de Precisão por Categoria

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(x='Categoria', y='Precisão', data=metrics_df, palette='Blues_d')
plt.ylim(0,1)
plt.title('Precisão por Categoria')
plt.ylabel('Precisão')
plt.xlabel('Categoria')
plt.xticks(rotation=45)
for i, p in enumerate(precision):
    plt.text(i, p + 0.01, f'{p:.2f}', ha='center')
    plt.savefig('precisao_por_categoria2.png', bbox_inches='tight', dpi=600)
plt.show()


3. Gráfico de Recall por Categoria

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(x='Categoria', y='Recall', data=metrics_df, palette='Oranges_d')
plt.ylim(0,1)
plt.title('Recall por Categoria')
plt.ylabel('Recall')
plt.xlabel('Categoria')
plt.xticks(rotation=45)
for i, r in enumerate(recall):
    plt.text(i, r + 0.01, f'{r:.2f}', ha='center')
    plt.savefig('recall_por_categoria2.png', bbox_inches='tight', dpi=600)
plt.show()


4. Gráfico de F1-Score por Categoria

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(x='Categoria', y='F1-Score', data=metrics_df, palette='Purples_d')
plt.ylim(0,1)
plt.title('F1-Score por Categoria')
plt.ylabel('F1-Score')
plt.xlabel('Categoria')
plt.xticks(rotation=45)
for i, f1 in enumerate(f1_score):
    plt.text(i, f1 + 0.01, f'{f1:.2f}', ha='center')
    plt.savefig('f1_score_por_categoria2.png', bbox_inches='tight', dpi=600)
plt.show()


**6. Avaliação gráfica dos Resultados**

Matriz de Confusão

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=clf.classes_, yticklabels=clf.classes_)
plt.title('Matriz de Confusão')
plt.xlabel('Previsto')
plt.ylabel('Real')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.savefig('matriz_confusao2.png', bbox_inches='tight', dpi=600)
plt.show()


Guia prático **completo e detalhado** (sem embeddings) para realizar uma **análise exploratória (EDA), visualizações gráficas** e **classificação textual** dos datasets **BBC News**.

**1. Configuração inicial no Google Colab**

In [None]:
# Instalar bibliotecas necessárias
!pip install pandas numpy matplotlib seaborn nltk wordcloud scikit-learn


In [None]:
# Importações essenciais
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_recall_fscore_support

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


**2. Carregar e Explorar os Datasets**

2.1. Dataset BBC News

In [None]:
url_bbc = 'https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv'
df_bbc = pd.read_csv(url_bbc)
print(df_bbc.head())


**3. EDA (Análise Exploratória) – Exemplo para BBC News**

In [None]:
# Distribuição das categorias
sns.countplot(y='category', data=df_bbc, palette='Set2')
plt.title('Distribuição Categorias BBC News')
plt.xlabel('Quantidade')
plt.ylabel('Categoria')
plt.savefig('distro_category_bbc.png', bbox_inches='tight', dpi=600)
plt.show()

# Comprimento dos textos
df_bbc['text_length'] = df_bbc['text'].apply(lambda x: len(x.split()))
sns.histplot(df_bbc['text_length'], bins=30, kde=True)
plt.title('Comprimento dos Textos BBC News')
plt.xlabel('Número de Palavras')
plt.ylabel('Frequência')
plt.savefig('hist_length_bbc.png', bbox_inches='tight', dpi=600)
plt.show()

# Wordcloud
text = ' '.join(df_bbc['text']).lower()
filtered_words = [word for word in text.split() if word not in stop_words]
wordcloud = WordCloud(width=800, height=400).generate(' '.join(filtered_words))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Wordcloud BBC News')
plt.savefig('wordcloud_bbc.png', bbox_inches='tight', dpi=600)
plt.show()


**4. Classificação sem Embeddings (TF-IDF)**

In [None]:
# Dividir os dados
X_train, X_test, y_train, y_test = train_test_split(
    df_bbc['text'], df_bbc['category'], test_size=0.3, random_state=42
)

# Vetorização TF-IDF
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_vec = tfidf.fit_transform(X_train)
X_test_vec = tfidf.transform(X_test)

# Classificador Random Forest
clf_rf = RandomForestClassifier(n_estimators=100, random_state=42)
clf_rf.fit(X_train_vec, y_train)
y_pred = clf_rf.predict(X_test_vec)

# Avaliação
print(classification_report(y_test, y_pred))


**5. Avaliação gráfica das métricas (exemplo BBC News)**

**Matriz de Confusão**

In [None]:
cm = confusion_matrix(y_test, y_pred, labels=clf_rf.classes_)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=clf_rf.classes_, yticklabels=clf_rf.classes_)
plt.title('Matriz de Confusão BBC News (sem embeddings)')
plt.xlabel('Previsto')
plt.ylabel('Real')
# plt.xticks(rotation=45)
# plt.yticks(rotation=0)
plt.savefig('matriz_confusao_bbc.png', bbox_inches='tight', dpi=600)
plt.show()


**Precisão, Recall, F1-Score**

In [None]:
accuracy = accuracy_score(y_test, y_pred)
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred, labels=clf_rf.classes_)

metrics_df = pd.DataFrame({
    'Categoria': clf_rf.classes_,
    'Precisão': precision,
    'Recall': recall,
    'F1-Score': f1_score
})

print(f"Acurácia geral: {accuracy*100:.2f}%")
print(metrics_df)

# Gráfico das métricas
metrics_df.set_index('Categoria').plot.bar(rot=0, figsize=(10,6), colormap='Pastel1')
plt.title('Precisão, Recall e F1-Score BBC News (sem embeddings)')
plt.ylabel('Valor')
plt.ylim(0,1)
plt.grid(axis='y')
plt.savefig('precisao_recall_f1_bbc.png', bbox_inches='tight', dpi=600)
plt.show()


In [None]:
# Instalação
!pip install pandas numpy seaborn matplotlib nltk wordcloud scikit-learn sentence-transformers tensorflow keras

# Importações
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_recall_fscore_support
from sentence_transformers import SentenceTransformer
import tensorflow as tf

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


In [None]:
url_bbc = 'https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv'
df = pd.read_csv(url_bbc)
df.head()


In [None]:
# Distribuição categorias
sns.countplot(y='category', data=df)
plt.title('Distribuição das Categorias')
plt.show()

# Wordcloud
text = ' '.join(df['text']).lower()
wordcloud = WordCloud(width=800, height=400).generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Wordcloud BBC News')
plt.show()


## **4. Preparação dos dados**

**Sem embeddings (TF-IDF)**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['category'], test_size=0.3, random_state=42)

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


**Com embeddings (Sentence-BERT)**

In [None]:
# model_emb = SentenceTransformer('all-MiniLM-L6-v2')
model_emb = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", token=False)

X_embeddings = model_emb.encode(df['text'])

X_train_emb, X_test_emb, y_train_emb, y_test_emb = train_test_split(
    X_embeddings, df['category'], test_size=0.3, random_state=42
)


## **5. Classificação sem Embeddings (Naive Bayes e SVM)**

**Naive Bayes**

In [None]:
nb_clf = MultinomialNB()
nb_clf.fit(X_train_tfidf, y_train)
y_pred_nb = nb_clf.predict(X_test_tfidf)
print("Naive Bayes sem embeddings:\n", classification_report(y_test, y_pred_nb))


**Support Vector Machines (SVM)**

In [None]:
svm_clf = SVC()
svm_clf.fit(X_train_tfidf, y_train)
y_pred_svm = svm_clf.predict(X_test_tfidf)
print("SVM sem embeddings:\n", classification_report(y_test, y_pred_svm))


## **6. Classificação com Embeddings (Naive Bayes e SVM)**

**Naive Bayes com embeddings (GaussianNB)**

In [None]:
from sklearn.naive_bayes import GaussianNB

gnb_clf = GaussianNB()
gnb_clf.fit(X_train_emb, y_train_emb)
y_pred_gnb = gnb_clf.predict(X_test_emb)
print("Naive Bayes com embeddings:\n", classification_report(y_test_emb, y_pred_gnb))


**SVM com embeddings**

In [None]:
svm_emb_clf = SVC()
svm_emb_clf.fit(X_train_emb, y_train_emb)
y_pred_svm_emb = svm_emb_clf.predict(X_test_emb)
print("SVM com embeddings:\n", classification_report(y_test_emb, y_pred_svm_emb))


## **7. Avaliação gráfica (exemplo: Naive Bayes sem embeddings)**

In [None]:
cm = confusion_matrix(y_test, y_pred_nb, labels=nb_clf.classes_)
sns.heatmap(cm, annot=True, cmap='Blues', fmt='d', xticklabels=nb_clf.classes_, yticklabels=nb_clf.classes_)
plt.title('Matriz de Confusão – Naive Bayes sem embeddings')
plt.show()


In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_recall_fscore_support
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

def plot_confusion_matrix(y_true, y_pred, labels, title='Matriz de Confusão'):
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    plt.figure(figsize=(8,6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=labels, yticklabels=labels)
    plt.title(title)
    plt.xlabel('Previsto')
    plt.ylabel('Real')
    plt.show()

def plot_metrics(y_true, y_pred, labels, model_name, title_suffix=''):
    accuracy = accuracy_score(y_true, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, labels=labels)

    df_metrics = pd.DataFrame({
        'Categoria': labels,
        'Precisão': precision,
        'Recall': recall,
        'F1-Score': f1
    })

    print(f"Acurácia Geral – {model_name} {title_suffix}: {accuracy*100:.2f}%")
    display(df_metrics)

    df_metrics.set_index('Categoria').plot.bar(rot=0, figsize=(10,6))
    plt.title(f'{model_name} – Métricas por Categoria {title_suffix}')
    plt.ylabel('Valor')
    plt.ylim(0, 1)
    plt.grid(axis='y')
    plt.show()


## **2. Aplicar para Modelos SEM Embeddings (TF-IDF)**

** Random Forest sem embeddings**

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_tfidf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_tfidf.fit(X_train_tfidf, y_train)
y_pred_rf = rf_tfidf.predict(X_test_tfidf)

plot_confusion_matrix(y_test, y_pred_rf, labels=rf_tfidf.classes_, title='Random Forest – Sem Embeddings')
plot_metrics(y_test, y_pred_rf, rf_tfidf.classes_, model_name='Random Forest', title_suffix='(Sem Embeddings)')
plot_confusion_matrix(y_test, y_pred_rf, labels=rf_tfidf.classes_, )
plot_metrics(y_test, y_pred_rf, rf_tfidf.classes_, model_name='Random Forest',)
plt.savefig('matriz_rf_tfidf2.png', bbox_inches='tight', dpi=600)


** Naive Bayes sem embeddings**

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train_tfidf, y_train)
y_pred_nb = nb_tfidf.predict(X_test_tfidf)

# plot_confusion_matrix(y_test, y_pred_nb, labels=nb_tfidf.classes_, title='Naive Bayes – Sem Embeddings')
# plot_metrics(y_test, y_pred_nb, nb_tfidf.classes_, model_name='Naive Bayes', title_suffix='(Sem Embeddings)')
plot_confusion_matrix(y_test, y_pred_nb, labels=nb_tfidf.classes_, )
plot_metrics(y_test, y_pred_nb, nb_tfidf.classes_, model_name='Naive Bayes', )
plt.savefig('matriz_rf_NB.png', bbox_inches='tight', dpi=300)


**SVM sem embeddings**

In [None]:
from sklearn.svm import SVC

svm_tfidf = SVC()
svm_tfidf.fit(X_train_tfidf, y_train)
y_pred_svm = svm_tfidf.predict(X_test_tfidf)

# plot_confusion_matrix(y_test, y_pred_svm, labels=svm_tfidf.classes_, title='SVM – Sem Embeddings')
# plot_metrics(y_test, y_pred_svm, svm_tfidf.classes_, model_name='SVM', title_suffix='(Sem Embeddings)')
plot_confusion_matrix(y_test, y_pred_svm, labels=svm_tfidf.classes_, )
plot_metrics(y_test, y_pred_svm, svm_tfidf.classes_, model_name='SVM', )
plt.savefig('matriz_rf_SVM.png', bbox_inches='tight', dpi=300)


## **3. Aplicar para Modelos COM Embeddings**

**Random Forest com embeddings**

In [None]:
rf_emb = RandomForestClassifier(n_estimators=100, random_state=42)
rf_emb.fit(X_train_emb, y_train_emb)
y_pred_rf_emb = rf_emb.predict(X_test_emb)

# plot_confusion_matrix(y_test_emb, y_pred_rf_emb, labels=rf_emb.classes_, title='Random Forest – Com Embeddings')
# plot_metrics(y_test_emb, y_pred_rf_emb, rf_emb.classes_, model_name='Random Forest', title_suffix='(Com Embeddings)')
plot_confusion_matrix(y_test_emb, y_pred_rf_emb, labels=rf_emb.classes_, )
plot_metrics(y_test_emb, y_pred_rf_emb, rf_emb.classes_, model_name='Random Forest', )
plt.savefig('matriz_RF_com.png', bbox_inches='tight', dpi=300)


**Naive Bayes com embeddings**

In [None]:
from sklearn.naive_bayes import GaussianNB

nb_emb = GaussianNB()
nb_emb.fit(X_train_emb, y_train_emb)
y_pred_nb_emb = nb_emb.predict(X_test_emb)

# plot_confusion_matrix(y_test_emb, y_pred_nb_emb, labels=nb_emb.classes_, title='Naive Bayes – Com Embeddings')
# plot_metrics(y_test_emb, y_pred_nb_emb, nb_emb.classes_, model_name='Naive Bayes', title_suffix='(Com Embeddings)')
plot_confusion_matrix(y_test_emb, y_pred_nb_emb, labels=nb_emb.classes_, )
plot_metrics(y_test_emb, y_pred_nb_emb, nb_emb.classes_, model_name='Naive Bayes', )
plt.savefig('matriz_NB_com.png', bbox_inches='tight', dpi=300)


**SVM com embeddings**

In [None]:
svm_emb = SVC()
svm_emb.fit(X_train_emb, y_train_emb)
y_pred_svm_emb = svm_emb.predict(X_test_emb)

# plot_confusion_matrix(y_test_emb, y_pred_svm_emb, labels=svm_emb.classes_, title='SVM – Com Embeddings')
# plot_metrics(y_test_emb, y_pred_svm_emb, svm_emb.classes_, model_name='SVM', title_suffix='(Com Embeddings)')
plot_confusion_matrix(y_test_emb, y_pred_svm_emb, labels=svm_emb.classes_,)
plot_metrics(y_test_emb, y_pred_svm_emb, svm_emb.classes_, model_name='SVM',)
plt.savefig('matriz_SVM_com.png', bbox_inches='tight', dpi=300)
