# Resumo: Scikit-Learn e PyTorch com 20 Newsgroups em CPU e GPU

## 1. Fluxo de Trabalho com Scikit-Learn (20 Newsgroups)

1. **Carregar dados públicos**  


In [12]:
from sklearn.datasets import fetch_20newsgroups
# Seleciona 4 categorias para simplificar
categories = ['alt.atheism','comp.graphics','sci.space','talk.politics.misc']
data = fetch_20newsgroups(subset='all',
                            categories=categories,
                            remove=('headers','footers','quotes'))
X, y = data.data, data.target

2.	Divisão treino/teste

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

	3.	Pipeline TF–IDF + Logistic Regression

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=10_000)),
    ('clf', LogisticRegression(max_iter=1000))
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

	4.	Avaliação

In [16]:
from sklearn.metrics import accuracy_score, classification_report
print("Acurácia:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=data.target_names))

Acurácia: 0.8712871287128713
                    precision    recall  f1-score   support

       alt.atheism       0.89      0.78      0.83       160
     comp.graphics       0.93      0.93      0.93       195
         sci.space       0.80      0.95      0.87       197
talk.politics.misc       0.90      0.79      0.84       155

          accuracy                           0.87       707
         macro avg       0.88      0.86      0.87       707
      weighted avg       0.88      0.87      0.87       707



# 2. Fluxo de Trabalho com PyTorch


2.1 Preparar features TF–IDF e DataLoader (CPU/GPU)

In [17]:
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.feature_extraction.text import TfidfVectorizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Vetorizar texto em TF–IDF (dense para PyTorch)
tfidf = TfidfVectorizer(stop_words='english', max_features=10_000)
X_train_tf = tfidf.fit_transform(X_train).toarray()
X_test_tf  = tfidf.transform(X_test).toarray()

class TfidfDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.long)
    def __len__(self):
        return len(self.y)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_ds = TfidfDataset(X_train_tf, y_train)
test_ds  = TfidfDataset(X_test_tf,  y_test)

train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)
test_loader  = DataLoader(test_ds,  batch_size=32)

2.2 Definição da Rede Neural

In [18]:
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self, in_features, hidden, out_classes):
        super().__init__()
        self.fc1 = nn.Linear(in_features, hidden)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden, out_classes)
    def forward(self, x):
        x = self.relu(self.fc1(x))
        return self.fc2(x)

model = SimpleNet(in_features=10_000, hidden=128, out_classes=len(categories))
model.to(device)

SimpleNet(
  (fc1): Linear(in_features=10000, out_features=128, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=128, out_features=4, bias=True)
)

2.3 Função de Custo e Otimizador

In [19]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
# ou: optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

2.4 Loop de Treino e Avaliação

In [20]:
from sklearn.metrics import accuracy_score

num_epochs = 5
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for Xb, yb in train_loader:
        Xb, yb = Xb.to(device), yb.to(device)
        optimizer.zero_grad()
        logits = model(Xb)
        loss   = criterion(logits, yb)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss médio: {total_loss/len(train_loader):.4f}")

    model.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for Xb, yb in test_loader:
            Xb, yb = Xb.to(device), yb.to(device)
            preds = model(Xb).argmax(dim=1)
            all_preds.append(preds.cpu())
            all_labels.append(yb.cpu())
    preds = torch.cat(all_preds)
    labs  = torch.cat(all_labels)
    acc = accuracy_score(labs, preds)
    print(f"→ Acurácia no teste: {acc:.4f}\n")

Epoch 1, Loss médio: 1.2084
→ Acurácia no teste: 0.8529

Epoch 2, Loss médio: 0.5369
→ Acurácia no teste: 0.8699

Epoch 3, Loss médio: 0.2351
→ Acurácia no teste: 0.8883

Epoch 4, Loss médio: 0.1353
→ Acurácia no teste: 0.8925

Epoch 5, Loss médio: 0.0956
→ Acurácia no teste: 0.8911



# 3. Geração de Embeddings com BERT

3.1 Tokenização e Pré-processamento

In [21]:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

enc = tokenizer(X_train, padding=True, truncation=True,
                max_length=128, return_tensors='pt')
enc = {k:v.to(device) for k,v in enc.items()}

3.2 Extração de Embeddings

In [22]:
bert = BertModel.from_pretrained('bert-base-uncased').to(device)
with torch.no_grad():
    out = bert(**enc)
cls_emb = out.last_hidden_state[:,0,:]  # [CLS] embeddings

KeyboardInterrupt: 

3.3 Uso e Visualização
- Classificação: alimentar cls_emb em SimpleNet ou LogisticRegression.
- Visualização:

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

emb2d = PCA(n_components=2).fit_transform(cls_emb.cpu().numpy())
plt.scatter(emb2d[:,0], emb2d[:,1], c=y_train, cmap='tab10')
plt.show()

4. Preenchimento (Zero-pad) e Truncamento
- No tokenizer: padding=True, truncation=True, max_length=128.
- Manual com pad_sequence:

In [None]:
from torch.nn.utils.rnn import pad_sequence
padded = pad_sequence(list_of_tensors, batch_first=True, padding_value=0)


5. Parâmetros-Chave & Quando Usar

| Parâmetro|	Valores Típicos	|Uso adequado|
|----|----|-----|
|batch_size	|16, 32, 64	|GPU limitada→16; GPU robusta→32+|
|learning_rate (lr)|	1e-3 (AdamW),1e-2 (SGD)|	AdamW: 1e-5–1e-3; SGD: 1e-2–1e-1|
|num_epochs	|2–10	|Poucos dados→5–10; corpora grandes→2–3|
|max_length	|128,256,512	|Textos curtos→128; longos→256/512|
|optimizer	|AdamW, SGD, RMSprop	|AdamW para transformers; SGD/RMSprop em modelos simples|
|criterion	|CrossEntropy, BCEWithLogits	|Múltiplas classes→CrossEntropy; binária→BCEWithLogits|

Dica final:
	•	Mantenha tensors e modelo no mesmo device.
	•	Monitore loss e métricas em cada época.
	•	Ajuste max_features e max_length conforme memória.
	•	Combine TF–IDF (rápido) com BERT (contextual) conforme necessidade.

