In [None]:
#if runned in a colab remote server using ngrok
from google.colab import files
uploaded = files.upload() 
%pip install -r requirements.txt


KeyboardInterrupt: 

## NLTK vs PyTorch

**Tokenizers in NLTK** are regex- or heuristic-based, sometimes using statistics.  
After tokenization, these are mapped to frequency counts, e.g., with a **TF-IDF vectorizer** or a manually created vocabulary dictionary (lexicon).  
NLTK is **not optimized** for training neural networks.  

**Tokenizers in PyTorch**, on the other hand, are **pretrained mappers** of character sequences into subword units.  
They split the input text into those units and assign each one an ID.  

A **vocab object** in PyTorch wraps all these attributes using dictionaries or hash maps.  
Its output is a **tensor of integers**.  


In [1]:
import torch
import torchtext
from torchtext.datasets import DBpedia
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
print(torchtext.__version__)

 #dbpedia has already a train split
#the iter object  makes  each item of the training manually available
#creating a vocab with the tokenize text
tokenizer=get_tokenizer("basic_english")
train_iter = DBpedia(root=".data", split="train")

def yield_token(iter):
    for _, texto in iter:
        yield tokenizer(texto)#makes the function a generator one value  at a time on demand
vocab= build_vocab_from_iterator(yield_token(iter),specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"]) #maps each token on the set to one number , inyective function like
text_pipeline= lambda z: vocab(tokenizer(z))
label_pipeline=lambda x: int(x)-1
text_pipeline("esto sera convertido a tokens y luego cada token a un numero")




ModuleNotFoundError: No module named 'torchtext'

Before neural embeddings became standard, text had to be processed and simplified so that methods like **word2vec** or **TF-IDF** could be applied.  

Now, we define the function `collate`, which by default stacks whatever your `DataLoader` outputs into tensors using built-in functions, assuming all data has the same properties (like length).  

In our case, instead of padding the sentences to make them all the same length, we keep track of their lengths in the **offsets** variable. This allows us to know where each sentence begins while preserving the order, so that the labels remain aligned with the offsets array.  

---

#### Pooling and EmbeddingBag  

**Pooling** is the process of compressing many vectors into just one.  
- **Max pooling**: take the largest feature value from each vector.  
- **Mean pooling**


In [None]:
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    # Q: Is it necessary always to define this function?
    # A: No. Only when you need custom batching logic. 
    # For example, with text data of variable lengths, 
    # we need to pad/offset them properly before putting them in a single tensor.

    label_list = []
    text_list = []
    offsets = [0]

    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))

        # Q: What are the labels of DBpedia and what are they used for?
        # A: DBpedia is a dataset for text classification. 
        # Each "label" is a category (e.g., Company, Artist, Place, etc.).
        # They are the target classes for supervised learning.

        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))

    label_list = torch.tensor(label_list, dtype=torch.int64)
    # we compute cumulative sum of text lengths so we know where each new text starts
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)

    # Q: Why is it important to send this to the device again?
    # A: Because DataLoader may create tensors on CPU by default. 
    # Sending to device ensures data is on GPU (if available) before the model uses it.
    return label_list.to(device), text_list.to(device), offsets.to(device)


# "iter" should be replaced with your actual dataset (e.g., train_dataset)
# Example:
# dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_batch)



: 

now we will create  the architecture, using functional as F that includes the fundamentals building blocks  of  all other modules on nn module.

had in my mind the idea that when a vector was inputed into a sigmoid function the vector will be normalize between -1 and 1, but you are tellingme thats not what the batch norm layer is doing but using conventional normalization metods la z normalization to mormalize all vectors. in a batch or a data set?

Yes and it does it using a mini batch not the entire data set. also the sigmoids outputs go from (0 to 1) its use at the end to map to probabilitys the output.

see the big diference between the two is  sigmoid compresses, z - normalization  rescales.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class ModelClasificationText(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes):
        # A: This initializes the parent class (nn.Module). 
        # It sets up all the internal machinery that PyTorch needs 
        # to register layers, track parameters, and build the computation graph.
        super(ModelClasificationText, self).__init__()

        # A: nn.EmbeddingBag creates a lookup table for vocab_size words,
        # each represented by a vector of size embed_dim.
        # When you pass text+offsets, it fetches embeddings for the tokens
        # and pools them (sum, mean, or max).
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, mode="mean")

        # Batch Normalization layer
        # This is NOT a sigmoid function. It's a normalization step that
        # helps training converge faster and reduces internal covariate shift.
        self.bn1 = nn.BatchNorm1d(embed_dim)

        # Fully connected layer
        # Q: What does this do?
        # A: It maps the normalized + activated embedding vector (size embed_dim)
        # into the number of classes (num_classes) you want to predict.
        self.fc = nn.Linear(embed_dim, num_classes)

    def forward(self, text, offsets):
        # Apply embedding bag: turns tokens into pooled sentence embedding
        embedded = self.embedding(text, offsets)

        # Normalize features
        normal = self.bn1(embedded)

        # Non-linear activation
        # Q: Why use ReLU here?
        # A: ReLU introduces non-linearity. Without it, the model would be 
        # equivalent to a single linear transformation, limiting its capacity.
        # ReLU prevents collapsing to a simple linear mapping and lets the
        # network learn more complex decision boundaries.
        activated = F.relu(normal)

        # Map to class scores ( soft max's logits)
        return self.fc(activated)


for a shallow model relu, if dead neurons appear, try leakky relu, if data is simetric around zero, tanh could be good, if the net is too deep gelu is your activation function, for the output stick to :
sigmoid, softmax or none for  binary,multiclas clasification or  for regression.



In [None]:
num_categories = len(set([label for label, text in iter]))  # 'iter' must be defined, likely your dataset/iterator
vocab_size = len(vocab)
embedding_size = 100  # you asked about this below
model = ModelClasificationText(vocab_size=vocab_size, embed_dim=embedding_size, num_classes=num_categories)
print(model)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


def train(dataloader, epoch):  # you need epoch as parameter for the print at the end
    model.train()  # put the model in training mode
    epoch_acc = 0
    epoch_loss = 0
    total_count = 0

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()  # reset gradients from last batch
        prediction = model(text, offsets)
        loss = critiria(prediction, label)  # typo: 'critiria' should be 'criterion'

        # backpropagation
        loss.backward()

        acc = (prediction.argmax(1) == label).sum()
        # argmax returns the class index with highest probability for each row

        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)  # prevent exploding gradients
        optimizer.step()

        epoch_acc += acc.item()
        epoch_loss += loss.item()
        total_count += label.size(0)

        if idx % 500 == 0 and idx > 0:
            print(f"epoch {epoch} | {idx/len(dataloader):.2f} batches | "
                  f"loss {epoch_loss/total_count:.4f} | accuracy {epoch_acc/total_count:.4f}")
    return epoch_acc/total_count,epoch_loss/total_count


In [None]:
EPOCHS= 3
LEARNNING_RATE= 0.2
BATCH_SIZE=64
### tehis functions are also hiper parameters 
criteria=torch.nn.CrossEntropyLoss()
optimizer=torch.optim.ADAM(model.parameters(),lr=LEARNNING_RATE)



def eval(dataloader):
    model.eval()
    epoch_acc=0
    total_count=0
    epoch_loss= 0
    with torch.no_grad():
        for idx,(label,text,offsets) in enumerate(dataloader):
            prediction= model(text, offsets)
            loss=criterion(prediction,label)
            acc=(prediction.argmax(1)==label).sum()
            epoch_loss +=loss.item()
            epoch_acc+= acc.item()
            total_count += label.size()
    return epoch_acc/total_count,epoch_loss/total_count


In [None]:
##DATA SPLIT
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
train,test=DBpedia()
train_dataset=to_map_style_dataset(train)
test_dataset= to_map_style_dataset(test)
train_len=int(len(train_dataset)*0.95)
train_split,val_split=random_split(train_dataset,[train_len,len(train_dataset)-train_len])
train_dataloader=DataLoader(train_split,batch_size=BATCH_SIZE, shuffle=True,collate_fn=collate_batch)
val_dataloader=DataLoader(val_split,batch_size=BATCH_SIZE, shuffle=True,collate_fn=collate_batch)
test_dataloader=DataLoader(test,batch_size=BATCH_SIZE, shuffle=True,collate_fn=collate_batch)


#now  we are ready to train

major_loss_val=float(inf)
for epoch in range (1,EPOCHS+1):
	train_acc,train_los =	train(train_dataloader)
	val_acc,val_loss =	eval(val_dataloader)
	if val_loss< major_loss_validation:
		best_valid_loss= val_loss
		torch.save(model.state_dict(),"best_saved.pt")
# why does it uses the  smalless loss of the val set inste of other set loss, is every epoc  defining a model and 
# the model of next epoch  is being a fine tune model of the previus epoch?
#can you explain the process of trainning validation and testing?
#now the test dataset
#remember in eval function the model was set to eval mode wich made it not to actualize parameters?
test_acc,test_loss=eval(test_dataloader)
print(f"Accurate del dataset -> {test_acc}")
print(f"perdida del test dataset -> {test_loss}")
#now inference

#here create a dict with the labels and the names of each label  DBpedia_labels

def predict (text, text_pipeline):
	with torch.no_grad():
		text=torch.tensor(text_pipeline(text))
		opt_mode= torch.compile(model,mode="redice-overhead")
eg1=" dont think sorry is easyly said, dont try turnning tabels  insted, you have take lots of chances befor, but aint  gonna give you any mor,thats how it goes, because part of me knows what you are thinking"
eg2=" Don't say words you are going to regred,don't let the fire rush to your head, i have heard thouse accusations befor, and i'm not gonna take any more,bilive me, the sun in your eyes made some of your lies worth beliving"
eg3="don't let false ilussions behind, don't cry i ain't changing my mind, so find another fool, like before, because i am not going to keep anymor, bealinving, some of your lies, while all of the signs are deciving"

model= model.to('cpu') #if it was on gpu why its been send to cpu 
print(f"the example if from category{DBpedia_label[predict(eg1)]}")

now storage and loading

In [None]:
model_state=model.state_dict()
optimizer_state=optimizer.state_dict()
checkpoint={
    "model_state":model_state,
    "optimizer_state":optimizer_state,
    "epoch":EPOCHS,
    "loss":train_loss,
}
torch.save(checkpoint,"model_checkpoint.pth")