#### **Text Classification**

- Assigning labels to text
- Giving meaningh to words and sentences
- Organizing and structuring unstructured data
- Applications - 
  - Analyzing customer sentiment in reviews
  - Detecting spam in emails
  - Tagging news articles with relevant topics


#### **Types of Classification**

- Binary Classification 
- Multi-class Classification
- Multi-label Classification


#### **Binary Classification**

- Sorting text two categories
- Example - Email Spam Detection *(**spam** / **not spam**)*

#### **Multi Class Classification**

- Sorting text into multiple categories
- Example - News classification *(**Politics** / **Sports** / **Technology**)* 


#### **Multi Label Classification**

- Each text can be assigned multiple labels
- Example - Books Classification *(A same book can have these classes - **Action**, **Adventure**, **Fantasy**)*

---

To classify text, the model may require an understanding of meaning of words.

![image.png](attachment:image.png)

#### **Word Embeddings**

![image.png](attachment:image.png)

In PyTorch, we can use `torch.nn.Embedding` to create word vectors from indexes.

In [10]:
# Example
import torch
from torch import nn

words = ['the', 'cat', 'sat', 'on', 'the', 'mat']
word_to_idx = {word:i for i,word in enumerate(words)}
inputs = torch.LongTensor([word_to_idx[word] for word in words])
embedding = nn.Embedding(num_embeddings=len(words), embedding_dim=10)
output = embedding(inputs)
output

tensor([[ 1.1809, -0.6755,  0.3682, -0.1397, -0.9883,  0.0777,  0.6384, -1.7109,
         -1.3935, -0.3082],
        [ 0.8623,  1.1949,  0.8835,  0.2848,  1.3900, -0.1441, -0.9749, -0.6267,
         -0.7231,  1.9871],
        [ 0.2417, -0.9565, -0.6208,  0.5341,  0.2989,  0.0454,  0.9910,  0.3484,
          0.7737, -0.1763],
        [ 1.2447,  0.8927, -0.8323, -1.6464,  0.2252, -1.2045,  2.4671, -0.6624,
         -0.0414, -0.6542],
        [ 1.1809, -0.6755,  0.3682, -0.1397, -0.9883,  0.0777,  0.6384, -1.7109,
         -1.3935, -0.3082],
        [-0.7660,  0.2267,  1.0981,  0.8187, -0.8026,  1.7551, -0.3461, -0.2252,
          0.4187,  0.3654]], grad_fn=<EmbeddingBackward0>)

In [None]:
from torch.utils.data import Dataset, DataLoader


def preprocess_sentences(text):
    # Tokenization
    # Stemming
    # ...
    # Word to index mapping


class TextDataset(Dataset):
    def __init__(self, encoded_sentences):
        self.data = encoded_sentences
    
    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        return self.data[index]


def text_preprocessing_pipeline(text):
    tokens = preprocess_sentences(text)
    dataset = TextDataset(tokens)
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
    return dataloader, vectorizer


text = 'The cat sat on the mat'
dataloader, vectorizer = text_preprocessing_pipeline(text)
embedding = nn.Embedding(num_embeddings=10, embedding_dim=50)

for batch in dataloader:
    output = embedding(batch)
    print(output)