### Torchtext Beginner Project

For this project you need to install torchtext and spaCy. So, if you already have it then you can carry on, but if you dont please use the following command in `conda powershell prompt` to install it.

 
- [ ]  `pip install torchtext`
- [ ]  `pip install spacy`


With that concluded, what we need to is to get the Twitter Dataset: [Sentiment140](https://www.kaggle.com/kazanova/sentiment140).

Now, lets begin our project.

We will start by importing the packages.

In [None]:
## Importing necessary packages ##

import spacy
import pandas as pd
import numpy as np
import torch
import torchtext
import torch.nn as nn

Now with that lets load our dataset.

Since it is a csv file we are going to use `pandas` to import it using the following code.

```python
pandas.read_csv('training.1600000.processed.noemoticon.csv')
```

But that would raise an error:

```python
'utf-8' codec can't decode bytes in position 7970-7971: invalid continuation byte after engine=python
```

So for that we need to change the engine to `python` and encoding to `ISO-8859-1`.

In [None]:
## Importing the dataset ##

tweets = pd.read_csv('training.1600000.processed.noemoticon.csv' ,
                     engine = 'python',
                     encoding = 'ISO-8859-1',
                     names = ['score' , 'id' , 'date' , 'query' , 'name' , 'tweet'],
                     header = None)

## Displaying first 5 rows ##
tweets.head()

Now with that out of the way, lets decode the dataset a bit.

The column 0 is basically the target sentiment and the column 5 are the tweets.

Lets check the unique values of sentiments and how many values there are altogether.

In [None]:
## Checking number of class values ##

tweets['score'].value_counts()

Wow! The dataset is a gem.

It has evenly distributed classes each with values 80000.

The classes are 0 which equivalents to negative sentiment and 4 which is poitive sentiment.

Now since the values are 0 and 4 we can transform them into categorical datatypes and thereafter move ahead.

In [None]:
## Transforming column 0 ##

tweets['sentiment_cat'] = tweets['score'].astype('category')

tweets['sentiment'] = tweets['sentiment_cat'].cat.codes.astype('float')

tweets.head()

Lets check the values now!!

In [None]:
tweets['sentiment'].value_counts()

Lets save this as a new csv file.

In [None]:
## Saving the dataset ##

tweets.to_csv('twitter_data.csv' , index = False)

Now to move ahead we need to make fields.

What does fields do?

Well they take sequence data when given and tokenize them. 

Furthermore there are LabelFields which contibute to the label recognition.

SO, lets do those things.

In [None]:
## Creating Field objects for the data ##

spacy_en = spacy.load('en_core_web_sm')

def token(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

label = torchtext.data.LabelField()

tweet = torchtext.data.Field(tokenize = token , 
                             lower = True)

Now lets move ahead and formulate our torchtext dataset.

In [None]:
## Setting our dataset ##

fields = [('score' , None) , ('id' , None) , ('date' , None) , ('query' , None) ,
          ('name' , None) , ('tweet' , tweet) , ('sentiment_cat' , None) , ('sentiment' , label)]

tweet_dataset = torchtext.data.TabularDataset(path = 'twitter_data.csv',
                                              format = 'CSV' ,
                                              fields = fields,
                                              skip_header = False)

Boom!! We have our dataset.

Lets split the dataset into three parts: training , testing , validation.

In [None]:
## Splitting the dataset ##

(train , val , test) = tweet_dataset.split(split_ratio = [0.8 , 0.1 , 0.1])

## Building Vocabulary ##

tweet.build_vocab(train , max_size = 50000)
label.build_vocab(train)

Now lets build the iterator.

In [None]:
## Setting device##

device = torch.device('cuda')

## Building iterator ##

train_iterator , val_iterator , test_iterator = torchtext.data.BucketIterator.splits(datasets = (train , val , test),
                                                                                     batch_size = 32 ,
                                                                                     device = device)

Now, its time that we build our model.

In [None]:
## Building our model ##

class SentimentModel(nn.Module):
    
    def __init__(self , hidden_size , embedding_dim , vocab_size):
        super().__init__()
        self.embedding = nn.Embedding(50000 , embedding_dim)
        self.encoder = nn.LSTM(input_size = embedding_dim ,
                               hidden_size = hidden_size ,
                               num_layers = 1)
        self.predictor = nn.Linear(hidden_size , 1)
        
    def forward(self , x):
        out = self.embedding(x)
        out , (hidden , _) = self.encoder(out)
        pred = self.predictor(hidden.squeeze(0))
        return pred
    

tweet_model = SentimentModel(100 , 300 , 20000)

Done!! Our model is set.

Lets set the optimizer and loss function.

In [None]:
## Optimizer ##

optim = torch.optim.Adam(tweet_model.parameters() , lr = 3e-4)

## Loss Function ##
criterion = nn.BCEWithLogitsLoss()

Lets train our model.

But first moving the model to cuda.

In [None]:
## Moving device to cuda ##

device = torch.device('cuda')

tweet_model = tweet_model.to(device = device)

In [None]:
## Training ##

num_epochs = 10

train_loss = []
val_loss = []

for epoch in range(num_epochs):
    
    tweet_model.train()
    for batch in train_iterator:
        pred = tweet_model(batch.tweet)
        label = batch.sentiment.reshape(-1 , 1).type(torch.cuda.FloatTensor)
        #print(label.type())
        #print(pred.type())
        optim.zero_grad()
        loss = criterion(pred, label)
        loss.backward()
        optim.step()
    
    train_loss.append(loss.item())
    
    tweet_model.eval()
    for batch in val_iterator:
        pred = tweet_model(batch.tweet)
        label = batch.sentiment.reshape(-1 , 1).type(torch.cuda.FloatTensor)
        loss = criterion(pred , label)
    
    val_loss.append(loss.item())
    
    print('Epoch : {} / {} --> Training Loss : {:.3f} , Validation Loss : {:.3f}'.format(epochs + 1 , num_epochs , train_loss[epoch] , val_loss[epoch]))

Awesome!!

Thank You!!