# Genre classification by plot summary
The corpus contains descriptions of 30,000 films with high veriability. 
I assume that it has a high complexity and hidden semantics.
For instance, the word "death" can be at drama, horror, action, fight, and even comedy. 

Therefor, I choose to mimic semantic with **transformers**, and not just words counting/distributions/LDA.

Because this is an asymmetric semantic task, I used MSMARCO Models. 

Models trained with a causal language modeling (CLM) objective are better than BERT. 

# Model input: 
Embbedings vector for each plot (674 features). 

Few options: 1. Extract Sementic will be at the start and at the end of each plot. 2. slicing according to the model capacity, and feeding it separately. 3. slice + mix the vectorized representation. 


# Model output:
**Vector of logits to describe each class probability, per plot summary.**
This will give a feature representation, *instead* of one vs all. This is because different combinations of classes hide a different semantic. 

**Labels:** One hot encoding. (vector with size of: n_of_classes)

#imbalanced data
This data set is multy labeled and imbalanced. 

Adjustments relative to the imbalanced data need to be considered. {such as Random Undersampling (Tomek Link), Oversampling (SMOTE), Class weights in the models, Change Evaluation Metric and so on.}

#validation
For quantifing the performance, I need to find the best threshold (relative to accuracy) and fine tune it with validation set. 



In [2]:
pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 3.0 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 26.6 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 43.3 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 45.7 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.8 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from torch import nn, optim
from torch.utils.data import Dataset
import pandas as pd
import warnings
warnings.filterwarnings('ignore')


In [95]:
from google.colab import files
uploaded = files.upload()

Saving 10.csv to 10.csv


In [97]:
df = pd.read_csv("10.csv", header=None)

In [98]:
# train-evaluation split: 
from sklearn.model_selection import train_test_split
train_df, eval_df = train_test_split(df, test_size=0.2)

In [9]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/msmarco-distilroberta-base-v2')

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/683 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [99]:
# Making the embeddings
embeddings = [model.encode(train_df.iloc[i][0]) for i in range(train_df.shape[0])]

In [100]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cpu'

In [101]:
embeddings = torch.tensor(embeddings)
print(f"I have got {embeddings.shape[0]} samples, each one represented by embedding vector with {embeddings.shape[1]} parameters.")

I have got 8 samples, each one represented by embedding vector with 768 parameters.


In [103]:
class Net(torch.nn.Module):
    def __init__(self, input_size, output_size):
        super(Net, self).__init__()
        self.input_size = input_size
        self.output_size = output_size
        self.out1 = (self.input_size + self.output_size)//2
        self.fc1 = torch.nn.Linear(in_features=self.input_size, out_features=self.out1)
        self.dropout = nn.Dropout(p=0.2)
        self.bn = nn.BatchNorm1d(self.out1)
        self.relu = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(in_features=self.out1, out_features=self.output_size)
        self.act = torch.nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        # x = self.bn(x)
        x = self.dropout(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.act(x)
        return x

    def predict(self, instance, label):
        embeddings = model.encode(instance)

# I'm using the nn.BCELoss for multiclass-multilabled classification. 
BCELoss and ***not*** BCEWithLogitsLoss because I want to use the same model for validation/testing.

In [62]:
input_size = len(embeddings[0])
output_size = train_df.iloc[0, 1:].shape[0]
net = Net(input_size=input_size, output_size=output_size).to(device)
net

Net(
  (fc1): Linear(in_features=768, out_features=562, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
  (bn): BatchNorm1d(562, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (fc2): Linear(in_features=562, out_features=356, bias=True)
  (act): Sigmoid()
)

In [104]:
criterion = nn.BCELoss(reduction='mean')
optimizer = optim.Adam(net.parameters(), lr=0.0001)


# Training and Evaluation
Due to time constraints, I only tried to catch the overfiting point.

Unfortunately I did not had time to add:
- Data loader
- Creating Batches
- Randomization
- Revaluation of the Validation_LOSS with Batch and not just single item
- BN
- Evaluation matrix 

In [131]:
eval_sentences = [eval_df.iloc[i][0] for i in range(eval_df.shape[0])]

In [132]:
eval_embeddings = torch.tensor(eval_embeddings)

In [133]:
epochs=100
for epoch in range(epochs):
  
    for i, embed in enumerate(embeddings):
        l = train_df.iloc[i, 1:].values
        
        labels = torch.tensor(l.astype(float)).to(device)
        optimizer.zero_grad()

        outputs = net(embed.to(device))

        labels = labels.float()
        outputs = outputs.float()
        loss = criterion(outputs.unsqueeze(-1), labels.unsqueeze(-1))

        loss.backward()
        optimizer.step()

    if epoch%10==0:
        for j, eval_embed in enumerate(eval_embeddings):
            val_l = eval_df.iloc[j, 1:].values
            val_labels = torch.tensor(val_l.astype(float)).to(device)
            val_labels = val_labels.float()

            val_outputs = net(eval_embed.to(device))
            val_outputs = val_outputs.float()

            val_loss = criterion(val_outputs.unsqueeze(-1), val_labels.unsqueeze(-1))
        torch.save(net.state_dict(), "model_weights_"+str(epoch))

        print('Epoch [%d/%d], Iter [%d]:' %(epoch+1, epochs, i+1))
        print('                        Train loss: %.4f' % (loss))
        print('                        Evaluation loss: %.4f' % (val_loss))




Epoch [1/100], Iter [8]:
                        Train loss: 0.0000
                        Evaluation loss: 0.0033
Epoch [11/100], Iter [8]:
                        Train loss: 0.0000
                        Evaluation loss: 0.0025
Epoch [21/100], Iter [8]:
                        Train loss: 0.0000
                        Evaluation loss: 0.0038
Epoch [31/100], Iter [8]:
                        Train loss: 0.0000
                        Evaluation loss: 0.0051
Epoch [41/100], Iter [8]:
                        Train loss: 0.0000
                        Evaluation loss: 0.0024
Epoch [51/100], Iter [8]:
                        Train loss: 0.0000
                        Evaluation loss: 0.0022
Epoch [61/100], Iter [8]:
                        Train loss: 0.0000
                        Evaluation loss: 0.0024
Epoch [71/100], Iter [8]:
                        Train loss: 0.0000
                        Evaluation loss: 0.0019
Epoch [81/100], Iter [8]:
                        Train loss: 0.0

In [106]:
torch.save(net.state_dict(), "model_weights.pth")


# Creating an inference class

In [145]:
class Test():
    def __init__(self, plot):
        self.plot = plot

    def inference(self):

        embedTest = model.encode(self.plot)
        embedTest = torch.tensor(embedTest)
        net_pred = Net(input_size=768, output_size=356).to(device)
        net_pred.load_state_dict(torch.load('/content/sample_data/model_weightsLast.pth', map_location='cpu'))
        net_pred.eval()
        outputs = net_pred(embedTest.to(device))
        out ={}
        for i, prob in enumerate(outputs):
            if prob>0.1:
                out[i] = prob
        return out

In [154]:
test = Test(plot= "Hi Omri, I realy like this task. Thanks!!")

test.inference()


{8: tensor(0.9909, grad_fn=<UnbindBackward>),
 19: tensor(0.9009, grad_fn=<UnbindBackward>),
 76: tensor(0.9847, grad_fn=<UnbindBackward>),
 138: tensor(0.9983, grad_fn=<UnbindBackward>),
 294: tensor(0.9259, grad_fn=<UnbindBackward>)}

In [None]:
from google.colab import drive
drive.mount('/model_weights')