Dataset:
    
    The label ‘1’ means the tweet is discriminatory / gender and the label ‘0’ means the tweet is not racist/sexist, you intend to predict the labels on the test data provided.
    
    The columns present in our dataset are :

    1. id: unique id of the tweet

    2.label : 0 or 1 (positive and negative)

    3. tweet: text of the tweet

In [None]:
# Installatin library
# !pip3 install transformers
# !pip3 install tensorflow
# !pip3 install torch


In [None]:
# import necessary libraries
import numpy as np
import pandas as pd
from transformers import DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments
import matplotlib as plot
import torch

In [None]:
# Data Preparation
train =pd.read_csv("train_2kmZucJ.csv")
test =pd.read_csv("test_oJQbWVk.csv")
ss =pd.read_csv("sample_submission_LnhVWA4.csv")

In [None]:
train.head()

In [None]:
test.head()

In [None]:
ss.head()

We are only interested in the column label and tweet. Tweet being input column ‘and label is the output variable. The label contains 0 and 1 . with 0 being the positive tweet and 1 being the negative tweet.

In [None]:
# dropping id as it is not of use
train.drop("id", axis = 1, inplace = True)
test.drop("id", axis = 1, inplace = True)

In [None]:
train.head()

In [None]:
test.head()

In [None]:
# Distribution fo positive and negative labels in the data

# Plotting the distribution fro dataset
ax = train.groupby('label').count().plot(kind = 'bar', title= "Distribution of data", legend = False)
ax.set_xticklabels(['Negative','Positive'], rotation=0)
# sorting data in list
text, sentiment = list(train['tweet']), list(train['label'])

Need to convert the tweet and labels column in the form of a list so that we can input them to the tokenizer

In [None]:
# Converting labels and tweet to list
labels = train['label'].tolist()
tweets = train['tweet'].tolist()

Tokenization and Encoding of data:

The tokenizer that we will be using is DistillBert tokenizer fast. DistilBertTokenizerFast is identical to BertTokenizerFast and runs end-to-end tokenization: punctuation splitting and wordpiece.

The parameters that are present in DistilBertTokenizerFast are :

( vocab_filedo_lower_case = Truedo_basic_tokenize = Truenever_split = Noneunk_token = ‘[UNK]’sep_token = ‘[SEP]’pad_token = ‘[PAD]’cls_token = ‘[CLS]’mask_token = ‘[MASK]’tokenize_chinese_chars = Truestrip_accents = None**kwargs )

The method splits the sentences into tokens, adds the [cls] and [sep] tokens, and also matches the tokens to id.

In [None]:
# Tokenization and Encoding of data
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

In [None]:
#padding and truncation of data
inputs = tokenizer(tweets, padding ="max_length", truncation = True)

So in tokenizer, we will give a list of tweets as the input and will get the token ids in return that we will input in the model.

Padding, Truncation, and all of the preprocessing are done in the DistillBert tokenizer itself.

In [None]:
# convert our data to tensors.
import torch
class twitterDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)

In [None]:
train_dataset = twitterDataset(inputs, labels)

In [None]:
#print(train_dataset.__getitem__(2))

'''
    The output of this is a dictionary containing 3 key-value pair

    Input id’s: This contains tensors of integers where each integer represents the word from the original sentence.

    Attention Mask: It is simply an array of 1’s and 0’s indicating which tokens are padding and which aren’t.

    Labels: target variables
'''

In [None]:
# Model Building
# model : Distillbert model
from transformers import DistilBertForSequenceClassification
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels =2)

In [None]:
torch.cuda.is_available()

In [None]:
# device = torch.device("cuda:1")
# device

In [None]:
#conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia

In [None]:
# Enalable gpu if it available
#device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device = torch.device("cpu")
model.to(device)

In [None]:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    output_dir ='./results',
    num_train_epochs =2,
    per_device_eval_batch_size =64,
    warmup_steps = 500,
    weight_decay = 0.01,
    logging_dir ='./logs',
    logging_steps = 10
)
trainer = Trainer(
    model = model,
    args= training_args,
    train_dataset= train_dataset
)

In [None]:
trainer.train()

In [None]:
torch.cuda.empty_cache()

### Checking the model on test data and finding the polarity of the sentiment:

In [None]:
# to_check_result gives the output in from of 0 or 1
# 0 being positive and 1 being negative

import numpy as np
def to_check_result(test_encoding):
    input_ids = torch.tensor(test_encoding["input_ids"]).to(device)
    attention_mask = torch.tensor(test_encoding["attention_mask"]).to(device)
    with torch.no_grad():
        outputs = model(input_ids.unsqueeze(0), attention_mask.unsqueeze(0))
    y =np.argmax(outputs[0].to('cpu').numpy())
    return y


In [None]:
# Tokenizing the test tweets and inputting them to model
l2 = []
for i in test['tweet']:
    test_encoding1 = tokenizer(i, truncation = True, padding = True)
    input_ids = torch.tensor(test_encoding["input_ids"]).to(device)
    attention_mask = torch.tensor(test_encoding["attention_mask"]).to(device)
    op = to_check_result(test_encoding1)
    l2.append(op)     # list contains the output sentiment of all the tweets in the test data.