# **Overview of the notebook**

Given text of abstract and title, we have to predict the subject category(Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance) from which the text belongs. it is a multilabel classification problem of imbalanced data.

I tried to apply every NLP approach in this notebook,  tf-idf, word2vector, dimensonality reduction of word-vectors, transformers, Doc2vec

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory

# For example, running this (by clicking run or pressing Shift+Enter) will
#list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Import necessary libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import spacy
from time import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
import multiprocessing

import gensim
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser
cores=multiprocessing.cpu_count()
cores
import logging  # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

from collections import defaultdict
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Flatten,Dropout
import tensorflow_addons as tfa
from sklearn.preprocessing import LabelEncoder

import os 
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import tensorflow_addons as tfa

from sklearn.manifold import TSNE

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Flatten




**Reading Data Files**

In [None]:
train_df=pd.read_csv("../input/topic-modeling-for-research-articles/train.csv")
test_df=pd.read_csv("../input/topic-modeling-for-research-articles/test.csv")

In [None]:
train_df.head()

In [None]:
train_df.shape

In [None]:
x=train_df.iloc[:,3:].sum()
rowsums=train_df.iloc[:,2:].sum(axis=1)
no_label_count = 0
for sum in rowsums.items():
    if sum==0:
        no_label_count +=1

print("Total number of articles = ",len(train_df))
print("Total number of articles without label = ",no_label_count)
print("Total labels = ",x.sum())

In [None]:
print("Check for missing values in Train dataset")
print(train_df.isnull().sum().sum())
print("Check for missing values in Test dataset")
null_check=test_df.isnull().sum()
print(null_check)

Lets now check the data types of columns, to assure that each column have the same data type as it should be (sometimes in some datasets there are some columns which has float or integer values but the data type of those columns is object, so for that case we need to change the datatype.)

In [None]:
train_df.dtypes


Now lets check how many abstracts belongs to each category

In [None]:
categories=["Computer Science","Physics","Mathematics","Statistics","Quantitative Biology","Quantitative Finance"]
category_count=[]
for i in categories:
  category_count.append(train_df[i].sum())

In [None]:
category_count

In [None]:
plt.figure(figsize=(15,5))
plt.bar(categories,category_count)



From the above plot its clear that "Quantitative biolgy" and "quantitative Finance" have too much less values, that means the dataset is imbalanced.

As the dataset is imbalanced, to make it balanced we can apply resampling techniques, the dataset is small so we can try oversampling of these two classes. 

we will implement oversampling later, first we will try to build a basic classification model.

now lets plot a figure for word count for each category.

In [None]:
total_word_count_in_each_category=[]
for i in categories:
  abstracts = train_df.where(train_df[i]==1)[['ABSTRACT']]
  count=pd.Series(abstracts.values.flatten()).str.len().sum()
  total_word_count_in_each_category.append(count)

In [None]:
plt.figure(figsize=(15,5))
plt.bar(categories,total_word_count_in_each_category)

Word count also almost in the same proportion as the number of texts in each category, only difference is statistics has more word than mathematics even the number of articles are more for mathematics.







now lets analyze avg word length of abstract, for each category.

In [None]:
avg_abstract_len_for_each_category=[]
for i in range(6):
  avg_abstract_len_for_each_category.append(total_word_count_in_each_category[i]/category_count[i])

In [None]:


plt.figure(figsize=(15,5))
plt.bar(categories,avg_abstract_len_for_each_category)

From the above plot its clear that articles of quantitaive biology are longest, and mathematics articles are shortest.

For each category, how many abstracts contains numberic values ( I want to check:- is there any significant difference in number of abstracts of computer science and mathematics or statistics..etc based on which contains numeric values)

In [None]:
count_numeric_contained_texts=dict()
for category in categories:
    count_numeric_contained_texts[category]=0
    

In [None]:
count_numeric_contained_texts

In [None]:
import re 

In [None]:
for category in categories:
    for text in train_df[train_df[category]==1]["ABSTRACT"]:
        if re.findall("\d",text):
            count_numeric_contained_texts[category]+=1 

In [None]:
for i in range(len(categories)):
    count_numeric_contained_texts[categories[i]]=count_numeric_contained_texts[categories[i]]/category_count[i]

In [None]:
count_numeric_contained_texts

In [None]:
plt.figure(figsize=(15,5))
plt.bar(categories,count_numeric_contained_texts.values())

Each category have numeric values in almost 40% texts except physics, also physics doesnt have significant difference, so numeric values in abstracts doesnt add any significant difference between categories, so we will just remove the numeric values from text preprocessing step.

lets concatenate Title and Abstract, and make it one big text.

In [None]:
train_df["text"]=train_df["TITLE"]+" "+train_df["ABSTRACT"]

dropping the TITLE and ABSTRACT columns.

In [None]:
train_df.drop(["TITLE","ABSTRACT"],axis=1,inplace=True)

In [None]:
train_df.head()

Lets make a function for train test split that we will need further.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
def split(X,y,test_size):
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=test_size,random_state=42)
    return (X_train,X_test,y_train,y_test)



# **Cleaning the text**

Generally cleaning the text consists of two steps

1. Normalizing the text (lowercase the all words) , removing punctuations, removing unicode characters, removing stopwords etc.

2. Applying stemming or lemmatization on words.

i would apply lemmatization because the lemmatized words have more information than stemmed words yet stemming is faster than lemmatization.

There are 2 available libraries available for lemmatization 1.NLTK 2.Spacy

i will try from both, then will compare the result.

In [None]:
# cleaned_title1=[] #after cleaning the each row's title , it will be appended in this list.
# for i in range(train_df.shape[0]):#iterating through the whole dataset.
#     title=re.sub("\n"," ",train_df.text[i])#text contains "/n" each time when line changes so we have to remove them 
#     title=re.sub("[^a-zA-Z]+ "," ",title)# removing all characters other than a-z, A-Z, 0-9 and space
#     title=title.lower()#changing the texts into lowercase.
#     title=title.split()#it will give a list of words
#     title=[lemmatizer.lemmatize(word) for word in title if not word in stopwords.words("english")]#removing stopords and applying lemmatization.
#     title=" ".join(title)#.join() will return string of lemmatized words.
#     cleaned_title1.append(title)#cleaned text is appended into the list.


In [None]:
# Saving this n_corpus to a csv file so that we don't have run above cell every time because it is taking too much time
#import pandas as pd
# Create a local file to upload.
# df_ = pd.DataFrame(cleaned_title1)
# df_.to_csv("cleaned_text1.csv")

In [None]:
# df_.loc[0,0],train_df.text[0]


importing spacy library for lemmatizing the words and using regular expression we will clean the text(removing numeric characters and stopwards )

In [None]:
nlp=spacy.load("en_core_web_sm",disable=['ner','parser']) # disabling Named Entity Recognition for speed

In [None]:
brief_cleaning=(re.sub("[^a-zA-Z]+"," ",str(text)).lower() for text in train_df["text"])

In [None]:
def cleaning(doc):
    txt=[token.lemma_ for token in doc if not token.is_stop]
    txt=[txt for txt in txt if len(txt)>1]
    return " ".join(txt)

Taking advantage of spaCy .pipe() attribute to speed-up the cleaning process:

In [None]:
t = time()

txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=2000, n_threads=-1)]

print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
train_df.text[9]




we will store the cleaned text into a list

In [None]:
# cleaned_title2=[]
# for i in range(train_df.shape[0]):
#     title1=train_df.text[i]
#     title1=title1.lower().strip()
#     title1=re.sub("\n"," ",title1)
#     title1=re.sub("[^a-zA-Z0-9 +]"," ",title1)
#     title1=nlp(title1)
#     title1=[token.lemma_.strip() for token in title1 if (not token.is_stop) and (len(token)>1)]
#     title1=" ".join(title1)
#     title1=re.sub(" +"," ",title1)
#     cleaned_title2.append(title1)

In [None]:
# cleaned_title1[0]

In [None]:
# len(cleaned_title2)

Spacy does better lemmatization than NLTK, so we will use cleaned_text2 to make our model

In [None]:
# Saving this cleaned text to a csv file so that we don't have run above cell every time because it is time taking
# import pandas as pd
# # Create a local file to upload.
# df1 = pd.DataFrame(cleaned_title2)
# df1.to_csv("cleaned_text_2.csv")

In [None]:
cleaned_text=pd.read_csv("../input/cleaned-text-s/cleaned_text_2.csv")
cleaned_text=list(cleaned_text['0'].values)
cleaned_text[:3]

In [None]:
len(txt)

To create the vocabulary, we will join all the texts.

In [None]:
vocabulary=" ".join(txt)
x=set(vocabulary.split())
len(x)

In [None]:
train_df.head()

There are total 45593 unique words in all the articles and only those we will use into our model.

In [None]:
train_df['cleaned_text']=txt
train_df.head()

In [None]:
train_df.cleaned_text[0]

In [None]:
X_train,X_test,y_train,y_test=split(train_df.loc[:,"cleaned_text"],train_df.loc[:,categories],0.2)

In [None]:

X_train.shape,X_test.shape,y_train.shape,y_test.shape

In [None]:
y_train

In [None]:

y_test=y_test.idxmax(axis='columns')
y_train=y_train.idxmax(axis='columns')

In [None]:
y_test

In [None]:

le=LabelEncoder()
y_train=le.fit_transform(y_train)
y_test=le.transform(y_test)

In [None]:
y_train,y_test






# **Changing text into numericals using Tfidf technique**

In [None]:
tfv=TfidfVectorizer(min_df=3,max_features=30000,strip_accents="unicode",analyzer="word",token_pattern=r"\w{1,}",ngram_range=(1,3),use_idf=1,smooth_idf=1,sublinear_tf=1,stop_words="english")

In [None]:
def tfidf_vectors(X_train,X_test):
    tfv.fit(list(X_train)+list(X_test))
    xtrain_tfv=tfv.transform(X_train)
    xtest=tfv.transform(X_test)
    return xtrain_tfv,xtest
xtrain_tfv,xtest=tfidf_vectors(X_train,X_test)

In [None]:
xtrain_tfv.shape

Target columns has 6 columns , so lets change into one columns which will have all 6 different categories.

In [None]:
y_train_new=y_train
y_test_new=y_test

**Lest apply a simple LogisticRegression model to classify.**

In [None]:
def clf_(xtrain_tfv,y_train_new,C):
    clf=LogisticRegression(C=C,solver="sag")
    clf.fit(xtrain_tfv,y_train_new)
    return clf
clf=clf_(xtrain_tfv,y_train_new,1.0)

Apply grid search to optimize the hyperparameters.

In [None]:


# from sklearn.model_selection import GridSearchCV
# params={
#     'C':[0.1,0.3,0.5,0.8,1],
#     'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
#     'penalty':['l1', 'l2', 'elasticnet', 'none']
# }
# gs_knn = GridSearchCV(LogisticRegression(),
#                       param_grid=params,
#                       scoring='accuracy',
#                       cv=5)

# gs_model = gs_knn.fit(X_train_vect_avg, y_train.values.ravel())

In [None]:
#gs_model.best_params_

In [None]:
clf=clf=clf_(xtrain_tfv,y_train_new,1.2)
clf

In [None]:
train_preds=clf.predict(xtrain_tfv)
test_preds=clf.predict(xtest)
train_preds


**Accuracy metric**

Our dataset is imbalanced and all the classes are equally important, so for this case macro average f1 score would be the best, and the confusion matric would give the overall good picture of every class's prediction.

In [None]:
print("train",f1_score(y_train_new,clf.predict(xtrain_tfv),average='macro'))
print("test",f1_score(y_test_new,clf.predict(xtest),average="macro"))

We can clearly see that the model is failing to generalize the test data so we need to apply some complex models.

In [None]:
print("train_accuracy",accuracy_score(y_train_new,clf.predict(xtrain_tfv)))
print("test_accuracy",accuracy_score(y_test_new,clf.predict(xtest)))

In [None]:
def plot_confusion_matrix(y,y_preds):
    c_matrix=confusion_matrix(y,y_preds)

    c_matrix=pd.DataFrame(c_matrix,columns=['Computer Science',
     'Physics',
     'Mathematics',
     'Statistics',
     'Quantitative Biology',
     'Quantitative Finance'],index=['Computer Science',
     'Physics',
     'Mathematics',
     'Statistics',
     'Quantitative Biology',
     'Quantitative Finance'])

    fig,ax=plt.subplots(figsize=(12,12))
    sns.set(font_scale=1.4)
    sns.heatmap(c_matrix/np.sum(c_matrix),fmt="0.2%",annot=True,cmap="Blues",ax=ax)
    ax.set_title("Confusion matrix ",fontsize=26)
    ax.set_xlabel("Predicted",fontsize=26)
    ax.set_ylabel("Actual",fontsize=26)


In [None]:
plot_confusion_matrix(y_train_new,clf.predict(xtrain_tfv))

From the confusion matrix, its clear that our machine learning model is not able to classify two classes accurately 

1.computer science-  it has the most number of the texts, so a ml model should classify it more accurately than others that instead of that it has low accuracy that means this category text has some noise.

2.Quantitative finance- it has lowest number of text so thats why our ml model is not classifying it accurately so for this we can apply oversampling.


lets now apply other techniques first, after we will modify our text and will apply oversampling.

In [None]:
plot_confusion_matrix(y_test_new,clf.predict(xtest))

From the confusion matrix,its very clear that ML model finding it difficult to differenciate between maths and stats, and due to less number of rows of Quantitative finance, ML model didnt learn it properly so we didnt get the good accuracy for this class.

Using a pipeline, we can do this task more organised and it is very easy to implement a new machine learning model(lets apply multinomial NB) with pipeline.

In [None]:

tfidf_pipeline=Pipeline([("tfidf",TfidfVectorizer(min_df=3,max_features=30000,strip_accents="unicode",analyzer="word",token_pattern=r"\w{1,}",ngram_range=(1,2),use_idf=1,smooth_idf=1,sublinear_tf=1,stop_words="english"))])

tfidf_pipeline.fit_transform(X_train)
tfidf_pipeline.transform(X_test)

lg_regression_pipeline=Pipeline([("tfidf_pipeline",tfidf_pipeline),
                        ("clf",LogisticRegression(C=1,solver="sag"))])



lg_regression_pipeline.fit(X_train,y_train_new)

lg_regression_pipeline.predict(X_train)

In [None]:
print("macro_f1_score_train",f1_score(y_train_new,lg_regression_pipeline.predict(X_train),average="macro"))
print("macro_f1_score_test",f1_score(y_test_new,lg_regression_pipeline.predict(X_test),average='macro'))

In [None]:
plot_confusion_matrix(y_train_new,lg_regression_pipeline.predict(X_train))


In [None]:
plot_confusion_matrix(y_test_new,lg_regression_pipeline.predict(X_test))

In [None]:
mnb_regression_pipeline=Pipeline([("tfidf_pipeline",tfidf_pipeline),
                        ("mnb",MultinomialNB())])
mnb_regression_pipeline.fit(X_train,y_train_new)





In [None]:
print("macro_f1_score_train",f1_score(y_train_new,mnb_regression_pipeline.predict(X_train),average="macro"))
print("macro_f1_score_test",f1_score(y_test_new,mnb_regression_pipeline.predict(X_test),average='macro'))

We just need to copy-paste the code, and change the model name, and we applied a new model on the data.

# Changing words into vector using genism

In [None]:
w2v_model=Word2Vec(min_count=2,
                  window=2,
                  size=200,
                  sample=6e-5,
                  alpha=0.03,
                  min_alpha=0.0007,
                  negative=20)

In [None]:
sent=[text.split() for text in train_df["cleaned_text"]]
sent[0]

In [None]:
phrases=Phrases(sent,min_count=20,progress_per=2000)

In [None]:
bigram=Phraser(phrases)
bigram

In [None]:
sentences=bigram[sent]

In [None]:
sentences

In [None]:
word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

In [None]:
sorted(word_freq,key=word_freq.get,reverse=True)[:10]

In [None]:
sorted(word_freq, key=word_freq.get, reverse=False)[:10]

In [None]:
t = time()

w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
t = time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
w2v_model.wv.most_similar("computer")

In [None]:
w2v_model.wv["biology"]

In [None]:
w2v_model.wv.most_similar(positive=["biology"])

In [None]:
w2v=dict(zip(w2v_model.wv.index2word,w2v_model.wv.syn0))

In [None]:
len(w2v['model'])

In [None]:
X_train

In [None]:
train_mean_word_vectors=[]
for i in X_train:
    words=i.split()
    train_mean_word_vectors.append(np.mean([w2v[word] for word in words if word in w2v]
                      or np.zeros(200),axis=0))

In [None]:
test_mean_word_vectors=[]
for i in X_test:
    words=i.split()
    test_mean_word_vectors.append(np.mean([w2v[word] for word in words if word in w2v]
                      or np.zeros(200),axis=0))

In [None]:
train_mean_word_vectors[0],test_mean_word_vectors[0]

In [None]:
df_mean_word_vectors=pd.DataFrame(train_mean_word_vectors,columns=range(1,201))
df_mean_word_vectors_test=pd.DataFrame(test_mean_word_vectors,columns=range(1,201))

In [None]:
df_mean_word_vectors

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler


In [None]:
y_train_new

In [None]:
# applying labelencoder
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
y_train=le.fit_transform(y_train_new)
y_train

In [None]:
y_test_new

In [None]:
y_test=le.transform(y_test_new)
y_test

In [None]:
X_train=df_mean_word_vectors
X_test=df_mean_word_vectors_test
X_train=X_train.values
X_train

In [None]:
X_test=X_test.values
X_test

In [None]:
class ClassifierDataset(Dataset):
    
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)


train_dataset = ClassifierDataset(torch.from_numpy(X_train).float(), torch.from_numpy(y_train).long())
# val_dataset = ClassifierDataset(torch.from_numpy(X_val).float(), torch.from_numpy(y_val).long())
test_dataset = ClassifierDataset(torch.from_numpy(X_test).float(), torch.from_numpy(y_test).long())

In [None]:
target_list = []
for _, t in train_dataset:
    target_list.append(t)
    
target_list = torch.tensor(target_list)
target_list

In [None]:
def get_class_distribution(obj):
    count_dict = {
        0: 0,
        1: 0,
        2: 0,
        3: 0,
        4: 0,
        5: 0,
    }
    
    for i in obj:
        if i == 0: 
            count_dict[0] += 1
        elif i == 1: 
            count_dict[1] += 1
        elif i == 2: 
            count_dict[2] += 1
        elif i == 3: 
            count_dict[3] += 1
        elif i == 4: 
            count_dict[4] += 1  
        elif i == 5: 
            count_dict[5] += 1              
        else:
            print("Check classes.")
            
    return count_dict

In [None]:
class_count = [i for i in get_class_distribution(y_train).values()]
class_weights = 1./torch.tensor(class_count, dtype=torch.float) 
print(class_weights)

In [None]:
class_weights_all = class_weights[target_list]
class_weights_all

In [None]:
weighted_sampler = WeightedRandomSampler(
    weights=class_weights_all,
    num_samples=len(class_weights_all),
    replacement=True
)

In [None]:
EPOCHS = 50
BATCH_SIZE = 16
LEARNING_RATE = 0.0007
NUM_FEATURES = len(df_mean_word_vectors.columns)
NUM_CLASSES = 6

In [None]:
train_loader = DataLoader(dataset=train_dataset,
                          batch_size=BATCH_SIZE,
                          sampler=weighted_sampler
)
# val_loader = DataLoader(dataset=val_dataset, batch_size=1)
test_loader = DataLoader(dataset=test_dataset, batch_size=1)

In [None]:
class MulticlassClassification(nn.Module):
    def __init__(self, num_feature, num_class):
        super(MulticlassClassification, self).__init__()
        
        self.layer_1 = nn.Linear(num_feature, 512)
        self.layer_2 = nn.Linear(512, 128)
        self.layer_3 = nn.Linear(128, 64)
        self.layer_out = nn.Linear(64, num_class) 
        
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.2)
        self.batchnorm1 = nn.BatchNorm1d(512)
        self.batchnorm2 = nn.BatchNorm1d(128)
        self.batchnorm3 = nn.BatchNorm1d(64)
        
    def forward(self, x):
        x = self.layer_1(x)
        x = self.batchnorm1(x)
        x = self.relu(x)
        
        x = self.layer_2(x)
        x = self.batchnorm2(x)
        x = self.relu(x)
        x = self.dropout(x)
        
        x = self.layer_3(x)
        x = self.batchnorm3(x)
        x = self.relu(x)
        x = self.dropout(x)
        
        x = self.layer_out(x)
        
        return x

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

In [None]:
model = MulticlassClassification(num_feature = NUM_FEATURES, num_class=NUM_CLASSES)
model.to(device)

criterion = nn.CrossEntropyLoss(weight=class_weights.to(device))
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
print(model)


In [None]:
def multi_acc(y_pred, y_test):
    y_pred_softmax = torch.log_softmax(y_pred, dim = 1)
    _, y_pred_tags = torch.max(y_pred_softmax, dim = 1)    
    
    correct_pred = (y_pred_tags == y_test).float()
    acc = correct_pred.sum() / len(correct_pred)
    
    acc = torch.round(acc * 100)
    
    return acc

In [None]:
accuracy_stats = {
    'train': [],
    "val": []
}
loss_stats = {
    'train': [],
    "val": []
}

In [None]:
val_loader=test_loader

In [None]:
# pip install torchmetrics

In [None]:
# from torchmetrics.classification import MulticlassF1Score
# metric = MulticlassF1Score(num_classes=6)

In [None]:
# def metric(y_train_pred,y_train_batch):
#     print(y_train_pred,y_train_batch)
#     for i in range(len(y_train_pred)):
#         if y_train_pred[i]==y_ytrain_batch[i]:
#             acc+=1
#     return acc/len(y_train_pred)

In [None]:
def multi_acc(y_pred, y_test):
    y_pred_softmax = torch.log_softmax(y_pred, dim = 1)
    _, y_pred_tags = torch.max(y_pred_softmax, dim = 1)    
    
    correct_pred = (y_pred_tags == y_test).float()
    acc = correct_pred.sum() / len(correct_pred)
    
    acc = torch.round(acc * 100)
    
    return acc


In [None]:
from tqdm.notebook import tqdm
print("Begin training.")
for e in tqdm(range(1, EPOCHS+1)):
    # TRAINING
    train_epoch_loss = 0
    train_epoch_acc = 0
    model.train()
    for X_train_batch, y_train_batch in train_loader:
        X_train_batch, y_train_batch = X_train_batch.to(device), y_train_batch.to(device)
        optimizer.zero_grad()

        y_train_pred = model(X_train_batch)
        
        train_loss = criterion(y_train_pred, y_train_batch)
        train_acc = multi_acc(y_train_pred, y_train_batch)
        
        train_loss.backward()
        optimizer.step()
        
        train_epoch_loss += train_loss.item()
        train_epoch_acc += train_acc.item()
        
        
    # VALIDATION    
    with torch.no_grad():
        
        val_epoch_loss = 0
        val_epoch_acc = 0
        
        model.eval()
        for X_val_batch, y_val_batch in val_loader:
            X_val_batch, y_val_batch = X_val_batch.to(device), y_val_batch.to(device)
            
            y_val_pred = model(X_val_batch)
                        
            val_loss = criterion(y_val_pred, y_val_batch)
            val_acc = multi_acc(y_val_pred, y_val_batch)
            
            val_epoch_loss += val_loss.item()
            val_epoch_acc += val_acc.item()
    loss_stats['train'].append(train_epoch_loss/len(train_loader))
    loss_stats['val'].append(val_epoch_loss/len(val_loader))
    accuracy_stats['train'].append(train_epoch_acc/len(train_loader))
    accuracy_stats['val'].append(val_epoch_acc/len(val_loader))


    print(f'Epoch {e+0:03}: | Train Loss: {train_epoch_loss/len(train_loader):.5f} | Val Loss: {val_epoch_loss/len(val_loader):.5f} | Train Acc: {train_epoch_acc/len(train_loader):.3f}| Val Acc: {val_epoch_acc/len(val_loader):.3f}')

In [None]:
# Create dataframes
train_val_acc_df = pd.DataFrame.from_dict(accuracy_stats).reset_index().melt(id_vars=['index']).rename(columns={"index":"epochs"})
train_val_loss_df = pd.DataFrame.from_dict(loss_stats).reset_index().melt(id_vars=['index']).rename(columns={"index":"epochs"})
# Plot the dataframes
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,7))
sns.lineplot(data=train_val_acc_df, x = "epochs", y="value", hue="variable",  ax=axes[0]).set_title('Train-Val Accuracy/Epoch')
sns.lineplot(data=train_val_loss_df, x = "epochs", y="value", hue="variable", ax=axes[1]).set_title('Train-Val Loss/Epoch')

# **DOC2VEC**

In [None]:
X_train,X_test,y_train,y_test=split(train_df.loc[:,"cleaned_text"],train_df.loc[:,categories],0.2)


In [None]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape


In [None]:
y_test=y_test.idxmax(axis='columns')
y_train=y_train.idxmax(axis='columns')


In [None]:
le=LabelEncoder()
y_train=le.fit_transform(y_train)
y_test=le.transform(y_test)

In [None]:
list_of_list_of_words_train=[sent.split(" ") for sent in X_train]
list_of_list_of_words_test=[sent.split(" ") for sent in X_test]



In [None]:
def tagged_document(list_of_list_of_words):
    for i,list_of_words in enumerate(list_of_list_of_words):
        yield gensim.models.doc2vec.TaggedDocument(list_of_words,[i])
training_data=list(tagged_document(list_of_list_of_words_train))

In [None]:
model=gensim.models.doc2vec.Doc2Vec(vector_size=100,min_count=2,epochs=30)


In [None]:
model.build_vocab(training_data)


In [None]:
model.train(training_data, total_examples=model.corpus_count, epochs=model.epochs)

In [None]:
v=model.infer_vector
v

In [None]:
def change_into_vector(doc):
    return model.infer_vector(doc)

In [None]:
train_doc_vectors=[]
for doc in list_of_list_of_words_train:
    train_doc_vectors.append(model.infer_vector(doc))

In [None]:
train_doc_vectors=np.array(train_doc_vectors)

In [None]:
train_doc_vectors=pd.DataFrame(train_doc_vectors,columns=range(1,101))

In [None]:
test_doc_vectors=[]
for doc in list_of_list_of_words_test:
    test_doc_vectors.append(model.infer_vector(doc))

In [None]:

test_doc_vectors=np.array(test_doc_vectors)
test_doc_vectors

In [None]:
test_doc_vectors=pd.DataFrame(test_doc_vectors,columns=range(1,101))

In [None]:

test_doc_vectors.head()

In [None]:
from torch.utils.data import Dataset, DataLoader
class Data(Dataset):
    def __init__(self,X_train,Y_train):
        self.X=torch.from_numpy(X_train).float()
        self.Y=torch.from_numpy(Y_train).long()
        self.len=self.X.shape[0]
    def __getitem__(self,index):      
        return self.X[index], self.Y[index]
    def __len__(self):
        return self.len

In [None]:
train_doc_vectors=np.array(train_doc_vectors)
test_doc_vectors=np.array(test_doc_vectors)


In [None]:
train_data=Data(train_doc_vectors,y_train)
test_data=Data(test_doc_vectors,y_test)

In [None]:
train_data.X[0:5],train_data.Y[0:5]

In [None]:
import torch.nn as nn
class NN(nn.Module):
    def __init__(self,input_layer,input_layer1,Hidder_layer,Hidden_layer2,output):
        super(NN,self).__init__()
        self.fc1=nn.Linear(input_layer,Hidden_layer)
        self.lstm=nn.LSTM(Hidden_layer,Hidden_layer2,input_layer1,batch_first=True)
        self.fc3=nn.Linear(Hidden_layer2,output)
        self.relu=nn.ReLU()
        self.dropout=nn.Dropout(0.25)
         
        
    def forward(self,x):
        x=self.fc1(x)
        x=self.relu(x)
        x=self.dropout(x)
        x=self.lstm(x)
        x=self.relu(x)
        x=self.dropout(x)
        x=self.fc3(x)
        return x

In [None]:
input_layer=100
input_layer1=1
Hidden_layer=50
Hidden_layer2=25
output=6


In [None]:
clf=NN(input_layer,input_layer1,Hidden_layer,Hidden_layer2,output)

In [None]:
print(clf.parameters)

In [None]:

category_count

In [None]:
class_weights=1/torch.tensor(category_count,dtype=torch.float)

In [None]:
target_list=[]
for _,t in train_data:
    target_list.append(t)


In [None]:
class_weights_all=class_weights[target_list]


class_weights_all


In [None]:
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler

weighted_sampler = WeightedRandomSampler(
    weights=class_weights_all,
    num_samples=len(class_weights_all),
    replacement=True
)

In [None]:
train_loader=DataLoader(dataset=train_data,batch_size=256,sampler=weighted_sampler)

In [None]:
test_loader = DataLoader(dataset=test_data, batch_size=1)

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

In [None]:
clf.to(device)

In [None]:
EPOCHS = 100
LEARNING_RATE = 0.05
NUM_FEATURES = 100
NUM_CLASSES = 6


In [None]:
import torch.optim as optim
from sklearn.preprocessing import MinMaxScaler    
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
criterion = nn.CrossEntropyLoss(weight=class_weights.to(device))
optimizer = optim.Adam(clf.parameters(), lr=LEARNING_RATE)
print(clf)


In [None]:
def multi_acc(y_pred, y_test):
    y_pred_softmax = torch.log_softmax(y_pred, dim = 1)
    _, y_pred_tags = torch.max(y_pred_softmax, dim = 1)    
    
    correct_pred = (y_pred_tags == y_test).float()
    acc = correct_pred.sum() / len(correct_pred)
    
    acc = torch.round(acc * 100)
    return acc


In [None]:
accuracy_stats = {
    'train': [],
    "val": []
}
loss_stats = {
    'train': [],
    "val": []
}


In [None]:
# from tqdm.notebook import tqdm
# print("Begin training.")
# for e in tqdm(range(1, EPOCHS+1)):
#     # TRAINING
#     train_epoch_loss = 0
#     train_epoch_acc = 0
#     clf.train()
#     for X_train_batch, y_train_batch in train_loader:
#         X_train_batch, y_train_batch = X_train_batch.to(device), y_train_batch.to(device)
#         optimizer.zero_grad()

#         y_train_pred = clf(X_train_batch)
        
#         train_loss = criterion(y_train_pred, y_train_batch)
#         train_acc = multi_acc(y_train_pred, y_train_batch)
        
#         train_loss.backward()
#         optimizer.step()
        
#         train_epoch_loss += train_loss.item()
#         train_epoch_acc += train_acc.item()
        
        
#     # VALIDATION    
#     with torch.no_grad():
        
#         val_epoch_loss = 0
#         val_epoch_acc = 0
        
#         model.eval()
#         for X_val_batch, y_val_batch in val_loader:
#             X_val_batch, y_val_batch = X_val_batch.to(device), y_val_batch.to(device)
            
#             y_val_pred = model(X_val_batch)
                        
#             val_loss = criterion(y_val_pred, y_val_batch)
#             val_acc = multi_acc(y_val_pred, y_val_batch)
            
#             val_epoch_loss += val_loss.item()
#             val_epoch_acc += val_acc.item()
#     loss_stats['train'].append(train_epoch_loss/len(train_loader))
#     loss_stats['val'].append(val_epoch_loss/len(val_loader))
#     accuracy_stats['train'].append(train_epoch_acc/len(train_loader))
#     accuracy_stats['val'].append(val_epoch_acc/len(val_loader))


#     print(f'Epoch {e+0:03}: | Train Loss: {train_epoch_loss/len(train_loader):.5f} | Val Loss: {val_epoch_loss/len(val_loader):.5f} | Train Acc: {train_epoch_acc/len(train_loader):.3f}| Val Acc: {val_epoch_acc/len(val_loader):.3f}')