In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

## Dataset

In [2]:
df = pd.read_csv("/kaggle/input/topic-modeling-articles-dataset/Train.csv")

In [3]:
df.shape

(14004, 31)

In [4]:
df = df.iloc[:,:6]
df.head()

Unnamed: 0,id,ABSTRACT,Computer Science,Mathematics,Physics,Statistics
0,1824,a ever-growing datasets inside observational a...,0,0,1,0
1,3094,we propose the framework considering optimal $...,1,0,0,0
2,8463,nanostructures with open shell transition meta...,0,0,1,0
3,2082,stars are self-gravitating fluids inside which...,0,0,1,0
4,8687,deep neural perception and control networks ar...,1,0,0,0


## Input - output split

In [5]:
X = df["ABSTRACT"]
y = df.iloc[:,2:]

## Train - test split

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)
y_train = pd.DataFrame(y_train)
y_test = pd.DataFrame(y_test)

In [8]:
y1_train = pd.DataFrame(y_train.iloc[:,0])
y2_train = pd.DataFrame(y_train.iloc[:,1])
y3_train = pd.DataFrame(y_train.iloc[:,2])
y4_train = pd.DataFrame(y_train.iloc[:,3])
y1_test = pd.DataFrame(y_test.iloc[:,0])
y2_test = pd.DataFrame(y_test.iloc[:,1])
y3_test = pd.DataFrame(y_test.iloc[:,2])
y4_test = pd.DataFrame(y_test.iloc[:,3])

## Analyzing output variables

In [9]:
y.sum(axis = 1).value_counts()

1    11800
2     2047
3      157
Name: count, dtype: int64

In [10]:
for i in range(4):
    print(y_train.iloc[:,i].value_counts())

Computer Science
0    6511
1    4692
Name: count, dtype: int64
Mathematics
0    8932
1    2271
Name: count, dtype: int64
Physics
0    8116
1    3087
Name: count, dtype: int64
Statistics
0    8194
1    3009
Name: count, dtype: int64


## Analysing input variables

In [11]:
X_train.shape

(11203, 1)

In [12]:
words_count = []
for i in range(X_train.shape[0]):
    words_count += [len(X_train.iloc[i,0].split())]
np.mean(words_count)

157.70507899669732

Average word counts of the `extracts` are not very large, and we have very less number of samples. So, Deep Leaning Based Algorithms will not be able to perform well. So, going with non-sequential models like Tree Based Algorithms and Naive Bayes.

# Data preparation

## Lowercasing

In [13]:
X_train["ABSTRACT"] = X_train["ABSTRACT"].str.lower()
X_test["ABSTRACT"] = X_test["ABSTRACT"].str.lower()
X_train.head()

Unnamed: 0,ABSTRACT
10864,recent work has explored a syntactic abilities...
6518,this research presents an innovative and uniqu...
11212,"recently, a fabrication of cdse nanoplatelets ..."
3589,"we present the simple, self-consistent model t..."
6927,imposing constraints on a output of the deep n...


## Removing Punctuations

In [14]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [15]:
exclude = string.punctuation

In [16]:
def remove_punc(text):
    for char in exclude:
        if char in text:
            text = text.replace(char," ")
    return text

In [17]:
X_train["ABSTRACT"] = X_train["ABSTRACT"].apply(remove_punc)
X_test["ABSTRACT"] = X_test["ABSTRACT"].apply(remove_punc)
X_train.head()

Unnamed: 0,ABSTRACT
10864,recent work has explored a syntactic abilities...
6518,this research presents an innovative and uniqu...
11212,recently a fabrication of cdse nanoplatelets ...
3589,we present the simple self consistent model t...
6927,imposing constraints on a output of the deep n...


## Removing Stop Words

- Since, we are using machine learning based algorithms, then sequence of the words doesn't matters.
- Removing stopwords can significantly enhance the algorithm's focus on meaningful words, thereby improving classification accuracy.

In [19]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords = stopwords.words("english")

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
def remove_stopwords(text):
    temp = []
    for word in text.split():
        if word not in stopwords:
            temp.append(word)
    temp = " ".join(temp)
    return temp

In [21]:
X_train["ABSTRACT"] = X_train["ABSTRACT"].apply(remove_stopwords)
X_test["ABSTRACT"] = X_test["ABSTRACT"].apply(remove_stopwords)
X_train.head()

Unnamed: 0,ABSTRACT
10864,recent work explored syntactic abilities rnns ...
6518,research presents innovative unique way solvin...
11212,recently fabrication cdse nanoplatelets became...
3589,present simple self consistent model predict m...
6927,imposing constraints output deep neural net on...


## Stemming

In [22]:
from nltk.stem.porter import PorterStemmer
import time

ps = PorterStemmer()
def stemming(word):
    return " ".join([ps.stem(word) for word in word.split()])

In [23]:
X_train["ABSTRACT"] = X_train["ABSTRACT"].apply(stemming)
X_test["ABSTRACT"] = X_test["ABSTRACT"].apply(stemming)

## Tokenization


In [24]:
def tekenize(text):
    return text.split()

In [25]:
X_train["ABSTRACT"] = X_train["ABSTRACT"].apply(tekenize)
X_test["ABSTRACT"] = X_test["ABSTRACT"].apply(tekenize)
X_train.head()

Unnamed: 0,ABSTRACT
10864,"[recent, work, explor, syntact, abil, rnn, hel..."
6518,"[research, present, innov, uniqu, way, solv, a..."
11212,"[recent, fabric, cdse, nanoplatelet, becam, im..."
3589,"[present, simpl, self, consist, model, predict..."
6927,"[impos, constraint, output, deep, neural, net,..."


## Vectorize

In [27]:
import gensim

corpus = list(X_train["ABSTRACT"])
len(corpus)

11203

In [28]:
vector_size = 100
model= gensim.models.Word2Vec(corpus,
    window = 4,
    min_count= 1,
    vector_size = vector_size,
    workers = 4
    )
model.train(corpus, total_examples = model.corpus_count,epochs=model.epochs)

(5598936, 6034605)

In [29]:
def vectorize_abstract(tokenized_list):
    text = [word for word in tokenized_list if word in model.wv.index_to_key]
    return np.mean(model.wv[text],axis = 0)

X_train_vec = X_train.copy()
X_test_vec = X_test.copy()
X_train_vec["ABSTRACT"] = X_train["ABSTRACT"].apply(vectorize_abstract).tolist()
X_test_vec["ABSTRACT"] = X_test["ABSTRACT"].apply(vectorize_abstract).tolist()

In [30]:
X_train_in = pd.DataFrame(X_train_vec['ABSTRACT'].tolist())
X_test_in = pd.DataFrame(X_test_vec['ABSTRACT'].tolist())
X_train_in

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.434169,0.497067,-0.137996,0.392593,0.312611,-0.447021,-0.561747,0.431754,-0.067340,-0.066511,...,0.550199,-0.319110,-0.202827,-0.440934,0.331354,-0.154107,0.023623,-0.264015,0.334725,-0.090740
1,-0.160964,0.139512,0.257017,0.463671,0.298604,-0.496796,-0.603057,0.138651,-0.251160,-0.129881,...,0.339514,0.055877,-0.075594,-0.252529,0.716801,-0.008356,-0.055552,-0.750366,-0.016194,-0.192978
2,-0.171138,-0.068740,0.409032,0.522558,0.343593,-0.438994,-0.086851,0.389179,-0.380887,-0.080176,...,-0.049513,0.194185,-0.127655,-0.017834,0.720038,-0.127321,0.198543,-0.333079,-0.211227,0.208561
3,-0.022124,0.510649,0.468935,0.007119,0.411399,-0.757856,0.052787,0.376282,-0.389018,0.376670,...,-0.104888,-0.372516,-0.272091,-0.169860,0.590384,0.018837,0.090778,-0.292178,0.041225,0.842303
4,-0.172022,0.440761,0.153715,0.407142,0.535757,-0.203469,-0.454783,0.477946,-0.049865,-0.028681,...,0.340345,-0.160666,0.030015,-0.167863,0.305392,-0.336355,-0.035989,-0.747500,-0.088499,0.247618
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11198,0.008307,0.161205,0.191842,-0.032319,0.407150,-0.289143,-0.473460,0.455931,-0.238869,0.030997,...,0.337980,0.325986,-0.051514,0.057061,0.418096,-0.002156,0.041988,-0.531555,0.122515,0.022464
11199,-0.171733,0.343394,0.284930,-0.295850,0.581179,-0.096978,-0.418676,0.502478,-0.148693,-0.356061,...,-0.028691,0.403866,-0.146093,-0.310352,0.679379,-0.105074,0.318533,-0.688415,-0.007885,0.066226
11200,0.084813,0.568134,0.007406,-0.235805,0.351063,0.035523,-0.446390,0.316369,-0.388147,-0.002179,...,0.008354,-0.013045,-0.310144,-0.485191,1.002162,-0.269859,0.265832,-0.544944,-0.106754,-0.052202
11201,-0.177524,0.175197,-0.006524,0.147917,0.473700,-0.232786,-0.118700,0.315585,-0.148813,0.097392,...,0.317880,0.130595,-0.059054,-0.164593,0.710671,-0.140382,0.194303,-0.703287,0.235591,-0.017372


# Modele selection and Evaluation

## 1.Random Forest Classifier

In [32]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import GridSearchCV

rfc = RandomForestClassifier(max_depth = 5,n_jobs = -1)
rfc.fit(X_train_in,y_train)
y_pred_train = rfc.predict(X_train_in)
y_pred = rfc.predict(X_test_in)

f1_score(y_train,y_pred_train,average = "macro"),f1_score(y_test,y_pred,average = "macro")

(0.7738344286296932, 0.7598236158427114)

window 1 , 300 (0.7688740952166848, 0.7556735556754897)

window 3 , 300(0.7754760431286739, 0.7696386951182455)
.7686592700639805, 0.7532132585105826)

window 7 , 300(0.7751692803647058, 0.7589523639952945)

w = 10 , 100 (0.7641491463691902, 0.7505651480280441)

## 2.Adaboost

In [33]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import f1_score, make_scorer
from sklearn.multioutput import MultiOutputClassifier

abc = MultiOutputClassifier(AdaBoostClassifier(n_estimators = 100, learning_rate = 0.5), n_jobs=-1)

abc.fit(X_train_in,y_train)
y_pred_train = abc.predict(X_train_in)
y_pred = abc.predict(X_test_in)
f1_score(y_train,y_pred_train,average = "macro"),f1_score(y_test,y_pred,average = "macro")

(0.8295371152981884, 0.8096038498691326)

## 3.Naive Bayes

In [34]:
from sklearn.naive_bayes import GaussianNB


Since GaussianNB can only handle 1d-array as y-dataset. So, fitting 4 different models in loop.

In [35]:
test_scores = []
for i in range(1,5):
    nbc= GaussianNB()
    nbc.fit(X_train_in,globals()[f"y{i}_train"])
    y_pred_train = nbc.predict(X_train_in)
    y_pred = nbc.predict(X_test_in)
    print("train: ",f1_score(y_pred_train, globals()[f"y{i}_train"]))
    print("test: ",f1_score(y_pred, globals()[f"y{i}_test"]))
    test_scores.append(f1_score(y_pred, globals()[f"y{i}_test"]))
print("Average F1 score in test data:",np.mean(test_scores))

train:  0.7910994272838231
test:  0.7838338895068595
train:  0.7481722177091795
test:  0.7455919395465995
train:  0.9313531353135314
test:  0.9312169312169313
train:  0.6306660499537464
test:  0.6561938958707361
Average F1 score in test data: 0.7792091640352816


## 4.XGBoost

In [37]:
from xgboost import XGBClassifier

xgb = RandomForestClassifier(max_depth = 5,n_jobs = -1)

xgb.fit(X_train_in,y_train)
y_pred_train = xgb.predict(X_train_in)
y_pred = xgb.predict(X_test_in)
f1_score(y_train,y_pred_train,average = "macro"),f1_score(y_test,y_pred,average = "macro")

(0.7691605559879291, 0.7583907431783062)

## Optimizing Hyperparameters for Adaboost with Word2Vec

In [None]:
for vector_size in [100,200,300]:
    for window in [4,7,10]:
        for n_estimators in [50,100]:
            for learning_rate in [0.1,0.5,1]:
                t = time.time()
                model= gensim.models.Word2Vec(corpus,window = window,
                       min_count= 1,
                       vector_size = vector_size,
                       workers = 4)
                model.train(corpus, total_examples = model.corpus_count,epochs=model.epochs)

                def vectorize_abstract(tokenized_list):
                    text = [word for word in tokenized_list if word in model.wv.index_to_key]
                    return np.mean(model.wv[text],axis = 0)
                
                X_train_vec = X_train.copy()
                X_test_vec = X_test.copy()
                X_train_vec["ABSTRACT"] = X_train["ABSTRACT"].apply(vectorize_abstract)
                X_test_vec["ABSTRACT"] = X_test["ABSTRACT"].apply(vectorize_abstract)
                X_train_in = pd.DataFrame(X_train_vec['ABSTRACT'].tolist())
                X_test_in = pd.DataFrame(X_test_vec['ABSTRACT'].tolist())

                classifier = MultiOutputClassifier(AdaBoostClassifier(n_estimators = n_estimators, learning_rate = learning_rate), n_jobs=-1)
                classifier.fit(X_train_in,y_train)
                y_pred_train = classifier.predict(X_train_in)
                y_pred = classifier.predict(X_test_in)
                print(time.time()-t)
                print("Hyperparameters: ","vector_size = ",vector_size,"window = ",window,"n_estimator = ",n_estimators,"learning_rate = ",learning_rate)
                print("Train score: ",f1_score(y_train,y_pred_train,average = "macro"),"Test score: ", f1_score(y_test,y_pred,average = "macro"))

Hyperparameters:  vector_size =  100 window =  4 n_estimator =  50 learning_rate =  0.1
Train score:  0.7888647954089151 Test score:  0.7833197806603203

Hyperparameters:  vector_size =  100 window =  4 n_estimator =  50 learning_rate =  0.5
Train score:  0.8183276736697701 Test score:  0.8075680903034081

Hyperparameters:  vector_size =  100 window =  4 n_estimator =  50 learning_rate =  1
Train score:  0.8190600566205557 Test score:  0.8008788838377146

Hyperparameters:  vector_size =  100 window =  4 n_estimator =  100 learning_rate =  0.1
Train score:  0.8044657044523095 Test score:  0.7985527473334413

Hyperparameters:  vector_size =  100 window =  4 n_estimator =  100 learning_rate =  0.5
Train score:  0.8287248232427425 Test score:  0.8141099215691027

Hyperparameters:  vector_size =  100 window =  4 n_estimator =  100 learning_rate =  1
Train score:  0.8371698315512807 Test score:  0.8041273834602976

Hyperparameters:  vector_size =  100 window =  7 n_estimator =  50 learning_rate =  0.1
Train score:  0.7908713844636447 Test score:  0.7866689781249941

Hyperparameters:  vector_size =  100 window =  7 n_estimator =  50 learning_rate =  0.5
Train score:  0.8214136108305716 Test score:  0.81392822408212

Hyperparameters:  vector_size =  100 window =  7 n_estimator =  50 learning_rate =  1
Train score:  0.8247351564218608 Test score:  0.8074778006134192

Hyperparameters:  vector_size =  100 window =  7 n_estimator =  100 learning_rate =  0.1
Train score:  0.8115296270075976 Test score:  0.8074653844202999

Hyperparameters:  vector_size =  100 window =  7 n_estimator =  100 learning_rate =  0.5
Train score:  0.8329382795515746 Test score:  0.8212894855427767

Hyperparameters:  vector_size =  100 window =  7 n_estimator =  100 learning_rate =  1
Train score:  0.8389373842595793 Test score:  0.8190983095827689

Hyperparameters:  vector_size =  100 window =  10 n_estimator =  50 learning_rate =  0.1
Train score:  0.7937267167437966 Test score:  0.7951062282192839

Hyperparameters:  vector_size =  100 window =  10 n_estimator =  50 learning_rate =  0.5
Train score:  0.8227459107308934 Test score:  0.8097438321472485

Hyperparameters:  vector_size =  100 window =  10 n_estimator =  50 learning_rate =  1
Train score:  0.8236634941993541 Test score:  0.8068745777322421

Hyperparameters:  vector_size =  100 window =  10 n_estimator =  100 learning_rate =  0.1
Train score:  0.8089049437543426 Test score:  0.8030716244434193

Hyperparameters:  vector_size =  100 window =  10 n_estimator =  100 learning_rate =  0.5
Train score:  0.8333416334890049 Test score:  0.8171821653512464

Hyperparameters:  vector_size =  100 window =  10 n_estimator =  100 learning_rate =  1
Train score:  0.8409969240931676 Test score:  0.8115190857833857

Hyperparameters:  vector_size =  200 window =  4 n_estimator =  50 learning_rate =  0.1
Train score:  0.7946991085349088 Test score:  0.7865253527033136

Hyperparameters:  vector_size =  200 window =  4 n_estimator =  50 learning_rate =  0.5
Train score:  0.8212818989011118 Test score:  0.8129098047356516

Hyperparameters:  vector_size =  200 window =  4 n_estimator =  50 learning_rate =  1
Train score:  0.8232561792761892 Test score:  0.8012920589635786

Hyperparameters:  vector_size =  200 window =  4 n_estimator =  100 learning_rate =  0.1
Train score:  0.8105600435838346 Test score:  0.8046765400054733

Hyperparameters:  vector_size =  200 window =  4 n_estimator =  100 learning_rate =  0.5
Train score:  0.834447949666691 Test score:  0.8168334589395274

Hyperparameters:  vector_size =  200 window =  4 n_estimator =  100 learning_rate =  1
Train score:  0.8401692843178505 Test score:  0.8097146468257053

Hyperparameters:  vector_size =  200 window =  7 n_estimator =  50 learning_rate =  0.1
Train score:  0.7997661735177206 Test score:  0.7967999887453208

Hyperparameters:  vector_size =  200 window =  7 n_estimator =  50 learning_rate =  0.5
Train score:  0.8215898166690483 Test score:  0.8080544260315742

Hyperparameters:  vector_size =  200 window =  7 n_estimator =  50 learning_rate =  1
Train score:  0.8242195330765387 Test score:  0.8106114167362274

Hyperparameters:  vector_size =  200 window =  7 n_estimator =  100 learning_rate =  0.1
Train score:  0.8143031033640369 Test score:  0.8107102794654606

Hyperparameters:  vector_size =  200 window =  7 n_estimator =  100 learning_rate =  0.5
Train score:  0.8360911601142774 Test score:  0.8197034867549734

Hyperparameters:  vector_size =  200 window =  7 n_estimator =  100 learning_rate =  1
Train score:  0.8439004036437036 Test score:  0.8111898505116573

Hyperparameters:  vector_size =  200 window =  10 n_estimator =  50 learning_rate =  0.1
Train score:  0.7991874187180988 Test score:  0.7985241632521124

Hyperparameters:  vector_size =  200 window =  10 n_estimator =  50 learning_rate =  0.5
Train score:  0.8230168123537445 Test score:  0.8163129392463889

Hyperparameters:  vector_size =  200 window =  10 n_estimator =  50 learning_rate =  1
Train score:  0.8277082146356952 Test score:  0.8107588446912674

Hyperparameters:  vector_size =  200 window =  10 n_estimator =  100 learning_rate =  0.1
Train score:  0.8143375249695566 Test score:  0.8107023516452269

Hyperparameters:  vector_size =  200 window =  10 n_estimator =  100 learning_rate =  0.5
Train score:  0.8361847158508455 Test score:  0.8213002351138347

Hyperparameters:  vector_size =  200 window =  10 n_estimator =  100 learning_rate =  1
Train score:  0.8453509501612227 Test score:  0.8129616926711318

Hyperparameters:  vector_size =  300 window =  4 n_estimator =  50 learning_rate =  0.1
Train score:  0.7928245533573252 Test score:  0.7902557094153544

Hyperparameters:  vector_size =  300 window =  4 n_estimator =  50 learning_rate =  0.5
Train score:  0.8223704691421638 Test score:  0.8098302391690684

Hyperparameters:  vector_size =  300 window =  4 n_estimator =  50 learning_rate =  1
Train score:  0.822442867600612 Test score:  0.8058847028746517

Hyperparameters:  vector_size =  300 window =  4 n_estimator =  100 learning_rate =  0.1
Train score:  0.8160158412321356 Test score:  0.8128208075327648

Hyperparameters:  vector_size =  300 window =  4 n_estimator =  100 learning_rate =  0.5
Train score:  0.8385322682633494 Test score:  0.8144937793525562

Hyperparameters:  vector_size =  300 window =  4 n_estimator =  100 learning_rate =  1
Train score:  0.8429703320983177 Test score:  0.8090254169993912

Hyperparameters:  vector_size =  300 window =  7 n_estimator =  50 learning_rate =  0.1
Train score:  0.7984635148913778 Test score:  0.7903276535154157

Hyperparameters:  vector_size =  300 window =  7 n_estimator =  50 learning_rate =  0.5
Train score:  0.8241265702476338 Test score:  0.8153644621972239

Hyperparameters:  vector_size =  300 window =  7 n_estimator =  50 learning_rate =  1
Train score:  0.8271812637847401 Test score:  0.8085161112950283

Hyperparameters:  vector_size =  300 window =  7 n_estimator =  100 learning_rate =  0.1
Train score:  0.812241735355691 Test score:  0.8096401217861039

Hyperparameters:  vector_size =  300 window =  7 n_estimator =  100 learning_rate =  0.5
Train score:  0.8406403759143766 Test score:  0.8139381447616445

Hyperparameters:  vector_size =  300 window =  7 n_estimator =  100 learning_rate =  1
Train score:  0.8449886939572137 Test score:  0.8079062394377027

Hyperparameters:  vector_size =  300 window =  10 n_estimator =  50 learning_rate =  0.1
Train score:  0.8011211885503766 Test score:  0.7992571041545528

Hyperparameters:  vector_size =  300 window =  10 n_estimator =  50 learning_rate =  0.5
Train score:  0.8228099522411267 Test score:  0.8114610537975122

Hyperparameters:  vector_size =  300 window =  10 n_estimator =  50 learning_rate =  1
Train score:  0.8281410959698303 Test score:  0.8089894901988152

Hyperparameters:  vector_size =  300 window =  10 n_estimator =  100 learning_rate =  0.1
Train score:  0.8184711618611453 Test score:  0.8151350929504934

Hyperparameters:  vector_size =  300 window =  10 n_estimator =  100 learning_rate =  0.5
Train score:  0.8396558446887468 Test score:  0.8238581528730363

Hyperparameters:  vector_size =  300 window =  10 n_estimator =  100 learning_rate =  1
Train score:  0.8486119727912368 Test score:  0.8156460238975898

### Best Score 

- Test score:  0.8238581528730363 with Train score:  0.8396558446887468

- Adaboost with Hyperparameters: n_estimator = 100,learning_rate =  0.5
- Word2Vec Hyperparameters: vector_size = 300, window =  10

This is best score as it is not overfitted and have highest test accuracy. 
 