# TEXT CLASSIFICATION

### Aim:

To construct a method to classify the unlabelled sentences according to the labels given in few articles (training set).

###  Introduction:

This corpus contains sentences from the abstract and introduction of 30 scientific articles that have been annotated (i.e. labeled or tagged) according to a modified version of the Argumentative Zones annotation scheme. These 30 scientific articles come from three different domains:

1. PLoS Computational Biology (PLOS)
2. The machine learning repository on arXiv (ARXIV)
3. The psychology journal Judgment and Decision Making (JDM)

There are 10 articles from each domain. In addition to the labeled data, this corpus also contains a corresponding set of unlabeled articles. These unlabeled articles also come from PLOS, ARXIV, and JDM. There are 300 unlabeled articles from each domain (again, only the sentences from the abstract and introduction). These unlabeled articles can be used for unsupervised or semi-supervised approaches to sentence classification which rely on a small set of labeled data and a larger set of unlabeled data.

Labels in the article includes AIMX, OWNX, CONT, BASE, MISC

### Methods:

#### Text Pre-processing
1. Extracted the labelled text files (training set) from the folder and joined all the sentences to form a dataframe.
2. All the sentences and its corresponding categories are separated.
3. Stopwords are removed and lematized to increase classsification accuracy.
4. Counter vectorized the data for classification.
4. Words in the word_list folder are extracted and labelled with the corresponding categories.
5. Above words are lematizzed and added to the counter vectorized output of data.

##### Naive Bayes:
6. Unlabelled data (test data) is extracted and preprocessed as mentioned above.
7. Naive bayes classifier is used for classification since we have small training set and large testing set, this gives better accuracy.
8. Accuracy of 87% is achieved with Navie Bayes classification.
9. Further to improve accuracy of text classification XG Boost is used.

##### XG Boost:
10. Using XG Boost method the accuracy is increased to 92.87%

#### Future Work:
At the end I used grid search to improve the XG boost parameters. But the running time was too long and not efficient.
We may improve the classification by using some deep learning algorithms also in future.

##### Prerequisite  libraries to instal in python before running

nltk, xgboost

### Conslusion:

Thus, the classifier is designed with an accuracy of 92.87% and the resulted output with labelled text of 300 unlabelled articles. The resulted output is saved in a text file with filename "belled_prediction.txt".

### Python Code
Before running the code, change the path of input files where ever required.

Example:
My path includes: "C:/Users/naren/Documents/Divya/SentenceCorpus/labeled_articles"



In [53]:
# Import all required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import glob
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
pd.set_option('display.max_colwidth', -1)
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
import warnings
warnings.filterwarnings('ignore')

In [54]:
#importing wordlist
aim_word_list=pd.read_csv("C:/Users/naren/Documents/Divya/SentenceCorpus/word_lists/aim.txt",header=None)
aim_words=list(aim_word_list[0])

own_word_list=pd.read_csv("C:/Users/naren/Documents/Divya/SentenceCorpus/word_lists/own.txt",header=None)
own_words=list(own_word_list[0])

base_word_list=pd.read_csv("C:/Users/naren/Documents/Divya/SentenceCorpus/word_lists/base.txt",header=None)
base_words=list(base_word_list[0])

cont_word_list=pd.read_csv("C:/Users/naren/Documents/Divya/SentenceCorpus/word_lists/contrast.txt",header=None)
cont_words=list(cont_word_list[0])

vocab=aim_words+own_words+base_words+cont_words
vocab=list(set(vocab))

In [55]:
# load all txt files
path=r"C:/Users/naren/Documents/Divya/SentenceCorpus/labeled_articles"
labeled_articles=[pd.read_table(file,error_bad_lines=False,warn_bad_lines=False) for file in glob.glob(path+"/*.txt")]

#concat all the files into one daatframe
df_labeled=pd.concat(labeled_articles)
df_labeled=df_labeled.reset_index()

#renaming the columns
df_labeled.columns=["category","text"]

#changing the order of the columns
df_labeled=df_labeled[["text","category"]]

In [56]:
#checking for missing values
print(df_labeled.isnull().sum())

#As we see the category is introduction, we better delete the null value
print(df_labeled[df_labeled["text"].isnull()]["category"].unique())

#removing null values
df_labeled=df_labeled.dropna()

text        43
category    0 
dtype: int64
['### introduction ###']


In [57]:
#checking for unique categories in category column
print(df_labeled["category"].unique())

#if you look at the unique categories,
#there are many categories which are not specified by annotators,lets inspect those rows
df_labeled[df_labeled["category"]==0].head()
#looking at those rows we conclude that text data has to be processed to be categorised

['MISC' 'AIMX' 'OWNX' 'CONT' 0L 1L 2L 3L 4L 5L 6L 7L 8L 9L 10L 11L 12L 13L
 14L 15L 16L 17L 18L 19L 20L 21L 22L 23L 24L 25L 26L 27L 28L 29L 30L 31L
 32L 33L 34L 35L 36L 37L 38L 39L 40L 41L 42L 43L 44L 45L 46L 47L 48L 49L
 50L 51L 52L 53L 54L 55L 56L 57L 58L 59L 60L 61L 62L 63L 64L 65L 66L 67L
 'BASE' 'OWNX ' 68L 69L 70L 71L 72L 73L 74L 75L 76L 77L]


Unnamed: 0,text,category
136,OWNX The Minimum Description Length principle for online sequence estimation/prediction in a proper learning setup is studied,0
273,"MISC Although the Internet AS-level topology has been extensively studied over the past few years, little is known about the details of the AS taxonomy",0
333,AIMX In this paper we derive the equations for Loop Corrected Belief Propagation on a continuous variable Gaussian model,0
358,AIMX In this paper we derive the equations for Loop Corrected Belief Propagation on a continuous variable Gaussian model,0
411,MISC Defensive forecasting is a method of transforming laws of probability (stated in game-theoretic terms as strategies for Sceptic) into forecasting algorithms,0


In [58]:
#dividing the data 
cat_labels=['AIMX','OWNX','MISC','CONT','BASE','OWNX']
df_label=df_labeled[df_labeled["category"].isin(cat_labels)]
df_to_be_labeled=df_labeled[~df_labeled["category"].isin(cat_labels)]

#data processing for df_to_be_labled
del df_to_be_labeled["category"]
df_to_be_labeled["category"]=[x[:4] for x in df_to_be_labeled["text"] ]
df_to_be_labeled["text"]=[x[4:] for x in df_to_be_labeled["text"] ]

#concatenating data
labeled_df=pd.concat([df_label,df_to_be_labeled])

labeled_df=labeled_df[labeled_df["category"]!='### ']

#replacing the values of OWNX
labeled_df["category"].replace('OWNX ','OWNX',inplace=True)

# Text Preprocessing

In [59]:
#create another dataframe
df=labeled_df.copy()

# remove punctuation from text
df["text"] = df['text'].str.replace('[^\w\s]','')

#change text to lower case
df["text"]=df["text"].map(lambda x: x.lower())

#remove stopwords
stop_words_txt=pd.read_csv("C:/Users/naren/Documents/Divya/SentenceCorpus/word_lists/stopwords.txt",header=None)
stop_words=list(stop_words_txt[0])
df['text'] = df["text"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

#tokenization
df['word_tokenize'] = df["text"].apply(lambda x: word_tokenize(x))

#lemmatization
lmtzr = WordNetLemmatizer()
df['lemmatized'] = df['word_tokenize'].apply(lambda x: [lmtzr.lemmatize(y) for y in x])

#lemmatization of vocab
vocab_lemmatized=[lmtzr.lemmatize(x) for x in vocab]
#stemming
#ps=PorterStemmer()
#df['stemmed'] = df['word_tokenize'].apply(lambda x: [ps.stem(y) for y in x])

df["new_text"]=df['lemmatized'].apply(lambda x: " ".join(x))

# Label Encoding the category variable
le=LabelEncoder()
df.category=le.fit_transform(df.category)
le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print(le_name_mapping)

{'a si': 5, 'CONT': 2, 'AIMX': 0, 'MISC': 3, 'the ': 6, 'BASE': 1, 'OWNX': 4}


# Test Data Processing


In [60]:
#loading test files
path_test=r"C:/Users/naren/Documents/Divya/SentenceCorpus/unlabeled_articles/"

arxiv_articles=[pd.read_table(file,error_bad_lines=False,warn_bad_lines=False,engine="python") for file in glob.glob(path_test+"arxiv_unlabeled/*.txt")]
jdm_articles=[pd.read_table(file,error_bad_lines=False,warn_bad_lines=False,engine="python") for file in glob.glob(path_test+"jdm_unlabeled/*.txt")]
plos_articles=[pd.read_table(file,error_bad_lines=False,warn_bad_lines=False,engine="python") for file in glob.glob(path_test+"plos_unlabeled/*.txt")]

arxiv=pd.concat(arxiv_articles)
jdm=pd.concat(jdm_articles)
plos=pd.concat(plos_articles)
df_unlabelled=pd.concat([arxiv,jdm,plos])

#changing the name of the column
df_unlabelled.columns=["text"]

#finding removing null values
#print(df_unlabelled.isnull().sum())
df_unlabelled=df_unlabelled.dropna()

#removing duplicates
df_unlabelled=df_unlabelled.drop_duplicates(keep="first")
df_unlabelled=df_unlabelled[df_unlabelled["text"] !="### introduction ###"]

#create another dataframe
df_u=df_unlabelled.copy()

# remove punctuation from text
df_u["text"] = df_u['text'].str.replace('[^\w\s]','')

#change text to lower case
df_u["text"]=df_u["text"].map(lambda x: x.lower())

#remove stopwords
stop_words_txt=pd.read_csv("C:/Users/naren/Documents/Divya/SentenceCorpus/word_lists/stopwords.txt",header=None)
stop_words=list(stop_words_txt[0])
df_u['text'] = df_u["text"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

#tokenization
df_u['word_tokenize'] = df_u["text"].apply(lambda x: word_tokenize(x))

#lemmatization
lmtzr = WordNetLemmatizer()
df_u['lemmatized'] = df_u['word_tokenize'].apply(lambda x: [lmtzr.lemmatize(y) for y in x])

#stemming
#ps=PorterStemmer()
#df['stemmed'] = df['word_tokenize'].apply(lambda x: [ps.stem(y) for y in x])

df_u["new_text"]=df_u['lemmatized'].apply(lambda x: " ".join(x))




## Word Representation

In [61]:
#Getting the term frequency
count_vect = CountVectorizer(min_df=0.001,ngram_range=(1,2))
df_counts = count_vect.fit_transform(df.new_text)
count=count_vect.get_feature_names()

#combining the most frequency words and the word list given 
new_vocab=set(count+vocab_lemmatized)



In [62]:
#tfidf vectorization
vectorizer = TfidfVectorizer(vocabulary=set(count))
transformed = vectorizer.fit_transform(df.new_text)
print(transformed.shape)
#print(vectorizer.get_feature_names())

(3116, 3650)


## Modelling
### 1.Naive Bayes Classifier

In [63]:
#Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha=0.01).fit(transformed, df.category)

#predicting on train data
predicted = clf.predict(transformed)
np.mean(predicted == df.category)

0.89569961489088579

In [64]:
#using the train data vocab on test data
df_test_tfidf = vectorizer.transform(df_u.new_text)

#predicting on test data
y_preds=clf.predict_proba(df_test_tfidf )
y_preds=pd.DataFrame(y_preds)
y_preds["max"]=y_preds.apply(lambda x: x.argmax(), axis=1)
df_u["category"]=y_preds["max"]
df_u["category"]=df_u["category"].astype("object")
df_u["label_category"]=df_u["category"].replace({0:'AIMX',1:'BASE', 2:'CONT',3:'MISC', 4:'OWNX'})

#test data labelled

test_unlabelled=df_u[["label_category"]]
test_unlabelled["text"]=df_unlabelled["text"]

test_unlabelled["label_category"].value_counts()

MISC    29354
OWNX    4217 
CONT    291  
AIMX    5    
Name: label_category, dtype: int64

In [65]:
labeled_df["category"].value_counts()

MISC    1824
OWNX    865 
AIMX    194 
CONT    170 
BASE    61  
a si    1   
the     1   
Name: category, dtype: int64

## 2.XG Boost

In [66]:
import xgboost as xgb
xgb1 = xgb.XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'multi:softprob',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
xg_boost=xgb1.fit( transformed, df.category)

In [67]:

#predicting on train data
predicted_xgb = xg_boost.predict(transformed)
np.mean(predicted_xgb == df.category)

0.92875481386392811

In [68]:
#predicting on test data
y_preds=xg_boost.predict_proba(df_test_tfidf )
y_preds=pd.DataFrame(y_preds)
y_preds["max"]=y_preds.apply(lambda x: x.argmax(), axis=1)
df_u["category"]=y_preds["max"]
df_u["category"]=df_u["category"].astype("object")
df_u["label_category"]=df_u["category"].replace({0:'AIMX',1:'BASE', 2:'CONT',3:'MISC', 4:'OWNX'})

#test data labelled

test_unlabelled=df_u[["label_category"]]
test_unlabelled["text"]=df_unlabelled["text"]
test_unlabelled["label_category"].value_counts()
test_unlabelled.to_csv('C:/Users/naren/Documents/Divya/SentenceCorpus/Labelled_Prediction.txt',index = False)

MISC    21224
OWNX    6257 
AIMX    4820 
CONT    1566 
Name: label_category, dtype: int64

In [None]:
    "#from sklearn.grid_search import GridSearchCV \n",
    "#param_test1 = {\n",
    "# 'max_depth':[3,4,5,6,7,8,9,10],\n",
    "## 'min_child_weight':[1,2,3,4,5,6,7,8,9,10,11,12],\n",
    "#'gamma':[i/10.0 for i in range(0,5)]\n",
    "#}\n",
    "#gsearch = GridSearchCV(estimator = xgb.XGBClassifier( learning_rate =0.1, n_estimators=140, \n",
    "                                                     max_depth=5,\n",
    "# min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,\n",
    "# objective= 'multi:softmax', nthread=4, scale_pos_weight=1, seed=27), \n",
    "# param_grid = param_test1, scoring='accuracy',n_jobs=4,iid=False, cv=5)\n",
    "#grid_search=gsearch.fit(transformed, df.category)\n",
    "#gsearch.grid_scores_, gsearch.best_params_, gsearch.best_score_"