# Welcome to Trio de Informática's Second IART Project about Topic Modelling using NLP

Throughout this notebook one can observe the step by step data analysis of the training datasets as well as different approaches to this challenge. Several algorithms will also be implemented such as Naive Bayes, decision trees, SVM, etc. Be aware that different data processing techniques can match different algorithms, so in order to test all the combinations several cells will be provided to present all the results, as well as a comparison of accuracy of each implemented technique. 

The following notebook will record the group's approach to the proposed project, starting with the data analysis, followed by the data preprocessing. Afterwards, various algorithms will be implemented and a result comparison between different approaches will be performed.

# Statistical analysis

As one can observe when executing the code below, the provided dataset is composed of a table with several columns referring to different informations. Every entry in this dataset is identified by an ID, generated incrementally. Each entry contains a Title and an Abstract, reffering to a report/article to be classified as belonging to some topic. Each entry has an additional 6 columns related to topic classification. Each of these columns reffers to a topic: Computer Science, Physics, Mathematics, Statistics, Quantitative Biology and Quantitative Finance, and will contain a 1 if the article belongs to that said topic and a 0 otherwise.

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

# Importing the dataset
train = pd.read_csv('archive/train.csv')

train.head()

## Size of the training dataset

It is important to know the magnitude of the dataset used to train the predictive model. Therefore, the following block of code will allow us to check the dimention of the used dataset. The bigger the dataset, the more data is taken into account when training the model, thus allowing it to acquire more information.

In [1]:
print("Train: Rows & Columns")
print(train.shape)

Train: Rows & Columns


NameError: name 'train' is not defined

## Training Dataset Info

By showing the info of the training dataset, one can see that there are no null values, meaning that there is no missing data. 

In [None]:
train.info()

## Dataset description

Additionally, as shown below, the training dataset seems very clean because:

- Number of rows for all the data columns are same
- The max of all these 0-1 inputs is always 1
- The min and max value of the column ID match the expected values and so do the 25%, 50% and 75%, thus showing that the ID is incremented correctly


In [3]:
train.describe()

NameError: name 'train' is not defined

## Topic Distribution

Each entry in the training dataset belongs to a topic. In order to evaluate the topics with the most and least entries, below is presented a numerical analysis where we can find the topic distribution of the training dataset. The model will classify more easily papers related to topics with the most entries in the training dataset, therefore the predictive model will be better trained to identify the more recurring topics of the training dataset.

In [None]:
print('Number of articles that belong the the topic:\n')
print('- Computer Science: ',train['Computer Science'].sum())
print('- Physics: ',train['Physics'].sum())
print('- Mathematics: ',train['Mathematics'].sum())
print('- Statistics: ',train['Statistics'].sum())
print('- Quantitative Biology: ',train['Quantitative Biology'].sum())
print('- Quantitative Finance: ',train['Quantitative Finance'].sum())

print('\nAs a percentage:\n')
print('- Computer Science: ',round(train['Computer Science'].sum()/train.shape[0]*100), '%')
print('- Physics: ',round(train['Physics'].sum()/train.shape[0]*100),'%')
print('- Mathematics: ',round(train['Mathematics'].sum()/train.shape[0]*100),'%')
print('- Statistics: ',round(train['Statistics'].sum()/train.shape[0]*100),'%')
print('- Quantitative Biology: ',round(train['Quantitative Biology'].sum()/train.shape[0]*100),'%')
print('- Quantitative Finance: ',round(train['Quantitative Finance'].sum()/train.shape[0]*100),'%')

## Variable correlation?

In [5]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sb

corrmat = train.corr()
f, ax = plt.subplots(figsize=(7, 7))
# Draw the heatmap using seaborn
sb.heatmap(corrmat, square=False, annot=True)
plt.show()

NameError: name 'train' is not defined

## Evaluation of the length distribution of Titles and Abstracts

In [7]:
train_title_len = train['TITLE'].apply(len)
sb.displot(train_title_len)

AttributeError: module 'seaborn' has no attribute 'displot'

In [8]:
train_abstract_len = train['ABSTRACT'].apply(len)
sb.displot(train_abstract_len)

AttributeError: module 'seaborn' has no attribute 'displot'

# Preprocessing Techniques

Be aware that each preprocessing technique applied includes some other steps such as: removing non-alphanumeric characters, lowercasing all words and tokenizing. After applying the preprocesing algorithm, there is an aditional step to remove all stopwords. After all this steps are completed, each preprocessing function returns the corpus, a structure composed of tuples encapsulating each preprocessed title-abstract pair.

## Porter Stemming

The Porter Stemming algorithm being the one of the oldest, originally developed in 1979, uses Suffix Stripping to produce stems. PorterStemmer is known for its simplicity and speed. It is commonly useful in Information Retrieval Environments known as IR Environments for fast recall and fetching of search queries. In a typical IR, environment documents are represented as vectors of words or terms. Words having the same stem will therefore have a similar meaning. 


In [9]:
from nltk.stem.porter import PorterStemmer

def porterStemming():

    corpus=[]
    # Initialize PorterStemmer
    ps = PorterStemmer()

    for i in range(0,train.shape[0]):
        # get title and abstract and remove non alpha chars
        title = re.sub('[^a-zA-Z]', ' ', train['TITLE'][i])
        abstract = re.sub('[^a-zA-Z]', ' ', train['ABSTRACT'][i])
        # to lower-case and tokenize
        title = title.lower().split()
        abstract = abstract.lower().split()
        # stemming and stop word removal
        title = ' '.join([ps.stem(w) for w in title if not w in set(stopwords.words('english'))])
        abstract = ' '.join([ps.stem(w) for w in abstract if not w in set(stopwords.words('english'))])
        corpus.append((title, abstract))

    return corpus

## Lancaster Stemming

Lancaster Stemming ( also known as Paice-Husk Stemmer) was developed in 1990 and uses a more aggressive approach than the Porter Stemming Algorithm and the Snowball Algorithm, although it is not as efficient as the latter. The LancasterStemmer (Paice-Husk stemmer) is an iterative algorithm with rules saved externally. On each iteration, it tries to find an applicable rule by the last character of the word. Each rule specifies either a deletion or replacement of an ending. If there is no such rule, it terminates. It also terminates if a word starts with a vowel and there are only two letters left or if a word starts with a consonant and there are only three characters left. Otherwise, the rule is applied, and the process repeats. LancasterStemmer is simple, but heavy stemming due to iterations and over-stemming may occur, specially on smaller words. Over-stemming causes the stems to be not linguistic, or they may have no meaning. 

In [10]:
from nltk.stem import LancasterStemmer

def lancasterStemming():

    corpus=[]
    # Initialize LancasterStemmer
    lc = LancasterStemmer()

    for i in range(0,train.shape[0]):
        # get title and abstract and remove non alpha chars
        title = re.sub('[^a-zA-Z]', ' ', train['TITLE'][i])
        abstract = re.sub('[^a-zA-Z]', ' ', train['ABSTRACT'][i])
        # to lower-case and tokenize
        title = title.lower().split()
        abstract = abstract.lower().split()
        # stemming and stop word removal
        title = ' '.join([lc.stem(w) for w in title if not w in set(stopwords.words('english'))])
        abstract = ' '.join([lc.stem(w) for w in abstract if not w in set(stopwords.words('english'))])
        corpus.append((title, abstract))
        
    return corpus

## Snowball Stemming

When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other languages the Snowball Stemmers can be called a multi-lingual stemmer. This stemmer is based on a programming language called ‘Snowball’ that processes small strings and is the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer has greater computational speed.

In [11]:
from nltk.stem.snowball import SnowballStemmer

def snowballStemming():

    corpus=[]
    # Initialize LancasterStemmer
    ss = SnowballStemmer(language='english')

    for i in range(0,train.shape[0]):
        # get title and abstract and remove non alpha chars
        title = re.sub('[^a-zA-Z]', ' ', train['TITLE'][i])
        abstract = re.sub('[^a-zA-Z]', ' ', train['ABSTRACT'][i])
        # to lower-case and tokenize
        title = title.lower().split()
        abstract = abstract.lower().split()
        # stemming and stop word removal
        title = ' '.join([ss.stem(w) for w in title if not w in set(stopwords.words('english'))])
        abstract = ' '.join([ss.stem(w) for w in abstract if not w in set(stopwords.words('english'))])
        corpus.append((title, abstract))

    return corpus

## Some comparisons on stemming algorithms

Difference Between Porter Stemmer and Snowball Stemmer:

- Snowball Stemmer is more aggressive than Porter Stemmer.
- Some issues in Porter Stemmer were fixed in Snowball Stemmer.
- There is only a little difference in the working of these two.
- Words like ‘fairly‘ and ‘sportingly‘ were stemmed to ‘fair’ and ‘sport’ in the snowball stemmer but when you use the porter stemmer they are stemmed to ‘fairli‘ and ‘sportingli‘.
- The difference between the two algorithms can be clearly seen in the way the word ‘Sportingly’ in stemmed by both. Clearly Snowball Stemmer stems it to a more accurate stem.

## Lemmatization

Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

In [12]:
from nltk.stem import WordNetLemmatizer

def lemmatization():
    corpus=[]
    # Initialize Word Net Lemmatizer
    lemm = WordNetLemmatizer()

    for i in range(0, train.shape[0]):
        # get title and abstract and remove non alpha chars
        title = re.sub('[^a-zA-Z]', ' ', train['TITLE'][i])
        abstract = re.sub('[^a-zA-Z]', ' ', train['ABSTRACT'][i])
        # to lower-case and tokenize
        title = title.lower().split()
        abstract = abstract.lower().split()
        # lemmatization and stop word removal
        title = ' '.join([lemm.lemmatize(w) for w in title if not w in set(stopwords.words('english'))])
        abstract = ' '.join([lemm.lemmatize(w) for w in abstract if not w in set(stopwords.words('english'))])
        corpus.append((title, abstract))
        
    return corpus

## Stemming vs. Lemmatization

Stemming and Lemmatization are itself form of NLP and widely used in Text mining. Text Mining is the process of analysis of texts written in natural language and extract high-quality information from text. It involves looking for interesting patterns in the text or to extract data from the text to be inserted into a database. Developers have to prepare text using lexical analysis, POS (Parts-of-speech) tagging, stemming and other Natural Language Processing techniques to gain useful information from text.

### When should one use Stemming or Lemmatization?
Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word.

Stemming follows an algorithm with steps to perform on the words which makes it faster. Whereas, in lemmatization, using WordNet corpus and a corpus for stop words as well to produce lemma, which makes it slower than stemming.

The above points show that if speed is the focus then stemming should be used since lemmatizers scan a corpus which consumes time and processing. It widely depends on the problem characteristics if stemmers should be used or lemmatizers.

### Pros of Stemming:

### Cons of Stemming:
- Issues of over stemming and under stemming may lead to not so meaningful or inappropriate stems.
- Stemming does not consider how the word is being used. For example – the word ‘saw‘ will be stemmed to ‘saw‘ itself but it won’t be considered whether the word is being used as a noun or a verb in the context. For this reason, Lemmatization is used as it keeps this fact in consideration and will return either ‘see’ or ‘saw’ depending on whether the word ‘saw’ was used as a verb or a noun.


### Pros of Lemmatization:

### Cons of Lemmatization:




## Bag-of-words


After forming the corpus, the bag-of-words model can be built!

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

def bagOfWords(corpus):
    # Create bag-of-words model
    data = []
    for (title, abstract) in corpus:
        data.append(title + abstract)

    vectorizer = CountVectorizer(max_features = 1500)
    X = vectorizer.fit_transform(data).toarray()
    y = train.iloc[:,-1].values

    # print(X.shape, y.shape)
    # print(vectorizer.get_feature_names())
    return X,y



## Splitting the training dataset 

In order to create a test dataset where the accuracy of the models can be evaluated, the training dataset must be split into a smaller training dataset and a test dataset, in order to compare the model predictions to the correct topic modelling answer.

In [14]:
# Split dataset into training and test sets
from sklearn.model_selection import train_test_split

def splittingDataset(X,y):

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

    # print(X_train.shape, y_train.shape)
    # print(X_test.shape, y_test.shape)

    return X_train, X_test, y_train, y_test 

## SVM

In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

In [15]:
from sklearn.svm import SVC

def SVM(X_train, y_train, X_test):
    classifier = SVC()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)

    return y_pred


## Decision Trees

Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity.

In [16]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# import graphviz

def decisionTree(X_train, y_train, X_test):
    classifier = DecisionTreeClassifier(min_samples_leaf=10)
    model = classifier.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    #Visualizing the decision tree
    # dot_data = tree.export_graphviz(model, feature_names=list(X), class_names=sorted(y.unique()), filled=True)
    # graphviz.Source(dot_data)

    return y_pred 


## Naïve Bayes

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems. It is a probabilistic classifier, which means it predicts on the basis of the probability of an object. It is mainly used in text classification that includes a high-dimensional training dataset. This algorithm is one of the simple and most effective classification algorithms which helps in building the fast machine learning models that can make quick predictions.

In [17]:
from sklearn.naive_bayes import GaussianNB

def naiveBayes(X_train, y_train, X_test):
    classifier = GaussianNB()
    model = classifier.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    return y_pred

# Running the code


In [18]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

def evaluatePerformance(y_test, y_pred):
    print(confusion_matrix(y_test, y_pred))
    print('Accuracy: ', accuracy_score(y_test, y_pred))
    print('Precision: ', precision_score(y_test, y_pred))
    print('Recall: ', recall_score(y_test, y_pred))
    print('F1: ', f1_score(y_test, y_pred))

## SVM

### SVM with Porter Stemming

In [19]:
corpus = porterStemming()
X,y = bagOfWords(corpus)
X_train, X_test, y_train, y_test = splittingDataset(X,y)
y_pred = SVM(X_train, y_train, X_test)
evaluatePerformance(y_test, y_pred)

KeyboardInterrupt: 

### SVM with Lancaster Stemming

In [None]:
corpus = lancasterStemming()
X,y = bagOfWords(corpus)
X_train, X_test, y_train, y_test = splittingDataset(X,y)
y_pred = SVM(X_train, y_train, X_test)
evaluatePerformance(y_test, y_pred)

### SVM with Snowball Stemming

In [48]:
corpus = snowballStemming()
X,y = bagOfWords(corpus)
X_train, X_test, y_train, y_test = splittingDataset(X,y)
y_pred = SVM(X_train, y_train, X_test)
evaluatePerformance(y_test, y_pred)

[[4144    1]
 [  50    0]]
Accuracy:  0.9878426698450536
Precision:  0.0
Recall:  0.0
F1:  0.0


### SVM with Lemmatization

In [55]:
corpus = lemmatization()
X,y = bagOfWords(corpus)
X_train, X_test, y_train, y_test = splittingDataset(X,y)
y_pred = SVM(X_train, y_train, X_test)
evaluatePerformance(y_test, y_pred)

[[4145    0]
 [  50    0]]
Accuracy:  0.9880810488676997
Precision:  0.0
Recall:  0.0
F1: 0.0
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


## Decision Trees

### Decision Trees with Porter Stemming

In [None]:
corpus = porterStemming()
X,y = bagOfWords(corpus)
X_train, X_test, y_train, y_test = splittingDataset(X,y)
y_pred = decisionTree(X_train, y_train, X_test)
evaluatePerformance(y_test, y_pred)

### Decision Trees with Lancaster Stemming

In [51]:
corpus = lancasterStemming()
X,y = bagOfWords(corpus)
X_train, X_test, y_train, y_test = splittingDataset(X,y)
y_pred = decisionTree(X_train, y_train, X_test)
evaluatePerformance(y_test, y_pred)

[[4130   15]
 [  35   15]]
Accuracy:  0.9880810488676997
Precision:  0.5
Recall:  0.3
F1:  0.37499999999999994


### Decision Trees with Snowball Stemming

In [52]:
corpus = snowballStemming()
X,y = bagOfWords(corpus)
X_train, X_test, y_train, y_test = splittingDataset(X,y)
y_pred = decisionTree(X_train, y_train, X_test)
evaluatePerformance(y_test, y_pred)

[[4133   12]
 [  42    8]]
Accuracy:  0.9871275327771156
Precision:  0.4
Recall:  0.16
F1:  0.22857142857142856


### Decision Trees with Lemmatization

In [None]:
corpus = lemmatization()
X,y = bagOfWords(corpus)
X_train, X_test, y_train, y_test = splittingDataset(X,y)
y_pred = decisionTree(X_train, y_train, X_test)
evaluatePerformance(y_test, y_pred)

## Naïve Bayes



### Naïve Bayes with Porter Stemming


In [None]:
corpus = porterStemming()
X,y = bagOfWords(corpus)
X_train, X_test, y_train, y_test = splittingDataset(X,y)
y_pred = decisionTree(X_train, y_train, X_test)
evaluatePerformance(y_test, y_pred)

### Naïve Bayes with Lancaster Stemming

In [None]:
corpus = lancasterStemming()
X,y = bagOfWords(corpus)
X_train, X_test, y_train, y_test = splittingDataset(X,y)
y_pred = decisionTree(X_train, y_train, X_test)
evaluatePerformance(y_test, y_pred)

### Naïve Bayes with Snowball Stemming

In [None]:
corpus = snowballStemming()
X,y = bagOfWords(corpus)
X_train, X_test, y_train, y_test = splittingDataset(X,y)
y_pred = naiveBayes(X_train, y_train, X_test)
evaluatePerformance(y_test, y_pred)

### Naïve Bayes with Lemmatization

In [None]:
corpus = lemmatization()
X,y = bagOfWords(corpus)
X_train, X_test, y_train, y_test = splittingDataset(X,y)
y_pred = naiveBayes(X_train, y_train, X_test)
evaluatePerformance(y_test, y_pred)

In [None]:
# Performance Comparison Between Different Approaches

# References


- https://www.geeksforgeeks.org/introduction-to-stemming/
- https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
- https://en.wikipedia.org/wiki/Support-vector_machine
- https://en.wikipedia.org/wiki/Decision_tree_learning
- https://www.javatpoint.com/machine-learning-naive-bayes-classifier
