## Welcome to Trio de Informática's Second IART Project about Topic Modelling using NLP

Throughout this notebook you will see the step by step data analysis of the training datasets as well as different approaches to this challenge. Several algorithms will also be implemented such as Naive Bayes, decision trees, SVM, etc. Be aware that different data processing techniques can match different algorithms, so in order to test all the combinations we will provide several cells to present all the results. 

As to provide better insight on the training dataset, we will start by presenting a statistic overview of the provided data.

In [28]:
import pandas as pd
import re
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

# Importing the dataset
dataset = pd.read_csv('archive/train.csv')

print(dataset)

          ID                                              TITLE  \
0          1        Reconstructing Subject-Specific Effect Maps   
1          2                 Rotation Invariance Neural Network   
2          3  Spherical polyharmonics and Poisson kernels fo...   
3          4  A finite element approximation for the stochas...   
4          5  Comparative study of Discrete Wavelet Transfor...   
...      ...                                                ...   
20967  20968  Contemporary machine learning: a guide for pra...   
20968  20969  Uniform diamond coatings on WC-Co hard alloy c...   
20969  20970  Analysing Soccer Games with Clustering and Con...   
20970  20971  On the Efficient Simulation of the Left-Tail o...   
20971  20972   Why optional stopping is a problem for Bayesians   

                                                ABSTRACT  Computer Science  \
0        Predictive models allow subject-specific inf...                 1   
1        Rotation invariance and transl

In [29]:
print("Info:" + str(dataset.info))
print("Shape: "+ str(dataset.shape))

Info:<bound method DataFrame.info of           ID                                              TITLE  \
0          1        Reconstructing Subject-Specific Effect Maps   
1          2                 Rotation Invariance Neural Network   
2          3  Spherical polyharmonics and Poisson kernels fo...   
3          4  A finite element approximation for the stochas...   
4          5  Comparative study of Discrete Wavelet Transfor...   
...      ...                                                ...   
20967  20968  Contemporary machine learning: a guide for pra...   
20968  20969  Uniform diamond coatings on WC-Co hard alloy c...   
20969  20970  Analysing Soccer Games with Clustering and Con...   
20970  20971  On the Efficient Simulation of the Left-Tail o...   
20971  20972   Why optional stopping is a problem for Bayesians   

                                                ABSTRACT  Computer Science  \
0        Predictive models allow subject-specific inf...                 1   
1 

In [30]:
print('As count:\n')
print('Computer Science: ',dataset['Computer Science'].sum())
print('Physics: ',dataset['Physics'].sum())
print('Mathematics: ',dataset['Mathematics'].sum())
print('Statistics: ',dataset['Statistics'].sum())
print('Quantitative Biology: ',dataset['Quantitative Biology'].sum())
print('Quantitative Finance: ',dataset['Quantitative Finance'].sum())

print('\nAs a percentage:\n')
print('Computer Science: ',round(dataset['Computer Science'].sum()/dataset.shape[0]*100), '%')
print('Physics: ',round(dataset['Physics'].sum()/dataset.shape[0]*100),'%')
print('Mathematics: ',round(dataset['Mathematics'].sum()/dataset.shape[0]*100),'%')
print('Statistics: ',round(dataset['Statistics'].sum()/dataset.shape[0]*100),'%')
print('Quantitative Biology: ',round(dataset['Quantitative Biology'].sum()/dataset.shape[0]*100),'%')
print('Quantitative Finance: ',round(dataset['Quantitative Finance'].sum()/dataset.shape[0]*100),'%')

As count:

Computer Science:  8594
Physics:  6013
Mathematics:  5618
Statistics:  5206
Quantitative Biology:  587
Quantitative Finance:  249

As a percentage:

Computer Science:  41 %
Physics:  29 %
Mathematics:  27 %
Statistics:  25 %
Quantitative Biology:  3 %
Quantitative Finance:  1 %


In [31]:
def preProcessStem():

    corpus=[]
    # Initialize PorterStemmer
    ps = PorterStemmer()

    for i in range(0,dataset.shape[0]):
        # get review and remove non alpha chars
        title = re.sub('[^a-zA-Z]', ' ', dataset['TITLE'][i])
        abstract = re.sub('[^a-zA-Z]', ' ', dataset['ABSTRACT'][i])
        # to lower-case and tokenize
        title = title.lower().split()
        abstract = abstract.lower().split()
        # stemming and stop word removal
        title = ' '.join([ps.stem(w) for w in title if not w in set(stopwords.words('english'))])
        abstract = ' '.join([ps.stem(w) for w in abstract if not w in set(stopwords.words('english'))])
        corpus.append((title, abstract))
        print((title, abstract))
    return corpus

In [32]:
from nltk.stem import WordNetLemmatizer

def preProcessLem():
    corpus=[]
    # Initialize Word Net Lemmatizer
    ps = WordNetLemmatizer()

    for i in range(0,dataset.shape[0]):
        # get review and remove non alpha chars
        title = re.sub('[^a-zA-Z]', ' ', dataset['TITLE'][i])
        abstract = re.sub('[^a-zA-Z]', ' ', dataset['ABSTRACT'][i])
        # to lower-case and tokenize
        title = title.lower().split()
        abstract = abstract.lower().split()
        # lemmatization and stop word removal
        title = ' '.join([ps.lemmatize(w) for w in title if not w in set(stopwords.words('english'))])
        abstract = ' '.join([ps.lemmatize(w) for w in abstract if not w in set(stopwords.words('english'))])
        corpus.append((title, abstract))
        print((title, abstract))
    return corpus

corpus = preProcessLem()

 'overset method commonly employed enable effective simulation problem involving complex geometry moving object rotorcraft paper present novel overset domain connectivity algorithm based upon direct cut approach suitable use gpu accelerated solver high order curved grid contrast previous method capable exploiting highly data parallel nature modern accelerator approach also substantially efficient handling curved grid arise within context high order method implementation new algorithm presented combined high order fluid dynamic code algorithm validated several benchmark problem including flow spinning golf ball reynolds number')
('data assimilation algorithm paradigm leray alpha model turbulence', 'paper survey various implementation new data assimilation downscaling algorithm based spatial coarse mesh measurement paradigm demonstrate application algorithm leray alpha subgrid scale turbulence model importantly use paradigm show always necessary one collect coarse mesh measurement state 

In [33]:
data = []
for (title, abstract) in corpus:
    data.append(title + abstract)

In [34]:
# Create bag-of-words model

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1500)
X = vectorizer.fit_transform(data).toarray()
y = dataset.iloc[:,-1].values

print(vectorizer.get_feature_names())
print(X.shape, y.shape)

['ability', 'able', 'absorption', 'abstract', 'abundance', 'access', 'according', 'account', 'accuracy', 'accurate', 'accurately', 'achieve', 'achieved', 'achieves', 'achieving', 'acoustic', 'across', 'action', 'activation', 'active', 'activity', 'ad', 'adaptation', 'adaptive', 'addition', 'additional', 'additionally', 'address', 'advance', 'advantage', 'adversarial', 'affect', 'affine', 'age', 'agent', 'agreement', 'aim', 'al', 'algebra', 'algebraic', 'algorithm', 'alignment', 'allocation', 'allow', 'allowing', 'allows', 'almost', 'along', 'alpha', 'already', 'also', 'alternative', 'although', 'always', 'among', 'amount', 'amplitude', 'analysis', 'analytic', 'analytical', 'analyze', 'analyzed', 'analyzing', 'angle', 'angular', 'anisotropic', 'anisotropy', 'anomaly', 'another', 'answer', 'appear', 'applicable', 'application', 'applied', 'apply', 'applying', 'approach', 'appropriate', 'approximate', 'approximately', 'approximation', 'arbitrary', 'architecture', 'area', 'argument', 'aris

In [35]:
# Split dataset into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(16777, 1500) (16777,)
(4195, 1500) (4195,)


In [36]:
# imports

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [37]:


# SVM

from sklearn.svm import SVC

classifier = SVC()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

print(set(y_test) - set(y_pred))

print(confusion_matrix(y_test, y_pred))
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('F1: ', f1_score(y_test, y_pred))

set()
[[4144    1]
 [  45    5]]
Accuracy:  0.9890345649582837
Precision:  0.8333333333333334
Recall:  0.1
F1:  0.17857142857142858
