## Welcome to Trio de Informática's Second IART Project about Topic Modelling using NLP

Throughout this notebook you will see the step by step data analysis of the training datasets as well as different approaches to this challenge. Several algorithms will also be implemented such as Naive Bayes, decision trees, SVM, etc. Be aware that different data processing techniques can match different algorithms, so in order to test all the combinations we will provide several cells to present all the results. 

As to provide better insight on the training dataset, we will start by presenting a statistic overview of the provided data.

In [9]:
import pandas as pd
import re
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

# Importing the dataset
dataset = pd.read_csv('archive/train_smaller.csv')

print(dataset)

        ID                                              TITLE  \
0        1        Reconstructing Subject-Specific Effect Maps   
1        2                 Rotation Invariance Neural Network   
2        3  Spherical polyharmonics and Poisson kernels fo...   
3        4  A finite element approximation for the stochas...   
4        5  Comparative study of Discrete Wavelet Transfor...   
...    ...                                                ...   
4994  4995  Model Order Selection Rules For Covariance Str...   
4995  4996  Enhancing Interpretability of Black-box Soft-m...   
4996  4997  CollaGAN : Collaborative GAN for Missing Image...   
4997  4998                       A Kuroda-style j-translation   
4998  4999  Electrical 2π phase control of infrared light ...   

                                               ABSTRACT  Computer Science  \
0       Predictive models allow subject-specific inf...                 1   
1       Rotation invariance and translation invarian...          

In [10]:
print("Info:" + str(dataset.info))
print("Shape: "+ str(dataset.shape))

Info:<bound method DataFrame.info of         ID                                              TITLE  \
0        1        Reconstructing Subject-Specific Effect Maps   
1        2                 Rotation Invariance Neural Network   
2        3  Spherical polyharmonics and Poisson kernels fo...   
3        4  A finite element approximation for the stochas...   
4        5  Comparative study of Discrete Wavelet Transfor...   
...    ...                                                ...   
4994  4995  Model Order Selection Rules For Covariance Str...   
4995  4996  Enhancing Interpretability of Black-box Soft-m...   
4996  4997  CollaGAN : Collaborative GAN for Missing Image...   
4997  4998                       A Kuroda-style j-translation   
4998  4999  Electrical 2π phase control of infrared light ...   

                                               ABSTRACT  Computer Science  \
0       Predictive models allow subject-specific inf...                 1   
1       Rotation invariance 

In [11]:
print('As count:\n')
print('Computer Science: ',dataset['Computer Science'].sum())
print('Physics: ',dataset['Physics'].sum())
print('Mathematics: ',dataset['Mathematics'].sum())
print('Statistics: ',dataset['Statistics'].sum())
print('Quantitative Biology: ',dataset['Quantitative Biology'].sum())
print('Quantitative Finance: ',dataset['Quantitative Finance'].sum())

print('\nAs a percentage:\n')
print('Computer Science: ',round(dataset['Computer Science'].sum()/dataset.shape[0]*100), '%')
print('Physics: ',round(dataset['Physics'].sum()/dataset.shape[0]*100),'%')
print('Mathematics: ',round(dataset['Mathematics'].sum()/dataset.shape[0]*100),'%')
print('Statistics: ',round(dataset['Statistics'].sum()/dataset.shape[0]*100),'%')
print('Quantitative Biology: ',round(dataset['Quantitative Biology'].sum()/dataset.shape[0]*100),'%')
print('Quantitative Finance: ',round(dataset['Quantitative Finance'].sum()/dataset.shape[0]*100),'%')

As count:

Computer Science:  2080
Physics:  1391
Mathematics:  1347
Statistics:  1298
Quantitative Biology:  146
Quantitative Finance:  64

As a percentage:

Computer Science:  42 %
Physics:  28 %
Mathematics:  27 %
Statistics:  26 %
Quantitative Biology:  3 %
Quantitative Finance:  1 %


In [12]:
def preProcessStem():

    corpus=[]
    # Initialize PorterStemmer
    ps = PorterStemmer()

    for i in range(0,dataset.shape[0]):
        # get review and remove non alpha chars
        title = re.sub('[^a-zA-Z]', ' ', dataset['TITLE'][i])
        abstract = re.sub('[^a-zA-Z]', ' ', dataset['ABSTRACT'][i])
        # to lower-case and tokenize
        title = title.lower().split()
        abstract = abstract.lower().split()
        # stemming and stop word removal
        title = ' '.join([ps.stem(w) for w in title if not w in set(stopwords.words('english'))])
        abstract = ' '.join([ps.stem(w) for w in abstract if not w in set(stopwords.words('english'))])
        corpus.append((title, abstract))
        print((title, abstract))
    return corpus

count complet corioli acceler latitud thu gener previou work find transmiss incid intern wave strongli affect presenc densiti staircas even wave initi pure inerti wave restor corioli acceler particular low frequenc wave wavelength perfectli transmit near critic latitud otherwis short wavelength wave effici transmit reson free mode interfaci graviti wave short wavelength inerti mode staircas case wave primarili reflect unless wavelength longer vertic extent entir staircas singl step expect incid intern wave strongli affect presenc densiti staircas frequenc latitud wavelength depend manner first could lead new criteria probe interior giant planet seismolog second may import consequ tidal dissip understand evolut giant planet system')
('new weber type integr equat relat weber titchmarsh problem', 'deriv solvabl condit close form solut weber type integr equat relat familiar weber orr integr transform old weber titchmarsh problem pose proc lond math soc pp recent solv author method involv p

In [17]:
from nltk.stem import WordNetLemmatizer

def preProcessLem():
    corpus=[]
    # Initialize Word Net Lemmatizer
    ps = WordNetLemmatizer()

    for i in range(0,dataset.shape[0]):
        # get review and remove non alpha chars
        title = re.sub('[^a-zA-Z]', ' ', dataset['TITLE'][i])
        abstract = re.sub('[^a-zA-Z]', ' ', dataset['ABSTRACT'][i])
        # to lower-case and tokenize
        title = title.lower().split()
        abstract = abstract.lower().split()
        # lemmatization and stop word removal
        title = ' '.join([ps.lemmatize(w) for w in title if not w in set(stopwords.words('english'))])
        abstract = ' '.join([ps.lemmatize(w) for w in abstract if not w in set(stopwords.words('english'))])
        corpus.append((title, abstract))
        print((title, abstract))
    return corpus

corpus = preProcessLem()

 as r rm cd mathfrak r mathfrak p dim')
('translation matrix element spherical gauss laguerre basis function', 'spherical gauss laguerre sgl basis function e normalized function type l n l l r r l lm vartheta varphi leq l n mathbb n constitute orthonormal polynomial basis space l mathbb r radial gaussian weight exp r recently described reliable fast fourier transforms sgl basis function main application sgl basis function fast algorithm solving certain three dimensional rigid matching problem center prioritized periphery purpose called sgl translation matrix element required describe spectral behavior sgl basis function translation paper derive closed form expression translation matrix element allowing direct computation quantity practice')
('theoretical analysis extending frequency bin entanglement photon photon atom photon hybrid system', 'inspired recent development research atom photon quantum interface energy time entanglement single photon pulse propose establish concept special 

In [20]:
data = []
for (title, abstract) in corpus:
    data.append(title + abstract)

In [21]:
# Create bag-of-words model

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1500)
X = vectorizer.fit_transform(data).toarray()
y = dataset.iloc[:,-1].values

print(vectorizer.get_feature_names())
print(X.shape, y.shape)

['ability', 'able', 'absorption', 'accelerated', 'acceleration', 'access', 'according', 'account', 'accuracy', 'accurate', 'accurately', 'achieve', 'achieved', 'achieves', 'achieving', 'acoustic', 'across', 'action', 'active', 'activity', 'adaptation', 'adaptive', 'addition', 'additional', 'additionally', 'address', 'advance', 'advantage', 'adversarial', 'affect', 'age', 'agent', 'agreement', 'aim', 'al', 'algebra', 'algebraic', 'algorithm', 'alignment', 'allocation', 'allow', 'allowing', 'allows', 'almost', 'along', 'alpha', 'already', 'also', 'alternative', 'although', 'always', 'among', 'amount', 'amplitude', 'analysis', 'analytic', 'analytical', 'analyze', 'analyzed', 'analyzing', 'angle', 'angular', 'anomaly', 'another', 'answer', 'appear', 'applicable', 'application', 'applied', 'apply', 'applying', 'approach', 'appropriate', 'approximate', 'approximately', 'approximation', 'arbitrary', 'architecture', 'area', 'argument', 'arithmetic', 'around', 'array', 'art', 'article', 'artifi