<a href="https://colab.research.google.com/github/danielmcarthur/fit5212-assignment-1/blob/main/FIT5212_Assignment_1_31421393.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FIT5212 Assignment 1
Author: Daniel McArthur | 31421393 | dmca0006@student.monash.edu

## Contents

1. [Introduction](#1introduction)
2. [Part 1: Text Classification](#2textclassification) <br>
  2.1 [Text Pre-Processing](#2.1textpre)


<a id='1introduction'></a>

## 1. Introduction

This task can be separated into two parts, the first being a Text Classification task and the second being a Topic Modelling task. The models I create will be trained on a dataset of 55,000 computer science abstracts from 1990 - 2014, sourced from arXiv.org. The models will be trained on a small dataset from 2015 by the same source. 

**Part 1: Text Classification**

The dataset of computer science articles are able to be classified into any combination of the three tags: *InfoTheory, CompVis* and *Math*. The `abstract` text data will first need to be pre-processed before training. In order to compare two different methods, the pre-processing and model used for the Text Classifier will be modified and compared. 

**Part 2: Topic Modelling**

The same training dataset from Part 1 will be used in this task for Latent Dirichlet Allocation (LDA) Topic Modelling. This will be done using appropriate text pre-processing and preparation with the `gensim.models.LdaModel()` function. Again, two variations in text pre-processing and LDA topics will be configured and compared with two different variations. 

<a id='2textclassification'></a>
# 2. Part 1: Text Classification

In [13]:
# import our libraries
import pandas as pd
import numpy as np
import re
# pre-processing
from nltk.corpus import stopwords
from nltk import word_tokenize    
from nltk.tokenize import wordpunct_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# uncomment lines below to install
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
train = pd.read_csv('axcs_train.csv', engine="python", sep=',', quotechar='"', error_bad_lines=False)

<a id='2.1textpre'></a>
## 2.1 Text Pre-Processing
In order to train our text classification model, appropriate text pre-processing must be conducted. To compare how the eventual model is affected by different pre-processing techniques, I will create two different configurations: **Limited Pre-Processing (LPP)** and **Extensive Pre-Processing (EPP)**. 

**LPP** will be comprised of:
1. Removal of non-alpha text
2. Convert to lower case
3. Tokenisation
4. Removal of stopwords
5. Word stemming/lemmatisation

**EPP** will be comprised of:
1. Removal of non-alpha text
2. Convert to lower case
3. Tokenisation
4. Removal of stopwords
5. Removal of numbers
6. TF-IDF for term importance in corpus
7. Top 1000 bi-gram inclusion
8. Word stemming/lemmatisation





```
Steps: 
1. Text Pre-processing
2. Train the model

```



In [14]:
# Let's define some tokeniser
class LemmaTokenizerWordnet(object):
    def __init__(self):
        self.wnl=WordNetLemmatizer()
    def __call__(self,doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

In [23]:
test = train.Abstract[10:20, ]
test

10     Software Agents: Completing Patterns and Cons...
11     The Difficulties of Learning Logic Programs w...
12     Decidable Reasoning in Terminological Knowled...
13     Mini-indexes for literate programs This paper...
14     Teleo-Reactive Programs for Agent Control A f...
15     Bias-Driven Revision of Logical Domain Theori...
16     Learning the Past Tense of English Verbs: The...
17     Substructure Discovery Using Minimum Descript...
18     Exploring the Decision Forest: An Empirical I...
19     An Alternative Conception of Tree-Adjoining D...
Name: Abstract, dtype: object

In [27]:
# create test word tokeniser
for x in test:
  x = word_tokenize(x.lower())
test

10     Software Agents: Completing Patterns and Cons...
11     The Difficulties of Learning Logic Programs w...
12     Decidable Reasoning in Terminological Knowled...
13     Mini-indexes for literate programs This paper...
14     Teleo-Reactive Programs for Agent Control A f...
15     Bias-Driven Revision of Logical Domain Theori...
16     Learning the Past Tense of English Verbs: The...
17     Substructure Discovery Using Minimum Descript...
18     Exploring the Decision Forest: An Empirical I...
19     An Alternative Conception of Tree-Adjoining D...
Name: Abstract, dtype: object

In [25]:
test

10     Software Agents: Completing Patterns and Cons...
11     The Difficulties of Learning Logic Programs w...
12     Decidable Reasoning in Terminological Knowled...
13     Mini-indexes for literate programs This paper...
14     Teleo-Reactive Programs for Agent Control A f...
15     Bias-Driven Revision of Logical Domain Theori...
16     Learning the Past Tense of English Verbs: The...
17     Substructure Discovery Using Minimum Descript...
18     Exploring the Decision Forest: An Empirical I...
19     An Alternative Conception of Tree-Adjoining D...
Name: Abstract, dtype: object