# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Wing Hang CHAN
#### Student ID: 3939713

Date: 15-Sep-2022

Version: 1.0

Environment: Python 3 and Jupyter notebook

## Introduction
There are 776 job advertistment in 4 different folders. A class ```JobAd``` for storing infomation like file path, raw data, job category, content dictionary, etc. The class provide functions for loading raw data, tokenizing description, counting tokens, etc. The class can be re-used in Task 2 & 3. Below notebook follows the assignment requirement and saves a file vocab.txt

## Importing libraries

There are many external libraries imported in internal modules
1. from itertools import chain
1. from nltk.tokenize import RegexpTokenizer
1. from nltk.tokenize import sent_tokenize
1. import re
1. import os
1. from itertools import chain
1. import numpy as np
1. import pandas as pd
1. from nltk.probability import *
1. from scipy.sparse import csr_matrix
1. from sklearn.linear_model import LogisticRegression
1. from sklearn.model_selection import KFold

In [1]:
# internal module
from module.jobAd import JobAd
from module.Utils import *

# external libraries
from itertools import chain
from nltk.probability import *
from pylab import *

### 1.1 Examining and loading data
- Examine the data folder, including the categories and job advertisment txt documents, etc. Explain your findings here, e.g., number of folders and format of txt files, etc.
- Load the data into proper data structures and get it ready for processing.
- Extract webIndex and description into proper data structures.


Function ```read_job_ad``` is to
 1. read all files from different sub-folders (i.e. job categories)
 2. convert all files in ```JobAd``` object for pre-processing
 3. read sub-folder name as job category
 4. ```JobAd``` stores all contents from a file (e.g. title, webIndex, company, description, job category)
 5. return a list of all ```JobAd``` objects and a list of string with all job categories

In [2]:
result = read_job_ad("./data")
job_ad_list = result[0]
job_category = result[1]

print("Number of folders: {}, {}".format(len(job_category), job_category))
print("Number of files: {}".format(len(job_ad_list)))
print(set([str(job_ad.data.keys()) for job_ad in job_ad_list]))


Number of folders: 4, ['Accounting_Finance', 'Engineering', 'Healthcare_Nursing', 'Sales']
Number of files: 776
{"dict_keys(['Title', 'Webindex', 'Company', 'Description'])", "dict_keys(['Title', 'Webindex', 'Description'])"}


From the above output, there are 4 job categories. They are 'Accounting_Finance', 'Engineering', 'Healthcare_Nursing' & 'Sales'.
There are 776 job advertisements. From the output, there are some job advertisements having 4 types of content, i.e. 'Title', 'Webindex', 'Company' & 'Description'. But, there are some job advertisements do not have 'Company'.

### 1.2 Pre-processing data
Perform the required text pre-processing steps.

#### Tokenizing Description (Pt. 1, 2 & 3)
1. Converting words into lower case
2. Tokenize job advertisement description with regular expression,
    ```python
    r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
    ```

In [3]:
# start tokenizing job advertisement description
for job_ad in job_ad_list:
    job_ad.tokenizeDesc(pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?") # Task 1 pt. 1, 2 & 3

all_words = summarise_words(job_ad_list)

Words: 186952
Vocabs: 9834


There are total 186952 words and 9834 vocabs (i.e. distinct of words) tokenized with the above regular expression for all descriptions.

#### Remove words with length less than 2 (Pt. 4)
```JobAd``` function ```remove_words(length=2)``` is to keep description tokens with character is larger than or equal to ```length```

In [4]:
for job_ad in job_ad_list:
    job_ad.remove_words() # Task 1 pt. 4

all_words_removed = summarise_words(job_ad_list, all_words)

Words: 180913
Vocabs: 9808
Removed Words: 6039
Removed Vocabs: 26


There are 6039 words and 26 vocabs with length is 1.


#### Remove stopwords (Pt.5)
1. ```Utils``` function ```read_stopwords(path="stopwords_en.txt", permission="r", encoding="utf-8")``` is to read stopwords from files
2. ```JobAd``` function ```remove_by_list(self, remove_list)``` is to remove description tokens with provided list of words

In [5]:
stopword_lists = read_stopwords()

for job_ad in job_ad_list:
    job_ad.remove_by_list(stopword_lists) #Task 1 pt. 5

all_words_removed_stopword = summarise_words(job_ad_list, all_words_removed)

Words: 107161
Vocabs: 9404
Removed Words: 73752
Removed Vocabs: 404


There are 73752 words and 404 vocabs treated as stopwords.

#### Remove word that appears only once based on Term Frequency (Pt.6)
to remove all words from all job advertisements which appear 1 only
1. by using ```nltk.FreqDist``` for counting frequency
2. ```[job_ad.desc_tokens for job_ad in job_ad_list]``` => join all desc_tokens from all job advisements (i.e. [all job_ad.desc_tokens[tokens]])
3. ```chain.from_iterable([all job_ad.desc_tokens[tokens]])``` => join [all tokens]
4. ```FreqDist[0]``` is the word
5. ```FreqDist[1]``` is the number of word frequency within the list provided
6. use ```JobAd``` function ```remove_by_list``` to remove the words in description tokens


In [6]:
words = list(chain.from_iterable([job_ad.desc_tokens for job_ad in job_ad_list]))
term_fd = FreqDist(words)

# Filtering a word list which appear once only
appear_1_list = [fd[0] for fd in list(term_fd.items()) if fd[1] == 1]

# remove word appear once
for job_ad in job_ad_list:
    job_ad.remove_by_list(appear_1_list)

all_words_removed_appear_1 = summarise_words(job_ad_list, all_words_removed_stopword)

Words: 102975
Vocabs: 5218
Removed Words: 4186
Removed Vocabs: 4186


There are 4186 words and vocabs appears only once combining all documents together.

#### Remove top 50 most frequent words base on Document Frequency (Pt.7)
to remove the top 50 most appearance words for all job advertisement by counting 1 for appearance from each job advertisement
1. ```[set(job_ad.desc_tokens) for job_ad in job_ad_list]``` => join all desc_tokens
2. with distinct for all job advisement i.e. ```[distinct job_ad.desc_tokens[tokens]]```
3. ```chain.from_iterable([all job_ad.desc_tokens[tokens]])``` => join [all tokens]
4. Use ```FreqDist``` function ```most_common()``` to get the list of top 50 document frequency
5. use ```JobAd``` function ```remove_by_list()``` to remove description tokens

In [7]:
words_2 = list(chain.from_iterable([set(job_ad.desc_tokens) for job_ad in job_ad_list]))
doc_fd = FreqDist(words_2)  # compute document frequency for each unique word/type

# Filtering a word list which includes 50 most common 
most_50_common_keys = [fd[0] for fd in doc_fd.most_common(50)]

for job_ad in job_ad_list:
    job_ad.remove_by_list(most_50_common_keys)

final_words_list = summarise_words(job_ad_list, all_words_removed_appear_1)

Words: 81205
Vocabs: 5168
Removed Words: 21770
Removed Vocabs: 50


The top 50 most frequent vocabs which include in all description for 21770 times.
There are 81205 words and 5168 vocabs left for description tokens.

## Saving required outputs (Pt. 8 & 9)
Save the vocabulary, bigrams and job advertisement txt as per specification.
- vocab.txt (Pt.9)
- job_ad_all.txt (for Task 2, also requirement for Task 1 Pt.8)

In [8]:
# code to save output data...
words = list(chain.from_iterable([job_ad.desc_tokens for job_ad in job_ad_list]))
fd = FreqDist(words)

# save files with vocab.txt, w+ for saving file with not exist file
with open("vocab.txt","w+",encoding= 'utf-8') as f: # open the txt file
    # sort the list by words
    for idx, k in enumerate(sort(list(fd.keys()))):
        # write in file by lines word:index
        f.write("{}:{}\n".format(k, idx))
    f.close()

In [9]:
generate_corpus_file(job_ad_list)

## Summary
Pre-processing should include stemming and lemmatization. There are some words could be combined(e.g. communicate, communicated, communicating, communication & communicator).<br/>
There are some job advertisement description which is not a complete sentence. The regex pattern ```r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"``` may filter some web url.<br/>
Also, there are some short-form terms like ASAP will be changed to asap which is not easy to recognized as a short-form term in lower case.