# FIT5196 Assessment 1: Task 2: Text Pre-Processing (%30)
#### Student Name:  Shih Ting Chu
#### Student ID:  29286875

Date: Aug 15

Environment: Python 3.6.3 and Jupyter notebook
Libraries used: 
* os (Miscellaneous operating system interfaces, e.g. os.listdir, for finding files by path)
* re (Regular expression operations)
* nltk (Natural Language Toolkit, e.g. tokenizer, lemmatizer, stopwords, collocations and probabilities)
* OrderedDict (A dictionary subclass to remember the order in which its contents are added)


### 1. Introduction
This assessment touches on the next step of analyzing textual data, i.e., converting the extracted data into a proper format. <br>In this assessment, you are required to write Python code to preprocess a set of resumes and convert them into numerical <br> representations (which are suitable for input into recommender-systems/ information-retrieval algorithms).
The data-set <br> that we provide contains 250 CVs for each student. Please find the  resume\_dataset.txt to know the PDF files in your own <br> data-set. Each line in the csv file contains the id of the resumes that a student needs to include in the data-set ( for example <br> 1111111111: [3 34 5 ...] means that the student 1111111111 data-set includes resume\_(3), resume\_(34), resume\_(5),... ). <br> CVs contain information about the applicants represented in the PDF format.
The information includes, for example, personal <br> information, skills, work experience, education, etc. Your task is to extract and transform the information for each applicant.


### 2. Import Libraries

In [4]:
# Import miscellaneous operating system interfaces
import os
# library for regular expression
import re
# Import RegexpTokenizer (比較聰明的抓字)
from nltk.tokenize import RegexpTokenizer
# Import PorterStemmer
from nltk.stem import PorterStemmer
# Extracting from a text a list of n-gram can be easily accomplished with function ngram()
from nltk.util import ngrams
# NLTK provides a built-in function FreqDist to compute this distribution directly from a set of word tokens.
from nltk.probability import *

# Import libraries for bigrams analysis
import nltk
nltk.download('stopwords')
from nltk.util import bigrams
from nltk import BigramCollocationFinder

# Sort dictionary in ascending/descending order based on values
from collections import OrderedDict

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/eileen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 3. Get My Selected Resume Filenames

In [7]:
# Dataset for me(29286875) copied from moodle
myDataset = '262 76 778 565 6 623 \
200 742 645 418 520 814 467 680 216 12 762 63 \
276 66 236 722 82 779 626 206 276 649 750 20 167 738 602 6 251 808 \
729 115 80 71 539 38 426 44 556 779 251 760 61 728 251 268 421 750 \
832 807 55 452 683 303 560 555 547 265 649 441 633 108 517 744 137 \
446 182 397 442 8 266 630 857 17 498 32 293 711 413 445 392 606 322 \
422 54 582 657 239 309 382 286 105 336 786 788 712 738 516 29 523 \
51 428 43 47 819 649 598 297 615 208 865 720 122 789 372 860 813 \
219 700 275 4 466 455 277 427 728 71 812 284 646 327 397 262 22 \
450 80 377 26 332 45 325 24 274 153 847 311 168 99 361 362 139 602 \
551 517 111 812 1 364 467 323 398 207 837 716 733 685 498 817 336 \
395 154 265 492 117 389 833 705 703 270 218 435 475 338 513 621 \
208 839 560 370 81 445 89 12 645 336 157 47 91 32 797 837 803 195 \
861 237 31 810 634 211 246 278 788 556 726 684 815 460 596 237 640 \
764 251 304 474 154 384 743 749 593 327 749 515 543 310 833 464 724 \
66 331 649 397 426 383 492'

# Create an empty list
myDataset_ls = []

# Convert string to list
myDataset_ls = myDataset.split(' ')

# Remove the duplicate values
clean_myDataset_ls = list(set(myDataset_ls))

# Format filenames assign to cv_name_ls
cv_name_ls = ['resume_(' + s + ').txt' for s in clean_myDataset_ls]

207


### 4. Function for Bigrams

In [3]:
# Extract bigrams ()
def generate_collocations(tokens):
    '''
    Parameter: given list of tokens
    Return: bigrams
    '''

    # Create an empty list to store stopwords
    stopwords_list = []
    # Open and read the stopword file, then add into stopwords_list
    with open('stopwords_en.txt') as f:
        stopwords_list = f.read().splitlines()
    # Score words 
    bigram_measures = nltk.collocations.BigramAssocMeasures()

    # Best results with window_size
    finder = BigramCollocationFinder.from_words(tokens, window_size = 2)
    # The token is stopword (the length is less than 3 or the lowercase)
    finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in stopwords_list)
    # The token just occurs once
    finder.apply_freq_filter(1)

    # The best 20 bigrams
    collocations = finder.nbest(bigram_measures.likelihood_ratio, 20)

    return collocations 

### 5. Function for Tokens Analysis

In [4]:
def analyze(resume, stopword):    
    
    
    '''Word tokenization'''
    token_regex = r"\w+(?:[-']\w+)?"
    # Find the matched token in clean_string
    match_token = re.findall(token_regex, resume)
    
    # Create a list to store alphabet tokens
    tokens = []   
    for each in match_token:
        if each.isalpha():
            tokens.append(each)
   

    '''Keep only capital words which are in the middle of sentences'''
    # Set regex for getting the first word in a sentence
    first_regex = r'(?:(?:[^\w,]\s)|(?:\.\s))([A-Z]\w+)\s'
    # Find the first token in each sentence
    match_first = re.findall(first_regex, resume)
    
    # Create empty lists
    mix = [] # Store lowercase & uppercase(in the middle of lines) tokens 
    upper = [] # Store those capital tokens in the middle of sentences
    gonna_stem = [] # mix - upper ("stem" will make all tokens lowercase so need to do it separately)
        
    # Check each token in match_token
    for each in tokens:
        # If that token is the first word in each sentence
        if each in first_regex:
            # Make it lowercase and then add into mix
            mix.append(each.lower())            
        else:
            # Add into vocab
            mix.append(each)
    
    # Split uppercase tokens and lowercase tokens
    for each in mix:
        if each.istitle():
            upper.append(each)
        else:
            gonna_stem.append(each)    

            
    '''Stem'''
    # Create a new Porter stemmer
    stemmer = PorterStemmer()
    # Stem tokens using the Porter stemmer
    stemmed_tokens = [stemmer.stem(word) for word in gonna_stem]
    
    # Combine stemmed_tokens and upper (stemmed lowercase tokens & capital tokens in the middle lines)
    vocab = []
    # Make each token unique
    vocab = list(set(stemmed_tokens + upper))


    '''Remove stopwords'''
    # Create an empty list to store stopwords
    no_stopwords_list = []

    # Open and read the stopword file, then add into stopwords_list
    with open(stopword) as f:
        no_stopwords_list = f.read().splitlines()
    
    # Create a tokens list without stopwords
    filtered_tokens = [each for each in vocab if each not in no_stopwords_list]


    '''Remove tokens with the length less than 3'''
    # Create an empty list
    no_less3_tokens = []

    # Check each token in sorted_often_tokens
    for each in filtered_tokens:
        # If the length of each token >= 3
        if len(each) >= 3:
            # Append that token into no_less3_tokens
            no_less3_tokens.append(each)


    '''Remove rare tokens (with the threshold set to %2)'''
    # Create a dictionary to store tokens and count their frequency 
    token_freq = {}
    # Check each token in all resumes
    for each in tokens:
        # If each token is in no_less3_tokens (list)
        if each in no_less3_tokens:
            # If each token is not in token_freq (dict)
            if each not in token_freq:
                # Add key into token_freq
                token_freq[each] = 0 
            # Count the token    
            token_freq[each] += 1

    # Sort tokens frequency in a dictionary in ascending order
    ascending_dict = OrderedDict(sorted(token_freq.items(), key=lambda key: key[1]))
    # Convert dictionary to list
    ascending_list = list(ascending_dict.items())
    # Remove rare tokens (2% of total tokens)
    no_rare_tokens = ascending_list[int(len(ascending_list)*0.02):]
    
    # Create a list to store only tokens but their frequency
    clean_uni_tokens = []
    for each in no_rare_tokens:
        clean_uni_tokens.append(each[0])


    '''First 200 meaningful bigrams'''
    # Get the bigrams list like [('I', 'am'), ('good', 'day'), ...]
    bigram_ls = generate_collocations(vocab)
    
    # Create an empty list
    split_ls = []
    # Get all values from bigram_ls and store into split_ls
    # Index of each bigram 
    for bigram in range(len(bigram_ls)):
        # Index of each word of a bigram
        for word in range(len(bigram_ls[bigram])):
            # Append into split_ls
            split_ls.append(bigram_ls[bigram][word])
    
    # Create an empty list
    best200 = []
    # Make it look like [('I am'), ('good day'), ...]
    for each in range(0, len(split_ls), 2):
        best200.append(split_ls[each]+" "+split_ls[each+1])
        
    
    ''' Vocab (token_string:integer_index) '''
    final_tokens = sorted(set(clean_uni_tokens+best200))
    
    vocab_str = ''
    # Make it readable
    for integer_index, token_string in enumerate(final_tokens, 1):
        vocab_str += token_string + ": " + str(integer_index) + "\n"
        
    # Write into a file
    f = open('29286875_vocab.txt','a+')
    f.write(vocab_str)
    f.close()  
    
    
    ''' CountVec (file_name, token_index:count, token_index:count,...) '''
    token_index = []
    for num in range(1, len(final_tokens)+1):
        token_index.append(num)
    
    token_dict = dict(zip(final_tokens, token_index))
    
    
    # Create a string to store per resume
    each_resume = ''
    # Return a list containing the names of the entries in the directory given by path
    # os.listdir(path of resumeTxt)
    for file in os.listdir("/Users/eileen/Jupyter/5196 data wrangling/Assignments/A1/PDF/resumeTxt+task2_pdf"):
        check_ls = []
        # If file is my selected resume
        if file in cv_name_ls:
            # Read each resume
            with open(file, 'r') as f:
                # Read line by line
                for line in f.readlines():
                    clean_line = line.strip()
                    # Add into each_resume
                    each_resume += clean_line
                check_ls = each_resume.split(" ")

            count_ls = []    
            # Check each token in all resumes, check_ls(messy tokens)
            for each in list(set(check_ls)):
                # If the token is in final_tokens(clean tokens)
                if each in final_tokens:
                    count_str = ''
                    
                    # Get the index of the token and count its frequency
                    count_str = str(token_dict[each]) + ": " + str(check_ls.count(each))
                    count_ls.append(count_str)

            # Write into a file
            f = open('29286875_countVec.txt','a+')
            f.write("\n\n")
            f.write(file)
            f.write(":\n")
            for item in count_ls:
                f.write("%s, " % item)
            f.close() 

### 6. Open and Read Each Resume

In [5]:
# Create an empty string to store all my resumes
my_CVs = ''

# Return a list containing the names of the entries in the directory given by path
# os.listdir(path of resumeTxt)
for file in os.listdir("/Users/eileen/Jupyter/5196 data wrangling/Assignments/A1/PDF/resumeTxt+task2_pdf"):
    # If the file is in cv_name_ls
    if file in cv_name_ls:
        # Open and read that file from the chosen path
        with open(file, 'r') as f:
            # Read line by line
            for line in f.readlines():
                # Add into my_CVs
                my_CVs += line

# Call function to analyze PDF
analyze(my_CVs, 'stopwords_en.txt')