# FIT5196 Assessment 1, Task 2: Text Processing
#### Student Name: Akshay Sapra
#### Student ID: 29858186

Date: 14/04/2019

Environment: Python 3.6.6 and Jupyter notebook
Libraries used: 
- re (for regular expression, included in Anaconda Python 3.6) 
- nltk (for pre text processing, included in Anaconda Python 3.6)
- nltk.data (for sentence segmentisation, included in Anaconda Python 3.6)
- nltk.tokenize (for tokenization using regex and Mwetokenizer, included in Anaconda Python 3.6)
- nltk.collocations (for creating bigrams, included in Anaconda Python 3.6)
- nltk.stem (for stemming the tokens, included in Anaconda Python 3.6)



## 1. Introduction
I have created this assignment to demonstrate my task pre processing capabilities and, extract and
transform the information for each unit into a vector space model.
data into a proper format. I have worked upon structured pdf files which contains information about several units in the Monash University crawled from Monash Website. The pdf file contains a table in which each row contains
information about a unit which is unit code, synopsis, and outcomes.
The order of my tasks are: 

1. Normalisation of tokens to lowercase  except the capital tokens appeared in the middle of a sentence/line.
2. The word tokenization by the use of following regular expression, "\w+(?:[-']\w+)?"
3. Tokens with the length less than 3 are removed from the vocab.
4. First 200 meaningful bigrams (i.e., collocations) are included in the vocab using PMI measure.
5. The context-independent and context-dependent (with the threshold set to %95) stop words are removed from the vocab. 
6. Rare tokens (with the threshold set to %5) are removed from the vocab.
7. Tokens are stemmed using the Porter stemmer.





## 2. Importing Libraries

In [72]:
import re
from nltk.tokenize import RegexpTokenizer 
from nltk.tokenize import MWETokenizer
from nltk.corpus import stopwords
import nltk.data
from nltk.probability import *
from itertools import chain
from nltk.stem import PorterStemmer 
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import sent_tokenize

## 3. Defining functions 

- Definign functions for sentence segmentation, tokenization and stopword removal

In [73]:
# sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
### write your code below
def senten(raw_text):
#     sentences = sent_detector.tokenize(raw_text.strip())
    sentences = nltk.sent_tokenize(raw_text)
    return sentences

### write your code below
def tonkenize(unit):
    tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?")
    unigram_tokens = tokenizer.tokenize(unit)
    return (unigram_tokens)

with open('stopwords_en.txt','r', encoding='UTF-8') as infile:
    stopwords = infile.read().split()
stopwords= list(set(stopwords))
#context independent
def stopwords_filter(mwe_tokens):
    stopped_tokens = [w for w in mwe_tokens if w not in stopwords]
    return stopped_tokens

## 3. Extracting data

PDF data is converted through https://www.Pdftotext.com and the text file generated by the website is loaded here.

In [74]:
with open('task2_29858186.txt','r', encoding='UTF-8') as infile:
    text = infile.read()


Basic cleansing of data is done to use it before extracting features from it.

In [75]:
text = re.sub('Title\n\nSynopsis\n\nOutcomes','',text)
#Removing the character which is present due to conversion of pdf to text at each new page
text = re.sub('\x0c','',text)
text= re.sub('\n',' ',text)
text = re.sub(']',']<splitter>',text)
#splitting  the entire text in terms of each unit
text = text.split('<splitter>')
text

["  BEH1041  This unit uses the framework of human development throughout the lifespan to identify health and, specifically, emergency health issues at various stages of the lifespan. Students will investigate the roles of paramedics and allied health professionals in assessing human development and maintaining health across the lifespan and will explore issues relating to death and grieving. Included in this unit will be clinical visits to selected agencies to provide clinical context to the theoretical background.  ['Describe the physical, personal, psychological and social milestones of human development throughout the lifespan.', 'Identify the social and cultural determinants that impact upon human development.', 'Communicate effectively with individuals across the lifespan within an appropriate developmental framework.', 'Identify common acute and chronic health issues that occur across the lifespan.', 'Apply contemporary theories of development to specific health issues across th

In the below step all the units are extracted and data is saved in the dictionary named text_dict. 
There are 196 elements because **duplicate units are not considered for processing.**

In [76]:
text_dict={}
a=[]
for i in range(len(text)-1):
#     print(i)
    unit= re.findall('^\s{2}([A-Z]{3}[0-9]{4}|[A-Z]{4}[0-9]{4})', text[i])[0]
#     a.append(unit)
    tex = re.sub('^\s{2}([A-Z]{3}[0-9]{4}|[A-Z]{4}[0-9]{4})','', text[i])
    text_dict[unit]=tex
# import collections
# print (list(item for item, count in collections.Counter(a).items() if count > 1))
#to check for duplicate
#['NUR3022', 'SWM5101', 'ATS3164']

# ## [1, 2, 5]
len(text_dict) 

196

In [77]:
for each in text_dict:
    text_dict[each] = re.sub("\['", "", text_dict[each])
    text_dict[each] = re.sub("', '", " ",text_dict[each] )
    text_dict[each] = re.sub("'\]", "", text_dict[each])
    text_dict[each] = re.sub('"\]', "", text_dict[each])
    text_dict[each] = re.sub('^\s+?','', text_dict[each])

text_dict

{'ACB2851': ' The objective of this unit is two-fold. First, the unit provides students with a broad introduction to accounting information systems and the role technology plays in accounting. The focus will be on an introduction to: enterprise systems; database management; documentation methods; internal controls; and the core business processes found in organisations. Second, the unit focuses on corporate modelling theory; models as decision support tools; types and uses of models; benefits and limitations of models; effective spreadsheet design; auditing spreadsheet models and development of various models using an industry standard spreadsheet.  examine the role of accounting information systems in analysing and providing decision support to managers explain the design of accounting information systems and financial models develop financial models to assist in decision making apply critical thinking, problem solving and presentation skills to individual and / or group activities de

## 4. Data Transformation through Tasks

###### 4.1. Normalisation (Task 5) of token to lowercase is done first 

Sentence Segmentation is done to improvise normalisation

In [78]:
#doing the sentence segmentation
for each in text_dict:
    text_dict[each]= senten(text_dict[each])

Normalization of the tokens in the start of sentence. 

In [79]:
#normalization of the tokens in the start of sentence. Task 2.5
for each in text_dict:
    for i in range(len(text_dict[each])):
        text_dict[each][i]=text_dict[each][i].replace(text_dict[each][i][0],text_dict[each][i][0].lower(),1)
text_dict

{'ACB2851': [' The objective of this unit is two-fold.',
  'first, the unit provides students with a broad introduction to accounting information systems and the role technology plays in accounting.',
  'the focus will be on an introduction to: enterprise systems; database management; documentation methods; internal controls; and the core business processes found in organisations.',
  'second, the unit focuses on corporate modelling theory; models as decision support tools; types and uses of models; benefits and limitations of models; effective spreadsheet design; auditing spreadsheet models and development of various models using an industry standard spreadsheet.',
  'examine the role of accounting information systems in analysing and providing decision support to managers explain the design of accounting information systems and financial models develop financial models to assist in decision making apply critical thinking, problem solving and presentation skills to individual and / or

###### 4.2. Tokenisation (Task 1) by the use of following regular expression, "\w+(?:[-']\w+)?"


In [80]:
#tokenization task 2.1
for each in text_dict:
    for i in range(len(text_dict[each])):
        text_dict[each][i]= tonkenize(text_dict[each][i])
text_dict

{'ACB2851': [['The', 'objective', 'of', 'this', 'unit', 'is', 'two-fold'],
  ['first',
   'the',
   'unit',
   'provides',
   'students',
   'with',
   'a',
   'broad',
   'introduction',
   'to',
   'accounting',
   'information',
   'systems',
   'and',
   'the',
   'role',
   'technology',
   'plays',
   'in',
   'accounting'],
  ['the',
   'focus',
   'will',
   'be',
   'on',
   'an',
   'introduction',
   'to',
   'enterprise',
   'systems',
   'database',
   'management',
   'documentation',
   'methods',
   'internal',
   'controls',
   'and',
   'the',
   'core',
   'business',
   'processes',
   'found',
   'in',
   'organisations'],
  ['second',
   'the',
   'unit',
   'focuses',
   'on',
   'corporate',
   'modelling',
   'theory',
   'models',
   'as',
   'decision',
   'support',
   'tools',
   'types',
   'and',
   'uses',
   'of',
   'models',
   'benefits',
   'and',
   'limitations',
   'of',
   'models',
   'effective',
   'spreadsheet',
   'design',
   'auditing',
 

Converting the value of each dictionary to list instead of list of list to make the operations easier.

In [81]:
#Converting the value of each dictionary to list instead of list of list
for each in text_dict:
    a=[]
    for i in range(len(text_dict[each])):
        for toke in text_dict[each][i]:
                a.append(toke)
    text_dict[each]=a
text_dict



{'ACB2851': ['The',
  'objective',
  'of',
  'this',
  'unit',
  'is',
  'two-fold',
  'first',
  'the',
  'unit',
  'provides',
  'students',
  'with',
  'a',
  'broad',
  'introduction',
  'to',
  'accounting',
  'information',
  'systems',
  'and',
  'the',
  'role',
  'technology',
  'plays',
  'in',
  'accounting',
  'the',
  'focus',
  'will',
  'be',
  'on',
  'an',
  'introduction',
  'to',
  'enterprise',
  'systems',
  'database',
  'management',
  'documentation',
  'methods',
  'internal',
  'controls',
  'and',
  'the',
  'core',
  'business',
  'processes',
  'found',
  'in',
  'organisations',
  'second',
  'the',
  'unit',
  'focuses',
  'on',
  'corporate',
  'modelling',
  'theory',
  'models',
  'as',
  'decision',
  'support',
  'tools',
  'types',
  'and',
  'uses',
  'of',
  'models',
  'benefits',
  'and',
  'limitations',
  'of',
  'models',
  'effective',
  'spreadsheet',
  'design',
  'auditing',
  'spreadsheet',
  'models',
  'and',
  'development',
  'of',
 

###### 4.3. First 200 Biagrams (Task 7) are included in the vocab using PMI measure

In [82]:
#Task 2.7 making list of biagrams
words=[]
for each in text_dict:
        for toke in text_dict[each]:
            words.append(toke)
            
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(words)
bigram=finder.nbest(bigram_measures.pmi, 200)
bigram

[('1895-1920', 'Russian'),
 ('6000-10', '000'),
 ('80-100', 'hrs'),
 ('APG5048', 'Translation'),
 ('ATS2296', 'ATS3296'),
 ('ATS3296', 'Musical'),
 ('An', 'acquaintance'),
 ('Applied', 'Pharmacy'),
 ('Array', 'SKA'),
 ('Assisted', 'Reproductive'),
 ('Associate', 'Dean'),
 ('Athenian', 'playwrights'),
 ('Bayes', 'Theorem'),
 ('Biodiversity', 'Conservation'),
 ('Business', 'Continuity'),
 ('CTDI', 'DLP'),
 ('Care', 'Paramedic'),
 ('Century', 'actor'),
 ('Creative', 'Writing'),
 ('Dean', 'Education'),
 ('Dieticians', 'Association'),
 ('Disaster', 'Recovery'),
 ('Dynamic', 'Traffic'),
 ('Galileo', 'Galilei'),
 ('Germany', 'Hungary'),
 ('Gravitational-Wave', 'Observatory'),
 ('Hamilton', 'Jacobi'),
 ("Homer's", 'Iliad'),
 ('Hong', 'Kong'),
 ('Implantation', 'Screening'),
 ('Insurance', 'Contracts'),
 ('Intelligent', 'Transport'),
 ('Intensive', 'Care'),
 ('Interferometer', 'Gravitational-Wave'),
 ('Kilometre', 'Array'),
 ('Laser', 'Interferometer'),
 ('Legal', 'System'),
 ('Locate', "design

In [83]:
# Retokenizing 
mwetokenizer = MWETokenizer(bigram)
text_dict =  dict((unit, mwetokenizer.tokenize(field)) for unit,field in text_dict.items())
text_dict

{'ACB2851': ['The',
  'objective',
  'of',
  'this',
  'unit',
  'is',
  'two-fold',
  'first',
  'the',
  'unit',
  'provides',
  'students',
  'with',
  'a',
  'broad',
  'introduction',
  'to',
  'accounting',
  'information',
  'systems',
  'and',
  'the',
  'role',
  'technology',
  'plays',
  'in',
  'accounting',
  'the',
  'focus',
  'will',
  'be',
  'on',
  'an',
  'introduction',
  'to',
  'enterprise',
  'systems',
  'database',
  'management',
  'documentation',
  'methods',
  'internal',
  'controls',
  'and',
  'the',
  'core',
  'business',
  'processes',
  'found',
  'in',
  'organisations',
  'second',
  'the',
  'unit',
  'focuses',
  'on',
  'corporate',
  'modelling',
  'theory',
  'models',
  'as',
  'decision',
  'support',
  'tools',
  'types',
  'and',
  'uses',
  'of',
  'models',
  'benefits',
  'and',
  'limitations',
  'of',
  'models',
  'effective',
  'spreadsheet',
  'design',
  'auditing',
  'spreadsheet',
  'models',
  'and',
  'development',
  'of',
 

###### 4.4. Token (Task 6) with length less than 3 are removed.

In [84]:
#Task 2.6 Token with lenth less than 3 to be removed
for each in text_dict:
    a=[]
    for toke in text_dict[each]:
        if len(toke)>2:
            a.append(toke)
    text_dict[each]=a
#     text_dict[each]=a
text_dict


{'ACB2851': ['The',
  'objective',
  'this',
  'unit',
  'two-fold',
  'first',
  'the',
  'unit',
  'provides',
  'students',
  'with',
  'broad',
  'introduction',
  'accounting',
  'information',
  'systems',
  'and',
  'the',
  'role',
  'technology',
  'plays',
  'accounting',
  'the',
  'focus',
  'will',
  'introduction',
  'enterprise',
  'systems',
  'database',
  'management',
  'documentation',
  'methods',
  'internal',
  'controls',
  'and',
  'the',
  'core',
  'business',
  'processes',
  'found',
  'organisations',
  'second',
  'the',
  'unit',
  'focuses',
  'corporate',
  'modelling',
  'theory',
  'models',
  'decision',
  'support',
  'tools',
  'types',
  'and',
  'uses',
  'models',
  'benefits',
  'and',
  'limitations',
  'models',
  'effective',
  'spreadsheet',
  'design',
  'auditing',
  'spreadsheet',
  'models',
  'and',
  'development',
  'various',
  'models',
  'using',
  'industry',
  'standard',
  'spreadsheet',
  'examine',
  'the',
  'role',
  'acco

###### 4.5. Stop words removal (Task 2)

In [85]:
#Task 2.2 removing context independent stop words
for each in text_dict:
#     for i in range(len(text_dict[each])):
        text_dict[each]=stopwords_filter(text_dict[each])
text_dict


{'ACB2851': ['The',
  'objective',
  'unit',
  'two-fold',
  'unit',
  'students',
  'broad',
  'introduction',
  'accounting',
  'information',
  'systems',
  'role',
  'technology',
  'plays',
  'accounting',
  'focus',
  'introduction',
  'enterprise',
  'systems',
  'database',
  'management',
  'documentation',
  'methods',
  'internal',
  'controls',
  'core',
  'business',
  'processes',
  'found',
  'organisations',
  'unit',
  'focuses',
  'corporate',
  'modelling',
  'theory',
  'models',
  'decision',
  'support',
  'tools',
  'types',
  'models',
  'benefits',
  'limitations',
  'models',
  'effective',
  'spreadsheet',
  'design',
  'auditing',
  'spreadsheet',
  'models',
  'development',
  'models',
  'industry',
  'standard',
  'spreadsheet',
  'examine',
  'role',
  'accounting',
  'information',
  'systems',
  'analysing',
  'providing',
  'decision',
  'support',
  'managers',
  'explain',
  'design',
  'accounting',
  'information',
  'systems',
  'financial',
  'm

Creating list of word to genreate the vocabulary and remove context dependent stop words

In [86]:
words=[]
for each in text_dict:
    for eacher in text_dict[each]:
        words.append(eacher)
words

['This',
 'unit',
 'framework',
 'human',
 'development',
 'lifespan',
 'identify',
 'health',
 'specifically',
 'emergency',
 'health',
 'issues',
 'stages',
 'lifespan',
 'students',
 'investigate',
 'roles',
 'paramedics',
 'allied',
 'health',
 'professionals',
 'assessing',
 'human',
 'development',
 'maintaining',
 'health',
 'lifespan',
 'explore',
 'issues',
 'relating',
 'death',
 'grieving',
 'included',
 'unit',
 'clinical',
 'visits',
 'selected',
 'agencies',
 'provide',
 'clinical',
 'context',
 'theoretical',
 'background',
 'describe',
 'physical',
 'personal',
 'psychological',
 'social',
 'milestones',
 'human',
 'development',
 'lifespan',
 'identify',
 'social',
 'cultural',
 'determinants',
 'impact',
 'human',
 'development',
 'communicate',
 'effectively',
 'individuals',
 'lifespan',
 'developmental',
 'framework',
 'identify',
 'common',
 'acute',
 'chronic',
 'health',
 'issues',
 'occur',
 'lifespan',
 'apply',
 'contemporary',
 'theories',
 'development',
 '

In [87]:
fd = FreqDist(words)
fd.most_common(25)


[('unit', 270),
 ('students', 212),
 ('research', 194),
 ('skills', 159),
 ('practice', 131),
 ('design', 111),
 ('knowledge', 110),
 ('issues', 107),
 ('health', 102),
 ('understanding', 99),
 ('analyse', 99),
 ('This', 98),
 ('apply', 94),
 ('management', 92),
 ('evaluate', 87),
 ('work', 86),
 ('identify', 83),
 ('social', 83),
 ('including', 83),
 ('development', 82),
 ('develop', 82),
 ('clinical', 77),
 ('project', 76),
 ('international', 75),
 ('demonstrate', 73)]

In [88]:
#Task 2.2 removing context dependent stop words
#document frequency 
words_2 = list(chain.from_iterable([set(value) for value in text_dict.values()]))
fd_2 = FreqDist(words_2)
fd_2.most_common(25)

#As it is evident that at most token unit is present in 162 which is less than 190 (95% of 200 is 190) so removal of context dependednt stop words is not required

[('unit', 162),
 ('students', 111),
 ('This', 98),
 ('skills', 87),
 ('knowledge', 77),
 ('research', 73),
 ('analyse', 72),
 ('issues', 68),
 ('understanding', 67),
 ('including', 66),
 ('practice', 65),
 ('apply', 62),
 ('evaluate', 62),
 ('develop', 61),
 ('critically', 56),
 ('identify', 53),
 ('The', 50),
 ('demonstrate', 50),
 ('relevant', 50),
 ('development', 49),
 ('context', 49),
 ('work', 45),
 ('critical', 44),
 ('management', 44),
 ('range', 43)]

###### 4.6. Rare tokens (Task 4) are generated and removed from the vocab.

In [89]:
rare_tokens= list(filter(lambda x: x[1]<10,fd_2.items())) #because 5% of 200 is 10
rare_tokens

[('maintaining', 3),
 ('death', 1),
 ('lifespan', 8),
 ('agencies', 3),
 ('background', 7),
 ('paramedics', 3),
 ('physical', 9),
 ('stages', 4),
 ('psychological', 5),
 ('allied', 2),
 ('acute', 3),
 ('milestones', 2),
 ('essential', 8),
 ('meet', 4),
 ('grieving', 1),
 ('visits', 3),
 ('assessing', 3),
 ('developmental', 5),
 ('included', 3),
 ('loss', 3),
 ('chronic', 2),
 ('specifically', 5),
 ('selected', 8),
 ('emergency', 6),
 ('promotion', 2),
 ('determinants', 5),
 ('occur', 2),
 ('summarise', 3),
 ('full', 2),
 ('articulation', 6),
 ('singing', 1),
 ('piece', 3),
 ('sustained', 4),
 ('capacity', 7),
 ('staff', 6),
 ('inception', 1),
 ('distinctive', 1),
 ('analyses', 6),
 ('frameworks', 8),
 ('text', 4),
 ('semester-long', 1),
 ('clear', 6),
 ('original', 5),
 ('theatre', 2),
 ('dancing', 1),
 ('languages', 4),
 ('increased', 3),
 ('exploratory', 1),
 ('methodological', 5),
 ('systematic', 5),
 ('musical', 2),
 ('entails', 3),
 ('argument', 5),
 ('structural', 3),
 ('workshop

In [90]:
#task 2.4 removing rare tokens
rare_tok=[]
for each in rare_tokens:
    rare_tok.append(each[0])
# rare_tok

for each in text_dict:
        text_dict[each]= [w for w in text_dict[each] if w not in rare_tok]
text_dict

{'ACB2851': ['The',
  'unit',
  'unit',
  'students',
  'broad',
  'introduction',
  'information',
  'systems',
  'role',
  'technology',
  'focus',
  'introduction',
  'systems',
  'management',
  'methods',
  'core',
  'business',
  'processes',
  'organisations',
  'unit',
  'focuses',
  'theory',
  'models',
  'decision',
  'support',
  'tools',
  'types',
  'models',
  'models',
  'effective',
  'design',
  'models',
  'development',
  'models',
  'industry',
  'standard',
  'examine',
  'role',
  'information',
  'systems',
  'decision',
  'support',
  'explain',
  'design',
  'information',
  'systems',
  'models',
  'develop',
  'models',
  'decision',
  'making',
  'apply',
  'critical',
  'thinking',
  'problem',
  'solving',
  'presentation',
  'skills',
  'individual',
  'group',
  'activities',
  'information',
  'systems',
  'demonstrate',
  'individual',
  'assessment',
  'understanding',
  'topics',
  'covered'],
 'ACF5080': ['This',
  'unit',
  'examines',
  'key',
  

##### 4.7. Stemming (Task 3) is done through Porter Stemmer

In [91]:
#task 2.3
stemmer = PorterStemmer()
for each in text_dict:
    for i in range(len(text_dict[each])):
        if not text_dict[each][i][0].isupper():
            text_dict[each][i]=stemmer.stem(text_dict[each][i])
        else:
            text_dict[each][i]=text_dict[each][i]


## 5. Identifying the lexical diversity

In [92]:
processed_words=[]
for each in text_dict:
    for i in range(len(text_dict[each])):
        processed_words.append( text_dict[each][i])
        
vocab = list(set(processed_words))
lexical_diversity = len(processed_words)/len(vocab)
print ("Vocabulary size: ",len(vocab),"\nTotal number of tokens: ", len(processed_words),"\nLexical diversity: ", lexical_diversity)

Vocabulary size:  234 
Total number of tokens:  9554 
Lexical diversity:  40.82905982905983


## 6. Creating the vocabulary and count vector file 

In [93]:
file= open('29858186_vocab.txt','w')
vocab.sort()
i=0
index={}
for each in vocab:
    if i != len (vocab)-1:
        file.write(str(each)+":"+str(i)+',\n')
    else:
        file.write(str(each)+":"+str(i))
    index[each]=i
    i+=1
file.close()

In [94]:
# index
file= open('29858186_countVec.txt','w')
flag=False
for each in text_dict:
    if not flag:
        file.write(str(each))
    else:
        file.write("\n"+str(each))
    flag=True
#     print(each)
    f=FreqDist(text_dict[each])
#     print(f.items())
    
    for key, value in f.items():
        file.write(","+str(index[key])+":"+str(value))
file.close()



## 7. Summary

This task2 assessment covers basic understanding of text file processing and generating sparse representation of unit code crawled from Monash Website in python. 

- The main outcome of the task are:
- **.txt Parsing and data Extraction** Data is extracted from text html files and by using the dynamic way of extracting data from different files.
- **re library** By using re to different pattern matching and to extract data from file.
- **nltk library** By using nltk to get tokens using RegexTokeniser(), sentence segmentation, finding bigrams.
- **vocabulary and sparse count vector:** A vocabulary consisting of words from all Monash website resumes obtained by removing stopwords, rare tokens, more frequent tokens, words with length less than 3, normalised and stemmed. Finally, a sparse vector was calculated for every resume by counting the frequency of vocabulary word occurrences.