# FIT5196 Data wrangling 
## Assignment 4 - Text Preprocessing



### Author: Ramprasath Karunakaran

ID: 26994437

Date written: 18/10/2016

Version: 1.0

Program: Python 2.7.12 and Jupyter notebook

In [None]:
#checking python version
!python --version

In [None]:
# importing libraries that will be used in this report
import pandas as pd
import re
import numpy as np
import os
import nltk
import xml.etree.ElementTree as ET
import csv

## Task 1 - Parsing the XML files

The given patent is completely in the XML format. From these XML, we have to extract the following details:
> - Patent ID
- Section ID
- Abstract
- Description
- Claims

In [None]:
#folder which has all the patent files
patents_folder="./100"

In [None]:
#to create a dictionary for each patent using object instantiation method
class Patent(object):
    def __init__(self, documentID, sectionID, abstract, description, claims):
        self.documentID = documentID
        self.sectionID = sectionID
        self.abstract = abstract
        self.description = description
        self.claims = claims

**All the functions used below are derived from the three chapters of Fundamentals of Text Preprocessing given in Alexandria materials.**

The given XML files can be parsed using the Element Tree in Python's XML library. Hence, all the details will be stored in a tree structure where each XML tag serves as a node.

In [None]:
#empty list to store all the patents
patentList =[]
#iterating through all the files and subfolders in the folder
for root,subfolders,files in os.walk(patents_folder):
    for file_name in files:
        #combining all the file path which will be fed to element tree for parsing
        file_path  = os.path.join(root, file_name)
        # to parse only xml files
        if file_path.endswith('XML'):
            # converting the file to a xml tree
            tree = ET.parse(file_path)
            # locating the required nodes and storing them in variables
            documentID = tree.find('.//doc-number').text
            sectionID = tree.find('.//section').text
            abst = tree.find('.//abstract')
            abstract = ET.tostring(abst, method="text",encoding='UTF-8') #XML uses UTF-8 codec
            desc = tree.find('.//description')
            description = ET.tostring(desc, method="text",encoding='UTF-8')
            clm = tree.find('.//claims')
            claims = ET.tostring(clm, method="text",encoding='UTF-8')
            # create a patent dictionary by creating a new object
            patent = Patent(documentID, sectionID, abstract, description,claims)
            # appending the patent to the list
            patentList.append(patent)

Once all the XML files are parsed using trees. All the patentIDs and the section IDs are stored in a text file called 'section_labels.txt'

In [None]:
# extracting the document ID and section ID alone from the patentList
labelInfo = dict([(patent.documentID, patent.sectionID) for patent in patentList])
# opening a new csv file in the write mode
labelFile = csv.writer(open("section_labels.txt","w"))
for key, value in labelInfo.items():
    # reading every key value and writing them into a file
    labelFile.writerow([key, value])

## Task 2 - Tokenisation of Text

From the patent list created above, the abstract, description and claims sections are chosen to be tokenised. Hence we create a dictionary with patent IDs and their respective text details.

In [None]:
#extracting text of patents with document IDs
patentText = dict([ (patent.documentID ,",".join((patent.abstract,patent.description,patent.claims))) for patent in patentList])

All the given files are patent files which are properly formatted using XML. Hence, there were no special characters involved in those documents. If there were some special characters, we will be in need of regular expressions to customise the process of tokenisation. So, we use the tokeniser library available in Natural Language ToolKit(NLTK) which will tokenise the entire dictionary 'patentText' choosing only the alphabet letters in the text.

In [None]:
def tokenize_sent(sent):
    """
    The function tokenizes a sentence, and return a list of words that only contain alphabet 
    letters.
    """
    #lower() converts the words to lowercase which will avoid confusion while creating a dictionary of vocabulary
    return [word for word in nltk.word_tokenize(sent.lower()) if word.isalpha()]

In [None]:
# dictionary to store patent ids and their corresponding set of tokens
tokenized_sents = {} 
# iterating through the patent text
for keys in patentText.iterkeys():
    #decode function is to support the ascii codec in python and all tokens are in UTF-8 encoding 
    tokenized_sents[keys] = tokenize_sent(patentText[keys].decode('utf-8')) 

Hence, all the patent IDs and their respective tokens are stored in a dictionary for further processing.

Now, we try to find the most common tokens in the data using the FreqDist function in NLTK probability.

In [None]:
from nltk.probability import *

def word_concat(dsd):
    """
    concatenate all the words stored in the values of a given dictionary. Each value is a list
    of tokenized sentences.
    """
    all_words = []
    # iterates all patents and store the tokens in a single list.
    for sent in dsd.values():
            all_words += sent
    print "tokens:", len(all_words) # total number of tokens 
    print "types:", len(set(all_words)) # total number of types
    return all_words

In [None]:
#listing the common words and their number of occurences
freq_dist = FreqDist(word_concat(tokenized_sents))
freq_dist.most_common(20)

From the above frequencies, it is evident that stop words like 'the','of' are going to affect our process of preprocessing since they do not provide much value to the processing.

So we try to remove the stopwords before finding bigrams and trigrams.

In [None]:
# import set of 127 stop words from NLTK corpus
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
stopwords_list

Remove the above stop words from the tokenised sentences.

In [None]:
for keys in tokenized_sents.iterkeys():
    filtered_sents = [token for token in tokenized_sents[keys] if token not in stopwords_list ]

Furthermore, we import another set of Stopwords from [Kevin Bouge's website](https://sites.google.com/site/kevinbouge/stopwords-lists) and store it in a separate text file and remove them from the data. 

In [None]:
import nltk
#list to store the stop words
stopwords_list_570 = []
# all the stopwords are stored in a text file
with open('./stopwords_en.txt') as f:
    stopwords_list_570 = f.read().splitlines()

In [None]:
# removing the 570 stop words
filtered_sents = [w for w in tokenized_sents[keys] if w.lower() not in stopwords_list_570]

Now, we try to list the most common words from the tokens using the same function as above. 

In [None]:
from nltk.probability import *
freq_dist = FreqDist(filtered_sents)
freq_dist.most_common(20)

Now, we try to find bigrams and trigrams using the NLTK in-built methods.

In [None]:
#finding bigrams and trigrams
my_bigrams = nltk.bigrams(filtered_sents)
my_trigrams = nltk.trigrams(filtered_sents)

In [None]:
#storing bigrams and trigrams in a list
list_bigrams=list(my_bigrams)
list_trigrams=list(my_trigrams)

In [None]:
# trying to find the frequency of bigrams and trigrams
common_trigrams = FreqDist(list_trigrams)
common_bigrams = FreqDist(list_bigrams)

Now, we extract the top 200 bigrams and top 100 trigrams to retokenise.

In [None]:
# extract the top 200 bigram and 100 trigram based on the frequency distribution
trigrams =[item[0] for item in common_trigrams.most_common(100)]
bigrams = [item[0] for item in common_bigrams.most_common(200)]

Now, we retokenise the given text by adding the bigrams and trigrams to our set of tokens using Multiword Tokeniser in NLTK.

In [None]:
# Merge the list of trigram and bigram
multiwordlist = []
multiwordlist.extend(bigrams)
multiwordlist.extend(trigrams)

In [None]:
from nltk.tokenize import MWETokenizer
mwe_tokens = {}
for keys in tokenized_sents.iterkeys():
    mwe_tokenizer = MWETokenizer(multiwordlist)
    # applying the trigram and bigram to the patent dict with stop words
    mwe_tokens[keys] = mwe_tokenizer.tokenize(tokenized_sents[keys])

Now, we have a dictionary of patent IDs and their respective tokens(multiwords implemented).

## Task 3 - Vectors
To implement vectors, we use the tokens without removing stop words and we generate a list of sentences for each patent by combining all the tokens. These sentences will make it easy for us to fit the data in vectoriser functions.

In [None]:
#list of patent ids
sents_ids = []
#list of sentences
sents_words =[]
for key, value in mwe_tokens.items():
    sents_ids.append("{0}".format(key))
    txt = ' '.join(value) # joining all the tokens to form a sentence
    sents_words.append(txt) # appending the sentences to the list

## TF-IDF Vectors

To produce TF-IDF vectors, we use TfidfVectorizer from Scikit library which will fit the sentences we formed above to get a weight factor matrix.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# vectorizer is set to analyse word by word
tfidf_vectorizer = TfidfVectorizer(input = 'content', analyzer = 'word')
#fitting the sentences into the vectorizer
tfidf_vectors = tfidf_vectorizer.fit_transform(sents_words)
tfidf_vectors.shape

We get a matrix with 800 rows as patent ids and 34534 words as columns. And the intersection of each row and column is the weight of the particular word using the TFID vectorizer.

In [None]:
# get the set of entire columns as the vocabulary.
vocab = tfidf_vectorizer.get_feature_names()
vocab_list = [] 
sub_list = []
# combining the patent id, word index and the weight
for i, val in enumerate(sents_ids):
    for word, weight in zip(vocab, tfidf_vectors.toarray()[i]):
        if weight > 0: #weight 0 is of no use
            sub_list = [val,list(vocab).index(word),weight]
            vocab_list += [sub_list]


Store all the weights in a separate text file.

In [None]:
tf_idfFile = csv.writer(open("tf_idf_vectors.txt","w"))
for value in vocab_list:
    # reading every value and writing them into a file
    tf_idfFile.writerow(value)


## Count Vectors and Binary Vectors.
To use count vectors, we make use of the CountVectorizer. To find the binary vector, we use count vector weight. If count vector weight is greater than 0, we set the binary vector weight to 1.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(input = 'content',analyzer = "word")

In [None]:
#fitting the sentences into the vectorizer
data_features = vectorizer.fit_transform(sents_words)
print data_features.shape

In [None]:
countvocab = vectorizer.get_feature_names()
countvocab_list = [] 
countsub_list = []
binaryvocab_list = []
binarysub_list = []

#combining the patent id, word index and the weight for count and binary vector
for i, val in enumerate(sents_ids):
    for word, weight in zip(countvocab, data_features.toarray()[i]):
        if weight > 0:
            countsub_list = [val,list(countvocab).index(word),weight]
            binarysub_list = [val,list(countvocab).index(word),1]
            countvocab_list += [countsub_list]
            binaryvocab_list += [binarysub_list]

In [None]:
#create a separate file for count vectors
countVectorsFile = csv.writer(open("count_vectors.txt","w"))
for value in countvocab_list:
    # reading every key value and writing them into a file
    countVectorsFile.writerow(value)

In [None]:
# create a separate file for binary vectors
with open("binary_vectors.txt", "wb") as outfile:
    writer = csv.writer(outfile, delimiter = ",")
    #for each dictionary item (key and value) their respective value are written in the file
    for val in binaryvocab_list:
        writer.writerow(val)

## Task 4

Running SVM classifier without removing stop words.

In [None]:
!python svm_classifier.py count_vectors.txt section_labels.txt

AUC 0.66

In [None]:
!python svm_classifier.py binary_vectors.txt  section_labels.txt

AUC 0.67

In [None]:
!python svm_classifier.py tf_idf_vectors.txt  section_labels.txt

AUC 0.75

Retokenising the multi word token set by removing the stop words

In [None]:
filtered_sents_retoken = {}
for keys in mwe_tokens.iterkeys():
    filtered_sents_retoken[keys] = [token for token in mwe_tokens[keys] if token not in stopwords_list ]

Now that the stop words are removed, we generate a new set of senetence for task 3 above by using the below function.

In [None]:
#list of patent ids
sents_ids = []
#list of sentences
sents_words =[]
#using a different set of tokens after removing stop words
for key, value in filtered_sents_retoken.items():
    sents_ids.append("{0}".format(key))
    txt = ' '.join(value) # joining all the tokens to form a sentence
    sents_words.append(txt) # appending the sentences to the list

By running all the vectors again from task 3, different sets of weights are generated.

In [None]:
!python svm_classifier.py count_vectors.txt section_labels.txt

AUC 0.74

In [None]:
!python svm_classifier.py binary_vectors.txt  section_labels.txt

AUC 0.68

In [None]:
!python svm_classifier.py tf_idf_vectors.txt  section_labels.txt

AUC 0.74

When stop words are removed, AUC value increases.