# FIT5196 Data wrangling 
## Assignment 2 - Data Cleansing



### Author: Bharath Kumar Devakumar

**ID: 27559807**

Date written: 15/10/2016

Version: 1.0

Program: Python 2.7.12 and Jupyter notebook

Libraries used:
- `pandas`: For data frames
- `numpy`: For numpy arrays
- `re`: For regular expressions
- `datetime`: For datetime data types
- `seaborn`: For visual graphs
- `HTML`: To define css
- `matplotlib.pyplot`: For visual plots
- `regression`: To train model

In [1]:
# importing libraries that will be used in this report
import pandas as pd
import re
import numpy as np
import os
import nltk
import xml.etree.ElementTree as ET 

## Task 1: Parsing XML files

- Parsing XML files: Read all the XML files with any Python XML library (e.g., ElementTree, Beautifulsoup, lxml etc.) and extract the following data
- Patent IDs: Extract the document number from the “publication-reference” tag as patent id for each patent. 
- IPC (International Patent Classification) section labels: Extract section labels (which take on values from A to H) from the “classifications-ipcr” tag for all the patents, save them in a file called “section_labels.txt”, in which each line is in the form of  “patent_id,section_label”, which should look like

### Reading XML files

In [2]:
text_file_folder = "./100"

In [3]:
#https://developmentality.wordpress.com/2012/03/30/three-ways-of-creating-dictionaries-in-python/
class Patent(object):
    def __init__(self, documentID, sectionID, abstract, description, claims):
        self.documentID = documentID
        self.sectionID = sectionID
        self.abstract = abstract
        self.description = description
        self.claims = claims

In [4]:
patents =[]
i=0
for root, subFolders, files in os.walk(text_file_folder):
    for file_name in files: # for each files 
        file_path  = os.path.join(root, file_name)
        if file_path.endswith('XML'):
            tree = ET.parse(file_path)
            doc = tree.find('.//doc-number').text
            sec = tree.find('.//section').text
            abstract = tree.find('.//abstract')
            abst = ET.tostring(abstract, method="text",encoding='UTF-8')
            description = tree.find('.//description')
            desc = ET.tostring(description, method="text",encoding='UTF-8')
            claims = tree.find('.//claims')
            clm = ET.tostring(claims, method="text",encoding='UTF-8')
            
            pat = Patent(doc, sec, abst, desc,clm)
            
            patents.append(pat)

            #print tree.find('.//doc-number').text + ',' +  tree.find('.//section').text 

In [5]:
# 
info = dict([ (p.documentID ,p.sectionID) for p in patents ])
print "\n".join(",".join((k,str(v))) for k,v in sorted(info.items()))
import csv
w = csv.writer(open("section_labels.txt", "w"))
for key, val in info.items():
    w.writerow([key, val])

07640598,A
07640599,A
07640600,A
07640601,A
07640602,A
07640603,E
07640604,E
07640605,A
07640606,A
07640607,A
07640608,A
07640609,A
07640610,A
07640611,A
07640612,B
07640613,A
07640614,A
07640615,A
07640616,A
07640617,B
07640618,A
07640619,B
07640620,A
07640622,E
07640623,A
07640624,A
07640625,A
07640626,A
07640627,E
07640628,E
07640629,E
07640630,E
07640633,A
07640634,F
07640636,A
07640638,A
07640639,A
07640640,A
07640641,H
07640642,B
07640643,F
07640644,B
07640645,B
07640646,B
07640647,G
07640648,H
07640651,H
07640652,H
07640654,H
07640655,H
07640656,H
07640657,H
07640658,H
07640659,H
07640660,H
07640661,B
07640662,B
07640663,B
07640664,B
07640665,B
07640666,B
07640667,B
07640668,B
07640670,G
07640671,B
07640672,G
07640673,G
07640674,G
07640676,B
07640677,G
07640678,F
07640679,A
07640680,A
07640681,A
07640682,E
07640684,E
07640685,E
07640686,D
07640687,G
07640688,F
07640689,F
07640690,F
07640691,F
07640692,A
07640693,A
07640694,A
07640695,A
07640696,E
07640697,E
07640698,B
07640699,H

In [6]:
pat_all = dict([ (p.documentID ,",".join((p.abstract,p.description,p.claims))) for p in patents ])
# print "\n".join(",".join((docid,str(pall))) for docid, pall in sorted(pat_all.items()))

In [7]:
pat_all['07640855']

'\nDisclosed is a clamping device (02) for fastening a plate (03) to the periphery of a cylinder (01). Said clamping device (02) comprises a first clamping element (04), a pivotally mounted second clamping element (06), a spring part (07), and a bracing element that is embodied as a pivotable spindle (08) and is movable between a clamping position in which the bracing element maintains the plate in a clamped state between the clamping elements and a released position in which the clamping elements release the plate. The spindle (08) is mounted in a groove so as to be displaceable, is fixed in an intermediate space between the spring part (07) and the second clamping element (06), and is pressed against the second clamping element by means of the spring part in the clamping position.\n\n,\n\nFIELD OF THE INVENTION\nThe invention relates to a clamping device for a printing plate on a cylinder according to the preamble of claim 1.\nThe subject of the invention is a cylinder having a clamp

In [10]:
def tokenize_sent(sent):
    """
    The function tokenizes a sentence, and return a list of words that only contain alphabet 
    letters.
    """
    return [word for word in nltk.word_tokenize(sent.lower()) if word.isalpha()]

In [11]:
tokenized_sents = {} #The key is the document name, the value is a list of tokenized sentences
#########please fill in the missing code below#######
for keys in pat_all.iterkeys():
    tokenized_sents[keys] = tokenize_sent(pat_all[keys])

######################################################pat_all['07640855']

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 11: ordinal not in range(128)

In [12]:
print tokenized_sents['07640855']

['disclosed', 'is', 'a', 'clamping', 'device', 'for', 'fastening', 'a', 'plate', 'to', 'the', 'periphery', 'of', 'a', 'cylinder', 'said', 'clamping', 'device', 'comprises', 'a', 'first', 'clamping', 'element', 'a', 'pivotally', 'mounted', 'second', 'clamping', 'element', 'a', 'spring', 'part', 'and', 'a', 'bracing', 'element', 'that', 'is', 'embodied', 'as', 'a', 'pivotable', 'spindle', 'and', 'is', 'movable', 'between', 'a', 'clamping', 'position', 'in', 'which', 'the', 'bracing', 'element', 'maintains', 'the', 'plate', 'in', 'a', 'clamped', 'state', 'between', 'the', 'clamping', 'elements', 'and', 'a', 'released', 'position', 'in', 'which', 'the', 'clamping', 'elements', 'release', 'the', 'plate', 'the', 'spindle', 'is', 'mounted', 'in', 'a', 'groove', 'so', 'as', 'to', 'be', 'displaceable', 'is', 'fixed', 'in', 'an', 'intermediate', 'space', 'between', 'the', 'spring', 'part', 'and', 'the', 'second', 'clamping', 'element', 'and', 'is', 'pressed', 'against', 'the', 'second', 'clampin