# Pre-processing XML Patents 


## 1. Introduction
This tutorial forcus on preprocessing a set of patents documents stored in XML format and generating the sparse representations for those patents. The final output file should be exactly the same as the one stored in "patents.txt".

In order to finish this task, you should 
1. Exatract the abstract and claims for each patent from its xml file. Use Beautiful soup 
2. Tokenise the patents
3. Generate 100 bigram collocations 
4. Re-tokenize the patents with those bigram collocations
5. Generate the TF-IDF vectors for those re-tokenized patents
6. save the vectors in the form shown in "patents.txt"

## 2.  Import libraries 

Here we will focus on using the existing packages as possible as we can.

In [17]:
from bs4 import BeautifulSoup as bsoup
import re
import os
import nltk
from nltk.collocations import *
from itertools import chain
import itertools
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer

## 3. Exatract Patent's abstract and Claims

The first task is to parse each patent stored in the "xml_files" folder. The information to be extracted includes
1. patent document number (doc-number) stored in "publication-reference"
2. patent's abstract
3. patent's claims 

Hint: you can use a dictionary to save patents, where the key is the doc-number, the value is a long string contains both abstracts and all claims.

In [16]:
xml_file_path = "./xml_files"

In [2]:
def parsing(t):

    xmlSoup = bsoup(t,"lxml-xml")
    
    pid = xmlSoup.find("publication-reference").find('doc-number').string 
    
    text = ""
    
    #Extract text in "abstract"    
    abt = xmlSoup.find('abstract')
    for p in abt.findAll('p'):
        text = text + p.text + " "
    
    #Extract Claims 
    for tag in xmlSoup.find_all('claim-text'):
        text = text + tag.text
 
    return (pid, text)

In [4]:
patents_raw = {}
for xfile in os.listdir(xml_file_path): 
    xfile = os.path.join(xml_file_path, xfile)
    if os.path.isfile(xfile) and xfile.endswith('.XML'): 
        (pid, text) = parsing(open(xfile))
        patents_raw[pid] = text

## 4. Tokenize the patents
After finish extract the texts, you now need to tokenize the patents with regular expression tokenizer implemented in NLTK. 

In [None]:
tokenizer = RegexpTokenizer(r'[a-zA-Z]{2,}') 

In [6]:
def tokenizePatent(pid):
    """
        the tokenization function is used to tokenize each patent.
        The one argument is patent_id.
        First, normalize the case.
        Then, use the regular expression tokenizer to tokenize the patent with the specified id
    """
    raw_patent = patents_raw[pid].lower() 
    tokenized_patents = tokenizer.tokenize(raw_patent)
    return (pid, tokenized_patents) # return a tupel of patent_id and a list of tokens

patents_tokenized = dict(tokenizePatent(pid) for pid in patents_raw.keys())

## 5.  Generate the 100 bigram collocations
The next task is go generate the bigram collocations, given the tokenized patents.

The first step is to concatenate all the tokenized patents using the chain.frome_iterable function. The returned list 
by the function contains a list of all the words seprated by while space.

In [8]:
all_words = list(chain.from_iterable(patents_tokenized.values()))

total number of tokens:  140764
the size of the unigram vocabulary:  3318 

['feed', 'inhibit', 'converter', 'grass', 'precluding', 'additives', 'otherwise', 'increasing', 'design', 'disengages', 'temporary', 'slide', 'wetter', 'catalytic', 'transport', 'vent', 'relates', 'forwards', 'produced', 'oxygen', 'without', 'starter', 'sleeves', 'issue', 'waterproofed', 'controls', 'axially', 'limiting', 'brought', 'protrusions', 'forth', 'handling', 'arrayed', 'securing', 'high', 'recording', 'walled', 'connective', 'returning', 'direction', 'grooves', 'strand', 'effective', 'apart', 'snag', 'envelop', 'but', 'base', 'tilting', 'ceramic']


The second step is to generate the 100 bigram cllocations. The functions you need include
* BigramAssocMeasures()
* BigramCollocationFinder.from_words()
* apply_freq_filter(20)
* apply_word_filter(lambda w: len(w) < 3)
* nbest(bigram_measures.pmi, 100)

Please do not change the parameters given in the last three function. More information about generating collocation with NLTK can be found http://www.nltk.org/howto/collocations.html. 

In [9]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_words)
bigram_finder.apply_freq_filter(20)
bigram_finder.apply_word_filter(lambda w: len(w) < 3)# or w.lower() in ignored_words)
top_100_bigrams = bigram_finder.nbest(bigram_measures.pmi, 100) # Top-100 bigrams
top_100_bigrams

[('harmonic', 'flex'),
 ('centrifugally', 'balanced'),
 ('robotic', 'harmonic'),
 ('expandable', 'chuck'),
 ('group', 'consisting'),
 ('charge', 'consistent'),
 ('walk', 'behind'),
 ('improperly', 'swapped'),
 ('saw', 'resonator'),
 ('behind', 'mowing'),
 ('jute', 'fibers'),
 ('actuator', 'flag'),
 ('elastic', 'band'),
 ('lead', 'frames'),
 ('drain', 'vent'),
 ('fresh', 'food'),
 ('high', 'humidity'),
 ('paper', 'particles'),
 ('fringe', 'maker'),
 ('ultrasonic', 'test'),
 ('foot', 'pedal'),
 ('elastomeric', 'mat'),
 ('capacitor', 'devices'),
 ('loaded', 'bag'),
 ('hammermilled', 'straw'),
 ('flash', 'tank'),
 ('tank', 'receiver'),
 ('hip', 'joint'),
 ('does', 'not'),
 ('duty', 'belt'),
 ('drier', 'solid'),
 ('solid', 'phase'),
 ('removable', 'joining'),
 ('cooler', 'box'),
 ('not', 'exceed'),
 ('cross', 'sectional'),
 ('case', 'packer'),
 ('vacuum', 'electronic'),
 ('driver', 'pulley'),
 ('mowing', 'machine'),
 ('fastened', 'together'),
 ('storage', 'capacity'),
 ('bus', 'bars'),
 ('p

## 6. Re-tokenize the patents again.

Task in Section 4 takenise the patents with only unigrams. Now, we introduce 100 collcations. we need to make sure those collocations are not split into two individual words. The tokenizer that you need is <a href="http://www.nltk.org/api/nltk.tokenize.html">MWEtokenizer</a>.


In [10]:
mwetokenizer = MWETokenizer(top_100_bigrams)
colloc_patents =  dict((pid, mwetokenizer.tokenize(patent)) for pid,patent in patents_tokenized.items())
all_words_colloc = list(chain.from_iterable(colloc_patents.values()))
colloc_voc = list(set(all_words_colloc))
print(len(colloc_voc))

3372


You can check the difference between th output of MWEtokenizer and RegexpTokenizer by <font size=3>adpating</font> the following code:

```python
for pid in patents_tokenized.keys():
    diff = set(colloc_patents[pid])-set(patents_tokenized[pid])
    if len(diff) != 0:
        print (pid, diff)
```

## 7. Generate the TF-IDF vectors for all the patents.
Please refer to 
* http://scikit-learn.org/stable/modules/feature_extraction.html
* http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [13]:
pids = []
patent_words = []
for pid, tokens in colloc_patents.items():
    pids.append(pid)
    txt = ' '.join(tokens)
    patent_words.append(txt)

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(input = 'content', analyzer = 'word')
tfidf_vectors = tfidf_vectorizer.fit_transform(patent_words)
tfidf_vectors.shape

(100, 3372)

## 8. Save the TF-IDF vector into the specified format
Hint: you can use 
* the <a href="https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csc_matrix.tocoo.html">tocoo()</a> function
* itertools.zip_longest()

In [None]:
save_file = open("patent_student.txt", 'w')

In [None]:
vocab = tfidf_vectorizer.get_feature_names()
#########please write the missing code below#######
cx = tfidf_vectors.tocoo() # return the coordinate representation of a sparse matrix
for i,j,v in itertools.zip_longest(cx.row, cx.col, cx.data):
    save_file.write(pids[i] + ',' + vocab[j] + ',' + str(v) + '\n')

In [15]:
save_file.close()