# Data Parsing

Date: 25/08/2019

Version: 2.0

Environment: Python 3.7.3 and Anaconda 4.3.0 (64-bit)

#### Libraries Used

- **re**: 
It is a python inbuilt library used for matching special sequence of characters in any given text. It is more efficient than conventional pattern matching.for more info https://docs.python.org/3/library/re.html
- **pandas**: 
It is a third party library and mainly used for Data Science applications.In this assignment pandas library has been used to construct a dataframe to work on the dataset provided. For more Info https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html 

## 1. Introduction
This assignment is all about parsing the data and extracting the required data from the patent documents present in the form of semi-structured data(i.e XML format). There are a total of 150 patents in one 19,229 KB file named `Group060.txt`. The required tasks are the following:

1. Extract the grant_id for each patent in `Group060.txt` file from the user-defined method grant_id(element).
2. Extract the patent_title for each patent in `Group060.txt` file from the user-defined method patent_kind(element).
3. Extract the patent_kind for each patent in `Group060.txt` file from the user-defined method patent_title(element).
4. Extract the number_of_claims for each patent in `Group060.txt` file from the user-defined method number_of_claims(element).
5. Extract the inventors for each patent in `Group060.txt` file from the user-defined method inventors(element).
6. Extract the citations_applicant_count for each patent in `Group060.txt` file from the user-defined method citations_applicant_count(element). 
7. Extract the citations_examiner_count for each patent in `Group060.txt` file from the user-defined method citations_examiner_count(element).
8. Extract the claims_text for each patent in `Group060.txt` file from the user-defined method claims_text(element).
9. Extract the abstract text for each patent in `Group060.txt` file from the user-defined method abstract(element).


After extracting the required data, the data must be stored in a structured formats of csv and json files.

More details for each task will be given in the following sections.

## 2. Import Libraries

In [1]:
import re
import pandas as pd

## 3. Opening, Reading and Splitting Data

As the first step, the file `Group060.txt` file, is opened in a read mode

In [2]:
file = open('Group060.txt', 'r')

file.read() method stores the content of the file in the form of string, as we want the data for each xml start and end tags, and then extract the required information from it, we are splitting the string stored from file.read() method based on the <?xml version="1.0" encoding="UTF-8"?> and then filtering the None values from it

In [3]:
string_file = file.read()
string_list = string_file.split('<?xml version="1.0" encoding="UTF-8"?>')
str_list = list(filter(None, string_list))

## 4. Extracting the information for grant_id

- **grant_id**: grant_id is unique for each patent grant, consisting of alphanumeric characters. It starts with the country code followed by the doc-number, which together represents a grant_id for each patent.

Regex is used for extracting the grant_id at us-patent-grant starting tag because, the value present inside the file attribute of us-patent-grant is unique for each patent-grant across different us-patent-grants. Here,

1. lang = .* is used because, the language can be of anything, as we don't want to capture the match of lang we used ?: before .* which together becomes (?:.*)
2. dtd-version = .* is used, beacause the version can be of anytype , and as we don't want to capture the match of dtd-version we used ?: before .* which together becomes (?:.*)
3. file attribute starts with quotations so, " is used at starting as it has two values separated by hyphen we used hyphen in the middle, and as the values inside this attribute can be of anything we use .* before and after the hypen, and as it ends with the quotations, we used quotations at the end. As we don't want to capture the match after the hyphen we used the representation of non-capturing group ?:
4. status = .* is used, because the value present inside the status can be of anything, and as we don't want to capture the match of status we used ?: which together becomes (?:.*)
5. id = .* is used, because the value present inside the status can be of anything, and as we don't want to capture the match of status we used ?: which together becomes (?:.*)
6. country = .* is used, because the value present inside the status can be of anything, and as we don't want to capture the match of status we used ?: which together becomes (?:.*)
7. date-produced = .* is used, because the value present inside the status can be of anything, and as we don't want to capture the match of status we used ?: which together becomes (?:.*)
8. date-publ = .* is used, because the value present inside the status can be of anything, and as we don't want to capture the match of status we used ?: which together becomes (?:.*)

search method is used for matching the regex as there is only one match available for each us-patent-grant

The use of search method can be found at https://docs.python.org/3/library/re.html

The braces() indicates the grouping/matching 

In [4]:
def grant_id(element):
    
    patent_grant = re.search(r'<us-patent-grant lang=(?:.*) dtd-version=(?:.*) file="(.*)-(?:.*)" status=(?:.*) id=(?:.*) country=(?:.*) date-produced=(?:.*) date-publ=(?:.*)>',element)
    #retrieving the value for grant_id if the regex finds the match
    if patent_grant:
        return patent_grant.group(1)


## 5. Extracting the information for patent_kind

- **patent_kind**: Search the pattern using regular expression, which starts and ends with publication-reference tags , as it specifies the exact kind-code for the particular patent-grant. Use groupings at all places where the values varies for publication-references for different patent-grants across countries, doc-number's, and dates. As we want only the value present between the kind tags which represent the patent-kind,  we made all the remaining groupings as non-capturing group using '?:'
1.  country = .* is used as the value of the country can be of anything, as it is not the one we want to match, make it as a non-capturing group, which together become (?:.*)
2. doc-number = .* is used as the value of the country can be of anything, as it is not the one we want to match, make it as a non-capturing group, which together become (?:.*)
3. kind = .* is used, as it represents the kind code for each us-patent-grant and the value for it can be of anything.
4. date = .* is used as the value of the country can be of anything, as it is not the one we want to match, make it as a non-capturing group, which together become (?:.*)

Note: Even though the kind tags are present across different places for each us-patent-grant, we use only the kind-code at publication-reference rather than others because, each us-patent-grant of publication-references consits of different documents, and has the kind code for each document, but that kind code may vary with the kind code of publication-reference. As all these documents are published with respect to the patent of one publication reference, we use the kind code of that publication-reference itself only.

search method is used for matching the regex as there is only one match available for each us-patent-grant

The use of search method can be found at https://docs.python.org/3/library/re.html

In [5]:
def patent_kind(element):
    
    patent_kind_code_regex = re.search(r'<publication-reference>\n<document-id>\n<country>(?:.*)</country>\n<doc-number>(?:.*)</doc-number>\n<kind>(.*)</kind>\n<date>(?:.*)</date>\n</document-id>\n</publication-reference>',element)
    '''
    dictionary is created with key-value pairs, where key is the kind-code for the publication-reference of each us-patent-grant 
    and value is the when that particular kind code is published/issued/or has no change.
    
    '''
    my_dictionary = { 'B2' :'Utility Patent Grant (with a published application) issued on or after January 2, 2001.',
                      'B1' :'Utility Patent Grant (no published application) issued on or after January 2, 2001.',
                      'P2' :'Plant Patent Grant (no published application) issued on or after January 2, 2001' ,
                      'P3' :'Plant Patent Grant (with a published application) issued on or after January 2, 2001',
                      'S1' :'Design Patent',
                      'E1' :'Reissue Patent'
                      
        }
    #retrieving the value for patent_kind if the regex finds the match
    if patent_kind_code_regex:
        patent_kind_code = patent_kind_code_regex.group(1)
        #retrieving the value if the patent_kind_code key is present in the keys of my_dictionary
        if patent_kind_code in my_dictionary.keys():
            return my_dictionary[patent_kind_code] 



## 6. Extracting the information for patent_title

- **patent_title**: Search for the patent_title for each patent_grant at invention-title tags, using regular expression. At the invention-title tags only, the title for each us-patent-grant is present, which is the name given by the inventor. As id for invention-title varies across different patent-grants and can be of anything we used,  .* and as it is not the one, we want, we made it as non-capturing group, and captured only the title we want,  and as the title can be of anything we used .*

search method is used for matching the regex as there is only one match available for each us-patent-grant

The use of search method can be found at https://docs.python.org/3/library/re.html

In [6]:
def patent_title(element):
    
    patent_title_re = re.search(r'<invention-title id="(?:.*)">(.*)</invention-title>',element)
    #retrieving the value for patent_title if the regex finds the match
    if patent_title_re:
        return patent_title_re.group(1)
    

## 7. Extracting the information for number_of_claims

- **number_of_claims**: This method gives the integer which denotes the number of claims made for a given patent grant.The claims made on each patent grant are enclosed between the "claim" tags, the "id" and "num" are the attributes for each claim which we are not concerned of (so they are made as non capturing groups). Our main area of interest for this function is counting the number of claims that is achieved by using the "findall" method (where all the occurances will be recorded and appended into the list) and then "len" function(for counting the length of the list returned by findall method which gives the count of claims).


Note:The use of findall method can be found at https://docs.python.org/3/library/re.html

In [7]:
def number_of_claims(element):
    no_of_claims=[]
    no_of_claims=re.findall(r'<claim id="(?:.*)" num="(?:.*)">',element)
    return len(no_of_claims)


## 8. Extracting the information for inventors

- **Inventors**:This method gives the names of the inventors for each patent grant.The names of the inventors are enclosed in the "last-name" and "first-name" tags which are in turn enclosed by the "addressbook" and "inventor" tags,the "sequence" and "designation" are the attributes of "inventor" tag which we are not concered of. Inventor also has sub tags other than "first-name" and "last-name" such as address of the inventor which we don't care about for this function. Our main interest lies in the names of the inventors,which is why all groups except name groups were made non capturing.The use of "findall" gives all the names of inventors in the list which is in turn converted into string using join and map methods.And in case if we don't find any occurances of the inventor tag, "NA" is returned.

Note:The use of findall method can be found at https://docs.python.org/3/library/re.html

In [8]:
def inventors(element):
    
    i_list=[]
    if re.search(r"<inventor sequence=(?:.*) designation=(?:.*)>\n<addressbook>\n<last-name>(?:.*)</last-name>\n<first-name>(?:.*)</first-name>\n<address>",element):
        inventor_list=re.findall(r"<inventor sequence=(?:.*) designation=(?:.*)>\n<addressbook>\n<last-name>(.*)</last-name>\n<first-name>(.*)</first-name>\n<address>",element)
        for name in inventor_list:
            i_list.append(name[1]+" "+name[0])
        inventor_str = ','.join(map(str,i_list))
        inventor_str = '[' + inventor_str +']'
        return inventor_str    
    else:
        return "NA"


## 9. Extracting the information for citations_applicant_count

- **citations_applicant_count**: This method gives an integer which denotes the number of citations made by the applicant for a given patent grant. The citations made by the applicant is enclosed between the category tags, and as it should specify that it is must be cited by the applicant, we are capturing the group that ends with applicant, and as we have to find the total number of citations made by the applicant we are using findall, so that it returns all matches in a list and then we are using the len method to get the count.

Note : We used only the category tags to count the citations made by the applicant because, the category tags for each us-paptent-grant clearly specified whether the citation is made by the applicant or not.

The use of findall method can be found at https://docs.python.org/3/library/re.html

In [9]:
def citations_applicant_count(element):
    
    citations_applicant_re=re.findall(r"<category>(.*applicant)</category>",element)
    #Returning the length of number of citations made by the applicant
    if citations_applicant_re:
        return len(citations_applicant_re)
    else:
        return 0
    

## 10. Extracting the information for citations_examiner_count

- **citations_examiner_count**: This method gives an integer which denotes the number of citations made by the examiner for a given patent grant. The citations made by the examiner is enclosed between the category tags, and as it should specify that it is must be cited by the examiner, we are capturing the group that ends with examiner,and as we have to find the total number of citations made by the applicant we are using findall, so that it returns all matches in a list and then we are using the len method to get the count.

Note : We used only the category tags to count the citations made by the examiner because, the category tags for each us-paptent-grant clearly specified whether the citation is made by the examiner or not.

The use of findall method can be found at https://docs.python.org/3/library/re.html

In [10]:
def citations_examiner_count(element):
    
    citations_examiner_re = re.findall('<category>(.*examiner)</category>',element)
    #Returning the length of number of citations made by the examiner
    if citations_examiner_re:
        return len(citations_examiner_re)
    else:
        return 0


## 11. Extracting the information for claims_text

- **claims_text**: We should first search, whether claims id="claims" is present, If it is present, then retrieve the information that is enclosed between two claim id's , because inside the claim id's only claim-text's is present. 

1. As claim id starts with CLM- and be of one or more digits, we used CLM-[0-9]+ and as num can be of any one or more digits, we used [0-9]+ . 
2. As we want all the text inside it, we are using .* , as we want to capture everything, we are making it lazy be keeping ? at the end of .* , which becomes .*?. we are using ?= for positive lookahead, so that after one match, it looks for the second match from where the first match has stopped, like it goes backwards for certain characters and then search from it for the second match.
3. The smaller grouping is added at the end using or operator to capture the data in the last group, as it doesn't contain one more claim id. 

Note:- The text for claims_text is present in the entire file with only <claims-text> tag, which is enclosed between <claims id="claims"> , if the claims-text is not present for the given us-patent-grant 'NA' is return which represent claims_text for the given us-patent-grant is 'Not Available'.
    
After retrieving the data using regex from claim-text, some substitutions are done to match the expected output.

The use of findall method can be found at https://docs.python.org/3/library/re.html

In [11]:
def claims_text(element):
    
    if re.search(r'<claims id="claims">',element):
        claim_id_regex1 = re.findall(r'<claim id="CLM-[0-9]+" num="[0-9]+">(.*?)(?=<claim id="CLM-[0-9]+" num="[0-9]+">)|<claim id="CLM-[0-9]+" num="[0-9]+">(.*)',element,flags=16)
        
        '''
        findall returns a list of strings, but in the above regular expression there are two groups one bigger and one smaller
        so, it returns a tuple. create a list which returns a tuples with the values only, that is the with the matching group.
        
        '''
        claim_id_regex2 = [tuple(claimmed for claimmed in claimm if claimmed) for claimm in claim_id_regex1]
        #concatenate item into a list-of-strings using join operator
        claim_id_regex = ["".join(line) for line in claim_id_regex2]
        
        '''
        if claim_id_regex is present then preprocessing the text using regular expressions by substituting some characters or 
        symbols and tags between <> and as the content inside it can be of anything we are using .*, and as we want to make it
        lazy we are adding ? at the end, which becomes together as <.*?>.
        
        '''
        if claim_id_regex:
            claim_id_string1 = re.sub(r"<.*?>","",str(claim_id_regex))
            new_claim_id_stringg = re.sub(r'(\n|\\n|;;|\s{2,})','',claim_id_string1)
            new_claim_id_stringg = re.sub(r'(\.\'\,)','.,',new_claim_id_stringg)
            new_claim_id_stringg = re.sub(r'( \')','',new_claim_id_stringg)
            new_claim_id_stringg = re.sub(r'(\\\')','\'',new_claim_id_stringg)
            new_claim_id_stringg = re.sub(r'(\[\'1)|(\[\"1)','[1',new_claim_id_stringg)
            new_claim_id_stringg = re.sub(r'(\[\")|(\[\')','[',new_claim_id_stringg)
            new_claim_id_stringg = re.sub(r'(\.\'\])|(\.\"\])|(\'\])','.]',new_claim_id_stringg)
            new_claim_id_stringg = re.sub(r'(.,\s")|(.",)','.,',new_claim_id_stringg)
            
            #If the length of claim_id_string1 is more than one, returning the string directly
            if len(claim_id_string1) > 1:
                return new_claim_id_stringg
            
            #If the length of claim_id_string1 is one, then substitute [` with [ .
            else:
                claim_str = new_claim_id_stringg
                claim_str = re.sub(r"\['","[",claim_str)
                return claim_str
        else:
            return 'NA'



## 12. Extracting the information for abstract 

- **abstract**: Searching for an abstract information using regular expression, as the content for the abstract text is enclosed between the abstract tags with id = "abstract", the starting and ending tags are fixed. As we want to retrieve all the information, and can be of any thing we used .*

Note:- There is no text for abstract id, unless at the place of <abstract id="abstract"> tag in the given `Group060.txt` file.
If the abstract-text is not present for the given us-patent-grant 'NA' is return which represents abstract-text for the given us-patent-grant is 'Not Available'
    
The use of search method can be found at https://docs.python.org/3/library/re.html

In [12]:
def abstract(element):
    
    abstract_re = re.search(r'<abstract id="abstract">(.*)</abstract>', element,flags=16)
    #retrieving the information if abstract text is present , otherwise returning 'NA'
    if abstract_re:
        abstract_regular = abstract_re.group(1)
        #using regular expressions, substituting all the tags, and new line symbols with empty strings
        abstract_string = re.sub(r"<.*?>","", str(abstract_regular))
        new_abstract_string = re.sub('\n','',abstract_string)
        return new_abstract_string
    else:
        return 'NA'

## 13. Calling all the functions, Storing into a list and Writing the data into a json format

In [13]:
'''
An empty list is created to store the data of all the required columns. 
Initially, append the column names to represent the name for each column, followed by method calling.

'''

new_complete_columns = []
new_complete_columns.append("grant_id")
new_complete_columns.append("patent_title")
new_complete_columns.append("kind")
new_complete_columns.append("number_of_claims")
new_complete_columns.append("inventors")
new_complete_columns.append("citations_applicant_count")
new_complete_columns.append("citations_examiner_count")
new_complete_columns.append("claims_text")
new_complete_columns.append("abstract")

'''
Iterating the total content in the file, which is stored as a list in str_list, and then for each string in a list, 
calling the method and appending the returned value to a list

'''
for element in str_list:
    new_complete_columns.append(grant_id(element))
    new_complete_columns.append(patent_title(element))
    new_complete_columns.append(patent_kind(element))
    new_complete_columns.append(number_of_claims(element))
    new_complete_columns.append(inventors(element))
    new_complete_columns.append(citations_applicant_count(element))
    new_complete_columns.append(citations_examiner_count(element))
    new_complete_columns.append(claims_text(element))
    new_complete_columns.append(abstract(element))
    
'''
An empty list is created which consist of sub-lists where each sub-list represents one particular us-patent-grant extracted 
information

'''

data = []

#opening a file in write mode to write the data into it

with open("Group060.json","w+") as out_handle:
    
    out_handle.write("{")
    
    '''
    Dividing the total data present in new_complete_columns into a sub-lists, by using for loop and appending it to a data
    list, so each list in the data list represents one us-patentgrant
    
    '''
    
    for i in range(0,len(new_complete_columns),9):
        new_data = new_complete_columns[i:i+9]
        data.append(new_data)
        '''
        Writing the total data in the form of json, depending upon its size of length
        
        '''
        if i==0:
            continue
        elif len(data)==len(new_complete_columns)//9:
            s='"'+new_data[0]+'":{"patent_title":"'+new_data[1]+'","kind":"'+new_data[2]+'","number_of_claims":'+str(new_data[3])+',"inventors":"'+new_data[4]+'","citations_applicant_count":'+str(new_data[5])+',"citations_examiner_count":'+str(new_data[6])+',"claims_text":"'+new_data[7]+'","abstract":"'+new_data[8]+'"}}'                                
            out_handle.write(s)
        else:
            s='"'+new_data[0]+'":{"patent_title":"'+new_data[1]+'","kind":"'+new_data[2]+'","number_of_claims":'+str(new_data[3])+',"inventors":"'+new_data[4]+'","citations_applicant_count":'+str(new_data[5])+',"citations_examiner_count":'+str(new_data[6])+',"claims_text":"'+str(new_data[7])+'","abstract":"'+new_data[8]+'"},\n'                                
            out_handle.write(s)



## 14. Writing the data into a csv file using to_csv function, by converting into dataframe using pandas

In [14]:
my_dataframe = pd.DataFrame(data)
my_dataframe.to_csv('Group060.csv',index=False,header=False)


### Displaying the dataframe

In [15]:
my_dataframe.head(151)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,grant_id,patent_title,kind,number_of_claims,inventors,citations_applicant_count,citations_examiner_count,claims_text,abstract
1,US10361000,System and method for protocol adherence,Utility Patent Grant (with a published applica...,20,"[Christopher Donald Johnson,Peter Henry Tu,Pie...",56,1,[1. An apparatus comprising:a processor and me...,The system and method disclosed herein provide...
2,US10360377,"Device, system, and method of obfuscating an u...",Utility Patent Grant (with a published applica...,19,[Tailim Song],0,19,"[1. A method, comprising:receiving, via a firs...",A message is received via a first mobile devic...
3,US10360641,Hybrid electronic lockbox,Utility Patent Grant (with a published applica...,20,"[Janet M. Mauller,Tera L. Howe,Amy Lamb,James ...",45,11,"[1. A method comprising:sending, by a driver f...","A method of managing an electric lockbox, such..."
4,US10358435,Triazolyl pyrimidinone compounds as PDE2 inhib...,Utility Patent Grant (with a published applica...,20,"[Dong-Ming Shen,Christopher J. Sinz,Alejandro ...",89,1,[1. A compound represented by structural formu...,The present invention is directed to pyrimidin...
5,US10358466,Modified lectin derived from Wisteria floribunda,Utility Patent Grant (with a published applica...,3,"[Takashi Sato,Yasunori Chiba,Hiroaki Tateno,Hi...",23,1,[1. A Wisteria floribunda monomeric lectin pol...,A Wisteria floribunda monomeric lectin polypep...
6,US10358905,Ultrasonic logging methods and apparatus for m...,Utility Patent Grant (with a published applica...,23,"[Lucio N. Tello,Edwin K. Roberts,Thomas J. Bla...",13,26,[1. A method using an acoustic logging system ...,A method and apparatus for measuring parameter...
7,US10361702,FPGA math block with dedicated connections,Utility Patent Grant (with a published applica...,9,"[Jonathan W. Greene,Fei Li]",4,4,[1. An architecture in a user-programmable int...,An architecture in a user-programmable integra...
8,US10361865,Signature method and system,Utility Patent Grant (with a published applica...,20,"[Eliphaz Hibshoosh,Aviad Kipnis,Nir Moshe,Alon...",8,11,[1. A method for digitally signing blocks of d...,"In one embodiment, a method, system, and appar..."
9,US10357564,Virus-PCION complex having enhanced antitumor ...,Utility Patent Grant (with a published applica...,4,"[Chae Ok Yun,Joung-Woo Choi]",1,3,[1. A method of enhancing transduction efficie...,The present disclosure relates to a compositio...


## 15.Summary

This assessment measured the understanding of reading the text file, extracting the required information using regular expressions and representing it in the form of popular structured data formats like csv and json.

As this is the basic step of data wrangling, this understanding adds a ground root for other steps in data wrangling process like Data Cleaning, Data Integration and Data Manipulation.

The main outcomes achieved by doing this assessment is,

- **Usage of regular expressions**:- Searching for a particular pattern, extracting the required information using efficient regular expressions like search(), findall() and usage of character sets, groupings, lazy, greedy, positive and negative lookahead's, which make regex more optimised.

- **Usage of Pandas**: Usage of pandas, and creating dataframe using pandas

- **Writing the data into a csv file**: Creating a data framework using pandas, and writing the data into a csv file using a builtin function to_csv 

- **Writing the data into a json file**: Dumping the data in the form of json into a file, without using built-in functions.



## 16. References
- livibetter. (2010, October 2).  *Remove-empty-strings-from-a-list-of-strings?* [Response to]. Retrieved from http://stackoverflow.com/a/3845423
- Rueta. (2010, May 13). *How-to-make-regular-expression-into-non-greedy?* [Response to]. Retrieved from http://stackoverflow.com/a/2824302
- Nadia Alramli. (2010, Feb 03). *Find-an-element-in-a-list-of-tuples?* [Response to]. Retrieved from http://stackoverflow.com/a/2191699
- Burhan Khalid. (2012, Sep 17). *Concatenate-item-in-list-to-strings?* [Response to]. Retrieved from http://stackoverflow.com/a/12453580
- Akavall. (2015, Jan 22). *Writing-a-python-list-of-lists-to-a-csv-file?* [Response to]. Retrieved from http://stackoverflow.com/a/14037540
- for checking regular expressions we used pythex https://pythex.org/
