# FIT5196 Assessment 1
#### Group number: 023
#### Student Name :JIANHAN MA & HSIAO YI
#### Student ID: 29332885 & 29360595

Date: 19/08/2019

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:

* pandas 0.19.2 (for data frame, included in Anaconda Python 3.6) 
* re 2.2.1 (for regular expression, included in Anaconda Python 3.6) 

## 1. Introduction
This assessment touches the very first step of analyzing textual data, i.e., extracting data from
semi-structured text files. There is a data-set that contains information about grants given for IP patent claims in `<group_number>.txt`. Each data-set contains information about several patent grants, e.g., patent title, patent ID, citation network, abstract etc. The task is to extract the data and transform the data into the CSV and JSON format with the following elements:
1. grant_id: a unique ID for a patent grant consisting of alphanumeric characters.
2. patent_kind: a category to which the patent grant belongs.
3. patent_title: a title given by the inventor to the patent claim.
4. number_of_claims: an integer denoting the number of claims for a given grant.
5. citations_examiner_count: an integer denoting the number of citations made by the examiner for a given patent grant (0 if None)
6. citations_applicant_count: an integer denoting the number of citations made by the applicant for a given patent grant (0 if None)
7. inventors: a list of the patent inventors’ names ([NA] if the value is Null).
8. claims_text: a list of claim texts for the different patent claims ([NA] if the value is Null).
9. abstract: the patent abstract text (‘NA’ if the value is Null)

## 2.Import libraries

In [1]:
import pandas as pd
from pandas import Series 
from pandas import DataFrame 
import re

## 3. Loading and checking data from dataset

Firstly, the file `Group023.txt` will be loaded from local machine so the first 10 lines can be inspected.

In [2]:
##read the first 10 lines of the file.
with open('Group023.txt','r') as inputfile:
    print('\n'.join([inputfile.readline().strip() for i in range(0, 10)]))

<?xml version="1.0" encoding="UTF-8"?>
<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US10360376-20190723.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20190709" date-publ="20190723">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>10360376</doc-number>
<kind>B2</kind>
<date>20190723</date>
</document-id>


we can see each patent start with `<?xml...>`, so based on this information, the whole txt file can be splited into a list which contains a lot of individual patents.

the pattern aims to define a expression to match the start part of each patent. This string means starting with `?xml` and end with `?>`. The non-greedy pattern `*?` is necessary so the whole file is not matched. Then the `re.split` function is to split the txt which is based on the pattern. The next step is to replace the line break by using for loop to remove all the `\n` in the whole list. The reason why there is a `pop` function is that the fist element is blank quotes.

In [3]:
## read the whole inputfile 
with open('Group023.txt','r') as inputfile:
    text = inputfile.read()

##define a regular expression and split file
pattern = '<\?xml.*?>'
result = re.split(pattern, text)

##preprocessing the content(such as remove line breaks, pop first blank element)
for i in range(len(result)):
    result[i] = result[i].replace('\n','')
result.pop(0)
print(len(result))

150


the length of the `result` list is 150. All patents haven been successfully extracted from the main file. Then we check the last patent in the list.

In [4]:
lp_lines = result[len(result) -1] #get the last patent from extracted result
lp_lines

'<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US10360087-20190723.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20190709" date-publ="20190723"><us-bibliographic-data-grant><publication-reference><document-id><country>US</country><doc-number>10360087</doc-number><kind>B2</kind><date>20190723</date></document-id></publication-reference><application-reference appl-type="utility"><document-id><country>US</country><doc-number>15795351</doc-number><date>20171027</date></document-id></application-reference><us-application-series-code>15</us-application-series-code><us-term-of-grant><us-term-extension>9</us-term-extension></us-term-of-grant><classifications-ipcr><classification-ipcr><ipc-version-indicator><date>20060101</date></ipc-version-indicator><classification-level>A</classification-level><section>G</section><class>06</class><subclass>F</subclass><main-group>9</main-group><subgroup>54</subgroup><symbol-position>F</symbol-position><classificat

# 4. Parsing file and extracting

In this part, there will be several functions to extract each element.

## 4.1 Match grant id

The first feature we need to extract is grant id which is the unique ID for a patent grant consisting of alphanumeric characters. so we can see the txt file, the grant id is contained within the beginning part of each patent. The string is `file=....`. so the regular expression that I defined used `\w` which can avoid match dash symbol. Then because there is not only one ID in the same patent, I pick the `re.search` function to match the first grant ID and return. Finally the `group()` function is to get the actual value from `re.search`. And we test this function.

In [5]:
def match_grant_id(result):
    grant_id_pattern = 'file="(\w*)' ##'\w' to matche any alphanumeric character and the underscore
    grant_id = re.search(grant_id_pattern, result)
    grant_id = grant_id.group(1) ## get the grant id
    return grant_id
match_grant_id(result[0]) ##test the function

'US10360376'

## 4.2 Match patent title

It is not so hard to find the patent title which is within the `<invention-title...>`. So there is a similar function to define. In addition, some tags are contained in patent title. And obviously all the tags is in the angle brackets`<>`. Hence, we need to use `re.sub` function to screen out those tags and remove.

In [6]:
def patent_title(result):
    title_pattern = '<invention-title.*?>(.*?)<\/invention-title>'
    html_tag = '<.*?>' ## match all the content in the angle brackets
    title = re.search(title_pattern, result)
    title = title.group(1)
    title = re.sub(html_tag,'',title) ##remove all the html tags
    return title
patent_title(result[0])

'Method for operating a computer unit, and such a computer unit'

## 4.3 Match kind of patent

The next part is the kind of patent. By following the reference called [The ABCs of Patent Kind Codes](https://www.finnegan.com/en/insights/blogs/prosecution-first/the-abcs-of-patent-kind-codes.html). And I noticed that only `B1` and `B2` need to be treated particularly which should add a description behind. So the `if else` function is useful here to judge which kind belongs the type.

In [7]:
def match_kind(result):
    kind_pattern = '<kind>(\w*)<\/kind>'
    type_pattern = '<application-reference\sappl-type="(\w*)">'
    kind = re.search(kind_pattern, result)
    kind = kind.group(1)
    types = re.search(type_pattern, result)
    types = types.group(1)
    types = types.capitalize() ## upper first character
    kinds = ''
    
    ## judgment for kind
    if kind == 'B1': 
        return types + ' Patent Grant (no published application) issued on or after January 2, 2001.'
    elif kind == 'B2':
        return types + ' Patent Grant (with a published application) issued on or after January 2, 2001.'
    elif types == 'Plant':
        return '2001"'
    else:
        return types + ' Patent'
match_kind(result[0])

'Utility Patent Grant (with a published application) issued on or after January 2, 2001.'

## 4.4 Count number of claims

For the number of claims, `re.findall` could be used to return a list of claim ID. And the length of this list is the number of claims, because the claim ID is unique, we do not need to concern about counting the duplicate claims.

In [8]:
def count_num_claims(result):
    claims_pattern = '<claim.id.*?>'
    claims = re.findall(claims_pattern, result) ##return a list of claim id
    claims_count = len(claims) ##count the length of list = number of claims
    return claims_count
count_num_claims(result[0])

11

## 4.5 Match inventors

It is a little bit complicated to get the name of inventors. Firstly, the scope of inventors should be defined, because not only inventors have last name and first name. Then the name can be captured from the inventors body. Finally, For loop have to be used to add the space between first name and last name, also add a comma after each inventor.

In [9]:
def match_inventors(result):
    name_pattern = r'<inventors>(.*?)</inventors>' ## identify the boundary to make sure only the inventos' name can be accessed
    last_name = r'<last-name>(.*?)</last-name>'
    first_name = r'<first-name>(.*?)</first-name>'
    names = re.search(name_pattern, result) ## search the inventors' information from the whole content
    if names == None:
        return '[NA]' ## return `NA` if the name is none.
    names = names.group()
    lnames = re.findall(last_name, names) 
    fnames = re.findall(first_name, names) ## return a list of name
    
    if lnames == 'Null' or fnames == 'Null':
        return '[NA]'
    inv_name = '' ## create a blank string to store names
    for i in range(len(lnames)):
        name = fnames[i] + ' ' + lnames[i] ## add a space between fname and lname
        inv_name += name ## append into string
        inv_name += ', ' ## add a comma behind full name
    inv_name = inv_name.strip(', ') ## to remove the last useless comma
    return '[' + inv_name +']'
match_inventors(result[1])

'[Edward Michael Leonard, Robert Howard Sawers, Kip Christopher Larson, Devesh Mishra, Eric Mathew Mack, Jeffrey B. Maurer]'

## 4.6 Citations_applicant_count

4.6 and 4.7 are similar to 4.4.

In [10]:
def count_applicant(result):
    app_pattern = '<category>cited\sby\sapplicant<\/category>' ##'\s' represent to match a space
    apps = re.findall(app_pattern, result) ##return a list
    app_count = len(apps)
    return app_count
count_applicant(result[0])

7

## 4.7 Citations_examiner_count

In [11]:
##same as 4.6
def count_examiner(result):
    exa_pattern = '<category>cited\sby\sexaminer<\/category>'
    examiner = re.findall(exa_pattern, result)
    exa_count = len(examiner)
    return exa_count
count_examiner(result[1])

6

## 4.8 Claims_text

This part is similar to the get patent title function. But the content of claim contains not only the tags but also some messy code. So all of these useless text need to be removed to avoid affect reading.

In [12]:
def match_claim(result):
    claim_pattern = r'<claims id="claims">(.*?)</us-patent-grant>' ##define main body
    tags_pattern = '<.*?>' ##define tags
    messy_text = '\(?S&#x26;OP\)?' ## define messy code
    claims = re.search(claim_pattern, result)
    if claims == None:
        return '[NA]'
    claims = claims.group(1)
    
    claim = ''
    claim = re.sub(tags_pattern,'',claims)
    claim = re.sub(messy_text,'',claim) ## remove all irrelevant text
    return '[' + claim + ']'
match_claim(result[1])

'[1. A computer-implemented method, comprising:hosting, by a computer system, a forecast module, the computer system interfacing with a planning system, a management system, a data store, and a computing resource that is associated with an operator of an inventory of an electronic marketplace, wherein:the planning system is configured to plan a flow of units of items through the inventory by at least generating simulations of supply and demand for the items during a planning horizon, and a labor forecast and a capacity forecast associated with inventorying the units of the items during the planning the planning horizon,the management system is configured to manage the flow based at least in part on a sales and operations planning  forecast that is associated with an item of the items and that is generated by the forecast module,the data store stores the simulations, the labor forecast, and the capacity forecast, andentries in the data store are updated based at least in part on input o

## 4.9 Abstract

Firstly, the scope of abstract should be identified in order to get the Null value easily. Then the content can be searched from abstract. Finally remove the messy tags.

In [13]:
def match_abstract(result):
    abs_pattern = r'<abstract\sid="abstract">(.*?)</abstract>'
    content_pattern = r'<.*?>(.*?)</p>' ## content end with '</p>'
    tags_pattern = r'<.*?>'
    messy_text = '&#x\w{0,4}'##match something like '&#x2019', {0-4} means match 0 to 4 characters
    
    abstract = re.search(abs_pattern, result)
    if abstract == None:
        return 'NA'
    abstract = abstract.group(1)
    
    content = re.search(content_pattern, abstract) ## get the main body from abstract
    content = content.group(1)
    
    content = re.sub(tags_pattern,'',content)
    content = re.sub(messy_text,'',content) ## remove all useless information
    return content
match_abstract(result[33])

'NA'

# 5. Write all features into file

## 5.1 To csv file

Python is allowed user transfer dictionary to csv file, all the values in dictionary will be transfered to be a column. And the keys of dictionary will become columns' name. So the for loop here is to store all the values into different list as value. And the name their key.

In [14]:
grant_id = []
title = []
kind = []
number_of_claims = []
inventors = []
citations_applicant_count = []
citations_examiner_count = []
claims_text = []
abstract = []

##append all features into different lists
for i in range(len(result)):
    grant_id.append(match_grant_id(result[i]))
    
    title.append(patent_title(result[i]))
    
    kind.append(match_kind(result[i]))
    
    number_of_claims.append(count_num_claims(result[i]))
    
    inventors.append(match_inventors(result[i]))
    
    citations_applicant_count.append(count_applicant(result[i]))
    
    citations_examiner_count.append(count_examiner(result[i]))
    
    claims_text.append(match_claim(result[i]))
    
    abstract.append(match_abstract(result[i]))

In [15]:
dic = {"grant_id" : grant_id, "patent_title" : title, "kind" : kind, "number_of_claims" : number_of_claims, "inventors" : inventors, "citations_applicant_count" : citations_applicant_count, "citations_examiner_count" : citations_examiner_count, "claims_text" : claims_text, "abstract": abstract}

In [16]:
df = pd.DataFrame(dic)
df.to_csv("Group023.csv",index = False) ## 'index = False' avoid writing the row index into file
df.head(5)

Unnamed: 0,grant_id,patent_title,kind,number_of_claims,inventors,citations_applicant_count,citations_examiner_count,claims_text,abstract
0,US10360376,"Method for operating a computer unit, and such...",Utility Patent Grant (with a published applica...,11,"[Laszlo Marton, Oliver Mihatsch]",7,3,"[1. A method for operating a computer unit, th...",A method is supplied for operating a computer ...
1,US10360522,Updating a forecast based on real-time data as...,Utility Patent Grant (no published application...,19,"[Edward Michael Leonard, Robert Howard Sawers,...",0,6,"[1. A computer-implemented method, comprising:...",Techniques for generating a forecast associate...
2,US10358186,Bicycle sprocket and bicycle sprocket assembly,Utility Patent Grant (with a published applica...,34,[Akinobu Sugimoto],9,37,[1. A bicycle sprocket comprising:a sprocket b...,"A bicycle sprocket comprises a sprocket body, ..."
3,USPP030748,Hosta plant named &#x2018;Etched Glass&#x2019;,"2001""",1,[Hans A. Hansen],0,0,[1. A new and distinct ornamental plant cultiv...,A new and distinct Hosta plant named ;Etched G...
4,US10357593,Malleable demineralized bone composition and m...,Utility Patent Grant (with a published applica...,11,"[Edgar S. Maldonado, Silvia Daniela Gonzales]",52,2,[1. A method of making a malleable demineraliz...,A malleable demineralized bone composition con...


## 5.2 To json file

Transferring to json file is similar to csv file. But the outputfile need to be filtered by patient id. So the content of each patent should be gathered into a dictionary. And the grant id is the key of the content dictionary. So the result is a dictionary contains another dictionary. Finally, output the dictionary into a json file(can be added) one by one.

In [17]:
dict_grant_id = {}
dict_content = {}
i = 0
for i in range(len(grant_id)):
    ## values in dict
    dict_content["patent_title"] = title[i]
    dict_content["kind"] = kind[i]
    dict_content["number_of_claims"] = number_of_claims[i]
    dict_content["inventors"] = inventors[i]
    dict_content["citations_applicant_count"] = citations_applicant_count[i]
    dict_content["citations_examiner_count"] = citations_examiner_count[i]
    dict_content["claims_text"] = claims_text[i]
    dict_content["abstract"] = abstract[i]
    
    ## identify the key of the dict
    dict_grant_id[grant_id[i]] = dict_content
    
    ## write the patent one by one into a file
    with open("Group023.json",'a+') as outputjson:
        outputjson.write(str(dict_grant_id))
    ## initialize the dictionary after write
    dict_grant_id = {}
    

# 6. Summary
This assessment measured the understanding of basic text file processing techniques in the Python programming language. The main outcomes achieved while applying these techniques were:

- **Reading and spliting txt file into a list.** By using bulit-in `re.split` and `replace` function. With the helper like *re tutorial*, it was not so hard to preprocess the file.

- **Extraction.** Defining function is a clear and user friendly way to parse these patent features. [regex101](https://regex101.com/) is a very useful online tool to check the regular expression. And also it can show which part is the group of regular expression. The `re.search`, `re.findall`and `re.sub` is used frequently in this part to extract the particular text from main file. Then by using basic funtion like `if else` and `loop` to make judgment and appending function.

- **Writing text into file.** By using `pd.DateFrame` function, the dictionary can be transferred into a dateframe. And `.to_csv` function is to transfer dateframe to csv file. it is worth noting that the `index` parameter should be set False in order to avoid saving the row index into file.

# 7. References
- The `pandas` Project. (2016a). *pandas 0.19.2 documentation: pandas.DataFrame.to_csv*. Retrieved from http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv
- W3Schools(2019) *Python RegEx*. Retrieved from https://www.w3schools.com/python/python_regex.asp