# Parsing Raw Text Files

Date: 25/08/2019

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:
* pandas 0.19.2 (for data frame,reading/writing csv included in Anaconda Python 3.6) 
* re 2.2.1 (for regular expression, included in Anaconda Python 3.6) 


## 1. Introduction
*Task is to analyse textual data, i.e., extracting data from the given semi-structured text files.XML text file contains information about several patent grants, e.g., patent title, patent ID, citation network, abstract etc.
Our task is to extract the data and transform the data into the CSV and JSON format.There are a total of 150 patents in file named `Group055.txt`. The required tasks are the following:*

1. Extract 'grant_id', 'patent_title', 'kind', 'no_of_claims', 'inventors','citation_applicant_count','citation_examiner_count', 'claim_text' and 'abstract from each patent.
2. Load the features related to each patent into .csv and .json format files 

## 2.  Import libraries

In [1]:
import pandas as pd
import re

## 3. Loading data

**Glimpse of the data**

In [2]:
# print first ten lines of the file
with open('Group055.txt','r') as infile:
    print('\n'.join([infile.readline().strip() for i in range(0, 10)]))

<?xml version="1.0" encoding="UTF-8"?>
<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US10357251-20190723.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20190709" date-publ="20190723">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>10357251</doc-number>
<kind>B2</kind>
<date>20190723</date>
</document-id>


We can see that the first XML document has an XML declaration <?xml...?> and a root tag <us-patent-grant>. Based on this information it's possible to properly delimit an XML document so it can be extracted individually.<br>Input data is in XML format. The features to be extracted from the data are enclosed within certain tags.

In [3]:
# %%time
patent_split_data = open('Group055.txt','r',encoding='UTF-8').read().split('<?xml version="1.0" encoding="UTF-8"?>')

Each patent is represented by tag <'?xml version="1.0" encoding="UTF-8"?>'. So we split by that tag to get a list of patents.<br>
NOTE: Value at zero(0) index of the list will be blank, so we dont consider that.

## 4. Functions to extract features from a single patent

### Methods used in feature extraction functions: <br>
*  <font size= '3'> **re.sub(x,y,z)**   :   Replaces y with x in z.Operation is done on strings and return value is a string 
* **re.findall(x,y)**   :   Finds x pattern/string in y and returns the matches in a list
* **zip()**     :    Returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together,and then the second item in each passed iterator are paired together
* **len()**    :    The len() function returns the number of items in an object.
* **value_counts()**    :    function returns object containing counts of unique values.



**1. Function to extract Citation counts from each patent** <br>

In [4]:
def get_citation_counts(each_patent):
    cited_by = re.findall(r'<category>cited by (.*)</category>',each_patent)
    if len(cited_by)!=0:  # check for cited_by value. If null we return 0
        citation_counts = pd.value_counts(cited_by)
        return citation_counts
    else:
        return 0

**Regex used** :< category>cited by (.*)</ category>. <br> The regex with .findall() method will return a list of 'examiner' and 'applicant' strings <br> .value_counts() gives the unique value counts

**2. Function to extract claim text and count of claims for each patent** <br>

In [5]:
def get_claimText_and_count(each_patent):
    pattern = re.compile('<claim id="[A-Za-z0-9\S]+ num="[0-9]+">\\n<claim-text>(.*)</claim-text>',re.DOTALL)# dotall includes \n 
    claims_xml = ''.join(re.findall(pattern,each_patent)) # join and findall return match in string format 
    claims_no_tags = re.sub('<[^<]+>','',claims_xml) # remove xml tags
    claims_text = re.sub('\n','',claims_no_tags) # remove new line(\n) from the text 
    claims_text = re.sub('\t','',claims_text) # remove tabs(\t) from the text
    if len(claims_text)!=0: # check if claim text was extracted else we return 0
        claim_id = re.findall('<claim id="[A-Za-z0-9\S]+ num="([\d]+)">',each_patent) # extract claim id's
        no_of_claims = len(claim_id) # number of claim id's
        return claims_text,no_of_claims
    else:
        return 'NA',0

**Regex used :** < claim id="[A-Za-z0-9\S]+ num="[0-9]+">\\\n<claim-text> (.\*) </ claim-text> for *Claim Text* <br>
  &emsp; &emsp;&emsp;&emsp;&emsp;&emsp;         < claim id="[A-Za-z0-9\S]+ num="([\d]+)">  for *Claim Count* <br> Claim text is extracted using the pattern and tags, newline char(\n) and tabs (\t) are substituted with nothing using re.sub() <br> For count of claim id's we extarct all the claim id's into a list using re.findall() and then apply len() function

**3. Function to extract abstract for each patent** 

In [6]:
def get_abstract(each_patent):
    abstract_tags = ''.join(re.findall('<abstract id="abstract">\\n<p id="[A-Za-z0-9\S]+ num="[0-9]+">(.*)</p>',each_patent))
    abstract = re.sub('<[^<]+>','',abstract_tags)
    abstract = re.sub('\n','',abstract)
    if len(abstract)!=0:
        return abstract
    else:
        abstract='NA'
        return abstract

**Regex used :** < abstract id="abstract">\\n<p id="[A-Za-z0-9\S]+ num="[0-9]+">(.*) < /p> <br> Abstract text is extracted using the regex. Then all the tags and new line characters within the text have been removed. Default value 'NA' is returned in case of null values. 

**4. Function to extract grand id for each patent**

In [7]:
def get_grant_id(each_patent):
    return ''.join(re.findall('file="([A-Za-z0-9]+)-[\d]+.XML',each_patent)) 

**Regex used :** 'file="([A-Za-z0-9]+)-[\d]+.XML' <br> Grant id has been extracted using above regex pattern.

**5. Function to extract title of each patent** 

In [8]:
def get_patent_title(each_patent):
    return re.sub('<[^<]+>','',''.join(re.findall('<invention-title id="[A-Za-z0-9\S]+>(.*)</invention-title>',each_patent)))

**Regex used :** <invention-title id="[A-Za-z0-9\S]+>(.*) < /invention-title> <br> Patent title has been extracted from the invention title tags. Further unneccessary tags were removed using re.sub () method.

**6. Function to extract full names of all the inventors from each patent**

In [9]:
def get_inventors(each_patent):
    xml_inventors_data = ''.join(re.findall('<inventors>[A-Za-z\\n\S\W]+</inventors>',each_patent))
    if len(xml_inventors_data)!=0:
        last_name = re.findall('<last-name>([A-Za-z\s\S]+?)</last-name>',xml_inventors_data)
        first_name = re.findall('<first-name>([A-Za-z\s\S]+?)</first-name>',xml_inventors_data)
        names = []
#         names= ''
        for first,last in list(zip(first_name,last_name)):
            names.append(first+' '+last)
        return names 
    else:
        return 'NA'

**Regex used :** <inventors>[A-Za-z\\n\S\W]+</inventors> - regex used to extract inventor details <br> &emsp;&emsp;&emsp;&emsp;&emsp;&emsp; <last-name>([A-Za-z\s\S]+?)</last-name>' - regex used to extract last name from inventor details text. <br> &emsp;&emsp;&emsp;&emsp;&emsp;&emsp; <first-name>([A-Za-z\s\S]+?)</first-name> - regex used to extract first name from inventor details text. <br> First name and last name of inventors are embedded in the inventor tags. <br> re.findall () method is used to extract the names into a list. Tuples of first and last names are created using zip () function. <br> Tuples are unpacked and appended into a list.<br> Return value of the function is the created list and default return value is 'NA'

**7. Function to extract the kind code for each patent**

In [10]:
def get_kind(each_patent):
    kind_tags = ''.join(re.findall('<publication-reference>([\W\w]+)</publication-reference>',each_patent))
    kind = ''.join(re.findall('<kind>(.*)</kind>',kind_tags))
    return kind

**Regex used :** < publication-reference>([\W\w]+)</ publication-reference> - regex is used to extract text between < publication-reference>  &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; and < /publication-reference> <br> &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;< kind>(.*)</ kind> - regex is used to extract kind code embedded within kind tags. <br> The extracted kind code is returned as a string value.

## 5. Calling functions for return values and creating a data frame for csv and a dictionary for &emsp; &emsp; json

### Methods used: <br>
*  <font size= '3'> **re.sub(x,y,z)**   :   Replaces y with x in z.Operation is done on strings and return value is a string 
* **re.findall(x,y)**   :   Finds x pattern/string in y and returns the matches in a list
* **zip()**     :    Returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together,and then the second item in each passed iterator are paired together
* **len()**    :    The len() function returns the number of items in an object.
* **value_counts()**    :    function returns object containing counts of unique values.

In [11]:
%%time
# column[] contains the names of all the columns to be used for the data frame
column=['grant_id','patent_title','kind','no_of_claims','inventors','citation_applicant_count','citation_examiner_count','claim_text','abstract']
df = pd.DataFrame(columns=column)  # created a data frame using pandas pd
d_frame=df.iloc[0:0]   # re-setting a new data frame to avoid overloading on multiple code runs 
dict_for_json = {}     # dictionary to be written for json output
# dictionary build to hold values for each kind code
kind_dictionary = {"B2":"Utility Patent Grant (with a published application) issued on or after January 2, 2001.","B1":"Utility Patent Grant (no published application) issued on or after January 2, 2001.",'E1':'Reissue Patent','S1':'Design Patent',"P2":"Plant Patent Grant (no published application) issued on or after January 2, 2001","P3":"Plant Patent Grant (with published application) issued on or after January 2, 2001"}

'''''
loop will run for each patent in the patent_split_data. For each patent the loop extracts all the features
using the above defined functions. Features are saved in a list and after converting to a series, list is 
appended to a data frame. Feature list values are used to create a dictionary which is used later for writing into json format.

'''''
for i in range(1,len(patent_split_data)):
    dict_json = {} 
    grant_id = get_grant_id(patent_split_data[i])
    patent_title = get_patent_title(patent_split_data[i])
    kind_code =get_kind(patent_split_data[i])
    if kind_code in kind_dictionary.keys():
        kind = kind_dictionary[kind_code]
    else:
        kind = 'NA'
        
    claim_text,no_of_claims = get_claimText_and_count(patent_split_data[i])
    inventors = '[{}]'.format(','.join(get_inventors(patent_split_data[i])))
    citation_counts= get_citation_counts(patent_split_data[i])
    
    # limiting conditions for citation_counts.Set default values as 0
    if 'examiner' in citation_counts:
        citation_examiner_count = citation_counts['examiner']
    else:
        citation_examiner_count=0
        
    if 'applicant' in citation_counts:
        citation_applicant_count = citation_counts['applicant']
    else:
        citation_applicant_count=0         
    abstract = get_abstract(patent_split_data[i])
    
    # list stores all features for each patent
    all_patent_feature = [grant_id,patent_title,kind,no_of_claims,inventors,citation_applicant_count,citation_examiner_count,claim_text,abstract]
    # populate dataframe with all_patent_features
    d_frame = d_frame.append(pd.Series(all_patent_feature,index=column),ignore_index=True)
    
    # This loop populates the dictionary with key,value pairs required to generate json format text file 
    for i in range(1,len(column)):
        dict_json[column[i]]=all_patent_feature[i] # 
    dict_for_json[grant_id] = dict_json # dictionary with grand id's as keys and respective other features as values

Wall time: 1.49 s


**Glimpse of the data frame**

In [12]:
d_frame.head(2)

Unnamed: 0,grant_id,patent_title,kind,no_of_claims,inventors,citation_applicant_count,citation_examiner_count,claim_text,abstract
0,US10357251,Surgical staples comprising hardness variation...,Utility Patent Grant (with a published applica...,19,"[Frederick E. Shelton, IV,Jeffrey S. Swayze,Ch...",4653,15,1. A surgical staple cartridge for use with a ...,A surgical staple cartridge is disclosed compr...
1,US10358657,IS-targeting system for gene insertion and gen...,Utility Patent Grant (with a published applica...,12,"[R&#xe9;mi Bernard,Esther Gerber,Elena Hauser,...",8,0,1. A method for introducing a nucleic acid mol...,The present invention relates to methods and c...


## 6. Writing into csv and json

In [13]:
# writing output into a json text file
with open('Group055.json', 'w+') as f:
    json_text = ''
    # formatting the key, value pairs of the dictionary for .json format
    for s,v in dict_for_json.items():
        first = '{%s}' % ','.join(['"%s": "%s"' % (x, y) for x, y in v.items()]) # placing all key, values pairs in double quotes
        second = '%s' % ''.join('"%s": %s' % (s, first))  # creating new key,value pairs.Key= grand_id, values=all other features
        json_text = json_text+second+','  # writing key,value pairs from second as string into json_text

    json_text_1 = re.sub('([\W\w\S\s]+})(\W)',r'\1',json_text)  # removing the last unneccessary '}' bracket from the text
    json_text_final = '{' +json_text_1+ '}' # enclosing the json text in {}
    print(json_text_final,file=f)
    
# writing output of d_frame dictionary into a csv file   
d_frame.to_csv('Group055.csv',mode='w+',index=False)

### Reading json

In [16]:
# import json

# with open('Group055.json') as f:
#     d = json.load(f)
#     print(d)


### Reading csv

In [17]:
# data_fr = pd.read_csv('Group055.csv')
# data_fr.head()

## References <br>
* Pandas 0.25.1 documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
* re - Regular Expression Opeartions https://docs.python.org/2/library/re.html
* re - Regular Expression Operations https://docs.python.org/3/library/re.html
