## 1. Introduction
This notebook is to extract the data and transform the data into the CSV and JSON format with the following elements:
1. <i>grant_id</i>: a unique ID for a patent grant consisting of alphanumeric characters.  
2. <i>patent_kind</i>: a category to which the patent grant belongs.  
3. <i>patent_title</i>: a title given by the inventor to the patent claim.
4. <i>number_of_claims</i>: an integer denoting the number of claims for a given grant.
5. <i>citations_examiner_count</i>:  an  integer  denoting  the  number  of  citations  made  by  the  examiner for a given patent grant (0 if None)
6. <i>citations_applicant_count</i>:  an  integer  denoting  the  number  of  citations  made  by  the  applicant for a given patent grant (0 if None)
7. <i>inventors</i>: a list of the patent inventors’ names ([NA] if the value is Null).
8. <i>claims_text</i>: a list of claim texts for the different patent claims ([NA] if the value is Null).
9. <i>abstract</i>: the patent abstract text (‘NA’ if the value is Null)

CSV file has columns as follows:

['grant_id', 'patent_title', 'kind', 'number_of_claims', 'inventors', 'citations_applicant_count', 'citations_examiner_count', 'claims_text', 'abstract']


## 2. Import libraries 

In [1]:
#!pip install pandas #uncomment to install pandas if required
import pandas as pd
import re

## 3.Examining and loading data

As a first step, the file xml will be loaded so its first 10 lines can be inspected. 

In [2]:
with open('data.txt','r') as infile:
    print('\n'.join([infile.readline().strip() for i in range(0, 10)]))

<?xml version="1.0" encoding="UTF-8"?>
<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US10360783-20190723.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20190709" date-publ="20190723">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>10360783</doc-number>
<kind>B2</kind>
<date>20190723</date>
</document-id>


We can see that the first XML document has an XML declaration <?xml version="1.0" encoding="UTF-8"?> and a root tag <us-patent-grant>. Based on this information it's possible to properly delimit an XML document so it can be extracted individually.

#### From observing the Sample_input.txt, we find these following: 

1. Each xml section begins with <b><?xml version="1.0" encoding="UTF-8"?\></b> and a new line.

2. The content within each xml section lie between <us-patent-grant .....> and <\/us-patent-grant> 
    
3. <i>grant_id</i> is the string of 10 characters within the pattern file="....xml" eg: US10357643

4. <i>patent_title</i> is the content that lie between <invention-title ...> and <\/invention-title>
    
5. <i>kind</i> is the content between tags <kind\> and <\/kind>
    * After we examined the content of patent kind which is in the format of '\w+\d?|\w?' in the Sample_output.csv we found that there needs to be a dictionary that pairs the code (eg:A1) and its corresponding texts. In particular, kind values found in Sample input are for example <i>A, A1, B2, etc...</i> meanwhile values in Sample_output files are in texts eg: Utility Patent Grant (with a published app...). To make sure the corresponding values are correct, we found that the codes are defined from website of United States Patent and Trademark Office (https://www.uspto.gov/), PATENT ASSIGNMENT DAILY XML FILE DESCRIPTION (Version 2.0). Therefore <i>type_kind</i> dictionary was created to match between kind codes and kind texts.
    
    
6. <i>number_of_claims</i> is the content between <number-of-claims\> and <\/number-of-claims>

7. <i>inventors</i> are the full names of all the inventors and they can be found between <last-name\> and <\/last-name>, and between <first-name\> and <\/first-name>

8. - 8.1 <i>citations_applicant_count</i> is the count of citations by applicants and they can be found in the the format of <category\>cited by applicant<\/category>
   - 8.2 <i>citations_examiner_count</i> is the count of citations by examiners and they can be found in the the format of <category\>cited by examiner<\/category>
   
9. - 9.1 <i>claims_text</i> can be found between the tags of <claim-text\> and <\/claim-text>
   - 9.2 some claim texts are too long so they are separated by the <claim-text\> tags. In order to deal with it, we will need to find if the beginning characters of the claim texts that indicates that it is a new claimtext or only a continuation.


10. <i>abstract</i> can be found between the tags of <abstract .....> and <\/abstract>

## 3. Parse XML File

The XML file eg: Sample_input.txt, or Group102.txt, etc... should be in the <b>same folder level</b> with this notebook to run. 

### 3.1 Create <i>type_kind</i> and <i>spec_dict</i>

<i>type_kind</i>: a dictionary containing the keys as found from the input, and values are corresponding texts similar to output file, with reference to UPSTO document. 

<i>spec_dict</i>: to properly display HTML special character in CSV file as indicated by Sample_output.csv, we will need a dictionary to display correctly the output in csv and json file.As advised by the teaching team, this requirement is optional.

https://www.w3schools.com/charsets/ref_utf_box.asp

In [3]:
# create a dictionary to store patent kinds. This code block was created in accordance to
# PATENT ASSIGNMENT DAILY XML FILE DESCRIPTION (Version 2.0)

type_kind = {
            "A": 'Utility Patent Grant issued prior to January 2, 2001.',   
            "A1": 'Utility Patent Application published on or after January 2, 2001.',
            "A2": 'Second or subsequent publication of a Utility Patent Application.',  
            'A9': 'Corrected published Utility Patent Application', 
            'Bn': 'Reexamination Certificate issued prior to January 2, 2001.', # NOTE: "n" represents a value 1 through 9. 
            'B1': 'Utility Patent Grant (no published application) issued on or after January 2, 2001.', # 
            'B2': 'Utility Patent Grant (with a published application) issued on or after January 2, 2001.', #
            'Cn': 'Reexamination Certificate issued on or after January 2, 2001.', # NOTE: "n" represents a value 1 through 9 denoting the publication level. 
            'E1': 'Reissue Patent', 
            'Fn': 'Reexamination Certificate of a Reissue Patent.', # NOTE: "n" represents a value 1 through 9 denoting the publication level. 
            'H1': 'Statutory Invention Registration (SIR) Patent Documents. SIR documents began with the December 3, 1985 issue.', # 
            'I1': '"X" Patents issued from July 31, 1790 to July 13, 1836.', 
            'I2': 'Reissue Patents', #"X"  issued from July 31, 1790 to July 13, 1836. 
            'I3': 'Additional Improvements - Patents issued between 1838 and 1861.', # 
            'I4': 'Defensive Publication - Documents issued from November 5, 1968 through May 5, 1987.', # 
            'I5': 'Trial Voluntary Protest Program (TVPP) Patent Documents.',
            'NP': 'Non-Patent Literature.', 
            'P': 'Plant Patent issued prior to January 2, 2001',
            'P1': 'Plant Patent Grant issued prior to January 2, 2001.', 
            'P2': 'Plant Patent Grant (no published application) issued on or after January 2, 2001.', # 
            'P3': 'Plant Patent Grant (with a published application) issued on or after January 2, 2001.', #
            'P4': 'Second or subsequent publication of a Plant Patent Application.', # 
            'P9': 'Correction publication of a Plant Patent Application.', 
            'S1': 'Design Patent',
            'S' : 'design',
            'NULL': 'NULL' #Placeholder for NULL values for duplicates and such.
             }

#create special character to translate ISO 10868 to characters
string_key_1 = """&#x20AC; &#x201A; &#x0192; &#x201E; &#x2026; &#x2020; &#x2021; &#x02C6; &#x2030; &#x0160; &#x2039; &#x0152; &#x017D; &#x2018; &#x2019; &#x201C; &#x201D; &#x2022; &#x2013; &#x2014; &#x02DC; &#x2122; &#x0161; &#x203A; &#x0153; &#x017E; &#x0178;"""
string_key_2 = """&#x21; &#x22; &#x23; &#x24; &#x25; &#x26; &#x27; &#x28; &#x29; &#x2a; &#x2b; &#x2c; &#x2d; &#x2e; &#x2f; &#x30; &#x31; &#x32; &#x33; &#x34; &#x35; &#x36; &#x37; &#x38; &#x39; &#x3a; &#x3b; &#x3c; &#x3d; &#x3e; &#x3f; &#x40; &#x41; &#x42; &#x43; &#x44; &#x45; &#x46; &#x47; &#x48; &#x49; &#x4a; &#x4b; &#x4c; &#x4d; &#x4e; &#x4f; &#x50; &#x51; &#x52; &#x53; &#x54; &#x55; &#x56; &#x57; &#x58; &#x59; &#x5a; &#x5b; &#x5c; &#x5d; &#x5e; &#x5f; &#x60; &#x61; &#x62; &#x63; &#x64; &#x65; &#x66; &#x67; &#x68; &#x69; &#x6a; &#x6b; &#x6c; &#x6d; &#x6e; &#x6f; &#x70; &#x71; &#x72; &#x73; &#x74; &#x75; &#x76; &#x77; &#x78; &#x79; &#x7a; &#x7b; &#x7c; &#x7d; &#x7e; &#xa1; &#xa2; &#xa3; &#xa4; &#xa5; &#xa6; &#xa7; &#xa8; &#xa9; &#xaa; &#xab; &#xac; &#xad; &#xae; &#xaf; &#xb0; &#xb1; &#xb2; &#xb3; &#xb4; &#xb5; &#xb6; &#xb7; &#xb8; &#xb9; &#xba; &#xbb; &#xbc; &#xbd; &#xbe; &#xbf; &#xc0; &#xc1; &#xc2; &#xc3; &#xc4; &#xc5; &#xc6; &#xc7; &#xc8; &#xc9; &#xca; &#xcb; &#xcc; &#xcd; &#xce; &#xcf; &#xd0; &#xd1; &#xd2; &#xd3; &#xd4; &#xd5; &#xd6; &#xd7; &#xd8; &#xd9; &#xda; &#xdb; &#xdc; &#xdd; &#xde; &#xdf; &#xe0; &#xe1; &#xe2; &#xe3; &#xe4; &#xe5; &#xe6; &#xe7; &#xe8; &#xe9; &#xea; &#xeb; &#xec; &#xed; &#xee; &#xef; &#xf0; &#xf1; &#xf2; &#xf3; &#xf4; &#xf5; &#xf6; &#xf7; &#xf8; &#xf9; &#xfa; &#xfb; &#xfc; &#xfd; &#xfe; &#xff;"""
string_value_1 = """€ ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ"""
string_value_2 = """! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < Err:520 > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ"""
string_key_3 = """&#x22; &#x26; &#x27; &#x3c; &#x3e; &#xa0; &#xa1; &#xa2; &#xa3; &#xa4; &#xa5; &#xa6; &#xa7; &#xa8; &#xa9; &#xaa; &#xab; &#xac; &#xad; &#xae; &#xaf; &#xb0; &#xb1; &#xb2; &#xb3; &#xb4; &#xb5; &#xb6; &#xb7; &#xb8; &#xb9; &#xba; &#xbb; &#xbc; &#xbd; &#xbe; &#xbf; &#xc0; &#xc1; &#xc2; &#xc3; &#xc4; &#xc5; &#xc6; &#xc7; &#xc8; &#xc9; &#xca; &#xcb; &#xcc; &#xcd; &#xce; &#xcf; &#xd0; &#xd1; &#xd2; &#xd3; &#xd4; &#xd5; &#xd6; &#xd7; &#xd8; &#xd9; &#xda; &#xdb; &#xdc; &#xdd; &#xde; &#xdf; &#xe0; &#xe1; &#xe2; &#xe3; &#xe4; &#xe5; &#xe6; &#xe7; &#xe8; &#xe9; &#xea; &#xeb; &#xec; &#xed; &#xee; &#xef; &#xf0; &#xf1; &#xf2; &#xf3; &#xf4; &#xf5; &#xf6; &#xf7; &#xf8; &#xf9; &#xfa; &#xfb; &#xfc; &#xfd; &#xfe; &#xff; &#x152; &#x153; &#x160; &#x161; &#x178; &#x192; &#x2c6; &#x2dc; &#x391; &#x392; &#x393; &#x394; &#x395; &#x396; &#x397; &#x398; &#x399; &#x39a; &#x39b; &#x39c; &#x39d; &#x39e; &#x39f; &#x3a0; &#x3a1; &#x3a3; &#x3a4; &#x3a5; &#x3a6; &#x3a7; &#x3a8; &#x3a9; &#x3b1; &#x3b2; &#x3b3; &#x3b4; &#x3b5; &#x3b6; &#x3b7; &#x3b8; &#x3b9; &#x3ba; &#x3bb; &#x3bc; &#x3bd; &#x3be; &#x3bf; &#x3c0; &#x3c1; &#x3c2; &#x3c3; &#x3c4; &#x3c5; &#x3c6; &#x3c7; &#x3c8; &#x3c9; &#x3d1; &#x3d2; &#x3d6; &#x2013; &#x2014; &#x2018; &#x2019; &#x201a; &#x201c; &#x201d; &#x201e; &#x2020; &#x2021; &#x2022; &#x2026; &#x2030; &#x2032; &#x2033; &#x2039; &#x203a; &#x203e; &#x2044; &#x20ac; &#x2111; &#x2118; &#x211c; &#x2122; &#x2135; &#x2190; &#x2191; &#x2192; &#x2193; &#x2194; &#x21b5; &#x21d0; &#x21d1; &#x21d2; &#x21d3; &#x21d4; &#x2200; &#x2202; &#x2203; &#x2205; &#x2207; &#x2208; &#x2209; &#x220b; &#x220f; &#x2211; &#x2212; &#x2217; &#x221a; &#x221d; &#x221e; &#x2220; &#x2227; &#x2228; &#x2229; &#x222a; &#x222b; &#x2234; &#x223c; &#x2245; &#x2248; &#x2260; &#x2261; &#x2264; &#x2265; &#x2282; &#x2283; &#x2284; &#x2286; &#x2287; &#x2295; &#x2297; &#x22a5; &#x22c5; &#x2308; &#x2309; &#x230a; &#x230b; &#x25ca; &#x2660; &#x2663; &#x2665; &#x2666; &#x27e8; &#x27e9; &#x2003; &#x2061; &#x2062; &#x212b; &#x22ee; &#x22f1; &#xf603; &#xf604;"""
string_value_3 = """"& ' < >   ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Œ œ Š š Ÿ ƒ ˆ ˜ Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ ς σ τ υ φ χ ψ ω ϑ ϒ ϖ – — ‘ ’ ‚ “ ” „ † ‡ • … ‰ ′ ″ ‹ › ‾ ⁄ € ℑ ℘ ℜ ™ ℵ ← ↑ → ↓ ↔ ↵ ⇐ ⇑ ⇒ ⇓ ⇔ ∀ ∂ ∃ ∅ ∇ ∈ ∉ ∋ ∏ ∑ − ∗ √ ∝ ∞ ∠ ∧ ∨ ∩ ∪ ∫ ∴ ∼ ≅ ≈ ≠ ≡ ≤ ≥ ⊂ ⊃ ⊄ ⊆ ⊇ ⊕ ⊗ ⊥ ⋅ ⌈ ⌉ ⌊ ⌋ ◊ ♠ ♣ ♥ ♦ ⟨ ⟩ \u2003 \u2061 \u2062 Å ⋮ ⋱ \uF603 \uF604"""
string_key_4 = """&#x2500; &#x2501; &#x2502; &#x2503; &#x2504; &#x2505; &#x2506; &#x2507; &#x2508; &#x2509; &#x250A; &#x250B; &#x250C; &#x250D; &#x250E; &#x250F; &#x2510; &#x2511; &#x2512; &#x2513; &#x2514; &#x2515; &#x2516; &#x2517; &#x2518; &#x2519; &#x251A; &#x251B; &#x251C; &#x251D; &#x251E; &#x251F; &#x2520; &#x2521; &#x2522; &#x2523; &#x2524; &#x2525; &#x2526; &#x2527; &#x2528; &#x2529; &#x252A; &#x252B; &#x252C; &#x252D; &#x252E; &#x252F; &#x2530; &#x2531; &#x2532; &#x2533; &#x2534; &#x2535; &#x2536; &#x2537; &#x2538; &#x2539; &#x253A; &#x253B; &#x253C; &#x253D; &#x253E; &#x253F; &#x2540; &#x2541; &#x2542; &#x2543; &#x2544; &#x2545; &#x2546; &#x2547; &#x2548; &#x2549; &#x254A; &#x254B; &#x254C; &#x254D; &#x254E; &#x254F; &#x2550; &#x2551; &#x2552; &#x2553; &#x2554; &#x2555; &#x2556; &#x2557; &#x2558; &#x2559; &#x255A; &#x255B; &#x255C; &#x255D; &#x255E; &#x255F; &#x2560; &#x2561; &#x2562; &#x2563; &#x2564; &#x2565; &#x2566; &#x2567; &#x2568; &#x2569; &#x256A; &#x256B; &#x256C; &#x256D; &#x256E; &#x256F; &#x2570; &#x2571; &#x2572; &#x2573; &#x2574; &#x2575; &#x2576; &#x2577; &#x2578; &#x2579; &#x257A; &#x257B; &#x257C; &#x257D; &#x257E; &#x257F;"""
string_value_4 = """─ ━ │ ┃ ┄ ┅ ┆ ┇ ┈ ┉ ┊ ┋ ┌ ┍ ┎ ┏ ┐ ┑ ┒ ┓ └ ┕ ┖ ┗ ┘ ┙ ┚ ┛ ├ ┝ ┞ ┟ ┠ ┡ ┢ ┣ ┤ ┥ ┦ ┧ ┨ ┩ ┪ ┫ ┬ ┭ ┮ ┯ ┰ ┱ ┲ ┳ ┴ ┵ ┶ ┷ ┸ ┹ ┺ ┻ ┼ ┽ ┾ ┿ ╀ ╁ ╂ ╃ ╄ ╅ ╆ ╇ ╈ ╉ ╊ ╋ ╌ ╍ ╎ ╏ ═ ║ ╒ ╓ ╔ ╕ ╖ ╗ ╘ ╙ ╚ ╛ ╜ ╝ ╞ ╟ ╠ ╡ ╢ ╣ ╤ ╥ ╦ ╧ ╨ ╩ ╪ ╫ ╬ ╭ ╮ ╯ ╰ ╱ ╲ ╳ ╴ ╵ ╶ ╷ ╸ ╹ ╺ ╻ ╼ ╽ ╾ ╿"""
string_key_5 = """&#x2100; &#x2101; &#x2102; &#x2103; &#x2104; &#x2105; &#x2106; &#x2107; &#x2108; &#x2109; &#x210A; &#x210B; &#x210C; &#x210D; &#x210E; &#x210F; &#x2110; &#x2111; &#x2112; &#x2113; &#x2114; &#x2115; &#x2116; &#x2117; &#x2118; &#x2119; &#x211A; &#x211B; &#x211C; &#x211D; &#x211E; &#x211F; &#x2120; &#x2121; &#x2122; &#x2123; &#x2124; &#x2125; &#x2126; &#x2127; &#x2128; &#x2129; &#x212A; &#x212B; &#x212C; &#x212D; &#x212E; &#x212F; &#x2130; &#x2131; &#x2132; &#x2133; &#x2134; &#x2135; &#x2136; &#x2137; &#x2138; &#x2139; &#x213A; &#x213B; &#x213C; &#x213D; &#x213E; &#x213F; &#x2140; &#x2141; &#x2142; &#x2143; &#x2144; &#x2145; &#x2146; &#x2147; &#x2148; &#x2149; &#x214A; &#x214B; &#x214C; &#x214D; &#x214E; &#x214F;"""
string_value_5 = """℀ ℁ ℂ ℃ ℄ ℅ ℆ ℇ ℈ ℉ ℊ ℋ ℌ ℍ ℎ ℏ ℐ ℑ ℒ ℓ ℔ ℕ № ℗ ℘ ℙ ℚ ℛ ℜ ℝ ℞ ℟ ℠ ℡ ™ ℣ ℤ ℥ Ω ℧ ℨ ℩ K Å ℬ ℭ ℮ ℯ ℰ ℱ Ⅎ ℳ ℴ ℵ ℶ ℷ ℸ ℹ ℺ ℻ ℼ ℽ ℾ ℿ ⅀ ⅁ ⅂ ⅃ ⅄ ⅅ ⅆ ⅇ ⅈ ⅉ ⅊ ⅋ ⅌ ⅍ ⅎ ⅏"""


string_key = string_key_1 + str(' ') + string_key_2 + str(' ') + string_key_3 + str(' ') + string_key_4 + str(' ') + string_key_5
# concatenate the special characters and save them as values, seperate by a space for later operation

string_value = string_value_1 + str(' ') + string_value_2 + str(' ') + string_value_3 + str(' ') + string_value_4 + str(' ') + string_value_5
# concatenate the special characters and save them as values, seperate by a space for later operation

key = [x for x in string_key.split(' ')]
# use split method to create a list of the keys (xml entities)

value = [x for x in string_value.split(' ')]
# use split method to create a list of the values (special characters)

spec_dict = dict(zip(key,value)) # create a special character dictionary


### 3.2 Utility Functions to strip unnecessary tag, and multiple spaces

* 3.2.1 <i>strip_tag()</i> function to get rid of unecessary tags in the format of <> or <\/>
  * to achieve this goal, we substitute the content within tags in the pattern of '<!\[CDATA\[(.*?)\]\]>|<.*?>' with the group <content_wotag\>
  * a CDATA section is in the format of <something\>this is CDATA section<something\/>
  
* 3.2.2 <i>clean()</i> function to get rid of multi whitespaces.
  * substitute the content that matches the pattern &#x with corresponding value from spec_dict
  * we found that there are usually at least 2 but less than 30 whitespaces found in the input txt file.
  
* 3.2.3 <i>multiple_replace()</i> function to properly display special characters via searching and replacing the pattern of '&#x\w+'
  * write a re expression with % format and substitute it with '|' (or operator) and all the keys in spec_dic dictionary
  * return the values corresponding to the keys using re.sub method

In [4]:
def strip_tag(text):
    """
    pass text to the function and get rid of the tags but reserve the content within the tags
    """
    return re.sub('<!\[CDATA\[(.*?)\]\]>|<.*?>', lambda m: m.group(1) or '', text, flags=re.DOTALL) 

    # group(1) references to <content_wotag>
    # the argument flags=re.DOTALL applies to all and strip all tags then substitue group(1) with the argument text.
    # ! symbol negates that a string is considered a match only if the rest of the expression is not matched.
    # substitutes patterns like <[CDATA[]]> (character data) or <...> with group(1) or '', return text
    
def clean(text):
    strip_and_sharp = multiple_replace(spec_dict,text)
    # substitutes patterns like '&#xwwwwwwww' with corresponding value from spec_dict
    
    return re.sub('\s{2,30}','',strip_and_sharp,flags=re.DOTALL)
    # substitute white spaces that are generated after removing tags and special characters
    

def multiple_replace(spec_dict, text):
  
    regex = re.compile("(%s)" % "|".join(map(re.escape, spec_dict.keys())))
    # use % format method to substitute the keys in spec_dict dictionary
    # re.escape want to match an arbitrary literal string that may have regular expression metacharacters in it.
    # For each match, look-up corresponding value in dictionary
    # substitute a string from start to end that exists in spec_dict by text
    return regex.sub(lambda mo: spec_dict[mo.string[mo.start(): mo.end()]], text) 



### 3.3 Define a Grab class to parse content given specific tag
The idea behind creating a Grab class is that to allow us to capture the content with given input as the pattern eg: Grab('kind') will capture the block containing <kind\> and </kind\> inclusively.

- 3.3.1 initiate the Grab object with a tag eg: 'kind'. Grab object will help us grab the block from beginning tagto ending tag inclusively. eg: <kind\> A1 <\kind>

- 3.3.2 Inside Grab we define <i>search</i> method and <i>search_all</i> method to be find pattern once (with search) or find all occurences through the xml block.  

- 3.3.3 define <i>get_pattern</i> method
 * store content without tags in group <content_wotag\> (content without tags) eg: A1
 * store content with tags in group <content_with_tag\> (content with tags) eg: <kind\> A1 <\kind>
 * We need to keep the content_with_tags because some blocks containing tags that appear multiple times. For example, one block of xml can have multiple tags <b><claim_text\></b>
    
##### For example: 
* <b>grant_id.group('content_wotag')</b> will return: 
'lang="EN" dtd-version="v4.5 2014-04-03" file="US10362643-20190723.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20190709" date-publ="20190723"' 
* <b>grant_id.group('content_with_tag')</b> will return: 
'<us-bibliographic-data-grant\><publication-reference\><document-id\><country\>US</country><doc-number\>10362643</doc-number\><kind\>B2</kind\><date\>20190723</date\>....
* <b>no_claim.group('content_wotag')</b>: 'id="CLM-00001" num="00001"'
* <b>no_claim.group('content_with_tag')</b> : <claim-text\>1. A light emitting diode (LED) driver circuit, comprising:<claim-text\>a dimmer connected to an AC input;</claim-text\><claim-text\>a rectifier circuit that is configured to receive the AC input from the dimmer, and to rectify the AC input to generate an input voltage;</claim-text\>...

<b>Note</b>: Doing this will help us to see if we can loop deeper into the content if we see any more tag appear.In the examples above, we can loop deeper into no_claim_group to find more tags like <claim-text\>. In case using content_with_tag and no more tag appears, it is because there are no more nested xml key-value.  

In [5]:
class Grab:
    
    """
    Grab all the content in txt/xml file in the format of '<....>'   
    """

    def __init__(self, tag):
        
        """
        tag : text pattern eg: kind
        """

        self.tag = tag
        self.pattern = self.get_pattern(tag)
    
    def search_all(self, text):
              
        for g in self.pattern.finditer(text): # finditer loops through all content and find all matches.
            yield g    # yield sequence of results

    def search(self, text):
        """
        search function from re library and shorten the codes
        """
        return self.pattern.search(text) # search finds the next match

    def get_pattern(cls, tag):
        
        """
        Return a regular expression object for parsing XML by a given tag name.
        ----------
        tag : text pattern eg: kind
        
        Returns
        -------
        o : regex object
            A compiled Python regex object
            
        """
        
        re_string = r"<{0}(?:\s*|\s+(?P<content_wotag>[^&<]*))"
        
        # (?:....) A non-capturing version of regular parentheses. 
        # Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
        
        # (?P<name>) re syntax that creates a group. This group can be referenced later by the name
        # the pattern of all content within tags and store the content in the group content_wtag with re_string
        # content_wtag contains all content

        re_string += r"(?:>(?P<content_with_tag>.*?)</{0}>|/>)" # reserve the content inside of tags

        return re.compile(
            re_string.format(tag), # source
            re.DOTALL # mode
        )

### 3.4 Functions to get the content corresponding to its dataframe columns.

Use Grab class to produce <b>re object</b> with tags as its pattern, then parse through the input file and find patterns that match each corresponding column in the dataframe and store the cleaned content. Grab object will return a block of particular tags defined by the input arguments. 

##### Break downs:

3.4.1 <i>get_grantid</i>():
  * initiate a Grab object then set pattern as 'us-patent-grant'
  * search the pattern of 'file="' in the Grab object, and capture the following ten alphanumeric characters. eg:US10357643
  * the pattern for grantid is file=" followed by 10 word characters. Hence the re expression is r'file="\w{10}'


3.4.2 <i>get_invention_title()</i>:
  * initiate a Grab object then set pattern as 'invention-title'
  * search the pattern of 'invention-title' in the Grab object
  * group the matching object with content_with_tag and return the content with clean() function to get rid of special characters. We can return the content-without-Tags and not using the clean() function, either way will work in this case.
  

3.4.3 <i>get_no_claim()</i>:
  * initiate a Grab object then set pattern as 'number-of-claims'
  * search the pattern of 'number-of-claims' in the Grab object
  * group the matching object with content_with_tag and return the content 


3.4.4 <i>get_fullname_list()</i>:
  * initiate a Grab object then set pattern as 'inventors'
  * search the pattern of 'inventors' in the Grab object
  * these two steps are to locate the inventors section
  * next we find the pattern of 'last-name' and 'first-name' then store them in a full_name list then concatenate them
   * we found that there are usually more than one inventor, so we use for loop and <i>search_all</i> to collect all the inventor names.



3.4.5 <i>get_cited_by_applicant()</i>:
  * initiate a Grab object then set pattern as 'us-patent-grant'
  * search the pattern of 'us-patent-grant' in the Grab object
  * these two steps are to locate the us-patent-grant section  
  * next we found that the pattern for 'cited by applicant' lie between the tags 'category'
  * use control condition and a counter to calculate the count of the matching citations.


3.4.6 <i>get_cited_by_examiner()</i>:
  * initiate a Grab object then set pattern as 'us-patent-grant'
  * search the pattern of 'us-patent-grant' in the Grab object
  * these two steps are to locate the us-patent-grant section  
  * next we found that the pattern for 'cited by examiner' lie between the tags 'category'
  * use control condition and a counter to calculate the count of the matching citations.


3.4.7 <i>get_kind()</i>:
  * while inspecting the sample input and sample output, we have figured out that there are four kinds of 'patent kind'. But to make sure that we are not hard coding them, we seek advice from Bruce Chen and got approved to refer to the PADX from UPSTO
  * initiate a Grab object then set pattern as 'us-patent-grant'
  * search the pattern of 'us-patent-grant' in the Grab object
  * create another Grab object, kind_re, and set pattern as 'kind'
  * search the pattern of 'us-patent-grant' in the Grab object
  * using the matching object as the key and reference to type_kind dictionary which stores patent-kind information
  

3.4.8 <i>get_claim_list() </i>:
  * initiate a Grab object then set pattern as 'us-patent-grant'
  * search the pattern of 'us-patent-grant' in the Grab object
  * create another Grab object, claim_text, and set pattern as 'claim-text'
  * search the object that matches the content within 'claim-text' tags.
  * use for loop to concatenate all claim texts into claim_string.
   * we found that some claim text are too long so they are sepereated by the <claim-text\> tags but they are actually one claim.
   * we also observe that if they are different claims, they are seperated with numbers follow by dots. ex. 1. (claim1),2. (claim2). But if they are the same claim there will be no numbers follow by dots.
     * so we use string manipulations to distinguish different claims.
     * we first append '#####' to each claim_string
     * after that, we substitue patterns like '#####\d' (the pattern of different claims) with '#\$#\\\2' (\\\2 leaves the following 2 characters unchanged).
     * next we substitute '#####' with '' in claim_string and split claim_string by '#\$#' again to have a proper claim_list. Now we get the desired claim_text


3.4.9 <i>get_abstract()</i>:
  * initiate a Grab object then set pattern as 'us-patent-grant'
  * search the pattern of 'abstract' in the Grab object
  * if content with tags 'abstract' is not empty, use strip_tag function to get rid of the CDATA and return the content, else return 'NA'

In [6]:
def get_grantid(temp_xml):
    
    '''
    Obtain grantid from temp_xml, temporary xml line
    '''
    
    grant_id_re = Grab("us-patent-grant")
    # set pattern as us-patent-grant
    
    grant_id = grant_id_re.search(temp_xml)
    # search the tag that contains pattern 'us-patent-grant'
   
    id =' ' #initialize as an empty string
    
    if re.search('file=\"\w{10}', grant_id.group('content_wotag')) != None: #if found
        
        # the patent id is 10 characters after file=". Since the numbers are in string format, they are considered as words.
        # search the pattern as file="wwwwwwwwww in the string of the value stored in content_wotag
        
        id = (re.search('file=\"\w{10}', grant_id.group('content_wotag')).group()).replace('file="','')
        # .group() returns all matches
        # .replace() replace grantid's file=" with ''
        
    return id


def get_invention_title(temp_xml):
    '''
    Obtain invention title from temp_xml
    '''
    
    invention_title_re = Grab("invention-title")
    #set pattern as invention-title
    
    invention_title = invention_title_re.search(temp_xml)
    # search the pattern of invention-title
    
    invention_temp = invention_title.group('content_with_tag')
    # retrieve the content within tags contain 'invention-title'

    return invention_temp # get rid of excess spaces and special characters

def get_no_claim(temp_xml) :
    '''
    Obtain number of claims
    '''
    
    no_claim_re = Grab("number-of-claims")
    no_claim = no_claim_re.search(temp_xml)
    no_claim_temp = no_claim.group('content_with_tag')
    
    return no_claim_temp
    
def get_fullname_list(temp_xml):
    '''
    Obtain full name_list
    '''
    
    inventors_re = Grab("inventors")
    inventors = inventors_re.search(temp_xml)
    # locate the content with tags that contain 'inventors'
    
    last_name_inventor_re = Grab('last-name')
    last_name_inventor = last_name_inventor_re.search(inventors.group('content_with_tag'))
    # find last_name that's located between tags contain 'last-name'

    first_name_inventor_re = Grab('first-name')
    first_name_inventor = first_name_inventor_re.search(inventors.group('content_with_tag'))
    # find first_name that's located between tags contain 'first-name'

    fname_list = []
    for fname in first_name_inventor_re.search_all(inventors.group("content_with_tag")):
        # use search_all() to loop through the content and find all fname
        
        fname_list.append(fname.group("content_with_tag"))
        
    lname_list = []
    for lname in last_name_inventor_re.search_all(inventors.group("content_with_tag")):
        # use search_all() to loop through the content and find all lname
        
        lname_list.append(lname.group("content_with_tag"))
        
    fullname_inventor = []
    for i in range(len(fname_list)):
        fullname_inventor.append(fname_list[i]+str(' ')+lname_list[i])
        # concatenate fname and lname with a space
        
    fullname_list = ",".join(str(name) for name in fullname_inventor)
    
    if len(fullname_list) > 0:
        return str('[') + fullname_list + str(']')
    else:
        return str('[NA]')

def get_cited_by_applicant(temp_xml):
    '''
    Obtain citations by applicant
    '''
    
    grant_id_re = Grab("us-patent-grant")
    grant_id = grant_id_re.search(temp_xml)
    # locate the content within tags contain 'us-patent-grant'
    
    citation_re = Grab("us-citation")
    #citation = citation_re.search(temp_xml)
    # retrieve the citation by searching pattern as 'us-citation'
    
    cited_by_applicant = 0 
    for citation in citation_re.search_all(grant_id.group("content_with_tag")):
        # loop through citations within tags that contain 'us-patent-grant'
        
        category_re = Grab('category') # narrow down the search area to between tags of <'category'>
        category = category_re.search(citation.group('content_with_tag'))
        if  category.group('content_with_tag') == 'cited by applicant':
            cited_by_applicant+=1
            # increment the count by 1 if 'cited by applicant' matches the content within tags of 'category'
            
    return cited_by_applicant

def get_cited_by_examiner(temp_xml):
    '''
    Obtain citations by examiner
    '''
    
    grant_id_re = Grab("us-patent-grant")
    grant_id = grant_id_re.search(temp_xml)
    # locate the content within tags contain 'us-patent-grant'
    
    citation_re = Grab("us-citation")
    #citation = citation_re.search(temp_xml)
    # retrieve the citation by searching pattern as 'us-citation'
    
    cited_by_examiner = 0
    for citation in citation_re.search_all(grant_id.group("content_with_tag")):
        # loop through citations within tags that contain 'us-patent-grant'
        
        category_re = Grab('category') # narrow down the search area to between tags of <'category'>
        category = category_re.search(citation.group('content_with_tag'))
        if  category.group('content_with_tag') == 'cited by examiner':
            cited_by_examiner+=1
            # increment the count by 1 if 'cited by examiner' matches the content within tags of 'category'
            
    return cited_by_examiner

def get_kind(temp_xml):
    '''
    Obtain the kind of patent
    '''
    
    grant_id_re = Grab("us-patent-grant")
    grant_id = grant_id_re.search(temp_xml)
    # locate the content within tags contain 'us-patent-grant'
    
    kind_re = Grab("kind")
    kind = kind_re.search(grant_id.group('content_with_tag'))
    # retrieve the content of kind by searching pattern as 'kind' within "us-patent-grant" xml block
    
    return type_kind[kind.group('content_with_tag')] #return the actual string from type_kind dictionary

def get_claim_list(temp_xml):
    '''
    Obtain the claim details
    '''
    
    grant_id_re = Grab("us-patent-grant")
    grant_id = grant_id_re.search(temp_xml)
    # locate the content within tags contain 'us-patent-grant'
    
    claim_text_re = Grab("claim-text")
    claim_text = claim_text_re.search(grant_id.group('content_with_tag'))
    # retrieve the claim text by searching pattern as 'claim-text'
    
    claim_string = ''
    
    for claim in claim_text_re.search_all(grant_id.group("content_with_tag")):
        claim_string = claim_string + (str(strip_tag(claim.group('content_with_tag')))) + str('#####')
        claim_string.strip()
        # concatenate '#####' at the end of each of the content between <claim-text> tags

   
    claim_string = re.sub('(#####)(\d)', '#$#\\2',claim_string)
    # substitute part #### in the content that matches this pattern '#####digit' (ex. #####2.) with #$# and keep the digit unchanged (ex. #$#2.)
    
    claim_string = re.sub('#####','',claim_string)
    # removes all '#####' because the connecting content are in the same claim_text
    
    claim_list = claim_string.split('#$#')
    # split the claim_string by the pattern '#$#', '#$#' is to mark the separation between two claim texts
        
    result_claim_list = [claim.strip() for claim in claim_list] #removing trailing spaces

        
    claim_string = ",".join(str(claim) for claim in result_claim_list)
    # join each claim with ',' as the Sample_output.txt suggests.
    
    if claim_string != '':
        return str('[') + claim_string + str(']') # make it look the same as Sample_output.txt
    else:
        str('[NA]') #else the content of claim_list 'NA'

def get_abstract(temp_xml):
    '''
    Obtain abstract
    '''
    
    abstract_re = Grab("abstract") # locate the content within tags contain 'abstract'
    abstract = strip_tag(abstract_re.search(temp_xml).group('content_with_tag')) if abstract_re.search(temp_xml) != None else 'NA'
    # if content with tags 'abstract' is not empty, use strip_tag function to get rid of the CDATA and return the content
    # else the content of abstract is 'NA'
    
    return abstract

### 3.5 : Define a function to generate dataframe and store txt or xml contents in corresponding columns

Use pandas to create a dataframe which its columns are named after Sample_output.csv or this list: 

['grant_id',
                 'patent_title',
                 'kind',
                 'number_of_claims',
                 'inventors',
                 'citations_applicant_count',
                 'citations_examiner_count',
                 'claims_text',
                 'abstract'] 

In [7]:
def create_data_frame(inputFile):
    '''
    Create dataframe using pandas
    '''
    
    colnames =['xml_raw','grant_id',
                 'patent_title',
                 'kind',
                 'number_of_claims',
                 'inventors',
                 'citations_applicant_count',
                 'citations_examiner_count',
                 'claims_text',
                 'abstract'] #columns names

    df = pd.DataFrame(columns=colnames) #initiate a dataframe

    # Get content from inputFile
    with open(inputFile, 'r', encoding='utf-8') as f: 
        content = f.readlines() #read each line
    f.close()            
    
    # each patent block starts with a line containing string 'xml version', 
    beginning_tag = 'xml version'
    xml_list = [''] #initiate an empty list
    string = '' #initiate an empty string
    
    # use for loop to append each line in content to xml_list
    for line in content:   
        if beginning_tag in line:
            xml_list.append(clean(string))
            # clean function is defined in above section.             
            string = '' # reset string as empty
        else:
            string+=line.replace('\n','') # get rid of new line character (\n) and store contents in string          
    xml_list.append(clean(string)) # append the last line to xml_list
    xml_list = xml_list[2:] # escape the first two lines which is the line containing '' and new line 
    
    df['xml_raw'] = xml_list
    # store the content in column 'xml_raw' od dataframe
    
    # use functions defined below to assign data to its corresponding columns. 
    # For example, grant_id column contains grant_id.
    df.grant_id = df.xml_raw.apply(get_grantid)
    df.patent_title = df.xml_raw.apply(get_invention_title)
    df.kind = df.xml_raw.apply(get_kind)
    df.number_of_claims = df.xml_raw.apply(get_no_claim)
    df.inventors = df.xml_raw.apply(get_fullname_list)
    df.citations_applicant_count= df.xml_raw.apply(get_cited_by_applicant)
    df.citations_examiner_count = df.xml_raw.apply(get_cited_by_examiner)
    df.claims_text = df.xml_raw.apply(get_claim_list)
    df.abstract = df.xml_raw.apply(get_abstract)
    
    df = df.drop("xml_raw", axis=1) 
    # get rid of column 'xml_raw'. Now df contains only the other named columns and correspoding content
    # axis = 1 is column, axis = 0 is row
    
    return df

### 3.6 Generate output file in the format of csv and json.
- 3.6.1 Csv file can be created via to_csv() from Pandas package
- 3.6.2 Json file can be created via writing values to file formatted as .json. Here we ran a for loop, and write a string with values gatherred from Pandas dataframe. 

In [12]:
def produce_csv_json(inputFile):
    '''
    Produce csv file then generate JSON file
    '''
    
    df = create_data_frame(inputFile)
    
    df.to_csv('data_output.csv',index = False) # index = False avoids storing indices
    
    json_list = [] #initialize an empty json list
    
    for index,patent in df.iterrows():
        df_row = '"{}":{{"patent_title":"{}","kind":"{}","number_of_claims":{},"inventors":"{}","citations_applicant_count":{},"citations_examiner_count":{},"claims_text":"{}","abstract":"{}"}}'.\
              format(patent['grant_id'],
                     patent['patent_title'],
                     patent['kind'],
                     patent['number_of_claims'],
                     patent['inventors'],
                     patent['citations_applicant_count'],
                     patent['citations_examiner_count'],
                     patent['claims_text'],
                     patent['abstract'])
        json_list.append(df_row)
        # json format has this pattern: {key}:{value}
        
    json_string = ','.join(json_list)
    json_string = '{' + json_string +'}'
    
    json_file = open('data_json.json','w+', encoding = 'utf')
    json_file.write(str(json_string)) 
    json_file.close()
    
    return    

### 3.7 Putting it all together

In [None]:
InputFilePath = "data.txt" #change input as desired
produce_csv_json(InputFilePath)
# produce the output with csv and json format

## 4. Summary/Conclusion
The assignment tests text file processing skill using regular expression and pandas. Techniques that are used to achieve the objectives are:

* Data parsing and extraction using re library.
 * re.compile() function is frequently used to locate the contents we need
 * re.sub() function is then used to substitute the matching patterns with content stored in groups with (?P<name\>)
 * re.search() is useful when need to find the first matching pattern.
 * re.finditer is useful when need to find all matching pattern.
 * m.group() is used to reference to re groups.
 * .join(map(re.escape, argument)) is useful when you want to match an arbitrary literal string that may contain special regular expression characters. For example: +, *, ?
 
* Export data using pandas dataframe.
 * pd.DataFrame(columns = ...)
 * apply() function allows us to apply a function to the argument.

* String manipulation
 * .join() built-in method is useful when concatenating strings.
 * .format() built-in method is useful when substituting groups (in this assignment mostly matching patterns)

## 5. References

_PyFormat: Using % and .format() for great good!_ Retrieve from https://pyformat.info/

Jay S. (2016, December 30) _'What does <![CDATA[]]> in xml mean?'_ Retrieve from
https://www.novixys.com/blog/what-does-cdata-in-xml-mean/

_re--Regular expression operations (2009)_ Retrieve from https://docs.python.org/3/library/re.html

_HTML ISO-8859-1 Reference_ Retrieve from https://www.w3schools.com/charsets/ref_utf_box.asp

United States Patent and Trademark Office Retrieve from https://www.uspto.gov/
