**Oliver Seager<br>
o.j.seager@lse.ac.uk<br>
Python 3.9.7**

**Created:** 07/05/2023 <br>
**Last Modified:** 23/06/2023

This script extracts application years from the individual OCRs of 1926-1975 patents from Fleming, Greene, Li, Marx and Yao (2019).

**Infiles**: <br>
- ***xxxxxxx.txt*; x in {0,1,...,9}** - 1,653,786 individual OCRs of 1926-1975 patents, each of a single patent with number *xxxxxxx*, from Fleming, Greene, Li, Marx and Yao (2019).

**Outfiles**: <br>
- **014_FGLMY2675appYears.dta** - USPTO patents from 1926-1975 with their application dates, as inferred from the Fleming, Greene, Li, Marx and Yao (2019) OCR.

**External Packages**
- `numpy` by Travis Oliphant
- `pandas` by Wes McKinney

### Preamble

In [None]:
# Import Packages

import pandas as pd
import numpy as np
import os
import re
from time import time


# Set Directory Shorthands

subsample_dir = "C:\\Users\\Ollie\\Dropbox\\State and innovation\\orig\\Fleming\\uspto.1926-1975\\"

### Get A List of the File Names for All Patents

Each patent is an individual file which can be read as a single-line *.txt*. We get the list of file names here.

In [None]:
allPatents = os.listdir(subsample_dir)

### Initiate the DataFrame for Export

This is what we'll put out to the final *.dta*. Variables...
- **patent_id** is just the USPTO patent number
- **appDate** is the year in which the application is filed according to the parse
- **foundFiled** is an indicator for whether or not the word "Filed" was found in the patent. This is used, where possible, to locate the 4-character sequence of numbers that constitutes application year, which (if OCRing is correct) will appear shortly afterwards. This indicator generally will, if 0, indicate a shoddy OCR and encourage scepticism in using the application year found.

In [None]:
df = pd.DataFrame(columns = ["patent_id", "appDate", "foundFiled"])

### Function for Putting Commas in Numbers

Borrowed from code I've written elsewhere. Just used for the time update.

In [None]:
def commafy(x):
    ## Put Commas into the Pre-decmial portion of the number ##
    xIntFloor = int(np.floor(x)) # 1.234 will return 1
    str_xIntFloor = str(xIntFloor) # Get string version of intege
    len_str = len(str_xIntFloor) # Length of integer in characters
    nr_commas = int(np.floor((len_str - 1)/3)) # Gets number of commas required
    if nr_commas > 0:
        strList = [] # Used to store maxIntFloor 3 character strings that compose the number
        for i in range(nr_commas):
            if i == 0:
                strList = strList + [str_xIntFloor[-3:]] # For the first iteration, we just take the last three characters
            else:
                subStrEnd = (i)*-3 # Where the iteration's substring of str_xIntFloor will start (negative indexIntFloor) 
                subStrStart = (i + 1)*-3 # Where the iteration's substring of str_xIntFloor will end (negative indexIntFloor)
                strList = strList + [str_xIntFloor[subStrStart:subStrEnd]] # Append three characters to list
        strList = strList + [str_xIntFloor[:(-3*nr_commas)]] # Finally, add the characters that precede the first comma
        strList.reverse() # Previous to this, the string "12345678" will produce list ["678","345","12"].
        comma_xIntFloor = ",".join(strList) # Join the list by commas
    else: # If no commas are required, we just return the 
        comma_xIntFloor = str_xIntFloor
    ## Add on Decimals if Needed ##
    if x != xIntFloor:
        xDecStr = str(x).split(".")[1] # Extracts the portion after the decimal place
        comma_x = comma_xIntFloor + "." + xDecStr
    else:
        comma_x = comma_xIntFloor
    return comma_x

### Parse Patent Documents

This loops through each patent, taking the first four character sequence of numbers \[after the word "Filed", if found\] as the application year.

In [None]:
## Initiations ##

sub_df = pd.DataFrame(columns = ["patent_id", "appDate", "foundFiled"]) # We generate a new carrier DataFrame every 10,000 
                                                                       # observations, as time taken to append a single row
                                                                      # to a DataFrame is increasing in the length of the
                                                                     # DataFrame

i = 0 # Just a simple counter

length = len(allPatents) # The total number of patents to be parsed

time1 = time() # Initiate the time

for patent in allPatents: # Iterates through all patents
        
    with open(subsample_dir + patent) as f: # Opens the patent's file as *.txt*
        
        ocr = f.readlines()[0] # Gets the string that constitutes the whole file
        
        ocrUpper = ocr.replace(" ", "").upper() # Gets characters to upper case and removes spaces for ease of parsing
        
        foundFiled = (ocrUpper.find("FILED") > -1) # Indicates whether the word "Filed" is found
        
        ocrTrimmed = ocrUpper[(ocrUpper.find("FILED") + 1 + 4*foundFiled):] # Isolates string after the word "FILED"
                                                                           # appears (if it appears)
        
        try:
        
            appDate = re.search("\d{4}", ocrTrimmed).group(0) # Try to find 4-character sequence of numbers
        
        except:
            
            appDate = 0 # Return year 0 if no 4-character sequence of numbers found
        
        appendDict = {
            "patent_id":patent[:-4], # the [:-4] removes the ".txt" from the filename, leaving just the patent number
            "appDate":appDate, 
            "foundFiled":foundFiled
        }
        
        sub_df = sub_df.append(appendDict, ignore_index = True) # Append data to the sub DataFrame
        
        i = i + 1 # Update the counter
        
        if i % 10000 == 0: # Executes every 10,000 patents
            
            df = pd.concat([df, sub_df]) # Adds the DataFrame containing the last 10,000 patents to the main DataFrame
            
            sub_df = pd.DataFrame(columns = ["patent_id", "appDate", "foundFiled"]) # Initiates a new DataFrame
            
            time2 = time() # Time after 10,000
            
            last10k = time2 - time1 # Time taken for the last 10,000, in seconds
            
            m = int(np.floor(last10k/60)) # Minutes digit(s)
            
            s = int(last10k - np.floor(last10k/60)*60) # Second digit(s)
            
            print(f"{commafy(i)} of {commafy(length)} patents complete")
            
            print(f"Last 10,000 patents took {m}m{s}s")
            
            time1 = time() # Restart the clock

### Export to *.csv*

Pandas has an easy time reading *.dta* but is sometimes stubborn about exporting to it, so in the name of saving time we export to *.csv*, reimport, and export to Stata.

In [None]:
# Export to .csv

df.to_csv("C:\\Users\\Ollie\\Dropbox\\State and innovation\\data\\Fleming\\appYears.csv")

# Import from .csv

df = pd.read_csv("C:\\Users\\Ollie\\Dropbox\\State and innovation\\data\\Fleming\\appYears.csv").drop(
    columns = "Unnamed: 0").set_index("patent_id")

# Export to .dta

df.to_stata("C:\\Users\\Ollie\\Dropbox\\State and innovation\\data\\014_FGLMY2675appYears.dta")

# Erase .csv file

os.remove("C:\\Users\\Ollie\\Dropbox\\State and innovation\\data\\Fleming\\appYears.csv")