## API key

### Get one

To run this code, you need an API key from Open AI. This involves giving them your credit card and setting up spending limits. 

### Using it

I run this file locally via Jupyterlab, so it's in a folder with `gpt_api.txt` which contains my API key. 

To run this file in Google Colab, you _could_ directly type your API key into the notebook below, **but this is a bad idea.** 

Instead, one common way is to store the API key in a file on your Google Drive and then access it from the Colab notebook. Here's how you can do it:

1.    Create a new text file on your Google Drive and store your API key in it. Name the file something like `gpt_api.txt`.
1.    Mount your Google Drive to the Google Colab notebook by running the following code block.
    ```python
    import openai
    from google.colab import drive
    drive.mount('/content/drive')
    with open('/content/drive/gpt_api.txt', 'r') as f:
        openai.api_key = f.read().strip()
    ```
1.     This will prompt you to click on a link to authorize the connection. Follow the instructions, and copy the authorization code into the input box that appears in the Colab notebook. You can now continue on. 

In [1]:
# !pip install openai 

In [2]:
import openai

# don't type the key in this file! open it from file that is in gitignore, github secrets, or in your google drive

with open('gpt_api.txt', 'r') as f:
    openai.api_key = f.read().strip()

## Load firms to look for

In [3]:
import pandas as pd

firms = pd.read_csv('inputs/compu_cust_random100_new.csv')
firms['cik'] = firms['cik'].astype(str).str.zfill(10)
firms

Unnamed: 0,gvkey,conm,cik,nobs
0,176768,BRIDGELINE DIGITAL INC,0001378590,7
1,19998,MEDPACE HOLDINGS INC,0001668397,22
2,4340,EMULEX CORP,0000350917,56
3,20129,MATERIALISE NV -ADR,0001091223,66
4,22653,VISTA OUTDOOR INC,0001616318,9
...,...,...,...,...
95,66726,APPLIED NEUROSOLUTIONS INC,0000872947,3
96,63489,INTERLINK COMPUTER SCIENCES,0000745597,1
97,9799,SOLITRON DEVICES INC,0000091668,39
98,30172,LXR BIOTECHNOLOGY INC,0000899504,2


In [106]:
firms.nobs.sum()

1832

## Load the filings of those firms.

Also, add the count of E10s to the firms data. 

These are all coded as either asset purchase or purchase agreement.

Some should be removed: "stock purchase".

In [4]:
# load and reduce to our firms
filings = pd.read_stata('inputs/EX10_10K_1997_2022_purchase.dta')
filings = filings.merge(firms['cik'])

In [107]:
filings

Unnamed: 0,cik,fdate,form,coname,wrdsfname,fname,fsize,rdate,secadate,secatime,sequence,type,description,filename,accession,removed,asset_purchase,purchase_agreement,path
0,0000016058,2003-09-29,10-K,CACI INTERNATIONAL INC /DE/,000001/16058/0001193125-03-055255.txt,edgar/data/16058/0001193125-03-055255.txt,2448497.0,2003-06-30,2003-09-29,1960-01-01 14:26:27,3.0,EX-10.12,EXHIBIT 10.12,dex1012.htm,0001193125-03-055255,,1,1,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/16058/000119312503055255/dex1012.htm
1,0000016058,2003-09-29,10-K,CACI INTERNATIONAL INC /DE/,000001/16058/0001193125-03-055255.txt,edgar/data/16058/0001193125-03-055255.txt,2448497.0,2003-06-30,2003-09-29,1960-01-01 14:26:27,5.0,EX-10.14,EXHIBIT 10.14,dex1014.htm,0001193125-03-055255,,0,1,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/16058/000119312503055255/dex1014.htm
2,0000016058,2003-09-29,10-K,CACI INTERNATIONAL INC /DE/,000001/16058/0001193125-03-055255.txt,edgar/data/16058/0001193125-03-055255.txt,2448497.0,2003-06-30,2003-09-29,1960-01-01 14:26:27,6.0,EX-10.15,EXHIBIT 10.15,dex1015.htm,0001193125-03-055255,,1,1,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/16058/000119312503055255/dex1015.htm
3,0000016058,2004-09-13,10-K,CACI INTERNATIONAL INC /DE/,000001/16058/0001193125-04-155541.txt,edgar/data/16058/0001193125-04-155541.txt,2830739.0,2004-06-30,2004-09-13,1960-01-01 14:08:14,2.0,EX-10.18,EXHIBIT 10.18,dex1018.htm,0001193125-04-155541,,0,1,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/16058/000119312504155541/dex1018.htm
4,0000016058,2004-09-13,10-K,CACI INTERNATIONAL INC /DE/,000001/16058/0001193125-04-155541.txt,edgar/data/16058/0001193125-04-155541.txt,2830739.0,2004-06-30,2004-09-13,1960-01-01 14:08:14,3.0,EX-10.19,EXHIBIT 10.19,dex1019.htm,0001193125-04-155541,,0,1,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/16058/000119312504155541/dex1019.htm
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84,0001828182,2021-03-25,10-K,"Signify Health, Inc.",000182/1828182/0001828182-21-000006.txt,edgar/data/1828182/0001828182-21-000006.txt,,2020-12-31,2021-03-25,1960-01-01 16:40:57,3.0,EX-10.4,EX-10.4,exhibit104.htm,0001828182-21-000006,,0,1,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/1828182/000182818221000006/exhibit104.htm
85,0001828182,2021-03-25,10-K,"Signify Health, Inc.",000182/1828182/0001828182-21-000006.txt,edgar/data/1828182/0001828182-21-000006.txt,,2020-12-31,2021-03-25,1960-01-01 16:40:57,6.0,EX-10.9,EX-10.9,exhibit109.htm,0001828182-21-000006,,0,1,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/1828182/000182818221000006/exhibit109.htm
86,0001828182,2022-03-03,10-K,"Signify Health, Inc.",000182/1828182/0001828182-22-000018.txt,edgar/data/1828182/0001828182-22-000018.txt,2983571.0,2021-12-31,2022-03-03,1960-01-01 17:03:32,2.0,EX-10.9,EX-10.9,exhibit109123121.htm,0001828182-22-000018,N,0,1,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/1828182/000182818222000018/exhibit109123121.htm
87,0001828182,2022-03-03,10-K,"Signify Health, Inc.",000182/1828182/0001828182-22-000018.txt,edgar/data/1828182/0001828182-22-000018.txt,2983571.0,2021-12-31,2022-03-03,1960-01-01 17:03:32,3.0,EX-10.10,EX-10.10,exhibit1010123121.htm,0001828182-22-000018,N,0,1,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/1828182/000182818222000018/exhibit1010123121.htm


In [5]:
# add E10 count var to firms df
# (
#     filings.merge(firms['cik'])
#     .groupby('cik',as_index=False)
#     ['fname'].count()
#     .rename(columns={'fname':'E10_count'})
#     .merge(firms,how='right')
#     .assign(E10_count=lambda x: x['E10_count'].fillna(0).astype(int)>0)
# )['E10_count'].sum()

28

## Define key functions to do the lift

Read this

https://platform.openai.com/docs/guides/chat/introduction


In [6]:
# I'm not sure which model the below is, but it's not the super cheap gpt-3.5-turbo

# the cheaper option is something like this:
openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "user", "content": "Where was it played?"}
    ]
)

<OpenAIObject chat.completion id=chatcmpl-78qNuLEJRoDYJD8PLnilPe3SCxxTs at 0x1f8935b4090> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The World Series in 2020 was won by the Los Angeles Dodgers. Due to the COVID-19 pandemic, the World Series was played entirely at Globe Life Field in Arlington, Texas.",
        "role": "assistant"
      }
    }
  ],
  "created": 1682342346,
  "id": "chatcmpl-78qNuLEJRoDYJD8PLnilPe3SCxxTs",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 38,
    "prompt_tokens": 28,
    "total_tokens": 66
  }
}

In [7]:
# gpt 4.0 wrote this mostly

import os
import glob

import numpy as np
import pandas as pd
from IPython.display import (  # used during dev - display(Markdown(markdown_table)) prints nice
    Markdown,
    display,
)
from tqdm import tqdm
from bs4 import BeautifulSoup

# Set Pandas display options to show full string
pd.set_option("display.max_colwidth", None)

def ask_openai(question, data):
    prompt = f"{data}\n---\n{question}"
    response = openai.Completion.create(
        engine="text-davinci-002",
        prompt=prompt,
        max_tokens=70,
        n=1,
        stop=None,
        temperature=0.5,
    )
    return response.choices[0].text.strip()

def parse_file(filename):

    # Define your question related to the loan application
    question = "Output a tab separated list containing two items: the name of the buyer, and the name of the seller."

    # remove the html
    with open(filename, "r") as fp:
        raw = BeautifulSoup(fp.read(), 'html.parser').get_text()

    return ask_openai(question, raw[:1850])

## Loop over files and do the thing


In [None]:
# filings .query('purchase_agreement==1 | asset_purchase==1'))
# query(asset_purchase==1)
# f'{cik no leading zeros}/{assession no dashes}/{filename}'

In [41]:
dirpath = r'C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files'

def create_path(row):
    return f"{dirpath}/{int(row['cik'])}/{row['accession'].replace('-', '')}/{row['filename']}"

filings['path'] = filings.apply(create_path, axis=1)

In [44]:
file_sentence_dict = {}
files = filings['path'].to_list()

for file in tqdm(files,total=len(files)):
    if os.path.exists(file):    
        file_sentence_dict.update({file: parse_file(file)}) #update the dictionary 
    else:
        print("No file: ",file)

100%|██████████████████████████████████████████████████████████████████████████████████| 89/89 [02:13<00:00,  1.50s/it]


## Examine output

In [46]:
file_sentence_dict[r'C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/16058/000119312503055255/dex1012.htm']

'Condor\nCACI'

In [60]:
import re 

In [104]:
delimiters = [r'\n\n', r'\n', r' – ', r' - ']

def col_split(row):
    for delimiter in delimiters:
        split_col = fr"{row['buyer_seller']}".split(delimiter)
        if len(split_col) == 2:
            return split_col
    return ['', '']

df = pd.DataFrame(file_sentence_dict.items(), columns=['document', 'buyer_seller'])
df[['buy','sell']] =  df.apply(col_split, axis = 1,result_type='expand')

df
# ['buyer_seller'].str.split('\t', expand=True)
# df = df.drop('buyer_seller', axis=1)
df

Unnamed: 0,document,buyer_seller,buy,sell
0,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/16058/000119312503055255/dex1012.htm,Condor\nCACI,,
1,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/16058/000119312503055255/dex1014.htm,"CACI INTERNATIONAL INC CACI, INC. – FEDERAL APPLIED TECHNOLOGY SOLUTIONS OF NORTHERN VA, INC.","CACI INTERNATIONAL INC CACI, INC.","FEDERAL APPLIED TECHNOLOGY SOLUTIONS OF NORTHERN VA, INC."
2,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/16058/000119312503055255/dex1015.htm,"CACI International Inc, CACI, Inc. - Federal\n\nPremier Technology, Inc, Premier Technology Group, Inc.","CACI International Inc, CACI, Inc.","Federal\n\nPremier Technology, Inc, Premier Technology Group, Inc."
3,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/16058/000119312504155541/dex1018.htm,"CACI INTERNATIONAL INC\nCACI, INC. - FEDERAL","CACI INTERNATIONAL INC\nCACI, INC.",FEDERAL
4,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/16058/000119312504155541/dex1019.htm,"CACI International Inc, C-CUBED Corporation",,
...,...,...,...,...
84,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/1828182/000182818221000006/exhibit104.htm,"Cure TopCo, LLC\nNew Remedy Corp.\nNew Mountain Partners V (AIV-C), L.P.\nCure Aggregator, LLC\nTTCP Executive Fund - CA, LLC\nHV Special Situations Fund L.P. (UAW)\nTHV COH Blocker Corp","Cure TopCo, LLC\nNew Remedy Corp.\nNew Mountain Partners V (AIV-C), L.P.\nCure Aggregator, LLC\nTTCP Executive Fund","CA, LLC\nHV Special Situations Fund L.P. (UAW)\nTHV COH Blocker Corp"
85,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/1828182/000182818221000006/exhibit109.htm,"Signify Health, Inc.\nNew Remedy Corp.",,
86,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/1828182/000182818222000018/exhibit109123121.htm,"Cure Aggregator, LLC, Cure TopCo, LLC, and [●]",,
87,C:\Users\DonsLaptop\Dropbox\SECExhibits\data\Raw_Exhibit_Files/1828182/000182818222000018/exhibit1010123121.htm,"Signify Health, Inc.\nNew Remedy Corp.",,


In [103]:
# re.split(delimiter_pattern,'CACI INTERNATIONAL INC CACI, INC. – FEDERAL APPLIED TECHNOLOGY SOLUTIONS OF NORTHERN VA, INC.')

delimiters = [r'\n\n', r'\n', r' – ', r' - ']

def col_split(string):
    for delimiter in delimiters:
        split_col = string.split(delimiter)
        if len(split_col) == 2:
            return split_col
    return ['', '']

col_split(r'Condor\nCACI') # should be ['Condor','','CACI']

['Condor', 'CACI']

In [105]:
df.loc[0,'buyer_seller']

'Condor\nCACI'

## Fermi estimate  of the project cost

Price: 0.002 per 1k tokens **in reply**

So

Cost = # docs * # tokens in reply per doc * 0.002/1000

The reply above was 10 tokens:

In [22]:
# !pip install --upgrade tiktoken

In [23]:
# open AI's tokenizer

import tiktoken
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
sent = 'Baxter Healthcare Corporation\tCFC International, Inc.'
len(encoding.encode(sent))


10

We have 0.25 million docs. 

## THE ESTIMATE, IN DOLLARS

In [58]:
files = 250000
toks_per = 400
cost_per_tok = 0.002/1000

files*toks_per*cost_per_tok

200.0

I can't eve believe that.

## Speed and rate limits

I sent 1850 characters, which OpenAI says is 376 tokens. 

In [54]:
sent = '''<FILENAME>ex10.txt
<DESCRIPTION>CFC INTERNATIONAL, INC.-BAXTER PURCHASE AGREEMENT
<TEXT>
Exhibit 10.9


                               PURCHASE AGREEMENT

         This Agreement, effective March 1, 2001 is between CFC International, a
Delaware corporation, with offices at 500 State Street, Chicago Heights,
Illinois 60411 ("Seller") and Baxter Healthcare Corporation, a Delaware
corporation, with offices at One Baxter Parkway, Deerfield, Illinois 60015 on
behalf or its self and its affiliates (entities controlling, controlled by, or
under common control with Baxter)("Buyer").

                                 1.0 Background


         1.1 Seller produces hot stamping foil which conforms and meets the
Specification Requirements submitted, accepted and in Seller's possession for
the Specification numbers listed attached in the Exhibit A., hereafter referred
to as "Products". Product Specifications may be revised from time to time and
new Specifications and numbers added by mutual agreement between parties. Buyer
requires foil for use in printing flexible packaging.


                                2.0 Distribution


         2.1 Subject to the terms and conditions of this Agreement, Seller shall
manufacture and sell Products to Buyer, and Buyer shall purchase Products for
manufacture into goods for use or resale in any country in the world. Buyer
agrees to purchase all their global foiling requirements from seller, or as
stated in Section 13.2.


                            3.0 Shipment of Products


         3.1 Seller will ship Products, F.O.B. Seller's facility, freight
collect, to locations specified by Buyer and via carriers specified by Buyer.

         3.2 Seller agrees to maintain negotiated consignment inventory at
Baxter's locations per specific plant consignment agreements.
'''

len(encoding.encode(sent))

376

Input token rate limit is 60000 per minute:

In [37]:
# can do this many contracts per minute 
(
    60000 # tokens limit per nute
    /
    400   # conservative guess tokens per contract 
)


150.0

Only allowed to do 60 requests a second. But a single request can "batch" multiple prompts. 

So, 50 times a minute, send 3 contracts. sleep(1.2) between calls. 

In [53]:
# contracts per day
(
    # num contracts per minute
    (
        60000 # tokens per minute
        /
        400   # tokens per contract (if the lenght above is kept)
    )
    *
    60*24 # minutes in a day
) 

216000.0

In [57]:
question = "Output a tab separated list containing two items: the name of the buyer, and the name of the seller."

openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "user", "content": f"{sent}\n---\n{question}"},
    ]
)

<OpenAIObject chat.completion id=chatcmpl-6zsvWPEb5epV348uxX6Dau1rfE6G4 at 0x1afae2ca450> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Baxter Healthcare Corporation\tCFC International, Inc.",
        "role": "assistant"
      }
    }
  ],
  "created": 1680207166,
  "id": "chatcmpl-6zsvWPEb5epV348uxX6Dau1rfE6G4",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 10,
    "prompt_tokens": 407,
    "total_tokens": 417
  }
}