# Health Score Scraping and Extraction

First, we initialize all libraries and create our instances to extract and render from the SEC database, and make OpenAI requests.

In [None]:
!pip install openai
!pip install sec-api
from google.colab import userdata
openai_key = userdata.get('openai')
sec_key = userdata.get('sec')

import pandas as pd
import numpy as np
from openai import OpenAI
import json
from sec_api import ExtractorApi, RenderApi

extractorApi = ExtractorApi(sec_key)
renderApi = RenderApi(sec_key)
client = OpenAI(api_key=openai_key)

Collecting openai
  Downloading openai-1.38.0-py3-none-any.whl.metadata (22 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.38.0-py3-none-any.whl (335 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m335.9/335.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading h11-0.14.0-py3-none-an

We first need to get a statement that details companies goals and current idealogies from their annual SEC report (10-K). We first got a list of the links to 5 companies 10-K's from the SEC's website.

In [None]:
tenk_links = json.load(open('10k_links_5_companies.json'))

Then, we wrote a function to extract just section 1 from each 10-K, as this section usually contains information on companies goals, and added these to a new CSV.

In [None]:
def get_goals(url):
    #Works on MCD < 2019, CMG, WEN, DMZ, TXRH
    business = extractorApi.get_section(url,"1",'text')
    return business

Then, we iterate through all the filings in the scraped tenk_links CSV, get the goals for the given 10-K, and add it to a new CSV.

In [None]:
goals = []
for filing in tenk_links['filings']:
    # get_goals doesn't work on MCD > 2018, so we need to make sure it's not a MCD filing from those years
    if not(filing['filedAt'][:4] in {'2019', '2020', '2021'} and filing['ticker'] == 'MCD'):
        for form in filing['documentFormatFiles']:
            if form['type'] == '10-K':
                document_url = form['documentUrl'].replace("/ix?doc=", "")
                goals.append((get_goals(document_url), filing['filedAt'], filing['ticker']))
                break

goal_df = pd.DataFrame(goals, columns=['goals', 'date', 'ticker'])
goal_df.to_csv('company_goals.csv', index=False)

Next, we needed to determine the companies devotion to healthier food based on their goals. We first tried an embedding route for natural language processing. This function gets creates an embedding vector for a given section of text using OpenAI's text-embedding-3-large model. We also define the cosine_similarity function to give us a similarity score between two vectors, allowing us to compare two embeddings.

In [None]:
def get_embedding(text):
    newtext = text.replace("\n", " ")
    if len(newtext) > 32000:
        newtext = newtext[:32000]
    response = client.embeddings.create(
        input=[newtext],
        model="text-embedding-3-large"
    )
    embedding = response.data[0].embedding
    return np.array(embedding)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

To determine how much the text indicates a concern for healthier food, we wrote this function, which compares the embedding vector for the text to the embedding vector to two other statements.

In [None]:
from math import floor

def healthy_goal_score(statement):
    healthier_food_related = get_embedding("One thing we care about is making our food healthier.")
    not_healthier_food_related = get_embedding("We do not care about making our food healthier.")

    statement_embedding = get_embedding(statement)

    pos = cosine_similarity(statement_embedding, healthier_food_related)
    neg = cosine_similarity(statement_embedding, not_healthier_food_related)

    lindiff = (pos-neg+1)/2

    return floor(lindiff*100)

However, since companies statements are usually quite long, we found the embeddings were often not similar to either reference, and the positive and negative similarities were very close to each other, which did not yield very interesting or accurate data. Thus, we decide to use OpenAI's GPT-4o-mini model to analyze text and give a score on dedication to healthier food.

In [None]:
def chat_goal_score(statement):
    system_message = 'You are an expert in NLP. Given an excerpt of a certain companies current business position, please provide a score from 0 to 100 indicating how much they care about making their food healthier. Provide your answer as a JSON object of the form {"score": 50}.'
    classifier = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[
            {'role': 'system', 'content': system_message},
            {'role': 'user', 'content': statement}
            ],
        response_format={'type':'json_object'}
    )
    response = classifier.choices[0].message.content
    jsonresponse = json.loads(response)
    return jsonresponse['score']

This method was much more successful, so we then loaded the saved goal data from the CSV, and applied our chat_goal_score function to every 10-K excerpt.

In [None]:
goaldf = pd.read_csv('company_goals.csv')
goaldf['health_score'] = goaldf['goals'].apply(lambda x: chat_goal_score(x))
goaldf.to_csv('health_scores.csv', index=False)

# Statewide Food Law Data Scraping and Extraction

First, we visited the NCSL website to obtain information on statewide laws. We got a list of all health related laws passed in states from 2014-2022, and copied it into a TXT file.

In [None]:
with open('rawncslfull.txt') as f:
    fullncsl = f.readlines()

Then, we iterated over this file and extracted the year and state from each entry. If "Food Safety" was in the topics, we added also added it to a seperate object that kept track of just the food safety laws.

In [None]:
from collections import defaultdict
food_laws_map = defaultdict(lambda: defaultdict(int))
all_laws_map = defaultdict(lambda: defaultdict(int))
for i, line in enumerate(fullncsl):
    if line[:4] in {'2014','2015','2016','2017','2018','2019','2020','2021','2022'}:
        year = line[:4]
        state = fullncsl[i-1][:2]
        topics = fullncsl[i+5][8:].replace('\n','').split(', ')
        if 'Food Safety' in topics:
            food_laws_map[state][year] += 1
        all_laws_map[state][year] += 1

Finally, we calculate the portion of health laws that are food related by dividing the corresponding entries in the objects.

In [None]:
weighted_map = {state: {year: food_laws_map[state][year]/all_laws_map[state][year] for year in food_laws_map[state]} for state in food_laws_map}

Then, we save this map to an XLSX file.

In [None]:
out = pd.DataFrame(weighted_map).fillna(0).astype(float)
out = out.sort_index().sort_index(axis=1)
out.to_excel('weighted_food_safety_laws_by_state_and_year.xlsx', sheet_name='Food Safety Laws')