## Amanda Cesario's notebook for DS Final Project
### Question 5: How does IonQ's risk factors change over time?
### Group 3 members: Cole Barrett, Caterina Grossi, Connor Steward
This notebook is using OpenAI's API to extract: Noun phrases, entity relationships, NER, and sentiment scores. Then, the results are saved to a csv file that will be later cleaned in "Cleaning - Final Project" notebook.

In [1]:
import os
from openai import OpenAI
from dotenv import load_dotenv
import pandas as pd
import numpy as np

In [2]:
load_dotenv()

# Initialize OpenAI client with API key
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Test API call
chat_completion = client.chat.completions.create(
    model="gpt-3.5-turbo",  # Model selection
    messages=[
        {"role": "user", "content": "Say this is a test"},
    ]
)

# Print response
print(chat_completion.choices[0].message.content)

This is a test


# Noun Phrase Extraction

In [3]:
# Create a function to read my txt files
def extract_text_from_txt(txt_path):
    """Reads text from a .txt file."""
    with open(txt_path, "r", encoding="utf-8") as file:
        return file.read()

In [4]:
RF_2021 = extract_text_from_txt("IONQ 2021 10K_Item1A.txt")

In [8]:
# Define chunk size (adjust based on OpenAI token limits)
chunk_size = 200
content_chunks = [RF_2021[i:i + chunk_size] for i in range(0, len(RF_2021), chunk_size)]

# List to store results
results = []

# Process chunks one by one
for i, chunk in enumerate(content_chunks):
    conversation = [
        {"role": "system", "content": "Extract noun phrases from the following text:"},
        {"role": "user", "content": chunk},
    ]

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=conversation
    )

    # Extract output
    output = response.choices[0].message.content.strip()

    # Append results to the list
    results.append({"Chunk Number": i + 1, "Noun Phrases": output})

# Convert results to DataFrame
df = pd.DataFrame(results)

# Save to CSV
csv_filename = "2021 NPs.csv"
df.to_csv(csv_filename, index=False, encoding="utf-8")

print(f"✅ Results saved to {csv_filename}")

In [9]:
# Checking if it was saved properly, it was
NP = pd.read_csv('2021 NPs.csv')
NP.head()

Unnamed: 0,Chunk Number,Noun Phrases
0,1,- Risk Factors\n- investment\n- securities\n- ...
1,2,- Form 10-K\n- Decision\n- Investment\n- Units...
2,3,- event\n- trading price\n- securities\n- inve...
3,4,- rating history\n- no revenues\n- basis\n- ab...
4,5,"initial business combination, vote, holders, f..."


In [10]:
# Since it takes awhile, I'm only going to do 2021 and 2024 for now.
# This will allow a good comparison from when they first went public to how they're doing after 3 years
RF_2024 = extract_text_from_txt("IONQ 2024 10K_Item1A.txt")

# Define chunk size (adjust based on OpenAI token limits)
chunk_size = 200
content_chunks = [RF_2024[i:i + chunk_size] for i in range(0, len(RF_2024), chunk_size)]

# List to store results
results = []

# Process chunks one by one
for i, chunk in enumerate(content_chunks):
    conversation = [
        {"role": "system", "content": "Extract noun phrases from the following text:"},
        {"role": "user", "content": chunk},
    ]

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=conversation
    )

    # Extract output
    output = response.choices[0].message.content.strip()

    # Append results to the list
    results.append({"Chunk Number": i + 1, "Noun Phrases": output})

# Convert results to DataFrame
df = pd.DataFrame(results)

# Save to CSV
csv_filename = "2024 NPs.csv"
df.to_csv(csv_filename, index=False, encoding="utf-8")

print(f"✅ Results saved to {csv_filename}")

In [11]:
# Checking to make sure it was saved properly
NP24 = pd.read_csv('2024 NPs.csv')
NP24.head()

Unnamed: 0,Chunk Number,Noun Phrases
0,1,- Item 1A\n- Risk factors\n- Our securities\n-...
1,2,- Cautionary Note \n- Forward-Looking Statemen...
2,3,- Annual Report\n- events\n- developments\n- b...
3,4,- common stock\n- all or part of your investme...
4,5,- risks\n- our business\n- a number\n- immater...


# Entity Relationship Extraction and Senitment Analysis

In [8]:
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
import pandas as pd
from tqdm import tqdm # To see progress on extraction

In [4]:
RF_2021 = extract_text_from_txt("IONQ 2021 10K_Item1A.txt")

In [5]:
RF_2024 = extract_text_from_txt("IONQ 2024 10K_Item1A.txt")

# NER

In [12]:
# 2021 Extraction

# Define a function to extract named entities using an LLMChain
def extract_ner(text):
    llm = OpenAI(temperature=0.7, max_tokens=200)
    
    template = """
    For the following passage, please identify all named entities and return them with their corresponding entity types in JSON format:
    {text}
    """
    prompt = PromptTemplate(template=template, input_variables=["text"])
    
    llm_chain = LLMChain(prompt=prompt, llm=llm)
    result = llm_chain.invoke({"text": text})
    return result

# Change your file path/variable here
chunk_size = 200  
content_chunks = [RF_2021[i:i + chunk_size] for i in range(0, len(RF_2021), chunk_size)]

results = []

# Process each chunk with a progress tracker
for i, chunk in enumerate(tqdm(content_chunks, desc="Processing chunks")):
    ner_output = extract_ner(chunk)
    results.append({"Chunk Number": i + 1, "Named Entities": ner_output})

# Convert results to a DataFrame and save as CSV
df = pd.DataFrame(results)
csv_filename = "2021_NER.csv"
df.to_csv(csv_filename, index=False, encoding="utf-8")
print(f"✅ Results saved to {csv_filename}")

  llm = OpenAI(temperature=0.7, max_tokens=200)
  llm_chain = LLMChain(prompt=prompt, llm=llm)
Processing chunks: 100%|██████████████████████| 828/828 [17:31<00:00,  1.27s/it]

✅ Results saved to 2021_NER.csv





In [13]:
NER21 = pd.read_csv('2021_NER.csv')
NER21.head()

Unnamed: 0,Chunk Number,Named Entities
0,1,"{'text': '\n\n{""securities"": ""Item 1A"", ""risk ..."
1,2,"{'text': '\n {\n ""Form 10-K"": ""Produ..."
2,3,"{'text': '\n {\n ""that event"": ""Even..."
3,4,"{'text': '\n [{""entity"": ""rating history"", ..."
4,5,"{'text': '\n{\n ""initial business combination..."


In [9]:
# 2024 Extraction
# Change your file path here (from RF_21 to RF_24)
chunk_size = 200  
content_chunks = [RF_2024[i:i + chunk_size] for i in range(0, len(RF_2024), chunk_size)]

results = []

# Process each chunk with a progress tracker
for i, chunk in enumerate(tqdm(content_chunks, desc="Processing chunks")):
    ner_output = extract_ner(chunk)
    results.append({"Chunk Number": i + 1, "Named Entities": ner_output})

# Convert results to a DataFrame and save as CSV
df = pd.DataFrame(results)
csv_filename = "2024_NER.csv"
df.to_csv(csv_filename, index=False, encoding="utf-8")
print(f"✅ Results saved to {csv_filename}")

  llm = OpenAI(temperature=0.7, max_tokens=200)
  llm_chain = LLMChain(prompt=prompt, llm=llm)
Processing chunks: 100%|██████████████████████| 949/949 [21:29<00:00,  1.36s/it]

✅ Results saved to 2024_NER.csv





In [10]:
# Checking to see if it was saved properly
NER24 = pd.read_csv('2024_NER.csv')
NER24.head()

Unnamed: 0,Chunk Number,Named Entities
0,1,"{'text': '\n {\n ""Item 1A"": ""entity""..."
1,2,"{'text': '\n {\n ""Above"": ""Organizat..."
2,3,"{'text': '\n{\n ""Annual Report"": ""Publicati..."
3,4,"{'text': '\n {""entities"": [\n {""name..."
4,5,"{'text': '\n{\n ""named_entities"": [\n {\n ..."


# Relationships

In [11]:
# 2021 extraction
# Define a function to extract relationships using an LLMChain
def extract_relationship(text):
    llm = OpenAI(temperature=0.7, max_tokens=200)
    
    template = """
    Find all noun phrases in the passage and their semantic types, construct relationships between pairs of 
    these entities in the form of a triple, identify the semantic type of the triple and output in a JSON structure.
    : {text} 
    """
    
    prompt = PromptTemplate(template=template, input_variables=["text"])
    llm_chain = LLMChain(prompt=prompt, llm=llm)
    
    result = llm_chain.invoke({"text": text})
    return result

# Change your file path/variable here
chunk_size = 200  # Adjust based on token limits
content_chunks = [RF_2021[i:i + chunk_size] for i in range(0, len(RF_2021), chunk_size)]

results = []

# Process each chunk with a progress tracker
for i, chunk in enumerate(tqdm(content_chunks, desc="Processing chunks")):
    relationship_output = extract_relationship(chunk)
    results.append({"Chunk Number": i + 1, "Relationship Extraction": relationship_output})

# Convert results to a DataFrame and save as CSV
df = pd.DataFrame(results)
csv_filename = "2021_Relationships.csv"
df.to_csv(csv_filename, index=False, encoding="utf-8")
print(f"✅ Results saved to {csv_filename}")

Processing chunks: 100%|██████████████████████| 828/828 [29:46<00:00,  2.16s/it]

✅ Results saved to 2021_Relationships.csv





In [12]:
relations21 = pd.read_csv('2021_Relationships.csv')
relations21.head()

Unnamed: 0,Chunk Number,Relationship Extraction
0,1,{'text': '\n JSON Output: \n [\n ...
1,2,"{'text': '""triples"": [\n {\n ..."
2,3,"{'text': '\n [\n {\n ""ent..."
3,4,{'text': '\n JSON output: \n {\n ...
4,5,"{'text': '\n\n[\n {\n ""entity"": ""ini..."


In [13]:
# 2024 extraction
# Change your file path/variable here (From RF_21 to RF_24)
chunk_size = 200  # Adjust based on token limits
content_chunks = [RF_2024[i:i + chunk_size] for i in range(0, len(RF_2024), chunk_size)]

results = []

# Process each chunk with a progress tracker
for i, chunk in enumerate(tqdm(content_chunks, desc="Processing chunks")):
    relationship_output = extract_relationship(chunk)
    results.append({"Chunk Number": i + 1, "Relationship Extraction": relationship_output})

# Convert results to a DataFrame and save as CSV
df = pd.DataFrame(results)
csv_filename = "2024_Relationships.csv"
df.to_csv(csv_filename, index=False, encoding="utf-8")
print(f"✅ Results saved to {csv_filename}")

Processing chunks: 100%|██████████████████████| 949/949 [33:40<00:00,  2.13s/it]

✅ Results saved to 2024_Relationships.csv





In [14]:
relations24 = pd.read_csv('2024_Relationships.csv')
relations24.head()

Unnamed: 0,Chunk Number,Relationship Extraction
0,1,"{'text': '\n\n ""triples"": [\n {\n ..."
1,2,"{'text': '\n[\n {\n ""entity"": ""Cauti..."
2,3,"{'text': 'r stock could decline, and the\n\n\n..."
3,4,{'text': '\n \n JSON Output: \n {\n ...
4,5,"{'text': '\nJSON Output:\n[\n {\n ""triple""..."


# Sentiments/Polarity

In [15]:
# 2021
# Define a function to predict sentiment as a polarity score using an LLMChain
def predict_sentiment(text):
    llm = OpenAI(temperature=0.7, max_tokens=200)
    
    template = """
    Analyze the sentiment of the following text and return a polarity score between -1 (most negative) and 1 (most positive). Provide only the score:
    {text}
    """
    
    prompt = PromptTemplate(template=template, input_variables=["text"])
    llm_chain = LLMChain(prompt=prompt, llm=llm)
    
    result = llm_chain.invoke({"text": text})
    return result

# Change your file path/variable here
chunk_size = 200  
content_chunks = [RF_2021[i:i + chunk_size] for i in range(0, len(RF_2021), chunk_size)]

results = []

# Process each chunk with a progress tracker
for i, chunk in enumerate(tqdm(content_chunks, desc="Processing sentiment chunks")):
    sentiment_output = predict_sentiment(chunk)
    results.append({"Chunk Number": i + 1, "Polarity Score": sentiment_output})

# Convert results to a DataFrame and save as CSV
df = pd.DataFrame(results)
csv_filename = "2021_Polarity.csv"
df.to_csv(csv_filename, index=False, encoding="utf-8")
print(f"✅ Results saved to {csv_filename}")

Processing sentiment chunks: 100%|████████████| 828/828 [08:31<00:00,  1.62it/s]

✅ Results saved to 2021_Polarity.csv





In [16]:
sentiment21 = pd.read_csv('2021_Polarity.csv')
sentiment21.head()

Unnamed: 0,Chunk Number,Polarity Score
0,1,{'text': '\n-0.8'}
1,2,{'text': '\n-0.3'}
2,3,{'text': '\n-0.6'}
3,4,{'text': '\n-0.7'}
4,5,{'text': '\n-0.2'}


In [17]:
# 2024
# Change your file path/variable here (From RF_21 to RF_24)
chunk_size = 200  
content_chunks = [RF_2024[i:i + chunk_size] for i in range(0, len(RF_2024), chunk_size)]

results = []

# Process each chunk with a progress tracker
for i, chunk in enumerate(tqdm(content_chunks, desc="Processing sentiment chunks")):
    sentiment_output = predict_sentiment(chunk)
    results.append({"Chunk Number": i + 1, "Polarity Score": sentiment_output})

# Convert results to a DataFrame and save as CSV
df = pd.DataFrame(results)
csv_filename = "2024_Polarity.csv"
df.to_csv(csv_filename, index=False, encoding="utf-8")
print(f"✅ Results saved to {csv_filename}")

Processing sentiment chunks: 100%|████████████| 949/949 [10:15<00:00,  1.54it/s]

✅ Results saved to 2024_Polarity.csv





In [18]:
sentiment24 = pd.read_csv('2024_Polarity.csv')
sentiment24.head()

Unnamed: 0,Chunk Number,Polarity Score
0,1,{'text': '\n-0.2'}
1,2,{'text': '\n\n0.0'}
2,3,{'text': '\n-0.2'}
3,4,{'text': '\n-1'}
4,5,{'text': '\n-0.5'}
