# OpenAI sentiment for the headlines

1. Using the category probabilties from the classifier(stored in predictions.json) we will create appropriate prediction sets based on categories whose prediction probabilities exceed a given threshold value(one for broad categories one for society categories). This process allows us to dynamically categorize headlines and accurately assign them to one or more categories based on their prediction scores.
2. After obtaining the prediction sets we will use them to prompt OpenAI to give sentiment of the headlines in relation to the categories in the prediction set. This will give us sentiments that take context into account.


**NOTE:**
1. Due to large size of the dataset, the prompts for the dataset were implemneted in batches of size 2000. For first 4 batches(3 size 2000 and 1 of size 1000) the implementation ran for short amount of time(around 20mins), but for remaining it tool hours of time(and runtime gets disconnected) so implemented it in chunks and finally stored in 5 datasets.
  1. sentiment_results_0_2000.csv (0-2000)
  2. sentiment_results_2_2000.csv (2000-4000)
  3. sentiment_results_3_2000.csv (4000-6000)
  4. sentiment_results_4_1000.csv (6000-7000)
  5. responses7k10k.json (7k-10k)


## Importing and preparing data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json


import re
import nltk
from nltk.util import ngrams
from collections import Counter
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
import json
# Load JSON
df = pd.read_json("datasets/predictions.json", orient="records", lines=True)
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,date,title,source,number_of_characters_title,number_of_words_title,day_of_week,month,year,quarter,is_weekend,category,title1,Topic,broad_probs,society_probs
0,0,0,2024-10-14,"will.i.am and Fyilicia on the AI revolution, i...",Evening Standard,122,20,Monday,October,2024,4,False,Other,"will.i.am and Fyilicia on the AI revolution, i...",72,"[0.0454943627, 0.0981117189, 0.0001882864, 1.0]","[0.9634657502, 0.3475753069, 0.0822043419, 0.1..."
1,1,1,2024-02-21,Intel Launches World’s First Systems Foundry D...,Investor Relations :: Intel Corporation (INTC),117,18,Wednesday,February,2024,1,False,Other,Intel Launches World’s First Systems Foundry D...,-1,"[0.10503153500000001, 0.1746044159, 0.96005296...","[0.9027467370000001, 0.8772780299, 0.092794001..."
2,2,2,2024-02-05,The Unique Challenges of Selling Enterprise AI,Emerge,54,9,Monday,February,2024,1,False,Career,The Unique Challenges of Selling Enterprise AI,11,"[0.048380736300000005, 1.0, 0.0002314206000000...","[0.9934456348, 0.5423408151, 0.224681139, 0.09..."
3,3,3,2024-08-28,Contentious California AI bill passes legislat...,Reuters,88,11,Wednesday,August,2024,3,False,Other,Contentious California AI bill passes legislat...,27,"[0.1038618386, 0.16605721410000002, 0.95684748...","[0.1792803556, 0.1055336893, 0.1054050848, 0.0..."
4,4,4,2024-10-15,"Exploring Genius, Creation, and Humanity in th...",University of Aberdeen,82,14,Tuesday,October,2024,4,False,Other,"Exploring Genius, Creation, and Humanity in th...",-1,"[0.1261354387, 0.2093674242, 0.898952662900000...","[0.5140131116000001, 0.1558585018, 0.101669400..."


In [None]:
def combine_and_filter_predictions(broad_probs, society_probs, broad_categories, society_categories, broad_threshold=0.3, society_threshold=0.6):
    # Filter broad categories based on probability threshold
    broad_pred_set = [(broad_categories[i], broad_probs[i]) for i in range(len(broad_probs)) if broad_probs[i] > broad_threshold]

    # Filter society categories based on probability threshold (only if broad category is "Society")
    society_pred_set = []
    if "{Society}" in [pred[0] for pred in broad_pred_set]:
        society_pred_set = [(society_categories[i], society_probs[i]) for i in range(len(society_probs)) if society_probs[i] > society_threshold]

    # Combine broad_pred_set and society_pred_set, replacing "Society" with specific society categories
    combined_pred_set = []
    combined_probs = []

    for category, prob in broad_pred_set:
        if category == "{Society}":
            # Replace "Society" with the specific society categories
            for society_category, society_prob in society_pred_set:
                combined_pred_set.append(society_category)
                combined_probs.append(society_prob)
        else:
            # Keep broad category
            combined_pred_set.append(category)
            combined_probs.append(prob)

    # Sort by probabilities in descending order
    sorted_indices = sorted(range(len(combined_probs)), key=lambda k: combined_probs[k], reverse=True)
    sorted_combined_pred_set = [combined_pred_set[i] for i in sorted_indices]
    sorted_combined_probs = [combined_probs[i] for i in sorted_indices]

    return sorted_combined_pred_set, sorted_combined_probs

In [None]:
broad_categories = ['{Education}', '{Careers & Workforce}', '{Society}', '{Other}']
society_categories = ['{AI in various Industries}', '{AI in companies & Enterprises}',
       '{AI Investments & Market Trends}', '{AI Ethics, Law & Policy}',
       '{AI Governance & Geopolitics}', '{AI overview, risks & impact}']

In [None]:
# Apply the function directly on the dataframe
df[['pred_set', 'pred_probs']] = df.apply(
    lambda row: pd.Series(combine_and_filter_predictions(
        row['broad_probs'],
        row['society_probs'],
        broad_categories,
        society_categories,
        broad_threshold=0.3,
        society_threshold=0.7
    )),
    axis=1
)

## Implementing openAI prompting

In [None]:
import pandas as pd
from tqdm import tqdm
from openai import OpenAI
tqdm.pandas()
# Initialize OpenAI client
client = OpenAI(api_key="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")

# Define the prompt template

prompt_template = """Analyze the sentiment of the following news headline and classify it as Positive, Negative, or Neutral, considering the listed categories.

Headline: "{headline}" , Categories: {categories}

Instructions:
1.Focus on the sentiment in the context of the given categories.
2.Provide only one sentiment label: Positive, Negative, or Neutral.
3.Respond with just the sentiment label—no explanation needed."
"""

# Function to call OpenAI API
def openAI(prompt):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        store=True,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return completion.choices[0].message.content.strip()  # Strip to remove extra spaces

# Function to get sentiment for a single headline
def get_sentiment(row):
    headline = row["title"]
    categories = ", ".join(row["pred_set"])
    formatted_prompt = prompt_template.format(headline=headline,categories=categories )

    try:
        return openAI(formatted_prompt)
    except Exception as e:
        print(f"Error processing headline: {headline} - {e}")
        return "Error"


In [None]:
# Define batch size 0-2k
batch_size = 2000

# First batch (0-1999)
batch_1 = df.iloc[0:batch_size].copy()
batch_1["openAI_sentiment"] = batch_1.progress_apply(get_sentiment, axis=1)
# Save results for batch 1
batch_1.to_csv("datasets/sentiment_results_0_2000.csv", index=False)

100%|██████████| 2000/2000 [21:18<00:00,  1.56it/s]


In [None]:
batch_1.tail()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,date,title,source,number_of_characters_title,number_of_words_title,day_of_week,month,year,quarter,is_weekend,category,title1,Topic,broad_probs,society_probs,pred_set,pred_probs,openAI_sentiment
1995,1995,1995,2024-09-25,Day Two at GSX: School Security Standard Updat...,Security Management Magazine,113,16,Wednesday,September,2024,3,False,Education,Day Two at GSX: School Security Standard Updat...,1,"[0.9946895838, 0.1067305729, 0.2557640672, 0.2...","[0.8349314332000001, 0.7156846523, 0.113179899...",[{Education}],[0.9946895838],Neutral
1996,1996,1996,2024-01-19,Google DeepMind Scientists in Talks to Leave a...,Bloomberg,76,13,Friday,January,2024,1,False,Career,Google DeepMind Scientists in Talks to Leave a...,-1,"[0.108533822, 0.1926451474, 0.9486312866000001...","[0.0965695456, 0.9999314547, 0.0970316678, 0.1...",[{AI in companies & Enterprises}],[0.9999314547],Neutral
1997,1997,1997,2024-10-19,Looking to Buy Your First AI Stock? This Is th...,Yahoo Finance,100,19,Saturday,October,2024,4,True,Other,Looking to Buy Your First AI Stock? This Is th...,4,"[0.10927193610000001, 0.183262676, 0.958851993...","[0.1512391865, 0.127711907, 0.9999843836000001...",[{AI Investments & Market Trends}],[0.9999843836000001],Positive
1998,1998,1998,2024-02-17,Opinion: Are artificial intelligence and autom...,Los Angeles Times,104,16,Saturday,February,2024,1,True,Other,Opinion: Are ai and automation a cure or poiso...,0,"[0.1120269299, 0.2027468383, 0.9455354214, 0.2...","[0.8230310082000001, 0.0864791721, 0.054725594...","[{AI overview, risks & impact}, {AI in various...","[0.8784251809, 0.8230310082000001]",Neutral
1999,1999,1999,2024-04-14,‘A Murder At The End Of The World’ Creators Ta...,Deadline,151,27,Sunday,April,2024,2,True,Other,‘A Murder At The End Of The World’ Creators Ta...,-1,"[0.1226332262, 0.2111879438, 0.871663689600000...","[0.9685612321, 0.1405998617, 0.106032416200000...","[{AI in various Industries}, {Other}]","[0.9685612321, 0.49269515280000004]",Neutral


In [None]:
# 2nd batch (2k-4k)
batch_2 = df.iloc[2000:4000].copy()
batch_2["openAI_sentiment"] = batch_2.progress_apply(get_sentiment, axis=1)
# Save results for batch 2
batch_2.to_csv("datasets/sentiment_results_2_2000.csv", index=False)

100%|██████████| 2000/2000 [22:35<00:00,  1.48it/s]


In [None]:
# 3rd batch (4k-6k)
batch_3 = df.iloc[4000:6000].copy()
batch_3["openAI_sentiment"] = batch_3.progress_apply(get_sentiment, axis=1)
# Save results for batch 3
batch_3.to_csv("datasets/sentiment_results_3_2000.csv", index=False)

100%|██████████| 2000/2000 [18:20<00:00,  1.82it/s]


In [None]:
# 4th batch (6k-7k)
batch_4 = df.iloc[6000:7000].copy()
batch_4["openAI_sentiment"] = batch_4.progress_apply(get_sentiment, axis=1)
# Save results for batch 4
batch_4.to_csv("datasets/sentiment_results_4_1000.csv", index=False)

100%|██████████| 1000/1000 [2:23:41<00:00,  8.62s/it]


In [None]:
# batch_5 = df.iloc[7000:10000].copy()
# batch_5["openAI_sentiment"] = batch_5.progress_apply(get_sentiment, axis=1)
# # Save results for batch 4
# batch_5.to_csv("sentiment_results_4_1000.csv", index=False)

'FTC Launches Operation AI Comply with Five Enforcement Actions Involving AI Misuse – AI: The Washington Report'

In [None]:
# 5th batch 7k-10k
# prompts for 7k-10k data points
prompts = []

for i in range(0,3000):
    headline = df['title'].iloc[7000+i]
    categories = ', '.join(list(df["pred_set"].iloc[7000+i]))
    s = prompt_template.format(headline=headline,categories=categories )
    prompts.append(s)
prompts

['Analyze the sentiment of the following news headline and classify it as Positive, Negative, or Neutral, considering the listed categories.\n\nHeadline: "Anthropic’s New AI Feature Mimics Human Travel Agents" , Categories: {AI in various Industries}\n\nInstructions:\n1.Focus on the sentiment in the context of the given categories.\n2.Provide only one sentiment label: Positive, Negative, or Neutral.\n3.Respond with just the sentiment label—no explanation needed."\n',
 'Analyze the sentiment of the following news headline and classify it as Positive, Negative, or Neutral, considering the listed categories.\n\nHeadline: "Guillermo del Toro on AI: \'It can do semi-compelling screensavers\'" , Categories: {AI in various Industries}, {Other}\n\nInstructions:\n1.Focus on the sentiment in the context of the given categories.\n2.Provide only one sentiment label: Positive, Negative, or Neutral.\n3.Respond with just the sentiment label—no explanation needed."\n',
 'Analyze the sentiment of the f

In [None]:
# 5th batch 7k-10k
# Store responses for 7k-10k data points
responses = {}

for i in range(0,3000):
    key = 7000+i
    prompt = prompts[i-1500]
    # Generate completion for each prompt
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        store=True,  # Save interactions for future analysis
        temperature=0,  # Ensure reproducibility
        messages=[{"role": "user", "content": prompt}]
    )
    # Store the response content
    responses[key] = completion.choices[0].message.content
    #print(f"response for prompt {i}: {responses[key]}")

In [None]:
# Save responses to a JSON file for future use
with open("datasets/responses7k10k.json", "w") as file:
    json.dump(responses, file, indent=4)