This script processes a dataset of financial news articles to generate summaries and sentiment analysis using the OpenAI API. It begins by reading the articles from a CSV file, cleaning the data by removing any entries with missing text, and ensuring that all entries in the `LLM_Output` column are in string format. The script then iterates through each article, sending the content to the OpenAI model with a prompt designed to generate a five-line summary, assess the sentiment towards a specific company, and evaluate the relevance of the article. The generated output is stored back in the dataset for further analysis or modeling.


This code block reads a CSV file containing full articles into a pandas DataFrame and performs basic data cleaning. First, the total number of articles is printed before any cleaning is done. The code then filters out any rows where the `full_article` column contains missing values (`NaN`), ensuring that only complete articles are kept. After cleaning, the remaining number of articles is printed. Finally, the `LLM_Output` column is converted to a string type to ensure consistent data processing in later stages.


In [None]:
import pandas as pd
from tqdm.auto import tqdm
from openai import OpenAI

full_articles = pd.read_csv("articles_full.csv")
print("Number of articles before cleaning: {}".format(len(full_articles)))
full_articles = full_articles[full_articles['full_article'].notna()]
print("Number of articles after cleaning: {}".format(len(full_articles)))
full_articles["llm_output"] = full_articles["llm_output"].astype(str)

In this section, the script uses the OpenAI API to generate summaries and sentiment analyses for each financial news article. A `client` is created using the OpenAI API key, and a detailed prompt (`message`) is defined to instruct the AI model on how to summarize the articles. The prompt includes instructions to generate a five-line summary, determine the overall sentiment (positive or negative), and assess the relevance of the article to a specific company.

The script then iterates through each article in the dataset using a loop. For each article that has not already been processed (where `LLM_Output` is "0"), the article content and the associated company name are sent to the OpenAI model. The model returns a formatted output with the summary, sentiment, and relevance information, which is then stored back into the dataset. This process enriches the dataset with valuable insights that can be used for building a sentiment analysis model.


In [None]:

client = OpenAI(api_key="{YOUR_KEY}")
message = """
Consider you are providing a TLDR of a finance news article for a project.
The summary needs to be valuable to build a sentiment analysis model on it.
If the article is already less than 5 lines, then provide a 2 line summary, else always provide 5 line summary.
I will also provide the company name for which the article was extracted.
Following should be the format of your output.
Try to make the summary with respect to the company name I am providing.
The sentiment should be with respect the company name I am providing.
If you are not confident on the sentiment, put "Unsure".
For relevance, based on the article provide if the article is high or low relevant. If not relevant put NA.

Summary: <Summary of the article in 5 lines>
Overall sentiment: <Positive or Negative>
Relevance to company: <High, Low or NA>
Company Name: <Repeat the company I am providing>
"""
for index, row in tqdm(full_articles.iterrows(), total=full_articles.shape[0]):
    if row["llm_output"] != "0":
        continue
    if index%100 == 0:
        full_articles.to_csv("articles_with_llm.csv")
    article = row['full_article']
    holding = row['holding']
    response = client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {
          "role": "system",
          "content": message
        },
        {
          "role": "user",
          "content": "company name: {} \n{}".format(holding, article)
        }
      ]
    )
    full_articles.at[index,'llm_output'] = response.choices[0].message.content

full_articles.to_csv("articles_with_llm.csv")