# Introduction


In this notebook, we explore a dataset of financial articles scraped from Yahoo Finance. The primary goal is to analyze the sentiment of these articles and derive meaningful insights from the textual data. We will perform the following steps:

1. **Imports**: We will import the necessary libraries and modules required for data manipulation, natural language processing, and sentiment analysis. This includes libraries for web scraping, data handling, sentiment analysis, and visualization.

2. **Data Loading**: We will load the dataset containing the scraped articles, which are stored in JSON format. This dataset comprises various financial topics, allowing us to conduct a comprehensive sentiment analysis.

3. **Data Cleaning and Preprocessing**: The loaded data will undergo several preprocessing steps, including:
   - Removing duplicates and handling missing values.
   - Cleaning the text by removing URLs, special characters, and stop words.
   - Converting dates to the appropriate format for analysis.

4. **Sentiment Analysis**: We will utilize Natural Language Processing (NLP) techniques to analyze the sentiment of the articles. This will involve:
   - Applying the VADER sentiment analysis tool for initial scoring.
   - Leveraging pre-trained transformer models, such as FinBERT, to gain deeper insights into the sentiment conveyed in the articles.

5. **Entity and Emotion Extraction**: Using SpaCy and NRC Lexicon, we will extract entities and associated emotions from the text to enrich our analysis further.

6. **Data Visualization and Reporting**: Finally, we will visualize the sentiment scores and present the findings in a structured format, allowing for an easier interpretation of the financial sentiment trends over time.

By the end of this notebook, we aim to obtain a consolidated DataFrame that captures essential information and insights derived from the financial articles, paving the way for informed decision-making in financial contexts.

# Imports and Data Loading

### Imports

In this section, we import essential libraries for data processing and analysis. Key libraries include:
- **Requests** and **BeautifulSoup** for web scraping.
- **Pandas** and **NumPy** for data manipulation and numerical operations.
- **NLTK** for natural language processing tasks, including sentiment analysis with VADER.
- **Transformers** for advanced NLP models like BART for summarization.
- **SpaCy** for efficient entity extraction.
- **nrclex** for emotion analysis based on the NRC lexicon.
- **BERTopic** and **SentenceTransformers** for topic modeling and semantic similarity.

These libraries form the backbone of our analysis workflow.


In [None]:
# ! pip install pandas numpy dash dash-bootstrap-components plotly wordcloud matplotlib nltk textblob nrclex bertopic sentence_transformers transformers bertopic

In [1]:
import requests
from bs4 import BeautifulSoup
import urllib.parse
import time
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import sent_tokenize
from transformers import pipeline
import spacy
from nrclex import NRCLex
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer


In [2]:
# Download the 'stopwords' corpus
nltk.download('stopwords')

# Optionally, download 'punkt' if you haven't already
nltk.download('punkt')

nltk.download('punkt_tab')

nltk.download('vader_lexicon')

[nltk_data] Downloading package stopwords to C:\Users\Arie
[nltk_data]     Gruber\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Arie
[nltk_data]     Gruber\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Arie
[nltk_data]     Gruber\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package vader_lexicon to C:\Users\Arie
[nltk_data]     Gruber\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

### Data Loading

In this section, we load the dataset containing scraped articles from Yahoo Finance, specifically focusing on Microsoft. The dataset features the following columns: **Date**, **Title**, **Author**, and **Text**. It encompasses articles from January 2024 to mid-October 2024. After completing the data cleaning process in this notebook, we expect to have 1,083 unique articles for analysis.

In [20]:
df = pd.DataFrame()
for month in range(1, 11):
  path = f'./articles/2024/{month}_articles.json'
  df = pd.concat([df, pd.read_json(path)])

df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Date,Title,Author,Text
0,2024-01-31 11:56:19+00:00,"Microsoft beats Q2 earnings on AI, cloud strength",Daniel Howley · Technology Editor,Microsoft (MSFT) announced its second quarter ...
1,2024-01-03 15:00:00+00:00,"If You Invested $10,000 in Microsoft When Saty...","Jeremy Bowman, The Motley Fool",Microsoft (NASDAQ: MSFT) is back on top of the...
2,2024-01-09 11:00:00+00:00,Why Microsoft Stock Rallied 57% in 2023,"Danny Vena, The Motley Fool",Shares of Microsoft (NASDAQ: MSFT) charged sha...
3,2024-01-30 21:35:57+00:00,Microsoft Corp (MSFT) Reports Robust Growth wi...,GuruFocus Research,"Revenue: $62.0 billion, an 18% increase year-o..."
4,2024-01-04 08:01:09+00:00,Microsoft is adding an AI button to PC keyboar...,Daniel Howley · Technology Editor,Microsoft (MSFT) is doubling down on its commi...


# Data Cleaning and Preprocessing

In this section, we focus on cleaning and preprocessing the dataset to ensure it is ready for analysis. The first step involves converting the **Date** column into a datetime format to facilitate time-based operations. After sorting the DataFrame by the date, we remove any duplicate entries based on the **Title** and **Date** columns, ensuring each article is unique. We also drop any rows with missing values and reset the index for a clean DataFrame. Finally, we extract the month from the date for potential grouping and analysis.

In [21]:
# Convert the date column to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Sort the DataFrame by the 'Date' column
df = df.sort_values(by='Date')
print('shape before removing duplicates' +str(df.shape))
# Remove duplicate rows based on 'Title' and 'Date'
df = df.drop_duplicates(subset=['Title', 'Date'])
df.dropna(inplace=True)
# Optionally, reset the index if you want a clean index after dropping duplicates
df.reset_index(drop=True, inplace=True)

print('shape after removing duplicates' +str(df.shape))
# Extract month for grouping
df['Month'] = df['Date'].dt.to_period('M')


df.head()

shape before removing duplicates(1491, 4)
shape after removing duplicates(1083, 4)


Unnamed: 0,Date,Title,Author,Text,Month
0,2024-01-01 12:00:27+00:00,Investors in Microsoft (NASDAQ:MSFT) have seen...,editorial-team@simplywallst.com (Simply Wall...,The most you can lose on any stock (assuming y...,2024-01
1,2024-01-02 15:09:36+00:00,Best AI Stock 2024: Alphabet Stock vs. Microso...,"Parkev Tatevosian, CFA, The Motley Fool",Fool.com contributor Parkev Tatevosian compare...,2024-01
2,2024-01-03 14:42:51+00:00,1 Artificial Intelligence (AI) Stock Poised to...,"Parkev Tatevosian, CFA, The Motley Fool",Fool.com contributor Parkev Tatevosian highlig...,2024-01
3,2024-01-03 15:00:00+00:00,"If You Invested $10,000 in Microsoft When Saty...","Jeremy Bowman, The Motley Fool",Microsoft (NASDAQ: MSFT) is back on top of the...,2024-01
4,2024-01-03 15:29:29+00:00,Microsoft Copilot is now available on iOS and ...,Aisha Malik,"Over the holiday season, Microsoft quietly lau...",2024-01


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1083 entries, 0 to 1082
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype              
---  ------  --------------  -----              
 0   Date    1083 non-null   datetime64[ns, UTC]
 1   Title   1083 non-null   object             
 2   Author  1083 non-null   object             
 3   Text    1083 non-null   object             
 4   Month   1083 non-null   period[M]          
dtypes: datetime64[ns, UTC](1), object(3), period[M](1)
memory usage: 42.4+ KB


### Text Cleaning

In the upcoming cell, we define a function to clean the text data in the **Text** column of our DataFrame. This function aims to remove any URLs, special characters, numbers, and excessive whitespace from the text. Additionally, all text is converted to lowercase to standardize it, making it easier to analyze later. The cleaned text will be stored in a new column, **clean_content_context**.


In [23]:
# Function to clean text
def clean_text(text):
    # Remove URLs, special characters, numbers, and extra spaces
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\@\w+|\#','', text)
    text = re.sub(r'[^A-Za-z\s]', '', text)
    text = text.lower()
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply cleaning
df['clean_content_context'] = df['Text'].apply(clean_text)

# Display the first few rows of the cleaned dataset
df.head()

Unnamed: 0,Date,Title,Author,Text,Month,clean_content_context
0,2024-01-01 12:00:27+00:00,Investors in Microsoft (NASDAQ:MSFT) have seen...,editorial-team@simplywallst.com (Simply Wall...,The most you can lose on any stock (assuming y...,2024-01,the most you can lose on any stock assuming yo...
1,2024-01-02 15:09:36+00:00,Best AI Stock 2024: Alphabet Stock vs. Microso...,"Parkev Tatevosian, CFA, The Motley Fool",Fool.com contributor Parkev Tatevosian compare...,2024-01,foolcom contributor parkev tatevosian compares...
2,2024-01-03 14:42:51+00:00,1 Artificial Intelligence (AI) Stock Poised to...,"Parkev Tatevosian, CFA, The Motley Fool",Fool.com contributor Parkev Tatevosian highlig...,2024-01,foolcom contributor parkev tatevosian highligh...
3,2024-01-03 15:00:00+00:00,"If You Invested $10,000 in Microsoft When Saty...","Jeremy Bowman, The Motley Fool",Microsoft (NASDAQ: MSFT) is back on top of the...,2024-01,microsoft nasdaq msft is back on top of the te...
4,2024-01-03 15:29:29+00:00,Microsoft Copilot is now available on iOS and ...,Aisha Malik,"Over the holiday season, Microsoft quietly lau...",2024-01,over the holiday season microsoft quietly laun...


### Stop Words Removal

In this section, we address the removal of stop words from the cleaned text. Stop words are common words (like "and," "the," "is," etc.) that may not contribute significant meaning to the analysis. By eliminating these words, we aim to enhance the quality of our textual data and focus on more informative terms. The cleaned text without stop words will be stored in a new column, **clean_content_no_stopwords**.

In [24]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stop_words])

df['clean_content_no_stopwords'] = df['clean_content_context'].apply(remove_stopwords)
df.head()

Unnamed: 0,Date,Title,Author,Text,Month,clean_content_context,clean_content_no_stopwords
0,2024-01-01 12:00:27+00:00,Investors in Microsoft (NASDAQ:MSFT) have seen...,editorial-team@simplywallst.com (Simply Wall...,The most you can lose on any stock (assuming y...,2024-01,the most you can lose on any stock assuming yo...,lose stock assuming dont use leverage money br...
1,2024-01-02 15:09:36+00:00,Best AI Stock 2024: Alphabet Stock vs. Microso...,"Parkev Tatevosian, CFA, The Motley Fool",Fool.com contributor Parkev Tatevosian compare...,2024-01,foolcom contributor parkev tatevosian compares...,foolcom contributor parkev tatevosian compares...
2,2024-01-03 14:42:51+00:00,1 Artificial Intelligence (AI) Stock Poised to...,"Parkev Tatevosian, CFA, The Motley Fool",Fool.com contributor Parkev Tatevosian highlig...,2024-01,foolcom contributor parkev tatevosian highligh...,foolcom contributor parkev tatevosian highligh...
3,2024-01-03 15:00:00+00:00,"If You Invested $10,000 in Microsoft When Saty...","Jeremy Bowman, The Motley Fool",Microsoft (NASDAQ: MSFT) is back on top of the...,2024-01,microsoft nasdaq msft is back on top of the te...,microsoft nasdaq msft back top tech world days...
4,2024-01-03 15:29:29+00:00,Microsoft Copilot is now available on iOS and ...,Aisha Malik,"Over the holiday season, Microsoft quietly lau...",2024-01,over the holiday season microsoft quietly laun...,holiday season microsoft quietly launched copi...


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1083 entries, 0 to 1082
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   Date                        1083 non-null   datetime64[ns, UTC]
 1   Title                       1083 non-null   object             
 2   Author                      1083 non-null   object             
 3   Text                        1083 non-null   object             
 4   Month                       1083 non-null   period[M]          
 5   clean_content_context       1083 non-null   object             
 6   clean_content_no_stopwords  1083 non-null   object             
dtypes: datetime64[ns, UTC](1), object(5), period[M](1)
memory usage: 59.4+ KB


# Sentiment and Emotion Analysis of Financial News Articles

### Sentiment Analysis with VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool particularly well-suited for analyzing sentiments expressed in social media and news articles. It is effective for short texts and accounts for the nuances of language, including emoticons, slang, and common expressions. In our analysis, we use VADER to evaluate the sentiments in the online news articles we scraped from Yahoo Finance, as it provides quick and reliable sentiment scores for financial texts.

In [26]:
sia = SentimentIntensityAnalyzer()

def get_vader_sentiment_scores(text):
    sentences = sent_tokenize(text)
    sentiment_scores = {}

    for sentence in sentences:
        score = sia.polarity_scores(sentence)
        for sentiment in score:
          sentiment_scores[sentiment] = score[sentiment]

    return sentiment_scores

vader_sentiments_list = list(df['clean_content_context'].apply(get_vader_sentiment_scores).values)
vader_sentiments_df = pd.DataFrame(vader_sentiments_list)
vader_sentiments_df.dropna(inplace=True)

vader_sentiments_df.head()

Unnamed: 0,neg,neu,pos,compound
0,0.026,0.752,0.222,0.9994
1,0.082,0.834,0.084,0.3612
2,0.083,0.832,0.084,0.0772
3,0.062,0.789,0.149,0.9976
4,0.014,0.867,0.119,0.9869


In [27]:
vader_sentiments_df.shape

(1083, 4)

### Sentiment Analysis with FinBERT

FinBERT is a pre-trained transformer-based model fine-tuned specifically for financial sentiment analysis. It provides a deeper understanding of sentiments in financial texts by leveraging the contextual information that transformer models offer. This is particularly important for our analysis of Yahoo Finance articles, as financial language often contains industry-specific terminology and nuanced meanings that traditional sentiment analysis tools might miss.

In [None]:
# Initialize FinBERT pipeline for sentiment analysis using a model fine-tuned for financial text
try:
    finbert_pipeline = pipeline("sentiment-analysis", model="yiyanghkust/finbert-tone", tokenizer="yiyanghkust/finbert-tone")
except Exception as e:
    print(f"Error loading FinBERT model: {e}")

def get_finbert_sentiment_scores(text):
    """
    Function to compute FinBERT sentiment scores on financial text.
    It splits the text into sentences and handles cases where sentences are too long for the model.

    Args:
        text (str): The input financial text to analyze.

    Returns:
        list: A list of dictionaries containing sentiment scores for each sentence.
    """
    try:
        # Tokenize the text into individual sentences
        sentences = sent_tokenize(text)
        finbert_scores = []

        # Iterate over sentences for sentiment analysis
        for sentence in sentences:
            # If the sentence is too long, split it into smaller chunks of 512 characters
            if len(sentence) > 512:
                chunks = [sentence[i:i + 512] for i in range(0, len(sentence), 512)]
                for chunk in chunks:
                    try:
                        result = finbert_pipeline(chunk)
                        finbert_scores.append(result[0])  # Store the result of each chunk
                    except Exception as e:
                        print(f"Error analyzing chunk: {chunk[:30]}... -> {e}")
            else:
                # Analyze normally if the sentence is within the 512-character limit
                try:
                    result = finbert_pipeline(sentence)
                    finbert_scores.append(result[0])  # Store the result for each sentence
                except Exception as e:
                    print(f"Error analyzing sentence: {sentence[:30]}... -> {e}")

        return finbert_scores
    except Exception as e:
        print(f"Error during sentiment analysis: {e}")
        return []

# Example: Apply the function to a dataframe column containing financial text
# df['finbert_sentiments'] = df['clean_content_context'].apply(get_finbert_sentiment_scores)
finbert = df['clean_content_context'].apply(get_finbert_sentiment_scores)
finbert.dropna(inplace=True)

finbert.head()

### Mean Sentiment Scores

Calculating the mean sentiment scores provides a consolidated view of the overall sentiment expressed in the articles. This step is crucial to understand general trends in the financial news sentiment over time. By averaging the sentiment scores, we can easily assess whether the overall tone of the articles is predominantly positive, negative, or neutral.

In [28]:
# Function to return a DataFrame with mean scores for each label
def get_mean_scores_df(label_list, labels=['Negative', 'Neutral', 'Positive']):
    # Convert the list of dictionaries to DataFrame
    row_data = pd.DataFrame(label_list).groupby('label').mean().T

    # Ensure all columns (labels) are present, and fill missing ones with 0
    for label in labels:
        if label not in row_data.columns:
            row_data[label] = 0

    # Reorder the columns to match the desired order (only if they exist in the data)
    row_data = row_data[[label for label in labels if label in row_data.columns]]

    return row_data

# Adjust apply calls for each model, with appropriate labels
finbert_mean_df = pd.concat(finbert.apply(lambda x: get_mean_scores_df(x)).tolist(), ignore_index=True)
finbert_mean_df.dropna(inplace=True)

finbert_mean_df.head()

label,Negative,Neutral,Positive
0,0.0,0.995125,0.99999
1,0.0,0.989776,0.999763
2,0.0,0.999922,0.962712
3,0.956777,0.999995,0.999816
4,0.0,0.992111,0.951249


### Summarization Using Bart

BART (Bidirectional and Auto-Regressive Transformers) is a transformer model designed for various natural language processing tasks, including text summarization. It effectively generates coherent and contextually relevant summaries from long documents. In our project, we utilize BART to distill the essential information from the financial news articles, allowing us to focus on the main insights without wading through extensive text.

In [112]:
from transformers import pipeline

bart_limit = 512

def get_summary(article):
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    
    if len(article.split()) > bart_limit:
        summaries = []
        # for index in range(0, len(article.split()), bart_limit):
        summaries = summarizer([article[index:index+bart_limit] for index in range(0,len(article.split()), bart_limit)]
                               , max_length=100)
        return ' '.join([summary['summary_text'] for summary in summaries]).strip()
    else:
        return summarizer(article, max_length=130)[0]['summary_text']

In [None]:
from transformers import pipeline

bart_limit = 512

def get_summary(article):
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    
    if len(article.split()) > bart_limit:
        summaries = []
        # for index in range(0, len(article.split()), bart_limit):
        summaries = summarizer([article[index:index+bart_limit] for index in range(0,len(article.split()), bart_limit)]
                               , max_length=100)
        return ' '.join([summary['summary_text'] for summary in summaries])
    else:
        return summarizer(article, max_length=130)

# Create the summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", max_length=130)
    
# Ensure that the input data is a string
df['clean_content_context'] = df['clean_content_context'].astype(str)

# Summarize the text and reate a DataFrame for summaries
summary_df = pd.DataFrame(df['clean_content_context'].apply(get_summary)).rename({'clean_content_context': 'Summary'}, axis=1)
summary_df.head()

Your max_length is set to 100, but your input_length is only 97. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)
Your max_length is set to 100, but your input_length is only 92. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=46)
Your max_length is set to 100, but your input_length is only 94. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=47)
Your max_length is set to 100, but your input_length is only 97. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)
Your

### Entity and Emotion Extraction

To gain deeper insights into the emotional landscape of the articles, we apply techniques to extract entities and emotions. This step enriches our analysis by identifying key entities mentioned in the articles and understanding the emotions expressed about these entities. This nuanced understanding can aid in more targeted sentiment analysis and reporting.

In [None]:
! pip install spacy
! python -m spacy download en_core_web_sm

In [40]:
import spacy
from nrclex import NRCLex

# Load SpaCy model (ensure you have it installed, e.g., 'en_core_web_sm')
nlp = spacy.load('en_core_web_sm')

def extract_entities_and_emotions(text):
    doc = nlp(text)
    results = []

    for sent in doc.sents:
        sent_text = sent.text
        entities = [ent.text for ent in sent.ents]
        emotions = NRCLex(sent_text).raw_emotion_scores
        results.append({
            'sentence': sent_text,
            'entities': entities,
            'emotions': emotions
        })

    return results

# df['entities_and_emotions'] = df['clean_content_context'].apply(extract_entities_and_emotions)
entities_and_emotions = df['clean_content_no_stopwords'].apply(extract_entities_and_emotions)

In [41]:
import numpy as np
import pandas as pd

def extract_emotions(row):
    # Initialize an empty dictionary to store sums of emotions
    summed_emotions = {}
    count = len(row)  # Number of dictionaries in the row

    # Loop through each index in the row
    for index in range(count):  # Assuming row is a list of dictionaries
        emotions_dict = row[index]['emotions']  # Extract the emotions dictionary

        # Add up emotion values, initializing keys if they don't exist yet
        for emotion, value in emotions_dict.items():
            if emotion in summed_emotions:
                summed_emotions[emotion] += value
            else:
                summed_emotions[emotion] = value

    # Calculate the mean by dividing the summed values by the number of entries
    mean_emotions = {emotion: value / count for emotion, value in summed_emotions.items()}

    # Convert the dictionary to a DataFrame row
    return pd.DataFrame([mean_emotions])

# Apply the function to every row in the entities_and_emotions DataFrame
emotions_df = pd.concat(entities_and_emotions.apply(lambda row: extract_emotions(row)).tolist(), ignore_index=True)

# Replace NaN values with 0 if necessary
emotions_df.replace(np.nan, 0, inplace=True)
emotions_df.head()

Unnamed: 0,anger,disgust,fear,negative,sadness,surprise,positive,anticipation,joy,trust
0,2.0,2.0,5.0,6.0,3.0,3.0,49.0,29.0,25.0,35.0
1,1.0,6.0,1.0,10.0,0.0,1.0,9.0,6.0,2.0,6.0
2,1.0,6.0,3.0,11.0,0.0,1.0,13.0,6.0,4.0,9.0
3,4.0,7.0,6.0,20.0,5.0,6.0,36.0,15.0,7.0,20.0
4,0.0,0.0,3.0,0.0,0.0,0.0,24.0,16.0,7.0,4.0


In [None]:
# Adjust the extract_emotions function to work with entities
def extract_emotions_by_entity(row):
    # Initialize an empty dictionary to store sums of emotions per entity
    summed_emotions = {}
    count_dict = {}

    # Loop through each dictionary in the row (list of dictionaries)
    for entry in row:
        entity = entry['entity']
        emotions_dict = entry['emotions']

        # Initialize entity emotion tracking if it doesn't exist
        if entity not in summed_emotions:
            summed_emotions[entity] = {}
            count_dict[entity] = 0

        # Add up emotion values for the entity, initializing keys if they don't exist yet
        for emotion, value in emotions_dict.items():
            if emotion in summed_emotions[entity]:
                summed_emotions[entity][emotion] += value
            else:
                summed_emotions[entity][emotion] = value

        # Increase the count for the entity
        count_dict[entity] += 1

    # Calculate the mean for each entity's emotions
    mean_emotions = {
        entity: {emotion: value / count_dict[entity] for emotion, value in emotions.items()}
        for entity, emotions in summed_emotions.items()
    }

    # Convert the dictionary to a DataFrame with each entity's average emotions
    return pd.DataFrame([{'entity': entity, **emotions} for entity, emotions in mean_emotions.items()])


# Apply the function to every row in the DataFrame
tmp = pd.concat(
    entities_and_emotions.apply(lambda row: extract_emotions_by_entity(row)).tolist(),
    ignore_index=True
)
tmp.head()

# Data Preparation and Final DataFrame Creation

In this section, we will prepare our data by adding prefixes to each DataFrame for better identification and then concatenate them into a single final DataFrame. This consolidated DataFrame will allow for easier analysis and export of results.

### Adding Prefixes and Concatenating DataFrames

To differentiate between the various DataFrames we have generated, we will add specific prefixes to each DataFrame. This will help in identifying which columns belong to which sentiment analysis technique or original content. After adding the prefixes, we will compile a list of all the DataFrames and concatenate them into a single DataFrame for streamlined analysis.


In [None]:
# Add prefixes to each DataFrame
final_df = df.add_prefix('original_')  # Prefix for the original df
vader_sentiments_df_with_prefix = vader_sentiments_df.add_prefix('vader_')  # Prefix for VADER sentiment
finbert_df_with_prefix = finbert_mean_df.add_prefix('finbert_')  # Prefix for FinBERT sentiment
summary_df_prefix = summary_df.add_prefix('summary_')
emotions_df_with_prefix = emotions_df.add_prefix('emotions_')  # Prefix for emotions

In [None]:
print(final_df.shape)
print(vader_sentiments_df_with_prefix.shape)
print(finbert_df_with_prefix.shape)
print(summary_df_prefix.shape)
print(emotions_df_with_prefix.shape)

In [None]:
# List of DataFrames to concatenate
dfs = [final_df, vader_sentiments_df_with_prefix, finbert_df_with_prefix, 
       summary_df_prefix, emotions_df_with_prefix]

# Concatenate all DataFrames along axis=1
final_df = pd.concat(dfs, axis=1)

# Show the resulting DataFrame
final_df.head()


### Final DataFrame Processing

After concatenating our DataFrames, we will perform some final processing, including dropping any rows with missing values to ensure the integrity of our data.

In [None]:
final_df.dropna(inplace=True)
final_df.columns

In [None]:
final_df.to_csv(path_or_buf='./final_df.csv')

In [None]:
# final_df = pd.read_csv('/content/drive/MyDrive/microsoft_articles/final_df.csv')

In [None]:
final_df.columns