# Sentiment and Embedding Analysis of News Data

This notebook focuses on analyzing news data and fake news data using various Natural Language Processing (NLP) techniques. The workflow includes:

1. **Sentiment Analysis**: Performing sentiment analysis using pre-trained models like RoBERTa and DistilRoBERTa.
2. **Emotion Analysis**: Applying emotion classification to extract emotional insights from the text.
3. **Model Integration**: Utilizing pre-trained models from Hugging Face Transformers and SentenceTransformers for advanced NLP tasks.
4. **Text Embeddings**: Generating embeddings using techniques like TF-IDF, Word2Vec, BERT, Bag of Words, and MiniLM.

Due to computational requirements, the analysis was run in Google Colab. To reproduce the results, ensure the dataset is located at the following path after the VADER analysis was completed in the `../nlp` notebook:

`/content/drive/MyDrive/Dataset`

In [None]:
!pip install numpy==1.26.4

In [None]:
!pip install gensim

In [1]:
# Loading the dataset from Google Drive
import os
import pandas as pd
folder_path = '/content/drive/MyDrive/Dataset'

In [None]:
# Function to load the dataset
def load_dataframes_from_folder(folder_path, startswith, key_position):
    """
    Load CSV files from a folder into a dictionary of dataframes.
    Each dataframe is keyed by the symbol extracted from the filename.

    Parameters:
        - folder_path (str): Path to the folder containing CSV files.
        - startswith (str): The prefix that the filenames should start with.
        - key_position (int): The position of the symbol in the filename when split by '_'.

    Returns:
        - dict: A dictionary where keys are symbols and values are dataframes.
    """

    dataframes_dict = {}

    # Loop through files in the folder
    for filename in os.listdir(folder_path):
        if filename.startswith(startswith):

            # Extract the key symbol
            ticker = filename.split('_')[ key_position]

            # Read the CSV file into a dataframe
            file_path = os.path.join(folder_path, filename)
            df = pd.read_csv(file_path)

            # Add the dataframe to the dictionary
            dataframes_dict[ticker] = df

    return dataframes_dict

In [10]:
# Loading the datasets
news_dict_sent = load_dataframes_from_folder(folder_path, startswith = 'news_', key_position = 1)
fake_news_dict_sent = load_dataframes_from_folder(folder_path, startswith = 'fake_news', key_position = 2)

In [11]:
# Having a look to one of the stock dataframes
news_dict_sent['AAPL'].head(10)

Unnamed: 0,date,title,source,text_column_tokens,text_without_puntutation,lemmatizers,without_stop_words,text_column,compound_score,positive_score,neutral_score,negative_score
0,2019-12-11,One of the top Apple 5 analysts predicts next ...,Markets Insider,"['one', 'of', 'the', 'top', 'apple', '5', 'ana...","['one', 'of', 'the', 'top', 'apple', '5', 'ana...","['one', 'of', 'the', 'top', 'apple', '5', 'ana...","['one', 'top', 'apple', '5', 'analyst', 'predi...",one top apple 5 analyst predicts next year 5g ...,0.2023,0.114,0.886,0.0
1,2019-12-11,Microsoft Stock Poised for a Positive 2020,Markets Insider,"['microsoft', 'stock', 'poised', 'for', 'a', '...","['microsoft', 'stock', 'poised', 'for', 'a', '...","['microsoft', 'stock', 'poise', 'for', 'a', 'p...","['microsoft', 'stock', 'poise', 'positive', '2...",microsoft stock poise positive 2020,0.5574,0.474,0.526,0.0
2,2019-12-11,"Wednesdays Vital Data: Peloton, Netflix and Tesla",Markets Insider,"['wednesdays', 'vital', 'data', ':', 'peloton'...","['wednesdays', 'vital', 'data', 'peloton', 'ne...","['wednesday', 'vital', 'data', 'peloton', 'net...","['wednesday', 'vital', 'data', 'peloton', 'net...",wednesday vital data peloton netflix tesla,0.296,0.306,0.694,0.0
3,2019-12-11,Apple CEO Tim Cook is striking back at critics...,Markets Insider,"['apple', 'ceo', 'tim', 'cook', 'is', 'strikin...","['apple', 'ceo', 'tim', 'cook', 'is', 'strikin...","['apple', 'ceo', 'tim', 'cook', 'be', 'strike'...","['apple', 'ceo', 'tim', 'cook', 'strike', 'bac...",apple ceo tim cook strike back critic say inno...,0.0,0.16,0.617,0.222
4,2019-12-11,"The fully upgraded Mac Pro costs 50,000, but y...",Markets Insider,"['the', 'fully', 'upgraded', 'mac', 'pro', 'co...","['the', 'fully', 'upgraded', 'mac', 'pro', 'co...","['the', 'fully', 'upgraded', 'mac', 'pro', 'co...","['fully', 'upgraded', 'mac', 'pro', 'cost', '5...","fully upgraded mac pro cost 50,000 add wheel 4...",0.0,0.0,1.0,0.0
5,2019-12-11,"Apple's pricey new 6,000 screen for the Mac Pr...",Markets Insider,"['apple', ""'s"", 'pricey', 'new', '6,000', 'scr...","['apple', 'pricey', 'new', '6,000', 'screen', ...","['apple', 'pricey', 'new', '6,000', 'screen', ...","['apple', 'pricey', 'new', '6,000', 'screen', ...","apple pricey new 6,000 screen mac pro clean sp...",0.6597,0.351,0.649,0.0
6,2019-12-11,"3 Restaurant Stocks, 2 Buys and 1 Warning",Markets Insider,"['3', 'restaurant', 'stocks', ',', '2', 'buys'...","['3', 'restaurant', 'stocks', '2', 'buys', 'an...","['3', 'restaurant', 'stock', '2', 'buy', 'and'...","['3', 'restaurant', 'stock', '2', 'buy', '1', ...",3 restaurant stock 2 buy 1 warn,-0.1027,0.0,0.682,0.318
7,2019-12-11,Wednesday Apple Rumors: Do Not Expect a Price ...,Markets Insider,"['wednesday', 'apple', 'rumors', ':', 'do', 'n...","['wednesday', 'apple', 'rumors', 'do', 'not', ...","['wednesday', 'apple', 'rumor', 'do', 'not', '...","['wednesday', 'apple', 'rumor', 'not', 'expect...",wednesday apple rumor not expect price increas...,-0.2411,0.0,0.803,0.197
8,2019-12-11,Stock Market Today: Federal Reserve in Focus; ...,Markets Insider,"['stock', 'market', 'today', ':', 'federal', '...","['stock', 'market', 'today', 'federal', 'reser...","['stock', 'market', 'today', 'federal', 'reser...","['stock', 'market', 'today', 'federal', 'reser...",stock market today federal reserve focus iphon...,0.0,0.0,1.0,0.0
9,2019-12-11,"Dow Jones Today: Federal Reserve Holds, Boosti...",Markets Insider,"['dow', 'jones', 'today', ':', 'federal', 'res...","['dow', 'jones', 'today', 'federal', 'reserve'...","['dow', 'jones', 'today', 'federal', 'reserve'...","['dow', 'jones', 'today', 'federal', 'reserve'...",dow jones today federal reserve hold boost stock,0.4019,0.278,0.722,0.0


In [12]:
# Having a look to one of the fake news dataframes
fake_news_dict_sent['testing'].head(10)

Unnamed: 0,fake_news,title,text_column_tokens,text_without_puntutation,lemmatizers,without_stop_words,text_column,compound_score,positive_score,neutral_score,negative_score
0,2,copycat muslim terrorist arrested with assault...,"['copycat', 'muslim', 'terrorist', 'arrested',...","['copycat', 'muslim', 'terrorist', 'arrested',...","['copycat', 'muslim', 'terrorist', 'arrest', '...","['copycat', 'muslim', 'terrorist', 'arrest', '...",copycat muslim terrorist arrest assault weapon,-0.9201,0.0,0.132,0.868
1,2,wow! chicago protester caught on camera admits...,"['wow', '!', 'chicago', 'protester', 'caught',...","['wow', 'chicago', 'protester', 'caught', 'on'...","['wow', 'chicago', 'protester', 'caught', 'on'...","['wow', 'chicago', 'protester', 'caught', 'cam...",wow chicago protester caught camera admits vio...,0.2732,0.335,0.447,0.218
2,2,germany's fdp look to fill schaeuble's big shoes,"['germany', ""'s"", 'fdp', 'look', 'to', 'fill',...","['germany', 'fdp', 'look', 'to', 'fill', 'scha...","['germany', 'fdp', 'look', 'to', 'fill', 'scha...","['germany', 'fdp', 'look', 'fill', 'schaeuble'...",germany fdp look fill schaeuble big shoe,0.0,0.0,1.0,0.0
3,2,mi school sends welcome back packet warning ki...,"['mi', 'school', 'sends', 'welcome', 'back', '...","['mi', 'school', 'sends', 'welcome', 'back', '...","['mi', 'school', 'sends', 'welcome', 'back', '...","['mi', 'school', 'sends', 'welcome', 'back', '...",mi school sends welcome back packet warn kid w...,0.3818,0.208,0.694,0.097
4,2,you.n. seeks 'massive' aid boost amid rohingya...,"['you.n', '.', 'seeks', ""'massive"", ""'"", 'aid'...","['you.n', 'seeks', 'aid', 'boost', 'amid', 'ro...","['you.n', 'seek', 'aid', 'boost', 'amid', 'roh...","['you.n', 'seek', 'aid', 'boost', 'amid', 'roh...",you.n seek aid boost amid rohingya within emer...,0.0258,0.239,0.531,0.23
5,2,did oprah just leave ‚nasty‚ hillary wishing s...,"['did', 'oprah', 'just', 'leave', '‚nasty‚', '...","['did', 'oprah', 'just', 'leave', 'hillary', '...","['do', 'oprah', 'just', 'leave', 'hillary', 'w...","['oprah', 'leave', 'hillary', 'wish', 'endorse...",oprah leave hillary wish endorse video,0.5859,0.543,0.326,0.13
6,2,france's macron says his job not 'cool' cites ...,"['france', ""'s"", 'macron', 'says', 'his', 'job...","['france', 'macron', 'says', 'his', 'job', 'no...","['france', 'macron', 'say', 'his', 'job', 'not...","['france', 'macron', 'say', 'job', 'not', 'cit...",france macron say job not cite talk turkey erd...,0.0,0.0,1.0,0.0
7,2,flashback: chilling ‚60 minutes‚ interview wit...,"['flashback', ':', 'chilling', '‚60', 'minutes...","['flashback', 'chilling', 'interview', 'with',...","['flashback', 'chill', 'interview', 'with', 'g...","['flashback', 'chill', 'interview', 'george', ...",flashback chill interview george soros nearly ...,0.0,0.0,1.0,0.0
8,2,spanish foreign ministry says to expel north k...,"['spanish', 'foreign', 'ministry', 'says', 'to...","['spanish', 'foreign', 'ministry', 'says', 'to...","['spanish', 'foreign', 'ministry', 'say', 'to'...","['spanish', 'foreign', 'ministry', 'say', 'exp...",spanish foreign ministry say expel north korea...,-0.4404,0.0,0.707,0.293
9,2,trump says cuba 'did some bad things' aimed at...,"['trump', 'says', 'cuba', ""'did"", 'some', 'bad...","['trump', 'says', 'cuba', 'some', 'bad', 'thin...","['trump', 'say', 'cuba', 'some', 'bad', 'thing...","['trump', 'say', 'cuba', 'bad', 'thing', 'aim'...",trump say cuba bad thing aim you.s diplomat,-0.5423,0.0,0.667,0.333


In [13]:
# NLP libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from scipy.special import softmax

In [14]:
# Load the pre-trained model and tokenizer for sentiment analysis
MODEL = 'cardiffnlp/twitter-roberta-base-sentiment'
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

In [None]:
# Function for sentiment analysis with RoBERTa
def perform_sentiment_analysis_roberta(data_dict, model, tokenizer, column_name='title', batch_size=16):
    """
    Perform sentiment analysis on a dictionary of dataframes using a pre-trained RoBERTa model.
    Each dataframe is processed in batches to manage memory usage.
    The results are added to the original dataframe.

    Parameters:
        - data_dict (dict): Dictionary of dataframes to process.
        - model: Pre-trained RoBERTa model for sentiment analysis.
        - tokenizer: Tokenizer for the RoBERTa model.
        - column_name (str): Name of the column containing text data.
        - batch_size (int): Number of samples to process in each batch.

    Returns:
        - dict: Updated dictionary of dataframes with sentiment analysis results.
    """

    # Uses GPU if available, otherwise CPU
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Moves the model to the selected device and sets it to evaluation mode (disables training-specific behaviors)
    model.to(device)
    model.eval()

    # Loop through each dataframe
    for key in data_dict.keys():

        # Selecting the text column to analyse
        df = data_dict[key]

        # Convert the text column to a list of strings
        texts = df[column_name].astype(str).tolist()

        # Store results
        all_probs = []

        # Process in batches
        for i in range(0, len(texts), batch_size):

            # Get the current batch of texts
            batch_texts = texts[i:i+batch_size]

            # Tokenize
            # Padding and truncation are applied to ensure all sequences in the batch have the same length
            # return_tensors='pt' converts the data to PyTorch tensors
            # max_length=64 sets the maximum length of the sequences to 64 tokens
            # padding=True pads the sequences to the maximum length of the batch
            tokens = tokenizer(batch_texts, padding=True, truncation=True, return_tensors='pt', max_length=512)

            # Move tokens to GPU if available
            # The tokenizer returns a dictionary of tensors, so we need to move each tensor to the device
            tokens = {k: v.to(device) for k, v in tokens.items()}

            # Inference
            # The model's forward method is called with the tokenized inputs
            with torch.no_grad():
                outputs = model(**tokens)

            # Move to CPU, apply softmax
            # The logits are the raw output of the model, and softmax converts them into probabilities
            # The axis=1 argument specifies that we want to apply softmax across the columns (for each row)
            # The outputs.logits is a tensor of shape (batch_size, num_classes)
            # where num_classes is the number of classes in the classification task
            # The resulting probs is a numpy array of shape (batch_size, num_classes)
            # The probs array contains the predicted probabilities for each class
            probs = softmax(outputs.logits.cpu().numpy(), axis=1)

            # The all_probs list is extended with the probabilities for the current batch
            all_probs.extend(probs)

            # Free memory
            del tokens, outputs
            torch.cuda.empty_cache()

        # Save predictions into DataFrame
        df['roberta_neg'] = [float(x[0]) for x in all_probs]
        df['roberta_neu'] = [float(x[1]) for x in all_probs]
        df['roberta_pos'] = [float(x[2]) for x in all_probs]
        df['roberta_sentiment'] = [int(x.argmax()) for x in all_probs]

    return data_dict

In [17]:
# Sentiment analysis using RoBERTa model in the stock news
news_dict_sent = perform_sentiment_analysis_roberta(news_dict_sent, model, tokenizer, column_name='text_column')

In [18]:
# Sentiment analysis using RoBERTa model in the fake news data
fake_news_dict_sent = perform_sentiment_analysis_roberta(fake_news_dict_sent, model, tokenizer, column_name='text_column')

In [None]:
# NLP libraries
from transformers import pipeline

# Other libraries
from tqdm import tqdm

In [None]:
def apply_transformers_emotion_analysis(dataframes_dict,
                                        analysis="text-classification",
                                        model="j-hartmann/emotion-english-distilroberta-base",
                                        text_column='title',
                                        batch_size=32):

    """
    Apply emotion analysis using a pre-trained model from Hugging Face Transformers on a dictionary of DataFrames.
    Each DataFrame is processed in batches to manage memory usage.
    The results are added to the original DataFrame.

    Parameters:
        - dataframes_dict (dict): Dictionary of DataFrames to process.
        - analysis (str): Type of analysis to perform (default is "text-classification").
        - model (str): Pre-trained model name from Hugging Face (default is "j-hartmann/emotion-english-distilroberta-base").
        - text_column (str): Name of the column containing text data (default is 'title').
        - batch_size (int): Number of samples to process in each batch (default is 32).

    Returns:
        - dict: Updated dictionary of DataFrames with emotion analysis results.
    """
    
    # Set device to GPU if available
    device = 0 if torch.cuda.is_available() else -1

    # Initialize the emotion analysis pipeline
    emotion_pipeline = pipeline(analysis, model=model, top_k=None, device=device)

    # Iterate through the dictionary of DataFrames
    for key, df in dataframes_dict.items():

        # Ensure the text column exists in the DataFrame
        if text_column in df.columns:

            # Convert the text column to a list of strings
            texts = df[text_column].astype(str).tolist()

            # Store results
            results = []

            # Process in batches
            for i in tqdm(range(0, len(texts), batch_size), desc=f"Processing {key}"):
                # Get the current batch of texts
                batch = texts[i:i+batch_size]

                # Tokenize and analyze emotions
                # The pipeline handles tokenization, padding, and truncation internally
                batch_results = emotion_pipeline(batch)

                # Store the results
                results.extend(batch_results)

            # Save predictions into DataFrame
            df['emotion_scores'] = results

            # Extract scores for each emotion
            emotions = ['joy', 'anger', 'fear', 'sadness', 'disgust', 'surprise', 'neutral']
            for emotion in emotions:
                df[emotion] = df['emotion_scores'].apply(
                    lambda x: next((item['score'] for item in x if item['label'] == emotion), 0)
                )

            # Drop the original emotion_scores column
            df.drop(columns=['emotion_scores'], inplace=True)
        else:
            print(f"Warning: '{text_column}' not found in DataFrame with key '{key}'.")

    return dataframes_dict

In [21]:
# Selecting the model for emotion analysis
model =  "j-hartmann/emotion-english-distilroberta-base"
analysis = "text-classification"

# Emotion analysis using DistilRoBERTa model
news_dict_sent = apply_transformers_emotion_analysis(news_dict_sent, analysis = analysis, model = model, text_column='text_column')
fake_news_dict_sent = apply_transformers_emotion_analysis(fake_news_dict_sent, analysis = analysis, model = model, text_column='text_column')


Device set to use cuda:0
Processing AAPL:   1%|▏         | 10/720 [00:01<02:02,  5.80it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Processing AAPL: 100%|██████████| 720/720 [02:04<00:00,  5.80it/s]
Processing BAC: 100%|██████████| 519/519 [01:29<00:00,  5.77it/s]
Processing GME: 100%|██████████| 213/213 [00:36<00:00,  5.78it/s]
Processing AMZN: 100%|██████████| 722/722 [02:05<00:00,  5.76it/s]
Processing NVDA: 100%|██████████| 692/692 [01:59<00:00,  5.79it/s]
Processing GS: 100%|██████████| 491/491 [01:24<00:00,  5.80it/s]
Processing TSLA: 100%|██████████| 722/722 [02:05<00:00,  5.77it/s]
Device set to use cuda:0
Processing testing: 100%|██████████| 288/288 [00:49<00:00,  5.77it/s]
Processing training: 100%|██████████| 1007/1007 [02:53<00:00,  5.80it/s]


In [None]:
def apply_transformers_sentiment_analysis(dataframes_dict,
                                          model="cardiffnlp/twitter-roberta-base-sentiment",
                                          text_column='title',
                                          batch_size=32):
    """
    Apply sentiment analysis using a pre-trained model from Hugging Face Transformers on a dictionary of DataFrames.
    Each DataFrame is processed in batches to manage memory usage.
    The results are added to the original DataFrame.

    Parameters:
        - dataframes_dict (dict): Dictionary of DataFrames to process.
        - model (str): Pre-trained model name from Hugging Face (default is "cardiffnlp/twitter-roberta-base-sentiment").
        - text_column (str): Name of the column containing text data (default is 'title').
        - batch_size (int): Number of samples to process in each batch (default is 32).

    Returns:
        - dict: Updated dictionary of DataFrames with sentiment analysis results.
    """
    
    # Set device to GPU if available
    device = 0 if torch.cuda.is_available() else -1

    # Initialize the sentiment analysis pipeline
    sentiment_pipeline = pipeline("sentiment-analysis", model=model, device=device)

    for key, df in dataframes_dict.items():
        if text_column in df.columns:
            texts = df[text_column].astype(str).tolist()
            results = []

            # Process in batches
            for i in tqdm(range(0, len(texts), batch_size), desc=f"Sentiment for {key}"):
                batch = texts[i:i+batch_size]
                batch_results = sentiment_pipeline(batch)
                results.extend(batch_results)

            # Extract labels and scores
            df['sentiment_label'] = [res['label'] for res in results]
            df['sentiment_score'] = [res['score'] for res in results]
        else:
            print(f"Warning: '{text_column}' not found in DataFrame with key '{key}'.")

    return dataframes_dict

In [23]:
# Sentiment score analysis
news_dict_sent = apply_transformers_sentiment_analysis(news_dict_sent)
fake_news_dict_sent = apply_transformers_sentiment_analysis(fake_news_dict_sent)

Device set to use cuda:0
Sentiment for AAPL: 100%|██████████| 720/720 [03:32<00:00,  3.40it/s]
Sentiment for BAC: 100%|██████████| 519/519 [02:33<00:00,  3.38it/s]
Sentiment for GME: 100%|██████████| 213/213 [01:03<00:00,  3.37it/s]
Sentiment for AMZN: 100%|██████████| 722/722 [03:35<00:00,  3.36it/s]
Sentiment for NVDA: 100%|██████████| 692/692 [03:24<00:00,  3.38it/s]
Sentiment for GS: 100%|██████████| 491/491 [02:24<00:00,  3.39it/s]
Sentiment for TSLA: 100%|██████████| 722/722 [03:33<00:00,  3.38it/s]
Device set to use cuda:0
Sentiment for testing: 100%|██████████| 288/288 [01:24<00:00,  3.40it/s]
Sentiment for training: 100%|██████████| 1007/1007 [04:56<00:00,  3.40it/s]


In [24]:
# Saving the new dataframes
for key, df in fake_news_dict_sent.items():
    output_path = os.path.join(folder_path, f'fake_news_{key}_roberta.csv')
    df.to_csv(output_path, index=False)

for key, df in news_dict_sent.items():
    output_path = os.path.join(folder_path, f'news_{key}_roberta.csv')
    df.to_csv(output_path, index=False)

In [2]:
# Function to load the dataset
def load_dataframes_from_folder_roberta(folder_path, startswith, endswith, key_position):

    dataframes_dict = {}

    # Loop through files in the folder
    for filename in os.listdir(folder_path):
        if filename.startswith(startswith) and filename.endswith(endswith):

            # Extract the key symbol
            ticker = filename.split('_')[key_position]

            # Read the CSV file into a dataframe
            file_path = os.path.join(folder_path, filename)
            df = pd.read_csv(file_path)

            # Add the dataframe to the dictionary
            dataframes_dict[ticker] = df

    return dataframes_dict

In [3]:
# Loading the dataset if needed during development
# Loading the datasets
news_dict_sent = load_dataframes_from_folder_roberta(folder_path, startswith = 'news_', endswith = '_roberta.csv', key_position = 1)
fake_news_dict_sent = load_dataframes_from_folder_roberta(folder_path, startswith = 'fake_news', endswith = '_roberta.csv', key_position = 2)

In [4]:
# Having a look to the dataframe
fake_news_dict_sent['testing'].head(2)

Unnamed: 0,fake_news,title,text_column_tokens,text_without_puntutation,lemmatizers,without_stop_words,text_column,compound_score,positive_score,neutral_score,...,roberta_sentiment,joy,anger,fear,sadness,disgust,surprise,neutral,sentiment_label,sentiment_score
0,2,copycat muslim terrorist arrested with assault...,"['copycat', 'muslim', 'terrorist', 'arrested',...","['copycat', 'muslim', 'terrorist', 'arrested',...","['copycat', 'muslim', 'terrorist', 'arrest', '...","['copycat', 'muslim', 'terrorist', 'arrest', '...",copycat muslim terrorist arrest assault weapon,-0.9201,0.0,0.132,...,0,0.002711,0.36282,0.446019,0.008094,0.034569,0.00718,0.138607,LABEL_0,0.878712
1,2,wow! chicago protester caught on camera admits...,"['wow', '!', 'chicago', 'protester', 'caught',...","['wow', 'chicago', 'protester', 'caught', 'on'...","['wow', 'chicago', 'protester', 'caught', 'on'...","['wow', 'chicago', 'protester', 'caught', 'cam...",wow chicago protester caught camera admits vio...,0.2732,0.335,0.447,...,0,0.007536,0.077617,0.032438,0.014817,0.002431,0.857197,0.007963,LABEL_0,0.830263


In [5]:
news_dict_sent['GS']

Unnamed: 0,date,title,source,text_column_tokens,text_without_puntutation,lemmatizers,without_stop_words,text_column,compound_score,positive_score,...,roberta_sentiment,joy,anger,fear,sadness,disgust,surprise,neutral,sentiment_label,sentiment_score
0,2011-05-20,"RBS, Goldman Sachs raise price target on LT",Markets Insider,"['rbs', ',', 'goldman', 'sachs', 'raise', 'pri...","['rbs', 'goldman', 'sachs', 'raise', 'price', ...","['rb', 'goldman', 'sachs', 'raise', 'price', '...","['rb', 'goldman', 'sachs', 'raise', 'price', '...",rb goldman sachs raise price target lt,0.0000,0.000,...,1,0.159468,0.113403,0.018421,0.081246,0.014139,0.019173,0.594150,LABEL_1,0.621827
1,2011-05-24,Oil up after Goldman Sachs raises forecasts,Markets Insider,"['oil', 'up', 'after', 'goldman', 'sachs', 'ra...","['oil', 'up', 'after', 'goldman', 'sachs', 'ra...","['oil', 'up', 'after', 'goldman', 'sachs', 'ra...","['oil', 'goldman', 'sachs', 'raise', 'forecast']",oil goldman sachs raise forecast,0.0000,0.000,...,1,0.248039,0.019998,0.005001,0.014369,0.002702,0.049541,0.660349,LABEL_1,0.542376
2,2011-06-02,Goldman Sachs gets subpoena from Manhattan DA,Markets Insider,"['goldman', 'sachs', 'gets', 'subpoena', 'from...","['goldman', 'sachs', 'gets', 'subpoena', 'from...","['goldman', 'sachs', 'get', 'subpoena', 'from'...","['goldman', 'sachs', 'get', 'subpoena', 'manha...",goldman sachs get subpoena manhattan da,0.0000,0.000,...,1,0.004383,0.722493,0.027779,0.013314,0.024096,0.007398,0.200537,LABEL_1,0.856827
3,2011-06-07,Goldman Sachs shareholder sues exdirector Raja...,Markets Insider,"['goldman', 'sachs', 'shareholder', 'sues', 'e...","['goldman', 'sachs', 'shareholder', 'sues', 'e...","['goldman', 'sachs', 'shareholder', 'sue', 'ex...","['goldman', 'sachs', 'shareholder', 'sue', 'ex...",goldman sachs shareholder sue exdirector rajat...,0.0000,0.000,...,1,0.002633,0.974094,0.006784,0.006124,0.002480,0.001868,0.006017,LABEL_1,0.545970
4,2011-07-19,Goldman Sachs profit misses expectations,Markets Insider,"['goldman', 'sachs', 'profit', 'misses', 'expe...","['goldman', 'sachs', 'profit', 'misses', 'expe...","['goldman', 'sachs', 'profit', 'miss', 'expect...","['goldman', 'sachs', 'profit', 'miss', 'expect...",goldman sachs profit miss expectation,0.3182,0.387,...,0,0.005472,0.018356,0.011508,0.902993,0.008218,0.008627,0.044826,LABEL_0,0.522701
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15692,2025-04-05,"The Best Stocks to Invest $1,000 in Right Now",Alpha Vantage,"['the', 'best', 'stocks', 'to', 'invest', '$',...","['the', 'best', 'stocks', 'to', 'invest', '1,0...","['the', 'best', 'stock', 'to', 'invest', '1,00...","['best', 'stock', 'invest', '1,000', 'right']","best stock invest 1,000 right",0.6369,0.512,...,2,0.020427,0.006542,0.002523,0.005657,0.001133,0.016178,0.947541,LABEL_2,0.637301
15693,2025-04-05,3 No-Brainer Warren Buffett Stocks to Buy Righ...,Alpha Vantage,"['3', 'no-brainer', 'warren', 'buffett', 'stoc...","['3', 'no-brainer', 'warren', 'buffett', 'stoc...","['3', 'no-brainer', 'warren', 'buffett', 'stoc...","['3', 'no-brainer', 'warren', 'buffett', 'stoc...",3 no-brainer warren buffett stock buy right,0.0000,0.000,...,1,0.117450,0.022918,0.010167,0.006184,0.002519,0.049155,0.791606,LABEL_1,0.840640
15694,2025-04-05,The Fed Is Not Rushing to Save the Markets Thi...,Google News,"['the', 'fed', 'is', 'not', 'rushing', 'to', '...","['the', 'fed', 'is', 'not', 'rushing', 'to', '...","['the', 'fed', 'be', 'not', 'rush', 'to', 'sav...","['fed', 'not', 'rush', 'save', 'market', 'time']",fed not rush save market time,-0.3875,0.000,...,1,0.010416,0.165085,0.025353,0.040946,0.001951,0.053457,0.702793,LABEL_1,0.605530
15695,2025-04-06,1 AI Robotics Stock to Buy Before It Soars 160...,Alpha Vantage,"['1', 'ai', 'robotics', 'stock', 'to', 'buy', ...","['1', 'ai', 'robotics', 'stock', 'to', 'buy', ...","['1', 'ai', 'robotics', 'stock', 'to', 'buy', ...","['1', 'ai', 'robotics', 'stock', 'buy', 'soar'...",1 ai robotics stock buy soar 160 % 2 trillion ...,0.2732,0.160,...,1,0.599865,0.113913,0.037667,0.003472,0.002605,0.180111,0.062367,LABEL_1,0.774427


In [None]:
# Math library
import numpy as np

# Machine Learning Libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# NLP libraries
import gensim
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer
from transformers import BertTokenizer, BertModel
import spacy

# Load the pre-trained model and tokenizer for embedings
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

In [None]:
# Load MiniLM model once
miniLM_model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
# 3. Transformer-based (BERT) Embeddings
def bert_embedding(column, batch_size=16):
        """
        Generate BERT embeddings for a given column of text data.
        The function processes the data in batches to manage memory usage.
        Each batch is tokenized, passed through the BERT model, and the CLS token embeddings are extracted.
        The resulting embeddings are returned as a list.
        
        Parameters:
            - column (pd.Series): The column of text data to process.
            - batch_size (int): The number of samples to process in each batch.

        Returns:
            - list: A list of BERT embeddings for the input text data.
        """

        # Uses GPU if available, otherwise CPU
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        # Moves the model to the selected device and sets it to evaluation mode (disables training-specific behaviors)
        model.to(device)
        model.eval()
        texts = column.astype(str).tolist()

        # Store results
        all = []

        # Process in batches
        for i in range(0, len(texts), batch_size):

            # Get the current batch of texts
            batch_texts = texts[i:i+batch_size]

            # Tokenize
            # Padding and truncation are applied to ensure all sequences in the batch have the same length
            # return_tensors='pt' converts the data to PyTorch tensors
            # max_length=64 sets the maximum length of the sequences to 64 tokens
            # padding=True pads the sequences to the maximum length of the batch
            tokens = tokenizer(batch_texts, padding=True, truncation=True, return_tensors='pt', max_length=512)

            # Move tokens to GPU if available
            # The tokenizer returns a dictionary of tensors, so we need to move each tensor to the device
            tokens = {k: v.to(device) for k, v in tokens.items()}

            # Inference
            # The model's forward method is called with the tokenized inputs
            with torch.no_grad():
                outputs = model(**tokens)

            # Get the Embeddings from BERT
            cls_embeddings = outputs.last_hidden_state[:, 0, :]
            embeddings = cls_embeddings.cpu().numpy()

            # The all list is extended with the embedings for the current batch
            all.extend(embeddings)

            # Free memory
            del tokens, outputs
            torch.cuda.empty_cache()

        return all

In [None]:
import joblib

def generate_embeddings(dataframe, fit = True, type_='fake', column_name='text_column'):
    """
    Generate various text embeddings for a given dataframe.
    The function computes TF-IDF, Word2Vec, BERT, and Bag of Words embeddings.
    It also computes the length of each embedding and adds them to the dataframe.

    Parameters:
        - dataframe (pd.DataFrame): The input dataframe containing text data.
        - fit (bool): If True, fit the vectorizers and save them. If False, load pre-trained vectorizers.
        - type_ (str): The type of data ('fake' or 'stocks') to determine the file paths for saving/loading.
        - column_name (str): The name of the column containing text data.

    Returns:
        - pd.DataFrame: The updated dataframe with embeddings and their lengths.
    """

    # Uses GPU if available, otherwise CPU
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    df = dataframe.copy()

    # 1. TF-IDF Vectorizer
    if fit == False:
      if type_ == 'fake':
        tfidf_vectorizer = joblib.load(os.path.join(folder_path, 'tfidf_vectorizer_fake.pkl'))
      else:
        tfidf_vectorizer = joblib.load(os.path.join(folder_path, 'tfidf_vectorizer_stocks.pkl'))
      tfidf_matrix = tfidf_vectorizer.transform(df[column_name])
    else:
      tfidf_vectorizer = TfidfVectorizer()
      tfidf_matrix = tfidf_vectorizer.fit_transform(df[column_name])
      if type_ == 'fake':
        joblib.dump(tfidf_vectorizer,  os.path.join(folder_path, "tfidf_vectorizer_fake.pkl"))
      else:
        joblib.dump(tfidf_vectorizer,  os.path.join(folder_path, "tfidf_vectorizer_stocks.pkl"))

    df['tfidf_embedding'] = [tfidf_matrix[i].toarray() for i in range(tfidf_matrix.shape[0])]
    df['tfidf_length'] = [len(tfidf_matrix[i].toarray()[0]) for i in range(tfidf_matrix.shape[0])]

    # 2. Word2Vec Embeddings (Pre-trained or trained on your dataset)
    word2vec_model = Word2Vec(sentences=[text.split() for text in df[column_name]], vector_size=100, window=5, min_count=1, workers=4)
    word2vec_matrix = np.array([np.mean([word2vec_model.wv[word] for word in text.split() if word in word2vec_model.wv], axis=0)
                                for text in df[column_name]])
    df['word2vec_embedding'] = [vec for vec in word2vec_matrix]
    df['word2vec_length'] = [len(vec) for vec in word2vec_matrix]

    # 3. Transformer-based (BERT) Embeddings
    bert_matrix =bert_embedding(df['title'])
    df['bert_embedding'] = [vec for vec in bert_matrix]
    df['bert_length'] = [len(vec) for vec in bert_matrix]

    # 5. Bag of Words (BoW)
    if fit == False:
      if type_ == 'fake':
        bow_vectorizer = joblib.load(os.path.join(folder_path, 'bow_vectorizer_fake.pkl'))
      else:
        bow_vectorizer = joblib.load(os.path.join(folder_path, 'bow_vectorizer_stocks.pkl'))
      bow_matrix = bow_vectorizer.transform(df[column_name])
    else:
      bow_vectorizer = CountVectorizer(max_features=1000)
      bow_matrix = bow_vectorizer.fit_transform(df[column_name])
      if type_ == 'fake':
        joblib.dump(bow_vectorizer, os.path.join(folder_path, "bow_vectorizer_fake.pkl"))
      else:
        joblib.dump(bow_vectorizer, os.path.join(folder_path, "bow_vectorizer_stocks.pkl"))

    df['bow_embedding'] = [bow_matrix[i].toarray() for i in range(bow_matrix.shape[0])]
    df['bow_length'] = [len(bow_matrix[i].toarray()[0]) for i in range(bow_matrix.shape[0])]

      # 6. MiniLM (SentenceTransformer)
    miniLM_matrix = miniLM_model.encode(df['title'].tolist(), device=device)
    df['miniLM_embedding'] = miniLM_matrix.tolist()
    df['miniLM_shape'] = [len(vec) for vec in miniLM_matrix]

    return df

In [None]:
# Main function that processes the dictionary of DataFrames
def process_dataframes(data_dict, type_='fake'):
    """
    Process a dictionary of DataFrames to generate embeddings for each DataFrame.
    The function applies the generate_embeddings function to each DataFrame in the dictionary.
    It also handles the fitting of vectorizers and saves them for future use.

    Parameters:
        - data_dict (dict): Dictionary of DataFrames to process.
        - type_ (str): The type of data ('fake' or 'stocks') to determine the file paths for saving/loading.

    Returns:
        - dict: Updated dictionary of DataFrames with embeddings and their lengths.
    """
    
    for key, df in tqdm(data_dict.items()):
        fit = True
        if key == 'testing':
          fit = False
        if key in ['BAC', 'GME', 'AMZN', 'NVDA', 'GS', 'TSLA']:
          fit = False
        df = generate_embeddings(df, fit = fit, type_=type_)
        data_dict[key] = df
    return data_dict

In [None]:
# Process embedings for stocks and fake news
news_dict_sent = process_dataframes(news_dict_sent, type_='stocks')

100%|██████████| 7/7 [03:50<00:00, 32.92s/it]


In [None]:
# Reversing to create the bag of words based on the first element of the dictionary, the training dataset
fake_news_dict_sent = dict(reversed(list(fake_news_dict_sent.items())))

In [22]:
fake_news_dict_sent.keys()

dict_keys(['training', 'testing'])

In [None]:
fake_news_dict_sent['training']['text_column'].isna().sum()

In [None]:
# Droping rows that after preprocessing are empty
fake_news_dict_sent['training'] = fake_news_dict_sent['training'].dropna()

In [None]:
# Generating the embeddings for the fake news data
fake_news_dict_sent = process_dataframes(fake_news_dict_sent)

100%|██████████| 2/2 [01:12<00:00, 36.46s/it]


In [26]:
# Saving the new dataframes
for key, df in fake_news_dict_sent.items():
    output_path = os.path.join(folder_path, f'fake_news_{key}_embedings.csv')
    df.to_csv(output_path, index=False)

for key, df in news_dict_sent.items():
    output_path = os.path.join(folder_path, f'news_{key}_embedings.csv')
    df.to_csv(output_path, index=False)

In [21]:
# Having a look to the dataframe
fake_news_dict_sent['testing'].head(2)

Unnamed: 0,fake_news,title,text_column_tokens,text_without_puntutation,lemmatizers,without_stop_words,text_column,compound_score,positive_score,neutral_score,...,roberta_sentiment,joy,anger,fear,sadness,disgust,surprise,neutral,sentiment_label,sentiment_score
0,2,copycat muslim terrorist arrested with assault...,"['copycat', 'muslim', 'terrorist', 'arrested',...","['copycat', 'muslim', 'terrorist', 'arrested',...","['copycat', 'muslim', 'terrorist', 'arrest', '...","['copycat', 'muslim', 'terrorist', 'arrest', '...",copycat muslim terrorist arrest assault weapon,-0.9201,0.0,0.132,...,0,0.002711,0.36282,0.446019,0.008094,0.034569,0.00718,0.138607,LABEL_0,0.878712
1,2,wow! chicago protester caught on camera admits...,"['wow', '!', 'chicago', 'protester', 'caught',...","['wow', 'chicago', 'protester', 'caught', 'on'...","['wow', 'chicago', 'protester', 'caught', 'on'...","['wow', 'chicago', 'protester', 'caught', 'cam...",wow chicago protester caught camera admits vio...,0.2732,0.335,0.447,...,0,0.007536,0.077617,0.032438,0.014817,0.002431,0.857197,0.007963,LABEL_0,0.830263


In [16]:
news_dict_sent['TSLA']

Unnamed: 0,date,title,source,text_column_tokens,text_without_puntutation,lemmatizers,without_stop_words,text_column,compound_score,positive_score,...,tfidf_embedding,tfidf_length,word2vec_embedding,word2vec_length,bert_embedding,bert_length,bow_embedding,bow_length,miniLM_embedding,miniLM_shape
0,2022-04-01,"Faux News Alert: Elon Musk Arrested, Queen Gra...",Markets Insider,"['faux', 'news', 'alert', ':', 'elon', 'musk',...","['faux', 'news', 'alert', 'elon', 'musk', 'arr...","['faux', 'news', 'alert', 'elon', 'musk', 'arr...","['faux', 'news', 'alert', 'elon', 'musk', 'arr...",faux news alert elon musk arrest queen grab ol...,-0.4767,0.126,...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",15385,"[-0.35821086, 0.14543921, 0.07205629, -0.08992...",100,"[0.16203997, -0.16324638, -0.28278008, -0.1302...",768,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1000,"[-0.012160414829850197, 0.053922440856695175, ...",384
1,2022-04-01,Is There a Silver Lining to Inflation Woes?,Markets Insider,"['is', 'there', 'a', 'silver', 'lining', 'to',...","['is', 'there', 'a', 'silver', 'lining', 'to',...","['be', 'there', 'a', 'silver', 'line', 'to', '...","['silver', 'line', 'inflation', 'woe']",silver line inflation woe,-0.4215,0.000,...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",15385,"[-0.34320086, 0.10270837, 0.14207323, -0.15861...",100,"[0.18812853, 0.17853445, -0.013467773, 0.03424...",768,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1000,"[-0.06830117851495743, -0.026948226615786552, ...",384
2,2022-04-01,Why Tesla's Stock Split Is Really About 'Memes...,Markets Insider,"['why', 'tesla', ""'s"", 'stock', 'split', 'is',...","['why', 'tesla', 'stock', 'split', 'is', 'real...","['why', 'tesla', 'stock', 'split', 'be', 'real...","['tesla', 'stock', 'split', 'really', 'dream',...",tesla stock split really dream theme,0.3167,0.314,...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",15385,"[-0.45050678, 0.17536835, 0.034953408, 0.24373...",100,"[0.13284218, -0.21777655, -0.6656369, 0.329884...",768,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1000,"[-0.006405037362128496, -0.009813237003982067,...",384
3,2022-04-01,TSLA Stock News: 6 Biggest Headlines That Tesl...,Markets Insider,"['tsla', 'stock', 'news', ':', '6', 'biggest',...","['tsla', 'stock', 'news', '6', 'biggest', 'hea...","['tsla', 'stock', 'news', '6', 'big', 'headlin...","['tsla', 'stock', 'news', '6', 'big', 'headlin...",tsla stock news 6 big headline tesla investor ...,0.0000,0.000,...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",15385,"[-0.74041533, 0.19829059, 0.27118728, 0.265206...",100,"[-0.21676604, -0.14409831, -0.023208717, 0.211...",768,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1000,"[-0.012592598795890808, 0.013282055035233498, ...",384
4,2022-04-01,HSBC Compares This EV Maker To Tesla Read Why,Markets Insider,"['hsbc', 'compares', 'this', 'ev', 'maker', 't...","['hsbc', 'compares', 'this', 'ev', 'maker', 't...","['hsbc', 'compare', 'this', 'ev', 'maker', 'to...","['hsbc', 'compare', 'ev', 'maker', 'tesla', 'r...",hsbc compare ev maker tesla read,0.0000,0.000,...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",15385,"[-0.36505875, 0.19596665, 0.0003031989, 0.1619...",100,"[-0.082745396, -0.25338537, 0.23409839, 0.1389...",768,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1000,"[-0.03992275521159172, 0.026881005614995956, -...",384
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23095,2025-04-05,"ALDX SECURITIES ALERT: Aldeyra Therapeutics, I...",Alpha Vantage,"['aldx', 'securities', 'alert', ':', 'aldeyra'...","['aldx', 'securities', 'alert', 'aldeyra', 'th...","['aldx', 'security', 'alert', 'aldeyra', 'ther...","['aldx', 'security', 'alert', 'aldeyra', 'ther...",aldx security alert aldeyra therapeutic inc. i...,0.7184,0.292,...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",15385,"[-0.3966381, 0.09935118, 0.11077504, 0.0453653...",100,"[-0.5136765, -0.19561636, -0.18028484, -0.2756...",768,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1000,"[-0.016185497865080833, -0.01937377080321312, ...",384
23096,2025-04-06,1 AI Robotics Stock to Buy Before It Soars 160...,Alpha Vantage,"['1', 'ai', 'robotics', 'stock', 'to', 'buy', ...","['1', 'ai', 'robotics', 'stock', 'to', 'buy', ...","['1', 'ai', 'robotics', 'stock', 'to', 'buy', ...","['1', 'ai', 'robotics', 'stock', 'buy', 'soar'...",1 ai robotics stock buy soar 160 % 2 trillion ...,0.2732,0.160,...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",15385,"[-0.34043753, 0.33370942, 0.21432164, 0.170055...",100,"[-0.80835897, -0.3036112, -0.059404142, 0.0366...",768,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1000,"[0.030140457674860954, -0.09314361214637756, -...",384
23097,2025-04-06,Elon Musk Thinks Tesla Will Become the World's...,Alpha Vantage,"['elon', 'musk', 'thinks', 'tesla', 'will', 'b...","['elon', 'musk', 'thinks', 'tesla', 'will', 'b...","['elon', 'musk', 'think', 'tesla', 'will', 'be...","['elon', 'musk', 'think', 'tesla', 'become', '...",elon musk think tesla become world valuable co...,0.1027,0.165,...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",15385,"[-0.43512595, 0.41087106, -0.0015948871, -0.15...",100,"[-0.24465694, -0.14306791, 0.120551914, 0.3829...",768,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1000,"[0.00170706317294389, 0.018161239102482796, 0....",384
23098,2025-04-06,"Trump Backs Biden's Project, China Tariffs, Te...",Alpha Vantage,"['trump', 'backs', 'biden', ""'s"", 'project', '...","['trump', 'backs', 'biden', 'project', 'china'...","['trump', 'back', 'biden', 'project', 'china',...","['trump', 'back', 'biden', 'project', 'china',...",trump back biden project china tariff tesla st...,0.0000,0.000,...,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",15385,"[-0.6120287, 0.2939689, 0.13553897, 0.0385757,...",100,"[-0.2703923, -0.45928058, -0.46089938, -0.0327...",768,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1000,"[0.00040450674714520574, 0.06807352602481842, ...",384
