# BeigeSage

This notebook lets you load and then run the pre-trained and fine-tuned BeigeSage model on texts.

### Reading in testing data

In [1]:
# Import necessary libraries
import pandas as pd
import os
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
import shutil

# Load the sentiment scores CSV
excel_path = r"C:/Users/MCOB PHD 14/Dropbox/Charlie's Dissertation/Beige Books/manual_sentiment.csv"
sentiment_data = pd.read_csv(excel_path)

# Define the label function
def label_sentiment(score):
    if score <= -0.3:
        return 0  # Negative
    elif score <= 0.2:
        return 1  # Mixed
    else:
        return 2  # Positive

# Apply the label function to the sentiment scores
sentiment_data['label'] = sentiment_data['human_sentiment'].apply(label_sentiment)

# Define path where text files are stored
text_files_dir = r"C:/Users/MCOB PHD 14/Dropbox/Charlie's Dissertation/Beige Books/selected_chunks2"

# Load the text files and create a DataFrame
text_data = {}
for filename in os.listdir(text_files_dir):
    if filename.endswith('.txt'):
        file_path = os.path.join(text_files_dir, filename)
        with open(file_path, 'r', encoding='utf-8') as file:
            text_data[filename] = file.read()

# Combine text data with sentiment data
text_df = pd.DataFrame(list(text_data.items()), columns=['file_names', 'text'])
combined_data = pd.merge(sentiment_data, text_df, on='file_names')

# Split data into training and testing sets
train_data, test_data = train_test_split(combined_data, test_size=0.2, random_state=42, stratify=combined_data['label'])

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# average length of text in combined_data
print(combined_data['text'].apply(lambda x: len(x.split())).mean())

228.603


### Load BeigeSage

In [3]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch

# Define the path where the model and tokenizer are saved
saved_model_path = 'C:/Users/MCOB PHD 14/Desktop/bbFinal/Notebooks/BeigeSage'

# Load the saved tokenizer
tokenizer = RobertaTokenizer.from_pretrained(saved_model_path)

# Load the saved model
model = RobertaForSequenceClassification.from_pretrained(saved_model_path)

# Set the model to evaluation mode
model.eval()


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

In [4]:
# Define a function to tokenize and predict sentiment
def predict_sentiment_with_status(texts):
    predictions = []
    total_texts = len(texts)
    for idx, text in enumerate(tqdm(texts, desc="Predicting Sentiment", ncols=100)):
        # Tokenize the text
        inputs = tokenizer(
            text,
            return_tensors="pt",       # Return as PyTorch tensors
            truncation=True,           # Truncate longer sequences
            padding='max_length',      # Pad to max length
            max_length=512             # Set maximum length
        )
        
        # Perform prediction
        with torch.no_grad():
            outputs = model(**inputs)
        
        # Get the predicted label
        predicted_label = torch.argmax(outputs.logits, dim=1).item()
        predictions.append(predicted_label)
        
        # Print status update for every 10 texts
        if (idx + 1) % 10 == 0 or (idx + 1) == total_texts:
            print(f"Processed {idx + 1}/{total_texts} texts")

    return predictions

In [5]:
from tqdm import tqdm

# Apply the prediction function to the 'text' column of test_data DataFrame
test_texts = test_data['text'].tolist()  # Convert text column to list
test_data['predicted_label'] = predict_sentiment_with_status(test_texts)

# Step 4: Map the numerical labels back to the class names (optional)
label_map = {0: "Negative", 1: "Mixed", 2: "Positive"}
test_data['predicted_class'] = test_data['predicted_label'].map(label_map)

# Display the DataFrame with predictions
print(test_data[['text', 'predicted_label', 'predicted_class']])

Predicting Sentiment:   5%|██                                      | 10/200 [00:05<01:28,  2.14it/s]

Processed 10/200 texts


Predicting Sentiment:  10%|████                                    | 20/200 [00:09<01:21,  2.21it/s]

Processed 20/200 texts


Predicting Sentiment:  15%|██████                                  | 30/200 [00:14<01:15,  2.25it/s]

Processed 30/200 texts


Predicting Sentiment:  20%|████████                                | 40/200 [00:18<01:11,  2.24it/s]

Processed 40/200 texts


Predicting Sentiment:  25%|██████████                              | 50/200 [00:23<01:10,  2.14it/s]

Processed 50/200 texts


Predicting Sentiment:  30%|████████████                            | 60/200 [00:28<01:07,  2.08it/s]

Processed 60/200 texts


Predicting Sentiment:  35%|██████████████                          | 70/200 [00:32<00:59,  2.19it/s]

Processed 70/200 texts


Predicting Sentiment:  40%|████████████████                        | 80/200 [00:37<00:56,  2.14it/s]

Processed 80/200 texts


Predicting Sentiment:  45%|██████████████████                      | 90/200 [00:42<00:51,  2.15it/s]

Processed 90/200 texts


Predicting Sentiment:  50%|███████████████████▌                   | 100/200 [00:47<00:47,  2.10it/s]

Processed 100/200 texts


Predicting Sentiment:  55%|█████████████████████▍                 | 110/200 [00:51<00:43,  2.06it/s]

Processed 110/200 texts


Predicting Sentiment:  60%|███████████████████████▍               | 120/200 [00:56<00:38,  2.10it/s]

Processed 120/200 texts


Predicting Sentiment:  65%|█████████████████████████▎             | 130/200 [01:01<00:33,  2.11it/s]

Processed 130/200 texts


Predicting Sentiment:  70%|███████████████████████████▎           | 140/200 [01:06<00:28,  2.11it/s]

Processed 140/200 texts


Predicting Sentiment:  75%|█████████████████████████████▎         | 150/200 [01:10<00:24,  2.06it/s]

Processed 150/200 texts


Predicting Sentiment:  80%|███████████████████████████████▏       | 160/200 [01:15<00:18,  2.11it/s]

Processed 160/200 texts


Predicting Sentiment:  85%|█████████████████████████████████▏     | 170/200 [01:20<00:16,  1.87it/s]

Processed 170/200 texts


Predicting Sentiment:  90%|███████████████████████████████████    | 180/200 [01:25<00:09,  2.08it/s]

Processed 180/200 texts


Predicting Sentiment:  95%|█████████████████████████████████████  | 190/200 [01:30<00:04,  2.03it/s]

Processed 190/200 texts


Predicting Sentiment: 100%|███████████████████████████████████████| 200/200 [01:35<00:00,  2.10it/s]

Processed 200/200 texts
                                                  text  predicted_label  \
528  reshaping their job mix . Longer-term , many r...                1   
491  December 8 , 1999 The Fifth District economy c...                2   
888  over the year in Massachusetts , Boston , and ...                1   
899  April 17 , 2019 Summary of Economic Activity S...                2   
960  economic conditions visit : https : //www.atla...                1   
..                                                 ...              ...   
672  , you do n't get the money . '' Looking ahead ...                0   
340  split between those expecting the trade balanc...                1   
244  activity has been `` unexpectedly quiet '' sin...                1   
471  outlook for 1998 , Third District bankers see ...                2   
550  for capital investment in the industry , dampe...                1   

    predicted_class  
528           Mixed  
491        Positive  
888      




Note to self: Ran model on Sept. 26, 2024, and it took 1 min 35.1 seconds

In [6]:
# Convert label in test_data to be Positive, Negative, or Mixed
test_data['label_actual'] = test_data['label'].map({0: "Negative", 1: "Mixed", 2: "Positive"})

# Display the classification report
print(classification_report(test_data['label_actual'], test_data['predicted_class']))

              precision    recall  f1-score   support

       Mixed       0.67      0.65      0.66        83
    Negative       0.69      0.64      0.67        39
    Positive       0.77      0.82      0.80        78

    accuracy                           0.71       200
   macro avg       0.71      0.70      0.71       200
weighted avg       0.71      0.71      0.71       200



In [19]:
test_data.head()

Unnamed: 0,Document,file_names,human_sentiment,scorer,label,text,label_actual,predicted_label,predicted_class
528,529,2002_bo (5)_chunk_6.txt,0.2,CS,1,"reshaping their job mix . Longer-term , many r...",Mixed,1,Mixed
491,492,1999_ri (1)_chunk_1.txt,0.6,CS,2,"December 8 , 1999 The Fifth District economy c...",Positive,2,Positive
888,889,2019_bo (6)_chunk_6.txt,0.3,CS,2,"over the year in Massachusetts , Boston , and ...",Positive,1,Mixed
899,900,2019_ri (6)_chunk_1.txt,0.3,CS,2,"April 17 , 2019 Summary of Economic Activity S...",Positive,2,Positive
960,961,2023_at (7)_chunk_6.txt,0.0,CS,1,economic conditions visit : https : //www.atla...,Mixed,1,Mixed


In [20]:
# Rename predicted_class as Sentiment_BERT
test_data.rename(columns={'predicted_class': 'Sentiment_BeigeSage'}, inplace=True)

# Save the test_data DataFrame to a CSV file
test_data.to_csv('BeigeSage_predictions.csv', index=False)

## Tool for testing a particular sentence or paragraph.

In [27]:
# Test the model on a sample of text, have it return a label

text = "March 12 , 1975 Recent evidence indicates further weakening in District business activity . Unemployment rose sharply in January , and manufacturers are not so optimistic as they were last November . Nevertheless , bank directors cited several sectors where business activity has improved . In addition , directors report that District businesses have not been confronted with overly excessive inventories ; inflationary pressures have also eased . The District 's rural areas continue to be relatively unaffected by the recession . Unemployment continues to rise . The District 's unemployment rate , seasonally adjusted , increased from 5.9 percent to 6.4 percent between December and January , with the Minneapolis-St. Paul area rate jumping from 5.1 percent to 6.2 percent . In early 1975 , District help-wanted advertising was down 30 percent from a year ago and initial claims for unemployment insurance were up 35 percent . District manufacturers are not so optimistic about their sales outlook as they were last November . After increasing 15.9 percent from a year earlier in the fourth quarter , District manufacturers look for their sales to surpass year-ago levels by around 9 percent during the first nine months of this year . ( Last November respondents had anticipated a 14 percent sales gain during the first half of 1974 . ) Both durables and nondurables producers revised downward their sales expectations , with price increases probably accounting for most , if not all , of the anticipated sales gains . Recent declines in District manufacturing employment also denote"

inputs = tokenizer(
    text,
    return_tensors="pt",       # Return as PyTorch tensors
    truncation=True,           # Truncate longer sequences
    padding='max_length',      # Pad to max length
    max_length=512             # Set maximum length
)

# Perform prediction
with torch.no_grad():
    outputs = model(**inputs)

# Get the predicted label
predicted_label = torch.argmax(outputs.logits, dim=1).item()
predicted_class = label_map[predicted_label]
print(f"Predicted Sentiment: {predicted_class}")

Predicted Sentiment: Negative
