# Introduction

In this notebook we introduce the methodology of using an open source language model (classifier) to classify strings into a set of arbitrary categories. 

The language model we will use is pretrained on a large dataset (just like the models underpinning OpenAI's recently famous ChatGPT interface). This model in particular has been trained and released by Facebook and has been optimized on the task of classification instead of text generation, but the underlying techniques are similar. It therefore is able to perform the tasks we want directly out-of-the-box without any further tweaking!

Classification is the act of assigning a probability between one and zero to a (set of) label(s) being applicable to a string of text. 

To give an example, the sentence "The sun is out today." seems pretty happy. If a person were asked to give it a score on "happiness", they might assign it a score of 90%. But what about the sentence "It is raining outside.."? That perhaps should score only 10%, if not lower. The language model we use will be able to draw similar conclusions automatically based on it's understanding of the English language.

Let's get started!

# Setup

In [2]:
# Import the pipeline helper from the transformers package, which we will use to load our model
from transformers import pipeline

In [2]:
# Download and instantiate the facebook/bart-large-mlni model and pipeline. The first time this cell runs, it downloads a large file containing the model weights (1.5GB of parameters all in all!). 
# Let it finish and from then on it will be cached on disk.

# Stay aware that this model instantiation uses a lot of memory/RAM, as it has to load the full 1.5GB of model parameters. If you load the model as below in several notebooks at the same time, you might overload your box. 
# To avoid this, when you're done with a notebook, close it's "kernel" on the left side of the Jupyterlabs interface (The second icon. circle with a square inside, selects the active kernels). This unloads the model from memory.

c = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')

# Testing whether a label is applicable

Let's see if we can replicate the ideas above. Is the sentence "The sun is out today!" indeed classifiably happy?

To do so, we use the defined interface of our classifier, which takes (at least) the following arguments: a sentence, a list of labels.

Perhaps it's easier to just look at an example!

In [3]:
c('The sun is out today.', ['Happy'])

{'sequence': 'The sun is out today.',
 'labels': ['Happy'],
 'scores': [0.958601176738739]}

Great result! It looks like our sentence is indeed very happy. With a score of almost 96%, it doesn't get much happier than that.

Challenge: Can you find a way to modify the sentence such that it is even happier?

In [4]:
# Modify the sentence to something even happier
c('The sun is out today.', ['Happy'])

{'sequence': 'The sun is out today.',
 'labels': ['Happy'],
 'scores': [0.958601176738739]}

And how about our "unhappy" sentence, will it indeed agree?

In [5]:
c('It is raining outside..', ['Happy'])

{'sequence': 'It is raining outside..',
 'labels': ['Happy'],
 'scores': [0.00019142446399200708]}

Nobody likes rain..

# Multiple labels

If that's not a happy sentence, then perhaps we can conclude it's a sad one? Does our model agree?

We can ask it to make a choice between several available labels.

In [6]:
c('It is raining outside..', ['Happy', 'Sad'])

{'sequence': 'It is raining outside..',
 'labels': ['Sad', 'Happy'],
 'scores': [0.9559348821640015, 0.04406508430838585]}

The mathematics of the model work out slightly differently to before for multi-labeling, but the conclusion remains the same, it certainly considers the sentence to be much more sad than happy.

What about other labels, like wet, dry, high, low and colorful or gray?

c('It is raining outside..', ['Happy', 'Sad', 'Wet', 'Dry', 'High', 'Low', 'Colorful', 'Gray'])

Hard question, but a clear and agreeable answer, it's more Wet than any other of those labels. But isn't rain both wet and sad?

To answer that question we can ask the model to assign a probability per label, rather than forcing it to make a choice. Note that in the above result, all label probabilities add up to 100%.

In [7]:
c('It is raining outside..', ['Happy', 'Sad', 'Wet', 'Dry', 'High', 'Low', 'Colorful', 'Gray'], multi_label=True)

{'sequence': 'It is raining outside..',
 'labels': ['Wet', 'Sad', 'Low', 'Gray', 'High', 'Colorful', 'Dry', 'Happy'],
 'scores': [0.9990817308425903,
  0.8136840462684631,
  0.7355218529701233,
  0.41334474086761475,
  0.03862270340323448,
  0.001768096350133419,
  0.0003273721958976239,
  0.00019142446399200708]}

This makes sense!

# Edge cases

Negatiation. Surely, if rain is bad, then the opposite should be good?

In [8]:
c('It is not raining outside..', ['Happy', 'Sad'])

{'sequence': 'It is not raining outside..',
 'labels': ['Happy', 'Sad'],
 'scores': [0.8534077405929565, 0.14659228920936584]}

It depends on the context though, and the model can handle that pretty well too.

In [9]:
c('After the drought it is finally raining outside..', ['Happy', 'Sad'])

{'sequence': 'After the drought it is finally raining outside..',
 'labels': ['Happy', 'Sad'],
 'scores': [0.7387923002243042, 0.2612077295780182]}

# Multiple sentences/strings

We can also ask the model to classify multiple senteces in one go. That might be useful when you use this model for your trading strategy later.

In [10]:
c(['The sun is out today.', 'It is raining outside..'], ['Happy', 'Sad'])

[{'sequence': 'The sun is out today.',
  'labels': ['Happy', 'Sad'],
  'scores': [0.9836673140525818, 0.016332658007740974]},
 {'sequence': 'It is raining outside..',
  'labels': ['Sad', 'Happy'],
  'scores': [0.9559348821640015, 0.04406508430838585]}]

# Free-form exploration

Below, try a few different examples, different labels, sentence structures, and so on, to get a feel for that the model can and can't handle very well.

In [11]:
c('@PharmaNews: Pfizer faces backlash over possible closure of regional office. #PharmaNews #RegionalOffice', ['sad','NVDA', 'ING', 'SAN', 'PFE', 'CSCO'])

{'sequence': '@PharmaNews: Pfizer faces backlash over possible closure of regional office. #PharmaNews #RegionalOffice',
 'labels': ['PFE', 'sad', 'NVDA', 'CSCO', 'SAN', 'ING'],
 'scores': [0.8763680458068848,
  0.0623045489192009,
  0.017301151528954506,
  0.01568964123725891,
  0.014519426971673965,
  0.013817159458994865]}

In [3]:
import pandas as pd

In [4]:
# 读取数据
df = pd.read_csv('training.csv')
df.head(10)

Unnamed: 0,SocialMediaFeed,NVDA,ING,SAN,PFE,CSCO
0,@PharmaNews: Pfizer faces backlash over possib...,0.0,0.0,0.0,-0.029512,0.0
1,@BusinessReport: A recent study found that mos...,0.0,0.0,0.0,0.0,0.0
2,@HardwareHubs: NVIDIA's contributions to a maj...,0.026125,0.0,0.0,0.0,0.0
3,@HealthWatch: Johnson & Johnson faces lawsuits...,0.0,0.0,0.0,0.0,0.0
4,@IndustryInsider: Magnificent Honary faces pro...,0.0,0.0,0.0,0.0,0.0
5,@SocialMediaRumor: Unverified sources hint at ...,0.0,0.0,0.0,0.0,0.0
6,@USFastFoodNews: McDonald's facing heat over p...,0.0,0.0,0.0,0.0,0.0
7,@TechTrends: Cisco faces challenges in its sup...,0.0,0.0,0.0,0.0,-0.028257
8,@DigitalDaily: Nvidia's stock feels the heat a...,-0.030776,0.0,0.0,0.0,0.0
9,@PharmaFlash: Pfizer faces minor product recal...,0.0,0.0,0.0,0.0,0.0


In [14]:
from transformers import pipeline

# Load the zero-shot classification pipeline
classifier = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')

# Function to analyze sentiment and predict stock movement
def analyze_sentiment_and_stock(text):
    # Define the possible labels (sentiments)
    possible_labels = ['positive', 'negative', 'neutral']

    # Perform zero-shot classification
    result = classifier(text, possible_labels)

    # Extract the predicted label and score
    predicted_label = result['labels'][0]
    predicted_score = result['scores'][0]

    # Map sentiment to stock movement
    if predicted_label == 'positive':
        stock_movement = 1.0  # Replace with actual positive movement
    elif predicted_label == 'negative':
        stock_movement = -1.0  # Replace with actual negative movement
    else:
        stock_movement = 0.0  # Neutral sentiment

    return stock_movement, predicted_label, predicted_score

# Example usage
text_example = "@PharmaNews: Pfizer faces backlash over possible closure of regional office. #PharmaNews #RegionalOffice"
movement, label, score = analyze_sentiment_and_stock(text_example)

print(f"Predicted Stock Movement: {movement}")
print(f"Predicted Sentiment: {label}")
print(f"Confidence Score: {score}")

Predicted Stock Movement: -1.0
Predicted Sentiment: negative
Confidence Score: 0.9809707999229431


In [16]:
from stock_sentiment_analysis import analyze_sentiment_and_stock

# Example usage
text_example = "@FinancialLeaks: Unconfirmed report of operational inefficiencies at ING Bank. #Banking #OperationalInnovation ING"
detected_company, predicted_sentiment, sentiment_score, adjusted_output = analyze_sentiment_and_stock(text_example)

print(f"Detected Company: {detected_company}")
print(f"Predicted Sentiment: {predicted_sentiment}")
print(f"Sentiment Score: {sentiment_score}")
print(f"Adjusted Output: {adjusted_output}")

Detected Company: ING
Predicted Sentiment: negative
Sentiment Score: 0.813266396522522
Adjusted Output: 0.0


In [20]:
company_training_data = []

for index, row in df.iterrows():
    non_zero_companies = [company for company in ['NVDA', 'ING', 'SAN', 'PFE', 'CSCO'] if row[company] != 0]
    if non_zero_companies:
        # Use the first non-zero company as the label
        label = non_zero_companies[0]
    else:
        # No non-zero companies, assign an empty label
        label = ''

    text_data = row['SocialMediaFeed']
    company_training_data.append((text_data, label))

In [23]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# 划分训练集和测试集
train_data, test_data = train_test_split(company_training_data, test_size=0.2, random_state=42)

# 定义简单的文本分类模型
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, output_size):
        super(TextClassifier, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, 32, sparse=True)
        self.fc = nn.Linear(32, output_size)
    
    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

# 定义数据集类
class CompanyDataset(Dataset):
    def __init__(self, data, vectorizer):
        self.data = data
        self.vectorizer = vectorizer
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        text, label = self.data[index]
        vectorized_text = torch.tensor(self.vectorizer.transform([text]).toarray()[0])
        label = torch.tensor([label], dtype=torch.long)  # Keep label as a string
        return vectorized_text, label.item()  # Use .item() to get the integer value if it's numeric

# 初始化文本向量化器
vectorizer = CountVectorizer()
vectorizer.fit([text for text, _ in train_data])

# 初始化数据集和数据加载器
train_dataset = CompanyDataset(train_data, vectorizer)
test_dataset = CompanyDataset(test_data, vectorizer)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# 初始化模型、损失函数和优化器
model = TextClassifier(vocab_size=len(vectorizer.vocabulary_), output_size=len(set(label for _, label in train_data)))
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 5
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    for text, label in train_loader:
        optimizer.zero_grad()
        output = model(text, None)
        
        # Ensure labels are converted to LongTensor
        label_tensor = torch.tensor([int(label)], dtype=torch.long)  # Convert label to integer
        
        loss = criterion(output, label_tensor)
        loss.backward()
        optimizer.step()

# 评估模型
model.eval()
correct, total = 0, 0
with torch.no_grad():
    for text, label in test_loader:
        output = model(text, None)
        _, predicted = torch.max(output, 1)
        total += label.size(0)
        correct += (predicted == label).sum().item()

accuracy = correct / total
print(f'Test Accuracy: {accuracy}')

ValueError: too many dimensions 'str'

In [7]:
sentiment_training_data = []

for index, row in df.iterrows():
    # 计算五个公司的数值和
    total_value = sum(row[['NVDA', 'ING', 'SAN', 'PFE', 'CSCO']])
    
    # 根据数值和的正负性分配标签
    sentiment_label = 'positive' if total_value > 0 else ('negative' if total_value < 0 else 'neutral')
    
    text_data = row['SocialMediaFeed']
    sentiment_training_data.append((text_data, sentiment_label))

# 打印示例数据
for example in sentiment_training_data[:5]:
    print(example)


('@PharmaNews: Pfizer faces backlash over possible closure of regional office. #PharmaNews #RegionalOffice', 'negative')
('@BusinessReport: A recent study found that most CEOs only read business books. That explains a lot. #CEOReads #BusinessBooks', 'neutral')
("@HardwareHubs: NVIDIA's contributions to a major industry collaboration have given the stock a boost. #IndustryCollaboration #GraphicsChip", 'positive')
('@HealthWatch: Johnson & Johnson faces lawsuits over product safety concerns. #Lawsuits #ProductSafety', 'neutral')
('@IndustryInsider: Magnificent Honary faces production delays. #ProductionDelays #IndustryNews', 'neutral')


Traceback (most recent call last):
  File "testmodel.py", line 71, in <module>
    loss = criterion(output, label)
  File "/home/ec2-user/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/.local/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 1152, in forward
    label_smoothing=self.label_smoothing)
  File "/home/ec2-user/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 2846, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
TypeError: cross_entropy_loss(): argument 'target' (position 2) must be Tensor, not tuple
