## Project 6: Analyzing Stock Sentiment from Twits

In [None]:
import json
import nltk
import os
import random
import re
import torch

from torch import nn, optim
import torch.nn.functional as F

## Introduction
When deciding the value of a company, it's important to follow the news. For example, a product recall or natural disaster in a company's product chain. You want to be able to turn this information into a signal. Currently, the best tool for the job is a Neural Network. 

For this project, you'll use posts from the social media site [StockTwits](https://en.wikipedia.org/wiki/StockTwits). The community on StockTwits is full of investors, traders, and entrepreneurs. Each message posted is called a Twit. This is similar to Twitter's version of a post, called a Tweet. You'll build a model around these twits that generate a sentiment score.

We've collected a bunch of twits, then hand labeled the sentiment of each. To capture the degree of sentiment, we'll use a five-point scale: very negative, negative, neutral, positive, very positive. Each twit is labeled -2 to 2 in steps of 1, from very negative to very positive respectively. You'll build a sentiment analysis model that will learn to assign sentiment to twits on its own, using this labeled data.

The first thing we should to do, is load the data.

## Import Twits 
### Load Twits Data 
This JSON file contains a list of objects for each twit in the `'data'` field:

```
{'data':
  {'message_body': 'Neutral twit body text here',
   'sentiment': 0},
  {'message_body': 'Happy twit body text here',
   'sentiment': 1},
   ...
}
```

The fields represent the following:

* `'message_body'`: The text of the twit.
* `'sentiment'`: Sentiment score for the twit, ranges from -2 to 2 in steps of 1, with 0 being neutral.


To see what the data look like by printing the first 10 twits from the list. 

In [None]:
with open('twits.json') as f:
    twits = json.load(f)

twits['data'][:5]

## Length of Data


In [None]:
len(twits['data'])

## Split Message Body and Sentiment Score

In [None]:
messages = [twit['message_body'] for twit in twits['data']]
sentiments = [twit['sentiment'] + 2 for twit in twits['data']]


In [None]:
sentiments[:40]

### Pre-Processing

In [None]:
nltk.download('wordnet')

def preprocess(message):

    text = message.lower()
    
    text = re.sub(r'https?://[^\s]+', ' ', text)
    
    text = re.sub(r'\$[a-zA-Z0-9]*', ' ', text)
    
    text = re.sub(r'@[a-zA-Z0-9]*', ' ', text)

    text = re.sub(r'[^a-z]', ' ', text)
    
    tokens = text.split()

    wnl = nltk.stem.WordNetLemmatizer()
    tokens = [wnl.lemmatize(w) for w in tokens if len(w)>1]
    
    return tokens

### Preprocess All the Twits 
Now we can preprocess each of the twits in our dataset. Apply the function `preprocess` to all the twit messages.

In [None]:
import pickle
with open ('P6_tokenized.p','rb')as f:
    tokenized=pickle.load(f)

In [None]:
messages[0]

In [None]:
tokenized[0]

## filter away empty tokenized twit and label

In [None]:
print("Number of tokens before removing empty ones: {}".format(len(tokenized)))

In [None]:
good_tokens = [idx for idx, token in enumerate(tokenized) if len(token) > 0]
tokenized = [tokenized[idx] for idx in good_tokens]
sentiments = [sentiments[idx] for idx in good_tokens]


In [None]:
print("Number of tokens after removing empty ones: {}".format(len(good_tokens)))

In [None]:
total_words = [word for twit in tokenized for word in twit]
st
total_words

## Bag of Words
we want to create a vocabulary and count up how often each word appears in our entire corpus. Use the [`Counter`](https://docs.python.org/3.1/library/collections.html#collections.Counter) function to count up all the tokens.

In [None]:
stacked_tokens = [word for twit in tokenized for word in twit]


In [None]:
stacked_tokens

In [None]:
from collections import Counter
bow = Counter(stacked_tokens)

bow


## Filter away too frequent/rare  Words in Message
- remove some of the most common words such as 'the', 'and', 'it', etc. These words don't contribute to identifying sentiment and are really common, resulting in a lot of noise in our input. 
- If we can filter these out, then our network should have an easier time learning.

- We also want to remove really rare words that show up in a only a few twits. 
- Here you'll want to divide the count of each word by the number of messages. Then remove words that only appear in some small fraction of the messages.

### Since already remove stopwords, may not need to remove most frequent words

In [None]:
total_num_words = len(stacked_tokens)


freqs = {key: value/total_num_words for key, value in bow.items()}


low_cutoff = 5e-6

high_cutoff = 15

freqs

In [None]:
K_most_common = [word[0] for word in bow.most_common(high_cutoff)]


In [None]:
bow.most_common(high_cutoff)


In [None]:
[word[0] for word in bow.most_common(high_cutoff)]

In [None]:

filtered_words = [word for word in freqs if (freqs[word] > low_cutoff and word not in K_most_common)]



In [None]:

print(K_most_common)
print(len(filtered_words))

filtered_words

## Updating Vocabulary by Removing Filtered Words

In [None]:
vocab = {word: i for i, word in enumerate(filtered_words,1)}

vocab

In [None]:
len(vocab)


In [None]:
id2vocab = {i: word for i, word in enumerate(filtered_words)}
id2vocab

In [None]:
filtered = [ [w for w in twit if w in vocab] for twit in tokenized ]


In [None]:
tokenized[:2]

In [None]:
len(filtered)


In [None]:
filtered

## Balancing the classes
- Let's do a few last pre-processing steps. we find that 50% of the twit are label as neutral. 
- This means that our network will be 50% accurate just by guessing 0 every single time.
- We should balance our classes to help our model learn.
- We make sure each of our different sentiment scores show up roughly as frequently in the data.
- What can go through each of our examples and randomly drop twits with neutral sentiment. 
- What should be the probability we drop these twits if we want to get around 20% neutral twits from 50% neutral? We should also take this opportunity to remove messages with length 0.

In [None]:
n_neutral = sum(1 for each in sentiments if each == 2)

n_neutral



In [None]:
sentiments

In [None]:
N_examples = len(sentiments)
keep_prob = (N_examples - n_neutral)/4/n_neutral
keep_prob

In [None]:
filtered


In [None]:
balanced = {'messages': [], 'sentiments':[]}

for idx, sentiment in enumerate(sentiments):
    message = filtered[idx]
    
    if len(message) == 0:

        continue
    elif sentiment != 2 or random.random() < keep_prob:

        balanced['messages'].append(message)
        balanced['sentiments'].append(sentiment) 

In [None]:
balanced

In [None]:
n_neutral = sum(1 for each in balanced['sentiments'] if each == 2)
print(n_neutral)
N_examples = len(balanced['sentiments'])
n_neutral/N_examples

## Convert vocab tokens into integer

In [None]:
balanced['messages']

In [None]:
token_ids = [[vocab[word] for word in twit] for twit in balanced['messages']]

token_ids

In [None]:
sentiments = balanced['sentiments']
sentiments

In [None]:
len(token_ids)


## Neural Network

#### Embed -> RNN -> Dense -> Softmax
### Implement the text classifier
we use softmax instead of sigmoid. The reason we are not using sigmoid is that the output of NN is not a binary. In our network, sentiment scores have 5 possible outcomes. We are looking for an outcome with the highest probability thus softmax is a better choice.

In [None]:
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_size, lstm_size, output_size, lstm_layers=1, dropout=0.1):        
        super().__init__()
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.lstm_size = lstm_size
        self.output_size = output_size        
        self.lstm_layers = lstm_layers
        self.dropout = dropout     
        
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_size)   
        
        self.lstm = nn.LSTM(input_size=embed_size, 
                            hidden_size=lstm_size,
                            num_layers=lstm_layers, 
                            batch_first=False, 
                            dropout=self.dropout)

        
        self.dropout = nn.Dropout(dropout)
        
        self.fc = nn.Linear(in_features=lstm_size, out_features=output_size)
        self.logsoftmax = nn.LogSoftmax(dim=1)

    def init_hidden(self, batch_size):        
        weight = next(self.parameters()).data 
 
        hidden = (weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_(),
                  weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_())        
        return hidden

    def forward(self, x, hidden_state):
        batch_size = x.size(0)
        
        
        x=x.type(torch.LongTensor).to(device)
        embeds = self.embedding(x)
        
        lstm_out, hidden_state = self.lstm(embeds, hidden_state)

        lstm_out = lstm_out[-1, : , :]
        
        out = self.dropout(lstm_out)
        out = self.fc(out)        
        log_probs = self.logsoftmax(out)
        
        return log_probs, hidden_state

## prepare x_train, x_valid set

In [None]:
split=0.8
valid_split = int(len(token_ids) * split)

train_features = token_ids[:valid_split]
valid_features = token_ids[valid_split:]
train_labels = sentiments[:valid_split]
valid_labels = sentiments[valid_split:]

## Prepare DataLoaders and Batching
Now we should build a generator that we can use to loop through our data. It'll be more efficient if we can pass our sequences in as batches. Our input tensors should look like `(sequence_length, batch_size)`. So if our sequences are 40 tokens long and we pass in 25 sequences, then we'd have an input size of `(40, 25)`.

If we set our sequence length to 40, what do we do with messages that are more or less than 40 tokens? For messages with fewer than 40 tokens, we will pad the empty spots with zeros. We should be sure to **left** pad so that the RNN starts from nothing before going through the data. If the message has 20 tokens, then the first 20 spots of our 40 long sequence will be 0. If a message has more than 40 tokens, we'll just keep the first 40 tokens.

In [None]:
def dataloader(twit_content, labels, sequence_length=30, batch_size=32, shuffle=True):

    if shuffle:
        indices = list(range(len(twit_content)))
        random.shuffle(indices)
        twit_content = [twit_content[idx] for idx in indices]
        labels = [labels[idx] for idx in indices]

    total_sequences = len(twit_content)

    for ii in range(0, total_sequences, batch_size):
        batch_twit_content = twit_content[ii: ii+batch_size]
        
        batch = torch.zeros((sequence_length, len(batch_twit_content)), dtype=torch.int64)
        for batch_num, tokens in enumerate(batch_twit_content):
            token_tensor = torch.tensor(tokens)
            start_idx = max(sequence_length - len(token_tensor), 0)
            batch[start_idx:, batch_num] = token_tensor[:sequence_length]
        
        label_tensor = torch.tensor(labels[ii: ii+len(batch_twit_content)])
        
        yield batch, label_tensor

## Draw data from first batch and check the shape

In [None]:
text_batch, labels = next(iter(dataloader(train_features, train_labels, sequence_length=20, batch_size=512)))
print(text_batch.shape)
print(labels.shape)

## Training parameter


In [None]:
vocab_size=len(vocab)+1
embed_size=1024
output_size=5
lstm_size=512
lstm_layers=2
dropout=0.1
batch_size=512

model = TextClassifier(vocab_size, embed_size, lstm_size, output_size, lstm_layers, dropout)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model

## initialize the weight and hidden state for the first time training

In [None]:
model.embedding.weight.data.uniform_(-1, 1)
hidden = model.init_hidden(batch_size) 

## fast pass the the first batch of data into our model to check if it works

In [None]:
logps, hidden = model.forward(text_batch, hidden)
logps.shape


In [None]:
logps[0]

In [None]:
model=torch.load('p6_epoch.pth')

## Train the model with data

In [None]:
epochs = 5
learning_rate = 0.001
clip = 5
print_every = 100
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
model.train()

train_losses = []
valid_losses = []
valid_accuracies = []
best_val_accuracy=0
for epoch in range(epochs):
    print(f'Training with epoch {epoch + 1}')
    
    train_loss = 0
    steps = 0    

    for text_batch, labels in dataloader(train_features, train_labels, 
                                         batch_size=batch_size, sequence_length=20, 
                                         shuffle=True):
        hidden = model.init_hidden(batch_size=labels.shape[0])

        steps += 1
       
        text_batch, labels = text_batch.to(device), labels.to(device)
        for each in hidden:
            each=each.to(device)
        
        model.zero_grad() 
        
        log_probs, hidden = model.forward(text_batch, hidden)        
        
        loss = criterion(log_probs, labels)
        
        loss.backward()  
        
        nn.utils.clip_grad_norm_(model.parameters(), clip)  
        
        optimizer.step()
        
        train_loss += loss.item()  
        
        if steps % print_every == 0:
            model.eval()
            for text_batch, labels in dataloader(valid_features, valid_labels,
                                                batch_size=batch_size, sequence_length=20,
                                                shuffle=True):

                valid_hidden = model.init_hidden(labels.shape[0])      
                
                text_batch, labels = text_batch.to(device), labels.to(device)
                for each in valid_hidden:
                    each=each.to(device)
 
                valid_log_probs, valid_hidden = model.forward(text_batch, valid_hidden)
                valid_loss = criterion(valid_log_probs, labels)
                                
                probs = torch.exp(valid_log_probs)
                top_prob, top_class = probs.topk(1)
                equality = top_class == labels.view(*top_class.shape)
                valid_accuracy = torch.mean(equality.type(torch.FloatTensor))
                                
            train_losses.append(loss.item())
            valid_losses.append(valid_loss.item())
            valid_accuracies.append(valid_accuracy.item())

            model.train()
            current_val_accuracy=sum(valid_accuracies)/len(valid_accuracies)
            
            print(f'Epoch: {epoch+1} / {epochs} \tStep: {steps}',
                  f'\n  Train Loss: {loss.item():.3f}',
                  f'  Valid Loss: {valid_loss.item():.3f}',
                  f'  Valid Accuracies: {valid_accuracy.item():.3f}')
            
            if current_val_accuracy > best_val_accuracy:
                
                torch.save(model,'p6_epochwostop.pth')
                best_val_accuracy=current_val_accuracy
                print("New best accuracy model is saved")


In [None]:
print(best_val_accuracy)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

fig = plt.figure(figsize=(10, 8))

ax1 = fig.add_subplot(111)


ax2 = ax1.twinx()
ax2.set_ylabel('Accy', color='tab:purple')
ax2.plot(train_losses, label='train_losses', c='purple')
ax2.plot(valid_losses, label='valid_losses', c='red')

plt.legend()

## Making Predictions
Now we have a trained model, try it on some new twits and see if it works appropriately. Remember  need to preprocess it first before passing it to the network.

In [None]:
def predict(text, model, vocab):    
    tokens = preprocess(text)
    
    tokens = [word for word in tokens if word in filtered_words]
    tokens = [vocab[word] for word in tokens] 
        
    text_input = torch.tensor(tokens).unsqueeze(1)
    
    hidden  = model.init_hidden(text_input.size(1))
    
    logps, _ = model.forward(text_input, hidden)
    pred = torch.exp(logps)
    
    return pred.to('cpu').detach().numpy()


In [None]:
text = "Google is working on self driving cars, I'm bullish on $goog"
model.eval()
model.to(device)
predict(text, model, vocab)

In [None]:
text = "Google is working on self driving cars, I'm bullish on $goog"
tokens = preprocess(text)
tokens

In [None]:
tokens = [word for word in tokens if word in filtered_words]
tokens = [vocab[word] for word in tokens] 
tokens

In [None]:
text_input = torch.tensor(tokens)
text_input.shape

In [None]:
text_input = torch.tensor(tokens).unsqueeze(1)
text_input.shape

## Questions: What is the prediction of the model? What is the uncertainty of the prediction?
- predict as label 4, with ~71% accuracy
- here label4= index3=score 3

## Testing

In [None]:
with open('test_twits.json', 'r') as f:
    test_data = json.load(f)

### Twit Stream

In [None]:
def twit_stream():
    for twit in test_data['data']:
        yield twit

next(twit_stream())

In [None]:
next(twit_stream())

Using the `prediction` function, let's apply it to a stream of twits.

In [None]:
def score_twits(stream, model, vocab, universe):

    for twit in stream:

        text = twit['message_body']
        symbols = re.findall('\$[A-Z]{2,4}', text)
        score = predict(text, model, vocab)

        for symbol in symbols:
            if symbol in universe:
                yield {'symbol': symbol, 'score': score, 'timestamp': twit['timestamp']}

In [None]:
universe = {'$BBRY', '$AAPL', '$AMZN', '$BABA', '$YHOO', '$LQMT', '$FB', '$GOOG', '$BBBY', '$JNUG', '$SBUX', '$MU'}
score_stream = score_twits(twit_stream(), model, vocab, universe)

next(score_stream)