### Mauer Cory
# CS 614 Assignment 3: NLP
## News article classifier

## Pitch: 
In a previous project I develop a Stock viewing website (https://mauer-stocks.azurewebsites.net/), with the goal of creating future stock price predictions in effort to make better investmant decisions. I have a theory that being able to automatically read in breaking news articles and classify wether they shed a positive or negative light on a given stock will enable me to make more accurate price predictions. The goal of this assignment is leverage a news article API and build a model that can classifiy the stories as positive/negative for future use as a model feature or engineered features that can be used to improve stock price predicitons.

## Data source:
For this project I chose to build my own dataset, leveraging the newsapi (https://newsapi.org/docs/client-libraries/python) via this homebrewed python script(https://github.com/cwma86/cs614/blob/main/nlp_assignment/pull_descriptions.py) for pulling news articles for various stock ticker and storing them to a sql database. Once the stories were stored to the database they were manually labeled with the assistance of this python script (https://github.com/cwma86/cs614/blob/main/nlp_assignment/label_data.py) which provided a simple human interface. 
The final data set resulted in 416 positive and 268 negative articles for various stocks over the last 1.5 months. 

## Model and data justification:
For building the model I chose to leverage a deep learning model that leveraged tensorflow/keras functionality. The largest issue with the model is it's simplicity, which was required to reduce overfitting between the test/train datasets due to the small data set of only 700 articles. Additionaly it would likely be benificial to move from a LSTM to a Transformer layer which has shown to have better success than LSTM due to language structure not following a purely temporal structure. 

In [1]:
    import os
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
    import tensorflow as tf
    number_words = 7000
    maxlen = 20
    model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(number_words,16,input_length=maxlen),
    tf.keras.layers.Dropout(0.6),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(5)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(2, activation='softmax')
    ])
    
    model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 16)            112000    
                                                                 
 dropout (Dropout)           (None, 20, 16)            0         
                                                                 
 bidirectional (Bidirectiona  (None, 10)               880       
 l)                                                              
                                                                 
 dropout_1 (Dropout)         (None, 10)                0         
                                                                 
 dense (Dense)               (None, 2)                 22        
                                                                 
Total params: 112,902
Trainable params: 112,902
Non-trainable params: 0
__________________________________________________

## Commented examples:
The following are ouputs produced from my model training script (https://github.com/cwma86/cs614/blob/main/nlp_assignment/test_train_model.py)

### Story 1: 
         NVIDIA today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13\% from a year ago and up 19\% from the...

Prediction: 

[negative positive]

[0.6487615  0.35123858]
    
Description: While this article does detail that the stock is down for the year, it also mentions that it is recently up, resulting in a mix of postitive/negative

### Story 2: 
         GameStop (GME) Gains As Market Dips: What You Should Know

Prediction:

[negative positive]

[0.35599473 0.6440053 ]

Description: This article mentions that the stock market has been dipping, but overall the stock itself shows growth

### Story 3: 
         General Electric (NYSE:GE) shareholders have earned a 36% CAGR over the last three years

Prediction: 

[negative positive]

[0.9395874  0.06041263]

Description: This is an incorrect classification, as the article is providing positive information on the stock. I suspect this is caused by CAGR, not showing up anywhere else in the data set. Further data collection should correct this.

### Story 4: 
         Berkshire Hathaway Reports Major Investment Losses in 2022

Prediction: 

[negative positive]

[0.30550626 0.6944938 ]

Description: This is an incorrect classification, I suspect caused by the model overfitting Berkshire Hathaway with mostly positive stories. Further work to collect a larger data set should enable us to correct these issuess

### Story 5: 
         Retail investors are sitting on heavy losses despite a 2023 stock rally

Prediction:

[negative positive]

[0.9596902  0.04030982]

Description: accurate classification

### Story 6: 
         Nvidia stock surge could signal the start of an AI bubble

prediction:

[negative positive]

[0.69276685 0.30723315]

Description: Likely an incorrect classification, as I would consider an AI bubble a short term positive for NVIDIA. This is likely caused from bubble typically being used in a negative context through the majority of this dataset. 

### Story 7: 
         Nvidia gains $185bn in value after predicting AI-driven boom in chip demand

Prediction:

[negative positive]

[0.06687223 0.9331278 ]

Descriptions: accurate classification

## Testing:
In depth run instructions can be found in our test/train script (), however a summary of results is as follows:

Using an adam optimizer with a learning rate of 2e-4 over the course of 40 epochs resulted in:

Epoch 40/40
21/21 [==============================] - 0s 9ms/step - loss: 0.1172 - accuracy: 0.9757 - val_loss: 0.3395 - val_accuracy: 0.8485

from sklear's classification_report

              precision    recall  f1-score   support

    negative       0.85      0.84      0.85        82
    positive       0.85      0.86      0.85        83

    accuracy                           0.85       165
    macro avg       0.85      0.85      0.85       165
   
weighted avg       0.85      0.85      0.85       165


## Code and run Instructions
Code to pull data, label data, and test/train the model can be found at this repo (https://github.com/cwma86/cs614/blob/main/nlp_assignment) 

For environment setup and run instructions please see the readme (https://github.com/cwma86/cs614/blob/main/nlp_assignment/README.md)

From there there are 3 main scripts
* Data collection (https://github.com/cwma86/cs614/blob/main/nlp_assignment/pull_descriptions.py)
* Data labeling (https://github.com/cwma86/cs614/blob/main/nlp_assignment/label_data.py)
* model training and evaluation (https://github.com/cwma86/cs614/blob/main/nlp_assignment/test_train_model.py)

### I agree to sharing this assignment with other students in the course after grading has occurred. 