# Output.ipynb

This notebook contains detailed instructions on how to reproduce the results of this project.  Note, for detailed instructions on the scripts that are running in the following cells, please see the README.md in the appropriate directory for details.

## Prerequisites

In the following subsections, it is expected that the user has downloaded the neccessary data so that each python notebook or script in this project can be run.  The required prerequisites must section **must** be completed before proceeding further.  The optional prerequites are optional.  It will be noted below when it is required.  Note, when you execute `download_output.py` the outputs and results of each model are automatically downloaded.  This allows the user to execute **most** cells but not all (i.e. the user can skip running the actual models, but will need to download the models if they actually want to regenerate the results).

### Required

In [None]:
from IPython.display import Image
import shutil
import os

In [None]:
working_directory = os.getcwd()

In [None]:
# Install all requirements to run everything in this project
!pip install -r requirements.txt

In [None]:
# This script downloads and propagates the datafiles required to run each script
%run download_output.py

### Optional

#### Configuring GPU for PyTorch and Tensorflow

Using the included requirements.txt file, this will download PyTorch and TensorFlow for CPU **only**.  If the user would like to use a GPU, we refer the reader to [PyTorch installation page](https://pytorch.org/get-started/locally/) or the [Tensorflow installation page](https://www.tensorflow.org/install/pip) so that they can download the appropriate GPU version.  Note, for PyTorch in this project we used version 2.0.0 and for Tensorflow we used version 2.12.0.

#### Download Models

In [None]:
# This script downloads the final models used in this project.  Note, depending on your internet speed this could take quite a while!
!python download_models.py

## Generating Training Data

The following cell allows the user to generate the training data used to train both models.

In [None]:
# Cleans the Sentiment140 dataset and outputs training_data_formulation/results/tweets.csv
os.chdir(f"{working_directory}/training_data_formulation")
!python clean_tweets.py

In [None]:
# Cleans the NewsMTSC dataset and outputs training_data_formulation/results/news.csv
os.chdir(f"{working_directory}/training_data_formulation")
!python clean_news.py

In [None]:
# Merges the training data and generates all the splits (i.e. train/test/validation etc_
os.chdir(f"{working_directory}/training_data_formulation")
!papermill data_formulation.ipynb /dev/null

# We copy all the resulting data splits to where they are used elsewhere in the project
os.chdir(f"{working_directory}")
shutil.copy("training_data_formulation/results/test_4000.csv", "bert/data/test_4000.csv")
shutil.copy("training_data_formulation/results/train_4000.csv", "bert/data/train_4000.csv")
shutil.copy("training_data_formulation/results/val_4000.csv", "bert/data/val_4000.csv")

shutil.copy("training_data_formulation/results/val_4000_news.csv", "bert/data/val_4000_news.csv")
shutil.copy("training_data_formulation/results/train_4000_news.csv", "bert/data/train_4000_news.csv")
shutil.copy("training_data_formulation/results/test_4000_news.csv", "bert/data/test_4000_news.csv")

shutil.copy("training_data_formulation/results/val_4000_tweets.csv", "bert/data/val_4000_tweets.csv")
shutil.copy("training_data_formulation/results/train_4000_tweets.csv", "bert/data/train_4000_tweets.csv")
shutil.copy("training_data_formulation/results/test_4000_tweets.csv", "bert/data/test_4000_tweets.csv")

shutil.copy("training_data_formulation/results/test_4000.csv", "lstm/data/test_4000.csv")
shutil.copy("training_data_formulation/results/train_4000.csv", "lstm/data/train_4000.csv")
shutil.copy("training_data_formulation/results/val_4000.csv", "lstm/data/val_4000.csv")

shutil.copy("training_data_formulation/results/val_4000_news.csv", "lstm/data/val_4000_news.csv")
shutil.copy("training_data_formulation/results/train_4000_news.csv", "lstm/data/train_4000_news.csv")
shutil.copy("training_data_formulation/results/test_4000_news.csv", "lstm/data/test_4000_news.csv")

shutil.copy("training_data_formulation/results/val_4000_tweets.csv", "lstm/data/val_4000_tweets.csv")
shutil.copy("training_data_formulation/results/train_4000_tweets.csv", "lstm/data/train_4000_tweets.csv")
shutil.copy("training_data_formulation/results/test_4000_tweets.csv", "lstm/data/test_4000_tweets.csv")


In [None]:
# Summary statistics of NewsMTSC data used in training data
os.chdir(f"{working_directory}/training_data_formulation")
Image(filename='results/news_stats.png')

In [None]:
# Summary statistics of Sentiment140 data used in training data
os.chdir(f"{working_directory}/training_data_formulation")
Image(filename='results/tweets_stats.png')

In [None]:
# Summary statistics of merged data used in trainng/validation/test splits
os.chdir(f"{working_directory}/training_data_formulation")
Image(filename='results/split_stats.png')

## Generating Bert Model Results

**Optional Prerequisites Required**

The following cell executes the `test.ipynb` notebook that executes the bert models on the training, vaidation and test datasets and produces a PNG of the results that is used in the final report.  The resulting PNG is shown below.  Note, actually running this could take quite a while (~ 10 minutes on an RTX 3090).  If you do not have an NVIDIA GPU with **at least** 8GB of vram, do not try and execute this, it will either take too long or you will get an out of memory error.

In [None]:
os.chdir(f"{working_directory}/bert")
!papermill test.ipynb /dev/null

In [None]:
os.chdir(f"{working_directory}/bert")
Image(filename='results/bert_results_table.png')

## Generating LSTM Model Results

**Optional Prerequisites Required**

The following cell executes the LSTM model on the scraped data from Twitter and the New York Times.  This produces lstm/results/twitter_lstm_results.csv and lstm/results/nyt_lstm_results.csv.  These files are then moved to the result_visualizations/data directory and are used to produce the vizualizations. 

In [None]:
os.chdir(f"{working_directory}/lstm")
!papermill test.ipynb /dev/null

In [None]:
os.chdir(f"{working_directory}/lstm")
Image(filename='results/lstm_results_table.png')

## Model Results on Scraped Data

The following cells execute several scripts that allow the user to reproduce the results of both models on the scraped data from the New York Times and Twitter.

### Generating Results of the Bert Model

**Optional Prerequisites Required**

The following cell executes the BERT model on the scraped data from Twitter and the New York Times.  This produces `bert/results/twitter_bert_results.csv` and `bert/results/nyt_bert_results.csv`.  These files are then moved to the `result_visualizations/data` directory and are used to produce the vizualizations. Note, actually running this could take quite a while (~ 10 minutes on an RTX 3090).  If you do not have an NVIDIA GPU with **at least** 8GB of vram, do not try and execute this, it will either take too long or you will get an out of memory error.

In [None]:
os.chdir(f"{working_directory}/bert")
!papermill real_data_testing.ipynb /dev/null

In [None]:
os.chdir(f"{working_directory}")
shutil.copy("bert/results/twitter_bert_results.csv", "result_visualizations/data/twitter_bert_results.csv")
shutil.copy("bert/results/nyt_bert_results.csv", "result_visualizations/data/nyt_bert_results.csv")

### Generating Results of the LSTM Model

**Optional Prerequisites Required**

The following cell executes the LSTM model on the scraped data from Twitter and the New York Times. This produces lstm/results/twitter_lstm_results.csv and lstm/results/nyt_lstm_results.csv. These files are then moved to the result_visualizations/data directory and are used to produce the vizualizations. 

In [None]:
os.chdir(f"{working_directory}/lstm")
!papermill real_data_lstm_testing.ipynb /dev/null

In [None]:
os.chdir(f"{working_directory}")
shutil.copy("lstm/results/twitter_lstm_results.csv", "result_visualizations/data/twitter_lstm_results.csv")
shutil.copy("lstm/results/nyt_lstm_results.csv", "result_visualizations/data/nyt_lstm_results.csv")

### Generating Vizualizations the Results
The following cell generates the vizualizations of the results from the previous two sections.  The vizualizations are then shown in the subsequent cells

In [None]:
os.chdir(f"{working_directory}/result_visualizations")
!papermill viz.ipynb /dev/null

#### Twitter Results

In [None]:
# Initial summary of LSTM model results on twitter data
os.chdir(f"{working_directory}/result_visualizations")
Image(filename='results/lstm_initial_twitter_summary_result.png')

In [None]:
# Initial summary of BERT model results on twitter data
os.chdir(f"{working_directory}/result_visualizations")
Image(filename='results/bert_initial_twitter_summary_result.png')

In [None]:
# Timeseries vizualization of both model results on twitter data
os.chdir(f"{working_directory}/result_visualizations")
Image(filename='results/timeseries_distribution_grid.png')

In [None]:
# Distribution vizualization of both model results on twitter data
os.chdir(f"{working_directory}/result_visualizations")
Image(filename='results/distribution_grid.png')

#### New York Time Results

In [None]:
# Initial summary of LSTM model results on NYT data
os.chdir(f"{working_directory}/result_visualizations")
Image(filename='results/lstm_initial_nyt_summary_result.png')

In [None]:
# Initial summary of BERT model results on NYT data
os.chdir(f"{working_directory}/result_visualizations")
Image(filename='results/bert_initial_nyt_summary_result.png')

In [None]:
# Timeseries vizualization of both model results on NYT data
os.chdir(f"{working_directory}/result_visualizations")
Image(filename='results/nyt_timeseries_distribution_grid.png')

In [None]:
# Distribution vizualization of both model results on NYT data
os.chdir(f"{working_directory}/result_visualizations")
Image(filename='results/nyt_distribution_grid.png')