## **_Using News Data to Predict Movements in the Financial Movements_**

We'll be using two apporaches here:

* Continuous Bag of Words Model
* RNN Models using Word Embeddings

In [46]:
import torch
import torch.utils.data as tud
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

from collections import Counter, defaultdict
import operator
import os, math
import random
import copy
import string
import multiprocessing as mp

[nltk_data] Downloading package punkt to /home/antimony/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/antimony/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [31]:
# set the random seeds so the experiments can be replicated exactly
random.seed(72689)
np.random.seed(72689)
torch.manual_seed(72689)
if torch.cuda.is_available():
    torch.cuda.manual_seed(72689)

# Global class labels.
POS_LABEL = 'up'
NEG_LABEL = 'down'

**Reading in all the Data**

In [3]:
data = pd.read_csv("ProcessedData/CombinedData.csv")
data.drop(columns=['Unnamed: 0'], inplace=True)
data.head()

Unnamed: 0,Title,Date,Content,OpenMove,CloseMove
0,Top U.S. General Praises Iran-Backed Shiite Mi...,2017-01-04,The top commander of the U.S.-led coalition ag...,1.0,1.0
1,Extremists Turn to a Leader to Protect Western...,2017-01-04,As the founder of the Traditionalist Worker Pa...,1.0,1.0
2,How Julian Assange evolved from pariah to paragon,2017-01-04,President-elect Donald Trump tweeted some pra...,1.0,1.0
3,House panel recommends cutting funding for Pla...,2017-01-04,A House panel formed by Republicans to invest...,1.0,1.0
4,Missouri Bill: Gun-Banning Businesses Liable f...,2017-01-04,As Missouri lawmakers convene for the 2017 leg...,1.0,1.0


### Preprocessing the Data For Feeding Into The Model

Preprocessing Involves (in our case):
* Turning All Words into lower/upper case, Normalization
* removing punctuations, accent marks and other diacritics
* removing stop words, sparse terms, and particular words
* Stemming using a Porter Stemmer from NLTK 

In [39]:
# Removing all Punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

# Removing all Stop Words
def remove_stopwords(text):
    text = word_tokenize(text)
    return  " ".join([i for i in text if i not in STOP_WORDS])

def stem(text, stemmer):
    text = word_tokenize(text)
    return " ".join([stemmer.stem(i) for i in text])

The pre_process function below performs all the preprocessing we defined above. 

In [44]:
def pre_process(df):
    # Normalization
    data['Title'] = data['Title'].str.lower()
    data['Content'] = data['Content'].str.lower()

    # Removing Punctuation
    df['Title'] = df['Title'].apply(remove_punctuation)
    df['Content'] = df['Content'].apply(remove_punctuation)
    
    STOP_WORDS = set(stopwords.words('english'))
    # Remove Stopwords
    df['Title'] = df['Title'].apply(remove_stopwords)
    df['Content'] = df['Content'].apply(remove_stopwords)

    # Stemming
    stemmer = PorterStemmer()
    df['Title'] = df['Title'].apply(stem, (stemmer,))
    df['Content'] = df['Content'].apply(stem, (stemmer,))

    return df

**We run the pre_process function in parallel to make it faster using the Multi-Processing Module**

In [45]:
# Processing in Parallel
n_threads = mp.cpu_count()-1
data_pieces = np.array_split(data, n_threads)

pool = mp.Pool(n_threads)
data = pd.concat(pool.map(pre_process, data_pieces))
pool.close()
pool.join()

data.head()

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - '/home/antimony/nltk_data'
    - '/home/antimony/miniconda3/envs/MachineLearning/nltk_data'
    - '/home/antimony/miniconda3/envs/MachineLearning/share/nltk_data'
    - '/home/antimony/miniconda3/envs/MachineLearning/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************
