# Smart Stock Prediction #
John, a hacker has got access to database of one of Leading and most profitable trading firm of India. Database had daily share prices, buy date and sell date data from Jan, 2000 till Nov,2019. 
He also found expected share prices for next 2 years from another table. For earning profits using this data, he started searching on web and encountered many unknown terms such as RNN, Attention Models etc. Then he remembered, he has few friends who are working on Deep learning and asked for their help. Can you help him by creating a deep learning based model for predicting buy and sell date.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## Data Visulization ##

Training Data (share_train_data.csv)
1. Each row represent one-month data for one particular share. 
2. Share is only bought once in that month and sold in same month only. Buy date will be always before sell date.
3. In below example, share_4 was bought on 8, Jan, 2000 and sold on 10, Jan,2000.

	Share_Name, Year, Month, Share Prices (28-31 values depending on month), Buy Date, Sell Date

	'share_4', 2000, 'JAN', 46.6, 54.7, 66.7, 44.7, 46.0, 72.8, 73.6, 37.9, 49.7, 74.3, 71.2, 51.0, 40.3, 57.8, 70.5, 66.2, 69.8, 45.8, 57.4, 50.2, 62.6, 40.6, 48.7, 71.2, 51.4, 61.4, 49.2, 52.7, 54.9, 37.3, 47.2, 8, 10


In [None]:
train_data = pd.read_csv('/kaggle/input/sharepricedateprediction/share_train_data.csv', header=None)
print(len(train_data))
train_data.head(10)

In [None]:
train_data[40:60]

In [None]:
# for the first column we have different share classes lets check all the available values
print(len(train_data[0].unique()))
train_data[0].unique()

In [None]:
# for the second column we have different share classes lets check all the available values
print(len(train_data[1].unique()))
train_data[1].unique()

In [None]:
# for the third column we have different share classes lets check all the available values
print(len(train_data[2].unique()))
train_data[2].unique()

## Preprocessing ##

In [None]:
#extracting last two columns from Traning data

month_dict={
    'JAN':1, 'FEB':2, 'MAR':3, 'APR':4, 'MAY':5, 'JUN':6,'JUL':7, 'AUG':8,'SEP':9, 'OCT':10, 'NOV':11, 'DEC':12
}

from calendar import monthrange


def dateExtractor(train_data):
    buy_dates=[]
    sell_dates=[]
    
    for row in train_data.values:
        days_in_month = monthrange(int(row[1]), int(month_dict[row[2]]))[1]
        sell_index=days_in_month+4
        buy_index=days_in_month+3
        buy_dates.append(row[buy_index])
        sell_dates.append(row[sell_index])
        
    return buy_dates, sell_dates
                               
buy_dates, sell_dates = dateExtractor(train_data)

In [None]:
print('Unique Buy Dates: '+str(set(buy_dates))+'\nUnique Sell Date: '+str(set(sell_dates)))

In [None]:
# Fill all the Nan value with 0
train_data=train_data.fillna(0)
train_data[99:100]

Our Dataset don't have same length of share price for all the records because of different days in each month, RNN will expect to have same size of records so we need to preprocess to generate consistence records.

In [None]:
def targetDateReplacer(train_data):
    '''
    Replace buy date and sell date with zeros so that we can generate consitent data 
    '''
    df = pd.DataFrame(columns = [0,1,2,3,4,5,6,7,8,9,10,11,12,13, 14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35])
    for index,row in enumerate(train_data.values):
        days_in_month = monthrange(int(row[1]), int(month_dict[row[2]]))[1]
        sell_index = days_in_month+4
        buy_index = days_in_month+3
        row[sell_index] = 0
        row[buy_index] = 0
        df.loc[index]=row
    return df

modified_df = targetDateReplacer(train_data)
modified_df.head()

In [None]:
modified_df=modified_df.drop([34,35], axis=1)

In [None]:
#we need to one hot encode sharetype, year, and month column

from sklearn.preprocessing import LabelEncoder
def label_encoder(modified_df):
    cols = (0, 1, 2)
    # process columns, apply LabelEncoder to categorical features
    for c in cols:
        lbl = LabelEncoder() 
        lbl.fit(list(modified_df[c].values)) 
        modified_df[c] = lbl.transform(list(modified_df[c].values))
    
    return modified_df

one_hot_df=label_encoder(modified_df)
one_hot_df.head()

In [None]:
# one hot encoded value of first column
modified_df[0].unique()

In [None]:
# one hot encoded value of second column
modified_df[1].unique()

In [None]:
# one hot encoded value of third column
modified_df[2].unique()

### Normalization ###

In [None]:
from sklearn import preprocessing
def normalize_data(df):
    min_max_scaler = preprocessing.MinMaxScaler()
    for i in range(34):
        df[i] = min_max_scaler.fit_transform(df[i].values.reshape(-1,1))
    return df

norm_df=normalize_data(modified_df)
norm_df.head()

In [None]:
norm_df.describe()

In [88]:
# we also need to normalize target values
min_max_scaler = preprocessing.MinMaxScaler()
n_buy_dates = min_max_scaler.fit_transform(np.asarray(buy_dates).reshape(-1,1))
n_sell_dates = min_max_scaler.fit_transform(np.asarray(sell_dates).reshape(-1,1))

# lets create target value using list comprehension 
n_buy_dates=list(n_buy_dates)
n_sell_dates=list(n_sell_dates)
target=[[n_buy_dates[i].item(),n_sell_dates[i].item()] for i in range(len(n_sell_dates))]
target[:10]

[[0.06896551724137931, 0.17241379310344826],
 [0.2068965517241379, 0.7586206896551724],
 [0.4827586206896552, 0.9655172413793104],
 [0.41379310344827586, 0.5517241379310345],
 [0.7586206896551725, 0.7586206896551724],
 [0.1724137931034483, 0.9655172413793104],
 [0.3103448275862069, 0.7586206896551724],
 [0.3103448275862069, 0.689655172413793],
 [0.7241379310344828, 0.793103448275862],
 [0.5172413793103449, 0.8275862068965517]]

In [89]:
#split_dataset into 80% training , 10% test and 10% Validation Dataset
train_x=np.array(norm_df[:int(0.90*len(norm_df))])
train_y=np.array(target[:int(0.90*len(target))])
valid_x=np.array(norm_df[int(0.90*len(norm_df)):])
valid_y=np.array(target[int(0.90*len(target)):])
print(len(train_y), len(train_x), len(valid_y), len(valid_x))

10970 10970 1219 1219


In [97]:
train_x = np.reshape(train_x, (train_x.shape[0], 1, train_x.shape[1]))
valid_x = np.reshape(valid_x, (valid_x.shape[0], 1, valid_x.shape[1]))

In [98]:
train_x.shape

(10970, 1, 34)

In [104]:
n_features=34
from keras.models import Sequential
from keras.layers import Dense, LSTM

model = Sequential()
model.add(LSTM(100, input_shape=(1, n_features)))
model.add(Dense(2))
model.compile(optimizer='adam', loss='mse')
model.summary()

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_7 (LSTM)                (None, 100)               54000     
_________________________________________________________________
dense_7 (Dense)              (None, 2)                 202       
Total params: 54,202
Trainable params: 54,202
Non-trainable params: 0
_________________________________________________________________


In [105]:
model.fit(train_x, train_y, epochs=10, validation_split=0.2, batch_size=50)

Train on 8776 samples, validate on 2194 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7f4ccc069fd0>

### Pytorch ###

In [None]:
import torch
from torch.utils.data import DataLoader, TensorDataset

#create Tensor Dataset
train_data=TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data=TensorDataset(torch.from_numpy(valid_x), torch.from_numpy(valid_y))

#dataloader
batch_size=50
train_loader=DataLoader(train_data, batch_size=batch_size, shuffle=True)
valid_loader=DataLoader(valid_data, batch_size=batch_size, shuffle=True)

In [None]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()
print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

In [None]:
import torch.nn as nn

class RNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, output_size, input_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(RNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.input_dim = input_dim
        
        # embedding and LSTM layers
        self.lstm = nn.LSTM(input_dim, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        # dropout layer
        self.dropout = nn.Dropout(0.3)
        
        # linear and sigmoid layers
        self.fc1 = nn.Linear(hidden_dim, 256)
        self.fc2 = nn.Linear(256, 32)
        self.fc3 = nn.Linear(32, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        
        
        batch_size = x.size(0)
        print(batch_size)
        
        x = x.unsqueeze(0)
        print(x, x.shape)
        
        lstm_out, hidden = self.lstm(x, hidden)
    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc1(out)
        out = self.fc2(out)
        out = self.fc3(out)

        
        # return last sigmoid output and hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

In [None]:
# Instantiate the model w/ hyperparams
output_size = 2
input_dim = 34
hidden_dim = 100
n_layers = 2

net = RNN(output_size, input_dim, hidden_dim, n_layers)

print(net)

In [None]:
for i in range(len(list(net.parameters()))):
    print(list(net.parameters())[i].size())

In [None]:
# loss and optimization functions
lr=0.001

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

In [None]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

In [None]:
# training params

epochs = 2 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip=5 # gradient clipping
# move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(50)
    # batch loop
    for inputs, labels in train_loader:
        counter += 1    
        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()
        
        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])
        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:
                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])
                if(inputs.shape[0] != batch_size):
                    continue
                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (None, 100)               54000     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 101       
Total params: 54,101
Trainable params: 54,101
Non-trainable params: 0
_________________________________________________________________


ValueError: Error when checking input: expected lstm_2_input to have 3 dimensions, but got array with shape (10970, 34)

Testing Data (share_test_data.csv)
1. Predict buy and sell date for estimated shares prices for next 2 years.
2. You can only buy once in month and sell once in month for given share.
3. Testing data have same format as training data except buy and sell date are not present.

In [None]:
test_data = pd.read_csv('/kaggle/input/sharepricedateprediction/share_test_data.csv', header=None)
test_data.head()

In [None]:
en_test=label_encoder(test_data)
norm_data=normalize_data(en_test)
norm_data.head()

Expected Output
1. Python code for training model using given train data using Tensorflow, Pytorch or Keras frameworks.
2. Python code for predicting on test data using trained model.
3. Generated predictions for test data in same format as training data. 
4. A PDF report explaining your approach and results. 