<a href="https://colab.research.google.com/github/azizamari/stock_price_prediction/blob/main/Stock_Market_Prediction_using_Numerical_and_Textual_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install yfinance

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Importing Dependencies

In [None]:
import yfinance as yf
import pandas as pd
import nltk
import re
import numpy as np
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from pandas import DataFrame as df
from pandas import concat
import matplotlib.pyplot as plt

In [None]:
# define dates
start='2021-10-01'
end='2022-01-01'
startstamp=20211001

## Get data from yahoo finance

In [None]:
past_data = yf.download('^BSESN', start=start, end=end)

[*********************100%***********************]  1 of 1 completed


In [None]:
print(past_data.columns)
past_data.head()

Index(['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-10-01,58889.769531,58890.078125,58551.140625,58765.578125,58765.578125,10200
2021-10-04,59143.0,59548.820312,58952.109375,59299.320312,59299.320312,10000
2021-10-05,59320.140625,59778.871094,59127.039062,59744.878906,59744.878906,12900
2021-10-06,59942.0,59963.570312,59079.859375,59189.730469,59189.730469,7000
2021-10-07,59632.808594,59914.910156,59597.058594,59677.828125,59677.828125,5700


To better understand stock data I read through this article 
[link](https://analyzingalpha.com/open-high-low-close-stocks)


In [None]:
past_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 63 entries, 2021-10-01 to 2021-12-31
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       63 non-null     float64
 1   High       63 non-null     float64
 2   Low        63 non-null     float64
 3   Close      63 non-null     float64
 4   Adj Close  63 non-null     float64
 5   Volume     63 non-null     int64  
dtypes: float64(5), int64(1)
memory usage: 3.4 KB


## Get india news headlines dataset

In [None]:
! pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
! mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d therohk/india-headlines-news-dataset

Downloading india-headlines-news-dataset.zip to /content
 98% 85.0M/86.6M [00:02<00:00, 34.4MB/s]
100% 86.6M/86.6M [00:02<00:00, 33.4MB/s]


In [None]:
!unzip india-headlines-news-dataset.zip

Archive:  india-headlines-news-dataset.zip
  inflating: india-news-headlines.csv  


In [None]:
cols = ['Date','News']
news_data = pd.read_csv('india-news-headlines.csv', names = cols)
news_data = news_data.dropna(axis = 0, how ='any') 
news_data.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,Date,News
publish_date,headline_category,headline_text
20010102,unknown,Status quo will not be disturbed at Ayodhya; s...
20010102,unknown,Fissures in Hurriyat over Pak visit
20010102,unknown,America's unwanted heading for India?
20010102,unknown,For bigwigs; it is destination Goa


### Data prep news data

In [None]:
news_data.head()

Unnamed: 0,Date,News
publish_date,headline_category,headline_text
20010102,unknown,Status quo will not be disturbed at Ayodhya; s...
20010102,unknown,Fissures in Hurriyat over Pak visit
20010102,unknown,America's unwanted heading for India?
20010102,unknown,For bigwigs; it is destination Goa


In [None]:
news=news_data['News'].iloc[1:].values

In [None]:
dates=news_data.index[1:].values

In [None]:
news_data=pd.DataFrame(data={'date':dates,'news':news})

In [None]:
news_data.head()

Unnamed: 0,date,news
0,20010102,Status quo will not be disturbed at Ayodhya; s...
1,20010102,Fissures in Hurriyat over Pak visit
2,20010102,America's unwanted heading for India?
3,20010102,For bigwigs; it is destination Goa
4,20010102,Extra buses to clear tourist traffic


In [None]:
news_data.describe()

Unnamed: 0,date,news
count,3650970,3650970
unique,7718,3387380
top,20141215,Straight Answers
freq,706,6723


In [None]:
news_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3650970 entries, 0 to 3650969
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   date    object
 1   news    object
dtypes: object(2)
memory usage: 55.7+ MB


In [None]:
news_data.isnull().sum()

date    0
news    0
dtype: int64

In [None]:
news_data.duplicated().sum()

162098

In [None]:
# drop duplicated data
news_data.drop_duplicates(keep='first', inplace=True, ignore_index=True)

In [None]:
news_data.duplicated().sum()

0

In [None]:
news_data['date'] = pd.to_numeric(news_data['date'])

In [None]:
# drop news before 2021-10-01
news_data.drop(news_data[news_data['date']<20211001].index,inplace=True)

In [None]:
# dropping data messes up indexes
news_data.reset_index(inplace = True, drop = True)

In [None]:
len(news_data)

87701

### Data prep yahoo finance data

In [None]:
past_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 63 entries, 2021-10-01 to 2021-12-31
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       63 non-null     float64
 1   High       63 non-null     float64
 2   Low        63 non-null     float64
 3   Close      63 non-null     float64
 4   Adj Close  63 non-null     float64
 5   Volume     63 non-null     int64  
dtypes: float64(5), int64(1)
memory usage: 3.4 KB


In [None]:
past_data.reset_index(inplace=True)

In [None]:
past_data.rename(columns={
      'Date':'date',
      'Open': 'open',
      'High': 'high',
      'Low': 'low',
      'Close': 'close',
      'Adj Close': 'adjclose',
      'Volume': 'volume'},
      inplace = True
)

In [None]:
past_data.describe()

Unnamed: 0,open,high,low,close,adjclose,volume
count,63.0,63.0,63.0,63.0,63.0,63.0
mean,59233.257564,59493.432602,58762.142981,59095.553447,59095.553447,9368.253968
std,1534.928686,1453.983828,1540.406671,1505.290274,1505.290274,7131.299149
min,56320.019531,56538.148438,55132.679688,55822.011719,55822.011719,1800.0
25%,57938.128906,58236.519531,57660.949219,57800.404297,57800.404297,6450.0
50%,59320.140625,59778.371094,58952.109375,59189.730469,59189.730469,7600.0
75%,60327.935547,60557.830078,59956.109375,60303.339844,60303.339844,9200.0
max,62156.480469,62245.429688,61624.648438,61765.589844,61765.589844,48400.0


In [None]:
past_data.date=past_data.date.dt.strftime('%Y%m%d').astype(int)

In [None]:
past_data

Unnamed: 0,date,open,high,low,close,adjclose,volume
0,20211001,58889.769531,58890.078125,58551.140625,58765.578125,58765.578125,10200
1,20211004,59143.000000,59548.820312,58952.109375,59299.320312,59299.320312,10000
2,20211005,59320.140625,59778.871094,59127.039062,59744.878906,59744.878906,12900
3,20211006,59942.000000,59963.570312,59079.859375,59189.730469,59189.730469,7000
4,20211007,59632.808594,59914.910156,59597.058594,59677.828125,59677.828125,5700
...,...,...,...,...,...,...,...
58,20211227,56948.328125,57512.011719,56543.078125,57420.238281,57420.238281,5700
59,20211228,57751.210938,57952.480469,57650.289062,57897.480469,57897.480469,5500
60,20211229,57892.308594,58097.070312,57684.578125,57806.488281,57806.488281,5300
61,20211230,57755.398438,58010.031250,57578.988281,57794.320312,57794.320312,7300


## Perform Sentiment Analysis

In [None]:
# stemming news headlines
# this takes some time

nltk.download('stopwords')
ps = PorterStemmer()

c = []
stopwrds= set(stopwords.words('english'))
for i in range(len(news_data['news'])):
    news = re.sub('[^a-zA-Z]',' ',news_data['news'][i])
    news = news.lower()
    news = news.split()
    news = [ps.stem(word) for word in news if not word in stopwrds]
    news=' '.join(news)
    c.append(news) 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
news_data['news']=pd.Series(c)

### Subjectivity and Polarity

We will use TextBlob for calculating them <br>Polarity > 0 means it's positive otherwise negative<br>Subjectivity quantifies how personal or factual it is, high subjectivity means it more of a personal opinion

In [None]:
from textblob import TextBlob

# we will apply these functions to news_headlines
def getSubjectivity(text):
  return TextBlob(text).sentiment.subjectivity

def getPolarity(text):
  return  TextBlob(text).sentiment.polarity

In [None]:
# Takes forever
news_data['subjectivity'] = news_data['news'].apply(getSubjectivity)
news_data['polarity'] = news_data['news'].apply(getPolarity)
news_data.head()

Unnamed: 0,date,news,subjectivity,polarity
0,20211001,dogecoin dark hors among cryptocurr,0.4,-0.15
1,20211001,cop dhananjaya play childhood dream salaga,0.0,0.0
2,20211001,five oat dish love,0.6,0.5
3,20211001,horoscop today octob check astrolog predict ar...,0.0,0.0
4,20211001,durga puja committe opt low key celebr,0.65,0.0


In [None]:
news_data.to_csv('checkpoint.csv')

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [None]:
news_data['compound'] = [sia.polarity_scores(v)['compound'] for v in news_data['news']]
news_data['negative'] = [sia.polarity_scores(v)['neg'] for v in news_data['news']]
news_data['neutral'] = [sia.polarity_scores(v)['neu'] for v in news_data['news']]
news_data['positive'] = [sia.polarity_scores(v)['pos'] for v in news_data['news']]

In [None]:
news_data.head()

Unnamed: 0,date,news,subjectivity,polarity,compound,negative,neutral,positive
0,20211001,dogecoin dark hors among cryptocurr,0.4,-0.15,0.0,0.0,1.0,0.0
1,20211001,cop dhananjaya play childhood dream salaga,0.0,0.0,0.5267,0.0,0.476,0.524
2,20211001,five oat dish love,0.6,0.5,0.6369,0.0,0.417,0.583
3,20211001,horoscop today octob check astrolog predict ar...,0.0,0.0,-0.6597,0.306,0.694,0.0
4,20211001,durga puja committe opt low key celebr,0.65,0.0,-0.2732,0.259,0.741,0.0


## Merge data for hybird model

In [None]:
merged_data = pd.merge(past_data, news_data, how='inner', on='date')

In [None]:
merged_data.isnull().sum()

date            0
open            0
high            0
low             0
close           0
adjclose        0
volume          0
news            0
subjectivity    0
polarity        0
compound        0
negative        0
neutral         0
positive        0
dtype: int64

In [None]:
merged_data.drop(['news','date'], axis=1, inplace=True)

In [None]:
merged_data.head()

Unnamed: 0,open,high,low,close,adjclose,volume,subjectivity,polarity,compound,negative,neutral,positive
0,58889.769531,58890.078125,58551.140625,58765.578125,58765.578125,10200,0.4,-0.15,0.0,0.0,1.0,0.0
1,58889.769531,58890.078125,58551.140625,58765.578125,58765.578125,10200,0.0,0.0,0.5267,0.0,0.476,0.524
2,58889.769531,58890.078125,58551.140625,58765.578125,58765.578125,10200,0.6,0.5,0.6369,0.0,0.417,0.583
3,58889.769531,58890.078125,58551.140625,58765.578125,58765.578125,10200,0.0,0.0,-0.6597,0.306,0.694,0.0
4,58889.769531,58890.078125,58551.140625,58765.578125,58765.578125,10200,0.65,0.0,-0.2732,0.259,0.741,0.0


## Multivariate time series forecasting

Convert dataset to supervised in **t** to **t+1** format<br>
var(t-1) and var (t) has the same meaning as (t) (t+1) where we are using 1 step back to format it into X and Y variable.

In [None]:
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
	n_vars = 1 if type(data) is list else data.shape[1]
	df1 = df(data)
	cols, names = list(), list()
	# input sequence (t-n, ... t-1)
	for i in range(n_in, 0, -1):
		cols.append(df1.shift(i))
		names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
	# forecast sequence (t, t+1, ... t+n)
	for i in range(0, n_out):
		cols.append(df1.shift(-i))
		if i == 0:
			names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
		else:
			names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
	# put it all together
	agg = concat(cols, axis=1)
	agg.columns = names
	# drop rows with NaN values
	if dropnan:
		agg.dropna(inplace=True)
	return agg

In [None]:
from sklearn.preprocessing import MinMaxScaler
values = merged_data.values
print(merged_data.head())
print(values)
values = values.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
# frame as supervised learning
reframed = series_to_supervised(scaled, 1, 1)
# drop columns we don't want to predict
reframed.drop(reframed.columns[[12,13,14,16,17,18,19,20,21,22,23]], axis=1, inplace=True)
print(reframed.columns)

           open          high           low         close      adjclose  \
0  58889.769531  58890.078125  58551.140625  58765.578125  58765.578125   
1  58889.769531  58890.078125  58551.140625  58765.578125  58765.578125   
2  58889.769531  58890.078125  58551.140625  58765.578125  58765.578125   
3  58889.769531  58890.078125  58551.140625  58765.578125  58765.578125   
4  58889.769531  58890.078125  58551.140625  58765.578125  58765.578125   

   volume  subjectivity  polarity  compound  negative  neutral  positive  
0   10200          0.40     -0.15    0.0000     0.000    1.000     0.000  
1   10200          0.00      0.00    0.5267     0.000    0.476     0.524  
2   10200          0.60      0.50    0.6369     0.000    0.417     0.583  
3   10200          0.00      0.00   -0.6597     0.306    0.694     0.000  
4   10200          0.65      0.00   -0.2732     0.259    0.741     0.000  
[[5.88897695e+04 5.88900781e+04 5.85511406e+04 ... 0.00000000e+00
  1.00000000e+00 0.00000000e+00]


### Build and train the LTSM model

In [None]:
# split into train and test sets
values = reframed.values
print((values).shape)
hours = 6000
train = values[hours:, :]
test = values[:hours, :]
# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

(30577, 13)
(24577, 1, 12) (24577,) (6000, 1, 12) (6000,)


In [None]:
import keras

model = keras.models.Sequential([
    keras.layers.LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])),
    keras.layers.Dense(1)
])
model.compile(loss='mae', optimizer='adam') 

history = model.fit(train_X, train_y, epochs=200, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)

Epoch 1/200
342/342 - 3s - loss: 0.0867 - val_loss: 0.2025 - 3s/epoch - 9ms/step
Epoch 2/200
342/342 - 1s - loss: 0.0464 - val_loss: 0.1771 - 1s/epoch - 4ms/step
Epoch 3/200
342/342 - 1s - loss: 0.0411 - val_loss: 0.1361 - 1s/epoch - 3ms/step
Epoch 4/200
342/342 - 1s - loss: 0.0376 - val_loss: 0.1223 - 1s/epoch - 3ms/step
Epoch 5/200
342/342 - 1s - loss: 0.0342 - val_loss: 0.0771 - 1s/epoch - 4ms/step
Epoch 6/200
342/342 - 1s - loss: 0.0316 - val_loss: 0.0713 - 1s/epoch - 3ms/step
Epoch 7/200
342/342 - 1s - loss: 0.0283 - val_loss: 0.0640 - 1s/epoch - 4ms/step
Epoch 8/200
342/342 - 1s - loss: 0.0235 - val_loss: 0.0584 - 1s/epoch - 3ms/step
Epoch 9/200
342/342 - 1s - loss: 0.0186 - val_loss: 0.0529 - 1s/epoch - 3ms/step
Epoch 10/200
342/342 - 1s - loss: 0.0138 - val_loss: 0.0490 - 1s/epoch - 4ms/step
Epoch 11/200
342/342 - 1s - loss: 0.0117 - val_loss: 0.0434 - 1s/epoch - 4ms/step
Epoch 12/200
342/342 - 1s - loss: 0.0116 - val_loss: 0.0438 - 1s/epoch - 3ms/step
Epoch 13/200
342/342 - 1s

## Visualizing results

In [None]:
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test', color='red')
plt.legend()
plt.show()

### infer model on test data

In [None]:
from sklearn.metrics import mean_squared_error

# make a prediction
print(test_X.shape)
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[2]))
print(test_X.shape)
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
# invert scaling for forecast
inv_yhat = np.concatenate((yhat, test_X[:, 1:]), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
inv_yhat = inv_yhat[:,0]
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
inv_y = np.concatenate((test_y, test_X[:, 1:]), axis=1)
inv_y = scaler.inverse_transform(inv_y)
inv_y = inv_y[:,0]
# calculate RMSE
rmse = np.sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)

(6000, 1, 12)
(6000, 1, 12)
Test RMSE: 111.046


an rmse of 111 is pretty good when lookin at sensex pas price values