<a href="https://colab.research.google.com/gist/ashwinkey04/de066da294792f198b7f74ad6ec702e8/fyp_phase_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phase 1

> Gathering news articles and stock prices for a specific stock and preparing the dataset for sentiment analysis and stock value prediction


#### How it works
> This notebook fetches daily stock market data of a specified stock from [yfinance](https://www.yahoofinanceapi.com/) api and its daily news articles from [mediastack](https://mediastack.com/) api and creates a derived dataset which contains **news sentiment**. This notebook can be scheduled to run daily on Google Cloud Run to gather data daily




#### Structure of dataset generated

1. stock_history.csv
2. news.json
3. news_sentiment.csv

#### stock_history.csv 

1. Date - Trading date
2. Open - Open price of day
3. High - Highest price of day
4. Low - Lowest price of day
5. Close - Closing price of day
6. Volume - Amount of asset/security 
7. Dividends - Distribution of stock
8. Stock splits - Shares of stock to its current shareholders


#### news.json

1. author - author of news article 
2. title - title of news article 
3. description - description of news article
4. url - url of news article
5. source - source of news article
6. image - image of news article
7. category - category of news article
8. language - language of news article
9. country - country name
10. published_at - published date

#### news_sentiment.csv

1. published_at - published date
2. title - title of news article 
3. description - description of news article
4. url - url of news article
5. sentiment - news sentiment
6. sentiment_score - news sentiment score between 0 to 1



## Installing packages


In [1]:
# Install Yahoo Finance package
!pip install yfinance

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting yfinance
  Downloading yfinance-0.2.4-py2.py3-none-any.whl (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.4/51.4 KB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting cryptography>=3.3.2
  Downloading cryptography-39.0.0-cp36-abi3-manylinux_2_28_x86_64.whl (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
Collecting frozendict>=2.3.4
  Downloading frozendict-2.3.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.0/111.0 KB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting beautifulsoup4>=4.11.1
  Downloading beautifulsoup4-4.11.1-py3-none-any.whl (128 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m128.2/128.2 KB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Coll

In [2]:
# Install Google cloud helper package
!pip install firebase-admin

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [7]:
# Install Transformers package for transfer learning
!pip install transformers

#Install pmdarima package for ARIMA
!pip install pmdarima

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pmdarima
  Downloading pmdarima-2.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
Collecting statsmodels>=0.13.2
  Downloading statsmodels-0.13.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m52.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: statsmodels, pmdarima
  Attempting uninstall: statsmodels
    Found existing installation: statsmodels 0.12.2
    Uninstalling statsmodels-0.12.2:
      Successfully uninstalled statsmodels-0.12.2
Successfully installed pmdarima-2.0.2 statsmodels-0.13.5


In [4]:
# Import other libraries 
import yfinance as yf
import pandas as pd
import os
import json
import datetime
from datetime import date,timedelta
import warnings
import http.client, urllib.parse

warnings.filterwarnings("ignore")
from transformers import AutoTokenizer, AutoModelForSequenceClassification,pipeline
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

In [11]:
# company symbol and name
company_symbol="RELIANCE.NS"

#initialise today date
today = str(date.today())
yesterday = str(date.today()- timedelta(days = 1))

# flag variables
news_inserted=False

#secret key of mediastack api
mediastack_api_token = "9f774391108184659c6eecd8dfdcd269"

# input file paths
stock_history_file_path='./input/reliance_stock_history.csv'
news_file_path='./input/reliance_news.json'
news_sentiment_file_path='./input/reliance_news_sentiment.csv'

# output file paths
output_stock_history_file_path='./reliance_stock_history.csv'
output_news_file_path='./reliance_news.json'
output_news_sentiment_file_path='./reliance_news_sentiment.csv'

# parameters for mediastack api
search_query='reliance'
conn = http.client.HTTPConnection('api.mediastack.com')
params = urllib.parse.urlencode({
    'keywords': search_query,
    'access_key': mediastack_api_token,
    'sort': 'published_desc',
    'limit': 10,
    'languages': 'en',
    'countries': 'in',
    })

## Create dataset

In [12]:
ticker_object=yf.Ticker(company_symbol)

In [13]:
def create_stock_history_dataset():
    reliance_stock_history=ticker_object.history(period="10d").reset_index()
    return reliance_stock_history

def update_stock_history_dataset():
    reliance_stock_history=pd.read_csv(stock_history_file_path)
    reliance_stock_history.Date=pd.to_datetime(reliance_stock_history.Date, format='%Y/%m/%d')
    today_reliance_stock_data=ticker_object.history(period="1d")
    today_reliance_stock_data=today_reliance_stock_data.reset_index()
    last_stock_date=str(today_reliance_stock_data.loc[0,'Date']).split()[0]
    if last_stock_date == reliance_stock_history['Date'].dt.strftime('%Y-%m-%d')[len(reliance_stock_history)-1]: #if already inserted 
        reliance_stock_history.iloc[-1:,:]=today_reliance_stock_data.iloc[-1].tolist()
    else:
        last_position=len(reliance_stock_history)
        reliance_stock_history.loc[last_position]=today_reliance_stock_data.iloc[-1].tolist()
    return reliance_stock_history

In [14]:
# create stock market history dataset
ticker_object=yf.Ticker(company_symbol)
if os.path.exists(stock_history_file_path)==False:
    reliance_stock_history=create_stock_history_dataset()
else:
    reliance_stock_history=update_stock_history_dataset()


reliance_stock_history.to_csv(output_stock_history_file_path,index=False)

In [15]:
reliance_stock_history

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2023-01-12 00:00:00+05:30,2524.850098,2532.5,2465.0,2471.600098,8163366,0.0,0.0
1,2023-01-13 00:00:00+05:30,2458.399902,2472.899902,2434.600098,2467.600098,9515473,0.0,0.0
2,2023-01-16 00:00:00+05:30,2472.699951,2479.649902,2427.0,2444.100098,6287407,0.0,0.0
3,2023-01-17 00:00:00+05:30,2458.0,2483.0,2450.600098,2478.800049,4961585,0.0,0.0
4,2023-01-18 00:00:00+05:30,2473.5,2491.100098,2460.350098,2474.699951,6206382,0.0,0.0
5,2023-01-19 00:00:00+05:30,2472.899902,2481.149902,2456.649902,2472.050049,5510333,0.0,0.0
6,2023-01-20 00:00:00+05:30,2475.0,2475.0,2437.25,2442.649902,6890325,0.0,0.0
7,2023-01-23 00:00:00+05:30,2449.0,2466.199951,2425.0,2430.300049,5055324,0.0,0.0
8,2023-01-24 00:00:00+05:30,2440.0,2443.649902,2387.350098,2415.949951,7609558,0.0,0.0
9,2023-01-25 00:00:00+05:30,2412.449951,2414.699951,2380.0,2382.550049,5713152,0.0,0.0


## Create/Update news dataset

In [16]:
def create_news_dataset():
    conn.request('GET', '/v1/news?{}'.format(params))
    res = conn.getresponse().read()
    reliance_news=json.loads(res.decode('utf-8'))["data"]
    return reliance_news

def update_news_dataset():
    global news_inserted
    with open(news_file_path,'r') as file:
        reliance_news=json.load(file)
        for news in reliance_news['articles']:
            if news['published_at'].split('T')[0]==yesterday:
                news_inserted=True
                break
        current_reliance_news=None
        if news_inserted==False:
            conn.request('GET', '/v1/news?{}'.format(params))
            res = conn.getresponse().read()
            current_reliance_news=json.loads(res.decode('utf-8'))["data"]
            reliance_news['articles']+=current_reliance_news
        return reliance_news['articles'],current_reliance_news

In [17]:
#create news dataset 
if os.path.exists(news_file_path)==False:
    reliance_news=create_news_dataset()
    current_reliance_news=reliance_news.copy()
else:
    reliance_news,current_reliance_news=update_news_dataset()

with open(output_news_file_path,'w') as file:
    json.dump({"articles":reliance_news},file)

## Predict sentiment on articles


In [18]:
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

Downloading:   0%|          | 0.00/252 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/758 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [19]:
def create_news_sentiment_dataset(news_sentiments):
    last_position=len(news_sentiments)
    article_ind=last_position
    title_description=[]
    if current_reliance_news!=None:
        for article in current_reliance_news:
            title_description.append(article['title']+' '+article['description'])
            news_sentiments.at[article_ind,'published_at']=article['published_at']
            news_sentiments.at[article_ind,'title']=article['title']
            news_sentiments.at[article_ind,'description']=article['description']
            news_sentiments.at[article_ind,'url']=article['url']
            article_ind+=1
        news_label_and_scores=classifier(list(title_description))
        labels=[pred['label'] for pred in news_label_and_scores]
        scores=[pred['score'] for pred in news_label_and_scores]
        news_sentiments.at[last_position:,'sentiment']=labels
        news_sentiments.at[last_position:,'sentiment_score']=scores
    
    news_sentiments.to_csv(output_news_sentiment_file_path,index=None)  

## Add news sentiment

In [20]:
news_sentiments=None
if os.path.exists(news_sentiment_file_path):
    news_sentiments=pd.read_csv(news_sentiment_file_path,index_col=None)                     
else:
    news_sentiments=pd.DataFrame(columns=['published_at','title','description','url','sentiment','sentiment_score'])
create_news_sentiment_dataset(news_sentiments)

### Connect to Google Cloud backend

In [21]:
#Authenticate with Google Cloud using a service account private key
import firebase_admin
from firebase_admin import credentials
from firebase_admin import firestore

cred = credentials.Certificate("./firestore-service.json")
# if not firebase_admin._apps:
firebase_admin.initialize_app(credential = cred, name = "PRED")


FileNotFoundError: ignored

In [None]:
#Connect to the database
db = firestore.client()
sentiment_coll = db.collection('news_sentiments')
reliance_doc_ref = sentiment_coll.document(u'RELIANCE')

In [None]:
news_sentiments['published'] = news_sentiments['published_at']
news_sentiments.set_index('published_at').to_dict()

In [None]:
#Update DB with predicted sentiments
reliance_doc_ref.set(news_sentiments.set_index('published_at').to_dict())

# Phase 2


## ARIMA Model for Time series forcasting

In [22]:
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from pmdarima.arima import auto_arima
from sklearn.metrics import mean_squared_error, mean_absolute_error
import math