<a href="https://colab.research.google.com/gist/ashwinkey04/de066da294792f198b7f74ad6ec702e8/fyp_phase_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phase 1

> Gathering news articles and stock prices for a specific stock and preparing the dataset for sentiment analysis and stock value prediction


#### How it works
> This notebook fetches daily stock market data of a specified stock from [yfinance](https://www.yahoofinanceapi.com/) api and its daily news articles from [mediastack](https://mediastack.com/) api and creates a derived dataset which contains **news sentiment**. This notebook can be scheduled to run daily on Google Cloud Run to gather data daily




#### Structure of dataset generated

1. stock_history.csv
2. news.json
3. news_sentiment.csv

#### stock_history.csv 

1. Date - Trading date
2. Open - Open price of day
3. High - Highest price of day
4. Low - Lowest price of day
5. Close - Closing price of day
6. Volume - Amount of asset/security 
7. Dividends - Distribution of stock
8. Stock splits - Shares of stock to its current shareholders


#### news.json

1. author - author of news article 
2. title - title of news article 
3. description - description of news article
4. url - url of news article
5. source - source of news article
6. image - image of news article
7. category - category of news article
8. language - language of news article
9. country - country name
10. published_at - published date

#### news_sentiment.csv

1. published_at - published date
2. title - title of news article 
3. description - description of news article
4. url - url of news article
5. sentiment - news sentiment
6. sentiment_score - news sentiment score between 0 to 1



In [2]:
# Install Yahoo Finance package
!pip install yfinance

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting yfinance
  Downloading yfinance-0.1.87-py2.py3-none-any.whl (29 kB)
Collecting requests>=2.26
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 1.4 MB/s 
Installing collected packages: requests, yfinance
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
Successfully installed requests-2.28.1 yfinance-0.1.87


In [None]:
# Install Google cloud helper package
!pip install firebase-admin

In [None]:
# Install Transformers package for transfer learning
!pip install transformers

In [3]:
# Import other libraries 
import yfinance as yf
import pandas as pd
import os
import json
import datetime
from datetime import date,timedelta
import warnings
import http.client, urllib.parse

warnings.filterwarnings("ignore")
from transformers import AutoTokenizer, AutoModelForSequenceClassification,pipeline
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

In [4]:
# company symbol and name
company_symbol="RELIANCE.NS"

#initialise today date
today = str(date.today())
yesterday = str(date.today()- timedelta(days = 1))

# flag variables
news_inserted=False

#secret key of mediastack api
mediastack_api_token = "9f774391108184659c6eecd8dfdcd269"

# input file paths
stock_history_file_path='./input/reliance_stock_history.csv'
news_file_path='./input/reliance_news.json'
news_sentiment_file_path='./input/reliance_news_sentiment.csv'

# output file paths
output_stock_history_file_path='./reliance_stock_history.csv'
output_news_file_path='./reliance_news.json'
output_news_sentiment_file_path='./reliance_news_sentiment.csv'

# parameters for mediastack api
search_query='reliance'
conn = http.client.HTTPConnection('api.mediastack.com')
params = urllib.parse.urlencode({
    'keywords': search_query,
    'access_key': mediastack_api_token,
    'sort': 'published_desc',
    'limit': 10,
    'languages': 'en',
    'countries': 'in',
    })

## Create dataset

In [5]:
ticker_object=yf.Ticker(company_symbol)

In [6]:
def create_stock_history_dataset():
    reliance_stock_history=ticker_object.history(period="10d").reset_index()
    return reliance_stock_history

def update_stock_history_dataset():
    reliance_stock_history=pd.read_csv(stock_history_file_path)
    reliance_stock_history.Date=pd.to_datetime(reliance_stock_history.Date, format='%Y/%m/%d')
    today_reliance_stock_data=ticker_object.history(period="1d")
    today_reliance_stock_data=today_reliance_stock_data.reset_index()
    last_stock_date=str(today_reliance_stock_data.loc[0,'Date']).split()[0]
    if last_stock_date == reliance_stock_history['Date'].dt.strftime('%Y-%m-%d')[len(reliance_stock_history)-1]: #if already inserted 
        reliance_stock_history.iloc[-1:,:]=today_reliance_stock_data.iloc[-1].tolist()
    else:
        last_position=len(reliance_stock_history)
        reliance_stock_history.loc[last_position]=today_reliance_stock_data.iloc[-1].tolist()
    return reliance_stock_history

In [10]:
# create stock market history dataset
ticker_object=yf.Ticker(company_symbol)
if os.path.exists(stock_history_file_path)==False:
    reliance_stock_history=create_stock_history_dataset()
else:
    reliance_stock_history=update_stock_history_dataset()


reliance_stock_history.to_csv(output_stock_history_file_path,index=False)

In [9]:
reliance_stock_history

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2022-11-11 00:00:00+05:30,2600.0,2636.0,2588.0,2631.800049,5681124,0,0
1,2022-11-14 00:00:00+05:30,2630.75,2642.0,2605.149902,2619.050049,4173395,0,0
2,2022-11-15 00:00:00+05:30,2622.300049,2626.399902,2590.0,2607.300049,3270388,0,0
3,2022-11-16 00:00:00+05:30,2610.0,2615.949951,2581.350098,2592.350098,4484007,0,0
4,2022-11-17 00:00:00+05:30,2584.949951,2613.0,2580.0,2599.050049,3074259,0,0
5,2022-11-18 00:00:00+05:30,2606.75,2609.0,2571.100098,2597.649902,2447425,0,0
6,2022-11-21 00:00:00+05:30,2588.0,2588.0,2543.100098,2550.899902,2949108,0,0
7,2022-11-22 00:00:00+05:30,2545.0,2568.5,2536.5,2565.050049,3051201,0,0
8,2022-11-23 00:00:00+05:30,2575.0,2577.899902,2552.25,2557.050049,2959787,0,0
9,2022-11-24 00:00:00+05:30,2566.0,2567.0,2550.300049,2558.300049,942996,0,0


## Create/Update news dataset

In [12]:
def create_news_dataset():
    conn.request('GET', '/v1/news?{}'.format(params))
    res = conn.getresponse().read()
    reliance_news=json.loads(res.decode('utf-8'))["data"]
    return reliance_news

def update_news_dataset():
    global news_inserted
    with open(news_file_path,'r') as file:
        reliance_news=json.load(file)
        for news in reliance_news['articles']:
            if news['published_at'].split('T')[0]==yesterday:
                news_inserted=True
                break
        current_reliance_news=None
        if news_inserted==False:
            conn.request('GET', '/v1/news?{}'.format(params))
            res = conn.getresponse().read()
            current_reliance_news=json.loads(res.decode('utf-8'))["data"]
            reliance_news['articles']+=current_reliance_news
        return reliance_news['articles'],current_reliance_news

In [14]:
#create news dataset 
if os.path.exists(news_file_path)==False:
    reliance_news=create_news_dataset()
    current_reliance_news=reliance_news.copy()
else:
    reliance_news,current_reliance_news=update_news_dataset()

with open(output_news_file_path,'w') as file:
    json.dump({"articles":reliance_news},file)

## Predict sentiment on articles


In [15]:
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

Downloading:   0%|          | 0.00/252 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/758 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [17]:
def create_news_sentiment_dataset(news_sentiments):
    last_position=len(news_sentiments)
    article_ind=last_position
    title_description=[]
    if current_reliance_news!=None:
        for article in current_reliance_news:
            title_description.append(article['title']+' '+article['description'])
            news_sentiments.at[article_ind,'published_at']=article['published_at']
            news_sentiments.at[article_ind,'title']=article['title']
            news_sentiments.at[article_ind,'description']=article['description']
            news_sentiments.at[article_ind,'url']=article['url']
            article_ind+=1
        news_label_and_scores=classifier(list(title_description))
        labels=[pred['label'] for pred in news_label_and_scores]
        scores=[pred['score'] for pred in news_label_and_scores]
        news_sentiments.at[last_position:,'sentiment']=labels
        news_sentiments.at[last_position:,'sentiment_score']=scores
    
    news_sentiments.to_csv(output_news_sentiment_file_path,index=None)  

## Add news sentiment

In [18]:
news_sentiments=None
if os.path.exists(news_sentiment_file_path):
    news_sentiments=pd.read_csv(news_sentiment_file_path,index_col=None)                     
else:
    news_sentiments=pd.DataFrame(columns=['published_at','title','description','url','sentiment','sentiment_score'])
create_news_sentiment_dataset(news_sentiments)

### Connect to Google Cloud backend

In [19]:
#Authenticate with Google Cloud using a service account private key
import firebase_admin
from firebase_admin import credentials
from firebase_admin import firestore

cred = credentials.Certificate("./firestore-service.json")
# if not firebase_admin._apps:
firebase_admin.initialize_app(credential = cred, name = "PRED")


In [20]:
#Connect to the database
db = firestore.client()
sentiment_coll = db.collection('news_sentiments')
reliance_doc_ref = sentiment_coll.document(u'RELIANCE')

ValueError: ignored

In [None]:
news_sentiments['published'] = news_sentiments['published_at']
news_sentiments.set_index('published_at').to_dict()

In [None]:
#Update DB with predicted sentiments
reliance_doc_ref.set(news_sentiments.set_index('published_at').to_dict())