# Stock Prediction from Financial News
#### I will be using FinViz website for data sourcing https://finviz.com/
#### Finviz includes all information required for stock analysis: fundamental ratios, news headlines and technical indicators
#### For e.g. to get the news headline for Apple from Finviz use the link: https://finviz.com/quote.ashx?t=AAPL. Note ticker AAPL. Replace it with ticker for other company to get their news

## Idea
#### Step 1: Scrape the FizWiz website to get news headlines
#### Step 2: Process each headline using Natural Language Processing (NLP)
#### Step 3: Assign score to each news headline
#### Step 4: Average the score for a day or week depededing on my short on long term investment horizon

## 1. Import Libraries
#### Libraries required
#### 1. Requests to get data from FinViz
#### 2. BeautifulSoup to parse data from FinViz
#### 3. Pandas to store the data
#### 4. Mlatplotlib to plot the sentiment score
#### 5. nltk.sentiment.vader to perform the sentiment analysis on news

In [8]:
# Import libraries
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import os
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# NLTK VADER for sentiment analysis
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# News Data pull url
finwiz_url = 'https://finviz.com/quote.ashx?t='

## 2. Understand html and identify important tags on FinViz that will be used to pull news data
### Steps to source news example on APPLE stock
#### Step 1: Go to the link: https://finviz.com/quote.ashx?t=AAPL and move the section where news is present on the page
![image.png](attachment:d34ca7e6-88b1-4e36-b7ec-3a870835a264.png)
#### Step 2: Right click to inspect the html
![image.png](attachment:e75d7dfc-490e-4b6c-b74d-99606e814200.png)
#### Step 3: On the inspect search for id="news-table". You will notice that news is stored in a table
![image.png](attachment:1cf48ed9-e663-419d-8619-346879e6b777.png)
#### Step 4: On the inspect search for td tag that has date and time. Note tag needs to have date and time. There are some tags that has only time. This is because only 1st news tag has date. Rest all has only time with them.
![image.png](attachment:060b6351-3779-4d05-970f-58862d283ed7.png)
#### Step 5: On the inspect search for tag a with class="tab-link-news" that has news headline
![image.png](attachment:043b74b0-eb25-4f80-9a92-08b092f0523f.png)

## 3. Pull the data and store it in a python dictionary. Add stocks in the list named tickers to pull data

In [12]:
# Declare the dictioanry
news_tables = {}

# Declare stocks for which data will be pulled in a list
tickers = ['AMZN', 'TSLA', 'GOOG', 'AAPL']

for ticker in tickers:
    url = finwiz_url + ticker
    req = Request(url=url,headers={'user-agent': 'my-app/0.0.1'}) 
    response = urlopen(req)    
    # Read the contents of the file into 'html'
    html = BeautifulSoup(response)
    # Find 'news-table' in the Soup and load it into 'news_table'
    news_table = html.find(id='news-table')
    # Add the table to our dictionary
    news_tables[ticker] = news_table

## 4. Test what is read from FinViz for one stock to get understading on data read and store mechanism
#### We will iterate through each <tr></tr> tags, first 4 rows, to obtain the headlines between <a></a> tags and date and time between <td></td> tags

In [15]:
# Read one single day of headlines for 'AMZN' 
amzn = news_tables['AMZN']
# Get all the table rows tagged in HTML with <tr> into 'amzn_tr'
amzn_tr = amzn.findAll('tr')

for i, table_row in enumerate(amzn_tr):
    # Read the text of the element 'a' into 'link_text'
    a_text = table_row.a.text
    # Read the text of the element 'td' into 'data_text'
    td_text = table_row.td.text
    # Print the contents of 'link_text' and 'data_text' 
    print(a_text)
    print(td_text)
    # Exit after printing 4 rows of data
    if i == 3:
        break

Dundas World Set to Open Store on Amazon, Adds Myriad Categories
Oct-19-20 12:01AM  
Dow Jones Futures: Stock Market Rally At Turning Point; Pelosi Sets Stimulus Deal Deadline
Oct-18-20 10:16PM  
Stock Market Rally At Turning Point; Pelosi Sets Stimulus Deal Deadline
05:39PM  
Why I'll Never Sell My Amazon Stock
03:05PM  


## 5. Data parsing to get date, time and news headline
### Understanding code block
#### Step 1: 1st loop is on dictionary to get the key value pairs. Key is the company ticker and value is the data extracted from HTML for company news
#### Step 2: Run 2nd loop on each <tr> tag in the html stored in the dictionary value.
    Note:
    a. Each tr has tb that has time, only first tr has both date and time. rest all tr has time
    b. In each <tr> there is <td> that has <div> which has <a>. This <a> has the news title and its link. We are only interested in news title
#### Step 3: Get the news headline from <a> only and store it in a variable
#### Step 4: Split ear <tr> on <td> to get date and time. Note only one <td> for the day will give both date and time. Rest all will only give time
#### Step 5: if length of the split is 1 then it is only time else it has both date and time
#### Step 6: Extract the name of the ticker from the ticker
#### Step 7: Create a unique list that includes ticker, date, time and news headline. This list will be saved in parsed news to create list of list

In [None]:
# Prase the html data in dictionary to get date, time and headlines. This will be stored in Python list for further processing
parsed_news = []

# Iterate through the news
for file_name, news_table in news_tables.items():
    # Iterate through all tr tags in 'news_table'
    for x in news_table.findAll('tr'):
        # read the text from each tr tag into text
        # get text from a only
        text = x.a.get_text() 
        # splite text in the td tag into a list 
        date_scrape = x.td.text.split()
        
        # if the length of 'date_scrape' is 1, load 'time' as the only element
        if len(date_scrape) == 1:
            time = date_scrape[0]
            
        # else load 'date' as the 1st element and 'time' as the second    
        else:
            date = date_scrape[0]
            time = date_scrape[1]
       
        # Extract the ticker from the file name, get the string up to the 1st '_'  
        # Note '_' is required when ticker is mentioned as AMZN_Amazon. Kindly check
        ticker = file_name.split('_')[0]
        
        # Append ticker, date, time and headline as a list to the 'parsed_news' list
        parsed_news.append([ticker, date, time, text])
        
# Uncomment below to understand what is included in parsed list
#parsed_news

## 6. Read each news and perform centimental analysis using NLP library Vader
### Understanding code block to get sentiment score from the news headline
#### Step 1: Initiatite the sentiment analyzer
#### Step 2: Convert the list created in step 5 to pandas data frame.
    Note: Pandas dataframe will save ticker, date, time, news and result of sentiment score
#### Step 4: Iterate through each headline and get the polarity score from vader
#### Step 5: Add the score provided to the data frame
#### Step 6: View the updated datframe with score from vader