<a href="https://colab.research.google.com/github/Zantorym/AIDI-1100-Project/blob/main/AIDI_1100_01_FINAL_PROJECT_GROUP_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AIDI1100 Final Project


---
**Group Number:** 2

**Group Members:**

Jaspreet Singh Marwah

Lawrence Wanderi Mwangi

Sherap Gyaltsen

Ayobami Banjoko

Oluwaseun Ogunnubi

Simrandeep Singh Rahi

**Course:** AIDI1100 - Introduction To AI Development

**Submission Date:** 

---

**Strategy of Code Distribution:**

Part 1 (Scan/Parse): Jaspreet

Part 2 (Track/Store): Lawrence Wanderi Mwangi

Part 3 (Retrieve Data): Sherap Gyaltsen

Part 4 (Visualize): Ayobami Banjoko, Oluwaseun Ogunnubi, Simrandeep Singh Rahi



---

**Program Description:**

*Part 1:* 

First, using the _get_urls() function, the program goes through the list of articles found on https://www.prnewswire.com/news-releases/news-releases-list. It gets the raw html code of the website. Using the BeautifulSoup module, the program parses through the html code and collects the URLs for all the articles released up til a certain date and time. If necessary, the program will keep going to the next page of the website until the URLs of all the articles released till the required date have been obtained.

Then, using _get_articles(), for each URL obtained (where each URL corresponds to one article), the raw HTML code is extracted and parsed through using BeautifulSoup. The program finds all the text from the body of the articles and stores it in a list.

The scrape function is used to conveniently execute both functions based on a specified number of days worth of articles. It then returns a list of all the text from the articles. This function allows this part of the code to act as an individual module. More information regarding this can be found in the "Bonus Work" section of this file.

*Part 2:*

All the text from the articles is merged into one large string. This string is parsed over using a regex pattern to find all stock symbols within the articles collected.

*Part 3:*

3 of the stock symbols that were collected are chosen for analysis, namely TSLA, GM and LCID. Given a start date, end date, and stock symbol, the stock_data() function will provide information regarding the closing price and volume of the specified stock for each day within the range of dates provided. This function does this by calling the Yahoo Finance API with the help of the yahoo_fin package.

*Part 4:*

generate_visualisations() takes advantage of the plotly and matplotlib libraries to generate interactive time-series plots of the volume and closing price of a given stock within a 60 day period. When the function is called with a stock name, it displays the two plots.

---

**Bonus work:**



*   GitHub was used to colaborate on the project. GitHub link: https://github.com/Zantorym/AIDI-1100-Project
*   The code for part 1 can be used as a module. When imported, the scrape() function will scrape all the latest PRNewswire articles published a specified number of days ago. This module is available on the GitHub repository as "prnewswire_scraper.ipynb".


# PART 1 - Scan/Parse

In [None]:
"""PRNewswire article scraper

This script scrapes all the PRNewswire articles released within a specified 
number of days from the current date and time.

This script can be imported as a module and contains the following functions:
    * _get_urls - returns the urls for all the articles released up till a 
                  certain date
    * _get_articles - returns the text of the body of the articles corresponding
                      to each url
    * scrape - the main function of the script that runs everything

Functions that start with an underscore are 'private' and are not meant to be 
called when this module is imported. These functions will not be copied over 
when this module is imported using the line 'from prnewswire_scraper import *'.
However, they will be copied over if the module is imported using the line
'import prnewswire_scraper' but this is unavoidable as there is no real way to
maintain 'private' functions of modules in python.


"""

# Importing modules
from datetime import datetime
from datetime import timedelta
import requests
from bs4 import BeautifulSoup
from pytz import timezone
import re

pip install yahoo_fin
from yahoo_fin.stock_info import get_data

import plotly.express as px
import matplotlib.pyplot as plt

In [None]:
# Gets urls for all the articles from start date to end date
# Returns a list of urls
def _get_urls(end_date):
  urls = [] # List of URLs to visit

  website = "https://www.prnewswire.com/news-releases/news-releases-list/?page=" # Website we need to scrape from
  page_num = 1 # Page number of the website we need to scrape from

  end_date_reached = False

  while not end_date_reached:
    current_site = website + str(page_num) + "&pagesize=100" # Link to visit with page number, set number of articles per page to 100 so that we don't need to visit as many pages
    response = requests.get(current_site)

    if response.status_code == 200: # 200 is the standard response for a successful HTTP request
      soup = BeautifulSoup(response.content) # Converting the plain text html code of the website into a BeautifulSoup object for easy parsing
      anchors = soup.find_all('a', {'class': 'newsreleaseconsolidatelink display-outline', 'href': True}) # Getting all the anchors for news articles within the webpage

      for anchor in anchors:
        date = anchor.find('small').get_text()
        try: 
          date = datetime.strptime(date, '%b %d, %Y, %H:%M ET') # Convert to datetime
        except: # If the conversion fails, that is because the article was releasaed today and the time is written as "HH:MM ET" instead of "Month DD, YYYY, HH:MM ET"
          date = datetime.strptime(date, '%H:%M ET') # Convert the time into a datetime variable
          now = datetime.now(timezone('EST')).date() # Get today's date, had to add timezone because google colab operates on UTC, while prnewswire operates on EST
          now_t = datetime.time(date) # Time the article was released
          date = datetime.combine(now, now_t) # Date and time combined
        
        if (date < end_date):
          end_date_reached = True
          break
        else:
          href = "https://www.prnewswire.com" + anchor['href'] # Retrieving href for the article and converting it to visitable link
          urls.append(href) # Adding to list of urls to visit
      
      page_num += 1
    


  return urls

In [None]:
def _get_articles(urls):
  articles = []

  for url in urls:
    response = requests.get(url)

    if response.status_code == 200:
      soup = BeautifulSoup(response.content)

      divs = soup.find_all('div', class_ = 'col-sm-10 col-sm-offset-1') # Finds all div containers of the class that's meant for the body of the webpage

      # Getting all the text out of the divs collected above
      article = [] # For storing all the text within this article
      for div in divs:
        p_tags = div.find_all('p') # All the p tags in the current div, since all the text in the body of the prnewsire articles is always stored within p tags
        for p in p_tags:
          article.append(p.get_text())
      articles.append(article)

  return articles

In [None]:
def scrape(NUM_DAYS):
  
  end_date = datetime.today() - timedelta(days=NUM_DAYS) # Date NUM_DAYS days ago
  urls = _get_urls(end_date)
  articles = _get_articles(urls)

  return articles

In [None]:
dataset = scrape(7)

## PART 2 - Track/Store/Search

In [None]:
corpus = ''
for x in dataset:
  corpus += ' '.join(str(e) for e in x)

In [None]:
def tickerCapture(corpus:str) -> list:
    """Returns a list of tuples that have this format (<Exchange>, <ticker>)

    Args:
        corpus (str): String that contains exchang and ticker info

    Returns:
        list: [(<Exchange>, <ticker>)] list of <exchange> <ticker> data
    """
     # Expected formats examples: (NYSE: HMLP) and  (NYSE/LSE: CCL; NYSE:CUK)
     # Example return [('NYSE', 'HMLP'), ('NYSE/LSE', 'CCL'), ('NYSE', 'CUK')]

    regexpattern = r'\b\(?(?P<exchange>[A-Z\/]+):\s?(?P<ticker>[A-Z]+)(?:\)|;)'

    result = re.findall(regexpattern, corpus)
    
    return result

In [None]:
tickers = tickerCapture(corpus)
tickers = list(set(tickers))

# PART 3 - Retrieve Data (Web (API))

From the list of tickers that we retrieved, we will select the following 3 that belong to the automobile industry: TSLA, GM, and LCID.

In [None]:
def stock_data(start_date,end_date,ticker):
  k = get_data(ticker, start_date=start_date, end_date=end_date, index_as_date = True, interval='1d')
  resultant = k[['close','volume']]
  return resultant

# PART 4 - Visualize

In [None]:
def generate_visualisations(ticker):
  # Getting info regarding the stock from the past 60 days using the ticker
  stock_info = stock_data('09/23/2021', '11/23/2021', ticker)
  stock_info = stock_info.rename_axis('date').reset_index()

  # Plotting time-series for volume of stock
  fig = px.line(stock_info, x='date', y="volume", title= f"Volume of {ticker} stock")
  fig.update_xaxes(
      rangeslider_visible=True,
      rangeselector=dict(
          buttons=list([
              dict(count=7, label="7d", step="day", stepmode="backward"),
              dict(count=1, label="1m", step="month", stepmode="backward"),
              dict(step="all")
          ])
      )
  )

  fig.show()

  # Generating time-series for closing price of stock
  fig = px.line(stock_info, x='date', y="close", title= f"Closing price of {ticker} stock")
  fig.update_xaxes(
      rangeslider_visible=True,
      rangeselector=dict(
          buttons=list([
              dict(count=7, label="7d", step="day", stepmode="backward"),
              dict(count=1, label="1m", step="month", stepmode="backward"),
              dict(step="all")
          ])
      )
  )

  fig.show()

In [None]:
generate_visualisations('TSLA')

In [None]:
generate_visualisations('GM')

In [None]:
generate_visualisations('LCID')