<a href="https://colab.research.google.com/github/Zantorym/AIDI-1100-Project/blob/main/AIDI_1100_01_FINAL_PROJECT_GROUP_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AIDI1100 Final Project


---
**Group Number:** 2

**Group Members:**

Jaspreet Singh Marwah

Lawrence Wanderi Mwangi

Sherap Gyaltsen

Ayobami Banjoko

Oluwaseun Ogunnubi

Simrandeep Singh Rahi

**Course:** AIDI1100 - Introduction To AI Development

**Submission Date:** 30th October, 2021

---

**Strategy of Code Distribution:**

Part 1 (Scan/Parse): Jaspreet

Part 2 (Track/Store): Lawrence Wanderi Mwangi

Part 3 (Retrieve Data): Sherap Gyaltsen

Part 4 (Visualize): Ayobami Banjoko, Oluwaseun Ogunnubi, Simrandeep Singh Rahi



---

**Program Description:**

*Part 1 (Scan/Parse):* 

First, using the _get_urls() function, the program goes through the list of articles found on https://www.prnewswire.com/news-releases/news-releases-list. It gets the raw html code of the website. Using the BeautifulSoup module, the program parses through the html code and collects the URLs for all the articles released up til a certain date and time. If necessary, the program will keep going to the next page of the website until the URLs of all the articles released till the required date have been obtained.

Then, using _get_articles(), for each URL obtained (where each URL corresponds to one article), the raw HTML code is extracted and parsed through using BeautifulSoup. The program finds all the text from the body of the articles and stores it in a list.

The scrape function is used to conveniently execute both functions based on a specified number of days worth of articles. It then returns a list of all the text from the articles. This function allows this part of the code to act as an individual module. More information regarding this can be found in the "Bonus Work" section of this file.

*Part 2 (Track/Store):*

All the text from the articles is merged into one large string. This string is parsed over using a regex pattern to find all stock symbols within the articles collected.

*Part 3 (Retrieve Data):*

3 of the stock symbols that were collected are chosen for analysis, namely TSLA, GM and LCID. Given a start date, end date, and stock symbol, the stock_data() function will provide information regarding the closing price and volume of the specified stock for each day within the range of dates provided. This function does this by calling the Yahoo Finance API with the help of the yahoo_fin package.

*Part 4 (Visualize):*

generate_visualisations() takes advantage of the plotly and matplotlib libraries to generate interactive time-series plots of the volume and closing price of a given stock within a 60 day period. When the function is called with a stock name, it displays the two plots.

---

**Bonus work:**



*   GitHub was used to colaborate on the project. GitHub link: https://github.com/Zantorym/AIDI-1100-Project
*   The code for part 1 can be used as a module. When imported, the scrape() function will scrape all the latest PRNewswire articles published a specified number of days ago. This module is available on the GitHub repository as "prnewswire_scraper.ipynb".
*   The dataset obtained from part 1 (the text from all the articles) and part 2 (the list of stock symbols) was pickled and uploaded to our GitHub repository for convenience as well as to ensure that all group members were working on the same dataset.

# PART 1 - Scan/Parse

In [1]:
"""Scraping PRNewswire articles

This part of the program scrapes all the PRNewswire articles released within a 
specified number of days from the current date and time.

This part of the program is available on our GitHub repository as an individual 
script that can be imported as a module and contains the following functions:
    * _get_urls - returns the urls for all the articles released up till a 
                  certain date
    * _get_articles - returns the text of the body of the articles corresponding
                      to each url
    * scrape - the main function of the script that runs everything

Functions that start with an underscore are 'private' and are not meant to be 
called when the module version of this code is imported. These functions will 
not be copied over when the module is imported using the line 'from prnewswire_scraper import *'.
However, they will be copied over if the module is imported using the line
'import prnewswire_scraper' but this is unavoidable as there is no real way to
maintain 'private' functions of modules in python.


"""

# Importing modules
from datetime import date
from datetime import datetime
from datetime import timedelta
import requests
from bs4 import BeautifulSoup
from pytz import timezone
import re

!pip install yahoo_fin
from yahoo_fin.stock_info import get_data

import plotly.express as px
import matplotlib.pyplot as plt

Collecting yahoo_fin
  Downloading yahoo_fin-0.8.9.1-py3-none-any.whl (10 kB)
Collecting feedparser
  Downloading feedparser-6.0.8-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 3.6 MB/s 
Collecting requests-html
  Downloading requests_html-0.10.0-py3-none-any.whl (13 kB)
Collecting sgmllib3k
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
Collecting fake-useragent
  Downloading fake-useragent-0.1.11.tar.gz (13 kB)
Collecting pyppeteer>=0.0.14
  Downloading pyppeteer-0.2.6-py3-none-any.whl (83 kB)
[K     |████████████████████████████████| 83 kB 2.3 MB/s 
[?25hCollecting pyquery
  Downloading pyquery-1.4.3-py3-none-any.whl (22 kB)
Collecting w3lib
  Downloading w3lib-1.22.0-py2.py3-none-any.whl (20 kB)
Collecting parse
  Downloading parse-1.19.0.tar.gz (30 kB)
Collecting websockets<10.0,>=9.1
  Downloading websockets-9.1-cp37-cp37m-manylinux2010_x86_64.whl (103 kB)
[K     |████████████████████████████████| 103 kB 26.1 MB/s 
[?25hCollecting urllib3!=1.25.0,!=

In [None]:
def _get_urls(end_date):
  """
  Gets urls for all the articles released on PRNewswire from the time this
  function is called till a specified end date.

  Args:
      end_date (datetime object): The oldest allowed date and time. If an article 
                                  was published before this date and time, it 
                                  will not be scraped.

  Returns:
      urls (list): A list of URLs corresponding to each article released within
                  the specified time period

  Detailed description:
  We start by visiting the first page of the news release list on PRNewswire.
  When we visit the page, in order to avoid errors, we confirm if the visit was 
  successful by checking if we got a response status code of 200 (what is 
  conventionally the standard response for a successful HTTP request). If the
  response status code is not 200, the program will keep trying to visit the
  website until it succeeds. 
  
  Once a successful visit has been established, using BeautifulSoup we parse 
  through the raw HTML code for the website. When looking at the raw html code
  manually, a pattern was observed. Each entry for an article is contained within
  an anchor tag of the class 'newsreleaseconsolidatelink display-outline'. So,
  out first step is to get all these anchor tags and store them in a list.

  Then, for each anchor tag, we find the date and time that article was released.
  Within each anchor tag, the date and time is enclosed within a 'small' tag.
  Moreover, it is always the only thing enclosed within a 'small' tag. So, to
  get the date and time of the release of the article, we extract the text
  enclosed within the 'small' tag. We then convert it into a date. The date
  itself can be represented in two ways, depending on when the article was
  realeased. If the article was released today, it would only have the hour and
  minute it was released in a 24-hour format. If it was released any other day,
  it would have the full date and time. The code accounts for both possibilities
  via a try-except statement. The date is then converted into a datetime variable.
  If the date is older than our end date, we know that we don't need to explore
  any more articles and we stop the process. Otherwise, we get the link to the
  article and store it in a list. If all the articles from the page have been
  covered and we still have not reached our end date, we move on to the next
  page.

  """
  urls = [] # List of URLs to visit

  website = "https://www.prnewswire.com/news-releases/news-releases-list/?page=" # Website we need to scrape from
  page_num = 1 # Page number of the website we need to scrape from

  end_date_reached = False # A boolean to keep track of whether we've collected all the necessary articles

  while not end_date_reached:
    current_site = website + str(page_num) + "&pagesize=100" # Link to visit with page number, set number of articles per page to 100 so that we don't need to visit as many pages
    response = requests.get(current_site)

    if response.status_code == 200: # 200 is the standard response for a successful HTTP request. This condition ensures that we were successful in retrieving the HTML code for the website.
      soup = BeautifulSoup(response.content) # Converting the plain text html code of the website into a BeautifulSoup object for easy parsing
      anchors = soup.find_all('a', {'class': 'newsreleaseconsolidatelink display-outline', 'href': True}) # Getting all the anchors for news articles within the webpage

      for anchor in anchors:
        date = anchor.find('small').get_text()
        try: 
          date = datetime.strptime(date, '%b %d, %Y, %H:%M ET') # Convert to datetime
        except: # If the conversion fails, that is because the article was releasaed today and the time is written as "HH:MM ET" instead of "Month DD, YYYY, HH:MM ET"
          date = datetime.strptime(date, '%H:%M ET') # Convert the time into a datetime variable
          now = datetime.now(timezone('EST')).date() # Get today's date, had to add timezone because google colab operates on UTC, while prnewswire operates on EST
          now_t = datetime.time(date) # Time the article was released
          date = datetime.combine(now, now_t) # Date and time combined
        
        if (date < end_date):
          end_date_reached = True
          break
        else:
          href = "https://www.prnewswire.com" + anchor['href'] # Retrieving href for the article and converting it to visitable link
          urls.append(href) # Adding to list of urls to visit
      
      page_num += 1
    


  return urls

In [None]:
def _get_articles(urls):
  """
  Gets and returns articles from a list of URLs

  Args:
      urls (list): A list of URLs, each corresponding to an article

  Returns:
      articles (list): A list of lists. Each list corresponds to one article
                      and contains strings that represent the text within 
                      that article.

  Detailed description:
  We visit each url from the list of URLs provided as input. We first check for
  a response status code of 200 to ensure the visit was successful. Then we feed
  the HTML code we retrieve from the website assosciated to the URL into
  BeautifulSoup for easy parsing. 
  
  When observing the HTML codes for articles on PRNewswire, it was observed that 
  any content within the body of the article was contained within div containers 
  of class 'col-sm-10 col-sm-offset-1'. Furthermore, within these div containers, 
  all the text was stored within paragraph (p) tags.

  So, for each article, we extract all the p tags within all the div tags of the
  class 'col-sm-10 col-sm-offset-1' and add them to a list as strings. Once we
  have succcessfully extracted all the p tags, we add this list to the 'articles' 
  list which stores all the articles.
  """
  articles = []

  for url in urls:
    response = requests.get(url)

    if response.status_code == 200:
      soup = BeautifulSoup(response.content)

      divs = soup.find_all('div', class_ = 'col-sm-10 col-sm-offset-1') # Finds all div containers of the class that's meant for the body of the webpage

      # Getting all the text out of the divs collected above
      article = [] # For storing all the text within this article
      for div in divs:
        p_tags = div.find_all('p') # All the p tags in the current div, since all the text in the body of the prnewsire articles is always stored within p tags
        for p in p_tags:
          article.append(p.get_text()) # Getting the plain text within the p tag
      articles.append(article)

  return articles

In [None]:
def scrape(NUM_DAYS):
  """
  Finds all the articles published on PRNewswire within a given number of days 
  from the time this function is called.

  Args:
      NUM_DAYS (int): The number of days of articles to scrape, starting from
                      the current day.

  Returns:
      articles (list): A list of lists. Each list corresponds to one article
                      and contains strings that represent the text within 
                      that article.

  Detailed description:
  Calculates the end date and time based on the current date and time as well as
  the number of days worth of articles to scrape. Then gets urls for all the
  articles published within the time period. Then visits all the urls to get the
  contents of the articles. Finally, returns the articles.
  """
  
  end_date = datetime.today() - timedelta(days=NUM_DAYS) # Date NUM_DAYS days ago
  urls = _get_urls(end_date)
  articles = _get_articles(urls)

  return articles

In [None]:
dataset = scrape(7)

## PART 2 - Track/Store/Search

In [None]:
# Each string within the list of articles is merged into one large string
corpus = '' # All the articles in one large string
for x in dataset:
  corpus += ' '.join(str(e) for e in x)

In [None]:
def tickerCapture(corpus:str) -> list:
    """
    Applies regex on a large string to find all matching instances of a stock
    symbols.

    Args:
        corpus (str): String that contains exchang and ticker info

    Returns:
        result (list): [(<Exchange>, <ticker>)] list of <exchange> <ticker> data

    Detailed Description:
    The stock symbols are always presented in the articles in a certain format.
    This function parses through a string and stores all instances that match 
    that format intro a list.
    """
     # Expected formats examples: (NYSE: HMLP) and  (NYSE/LSE: CCL; NYSE:CUK)
     # Example return [('NYSE', 'HMLP'), ('NYSE/LSE', 'CCL'), ('NYSE', 'CUK')]

    regexpattern = r'\b\(?(?P<exchange>[A-Z\/]+):\s?(?P<ticker>[A-Z]+)(?:\)|;)'

    result = re.findall(regexpattern, corpus)
    
    return result

In [None]:
tickers = tickerCapture(corpus)
tickers = list(set(tickers)) # Removing duplicates

# PART 3 - Retrieve Data (Web (API))

From the list of tickers that we retrieved, we will select the following 3 that belong to the automobile industry: TSLA, GM, and LCID.

In [2]:
def stock_data(ticker, end_date = 0, start_date = 0):
  """
  Gets daily historical data for stock over past 60 days. Uses yahoo finance api
  to get the historical data.

  Args:
      ticker (str): The stock symbol
      end_date (datetime object): Default value is 0. The date to end searching 
                                  at
      start_date (datetime object): Default value is 0. The date to start 
                                    searching at

  Returns:
      resultant (pandas dataframe): The stock close price and volume over the
                                    specified time period
  """

  # If either one of dates aren't entered, we default to the past 60 days. Otherwise, we go with the time period specified.
  if end_date == 0 or start_date == 0:
    end_date = datetime.today() - timedelta(days=1) # Today
    start_date = end_date - timedelta(days=60) # Day 60 days ago

  # If the start date is greater than the end date (which may happen if the user gets confused and mixes them up) we swap the dates around to fix it
  if start_date > end_date:
    temp = end_date
    end_date = start_date
    start_date = temp
  
  k = get_data(ticker, start_date=start_date, end_date=end_date, index_as_date = True, interval='1d') # Getting the historical stock data via yahoo finance api
  resultant = k[['close','volume']] # Selecting only the stock close price and volume
  return resultant

# PART 4 - Visualize

In [3]:
def generate_visualisations(ticker, end_date = 0, start_date = 0):
  """
  Generates time-series plots for stock closing price and volume over the past
  60 days, provided the stock symbol.

  Args:
      ticker (str): A string representing a stock symbol
      end_date (datetime object): Default value is 0. The date to end searching 
                                  at
      start_date (datetime object): Default value is 0. The date to start 
                                    searching at
  
  Returns:
      fig1 (plotly plot object): Time-series plot for volume of stock
      fig2 (plotly plot object): Time-series plot for closing price of stock

  Detailed Description:
  The function first gets the data of the stock over a 60 day period. It then
  uses plotly and matplotlib to generate 2 time-series plots of the data.
  """
  # Getting info regarding the stock from the past 60 days using the ticker
  stock_info = stock_data(ticker, end_date, start_date)
  stock_info = stock_info.rename_axis('date').reset_index() # Info regarding stock

  # Plotting time-series for volume of stock
  fig1 = px.line(stock_info, x='date', y="volume", title= f"Volume of {ticker} stock") # Plotting
  fig1.update_xaxes( # Adding the ability to conveniently narrow down or expand the window
      rangeslider_visible=True,
      rangeselector=dict(
          buttons=list([
              dict(count=7, label="7d", step="day", stepmode="backward"),
              dict(count=1, label="1m", step="month", stepmode="backward"),
              dict(step="all")
          ])
      )
  )

  # Generating time-series for closing price of stock
  fig2 = px.line(stock_info, x='date', y="close", title= f"Closing price of {ticker} stock") # Plotting
  fig2.update_xaxes( # Adding the ability to conveniently narrow down or expand the window
      rangeslider_visible=True,
      rangeselector=dict(
          buttons=list([
              dict(count=7, label="7d", step="day", stepmode="backward"),
              dict(count=1, label="1m", step="month", stepmode="backward"),
              dict(step="all")
          ])
      )
  )

  return fig1, fig2  

In [4]:
# Generating visualisations for Tesla
vol_viz, price_viz = generate_visualisations('TSLA', '09/23/2021', '11/23/2021')

# Adding anotations to the volume visualisation
vol_viz.add_annotation(x=date(2021, 10, 4), y=30483340,
            text="Better-than-expected third-quarter",
            showarrow=True,
            arrowhead=1)

vol_viz.add_annotation(x=date(2021, 10, 25), y=62852100,
            text="Hertz announces plan to buy <br> 100k Tesla vehicles",
            showarrow=True,
            arrowhead=1)

vol_viz.add_annotation(x=date(2021, 11, 1), y=56048720,
            text="Tesla announces program to let 3rd party <br> vehicles use their charging stations",
            showarrow=True,
            arrowhead=1)

vol_viz.add_annotation(x=date(2021, 11, 9), y=59105840,
            text="Elon Musk proposes selling <br> 10% of his Tesla stock",
            showarrow=True,
            arrowhead=1)

# Adding anotations to the closing price visualisation
price_viz.add_annotation(x=date(2021, 10, 4), y=781.53,
            text="Better-than-expected third-quarter",
            showarrow=True,
            arrowhead=1)

price_viz.add_annotation(x=date(2021, 10, 25), y=1024.86,
            text="Hertz announces plan to buy <br> 100k Tesla vehicles",
            showarrow=True,
            arrowhead=1)

price_viz.add_annotation(x=date(2021, 11, 1), y=1208.59,
            text="Tesla announces program to let 3rd party <br> vehicles use their charging stations",
            showarrow=True,
            arrowhead=1)

price_viz.add_annotation(x=date(2021, 11, 9), y=1023.5,
            text="Elon Musk proposes selling <br> 10% of his Tesla stock",
            showarrow=True,
            arrowhead=1)

vol_viz.show() # Displaying the volume visualisation
price_viz.show() # Displaying the price visualisation

In [19]:
# Generating visualisations for General Motors
vol_viz, price_viz = generate_visualisations('GM', '09/23/2021', '11/23/2021')

# Adding anotations to the volume visualisation
vol_viz.add_annotation(x=date(2021, 10, 4), y=29206880,
            text="Launch of GM cruise <br> autonomous vehicles",
            showarrow=True,
            arrowhead=1)

vol_viz.add_annotation(x=date(2021, 10, 8), y=33724940,
            text="Expansion of OnStar Insurance",
            showarrow=True,
            arrowhead=1)

vol_viz.add_annotation(x=date(2021, 10, 27), y=36397560,
            text="CEO announces intention to introduce <br> 30 new EV models by 2025",
            showarrow=True,
            arrowhead=1)

vol_viz.add_annotation(x=date(2021, 11, 4), y=24026100,
            text="Production of Hummer EV on schedule",
            showarrow=True,
            arrowhead=1)

vol_viz.add_annotation(x=date(2021, 11, 12), y=31152370,
            text="GM introducing new EV <br> models for 2023",
            showarrow=True,
            arrowhead=1)

vol_viz.add_annotation(x=date(2021, 11, 17), y=29983380,
            text="GM files to trademark <br> new audivisual <br> application, Dashpaper",
            showarrow=True,
            arrowhead=1)

# Adding anotations to the closing price visualisation
price_viz.add_annotation(x=date(2021, 10, 4), y=53.98,
            text="Launch of GM cruise <br> autonomous vehicles",
            showarrow=True,
            arrowhead=1)

price_viz.add_annotation(x=date(2021, 10, 8), y=58.57,
            text="Expansion of OnStar Insurance",
            showarrow=True,
            arrowhead=1)

price_viz.add_annotation(x=date(2021, 10, 26), y=57.53,
            text="CEO announces intention to introduce <br> 30 new EV models by 2025",
            showarrow=True,
            arrowhead=1)

price_viz.add_annotation(x=date(2021, 11, 4), y=58.64,
            text="Production of Hummer EV on schedule",
            showarrow=True,
            arrowhead=1)

price_viz.add_annotation(x=date(2021, 11, 12), y=63.4,
            text="GM introducing new EV <br> models for 2023",
            showarrow=True,
            arrowhead=1)

price_viz.add_annotation(x=date(2021, 11, 17), y=64.61,
            text="GM files to trademark <br> new audivisual <br> application, Dashpaper",
            showarrow=True,
            arrowhead=1)

vol_viz.show() # Displaying the volume visualisation
price_viz.show() # Displaying the price visualisation

In [27]:
# Generating Visualisation for Lucid Motors
vol_viz, price_viz = generate_visualisations('LCID', '09/23/2021', '11/23/2021')

# Adding anotations to the volume visualisation
vol_viz.add_annotation(x=date(2021, 9, 29), y=120691800,
            text="Production of first luxury car announced",
            showarrow=True,
            arrowhead=1)

vol_viz.add_annotation(x=date(2021, 10, 28), y=377220900,
            text="Launch of deliveries for first luxury car",
            showarrow=True,
            arrowhead=1)

vol_viz.add_annotation(x=date(2021, 11, 8), y=154195400,
            text="Reviews of car from major sources released",
            showarrow=True,
            arrowhead=1)

vol_viz.add_annotation(x=date(2021, 11, 16), y=248654600,
            text="Confirmed plans to build 20k vehicles next year",
            showarrow=True,
            arrowhead=1)

# Adding anotations to the closing price visualisation
price_viz.add_annotation(x=date(2021, 9, 29), y=26.28,
            text="Production of first luxury car announced",
            showarrow=True,
            arrowhead=1)

price_viz.add_annotation(x=date(2021, 10, 28), y=35.48,
            text="Launch of deliveries for first luxury car",
            showarrow=True,
            arrowhead=1)

price_viz.add_annotation(x=date(2021, 11, 8), y=45.92,
            text="Reviews of car from major sources released",
            showarrow=True,
            arrowhead=1)

price_viz.add_annotation(x=date(2021, 11, 16), y=55.52,
            text="Confirmed plans to build 20k vehicles next year",
            showarrow=True,
            arrowhead=1)

vol_viz.show() # Displaying the volume visualisation
price_viz.show() # Displaying the price visualisation

## Analysis of visualisations

All 3 stocks are worth buying. For all the stocks, we notice that their volume generally only increases when the price of the stock increases. Since the behaviour of a stock tends to remain the same when the volume of the stock is increasing, and since the closing price for all 3 stocks has generally risen over the 60 day period, we can assume that all of them are worth buying.