# Attempting to scrape info on upcoming earnings from earningswhispers.com

Endpoint of concern here is https://www.earningswhispers.com/calendar, to which a GET request is sent. Information for one day's worth of earnings is given per one request/response. There are a few parameters available to this query:


```
"sb" - Sort by. Four options:
  "p" - Popularity
  "t" - Time
  "c" - Name
  "s" - Sales
"d" - Offset in days (including weekends) from today of the day to target. I.e, 0 to target today. NOTE: EW determines "today" by EST
"t" - Seems to be a switch for the "All Earnings" vs "Earnings Whispers" option, generally will set to "all" 
"v" - View. Either "s" or "t":
  "s" - Standard view
  "t" - List view. We default to list view in requests, doesn't make a difference in the information returned 
```

The results to the given day are listed in an HTML ```ul``` tag of id="epscalendar". Only a certain number (TODO: Determine cutoff) of results are returned here. Manually browsing, the "Show More" button invokes a javascript function which sends another request to [.../morecalendar](https://www.earningswhispers.com/morecalendar?sb=p&d=0&t=all&v=t) and appends the results to the existing page. We will simply manually make 2 GET requests. 

I have yet to determine exactly what the maximum possible value of "d" is. At the time of writing, 11/07/2022, 87 is the highest value being accepted. When browsing the site normally, the current month and next month are visible and able to be clicked through but it seems that closer to three months can be seen via this parameter. More testing to do.




# What info is returned
As mentioned, results are listed in an HTML `ul` element. This object contains one list item, `li` tag, per result. Note that the first list item present will be of id="calhead" and acts as header/column labels. 

Result list (`ul`) definition and first item (headers):

https://imgur.com/a/ON1wBMZ

Subsequent list items, one result per reporting company, contain information regarding the earnings report and respective estimates. There are two possible classes that a result can fall into: already reported, or yet to report. The structure of list items for these two classes of results vary slightly. 
The structure of an `li` for a ticker which has not yet reported as of time of query:

https://imgur.com/a/dHuZPp4

The structure of an `li` for a ticker which has already reported as of time of query:


https://imgur.com/a/Qyp1OGl

Note the additional fields in results for companies which have already filed. Those which have not yet filed contain a div object of class "time", while those which have filed contain various div objects with classes containing "act" and "actual". 
The two possible list item structures will be handled appropriately, and the existence of a "time" div will be the deciding factor on how to treat any given list item. 

# Yet-to-report companies
The fields of interest from a list item of a company that has not yet reported (determined by existence of a "time" div):


```
div class="company"...> Company name </...
div class="ticker"...> Company ticker </...
div class="time"...> Time of earnings release </...
div class="estimate"...> EPS estimate </...
div class="revest"...> Revenue estimate </...
div class="revgrowthprint"...> Expected revenue growth % </... # TODO: Better understanding of rev and eps growth metrics. This field is not present in morecalendar but the information is available. More JS learning to do
```

These fields exist within a `div` tag within the `li` tag per company. Some information on earnings conference call time and how to access seem to be given in a sibling `div` grouping "options" to take relevent to the company (add to watchlist, etc, including info on conference call). May want to scrape this information too after some more investigation into structure


# Already-reported companies
In addition to company and ticker as in yet-to-report companies, those which have already reported contain the following fields interest in the main `div` of their list items:


```
div class="actestimate"...> EPS estimate </...
div class="actrevest"...> Revenue estimate </...
div class="actual red/green"...> Actual EPS. "red" on miss, "green" on beat </...
div class="revactual red/green"...> Actual revenue (color according to miss/beat) </...
div class="revsurpfull red/green"...> % by which revenue estimate was missed/beat </...
div class="revgrowthfull red/green"...> %, unsure. Seems to represent either change in revenue or revenue estimates from last report. TODO: More poking </...
div class="epssurpfull red/green"...> % by which EPS estimate was missed/beat </...
div class="epsgrowthfull red/green"...> %, unsure. Seems to represent either change in EPS or EPS estimates from last report. TODO: More poking </...
div class="guidance pos/neg/neut> *OPTIONAL* If available, indicative of forward guidance sentiment given by reporting company </...
```




# Consensus estimates vs. "whispers"
EarningsWhispers offers a proprietary "whisper" which they argue is more meaningful than the consensus estimate. This whisper value is not given directly on the calendar page but instead is in [.../stocks/ticker](https://www.earningswhispers.com/epsdetails/vrm) or [.../epsdetails/ticker](https://www.earningswhispers.com/stocks/vrm) depending on whether the company has reported at the time of the request or not. We can rely on the existence of a "time" field to mean the result of the report is not yet available and we should reference .../stocks/... If "time" does not exist, reference .../epsdetails/... as the report's result is available.

These two endpoints are useful in their own right for quickly obtaining results of the last earnings report of any company via ticker. 
TODO: Write a few more methods to pull relevent information from these pages 

In [None]:
import requests
import re
from datetime import date
from bs4 import BeautifulSoup

In [None]:
# Quick method to convert a target date into an offset to feed parameter "d" in request
def date_to_offset(target_date):
  target_offset = 0

  if not isinstance(target_date, date):
    print("Error: a datetime.date object must be passed to date_to_offset")
    print("Returning an offset of 0")
    return target_offset

  # Grab today's date
  todays_date = date.today()

  # Am yet to discover (if at all possible) how to send a request for a past date to earningswhispers.
  # Thus for now, make sure the target date is indeed not before today
  if todays_date > target_date:
    print("Cannot retrieve EarningsWhispers data for past dates. Returning offset of 0 days\nTodays date: {}\nEntered target date: {}".format(todays_date, target_date))
    return target_offset

  # Subtract the date objects
  offset_timedelta = target_date - todays_date
  target_offset = offset_timedelta.days

  return target_offset

In [None]:
"""
Parse a list item determined to be a "yet-to-report" company
Returns a dict of structure:
{
  "b_o_a" : "b" or "a", depending on whether company reports before market open or after market close. Blank string if unable to tell
  "company" : "Company name",
  "ticker" : "Ticker",
  "time" : "Reporting time",
  "eps_est" : "EPS estimate",
  "rev_est" : "Revenue estimate",
  TODO: EPS and rev growth estimates, earnings call info
}
"""
def parse_yet_to_report(list_item):
  ret_dict = {} # To be returned
  
  # Step into the main div 
  main_div = list_item.find("div", { "id" : re.compile("^T-")})
  if not main_div:
    print("parse_yet_to_report could not find a <div> of id 'T-' in list_item: {}".format(list_item))
    return ret_dict

  # Check if reporting before open or after close. li class="cor(s) bmo/amc showconf nwh"..., with bmo/amc meaning before open or after close
  # Split class into words, look for bmo or amc
  # HTML 4/5 define class as multi value attribute so bs4 returns a list when we get it
  ret_dict["b_o_a"] = ""
  for class_val in list_item.get("class"):
      if "bmo" in class_val.lower().split():
          ret_dict["b_o_a"] = "b"
      elif "amc" in class_val.lower().split():
          ret_dict["b_o_a"] = "a"
      else:
          pass

  # Grab our other fields of interest
  ret_dict["company"] = main_div.find("div", { "class" : "company" }).text.strip()
  ret_dict["ticker"] = main_div.find("div", { "class" : "ticker" }).text.strip()
  ret_dict["time"] = main_div.find("div", { "class" : "time" }).text.strip()
  ret_dict["eps_est"] = main_div.find("div", { "class" : "estimate" }).text.strip()
  ret_dict["rev_est"] = main_div.find("div", { "class" : "revest" }).text.strip()

  # TODO: EPS and revenue growth estimates

  # TODO: Earnings conference call info

  return ret_dict

In [None]:
"""
Parse a list item determined to be an already-reported company
Returns a dict of structure:
{
  "b_o_a" : "b" or "a", depending on whether company reports before market open or after market close. Blank string if unable to tell
  "company" : "Company name",
  "ticker" : "Ticker",
  "eps_est" : "EPS estimate",
  "eps_act" : "Actual EPS",
  "eps_surp" : "Percentage of EPS miss/beat",
  "rev_est" : "Revenue estimate",
  "rev_act" : "Actual revenue",
  "rev_surp" : "Percentage of revenue miss/beat",
  "guidance" : "pos/neg/neut"
  TODO: EPS and rev growth fields, earnings call info
}
"""
def parse_already_reported(list_item):
  ret_dict = {} # To be returned

  # Step into the main div>
  main_div = list_item.find("div", { "id" : re.compile("^T-")})
  if not main_div:
    print("parse_already_reported could not find a <div> of id 'T-' in list_item: {}".format(list_item))
    return ret_dict

  # Check if reporting before open or after close. li class="cor(s) bmo/amc showconf nwh"..., with bmo/amc meaning before open or after close
  # Split class into words, look for bmo or amc
  # HTML 4/5 define class as multi value attribute so bs4 returns a list when we get it
  ret_dict["b_o_a"] = ""
  for class_val in list_item.get("class"):
      if "bmo" in class_val.lower().split():
          ret_dict["b_o_a"] = "b"
      elif "amc" in class_val.lower().split():
          ret_dict["b_o_a"] = "a"
      else:
          pass

  # Grab our other fields of interest
  ret_dict["company"] = main_div.find("div", { "class" : "company" }).text.strip()
  ret_dict["ticker"] = main_div.find("div", { "class" : "ticker" }).text.strip()
  ret_dict["eps_est"] = main_div.find("div", { "class" : "actestimate" }).text.strip()
  ret_dict["rev_est"] = main_div.find("div", { "class" : "actrevest" }).text.strip()
  ret_dict["eps_act"] = main_div.find("div", { "class" : re.compile("^actual") }).text.strip()
  ret_dict["rev_act"] = main_div.find("div", { "class" : re.compile("^revactual") }).text.strip()
  ret_dict["rev_surp"] = main_div.find("div", { "class" : re.compile("^revsurpfull") }).text.strip()
  #ret_dict["rev_growth"] = main_div.find("div", { "class" : re.compile("^revgrowthfull") }).text.strip()
  ret_dict["eps_surp"] = main_div.find("div", { "class" : re.compile("^epssurpfull") }).text.strip()
  #ret_dict["eps_growth"] = main_div.find("div", { "class" : re.compile("^epsgrowthfull") }).text.strip()

  # Remember class is multi-value attribute, and the spaces between guidance and pos/net/neut cause the two words to be returned as two values. Hence grab the last item in the list
  guidance_tag = main_div.find("div", attrs = { "class" : re.compile("guidance") })
  if guidance_tag:
    ret_dict["guidance"] = guidance_tag.get("class")[-1]

  # TODO: Earnings conference call info

  return ret_dict

In [None]:
# Queries earningswhispers.com for a given date, with a sort type of "p", "t", "c", or "s" as outlined in text above
# Returns a list of dictionary objects, one per reporting company 
def list_earnings_on_date(target_date, sort_type):
  earnings_master_list = []

  # Check arguments are sane
  if not isinstance(target_date, date):
    print("Error: a datetime.date object must be passed to list_earnings_on_date")
    return earnings_master_list

  if sort_type not in ['p', 't', 'c', 's']:
    print("Error: invalid sort type passed to list_earnings_on_date: {}\nValid options are: p,t,c,s".format(sort_type))
    return earnings_master_list
  
  # Prepare for request
  ew_base = "https://www.earningswhispers.com/"
  ew_cal= "calendar"
  ew_morecal = "morecalendar"
  params_dict = {
      "sb" : sort_type,
      "d" : date_to_offset(target_date),
      "t" : "all",
      "v" : "t"
  }

  # Send it
  response = requests.get(url = ew_base + ew_cal, params = params_dict)
  response.raise_for_status()

  # Find the EPS calendar ul tag. If there are multiple, loop through them (have not yet encountered this)
  html_cont = response.content
  html_soup = BeautifulSoup(html_cont, "lxml")

  for res_list in html_soup.find_all("ul", { "id" : "epscalendar"}):

    # Loop through list items li tags with a class attribute set (skips first entry AKA column labels)
    for list_item in res_list.find_all("li", { "class" : True }):
      reporter_dict = {} # Will hold information on the current result/company and be appended to the master list

      # Multiple div elements exist here. The id attribute of the main one is "T-(ticker)" and the other (which contains info on earnings call as mentioned above) "O-(ticker)"
      # For now, we are targetting the fields within the "T-" div
      main_div = list_item.find("div", { "id" : re.compile("^T-")})
      if not main_div:
        print("Unexpected structure of list item in epscalendar. List item: {}".format(list_item))
        continue

      # Determine if yet to report or already reported. Handle accordingly
      found_time = main_div.find("div", { "class" : "time"})
      if found_time:
        reporter_dict = parse_yet_to_report(list_item)
      else:
        reporter_dict = parse_already_reported(list_item)

      # Append to master list
      earnings_master_list.append(reporter_dict)

  #
  # Now check morecalendar
  #
  response = requests.get(url = ew_base + ew_morecal, params = params_dict)
  if response.status_code != 200:
    return earnings_master_list

  # Same process but look for ul tag with id "morecalendar" instead of "epscalendar". List is in the same format
  html_cont = response.content
  html_soup = BeautifulSoup(html_cont, "lxml")

  for res_list in html_soup.find_all("ul", { "id" : "morecalendar"}):

    # Loop through list items li tags with a class attribute set (skips first entry AKA column labels)
    for list_item in res_list.find_all("li", { "class" : True }):
      reporter_dict = {} # Will hold information on the current result/company and be appended to the master list

      # Multiple <div> elements exist here. The id attribute of the main one is "T-(ticker)" and the other (which contains info on earnings call as mentioned above) "O-(ticker)"
      # For now, we are targetting the fields within the "T-" div
      main_div = list_item.find("div", { "id" : re.compile("^T-")})

      # Determine if yet to report or already reported. Handle accordingly
      found_time = main_div.find("div", { "class" : "time"})
      if found_time:
        reporter_dict = parse_yet_to_report(list_item)
      else:
        reporter_dict = parse_already_reported(list_item)

      # Append to master list
      earnings_master_list.append(reporter_dict)

  return earnings_master_list

In [None]:
target_date = date.today()
print(list_earnings_on_date(target_date, "p"))