**Created by: Christopher Brown**  
**Last revised: 9/13/20**

# Overview
This notebook provides the Python code and functions I created to acquire 94,000+ UFO reports from nuforc.org. In order to web scrape an individual report which contains the corresponding full report summary, three distinct scrapes are necessary.

In [1]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
# import numpy as np 
# pip install lxml # for pd.read_html()
# pip install html5lib
import time # time.time() for tracking time spent on large scrapes

# Code for more robust scraping/fault tolerance (necessary for level 3 scraping due to the long duration)
# Acquired via: https://findwork.dev/blog/advanced-usage-python-requests-timeouts-retries-hooks/
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

retry_strategy = Retry(
    total=5,
    status_forcelist=[429, 500, 502, 503, 504],
    method_whitelist=["HEAD", "GET", "OPTIONS"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
http = requests.Session()
http.mount("https://", adapter)
http.mount("http://", adapter)

# First level scraping
The primary goal of the first level scrape is to gather a list of urls for each month/year combination on the [web reports page](http://www.nuforc.org/webreports/ndxevent.html).

In [2]:
# 1st level scraping function
def nuforc_scrape_L1(url='http://www.nuforc.org/webreports/ndxevent.html'):
    """
    Scrape the National UFO Reporting Center Report Index by Month.
    
    Returns:
    dict of urls and corresponding month/year and report count in a tuple
    
    Example usage:
    nuforc_scrape_L1()
    """
    # Generate the request
    page = requests.get(url, timeout=10)
    if page: # Handle new scrape / page
        soup = BeautifulSoup(page.content, 'html.parser')
    else: # Return a helpful message if no valid input has been supplied.
        return 'Please provide a valid url leading to an html page.'
    
    table = soup.find('tbody')
    
    url_dict = {} # Build a dict as follows: {url: (date, count)}
    for t in table.find_all('tr'):
        date,count = t.find_all('td')
        url_dict[date.find('a')['href']] = (date.text, count.text)
    return url_dict

In [3]:
# Run the first scrape
url_dict_L1 = nuforc_scrape_L1()

# Second and third level scraping
Next, visiting each month/year url is necessary in order to extract the report table at each location. However, because the `Summary` field is truncated at 135 characters, it is also necessary to retrieve the url for each sighting page, where the full `Summary` can be extracted.  
  
Because using the 3rd level scrape on all month/year urls can take a considerable amount of time, I designed the workflow to allow passage of a partial month/year url list, allowing for some control over the run time as well as the volume of records returned.

In [4]:
# Note that the urls begin with 'http://www.nuforc.org/webreports/
urls = ['http://www.nuforc.org/webreports/'+k for k in url_dict_L1.keys()]
len(urls) # 865 month/year urls as of 9/8/2020

865

Pandas has a handy `read_html()` function for extracting html tables, which is very useful for the level 2 scrape. In the next cell, I'll run it on an example of the most recent entry (09/2020 as of this writing) to give a sense of what the raw results look like. As nice as this is, it doesn't return the individual report urls, so that extraction needs to be done with the help of `BeautifulSoup`.

In [5]:
# Pass the first (most recent) url, extract first and only html table,
# then take a look at the resulting dataframe
df = pd.read_html(urls[0])[0]
df

Unnamed: 0,Date / Time,City,State,Shape,Duration,Summary,Posted
0,9/4/20 05:36,Buffalo,NY,Other,Unknown,The shape consisted of two chevrons with a ver...,9/4/20
1,9/4/20 04:58,Lakeland,FL,Fireball,4 seconds,Bright Falling Green Object In A Moonlit Early...,9/4/20
2,9/4/20 03:20,York County,VA,Other,1-2 seconds,I was heading back from a friends house coming...,9/4/20
3,9/4/20 03:18,Beausejour (Canada),MB,Formation,5 seconds,3 white dots flying East ward in Eastern sky.,9/4/20
4,9/4/20 01:15,Pahrump,NV,Light,2-3 minutes,Blue ziggzaging and sudden stops light that ti...,9/4/20
5,9/4/20 01:00,Williamstown,WV,Teardrop,Seconds,Looked up and caught a glimpse of a light movi...,9/4/20
6,9/3/20 23:45,Allen,TX,Oval,5 seconds,Large oval shaped bright green ufo traveling t...,9/4/20
7,9/3/20 23:42,Granbury,TX,Light,5-6 seconds,I was watching the East sky by lake Granbury a...,9/4/20
8,9/3/20 21:15,Mukwonago,WI,Triangle,2 minutes,Triangle with 3 white lights on corners.,9/4/20
9,9/3/20 20:00,Dansville/Mason,MI,Light,10 seconds,Triple lights in the sky.,9/4/20


In [6]:
def nuforc_scrape_L2(monthly_urls):
    """
    Scrapes a list of urls retrieved from the National UFO Reporting Center Report
    Index by month/year pair. See http://www.nuforc.org/webreports/ndxevent.html. From each url
    an html table is extracted and stored as a dataframe.
    
    Additionally, this function calls upon nuforc_scrape_L3() as a 3rd level scrape,
    extracting the full summary for each sighting and adding the text to the corresponding
    report/dataframe.
    
    Returns:
    A list of dataframes corresponding to the length of monthly urls supplied as input.
    
    Example usage:
    ufo_scrape_L2(urls)
    ufo_scrape_L2(['http://www.nuforc.org/webreports/'+k for k in url_dict_L1.keys()])
    """
    dfs = []
    for monthly_url in monthly_urls:
        monthly_ufo_page = http.get(monthly_url)
        
        df = pd.read_html(monthly_ufo_page.content)[0] # The table of interest is always the first table element
        
        # Acquire urls for each sighting in a month
        sighting_urls=[]
        for tr in BeautifulSoup(monthly_ufo_page.content, 'html.parser').find('tbody').find_all('tr'):
            sighting_urls.append(tr.find('td').find('a')['href']) # Find the url for each sighting
        
        # Add URL and Full Summary columns to the dataframe
        df['URL'] = ['http://www.nuforc.org/webreports/' + url for url in sighting_urls] 
        df['Full Summary'] = df['URL'].map(lambda x: nuforc_scrape_L3(x)) # Get full summary for each sighting
        
        dfs.append(df)
    return dfs # return list of dataframes

In [7]:
def nuforc_scrape_L3(url):
    """
    Takes a report url and returns the full report sighting summary description. Protects against
    empty or missing webpages, providing diagnostic information in these cases.
    """
    page = http.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    # Extract full summary description
    try:
        if page:
            desc = soup.find('tbody').find_all('td')[1].text
            return desc
        else: # Handle missing webpage (Posting date of 5/4/19 has numerous examples)
            return '[[The page cannot be found]]'
    except: # Handle incomplete/empty webpages
        return url # Use corresponding url to examine these pages if needed

Next, a 'brief' demo will run on the two most recent months. A note of caution: running the scrape on full list of month/year url pairs will likely take at least several hours.

In [8]:
a = time.time()
df_list = nuforc_scrape_L2(urls[0:2]) # 2 most recent month/year tables worth of UFO reports
b = time.time()
b-a # time elapsed, in seconds

344.32057666778564

In [9]:
# If many dataframes are present in the resulting list, one could combine the dataframes like so:
ufo_df = df_list[0]
for i in df_list[1:len(df_list)]:
    ufo_df = pd.concat([ufo_df, i])
ufo_df

Unnamed: 0,Date / Time,City,State,Shape,Duration,Summary,Posted,URL,Full Summary
0,9/4/20 05:36,Buffalo,NY,Other,Unknown,The shape consisted of two chevrons with a ver...,9/4/20,http://www.nuforc.org/webreports/159/S159151.html,The shape consisted of two chevrons with a ver...
1,9/4/20 04:58,Lakeland,FL,Fireball,4 seconds,Bright Falling Green Object In A Moonlit Early...,9/4/20,http://www.nuforc.org/webreports/159/S159152.html,Bright Falling Green Object In A Moonlit Early...
2,9/4/20 03:20,York County,VA,Other,1-2 seconds,I was heading back from a friends house coming...,9/4/20,http://www.nuforc.org/webreports/159/S159147.html,I was heading back from a friends house coming...
3,9/4/20 03:18,Beausejour (Canada),MB,Formation,5 seconds,3 white dots flying East ward in Eastern sky.,9/4/20,http://www.nuforc.org/webreports/159/S159154.html,3 white dots flying East ward in Eastern sky3 ...
4,9/4/20 01:15,Pahrump,NV,Light,2-3 minutes,Blue ziggzaging and sudden stops light that ti...,9/4/20,http://www.nuforc.org/webreports/159/S159153.html,Blue ziggzaging and sudden stops light that ti...
...,...,...,...,...,...,...,...,...,...
646,8/1/20 04:00,Lehighton,PA,Light,60 minutes,This is the 3rd experiences I've had in a 4-5 ...,8/6/20,http://www.nuforc.org/webreports/158/S158093.html,This is the 3rd experiences I've had in a 4-5 ...
647,8/1/20 04:00,N. Wales,PA,Chevron,30 minutes,Three bright lights in the sky that disappeare...,8/6/20,http://www.nuforc.org/webreports/158/S158153.html,Three bright lights in the sky that disappeare...
648,8/1/20 03:04,Melbourne,FL,Oval,6 minutes,I don't know what this is. If you could confor...,8/6/20,http://www.nuforc.org/webreports/158/S158088.html,I don't know what this is. If you could confor...
649,8/1/20 01:10,Big Falls,MN,Triangle,3 seconds,"I was sitting on my back porch smoking, when I...",8/6/20,http://www.nuforc.org/webreports/158/S158100.html,"I was sitting on my back porch smoking, when I..."


In [10]:
# Finally, the results can be saved with the following: tailor the filename to represent the timeframe captured
dates = '2020_09_08'
ufo_df.to_csv('nuforc_reports_'+dates+'.csv', index=False)