#### Laying a pipeline... 

We are going to create a simple mechanism to have your scrapers repeat every so often, store the data they find and then prepare them for the Free Press. Our approach will involve a so-called filesystem database. For each day, we will want to 1) pull the content from the site and store it locally, 2) run a process to pull the data from the PDF or HTML and return a CSV, say, and 3) prepare the data for the Free Press. And we'll want to do it on at regular intervals. 

To create structures on the filesystem, we are going to use a package called `os` for "operating system".  

In [None]:
import os
from datetime import datetime

We use the `.getcwd()` method to tell what folder we're in.

In [None]:
os.getcwd()

We use the `.listdir()` method to exhibit all the files (including other folders) in a directory.

In [None]:
os.listdir()

We use `.mkdir()` to make a folder. Here we make one for our project `freepress`. 

In [None]:
os.mkdir("freepress")
os.listdir()

See that our new folder is among our files. Now, we can pass and argument to `.listdir()` to see the contents of a particular folder. 

In [None]:
os.listdir("freepress")

We can move to that directory using `.chdir()` and the name of the folder we'd like to drop into.

In [None]:
os.chdir("freepress")

In [None]:
os.getcwd()

Now we are going to create a folder that has the date of the material we are fetching. We can use the `datetime` object to get one for `.today()` and then use `strftime()` to convert the datetime object into a string. Here we use the %-operators we saw last time. 

In [None]:
today = datetime.today()
folder_name = today.strftime("%Y-%m-%d")

folder_name

Now, our program is going run several times a day so we don't want to keep trying to make the folders we need. Once it's created you move into it... 

In [None]:
if not path.isdir(folder_name): os.mkdir(folder_name)

And then after changing directory to the folder for our date, we make three new folders, one for the HTML or PDF `source`, one for the `data` and then one for the information we are setting to the `freepress`. 

In [None]:
os.chdir(folder_name)

if not path.isdir("source"): os.mkdir("source")
if not path.isdir("data"): os.mkdir("data")
if not path.isdir("freepress"): os.mkdir("freepress")

We'll walk through the process. Here we take a URL and read the data. We save it in a file in our `source` folder.

In [None]:
from requests import get

url = 'https://www.michigan.gov/coronavirus/0,9753,7-406-98163_98173---,00.html'
response = get(url)

html = open("source/michigan_cases.html","w")
html.write(response.text)
html.close()

We can also read it into a Pandas Data Frame, keeping just the table of cases, say. Remember `read_html()`.

In [None]:
from pandas import read_html

tables = read_html(response.text)
tables[0]

Finally, we save the CSV file in our `data` directory.

In [None]:
tables[0].to_csv("data/cases.csv")

Now, let's look tidy this up a little and make some functions that will make the process a little tidier to run. 

In [None]:
# rig
import os
import pandas
from datetime import datetime
import requests

def get_folder_name(directory_type):
    ''' get the folder name
        pass in the directory type: source, data, freepress
        and this function will return a string (folder name) as:
        YYYY-MM-DD/type

        e.g.:
        2020-04-13/source
        2020-04-13/data
        2020-04-13/freepress
        
    '''
    
    today = datetime.today()
    folder_name = f'{today.strftime("%Y-%m-%d")}/{directory_type}'

    return folder_name

def save_source_file(file_name, file_data):
    ''' create the "source" directory and save the file_data
    '''
    
    folder_name = get_folder_name("source")
    
    # does the folder exist? if not, create it
    if not os.path.isdir(folder_name):
        os.makedirs(folder_name)
    
    file_name_with_folder = f'{folder_name}/{file_name}'
    print(f'saving source file data to {file_name_with_folder}')
    
    # open the file so that we can "write" to it
    with open(file_name_with_folder, 'w') as data_file:
        data_file.write(file_data)
    
    
def save_data_file(file_name, file_data):
    ''' like "save_source_file", create the "data" directory and save your csv to it
        **notice the duplicate code in here and save_source_file() - there are ways we can clean this up**
    '''
    
    folder_name = get_folder_name("data")
    
    # does the folder exist? if not, create it
    if not os.path.isdir(folder_name):
        os.makedirs(folder_name)
    
    file_name_with_folder = f'{folder_name}/{file_name}'
    print(f'saving source file data to {file_name_with_folder}')
    
    # open the file so that we can "write" to it    
    with open(file_name_with_folder, 'w') as data_file:
        data_file.write(file_data)
    
def fetch_and_save(url, file_name):
    ''' fetch the URL and save it to a file
    '''
    
    # fetch the data from the url
    response = requests.get(url)
    
    # save the text (HTML, JSON, etc) to our "source" directory
    save_source_file(file_name, response.text)

    # return the respons object
    return response

def process_and_save(response):
    ''' put your custom parsing code in here
        and make sure you save your csv file(s) in here
    '''
    
    # pass the HTML of our page into pandas read_html
    tables = pandas.read_html(response.text)
    print(len(tables))
    
    # take out the first file which is the cases file
    cases_df = tables[0]
    
    print(cases_df)

    # we can convert out dataframe to csv (as a string)
    csv_data = cases_df.to_csv(index=False)
    
    # save our csv to a file
    save_data_file("cases_by_county.csv", csv_data)
    
    # if you wanted to save multiple dataframes, you can do that here
    #save_data_file("cases_by_age.csv", tables[1].to_csv(index=False))
    
    # return the dataframe
    return cases_df
    
    
    
    

In [None]:
# michigan covid page
url = 'https://www.michigan.gov/coronavirus/0,9753,7-406-98163_98173---,00.html'

# fetch the URL, save the data to a file
response = fetch_and_save(url, "cases.html")

# do our custom processing and save it to csv file in this method
df = process_and_save(response)
