# Textbook Data Retrieval 

This notebook provides code to create the data sets used in this textbook. It also contains a summary of the data, their original source location, the license for the data, the date the data were accessed to generate the committed repository version of the original/processed data, and any other relevant meta-information.

Running all cells in this notebook will:

1. obtain all data from their original sources
- validate the original data:
    - if the original does not exist in the repository already, a warning will be generated and the newly obtained data will be stored.
    - if the original data already exists and does not match with the newly downloaded data, a warning will be generated and the newly obtained data will be discarded.
- process the data into the format(s) required for generating the textbook
    - if a processed version of the data does not exist in the repository, a warning will be generated prior to storing the processed data.
    - if the processed data already exists and does not match with the newly processed data, a warning will be generated and the newly processed data will be discarded. 

## Adding New Data

When adding a new data source to the textbook, it should be appended to the end of this notebook with a consistent notebook cell formatting / arrangement. The best way to do this is to copy and paste a block of cells for one of the examples below and then replace the information.

## Load Packages

This notebook uses the following Python3 packages to obtain and process data.

In [58]:
import tqdm              #for progress bars
import numpy as np       #for manipulating arrays
import pandas as pd      #for loading/writing/manipulating tabular data
import requests, ftplib  #for downloading files
import os                #for handling files
import hashlib           #for validating files

## Common Functions

In [109]:
def hash_data(data):
    data_bytes = data.encode()
    sha1 = hashlib.sha1()
    sha1.update(data_bytes)
    return sha1.hexdigest()

def validate_data(new_data, stored_data, compare_hash, stored_filename):
    print('Validating new/stored data')
    new_hash = hash_data(new_data)
    stored_hash = hash_data(stored_data)
    print('Comparison hash:  ' + compare_hash)
    print('New data hash:    ' + new_hash)
    print('Stored data hash: ' + stored_hash)
    new_valid = (hash_data(new_data) == compare_hash)
    stored_valid = (hash_data(stored_data) == compare_hash)
    data = None
    if not new_valid:
        print('The newly obtained data hash is different from the stored hash')
        new_filename = stored_filename+'.new'
        f = open(new_filename, 'w')
        f.write(raw_data)
        f.close()
        print('New data was saved in ' + new_filename + '')
        print('If you want to use the new data, you must replace ' + stored_filename + ' with \
               the contents of ' + new_filename + ' and update the hash in this notebook\
               to ' + hash_data(new_data))
        if stored_valid:
            print('Stored data hash matches; continuing with stored data.')
            data = stored_data
        else:
            print('Stored data hash is also different from stored hash.')
            print('Please follow the above directions and update the hash in this notebook to '+ hash_data(raw_data))
    else:
        print('Newly obtained data hash matches.')
        if not stored_valid:
            print('Stored data hash does not match; replacing stored data with newly obtained data')
            f = open(stored_filename, 'w')
            f.write(new_data)
            f.close()
        else:
            print('Stored data hash matches too.')
        print('Continuing with new data')
        data = new_data
    return data
    
def load_file(filename):
    print('Loading ' + str(filename))
    stored_data = ''
    try:
        with open(filename, 'r') as f:
            stored_data = f.read()
    except Exception as e:
        print('Exception while loading '+filename)
        print(e)

    return stored_data

def download_ftp(url, folder_path, filename):
    print('Downloading ' + filename + ' from ' + url)
    raw_data = ''
    try:
        with ftplib.FTP(url) as ftp:
            ftp.login()
            ftp.cwd(folder_path)
            resp = []
            ftp.retrlines('RETR '+filename, callback = lambda ln : resp.append(ln))
            raw_data = '\n'.join(resp)
    except Exception as e:
        print('Exception while downloading ' + filename + ' from ' + url)
        print(e)
    
    return raw_data
    


# Mauna Loa CO2 Data

## Meta-info

- **Source:** [National Ocean and Atmospheric Administration (NOAA)](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html)
- **Data URL:** ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt
- **Date Accessed:** July 4, 2020
- **Attribution:** Dr. Pieter Tans, [NOAA/GML](www.esrl.noaa.gov/gmd/ccgg/trends/) and Dr. Ralph Keeling, [Scripps Institution of Oceanography](https://scrippsco2.ucsd.edu/).
- **License:** Custom:


```
# --------------------------------------------------------------------
# USE OF NOAA ESRL DATA
# 
# These data are made freely available to the public and the
# scientific community in the belief that their wide dissemination
# will lead to greater understanding and new scientific insights.
# The availability of these data does not constitute publication
# of the data.  NOAA relies on the ethics and integrity of the user to
# ensure that ESRL receives fair credit for their work.  If the data 
# are obtained for potential use in a publication or presentation, 
# ESRL should be informed at the outset of the nature of this work.  
# If the ESRL data are essential to the work, or if an important 
# result or conclusion depends on the ESRL data, co-authorship
# may be appropriate.  This should be discussed at an early stage in
# the work.  Manuscripts using the ESRL data should be sent to ESRL
# for review before they are submitted for publication so we can
# ensure that the quality and limitations of the data are accurately
# represented.
# 
# Contact:   Pieter Tans (303 497 6678; pieter.tans@noaa.gov)
# 
# File Creation:  Sat Jul  4 05:00:25 2020
# 
# RECIPROCITY
# 
# Use of these data implies an agreement to reciprocate.
# Laboratories making similar measurements agree to make their
# own data available to the general public and to the scientific
# community in an equally complete and easily accessible form.
# Modelers are encouraged to make available to the community,
# upon request, their own tools used in the interpretation
# of the ESRL data, namely well documented model code, transport
# fields, and additional information necessary for other
# scientists to repeat the work and to run modified versions.
# Model availability includes collaborative support for new
# users of the models.
# --------------------------------------------------------------------
#  
#  
# See www.esrl.noaa.gov/gmd/ccgg/trends/ for additional details.
#  
# NOTE: DATA FOR THE LAST SEVERAL MONTHS ARE PRELIMINARY, ARE STILL SUBJECT
# TO QUALITY CONTROL PROCEDURES.
# NOTE: The week "1 yr ago" is exactly 365 days ago, and thus does not run from
# Sunday through Saturday. 365 also ignores the possibility of a leap year.
# The week "10 yr ago" is exactly 10*365 days +3 days (for leap years) ago.
```

## Obtain / validate data

In [110]:
#download data from source, load it from storage
new_data = ftp_download('aftp.cmdl.noaa.gov', 'products/trends/co2/', 'co2_weekly_mlo.txt')
stored_filename = 'mauna_loa_raw.txt'
stored_data = load_file(stored_filename)

Loading mauna_loa_raw.txt


In [111]:
#validate both newly downloaded and stored data
compare_hash = '464aa24fcd8cdb051d92361da956be3d3a2818eb'
data = validate_data(new_data, stored_data, compare_hash, stored_filename)

Validating new/stored data
Comparison hash:  464aa24fcd8cdb051d92361da956be3d3a2818eb
New data hash:    464aa24fcd8cdb051d92361da956be3d3a2818eb
Stored data hash: 464aa24fcd8cdb051d92361da956be3d3a2818eb
Newly obtained data hash matches.
Stored data hash matches too.
Continuing with new data


## Preprocess data

In [113]:
# remove the lines beginning with # (these are for meta information)
no_meta_info = [s for s in data.split('\n') if s[0] != '#']
# replace all whitespace with a single space, strip from beginning and end, keep only first 5 cols
standardized_whitespace = [', '.join([num for num in s.strip().split(' ') if len(num)>0][:5]) for s in no_meta_info]
# stitch together into a string with col names at the head
clean_data = 'year, month, day, date_decimal, ppm\n'+'\n'.join(standardized_whitespace)

#load previously processed data
stored_filename = 'mauna_loa.csv'
stored_clean_data = load_file(stored_filename)

Loading mauna_loa.csv


In [115]:
#validate the preprocessed data
compare_hash = '033baea66bf56351ad858e10ed7996c6ac7b9aa7'
data = validate_data(clean_data, stored_clean_data, compare_hash, stored_filename)

Validating new/stored data
Comparison hash:  033baea66bf56351ad858e10ed7996c6ac7b9aa7
New data hash:    033baea66bf56351ad858e10ed7996c6ac7b9aa7
Stored data hash: 033baea66bf56351ad858e10ed7996c6ac7b9aa7
Newly obtained data hash matches.
Stored data hash matches too.
Continuing with new data


# Island Landmasses Data

# Historical Vote Data

# Wisconsin Breast Cancer Data

# State Property Vote Data

# Income and Housing Data