# Textbook Data Retrieval 

This notebook provides code to create the data sets used in this textbook. It also contains a summary of the data, their original source location, the license for the data, the date the data were accessed to generate the committed repository version of the original/processed data, and any other relevant meta-information.

Running all cells in this notebook will:

1. obtain all data from their original sources
- validate the original data:
    - if the original does not exist in the repository already, a warning will be generated and the newly obtained data will be stored.
    - if the original data already exists and does not match with the newly downloaded data, a warning will be generated and the newly obtained data will be discarded.
- process the data into the format(s) required for generating the textbook
    - if a processed version of the data does not exist in the repository, a warning will be generated prior to storing the processed data.
    - if the processed data already exists and does not match with the newly processed data, a warning will be generated and the newly processed data will be discarded. 

## List of Datasets

| **Name** | **Chapters**| **R Dataset** | **Remote Database** |
| -------- | ----------- | ----------- | ---------------- |
| US 2016 Census/Vote Data | 1, 2, 3 | N | N |
| Canadian Movies | 2 | N | Y |
| Historical US Vote | 3 | N | N |
| mtcars | 3 | Y| N |
| Mauna Loa CO2 | 4 | N | N |
| Islands | 4 | Y | N |
| Old Faithful | 4 | Y | N |
| Speed of Light | 4 | Y | N |
| Wisconsin Breast Cancer | 6, 7 | N | N |
| Sacramento Real Estate | 8, 9 | Y | N | 
| Marketing Data | 10 | N | N |



## Load Packages

This notebook uses the following Python3 packages to obtain and process data.

In [137]:
import numpy as np                #for manipulating arrays
import pandas as pd               #for loading/writing/manipulating tabular data
import requests, ftplib           #for downloading files
import os                         #for handling files
import hashlib                    #for validating files
import io                         #for creating byte streams for xlsx files
import rpy2.robjects as robjects  #for obtaining datasets included in R
robjects.r('library(caret)')      #for the Sacramento dataset in R

0,1,2,3,4,5,6
'caret','ggplot2','lattice',...,'datasets','methods','base'


## Common Functions

In [151]:
datasets = {}

# This function takes in a string and outputs its SHA1 hash
def hash_data(data):
    data_bytes = data.encode()
    sha1 = hashlib.sha1()
    sha1.update(data_bytes)
    return sha1.hexdigest()

# This function takes in new/stored data strings, a hash to compare, and a filename to store data in
# If the new/stored data have a matching hash with compare_hash, just return the data
# If the stored one matches but new one doesn't, output a message with instructions on how to update the data and continue with old data
# If the new one matches but the stored one doesn't, just overwrite the stored data
def validate_data(new_data, stored_data, compare_hash, stored_filename):
    print('Validating new/stored data')
    new_hash = hash_data(new_data)
    stored_hash = hash_data(stored_data)
    print('Comparison hash:  ' + compare_hash)
    print('New data hash:    ' + new_hash)
    print('Stored data hash: ' + stored_hash)
    new_valid = (hash_data(new_data) == compare_hash)
    stored_valid = (hash_data(stored_data) == compare_hash)
    
    data = None
    if not new_valid:
        print('The newly obtained data hash does not match')
        if len(new_data) == 0:
            print('The new data is empty, please check for errors in downloading')
        else:
            new_filename = stored_filename+'.new'
            f = open(new_filename, 'w')
            f.write(new_data)
            f.close()
            print('New data was saved in ' + new_filename)
            print('If you want to use the new data, you must replace ' + stored_filename + ' with the contents of ' + new_filename + ' and update the hash in this notebook to ' + new_hash)
        
        if stored_valid:
            print('Stored data hash matches; continuing with stored data.')
            data = stored_data
        else:
            print('Stored data hash also does not match.')
            print('Please follow the above directions and update the hash in this notebook')
    else:
        print('Newly obtained data hash matches.')
        if not stored_valid:
            print('Stored data hash does not match')
            if len(stored_data) == 0:
                print('Stored data is empty')
            else:
                old_filename = stored_filename+'.old'
                f = open(old_filename, 'w')
                f.write(stored_data)
                f.close()
                print('Moved stored data to ' + old_filename)
                print('If you want to use the old data, you must replace ' + stored_filename + ' with the contents of ' + new_filename + ' and update the hash in this notebook to ' + stored_hash)
            f = open(stored_filename, 'w')
            f.write(new_data)
            f.close()
            print('Saved new data to ' + stored_filename)
        else:
            print('Stored data hash matches too.')
        print('Continuing with new data')
        data = new_data
    return data
    
def load_file(filename):
    print('Loading ' + str(filename))
    stored_data = ''
    try:
        with open(filename, 'r') as f:
            stored_data = f.read()
    except Exception as e:
        print('Exception while loading '+filename)
        print(e)

    return stored_data

def download_ftp(url, folder_path, filename):
    print('Downloading ' + filename + ' from ' + url)
    raw_data = ''
    try:
        with ftplib.FTP(url) as ftp:
            ftp.login()
            ftp.cwd(folder_path)
            resp = []
            ftp.retrlines('RETR '+filename, callback = lambda ln : resp.append(ln))
            raw_data = '\n'.join(resp)
    except Exception as e:
        print('Exception while downloading ' + filename + ' from ' + url)
        print(e)
    
    return raw_data

def download_http(url):
    return requests.get(url).content.decode('utf-8')    

def retrieve_r_table(name):
    robjects.r('data('+name+')')
    table = robjects.r(name)
    colnames = list(table.names)
    cols = []
    for colname in colnames:
        if type(table.rx2(colname)) == robjects.vectors.FactorVector:
            levels = list(table.rx2(colname).levels)
            cols.append([levels[lv-1] for lv in list(table.rx2(colname))])
        else:
            cols.append(list(table.rx2(colname)))
    data = ','.join(colnames)+'\n'
    for r in range(len(cols[0])):
        data += ','.join([str(col[r]) for col in cols]) + '\n'
    return data


# US 2016 Census / Vote Data

## Meta-info

- **Source:** [DataUSA](https://datausa.io)
- **Data URL:**
    - Census data: http://datausa.io/api/data?drilldowns=State&measure=Average%20Commute%20Time,Property%20Value,Median%20Household%20Income,Population&year=2016
    - Election data: https://www.fec.gov/documents/1890/federalelections2016.xlsx
- **Date Accessed:** July 5, 2020
- **License:** 

```
You can copy, download or print content for your own use, and you can also include excerpts from Data USA, databases and multimedia products in your own documents, presentations, blogs, websites and teaching materials, provided that suitable acknowledgment of Data USA as source is given.

All requests for commercial use and translation rights should be submitted to usage@datausa.io.
```

## Processing Code

In [None]:
def retrieve_state_property_vote():
    #add a user agent header so that the API doesn't return 403
    useragent = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    #create the query URL
    state_data_url = "http://datausa.io/api/data?drilldowns=State&measure=Average Commute Time,Property Value,Median Household Income,Population&year=2016"
    state_data = requests.get(state_data_url, headers=useragent).json()['data']
    #convert to a list of lists, removing DC, Puerto Rico
    #format numerical entries as either a 2-decimal float or integer
    colnames = ['State', 'Population', 'Property Value', 'Median Household Income', 'Average Commute Time']
    datalines = [[d[colname] if type(d[colname]) == str else format(d[colname], '.2f').rstrip('0').rstrip('.') for colname in colnames] for d in state_data if d['State'] != 'Puerto Rico' and d['State'] != 'District of Columbia']
    #obtain general presidential election 2016 results data
    elec_data_url = "https://www.fec.gov/documents/1890/federalelections2016.xlsx"
    stream = io.BytesIO(requests.get(elec_data_url).content)
    elec_data = pd.read_excel(stream,sheet_name=8)
    #extract one row per state with winner
    elec_data = elec_data[elec_data['WINNER INDICATOR'] == 'W']
    elec_dict = {}
    for i in range(elec_data.shape[0]):
        party = elec_data.iloc[i]['PARTY']
        #some rows assign the "winner" to combined parties; check the winner name for these rows
        #otherwise convert DEM/REP to long form names
        if party == 'Combined Parties:':
            if elec_data.iloc[i]['LAST NAME'] == 'Trump':
                party = 'Republican'
            else:
                party = 'Democratic'
        elif party == 'REP':
            party = 'Republican'
        elif party == 'DEM':
            party = 'Democratic'
        elec_dict[elec_data.iloc[i]['STATE']] = party
    #combine the two data
    data = [','.join(colnames+['Party'])]
    for line in datalines:
        data.append(','.join(line)+',' + elec_dict[line[0]])
    print('\n'.join(data))
    

datasets['state_property_vote'] = {}
datasets['state_property_vote']['compare_hash'] = '033baea66bf56351ad858e10ed7996c6ac7b9aa7'
datasets['state_property_vote']['retrieve_data'] = retrieve_state_property_vote
        

# Canadian Movies Data

## Meta-info

- **Source:** [National Ocean and Atmospheric Administration (NOAA)](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html)
- **Data URL:** ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt
- **Date Accessed:** July 4, 2020
- **Attribution:** Dr. Pieter Tans, [NOAA/GML](www.esrl.noaa.gov/gmd/ccgg/trends/) and Dr. Ralph Keeling, [Scripps Institution of Oceanography](https://scrippsco2.ucsd.edu/).
- **License:**



## Processing Code

# Historical US Vote Data

## Meta-info

- **Source:** [National Ocean and Atmospheric Administration (NOAA)](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html)
- **Data URL:** ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt
- **Date Accessed:** July 4, 2020
- **Attribution:** Dr. Pieter Tans, [NOAA/GML](www.esrl.noaa.gov/gmd/ccgg/trends/) and Dr. Ralph Keeling, [Scripps Institution of Oceanography](https://scrippsco2.ucsd.edu/).
- **License:**



## Processing Code

# Motor Trend Car Road Tests Data

## Meta-info

- **Source:** [National Ocean and Atmospheric Administration (NOAA)](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html)
- **Data URL:** ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt
- **Date Accessed:** July 4, 2020
- **Attribution:** Dr. Pieter Tans, [NOAA/GML](www.esrl.noaa.gov/gmd/ccgg/trends/) and Dr. Ralph Keeling, [Scripps Institution of Oceanography](https://scrippsco2.ucsd.edu/).
- **License:**



## Processing Code

In [3]:
def retrieve_mtcars():
    return retrieve_r_table('mtcars')

datasets['mtcars'] = {}
datasets['mtcars']['compare_hash'] = '033baea66bf56351ad858e10ed7996c6ac7b9aa7'
datasets['mtcars']['retrieve_data'] = retrieve_mtcars

# Mauna Loa CO2 Data

## Meta-info

- **Source:** [National Ocean and Atmospheric Administration (NOAA)](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html)
- **Data URL:** ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt
- **Date Accessed:** July 4, 2020
- **Attribution:** Dr. Pieter Tans, [NOAA/GML](www.esrl.noaa.gov/gmd/ccgg/trends/) and Dr. Ralph Keeling, [Scripps Institution of Oceanography](https://scrippsco2.ucsd.edu/).
- **License:**


```
# --------------------------------------------------------------------
# USE OF NOAA ESRL DATA
# 
# These data are made freely available to the public and the
# scientific community in the belief that their wide dissemination
# will lead to greater understanding and new scientific insights.
# The availability of these data does not constitute publication
# of the data.  NOAA relies on the ethics and integrity of the user to
# ensure that ESRL receives fair credit for their work.  If the data 
# are obtained for potential use in a publication or presentation, 
# ESRL should be informed at the outset of the nature of this work.  
# If the ESRL data are essential to the work, or if an important 
# result or conclusion depends on the ESRL data, co-authorship
# may be appropriate.  This should be discussed at an early stage in
# the work.  Manuscripts using the ESRL data should be sent to ESRL
# for review before they are submitted for publication so we can
# ensure that the quality and limitations of the data are accurately
# represented.
# 
# Contact:   Pieter Tans (303 497 6678; pieter.tans@noaa.gov)
# 
# File Creation:  Sat Jul  4 05:00:25 2020
# 
# RECIPROCITY
# 
# Use of these data implies an agreement to reciprocate.
# Laboratories making similar measurements agree to make their
# own data available to the general public and to the scientific
# community in an equally complete and easily accessible form.
# Modelers are encouraged to make available to the community,
# upon request, their own tools used in the interpretation
# of the ESRL data, namely well documented model code, transport
# fields, and additional information necessary for other
# scientists to repeat the work and to run modified versions.
# Model availability includes collaborative support for new
# users of the models.
# --------------------------------------------------------------------
#  
#  
# See www.esrl.noaa.gov/gmd/ccgg/trends/ for additional details.
#  
# NOTE: DATA FOR THE LAST SEVERAL MONTHS ARE PRELIMINARY, ARE STILL SUBJECT
# TO QUALITY CONTROL PROCEDURES.
# NOTE: The week "1 yr ago" is exactly 365 days ago, and thus does not run from
# Sunday through Saturday. 365 also ignores the possibility of a leap year.
# The week "10 yr ago" is exactly 10*365 days +3 days (for leap years) ago.
```

## Processing Code

In [4]:
def retrieve_mauna_loa():
    data = download_ftp('aftp.cmdl.noaa.gov', 'products/trends/co2/', 'co2_weekly_mlo.txt')
    # remove the lines beginning with # (these are for meta information)
    no_meta_info = [s for s in data.split('\n') if s[0] != '#']
    # replace all whitespace with a single space, strip from beginning and end, keep only first 5 cols
    standardized_whitespace = [', '.join([num for num in s.strip().split(' ') if len(num)>0][:5]) for s in no_meta_info]
    # stitch together into a string with col names at the head
    clean_data = 'year, month, day, date_decimal, ppm\n'+'\n'.join(standardized_whitespace)
    return clean_data

datasets['mauna_loa'] = {}
datasets['mauna_loa']['compare_hash'] = '033baea66bf56351ad858e10ed7996c6ac7b9aa7'
datasets['mauna_loa']['retrieve_data'] = retrieve_mauna_loa


# Island Landmasses Data

## Meta-info

- **Source:** The World Almanac and Book of Facts, 1975, page 406. See https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/islands.html 
- **Date Accessed:** July 4, 2020
- **License:** Custom:


## Processing Code

In [5]:
def retrieve_islands():
    isl = robjects.r('islands')
    names = list(isl.names)
    vals = list(isl)
    data = 'Island,Landmass\n'
    data = data + '\n'.join([d[0]+','+str(int(d[1])) for d in zip(names, vals)])
    return data

datasets['islands'] = {}
datasets['islands']['compare_hash'] = '033baea66bf56351ad858e10ed7996c6ac7b9aa7'
datasets['islands']['retrieve_data'] = retrieve_islands

# Old Faithful Data

## Meta-info

- **Source:** [National Ocean and Atmospheric Administration (NOAA)](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html)
- **Data URL:** ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt
- **Date Accessed:** July 4, 2020
- **Attribution:** Dr. Pieter Tans, [NOAA/GML](www.esrl.noaa.gov/gmd/ccgg/trends/) and Dr. Ralph Keeling, [Scripps Institution of Oceanography](https://scrippsco2.ucsd.edu/).
- **License:**



## Processing Code

In [6]:
def retrieve_faithful():
    return retrieve_r_table('faithful')

datasets['faithful'] = {}
datasets['faithful']['compare_hash'] = '033baea66bf56351ad858e10ed7996c6ac7b9aa7'
datasets['faithful']['retrieve_data'] = retrieve_faithful

# Michelson Speed of Light Data

## Meta-info

- **Source:** [National Ocean and Atmospheric Administration (NOAA)](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html)
- **Data URL:** ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt
- **Date Accessed:** July 4, 2020
- **Attribution:** Dr. Pieter Tans, [NOAA/GML](www.esrl.noaa.gov/gmd/ccgg/trends/) and Dr. Ralph Keeling, [Scripps Institution of Oceanography](https://scrippsco2.ucsd.edu/).
- **License:**



## Processing Code

In [7]:
def retrieve_michelson():
    return retrieve_r_table('morley')

datasets['michelson'] = {}
datasets['michelson']['compare_hash'] = '033baea66bf56351ad858e10ed7996c6ac7b9aa7'
datasets['michelson']['retrieve_data'] = retrieve_michelson

# Wisconsin Breast Cancer Data

## Meta-info

- **Source:** [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))
- **Data URL:** https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data
- **Date Accessed:** July 5, 2020
- **Attribution:** Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian, University of Wisconsin.
- **License:** Custom:


## Processing Code

In [None]:
def retrieve_wdbc():
    data = download_http('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
    #create list of variable names
    names = ['Class', 'Radius', 'Texture', 'Perimeter', 'Area', 'Smoothness', 'Compactness', 'Concavity', 'Concave Points', 'Symmetry', 'Fractal Dimension']
    #remove all but the class label (B/M) and first 10 entries (means of each value)
    data_lines = [line.split(',')[1:12] for line in data.split('\n')]
    clean_data = ','.join(names) + '\n' + '\n'.join([','.join(line) for line in data_lines])
    return clean_data
    
datasets['wdbc'] = {}
datasets['wdbc']['compare_hash'] = '033baea66bf56351ad858e10ed7996c6ac7b9aa7'
datasets['wdbc']['retrieve_data'] = retrieve_wdbc

# Sacramento Real Estate Data

## Meta-info

- **Source:** [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))
- **Data URL:** https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data
- **Date Accessed:** July 5, 2020
- **Attribution:** Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian, University of Wisconsin.
- **License:** Custom:


## Processing Code

In [12]:
def retrieve_sacramento():
    return retrieve_r_table('Sacramento')

datasets['sacramento'] = {}
datasets['sacramento']['compare_hash'] = '033baea66bf56351ad858e10ed7996c6ac7b9aa7'
datasets['sacramento']['retrieve_data'] = retrieve_sacramento

R[write to console]: Loading required package: lattice

R[write to console]: Loading required package: ggplot2



city,zip,beds,baths,sqft,type,price,latitude,longitude
34,64,2,1.0,836,3,59222,38.631913,-121.434879
34,52,3,1.0,1167,3,68212,38.478902,-121.431028
34,44,2,1.0,796,3,68880,38.618305,-121.443839
34,44,2,1.0,852,3,69307,38.616835,-121.439146
34,53,2,1.0,797,3,81900,38.51947,-121.435768
34,65,3,1.0,1122,1,89921,38.662595,-121.327813
34,66,3,2.0,1104,3,90895,38.681659,-121.351705
34,49,3,1.0,1177,3,91002,38.535092,-121.481367
29,24,2,2.0,941,1,94905,38.621188,-121.270555
31,25,3,2.0,1146,3,98937,38.700909,-121.442979
34,64,3,2.0,909,3,100309,38.637663,-121.45152
34,52,3,2.0,1289,3,106250,38.470746,-121.458918
34,44,1,1.0,871,3,106852,38.618698,-121.435833
34,51,3,1.0,1020,3,107502,38.482215,-121.492603
34,66,2,2.0,1022,3,108750,38.672914,-121.35934
34,66,2,2.0,1134,1,110700,38.700051,-121.351278
31,25,2,1.0,844,3,113263,38.689591,-121.452239
5,6,2,1.0,795,1,116250,38.679776,-121.314089
34,61,2,1.0,588,3,120000,38.612099,-121.469095
31,25,3,2.0,1356,3,121630,38.689999,-121.46322
5,6,3,2.0,1

# Marketing Data

## Meta-info

- **Source:** [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))
- **Data URL:** https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data
- **Date Accessed:** July 5, 2020
- **Attribution:** Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian, University of Wisconsin.
- **License:** Custom:


## Processing Code

In [12]:
def retrieve_marketing():
    return retrieve_r_table('Sacramento')

datasets['marketing'] = {}
datasets['marketing']['compare_hash'] = '033baea66bf56351ad858e10ed7996c6ac7b9aa7'
datasets['marketing']['retrieve_data'] = retrieve_marketing

R[write to console]: Loading required package: lattice

R[write to console]: Loading required package: ggplot2



city,zip,beds,baths,sqft,type,price,latitude,longitude
34,64,2,1.0,836,3,59222,38.631913,-121.434879
34,52,3,1.0,1167,3,68212,38.478902,-121.431028
34,44,2,1.0,796,3,68880,38.618305,-121.443839
34,44,2,1.0,852,3,69307,38.616835,-121.439146
34,53,2,1.0,797,3,81900,38.51947,-121.435768
34,65,3,1.0,1122,1,89921,38.662595,-121.327813
34,66,3,2.0,1104,3,90895,38.681659,-121.351705
34,49,3,1.0,1177,3,91002,38.535092,-121.481367
29,24,2,2.0,941,1,94905,38.621188,-121.270555
31,25,3,2.0,1146,3,98937,38.700909,-121.442979
34,64,3,2.0,909,3,100309,38.637663,-121.45152
34,52,3,2.0,1289,3,106250,38.470746,-121.458918
34,44,1,1.0,871,3,106852,38.618698,-121.435833
34,51,3,1.0,1020,3,107502,38.482215,-121.492603
34,66,2,2.0,1022,3,108750,38.672914,-121.35934
34,66,2,2.0,1134,1,110700,38.700051,-121.351278
31,25,2,1.0,844,3,113263,38.689591,-121.452239
5,6,2,1.0,795,1,116250,38.679776,-121.314089
34,61,2,1.0,588,3,120000,38.612099,-121.469095
31,25,3,2.0,1356,3,121630,38.689999,-121.46322
5,6,3,2.0,1