### Introduction
The following are validations that help define the import library under `src/utils.py`

In [2]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [13]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
# Column names are provided in data/auto-mpg.names
colnames = ['mpg', 'cylinders', 'displacement', 'horsepower',
            'weight', 'acceleration', 'year', 'origin', 'name']
dat = pd.read_csv(url, delim_whitespace=True, names=colnames)

In [14]:
dat.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [26]:
# Typecheck needed - mismatch Horsepower and Origin
print(dat.dtypes)

mpg             float64
cylinders         int64
displacement    float64
horsepower       object
weight          float64
acceleration    float64
year              int64
origin            int64
name             object
dtype: object


### Typecheck - Horsepower

In [25]:
# Find out where horsepower isn't numeric
converted = pd.to_numeric(dat['horsepower'], errors='coerce')
idx = converted.isna()
dat[idx]  # Horsepower 

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
32,25.0,4,98.0,?,2046.0,19.0,71,1,ford pinto
126,21.0,6,200.0,?,2875.0,17.0,74,1,ford maverick
330,40.9,4,85.0,?,1835.0,17.3,80,2,renault lecar deluxe
336,23.6,4,140.0,?,2905.0,14.3,80,1,ford mustang cobra
354,34.5,4,100.0,?,2320.0,15.8,81,2,renault 18i
374,23.0,4,151.0,?,3035.0,20.5,82,1,amc concord dl


* **Problem**: Initial typecheck shows that horsepower is somehow loaded in as object despite being indicated as continuous. 
* **Solution**: Identified string `?` that matches 6 NA values. Convert "?" into NA values during import

### Typecheck - Origin 

In [28]:
dat['origin'].value_counts()

1    249
3     79
2     70
Name: origin, dtype: int64

In [29]:
dat['origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})

0         USA
1         USA
2         USA
3         USA
4         USA
        ...  
393       USA
394    Europe
395       USA
396       USA
397       USA
Name: origin, Length: 398, dtype: object

* **Problem**: Origin is encoded into numerical values, but should be categorical
* **Solution**: Decode origin back to string