In [1]:
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

In [2]:
# HIDDEN
def df_interact(df, nrows=7, ncols=7):
    '''
    Outputs sliders that show rows and columns of df
    '''
    def peek(row=0, col=0):
        return df.iloc[row:row + nrows, col:col + ncols]

    row_arg = (0, len(df), nrows) if len(df) > nrows else fixed(0)
    col_arg = ((0, len(df.columns), ncols)
               if len(df.columns) > ncols else fixed(0))
    
    interact(peek, row=row_arg, col=col_arg)
    print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))

In [3]:
# HIDDEN
babies = pd.read_csv('data/babies23.data', delimiter='\s+')
babies_small = babies[['wt', 'race', 'ed']]

## Data Types

We often begin exploratory data analysis by examining the types of data that occur in a table. Although there are multiple ways of categorizing data types, in this book we discuss three broad types of data:

1. **Nominal data**, which represents categories that do not have a natural ordering. For example: names of people, beverage titles, and zip codes.
1. **Ordinal data**, which represents ordered categories. For example: T-shirt sizes (small, medium, large), Likert-scale responses (disagree, neutral, agree), and level of education (high school, college, graduate school).
1. **Numerical data**, which represents amounts or quantities. For example: heights, prices, and distances.

We refer to these types as **statistical data types**, or simply **data types**.

`pandas` assigns each column of a DataFrame a **computational data type** that represents how the data are stored in the computer's memory. It is essential to remember that the statistical data type can differ from the computational data type.

For example, consider the table below which records weights of babies at birth, race of the mother, and educational level of the mother.

In [4]:
babies_small

Unnamed: 0,wt,race,ed
0,120,8,5
1,113,0,5
2,128,0,2
...,...,...,...
1233,130,1,2
1234,125,0,4
1235,117,0,4


Every column of thie DataFrame has a numeric computational data type. In this case, the `int64` type signifies that each column contains integers.

In [5]:
babies_small.dtypes

wt      int64
race    int64
ed      int64
dtype: object

However, it would be foolish to work with all three columns as if they have a numeric statistical data type. In order to understand the dataset's data types, we almost always need to consult the dataset's **data dictionary**. A data dictionary is a document included with the data that describes what each column in the data records. For example, the data dictionary for this dataset states the following:

```
wt -  birth weight in ounces (999 unknown)
race - mother's race 0-5=white 6=mex 7=black 8=asian 9=mixed 99=unknown
ed - mother's education 0= less than 8th grade, 
   1 = 8th -12th grade - did not graduate, 
   2= HS graduate--no other schooling , 3= HS+trade,
   4=HS+some college 5= College graduate, 6&7 Trade school HS unclear, 9=unknown
```

Although the `wt`, `race`, and `ed` columns are stored as integers in `pandas`, the `race` column contains nominal data and `ed` contains ordinal data.

In fact, we must exercise caution even with the `wt` column. Computing the average birth weight by taking the average of the `wt` column will not give an accurate result because unknown values are recorded as `999`. If left as is, the unknown values will cause our average to be higher than it should be.

**The Importance of Data Types**

Data types guide further data analysis by specifying the operations, visualizations, and models we can apply to values in the data. For example, differences between numerical data are meaningful while differences between ordinal data are not. This means that for the `babies_small` DataFrame the average baby birth weight has meaning but not the "average" educational level.

`pandas` will not complain if we attempt to compute the mean of the values in the educational level column:

In [10]:
# Don't use this value in actual data analysis
babies_small['ed'].mean()

2.9215210355987056

This quantity, however, provides little useful information. We could have easily replaced the values in the `ed` column with their string descriptions — for example, we can replace `0`'s with `'less than 8th grade'`, `1`'s with `'8th-12th grade'`, and so on. We would not say that the "average" of these strings contains much value. We would not say the same with the average of the numeric values either.

Although the value differences of ordinal data are not meaningful, the direction of the difference has meaning. For example, we could say a mother with `ed=5` (college graduate) has a greater education level than a mother with `ed=2` (high school graduate).

Nominal data, in comparison, do not provide meaning in the direction of the differences. A mother with `race=6` (Mexican) and a mother with `race=7` (Black) simply have different races.

### Example: Infant Health



In [11]:
babies

Unnamed: 0,id,pluralty,outcome,date,...,inc,smoke,time,number
0,15,5,1,1411,...,1,0,0,0
1,20,5,1,1499,...,4,0,0,0
2,58,5,1,1576,...,2,1,1,1
...,...,...,...,...,...,...,...,...,...
1233,9213,5,1,1672,...,3,1,1,2
1234,9229,5,1,1680,...,1,0,0,0
1235,9263,5,1,1668,...,6,0,0,0


In [6]:
scores = pd.read_csv('data/SFBusinesses/inspections.csv')
scores

Unnamed: 0,business_id,score,date,type
0,19,94,20160513,routine
1,19,94,20171211,routine
2,24,98,20171101,routine
...,...,...,...,...
14219,94142,100,20171220,routine
14220,94189,96,20171130,routine
14221,94231,85,20171214,routine


In [7]:
housing = pd.read_csv('data/SFHousing.csv')
housing

Unnamed: 0.1,Unnamed: 0,county,city,zip,...,lat,quality,match,wk
0,1,Alameda County,Alameda,94501.0,...,37.76,gpsvisualizer,Exact,2003-04-21
1,2,Alameda County,Alameda,94501.0,...,37.76,QUALITY_ADDRESS_RANGE_INTERPOLATION,Exact,2003-04-21
2,3,Alameda County,Alameda,94501.0,...,37.77,QUALITY_ADDRESS_RANGE_INTERPOLATION,Exact,2003-04-21
...,...,...,...,...,...,...,...,...,...
281503,348191,Sonoma County,Sonoma,95476.0,...,38.28,QUALITY_CITY_CENTROID,Exact,2006-05-29
281504,348192,Sonoma County,Windsor,95492.0,...,38.55,QUALITY_EXACT_PARCEL_CENTROID,Relaxed; Soundex,2006-05-29
281505,348193,Sonoma County,Windsor,95492.0,...,38.54,QUALITY_CITY_CENTROID,Exact,2006-05-29
