# Data Science is Data ... and Science

__Great... What does that mean?__

<img src="https://materials.s3.amazonaws.com/i/QS5fKym.jpg" width=700>

__Wow ... that's a whole lot of ... stuff to learn.__

We'll take it one step at a time. Note that the *model* part that everyone loves comes near the end, and isn't the biggest part. (You may have seen a much less pretty Google image that makes this same point.)

*We're going to start with __Data__ and the __questions__ we might ask*

## Data

<img src="https://materials.s3.amazonaws.com/i/rbcnjIr.jpg">
<br/><br/>
* __Kinds__
  * String, discrete / categorical, continuous
  * Encodings, precision, etc.
* __Records and Fields__
  * Data Dictionaries
    * e.g., "1=North America, 2=South America"
  * Where does the "buck stop" on business knowledge?
    * e.g., Who knows how/why/when Costa Rica was assigned to ... ?
* Sources
* Size
* Sampling
* vs. "capta"
* Value? (e.g., neighbor)

## Questions 
### Experiments, the Science part of Data Science

<img src="https://materials.s3.amazonaws.com/i/QfaKplG.jpg" width=600>
<br><br>

* Prediction
  * Past data to predict future (e.g., weather)
  * Local data to predict general (e.g., handwriting recognition)
* Experiments need (ideally)...
  * Clear hypothesis
  * Clear "null hypothesis"
  * Reproducibility
  * Methodology chosen *before* looking at the data/results

## Why is this Hard?

Poor understanding of the data
* Incomplete picture
  * e.g., small diamonds
* Biased picture
  * Cognitive biases, reasoning difficulties
    * Monty Hall
    * Simpson's paradox
  * Induced / selection biases
    * Berkson's paradox

Poor separation of signal from noise

<img src="https://materials.s3.amazonaws.com/i/xvh6lXe.png" width=200>

Ill-posed problems ("Who are the best customers?" ... "What is this a picture of?")

## Where do we Start?

#### Exploratory Data Analysis

We can dive right in ... assuming we have easy access to query the data

## Lab Activity: SQL

Take a look at the SQLite example database schema: http://www.sqlitetutorial.net/sqlite-sample-database/

You can interactively query this database using the SQL GUI at http://www.sqlitetutorial.net/tryit/

Working in small groups, try and answer the following questions...

1. How many invoices are there?
2. What country has the highest average invoice amount?
3. How many distinct prices are there?
4. On average, how many tracks does an artist have in the database?
5. What artist has the fewest tracks in the database? The most?
6. What genre has the longest average tracks?
7. Are any employees also customers?
8. Report the top 10 artists by total revenue

*What are some limitations of this database?*

### Querying a SQL database is pretty easy ... why isn't everything this easy?

Why wouldn't a company put every piece of data into a relational (SQL) database?

What are some tradeoffs to associated with relational systems?

When are the tradeoffs worth it?

### But if we can get the data and query it, it's downhill from there, right?

<img src="https://materials.s3.amazonaws.com/i/f6reoje.png" width=200> <img src="https://materials.s3.amazonaws.com/i/guyiu4h.jpg"> <img src="https://materials.s3.amazonaws.com/i/3IFawyV.png" width=180>

* 2005 Dom Perignon Rose, 100 pts, $270
* 2009 Roederer Cristal, 96 pts, $179
* NV Mumm Napa Brut Prestige, 91-92 pts, $19

#### Try this...

1. What is the lowest-rated wine you can find online?
2. What is the range of wine ratings?

Suppose we build a dataset of wines and their ratings. 
1. What kind of business questions (and answers) might be useful?
2. Leaving aside the subjectivity of human tasters, what are some problems with this data?

Everything would be easier if one organization (?) used consistent standards (?) and data (?) to provide ratings, right?
* https://www.consumerreports.org/consumerist/can-you-trust-those-awards-you-see-in-auto-ads/
* https://money.cnn.com/2015/09/30/news/better-business-bureau-millions/index.html

It all works better at Internet scale, where millions of data points can be aggregated, right?
* https://www.forbes.com/sites/jimhandy/2012/08/16/think-yelp-is-unbiased-think-again
* https://www.sfgate.com/news/article/Yelp-can-give-paying-clients-better-ratings-5731200.php

But some companies just tabulate the data, and the scale is clear -- like the ubiquitous 1-5 stars, right?
* https://therideshareguy.com/how-to-avoid-deactivation-as-an-uber-and-lyft-driver/

At least it will be straightforward when we're analyzing physical, natural, plentiful data with simple sensors, right?
* https://gizmodo.com/why-cant-this-soap-dispenser-identify-dark-skin-1797931773

## What's the Point?

* Make sure we have the "right" data
* Understand the data
* Be careful about the interpretive tools we apply
* Consider the predictive power of the dataset

We'll talk more about statistics in a little bit.

### Where to start when someone just "hands us" a pile of data?

#### Data access, parsing, ingestion, "wrangling"

https://en.wikipedia.org/wiki/Data_wrangling

Let's look at some real but fairly "clean" data to begin with, unlike most real-world data.

### Lab Activity: CSV

In [2]:
! wget https://raw.githubusercontent.com/Yoctol/decision-tree-example/master/part_1_data.csv -O /tmp/housing.csv

/bin/sh: wget: command not found


In [4]:
import pandas as pd

__What is Pandas?!__

https://pandas.pydata.org/

https://pandas.pydata.org/pandas-docs/stable/10min.html

In [6]:
df = pd.read_csv("data/housing.csv", comment="#")

In [7]:
df

Unnamed: 0,in_sf,beds,bath,price,year_built,sqft,price_per_sqft,elevation
0,0,2.0,1.0,999000,1960,1000,999,10
1,0,2.0,2.0,2750000,2006,1418,1939,0
2,0,2.0,2.0,1350000,1900,2150,628,9
3,0,1.0,1.0,629000,1903,500,1258,9
4,0,0.0,1.0,439000,1930,500,878,10
...,...,...,...,...,...,...,...,...
487,1,5.0,2.5,1800000,1890,3073,586,76
488,1,2.0,1.0,695000,1923,1045,665,106
489,1,3.0,2.0,1650000,1922,1483,1113,106
490,1,1.0,1.0,649000,1983,850,764,163


In [8]:
df.describe()

Unnamed: 0,in_sf,beds,bath,price,year_built,sqft,price_per_sqft,elevation
count,492.0,492.0,492.0,492.0,492.0,492.0,492.0,492.0
mean,0.544715,2.155488,1.905691,2020696.0,1959.103659,1522.989837,1195.632114,39.845528
std,0.498503,1.305133,1.06815,2824055.0,40.579602,1014.366252,733.765622,44.673248
min,0.0,0.0,1.0,187518.0,1880.0,310.0,270.0,0.0
25%,0.0,1.0,1.0,749000.0,1924.0,832.75,730.5,10.0
50%,1.0,2.0,2.0,1145000.0,1960.0,1312.0,960.0,18.5
75%,1.0,3.0,2.0,1908750.0,2001.0,1809.0,1419.0,61.0
max,1.0,10.0,10.0,27500000.0,2016.0,7800.0,4601.0,238.0


Some basic queries...

In [20]:
df.groupby('in_sf').mean()

In [21]:
df.groupby('in_sf').mean()['price']

In [22]:
df.groupby('in_sf').mean()[['price', 'sqft']]

In [23]:
import matplotlib.pyplot as plt

Ok, __what's Matplotlib??__

https://matplotlib.org/

https://matplotlib.org/tutorials/introductory/usage.html#sphx-glr-tutorials-introductory-usage-py

In [25]:
fig, ax = plt.subplots()
ax.plot(df.sqft, df.price, '.')
display(fig)

In [26]:
fig, ax = plt.subplots()
ny = df[df.in_sf == 0]
ax.plot(ny.sqft, ny.price, '.')
display(fig)

In [27]:
fig, ax = plt.subplots()
sf = df[df.in_sf == 1]
ax.plot(sf.sqft, sf.price, '.')
display(fig)

In [28]:
fig, ax = plt.subplots()
ax.hist(df[df.in_sf == 0].sqft, bins=40)
display(fig)

In [29]:
fig, ax = plt.subplots()
ax.hist(df[df.in_sf == 1].sqft, bins=40)
display(fig)

### Lab Activity

1. What is the most expensive NY property (in price per square foot)? In SF?
2. What are the top 10 highest elevation properties in SF? In NY?
3. If you *didn't know* whether a property was in SF or NY, what info in the dataset (besides the `in_sf` flag, obviously) might be useful to figure out the property's location?
4. How well would that work? Why?
5. Do properties seem to have a "price per bedroom"? How would you ask (or answer) that kind of question?
6. In SF, higher elevation properties might have a view, or they might be in nice neighborhoods like Nob Hill or Russian Hill. Does the elevation correlate with price?

## Data Cleansing

<img src="https://materials.s3.amazonaws.com/i/Data-Cleansing-tool.jpg" width=500>

Typical Problems
* Incomplete records / missing values
* Duplicate (or partial duplicate) records
* Impossible values
* Values that violate business rules
* Sampling/distribution problem
* Skewed values

Approaches to Cleansing/Repair
* Dropping records
* Repairing values from alternate sources
* Imputing values
* Upsampling/downsampling/stratified sampling
* Deskewing calculations
* Normalization (scale to 0-1) / standardization (mean 0, sd 1)

__BEWARE!__

*Cleansing your data is like doing surgery: if you get it right, everyone will be happy ... and may not even notice anything happened.*

*But: if you don't understand the data and problems thoroughly, and if you are not thoughtful about the effect of your intervention, you can create worse problems:*
* System crashes
* Financial (business) losses due to poor human or machine decisionmaking from the data
* Legal liability for your company, your business unit, or yourself, due to violation of US or EU law around privacy, discrimination, accounting rules, etc.

## Terminology and "Features"

On the business side, we often refer to parts of a dataset as "records" and "fields"...

For example, in a denormalized dataset of customer purchases,
* information associated with a particular purchase is a records
* the data elements inside that record (e.g., customer last name, item name, price, time, etc.) are fields

__As we move our focus toward machine learning, we focus more on *features* associated with each record.__

So what is a feature? And how is it different from a field?

*Features* are values 
* that we as data scientists (or in some cases an alogorithm or program) selects or creates
* specifically for the purpose of capturing key information that will inform our model or goal
* may be related to one or more fields in a data set
* may involve a change of represenatation from the underlying data.

Here are some examples:

* Customer last name is a field in some records. Is it a feature? I.e., would we likely include it for its informational or predictive value about some business project? Probably not.

* Customer phone number is a field. Might it be a feature?
  * Area code and/or exchange (prefix) used to be geographically correlated ... so maybe ... especially if we are talking land lines only
  * On the other hand, the last 4 digits are rarely meaningful
  * For cell numbers, there might be some valuable info in the area code and exchange, but it's hard to know -- we might want to formulate an experiment to decide
  
* Customer age might be a field in some data. Is it a feature?
  * For some use cases, it might be extremely informative
  * For other use cases, it might be flat out illegal to use

* Suppose age is informative and legal for our project. But suppose that the exact age (26 vs. 27, say, or 53 vs. 55) is not terribly useful.
  * We might transform the raw age *field* into a discrete feature.
  * For example, we might group 0-17 as minors, 18-34 as young adult, 35-55 as middle age, etc.
  * This pattern is called "bucketing" because we group values into a smaller number of buckets (which stand in for groups of raw values)
  * This throws away some data but might make our model simpler or more accurate
  
There are many such patterns for extracting, transforming, and selecting features from a dataset for further use. There is no one right way to do it, and many machine learning experts have asserted that data and feature preparation is substantially *more* important to the end result than fancy mathematical algorithms.

####APPENDIX

__CIFAR-10 Categories__
`airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck`

__Some info around wine scores:__

* https://www.winespectator.com/display/show/id/scoring-scale
* https://medium.com/the-global-wine-score/the-global-wine-score-data-distribution-a97cb67a1182