<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# What is Data Science?
 
_Authors: Alexander Egorenkov (DC), Amy Roberts (NYC)_
 
---


## Demo

### Building a Model to Predict Housing Prices

Let's build a model to predict the price that a house will sell for. Such a model might be useful for identifying underpriced houses that you could flip for a profit.

Press `Shift + Return` to run each of the code cells below. You won't understand everything that is happening here; the goal is just to get the big picture.

In [1]:
# Load the "Pandas" library -- think of it as spreadsheets in Python
import pandas as pd

In [2]:
# Use pandas to load in the data
ames_df = pd.read_csv('../assets/data/ames_train.csv')

In [3]:
# Look at the first five rows
ames_df.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
# To get us started, use just the numeric columns without missing data
ames_df = ames_df.select_dtypes(['int64', 'float64']).dropna(axis='columns')

In [5]:
# Split the data into the column `y` we want to predict and the 
# columns `X` we will use to make the predictions
X = ames_df.drop('SalePrice', axis='columns')
y = ames_df.loc[:, 'SalePrice']

In [6]:
# Set aside 25% of the data for testing the model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [7]:
# Import a model class
from sklearn.ensemble import RandomForestRegressor

# Create a model from that class
rfc = RandomForestRegressor(n_estimators=200)

# Ask the model to learn a function that predicts `y` from `X`
rfc.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

That's it! We just built a machine learning model.

Is it any good? Let's test it on data that it hasn't seen.

In [8]:
# Score the model on the test data
rfc.score(X_test, y_test)

0.884335749450735

This score tells us (roughly) by what factor (0 to 1) our model's errors are on average than the errors that we would get if we just guessed the average price every time. This number is substantially greater than 0, so our model works!

We can also look at the model's predictions and compare them to the actual values:

In [9]:
list(zip(rfc.predict(X_test), y_test))

[(197717.14, 184000),
 (105418.25, 84500),
 (158912.41, 141000),
 (249336.315, 239000),
 (165023.18, 154000),
 (145496.3, 149000),
 (321410.27, 336000),
 (134338.5, 139500),
 (136696.11, 140000),
 (83430.235, 105000),
 (186888.775, 192000),
 (91595.415, 106250),
 (189340.705, 196000),
 (319238.21, 309000),
 (180807.2, 155835),
 (242161.185, 268000),
 (301019.005, 395000),
 (236698.995, 239799),
 (141212.205, 161500),
 (155716.12, 176000),
 (102753.355, 108500),
 (81908.175, 68400),
 (237098.295, 232600),
 (79574.465, 80500),
 (129869.81, 131000),
 (346863.33, 301500),
 (73540.0, 60000),
 (236625.7, 214000),
 (127267.385, 89500),
 (170709.495, 178000),
 (248455.425, 245000),
 (129056.235, 159434),
 (171676.625, 178000),
 (273880.25, 339750),
 (134744.925, 143250),
 (280617.575, 290000),
 (143399.945, 140000),
 (145373.57, 117500),
 (145148.695, 143900),
 (130866.805, 125000),
 (140065.175, 137500),
 (114439.565, 85000),
 (103373.75, 91000),
 (127197.465, 128950),
 (275776.585, 236500),


### Summary

Here are the high-level steps we performed:

- Load some data.
- Split the data by columns into the "target variable" we want to predict and the "feature variables" we want to use to predict it.
- Split the data by rows into a "training set" that we will use to teach the model and a "test set" that we will use to evaluate its performance.
- Fit a model on the training set. (This is where the model learns a relationship between the features and the target.)
- Evaluate the model on the test set by measuring the accuracy of its predictions.

These are the bare minimum for the kind of data modeling that is the focus of this course.

## Types of Data Work

This course focuses on **supervised learning** with **tabular**, **cross-sectional** data.

### Supervised Learning (a.k.a. “predictive modeling”):

Given a bunch of examples with input features and an output label, predict the output label for new examples.

**Examples:**

- Predict the price of a house based on its neighborhood, number of bedrooms, etc.
- Predict whether an email is spam or "ham" based on its contents.

Predicting a *continuous value* such as house price is called **regression**.

Predicting a *discrete category* such as spam or ham is called **classification**.

**Major challenges:**

- Getting good labeled data.
- Keeping focus on business value rather than just building the most accurate model.

### Unsupervised Learning

Given a bunch of examples with features, find some kind of structure.

**Examples:**

- Put coins into groups that are similar to one another in terms of weight, composition, etc.
- Identify five traits that capture a large proportion of the personality variation among people.
- Flag unusual-looking credit card transactions.

Representing objects as members of *groups* is called **clustering**.

Representing objects in terms of a smaller number of features than you started with is called **dimensionality reduction**.

Identifying unusual objects is called **anomaly detection**.

**Major challenge:** Evaluating performance in the absence of labels.

**Exercise (6 mins, in groups).**

Apply two of the following labels to each task below. For instance, the following task would get the labels "supervised learning" and "regression:" "Given data on prior home sales that includes home features (e.g. number of bedrooms) and sales price, predict sales prices for a new set of homes described by the same features."

**Labels:**

- Supervised learning
- Unsupervised learning
- Regression
- Classification
- Clustering
- Dimensionality reduction
- Anomaly detection

**Tasks:**

- Given a set of music audio files, group files that seem to have similar musical styles (without labeling them as belonging to particular genres).

- Derive "musical fingerprints" that allow an algorithm running on a remote server to identify what song a phone user is hearing with as little data transmission as possible. (For this exercise, focus on the process of creating the musical fingerprints -- using those fingerprints to identify songs is a separate step.)

- Given sensor data from a locomotive and times within the data set in which the engine failed, use new sensor data to predict whether or not the engine will fail in the next two weeks.

- Given sensor data from a locomotive, identify periods of time in which the engine is behaving abnormally.

- Given a set of chest X-rays with physician's diagnoses of pneumonia and other diseases, identify which patients representated in a new set of chest X-rays have pneumonia.

- Given sensor data from a locomotive that includes GPS and fuel consumption information, predict how much fuel a locomotive will consume on a trip between two specified points.

$\blacksquare$

### Other Types of Data Work

There are other kinds of data work that are not machine learning:

- **Machine learning** makes **many decisions** by producing a model that can be applied again and again (e.g. predicting prices for as many houses as you can give it).
- **Statistics** makes **one decision** by testing a hypothesis or estimating a parameter (e.g. does making a button red rather than yellow lead to higher click-through rates?).
- **Analytics** makes **no decisions**. Instead, it surfaces information to human decision makers through metrics and visualization.

"Data science" encompasses both machine learning and statistics. Data analytics is generally considered "not data science."

## Types of Data

### Tabular Data (a.k.a. "Structured Data")

*Tabular data* is data arranged in tables (rows and columns).

### Unstructured Data

We can also do machine learning with "unstructured data" such as images, video, text, and audio. For instance, you could train a model to extract the text from images of street signs.

<img src="../assets/images/street_signs.jpg" width=400>

### Cross-Sectional Data

<img src="../assets/images/cross_sectional_data.png" width=400>

Data collected at a single point in time for each individual is called **cross-sectional data**.

### Time Series/Longitudinal Data

<img src="../assets/images/time_series_data.png" width=400>

Data on the same variables collected at multiple time points from the same individuals is called **longitudinal** for **time-series** data.

## Summary

This course focused on getting a computer to learn from examples how to predict a target variable from various feature variables, using data that is organized into rows and columns (one column per variable, one row per individual), where time is not an important factor and we are not interfering in the system.

**Exercise.**

List five specific product features that you think use data science (e.g. Netflix movie recommemdations).

$\blacksquare$

# The Data Science Workflow

![](../assets/images/Data-Framework-White-BG.png)

**Notes**

- The idea of a "hypothesis-driven approach" is more appropriate for statistics than machine learning. For us, the "frame" step has more to do with identifying a practically significant use case for supervised learning.
- For us, the primary output of the "Analyze" phase is a model that can be used to automate decisions. The "Interpret" and "Communicate" phases are primarily about quantifying the value of the model and getting people to use it. However, the process of creating the model can also generate more one-off "insights" that you will want to interpret and communicate like an analyst or statistician.
- This process is typically iterative.
- Talking with subject-matter experts early and often greatly increases your chances of producing a useful result.

## Application: Data Science Workflow Through Ames Data

### 1. Frame

Identify:

- High-level business objectives
- Deliverables
- Success criteria
- Relevant data sets

#### High-Level Business Objectives

Suppose a real estate wants to predict prices for houses so that they can more reliably buy them at a discount, make cost-effective improvements, and sell them for a large profit.

#### Deliverables

E.g.

* Presentation to the real estate team
* Business report discussing results, procedures used, and rationales
* API that provides estimated returns

#### Success Criteria

This project will be considered a success if the estimated returns provided by the API are at least as accurate as the estimates that the company currently produces manually (while saving time).

**Note:** It can be difficult to predict what level of performance a data science model will be able to achieve before you dig into the data and start building models. **Keep your criteria for success minimal** and **figure out as quickly as possible whether you are going to fail**.

#### Relevant Data Sets

**Key questions:**

- What data would be ideal?
- What data is available?
- What can we do to close that gap?
- Is it plausible that we can succeed with the data we can get?

**Subsidiary questions:**

- Where is the data set coming from? How was it collected? Can it be trusted?
- What variables does it contain?
- If the data is spread across multiple pieces, how do those pieces fit together?
- Do our data appropriately align with the question/problem statement?
- Is this data set aggregated? Can we use the aggregation, or do we need to obtain it pre-aggregation?
- Is there enough data?
- Does the data cover all of the types of situations (times, places, etc.) to which we want to apply our model?
- Is the data representative
- How can we access it (e.g. file, database, web API, web scraping)?
- What are the most appropriate tools for working with the data, given its size and format?

**Exercise (in groups).**

Answer the following questions about the [Ames housing data set](../assets/data/ames_data_documentation.txt).

- What would an ideal data set look like for the project of predicting housing prices?

- How closely does the Ames housing data set match the ideal data that you envisioned?

- Would it be sufficient for our purposes?

- What limitations does it have?

$\blacksquare$

### 2. Prepare

Data scientists often work with data that they did not collect ("secondary data"), so they have to use *data dictionaries* and other documentation to learn how the set was gathered.

Here's an example of a data dictionary:

Variable | Description | Type of Variable
---| ---| ---
Square Footage | Floating Point | Continuous
Street Type | 1 - Gravel, 2 - Paved | Categorical
Neighborhood | String, e.g., 'Lake View' | Categorical
Number of Bedrooms | Integer | Discrete

**Common data preparation steps**:

- Addressing missing values
- Addressing outliers
- Restructuring
- Reformatting
- Aggregating
- Transforming

![](../assets/images/clean_data_borat.png)

### 3. Analyze

#### Descriptive Modeling

Data scientists use statistics such as frequencies, means, and standard deviations to give compact descriptions of their data sets.

Variable | Mean or Frequency (%)
---| ---
Square Footage | 2201.3
Street Type - Gravel | 8%
Street Type - Paved | 92%
Number of Bedrooms | 1.8

#### Predictive Modeling

Data scientists build models to predict either discrete outcomes (e.g. this house will / will not sell in the next month) or continuous values (e.g. this house will sell for $358,000).

### 4. Interpret

- Check your model for correctness.
- Determine what your model is really telling you, keeping in mind the limitations of your data and modeling techniques.
- Determine what one-off recommendations your model supports and/or what kinds of ongoing decisions it can support.
- Get input from subject-matter experts!

### 5. Communicate

Without effective communication, your work will not be used.

- Identify your goals.
- Put the bottom line front and center: _"Kitchen renovations have a positive return on investment, while other renovations do not."_
- Speak the language of your audience (often $$$).
- Practice, ideally with a real audience that can give useful feedback.

**Iterate, iterate, iterate.**

<a id="summary1"></a>
## Summary

Use the data science workflow to develop solutions.
  - **Frame** a problem by identifying an opportunity to save money and/or generate new revenue through supervised learning.
  - **Prepare** your data.
  - **Analyze** your data.
  - **Interpret** the results of your analysis in terms of your business.
  - **Communicate** your results to different audiences.

# Why Not Use GUI Tools (Excel, Power BI, Tableau, etc.)?

Excel tends to show the data and hides the code, while programming languages tend to show the code and hide the data. Excel is fantastic for non-mission-critical work with data with fairly simple logic.

Tableau and similar tools are fantastic for analytics work.

Programming languages such as Python have a number of advantages over GUI tools for machine learning and other data work:

- **Flexibility:** Standard programming languages are [Turing Complete](https://en.wikipedia.org/wiki/Turing_completeness), so they can (in principle) solve any problem that can be solved with a computer.
- **Performance:** Numerical computing libraries such as NumPy for Python are highly optimized for speed and memory efficiency. If that's not enough, you can drop down to a lower-level language for further optimizations and/or run many jobs in parallel across multiple computers. By contrast, Excel doesn't even let you look at more than medium-sized data.
- **Extensibility:** Anyone in the world can create and publish a package that extends the language. As a result, the package ecosystem for doing data science in Python is incredible, and it is constantly improving.
- **Transparency and Reproducibility:** When you use code to analyze data, that code serves as a record of exactly what you did. Having such a record makes it easier to catch errors and to apply the same steps to new data. By contrast, [Excel is notoriously good at hiding errors](https://www.theverge.com/2013/4/17/4234136/excel-calculation-error-infamous-economic-study).
- **Deployability:** Code can be set up to run automatically, e.g. to create scheduled reports or to recommend products on a website in real time.
- **Version Control:** Code can be put into a version control system so that you can inspect how it has changed over time and roll it back when something goes wrong.
- **Cost:** Python is free.