<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# What is Data Science?
 
_Instructor_: Tim Book, General Assembly DC
 
---


<a id="learning-objectives"></a>
## Learning Objectives
*After this lesson, you will be able to:*

- **Describe** the roles and components of a successful development environment.
- **Define** data science and the data science workflow.
- **Apply** the data science workflow to solve a task.
- **Discuss** common data science terminology and processes.

<a id="ds-real-world"> </a>
# Activity: Data Science in the Real World

- Give me five products or services that you think utilize data science.

- **Examples**

- Providing movie recommendations on Netflix.
- Making product suggestions on Amazon.
- Offering election and sports coverage on the stats site FiveThirtyEight.
- Calculating daily bet predictions on the fantasy sports site DraftKings.
- Returning auto-translate and search results on Google.

<a id="question"> </a>
# How to Ask Good Questions

### How Do Data Scientists Solve Problems?

Most practitioners apply a version of the **scientific method** in order to logically deconstruct and analyze an issue. At General Assembly, we'll call this the **data science workflow**, which we've broken down into a series of steps.

This problem-solving framework will help you produce results that are **reliable** (so that your findings will be more accurate) and **reproducible** (so that others can follow your steps and achieve the same results).

Note that, depending on the problem, this process is not always linear. You may require lots of iteration and repetition before any conclusions can be drawn!

<a id="good_q"></a>
## Asking a Good Question

All data science projects are different, but they all start with a **problem**. Thus, it is important to have a **problem statement**.

From this problem statement arises questions; questions we will ask the data in order to gain more information so we can attempt to find a solution to that problem.

**Why do we need a good question?**

_“A problem well stated is half solved.”_ — Charles Kettering

A good question: 

- Sets you up for success as you begin analysis.

- Establishes the basis for reproducibility.

- Enables collaboration through clear goals.
    - It's hard to collaborate without a vision.

## Are you SMART?

One way to approach formulating a question is through goal-setting via the **SMART Goals Framework**:

- **Specific**: The data set and key variables are clearly defined.

- **Measurable**: The type of analysis and major assumptions are articulated.

- **Attainable**: The question you are asking is feasible for your data set and not likely to be biased.

- **Reproducible**: Another person (or future you) can read and understand exactly how your analysis is performed.

- **Time-bound**: You clearly state the time period and population to which this analysis pertains.

## What Are Some Common Questions Asked in Data Science?

**Machine learning more or less asks the following questions:**

- Does X predict Y? (Where X is a set of predictor variables and Y is an outcome.)

- Are there any distinct groups in our data?

- What are the key components of our data?

- Is one of our observations “weird”?

## What Are Some Common Questions Asked in Data Science?
**From a business perspective, we can ask:**

- What is the likelihood that a customer will buy this product?

- How much demand will there be for my service tomorrow?

- What groups of products are customers purchasing together?

- Can we automate this simple yes/no decision?

<a id="dswf"></a>
## Introduction: The Data Science Workflow

---

- **Frame**: Develop a hypothesis-driven approach to your analysis.
- **Prepare**: Select, import, explore, and clean your data. _**(80% of your work is here!)**_
- **Analyze**: Structure, visualize, and complete your analysis.
- **Interpret**: Derive recommendations and business decisions from your data.
- **Communicate**: Present (edited) insights from your data to different audiences.

![](./assets/Data-Framework-White-BG.png)

#### Notes about GA's Data Workflow

_Remember, these steps are not hard-set rules; instead, think of them as problem-solving guidelines._

- These steps are **iterative**; it's normal to go back and repeat certain steps a few times in a row.
- The process is **cyclical**; after completing the process, you may restart it on new findings.

# Application: Data Science Workflow Through Ames Data

## Frame
---
We work for a real estate company interested in using data science to determine the best properties to buy and resell. Specifically, your company would like to identify the characteristics of residential houses that estimate their sale price and the cost-effectiveness of doing renovations.

#### What is the Business/Product Objective?

The client tells us their goals are to
* Accurately predict house prices (so they can be sold for a larger profit).
* What house features would be more likely to lead to a forecloser? (This could represent more profitable sales for the company).

#### Identify and Hypothesize Goals and Criteria for Success

Ultimately, the customer wants us to:
* Deliver a presentation to the real estate team.
* Write a business report discussing results, procedures used, and rationales.
* Build an API that provides estimated returns.

#### Create a Set of Questions to Help You Identify the Correct Data Set

* Can you think of questions that would help this customer deliver on their business goals? 
* What sort of features or columns would you want to see in the data?

## Ideal Data vs. Available Data

Oftentimes, we'll start by identifying the *ideal data* we would want for a project.

_Then_, we learn about the limitations. We have a data dictionary for the Ames data [here](./extra-materials/ames_data_documentation.txt):

- 20 continuous variables indicating square footage.
- 14 discrete variables indicating number of each room type.
- 46 categorical variables containing 2–28 classes each, e.g., street type (gravel/paved) and neighborhood (city district name).

---

Do you think we can solve our problem with what we have?

Sometimes, we might realize we can't answer our question with the given data.

If we can't solve our problem, perhaps there is another (solvable) question we can answer? Go back to **Frame**.

### Data acquisition

---

- **What are some questions we should ask during the acquisition process?**

- Our Ames data set contains the following information:
    - [Ames Data Set Introduction PDF](./extra-materials/ames.pdf) (from the "Journal of Statistics Education")
    - "Data set contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010."

### Data Quality

---

- **What are some questions we should ask when checking the data for quality?**
  - [Ames Data Set Documentation](./extra-materials/ames_data_documentation.txt)

##  Prepare

---

Often, we are given **secondary data**, or data that were collected previously. In these cases, we have to learn as much as possible about our data using tools like data dictionaries and source documentation to determine how the set was gathered.

Here's an example of a data dictionary:

Variable | Description | Type of Variable
---| ---| ---
Square Footage | Floating Point | Continuous
Street Type | 1 - Gravel, 2 - Paved | Categorical
Neighborhood | String, e.g., 'Tenderloin' | Categorical
Number of Bedrooms | Integer | Discrete

## Prepare

What does "data cleaning" mean to you? How can data be "dirty"?

- Missing data
- Numbers in wrong format
- Dates in wrong format
- Text formatting

**Cleaning data is usually the most time-intensive part of the process.**

## Analyze

---

Analysis often starts with basic statistics, often called **summary statistics**

Statistics that we might expect for the earlier housing variables include:

Variable | Mean or Frequency (%)
---| ---
Square Footage | 2201.3
Street Type - Gravel | 8%
Street Type - Paved | 92%
Number of Bedrooms | 1.8

**Besides the mean, what other types of summary statistics might we be interested in?**

## Creating a Predictive Model 
---
We generate predictive models based on the SMART goal we decided upon earlier.

**What are some other business goals we can support as data scientists for this realty company? What are some values we would like to guess?**

**What do you think are the steps for model building?**

_We'll be spending much of our time in this course on data analysis and predictive modeling._

## Interpret

---

### Develop Recommendations and Decisions

**Now that you have a model, what are some things you should check?**

Can you convert your model's finding into a conclusion or next step for your employer?

Think:
* Understandability
* Actionability

## Communicate

---

#### Share the Results of Your Analysis  

**NO ONE CARES ABOUT YOUR RESULTS IF THEY CAN'T UNDERSTAND THEM!** This best analysis in the world is useless if you can't communicate it.

The most basic form of a data science presentation should include a simple sentence that describes your results:

_"Customers from large companies had twice (CI 1.9, 2.1) the odds for placing another order with Planet Express compared to customers from small companies."_

Need some inspiration? I'm a huge fan of [FiveThirtyEight.com](http://fivethirtyeight.com). They do an amazing job of presenting complex statistical findings to a general audience. 

## Communicate
---
#### Public Speaking

When crafting a presentation, always consider your audience and make sure to **practice** your presentation beforehand. I practice for these lectures by presenting the entire class alone in my room. This is when I realize the things I wrote down might not work when speaking them aloud.

**Anticipate** issues or questions your audience might have.

Consider **who your audience is**. A presentation created for your fellow data scientists will be vastly different than a presentation intended for executives trying to make a business decision.

## Wisdom from Albert
---
#### _"Everyone is your client." -Albert Lee_

* Your client is your client.

* Your manager is your client.

* Your coworker is your client.

* Your subordinate is your client.

If you communicate with _everyone_ as if they are your client, you will go far.

## A Note About Iteration

We went through all five steps of the **Data Science Workflow**. That does not mean you will go through five steps. It's almost certain you will need to iterate. Go back and reformulate your question. Edit code that you thought was perfect. The work of a data scientist is never linear.

**What are some things you may want to redo or iterate over after presenting your findings?**

<a id="summary1"></a>
# Summary

---

1) **Crafting good questions is key.** <br>
  - Without a thoughtful, targeted, and SMART question, it can be difficult to create an effective model.
  
2) **Use the data science workflow to iteratively develop solutions.** <br>
  - **Frame**: Develop a hypothesis-driven approach to your analysis.
  - **Prepare**: Select, import, explore, and clean your data.
  - **Analyze**: Structure, visualize, and complete your analysis.
  - **Interpret**: Derive recommendations and business decisions from your data.
  - **Communicate**: Present (edited) insights from your data to different audiences.
  
3) **Informed by your past work, continue to refine your findings and models.** <br>
  - While the data science workflow may appear to be linear, we consistently return to past steps to implement new findings

<a id="ML"></a>

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
<br>
# Introduction: Machine Learning

<a id="common-ml-defs"> </a>
## Common Machine Learning Definitions

There are two main categories of machine learning: **supervised learning** and **unsupervised learning**.

**Supervised learning (a.k.a., “predictive modeling”):**  
_Classification and regression_
- Predicts an outcome based on input data.
    - Example: Predicts whether an email is spam or ham.
- Attempts to generalize.
- Requires past data on the element we want to predict (the target).

**Unsupervised learning:**  
_Clustering and dimensionality reduction_
- Extracts structure from data.
    - Example: Segmenting grocery store shoppers into “clusters” that exhibit similar behaviors.
- Attempts to represent.

Oftentimes, we may combine both types of machine learning in a project to reduce the cost of data collection by learning a better representation. This is referred to as **transfer learning**.

Unsupervised learning tends to present more difficult problems because its goals are sometimes unclear.

<a id="supervised"></a>
## Supervised Learning

Supervised learning tends to be the most frequent type of work that data scientists do and will be the main focus of this course. How does supervised learning work?

1) We train a **machine learning model** (more on that shortly) using **labeled data** (the _"Y"_ variable). <br>
- The model learns some kind of relationship between the features (X) and the response (Y).

2) We make predictions on **new data** for which the response is unknown. <br>

The primary goal of supervised learning is to build a model that “generalizes” — i.e., accurately predicts the **future** rather than the **past**!

## Practice: Classification vs. Regression

There are two categories of supervised learning:

**Regression**
- The outcome we are trying to predict is a continuous value.
    - **Can you think of anything we would want to predict like this?** 

**Classification**
- The outcome we are trying to predict is categorical (i.e., it comes in one of a set number of classes).
    - **Can you think of anything that we would want to predict like this?**

The type of supervised learning problem has nothing to do with the features; only the response matters!

## Unsupervised Learning

#### Common Types of Unsupervised Learning

- **Clustering:** Groups “similar” data points together.
- **Dimensionality reduction:** Reducing the number of variables used without sacrificing too much.

## Clustering

Imagine that we had a bunch of coins we wanted to automatically split into groups. An unsupervised learning technique would involve the following steps:

1) Clustering the coins based on “similarity" — this could be through the size, the material, or the language on the coins. <br>
2) Inspecting the grouping that the algorithm found.

Hopefully this would put the coins into sets of related groups.

## Dimensionality Reduction

Imagine that we had a huge amount of features related to those coins — country of origin, size, weight, mass, density, condition, chemical makeup, etc. Moreover, say that we had thousands or (in some cases, millions) of different features. Not all of these features are helpful! 

You probably already know the scientific property:

$$\dfrac {mass} {size} = density$$

Here, density could take the place of two different features from before. Instead of using three variables, we only need one! Many dimensionality reduction algorithms do this kind of recognition on a much grander scale. They're often used on data with hundreds of variables.

Sometimes unsupervised learning is used as a “preprocessing” step for supervised learning. (Can you guess how?)

### Examples

**Supervised Learning: Coin Classifier**

- **Observations:** Coins.
- **Features:** Size and mass.
- **Response or target variable:** Hand-labeled coin type.

- Train a machine learning model using labeled data.
    - The model learns the relationship between the features and the coin type.

- Make predictions on new data for which the response is unknown.
    - Give the model a new coin and it will predict the coin type automatically.
    
**Unsupervised Learning: Types of Customers at a Bar**

- **Observations:** Customers.
- **Features:** Drink purchases, people they interact with, etc.
- **Response or target variable:** There isn’t one — instead, we group similar customers together.

<a id = 'algorithm'></a>

## Algorithms

The underlying engine of all ML models is some **algorithm**. These algorithms are used to help identify trends, represent said trends, and explain the overall variance of the data. If they already sound complicated, it's because they usually are. While we'll learn about the algorithms hiding behind most of the techniques we learn in this class, we'll let the computer handle their implimentation.

Let's say we are a real estate agent looking to price a house using data of other home sales. Suppose we only have **sale price** and **square footage** as variables to work with. 

Suppose we also think that a **linear model** is a good technique to use here. That is, we think that the traditional algebra equation

$$y = mx + b$$

can describe home prices as follows:

$$price = m \cdot sqft + b$$

Combining all of our data, Python tells us that the equation

$$price = 114sqft + 0$$

adequately predicts sales price. How much can we expect a 2,500 square foot home to sell for?

## Final Algorithm

$$ price = 114sqft $$

This is an example of a model built with the intent of predicting price. The algorithm is simple and built off of limited information. Typically, our models will be more complex, and we'll consider a greater amount of prior data to help us develop a final algorithm.  

## Algorithm Training 

In our example, we used previously known information to find our coefficients. This action is also referred to as "training." But, let's make something clear:

- Model building would be the task of constructing an actual algorithm.
    - This is the linear model of $ y = mx + b $.
- Training involves figuring out the coefficient and the _y_-intercept the model uses for _our intended purpose_.  
    - The coefficients uncovered via training were $m = 114$ and $b = 0$.

<a id="conclusion2"></a>
## Conclusion

---

Check to see if you can answer the following questions easily:

- What is data science?
- What is the data science workflow?
- What is the difference between supervised and unsupervised learning?
- What is the difference between regression and classification? 
- What is an algorithm?

<a id="course-info"> </a>
# Course Information
    
### GA offers a special learning environment.

- What you should know: GA is a global community of individuals and organizations empowered to pursue the work we love.
- Who we are: Meet your instructional team.
- How to provide feedback: exit tickets, mid-course survey, and end-of-course survey. We want to hear from you!

## Road to Success

- **The emotional cycle of change**: This course is **fast** and covers **a lot** of material. There will be times when you may feel discouraged or overwhelmed, but don't give up - this is natural (and part of the design). By the end of the course, you'll feel more confident in your ability to define problems, analyze data, and prototype solutions. 
- **Student learning responsibility**: Our lessons cover topic foundations, but there is always more to learn! You are responsible for your learning experience - but don't get overwhelmed! Instead, just make sure you follow along, practice as much as possible, and ask questions.
- **GA requirements**: Show up. Be on time. Participate. Submit your projects. Allow yourself to struggle. Read the docs. Have fun!
- Q&A?


### Course Outline and Project Due Dates

General Assembly's part-time Data Science materials are organized into **four** units.

| Unit   | Title  | Topics Covered  | Length | 
| ---    | ---    |  ---     | ---    |
| Unit 1 | Foundations       | Python Syntax, Development Environment | Lessons 1–4 |
| Unit 2 | Working with Data | Stats Review, Visualization, & EDA     | Lessons 5–9  | 
| Unit 3 | Data Modeling     | Regression, Classification, & KNN      | Lessons 10–14  | 
| Unit 4 | Applications      | Decision Trees, NLP, & Flex Topics     | Lessons 15–19  | 

> **Instructor Note:** If there is time, briefly walk through the entire `course-info` repository with your students. If not, refer them to it for class information.