# Application: Data Science Workflow Through Ames Data

## Frame

---

- Identify the business/product objectives.
- Identify and hypothesize goals and criteria for success.
- Create a set of questions to help you identify the correct data set.

We work for a real estate company interested in using data science to determine the best properties to buy and resell. Specifically, your company would like to identify the characteristics of residential houses that estimate their sale price and the cost-effectiveness of doing renovations.

#### Identify the Business/Product Objectives
The customer tells us their business goals are to accurately predict prices for houses (so that they can sell them for as large a profit as possible) and to identify which kinds of features in the housing market would be more likely to lead to foreclosure and other abnormal sales (which could represent more profitable sales for the company).

#### Identify and Hypothesize Goals and Criteria for Success

Ultimately, the customer wants us to:
* Deliver a presentation to the real estate team.
* Write a business report discussing results, procedures used, and rationales.
* Build an API that provides estimated returns.

#### Create a Set of Questions to Help You Identify the Correct Data Set

* Can you think of questions that would help this customer deliver on their business goals? 
* What sort of features or columns would you want to see in the data?

**Ideal Data vs. Available Data**  

Oftentimes, we'll start by identifying the *ideal data* we would want for a project.

Then, during the data acquisition phase, we'll learn about the limitations on the types of data actually available. We have to decide if these limitations will inhibit our ability to answer our question of interest or if we can work with what we have to find a reasonable and reliable answer.

For example, we provide a set of housing data for Ames, Iowa, which [includes](./extra-materials/ames_data_documentation.txt):

- 20 continuous variables indicating square footage.
- 14 discrete variables indicating number of each room type.
- 46 categorical variables containing 2–28 classes each, e.g., street type (gravel/paved) and neighborhood (city district name).

---

#### **Optional Check**

Take a moment to look through the data description. How closely does the set match the ideal data that you envisioned? Would it be sufficient for our purposes? What limitations does it have?

---

This is possibly the hardest step in the data science workflow. At this stage, it's common to realize that the problem you're trying to solve may not be solvable with the information available. The data could be incomplete, non-existant, or unable to meet the criteria necessary to answer your question.  

That said, you now have a better feel for the data that's available and the information they could contain. You can now identify a new, answerable question that ultimately helps you solve or better understand your problem.

### Data acquisition

---

- **What are some questions we should ask during the acquisition process?**

- Our Ames data set contains the following information:
    - [Ames Data Set Introduction PDF](./extra-materials/ames.pdf) (from the "Journal of Statistics Education")
    - "Data set contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010."

### Data Quality

---

- **What are some questions we should ask when checking the data for quality?**
  - [Ames Data Set Documentation](./extra-materials/ames_data_documentation.txt)

##  Prepare

---

Often, we are given *secondary data*, or data that were collected previously. In these cases, we have to learn as much as possible about our data using tools like data dictionaries and source documentation to determine how the set was gathered.

Here's an example of a data dictionary:

Variable | Description | Type of Variable
---| ---| ---
Square Footage | Floating Point | Continuous
Street Type | 1 - Gravel, 2 - Paved | Categorical
Neighborhood | String, e.g., 'Tenderloin' | Categorical
Number of Bedrooms | Integer | Discrete

**Common considerations when preparing our data include:**  

- **Ensuring data is clearly defined and structured**
- **Check and clean data formatting as needed**

**Common considerations for cleaning include**:

- **Most data will **not** come perfectly clean and ready to use. Cleaning data is normally the most time-consuming task a data scientist faces.**

---

As you can see, the "Prepare" phase of the data science workflow encompasses several steps: the act of reviewing, indexing, and cleaning your data. This normally consumes a great deal of time!

## Analyze

---

As an example of basic statistics, Data scientists often check the mean, standard deviation, or specific frequency counts of their data. Statistics that we might expect for the earlier housing variables include:

Variable | Mean or Frequency (%)
---| ---
Square Footage | 2201.3
Street Type - Gravel | 8%
Street Type - Paved | 92%
Number of Bedrooms | 1.8

**What sort of questions do these types of statistics allow us to answer? Why would we do this?**

### Creating a Predictive Model 

Typically, our interest is in predicting or guessing some sort of value we might be interested in (such as the housing price for a house given some set of fixed characteristics). 

**What are some other business goals we can support as data scientists for this realty company? What are some values we would like to guess?*

**What do you think are the steps for model building?**

_We'll be spending much of our time in this course on data analysis and predictive modeling._

## Interpret

---

### Develop Recommendations and Decisions

**Now that you have a model, what are some things you should check?**

**Now that you have a model, can you convert your model's finding into a conclusion or next step for your employer?**

## Communicate

---

#### Share the Results of Your Analysis  

Presentations are a critical part of your analysis. It doesn't matter how brilliant your model is or how illuminating your findings are — without effective communication, your work will not be used.

The most basic form of a data science presentation should include a simple sentence that describes your results:

_"Customers from large companies had twice (CI 1.9, 2.1) the odds for placing another order with Planet Express compared to customers from small companies."_

Data science presentations can also be far more complex and exciting, like some of the [research presented by Nate Silver's FiveThirtyEight blog](http://fivethirtyeight.com/burrito/#brackets-view).

When crafting a presentation, always consider your audience and make sure to practice your presentation beforehand. Consider the types of questions people might ask or — better yet — test your presentation on a few people and pay attention to their response. Clarify and refine your presentation accordingly.

Make sure to consider your needs and goals, as well as those of your audience. A presentation created for your fellow data scientists will be vastly different than a presentation intended for executives trying to make a business decision.

**A Note About Iteration**

Iteration is an important part of *every step* in the data science workflow. At any given point in the process, you may find yourself repeating or going back and redoing steps in order to better understand your data, clarify your model, and refine your presentation.

**What are some things you may want to redo or iterate over after presenting your findings?**

<a id="summary1"></a>
# Summary

---

1) **Crafting good questions is key.** <br>
  - Without a thoughtful and targeted question, it can be difficult to create an effective model.
  
2) **Use the data science workflow to iteratively develop solutions.** <br>
  - **Frame**: Develop a hypothesis-driven approach to your analysis.
  - **Prepare**: Select, import, explore, and clean your data.
  - **Analyze**: Structure, visualize, and complete your analysis.
  - **Interpret**: Derive recommendations and business decisions from your data.
  - **Communicate**: Present (edited) insights from your data to different audiences.

3) **Informed by your past work, continue to refine your findings and models.** <br>
  - While the data science workflow may appear to be linear, we consistently return to past steps to implement new findings