# Abstract

## Problem Statement
- *Who are your customers?*
- *What is the problem?*
- *What solution do you propose?*

## Work Plan
- *What data will you collect?*
- *What models can you use to analyze it?*
- *How will you know that your models work?*

# Obtain the Data


## Map the Data Pipeline

*After completing this step, be sure to edit `references/data_dictionary` to include descriptions of where you obtained your data and what information it contains.*

## Build the Pipeline Tools

*Make sure these steps are reproducible by code. Put some thought into the directory structures and filepaths you are using to save your data, so it's easy to load files you need.*

In [None]:
%%writefile src/obtain.py

# write functions that form a data pipeline, 
# then run this cell to save them in src/obtain.py

## Run the Pipeline

*Set this up so that you won't need to download datasets that you already have on your computer when you re-run the pipeline.*

In [108]:
# import data pipeline functions
def run_data_pipeline():
    # call data pipeline functions in order
    pass

run_data_pipeline()

# Scrub the Data


## Load Data

## Verify Data Integrity

## Engineer Features

*Before moving on to exploratory analysis, write down some notes about challenges encountered while working with this data that might be helpful for anyone else (including yourself) who may work through this later on.*

# Explore the Data

*Before you start exploring the data, write out your thought process about what you're looking for and what you expect to find. Take a minute to confirm that your plan actually makes sense.*

*Calculate summary statistics and plot some charts to give you an idea what types of useful relationships might be in your dataset. Use these insights to go back and download additional data or engineer new features if necessary. Not now though... remember we're still just trying to finish the MVP!*

## Inspect Raw Data

## Inspect Features for Predictive Patterns

*In a regression model, we are looking for clear relationships between our features and targets. Generate pairplots and look for features that have high correlations with the target variable.*

*In a classification model, we are looking for features that separate the population into distinct distributions. Generate pairplots that are color-coded by your categories and look for features where the categories have distinct distributions.*

- *What did you learn about your data?*
- *Does it look like there are clear patterns and relationships among your features that will allow you to make good predictions?*
- *Which features do you think will be most helpful?*

# Model the Data

*Describe the algorithms that you are considering. How do they work? Why are they good choices for this data and problem space?*

*What nuances in the data will you have to be aware of in order to avoid introducing bias to your model? What steps will you need to take to prevent overfitting? What risks are there for data leakage?*

## Train Test Split

## Preprocessing

### Label Encoding

### Feature Scaling

## Build and Train Model

_Write down any thoughts you may have about working with these algorithms on this data. What looks to have been the most successful design choices? What pain points are you running into? What other ideas do you want to try out as you iterate on this pipeline?_

## Predict and Score

## Inspect Errors

# iNterpret the Model

_Write up the things you learned, and how well your model performed. Be sure address the model's strengths and weaknesses. What types of data does it handle well? What types of observations tend to give it a hard time? What future work would you or someone reading this might want to do, building on the lessons learned and tools developed in this project?_

## Strengths and Weaknesses

## What Else Can We Do?

- *find more data*
- *build new features*
- *try new models*

_(These are the obvious answers. The more interesting questions are **what** to try, **how** to do it, and **why** you think it might make a difference.)_