# PART 1

In the field of data science, good projects are **practical**. Your capstone project should be manageable and affect a real world audience. This might be a domain you are familiar with, a particular interest you have, something that affects a community you are involved in, or an area that relates to a field you wish to work in.

One of the best ways to test ideas quickly is to share them with others. A good data scientist has to be comfortable discussing ideas and presenting to audiences. That's why for Part 1 of your Capstone project, you'll be preparing a lightning talk on some potential interest areas and datasets.

This deliverable will provide you with guidance to help you select an awesome topic and begin to build a polished Capstone project. 
- [ ] Identify Problem
- [ ] Acquire Data
- [ ] Present Data

   - Problem Statement
   - Potential Audience 
   - Goals
   - Success Metrics
   - Data Source(s)
    data, goals, audiences, and metrics.

5. For all datasets, identify their source, format, and necessary action items to obtain or access them.
6. Create a blog post of at least 500 words (and 1-2 graphics!) that describes your project idea, data, and audience. Link to it in your presentation appendix.

**Begin by Asking:**
- What is the scope of the need or problem I wish to investigate?
- Who is this for? Who is impacted or affected by this data? Who would benefit from this model?
- What are my goals for this investigation?
- What does success look like? How will I know if my model performs well?
- Where will I find data for this project? Is the data available?

**For the Bonus, Ask:**
- What format is the data in? What specific steps do I need to take to access it?
- How will I explain this project to outside audiences?




# PART2

#  Dataset + Data Collection
Create your own database and data dictionary?
- [ ] Identify Problem
- [ ] Parse Data
- [ ] Refine Data

1. Find and Clean Your Data: Source and format the required data for your project.
   - Create a database
   - Create a data dictionary
2. Perform preliminary data munging and cleaning of your data: organize your data relevant to your project goals.
   - Review data to verify initial assumptions
   - Clean and munge data as necessary you'll need to collect, clean, and document the dataset(s) you intend to use for your project.
3. Describe your data: keep your intended audience(s) in mind.
   - Document your work so far in a Jupyter notebook.

4. Document your project goals (revise from your initial pitch)
   - Articulate â€œSpecific aimâ€
   - Outline proposed methods and models
   - Define risks & assumptions

# PART 3

- [ ] Mine Data
- [ ] Refine Data
- [ ] Model Data

1. Create a "progress report" that documents:
   - Your approach to exploratory data analysis
   - Your initial results
   - Any roadblocks, setbacks, or surprises

2. Perform initial descriptive and visual analysis of your data.
   - Identify outliers
   - Summarize risks and limitations

3. Discuss your proposed next steps
   - Describe how your EDA will inform your modeling decisions
   - What are three concrete actions you need to take next?

4. Visualize your EDA and approach using at least **two or more** of the data visualization methods we've covered in class.
#  EDA + Preliminary Analysis

Begin quantitatively describing and visualizing your data. With rich datasets, EDA can go down an endless number of roads. Maintain perspective on your goals and scope your EDA accordingly.

notes on your approach, results, setbacks, and findings! basis of your "progress report" 

- Document **everything** as you go! This will give you valuable material to pull into your report - and will paint a more accurate picture than trying to summarize afterward :)
- Be candid! This is not a race, but a chance to get valuable feedback. Be honest about what techniques have worked, what steps have taken you down the wrong turns, and what blockers you've run into.

- [Describing data visually](http://www.statisticsviews.com/details/feature/6314441/Visualising-Statistics-The-importance-of-seeing-not-just-describing-data.html)
- [Real world data science workflows often contain setbacks](https://guerrilla-analytics.net/2015/02/20/data-science-workflows-a-reality-check/)


# PART 4

# Part 4: Findings + Technical Report
Data science requires clean data, logical study design, and reproducible results. The best way to do this (and build your portfolio) is to get in the habit of fully documenting your work for your peers and colleagues.

In Part 4 of our Capstone, you'll assemble a technical notebook that details your model and approach for your peers. It should be written in a straightforward manner, with concisely commented code, documented procedures and reasoning, and logical analysis. Where applicable, include clearly labeled plots, graphs, and other visualizations, explaining any outliers and relationships between features and data.  

Start with a brief "executive summary" and then walk us through each portion of your notebook, step by step. Explain your goals, describe modeling choices, evaluate model performance, and discuss results. Data science reporting is technical, but donâ€™t forget that your approach should tell us a compelling story about your data.

Include any additional code, data, or other materials in appendices, as needed. Above all, your process descriptions should be concise and relevant to your goals.

- [ ] Mine Data
- [ ] Refine Data
- [ ] Model Data
- [ ] Present Data

1. Begin with an executive summary:
   - What is your goal?
   - What are your metrics?
   - What were your findings?
   - What risks/limitations/assumptions affect these findings?

2. Walk through your model step by step, starting with EDA
   - What are your variables of interest?
   - What outliers did you remove?
   - What types of data imputation did you perform?

3. Summarize your statistical analysis, including:
   - model selection
   - implementation
   - evaluation
   - inference

4. Clearly document and label each section
   - Logically organize your information in a persuasive, informative manner
   - Include notebook headers and subheaders, as well as clearly formatted markdown for all written components
   - Include graphs/plots/visualizations with clear labels
   - Comment and explain the purpose of each major section/subsection of your code
     - *Document your code for your future self, as if another person needed to replicate your approach*

5. Host your notebook and any other materials in your own public Github Repository
   - Include a technical appendix, including links and explanations to any outside libraries or source code used
   - Host a local copy of your dataset or include a link to a remotely hosted version

#### BONUS
6. Describe how this model could be put into real world production. Consider:
   - How could you continue to validate your model's performance over time?
   - What steps might you need to take to productionize your model for an enterprise environment?
   - How would you deploy your model publicly? What could you do to setup your model and share it online right now?

7. Create a blog post of at least 1000 words summarizing your approach in a tutorial format and link to it in your notebook.
   - In your tutorial, address a slightly less technical audience; think back to Day 1 of the program - how would you explain and walk through your capstone project to your earlier self?

## Suggested Ways to Get Started
- Use the DSI Data Science Framework to help you organize your information
- For any given step, consider the logic that links it to other steps and clearly describe each assumption.
- After writing a draft, leave it alone for at least 24 hours. What would you revise, reword, or take out?
- Alternatively, read over a draft with a peer or in a group. What areas do they find confusing or unclear?

## Useful Resources
- [How to Report Statistics to Technical Audiences](http://abacus.bates.edu/~ganderso/biology/resources/writing/HTWstats.html)
- [Data Science Employers Value Research Reports](https://www.quora.com/What-is-a-good-way-for-a-data-scientist-to-construct-an-online-portfolio)

# PART 5

Presentation + Non-Technical Summary
won't know anything about data science!
most important insights from your project 
explain your findings to the public.

Tell us the most interesting story about your data. 
Break down your process for a novice audience. 
Make sure to include compelling visuals. 
most relevant components of your project.


including an explanation of your model and findings for a non- technical audience.

- [ ] Identify Problem
- [ ] Model Data
- [ ] Present Data
    - Define technical terms and any basic data science concepts that inform your approach
   - Don't just deliver the information; tell a story about the problem and solution

2. Prepare polished visuals or a publicly-suitable slide deck to guide your presentation.
   - Include graphs and/or visualizations

3. Make sure you cover the following areas:
   - Goals
   - Success Criteria / Metrics
   - Data
   - Overall Approach
   - Basic description of model
   - Findings
   - Risks/Limitations
   - Impact, next steps, conclusions

4. Successfully answer questions about your project from your audience.

5. Discuss longer term potential of your project and model.
   - Describe how you could validate your model's performance over time
   - Explain how you would deploy your model in a production environment

6. Create a publicly hosted interactive visualization that your audience can use to further access and explore your data and findings.
   - Bonus points for embedding this into your blog post tutorial!


## Suggested Ways to Get Started

- Review the information you provided in the "Executive Summary" from Part 4. This is the same information you should cover here.
- The difference is that you should **not** assume that your audience knows anything about your problem, model, or basic data science. Structure your presentation as if explaining your model to a non-data science friend.
- Practice! Test your presentation on other GA students or friends and see where they have questions.
- Include more visuals (and less text) than you think.
- Don't just read your presentation - deliver it!

## Useful Resources

- [Best Practices for Visualization ](https://drive.google.com/file/d/0Bx2SHQGVqWasWUpNX28yMTVuS1U/view?usp=sharing)
- [Importance of Storytelling with Data - Tableau Whitepaper](https://drive.google.com/file/d/0Bx2SHQGVqWasTmhYM1FHX3JfNEU/view?usp=sharing)
- [Sample PT Data Science Projects from other GA students](https://gallery.generalassemb.ly/DS?metro=)
