# Project Guidelines

## Overview

This document describes the final project you will complete for this course. This project will build on your existing knowledge and take you to the next level with your skills in coding, software engineering, visualization, debugging, algorithmic thinking, etc.

* The end result of your project will be a **computational narrative**, i.e. a *story*, that uses code, data and visualizations to communicate something of significance. You will tell this story to me in the form of notebooks and to the class in your final presentation.
* Most importantly, the project will be a lot of **fun** and provide you with something significant you can use as the start of a data science portfolio.

## Selecting a dataset

Use the following guidelines for picking a dataset:

* Your dataset needs to be sufficiently large and complex to explore interesting questions:
  - Tabular data
    - Should have enough columns in one or more tables to answer interesting questions. In practice, one table with many columns or multiple tables will be required.
    - After you have grouped by all categorical columns, the resulting groups should have at least 100s of rows.
    - Should have numerous columns with different data types (categorical, geographical, date/time, quantitative).
* Your dataset must be suitable for predictive statistical modelling.

## Final presentation

To get credit for the project, you **must** present your project to the class during the official final time for the class. You will use the notebook to do your presentation.

* Your presentation should be a narative that describes your data science process, and the questions you explored.
* You should do live demonstrations and a summary of the code, but you shouldn't get too technical about the details of the code.
* Remember, most of the class won't know anything about the dataset, topics or questions you are presenting about.
* If you have code that takes a while to run, you should run the code before the presentation, save the results to a file, load the results from disk and perform final analysis and visualizations.
* Your presentation notebook should be a separate notebook from the main notebook that contains the full data science process and its description.
* You will have 5-7 minutes total for your presentation.

**Students and faculty outside the class will be invited to attend the final presentations.**

## Rubric

You will be graded on the following categories:

### Code

* Code is well organized into relatively small functions (10-20 lines) that do one thing.
* Functions have docstrings describing what they do.
* Appropriate variable and function names are used.
* [PEP 8](https://www.python.org/dev/peps/pep-0008/) is approximately followed for code style.
* Code that is used in multiple notebooks is put into standalone `.py` files and imported.
* Code is written with fast performance in mind and does not have any significant performance problems.
* All of your notebooks should run from scratch in a reasonable amount of time. Please make a note about code that takes a long (more than a few minutes) to run.

### Narrative

* Narrative text in the form of Markdown cells is provided to describe the dataset, code, results, visualizations, modeling, etc.
* When appropriate, equations are included (LaTeX).
* You identify the core questions you will study.
* The narrative tells a compelling *story* that motivates and answers the questions.
* A one paragraph abstract is provided.

### Organization

* The project is organized into multiple notebooks with clear titles and ordering. Something like this:
  - `01-Introduction.ipynb`
  - `02-Import.ipynb`
  - `03-Tidy.ipynb`
  - `03-EDA.ipynb`
  - `04-Modeling.ipynb`
  - `06-Presentation.ipynb`
* An introduction notebook is provided with the following sections:
  - Abstract
  - Description of the dataset.
  - Index of other notebooks with a short description
  - Citations
* Each notebook has well organized Markdown sections with headings.
* More details of the notebook format and organization will be provided.

### Import and Tidy

* Multiple tidy tables are used to encode one-one, one-many and many-many relationships.

### Exploratory Data Analysis

* Data exploration is tied to the questions you are exploring and this relationship is made explicit.
* Visualizations adhere to the theory and practice of effective visualizations.
* Interesting relationships in different parts of the dataset are explored.
* Interactive widgets are used to explore the dataset interactively.
* Appropriate visualizations are used.

### Statistical modeling

* Bootstrap resampling is used to create distributions and confidence intervals of any estimated statistics.
* Appropriate statistical models (regression, maximum likelihood, classification) are built to explore relationships, extract insight from the data and make predictions.
* Best practices of machine learning (test/train splitting, cross validation, hyper-parameter optimization, error analysis, etc.)
* Multiple algorithms/models are compared where appropriate.

### Challenge area

* Your project should contain at least one challenge area that involves your learning a new Python library, statistical technique, etc.
* Examples
  - A new data storage format, such as HDF5.
  - A dataset involving time series and date/times (such as financial assets over time).