# Data Science - Course Project Brief

The course project should represent significant original work applying data science techniques to an interesting problem. Final projects are individual attainments, but you should definitely talk to your instructors and classmates about them.

Address a data­-related problem in your professional field or in a field you're interested in. Pick a subject that you're passionate about; if you're strongly interested in the subject matter it'll be more fun for you and you'll probably produce a better project!

To stimulate your thinking, either review the compendium of public data sources or the features example projects below. Using public data sets is the most common choice. If you have access to private data, that's also an option, though you'll have to be careful about what results you can release. Competing in a Kaggle competition is a project option as well.

The final project should include at least the following components:

* Gather, pre-process and visualize one or more datasets. What can you learn from a high­ level analysis?
* Apply appropriate techniques: regression/classification algorithms, evaluation, cross-­validation, etc., and report your results.
* Extrapolate your findings and consider what further study or projects your efforts could lead to. 

## Project Deliverables

### Notebook or Paper

You have the choice of submitting a highly commented notebook in the [literate programming](https://en.wikipedia.org/wiki/Literate_programming) style, or writing a short (4-6 pages) paper. A 'literate' notebook just means that your project write-up is interspersed with the code that implements the project - just like the notebooks we use in class. Alternatively, you can write a short paper which targets a technical audience and discusses the your project.

Either of these choices would cover:

* Description of problem and hypothesis.
* Detailed description your data set.
    - How did you decide what features to use in your analysis?
    - What challenges did you face in terms of obtaining and organizing the data?
    - What did you learn from the initial exploration phase?
* Describe the statistical methods you used, and perhaps others you considered but did not use, and how you decided what to use.
* What business applications do your findings have?
* Describe the implementation plan in detail from the ingesting of data to how end ­users access it.

Your paper or notebook should demonstrate thorough understanding of statistical techniques, data management, and the application of these in programming. It should communicate clearly to a reasonably technical audience.

### Presentation

In the final week of class, you will give a 5-7 minute presentation summarizing your project. The presentation should target a non­-technical audience - it's a chance to practice the highly sought-after communication skills that data scientists need. It will be appropriate to have an accompanying slide deck.

What to cover in your presentation:

* Overview of problem and hypotheses
* Overview of data
* Appropriate visualizations
* Modeling techniques used and why
* Your findings, and how they're actionable
* Your implementation plan, and any hurdles

Your presentation should be engaging, clear, and informative, describing the project, approach, and conclusions, and be suitable for a non­-technical audience.

## Project Milestones

| Deliverable         | Deadline        |
|:-------------------:|:---------------:|
| Project Summary     | Sun, 27/12/2015  |
| Elevator Pitch      | Sat, 09/01/2016  |
| First Draft         | Tue, 19/01/2016  |
| Paper/Notebook      | Tue, 26/01/2016  |
| Presentation        | Sat, 30/01/2016  |

### Project Summary & Elevator Pitches

The best predictor of a succesful project is having a problem that needs solving, a tractable project hypothesis, access to a comprehensive dataset and a clearly delineated scope. To make sure that you set out on the right foot, you will send in one paragraph summarising :

* The problem you are solving
* Description of data set and how you will obtain it
* Techniques you plan to use and why
* Hypotheses
* Possible practical/business applications

To inform your classmates and get some early feedback, you will also present your project idea in an elevator pith style presentations (max. 90 seconds). The presentation will cover:

* A concise statement of the goal of your project
* What question or questions you hope to answer
* What data set you plan to use and how you will obtain the data
* What type of machine learning problem this will require
* Why you chose this project

### First Draft

Two weeks before the end of the class, you will be asked to provide a preliminary draft of your project for review. Your peers and instructors will provide feedback, according to [these guidelines](https://github.com/ga-students/DS_HK_7/blob/gh-pages/Peer%20Review%20Guidelines.md).

At a minimum, you should include:

* Narrative of what you have done so far and what you are still planning to do, ideally in a format similar to the format of your final project paper
* Code, with lots of comments

Ideally, you would also include:

* Visualizations you have done
* Slides (if you have started making them)
* Data and data dictionary

#### Tips for success

* The work should stand "on its own", and should not depend upon the reader remembering anything you might have previously said in class about your project.
* Organize your narrative and files so that the reader can easily follow along.
* The better you explain your project, and the easier it is to follow, the more useful feedback you will receive!
* If your reviewers can actually run your code on the provided data, they will be able to give you more useful feedback on your code. (It can be very hard to make useful code suggestions on code that can't be run!)

### Presentations

Deliver your project presentation in class and submit all required deliverables (paper, slides, code, data, and data dictionary).

## Project Ideas

Below is a list of suggested projects, from various sources. See if anything tickles your fancy or whether it can inspire you to do something similar.

* [CS 229 Machine Learning Final Projects, Autumn 2014](http://cs229.stanford.edu/projects2014.html)
* [CS 229 Machine Learning Final Projects, Autumn 2013](http://cs229.stanford.edu/projects2013.html)
* [CS 229 Machine Learning Final Projects, Autumn 2012](http://cs229.stanford.edu/projects2012.html)
* [CS 229 Machine Learning Final Projects, Autumn 2011](http://cs229.stanford.edu/projects2011.html)
* [CS 229 Machine Learning Final Projects, Autumn 2010](http://cs229.stanford.edu/projects2010.html)

### Example Projects

| Topic | Author | Paper | Presentation |
|:------|:------:|:-----:|:------------:|
| **Social Media Data** to **recommend countries to visit** based on a person's **travel history** | Jamar Parris | [PDF](https://github.com/ajschumacher/gadsdc/raw/master/final_projects/examples/Travel_paper.pdf) | [PDF](https://github.com/ajschumacher/gadsdc/raw/master/final_projects/examples/Travel_presentation.pdf) |
| **Predicting Successful Kickstarter Campaigns** | Ruben Naeff | [IPyNB](http://www.rubennaeff.nl/extra/gads7/rubennaeff_kickstarter_notebook.ipynb) | [PDF](https://github.com/ajschumacher/gadsdc/raw/master/final_projects/examples/Kickstarter_presentation.pdf) |
| **Predicting Loan Defaults** on LendingClub Data | Nikesh Patel | [PDF](https://github.com/ajschumacher/gadsdc/raw/master/final_projects/examples/Loans_paper.pdf) | [PDF](https://github.com/ajschumacher/gadsdc/raw/master/final_projects/examples/Loans_presentation.pdf) |
| **Classifying PDFs** as likely malicious of likely benign | Joe Carli | - | [PDF](https://github.com/ajschumacher/gadsdc/raw/master/final_projects/examples/PDFMalwareClassifier_Presentation.pdf) |
| Twitter **Music Recommendation Enginee** | Anon | - | [PDF](https://github.com/ajschumacher/gadsdc/raw/master/final_projects/examples/PDFMalwareClassifier_Presentation.pdf) |
| **Venture Capital, Startups and Data** | Vijay Venkatesh | - | [PPTX](https://github.com/ajschumacher/gadsdc/blob/master/final_projects/examples/VentureCapital_presentation.pptx?raw=true) |
| **Behance.net Image Analysis Project** - What can image analysis tell us about Design? | Devon Hirth | - | [REPO](https://github.com/devowhippit/ga-ds-project) |
| **Building Predictive Models for NYC High Schools** | Alec Hubel | [Scribd](http://www.scribd.com/doc/191207189/Building-Predictive-Models-for-NYC-High-Schools-Alec-Hubel) | - |
| **Allstate Purchase Prediction Challenge** - Kaggle Competition | Just Markham | [MD](https://github.com/justmarkham/kaggle-allstate/blob/master/allstate-paper.md) | [PDF](https://github.com/justmarkham/kaggle-allstate/raw/master/allstate-presentation.pdf) |

## Data Sources

See [datasets](http://nbviewer.jupyter.org/github/ga-students/DS_HK_8/blob/gh-pages/notebooks/AA%20-%20Datasets.ipynb) for a list of resources.