Final Project
The final project should represent significant original work applying data science techniques to an interesting problem. Final projects are individual attainments, but you should definitely talk to your instructors and classmates about them.
Address a data-related problem in your professional field or in a field you're interested in. Pick a subject that you're passionate about; if you're strongly interested in the subject matter it'll be more fun for you and you'll probably produce a better project!
To stimulate your thinking, there is a compendium of some public data sources. Use public data is the most common choice. If you have access to private data, that's also an option, though you'll have to be careful about what results you can release. Competing in a Kaggle competition is a project option as well.
The final project should include at least the following components:
- Gather, preprocess and visualize one or more datasets. What can you learn from a high level analysis?
- Apply appropriate techniques: regression/classification algorithms, evaluation, cross-validation, etc., and report your results.
- Consider how you could implement what you’ve done in a production system. Where would the data live? How would it represented? How would end users access it? What parts would be streaming vs. batch?
Optionally, it would be exciting to actually implement the system based on your work, and to the degree this is possible within the time frame of the course, it is certainly encouraged!
By the end of the fifth week of class, you should have some data identified and an outline including the following:
- The problem you are solving
- Description of data set and how you will obtain it
- Techniques you plan to use and why
- Hypotheses
- Possible practical/business applications
How do you know you're on the right track?
- By the fifth week of class, you and your instructors all know the topic, data, and general plan of your project.
Before the end of the class, you will complete a short (4-6 pages) paper describing your project. The paper should target a technical audience.
What to cover in your paper:
- Description of problem and hypothesis.
- Detailed description your data set.
- How did you decide what features to use in your analysis?
- What challenges did you face in terms of obtaining and organizing the data?
- What did you learn from the initial exploration phase?
- Describe the statistical methods you used, and perhaps others you considered but did not use, and how you decided what to use.
- What business applications do your findings have?
- Describe the implementation plan in detail from the ingesting of data to how end users access it.
Your paper should demonstrate thorough understanding of statistical techniques, data management, and the application of these in programming. It should communicate clearly to a reasonably technical audience.
In the final week of class, you will give a 5-7 minute presentation summarizing your project. The presentation should target a non-technical audience - it's a chance to practice the highly sought-after communication skills that data scientists need. It will be appropriate to have an accompanying slide deck.
What to cover in your presentation:
- Overview of problem and hypotheses
- Overview of data
- Appropriate visualizations
- Modeling techniques used and why
- Your findings, and how they're actionable
- Your implementation plan, and any hurdles
Your presentation should be engaging, clear, and informative, describing the project, approach, and conclusions, and be suitable for a non-technical audience.
How do you know you're done?
- You have a git repository accessible on github including all your project's data (if reasonably possible), source code, presentation slides, and final paper.
- Your instructors have reviewed all your project work.
- You have given your final presentation with your class and instructors. (Additional presentations to other audiences are also encouraged.)
You can find some example projects from GA students in the examples directory. For even more inspiration, go to the bottom of Andrew Ng's CS229 site and explore the "Recent years' projects"!