Assignment #2 - Reproducible Data Analysis with Jupyter

Out: March 6, 2017
Due: March 27, 2017 (10am, before class begins)

Overview
The goal of this assignment is to gain experience using the Jupyter programming environment (including Python, Pandas, and Matplotlib) to do data journalism relating to a large government dataset. One of the main benefits to using Jupyter for data journalism is that others can see the exact steps of how you reached your conclusions. In this assignment you'll be analyzing data from the federal government's College Scorecard.

Getting Started
You'll first want to have a solid understanding of the in-class tutorials and problem sets we've covered on loading, manipulating, and visualizing data with Pandas. Additional documentation on Pandas can be found here

You should download the skeleton notebook provided and open it in your your Jupyter environment. The notebook has links to the dataset you'll use, as well as the full data documentation, and the full data dictionary. There are thousands of variables, and you really want to go through the full data documentation carefully before getting started with the analysis. You'll need to reference the full data dictionary frequently in order to understand what each of the variables means and how you might parse it.

Details

A brief overview is given here, but it's best for you to look directly in the skeleton notebook for instructions and details. Within the notebook there are seven questions that you must answer (plus one additional extra credit question for the truly intrepid). After each question, add a cell and write code using Python / Pandas in order to arrive at the correct solution to the question. Be sure to include comments in your code! Someone else looking at your notebook should be able to follow your code and your logic in solving the question. The more documentation you provide for your method, the better - make it reproducible!

In addition to the notebook which you flesh out, you should work on a short written description of your reflections on the assignment (e.g. challenges, difficulties), and a description, interpretation, and rationale for the newsworthiness / importance of your insight for question #6.

Submission
This is an individual assignment and you may NOT work in groups. All work should be your own. If you find and use code snippets online that is fine, but you should clearly note this and include a comment with a url link to the original source.

You will be evaluated based on the clarity of your Jupyter notebook (easy to follow, well-commented and structured, adequate documentation for reproducibility), the quality of your written report which should include your interpretation and reflection on the assignment (easy to follow, explains assumptions and editorial decisions, interpretations reflect an understanding of the underlying data, reflections on challenges of assignment), and the Analysis Methods (accurate, bug free, makes use of Pandas / Matplotlib / Python effectively, logical, analysis reflects accuracy and understanding of data).

Your should submit (1) your write-up of less than 500 words (excessively short or long write-ups will be penalized) as a .pdf, and (2) your .ipynb file so that your analysis can be re-run.

Mail the .pdf (filename of ASGN2_<your lastname>.pdf) of your write-up, and the .ipynb (filename of ASGN2_<your lastname>.ipynb) to Professor Diakopoulos: nad@umd.edu by the due date.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment-2.md

Assignment-2.md

Files

Assignment-2.md

Latest commit

History

Assignment-2.md

File metadata and controls