Algorithms, Summer 2015
LEDE Program, Columbia University, Graduate School of Journalism
Richard Dunks: richard [at] datapolitan [dot] com
Chase Davis: chase.davis [at] nytimes [dot] com
Room Number: Pulitzer Hall 601B
Course Dates: 14 July - 27 August 2015
This course presents an overview of algorithms as they relate to journalistic tradecraft, with particular emphasis on algorithms that relate to the discovery, cleaning, and analysis of data. This course intends to provide literacy in the common types of data algorithms, while providing practice in the design, development, and testing of algorithms to support news reporting and analysis, including the basic concepts of algorithm reverse engineering in support of investigative news reporting. The emphasis in this class will be on practical applications and critical awareness of the impact algorithms have in modern life.
- You will understand the basic structure and operation of algorithms
- You will understand the primary types of data science algorithms, including techniques of supervised and unsupervised machine learning
- You will be practiced in implementing basic algorithms in Python
- You will be able to meaningfully explain and critique the use and operation of algorithms as tools of public policy and business
- You will understand how algorithms are applied in the newsroom
All students will be expected to have a laptop during both lectures and lab time. Time will be set aside to help install, configure, and run the programs necessary for all assignments, projects, and exercises. Where possible, all programs will be free and open-source. All assigned work using services hosted online can be run using free accounts.
The required readings for this course consist of book chapters, newspaper articles, and short blog posts. The intention is to help give you a foundation in the critical skills ahead of class lectures. All required readings are available online or will be made available to you electronically. Recommended readings are suggestions if you wish to study further the topics covered in class. Suggested readings will also be provided as appropriate for those interested in a more in-depth discussion of the material covered in class.
This course consists of programming and critical response assignments intended to reinforce learning and provide you with pratical applications of the material covered in class. Completion of these assignments is critical to achieving the outcomes of this course. Assignments are intended to be completed during lab time or for homework. Generally, assignments will be due the following week, unless otherwise stated. For example, exercises assigned on Tuesday will be due before class on the following Tuesday.
- Programming assignments will be submitted via Slack to the TAs in Python scripts (not ipynb) format. The exercises should be standalone for each assignment, not a combination of all assignments. This allows them to be tested and scored separately.
- Response questions should be submitted using this address and will be posted to the class Tumblr after grading. They should be clear, concise, and use the elements of good grammar. This is an opportunity to develop your ability to explain algorithms to your audience.
Class runs from 10am to 1pm Tuesday and Thursday. Lab time will be from 2pm to 5pm Tuesday and Thursday. The class will be taught in roughly 50 minute blocks, with approximately 10 minute breaks between each 50 minute block. Class will be a mix of lecture and practical exercise work, emphasizing the application of skills covered in the lecture portion of the class. Lab time is intended for the completion of exercises, but may also include guided learning sessions as necessary to ensure comprehension of the course material.
- Attendance and Tardiness: We expect you to attend every class, arriving on time and staying for the entire duration of class. Absences will only be excused for circumstances coordinated in advance and you are responsible for making up any missed work.
- Participation: We expect you to be fully engaged while you’re in class. This means asking questions when necessary, engaging in class discussions, participating in class exercises, and completing all assigned work. Learning will occur in this class only when you actively use the tools, techniques, and skills described in the lectures. We will provide you ample time and resources to accomplish the goals of this course and expect you to take full advantage of what’s offered.
- Late Assignments: All assignments are to be submitted before the start of class. Assignments posted by the end of the day following class will be marked down 10% and assignments posted at the end of the day following will be marked down 20%. No assignments will be accepted for a grade after three days following class.
- Office Hours: We won’t be holding regular office hours, but are available via email to answer whatever questions you may have about the material. Please feel free to also reach out to the Teaching Assistants as necessary for support and guidance with the exercises, particularly during lab time.
- Stack Overflow - Q&A community of technology pros
(Some) Open Data Sources
- New York City Open Data Portal
- New York State Open Data Portal
- Hilary Mason’s Research Quality Data Sets
Data Journalism and Critiques
Conway, Drew and John Myles White. Machine Learning for Hackers. O'Reilly Media, Inc., 2012.
Knuth, Donald E. The Art of Computer Programming. Addison-Wesley Professional, 2011.
MacCormick, John. Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers. Princeton University Press, 2011.
McCallum, Q Ethan. Bad Data Handbook. O'Reilly Media, Inc., 2012.
McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media, Inc., 2012.
O'Neil, Cathy and Rachel Schutt. Doing Data Science: Straight Talk from the Front Line. O'Reilly Media, Inc., 2013.
Russell, Matthew A. Mining the Social Web. O'Reilly Media, Inc., 2013.
Sedgewick, Robert and Kevin Wayne. Algorithms. Addison-Wesley Professional, 2011.
Steiner, Christopher. Automate This: How Algorithms Came to Rule Our World. Penguin Group, 2012.
(Subject to change)
Week 1: Introduction to Algorithms/Statistics review
Class 1 Readings
- Miller, Claire Cain, “When Algorithms Discriminate” New York Times, 9 July 2015
- O’Neil, Cathy, “Algorithms And Accountability Of Those Who Deploy Them”
- Elkus, Adam, “You Can’t Handle the (Algorithmic) Truth”
- Diakopoulos, Nicholas, "Algorithmic Accontability Reporting: On the Investigation of Black Boxes"
Class 2 Readings (optional)
- McKinney, "Getting Started With Pandas" Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython.
- McKinney, "Plotting and Visualization" Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython.
Week 2: Statistics in Reporting/Opening the Blackbox: Supervised Learning - Linear Regression
Class 1 Readings
Class 2 Readings
- O'Neill, "Statistical Inference, Exploratory Data Analysis, and the Data Science Process" Doing Data Science: Straight Talk from the Front Line pp. 17-37
Week 3: Opening the Blackbox: Supervised Learning - Feature Engineering/Decision Trees
Class 2 Readings
- Building Machine Learning Systems with Python, pp. 33-43
- Learning scikit-learn: Machine Learning in Python, pp. 41-52
- Brownlee, Jason, ("Discover Feature Engineering, How to Engineer Features and How to Get Good at It")[http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/]
- ("A Visual Introduction to Machine Learning")[http://www.r2d3.us/visual-intro-to-machine-learning-part-1/]