Data Science Course: Lectures and Materials

Issues: For questions, answers and discussions:

Git Workflow and Command Line Tips:

Tips

Class Meetings

Introduction to Data Science

Wednesday, 9/3/2014

Class Materials

Data Collection and Extraction

Monday, 9/8/2014

Project 1 Introduced

Description for Project 1

Class Materials

Additional Resources:

#####Learning how to use the file pager, less

Less Homepage

Python Documentation

Handy to have this in your bookmarks!

https://docs.python.org/2.7/

Couple extra handy `python` introductions

Beautiful Soup Tutorials

Python API Wrappers, API "console" via apigee

Numpy

Wednesday, 9/10/2014

Class Materials

Additional Resources

Watch the 5 minute "Ipython Notebook Tour"
Review "What is NumPy"
Watch Wes McKinney's 10 minute Whirlwind Tour of Pandas (even once is ok ;-) )
Another great resource: Review Chapters 1 to 5 of Julia Evans Cookbook

Pandas

Monday 9/15/2014

Class Materials

Data Visualization and MatPlotLib

Wednesday 9/17/2014

Class Materials

Lecture Notes: Data Visualization

Python Notebook: Plotting with Matplotlib

Assignments Due

Complete and submit previous assignments

Additional Resources

Resource	About
Basic Plotting in Pandas
Matplotlib userguide
Matplotlib Gallery	Examples with Code
Rougier and Prace EuroSciPy Matplotlib Tutorial	Short Overview

Exploratory Data Analysis

Monday 9/22/2014 We'll be reviewing a number of datasets and going through the Data Exploration Process

The ACES model for Data Exploration:

Letter	Step	Notes
A	Acquire the data and Assemble the data frame	Find data, import into Pandas
C	Clean the data frame	Identify and limit columns, rows, indices, dates, etc.
E	Explore global properties	Visualize! Basic plots and stats appropriate to the data set
S	Subset comparisons	Look at (visualize!) initial emergenet variable relationships and subsets

Class Materials

EDA Run Through - IMDB

Resources

Presentations, Machine Learning, and Data Science Careers

Wednesday 9/24/2014

Assignments Due

[Project 1: Scraping, APIs, and Data Visualization](Project 1https://github.com/datadave/GADS9-NYC-Spring2014-Lectures/blob/master/projects/project01.md)

Class Outline

Selected Presentations of Student Projects
Discussion of Data Science Careers
Introduction to Machine Learning

Linear Regressions

Monday 9/29/2014 We'll be discussing the linear regression algorithm and learn about scoring regression models

Class Materials

Assignments Due

Please submit three optimized models using the data/day.csv file in an ipython notebook or python script for each y variable casual, registered, and cnt. Please put this in a lab_submissions/lab07/yourname folder.

Naive Bayes

(link to lesson folder)

Wednesday 10/01/2014 We'll be reviewing some basics of probability, developing ways to work with text data, and using a classification algorithm to classify text.

Objectives

Articulate Naive Bayes' advantages, flaws, applications and theoretical foundation
Explain how Naive Bayes is applied to classify text or Spam
Be familiar with using the N.B. classifiers in NLTK and SKLearn
Create a basic Naive Bayes classifier

Materials

NB_Gender_Names_NLTK: Notebook covering basics of Naive Bayes with single features
NB_Biebama_NLTK: Demo: Classifying text as Obama or Bieber
NB_Movies_SKLearn: Illustration of SK Learn NB functions
NB_Movies_NTLK: Illustration of NB on text with NLTK

Assignments

Add a feature to the NLTK gender classifier to try and improve performance
Create a classifier to tell the difference between two authors
Brainstorm classification topics for projects (due May 14)

Follow Up Notes

Based on student feedback:

Additional NB Notes

Classifier Comparison and Logistic Regression

Monday 10/6/2014

Objectives

Understand how to apply logistic regression to a classification problem
Create a two dimensional feature space to evalute the performance of classifiers
Leverage the interoperability of SKLearn classifiers to compare KNN, Naive Bayes, Decision Trees and Logistic Regression on a single classification problem

Materials

The lesson notebook provides:

A brief background on logistic classification
A mesh function using np.meshgrid to evaluate the predictive functions on a 2 dimensional feature grid

The intention is to provide a starting template with which to contrast various classifiers on a clean, real-world data set.

Static Ipython Notebook

Assignments

Students are expected add additional classifiers to the notebook, experiment with parameters, and develop conclusions about the differences between classifier performance on the given sample data.

Your Centroid or Mine? An Introduction to K-Means

Wednesday 10/8/2014

Humans are good, often too good, at clustering, and its another realm of our intelligence that we can programatically apply to machines. Toddlers can tell that objects are boats, flags, and doggies -- but how?

Machine clustering is used to categorize the web, understand galaxies, organize genetic, segment customers, classify mental illness, and detect disease patterns, to name just a few applications.

K Means is a very simple algorithm for classifying that works well and is by far the most widely used.

Here's some resources to get started:

Recommended Resources

| Title | Author | Type | Length | Difficulty | Description | Rating (1 to 4 Stars) | ----- | ----- | ---- | ----- | ------ | --- | --- | --- | |Cluster Analysis and K-Means| Kumar, UMN | PDF Excerpt | 40 pages | Intermediate | Good chapter overview on clustering and then section "8.2.1 Basic K-Means Algorithm" gives great K-Means summary. The rest of 8.2 includes complications wth K-Means and concludes with the optimization math. If you just want the minimum, read pgs. 498 to 505 | ++++ | Clustering Overview | StanfordML | html page | 3 pages | Intermediate | Good, quick overview of everything | ++++ | K-Means Clustering | Mathematical Monk | Video | 15 minute | Novice | Good Kahn style overview of math | +++ | K-Means Wikipedia Entry | Everyone | Wikipedia | 6 pages | Intermediate | Includes Iris and 'mickey mouse' we'll be looking at. | ++

Class Lecture

Review of Random Forests and the Ensemble Learning Approach

Monday 10/13/2014

We'll wrap up all machine learning material by having a more in detail discussion about the differences between bagging (bootstrapping), boosting, and random forests.

Recommended Resources

Title	Author	Type	Length
Ensemble Learning	Wikipedia	Article	---
sklearn doc	Scikey-learn	Documentation	---
yhat Blog on Random Forests	yHat	blog article	---
Ensemble Methods in Machine Learning	Dietterich, Thomas	PDF Journal	15 pages
A Few Useful Things to Know about Machine Learning	Domingos, Pedro	PDF Journal	9 pages
Ensemble Methods	Hyer, Jay	Presentation	31 Slides
Kaggle Random Forests	Kaggle	Kaggle	---

Class Lecture

Updates expected -- See Lesson Folder for further details

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Admin		Admin
about_the_course		about_the_course
examples		examples
lessons		lessons
projects		projects
supplementarymaterial		supplementarymaterial
tips		tips
README.md		README.md

Resource	About
Regressions with Sklearn
Overfitting Regressions
Guide to Logistic Regression
Khan Academy Algebra Review
MIT OCW

datadave/data-science-course

Folders and files

Latest commit

History

Repository files navigation

Data Science Course: Lectures and Materials

Issues: For questions, answers and discussions:

Git Workflow and Command Line Tips:

Class Meetings

Introduction to Data Science

Class Materials

Data Collection and Extraction

Project 1 Introduced

Class Materials

Additional Resources:

Python Documentation

Couple extra handy python introductions

Beautiful Soup Tutorials

Python API Wrappers, API "console" via apigee

Numpy

Class Materials

Additional Resources

Pandas

Class Materials

Data Visualization and MatPlotLib

Class Materials

Assignments Due

Additional Resources

Exploratory Data Analysis

Class Materials

Resources

Presentations, Machine Learning, and Data Science Careers

Assignments Due

Class Outline

Linear Regressions

Class Materials

Assignments Due

More Reading

Naive Bayes

Objectives

Materials

Assignments

Follow Up Notes

Classifier Comparison and Logistic Regression

Objectives

Materials

Assignments

Your Centroid or Mine? An Introduction to K-Means

Recommended Resources

Review of Random Forests and the Ensemble Learning Approach

Recommended Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Couple extra handy `python` introductions

Packages