Python
Latest commit 8130352 Sep 15, 2014 Update FinalProject.md

# Data Science Course: Lectures and Materials

##### Viewing Your and Other Student Work:

iPython Notebook Viewer for this class's student repo

# Class Meetings

Monday, 3/31/14

## Data Collection and Extraction

Wednesday, 4/7/14

#### Class Materials

##### Python Documentation

Handy to have this in your bookmarks!

## Numpy

Wednesday 4/9/2014

Monday 4/14/2014

## Data Visualization and MatPlotLib

Wednesday 4/16/2014

#### Class Materials

Lecture Notes: Data Visualization

Python Notebook: Plotting with Matplotlib

#### Assignments Due

• Complete and submit previous assignments

Basic Plotting in Pandas
Matplotlib userguide
Matplotlib Gallery Examples with Code
Rougier and Prace EuroSciPy Matplotlib Tutorial Short Overview

## Exploratory Data Analysis

Monday 4/21/2014 We'll be reviewing a number of datasets and going through the Data Exploration Process

The ACES model for Data Exploration:

Letter Step Notes
A Acquire the data and Assemble the data frame Find data, import into Pandas
C Clean the data frame Identify and limit columns, rows, indices, dates, etc.
E Explore global properties Visualize! Basic plots and stats appropriate to the data set
S Subset comparisons Look at (visualize!) initial emergenet variable relationships and subsets

#### Assignments Due

N/A - Please review all prior materials and work on Project 1.

## Presentations, Machine Learning, and Data Science Careers

Wednesday 4/23/2014

#### Assignments Due

Project 1: Scraping, APIs, and Data Visualization

#### Class Outline

• Selected Presentations of Student Projects
• Discussion of Data Science Careers
• Introduction to Machine Learning

## Linear Regressions

Monday 4/28/2014 We'll be discussing the linear regression algorithm and learn about scoring regression models

#### Assignments Due

Please submit three optimized models using the `data/day.csv` file in an ipython notebook or python script for each y variable casual, registered, and cnt. Please put this in a `lab_submissions/lab07/yourname` folder.

Regressions with Sklearn
Overfitting Regressions
Guide to Logistic Regression
MIT OCW

## Naive Bayes

Wednesday 4/30/2014 We'll be reviewing some basics of probability, developing ways to work with text data, and using a classification algorithm to classify text.

#### Objectives

• Articulate Naive Bayes' advantages, flaws, applications and theoretical foundation
• Explain how Naive Bayes is applied to classify text or Spam
• Be familiar with using the N.B. classifiers in NLTK and SKLearn
• Create a basic Naive Bayes classifier

#### Assignments

• Add a feature to the NLTK gender classifier to try and improve performance
• Create a classifier to tell the difference between two authors
• Brainstorm classification topics for projects (due May 14)

Based on student feedback:

## Classifier Comparison and Logistic Regression

Wednesday 5/7/2014

#### Objectives

• Understand how to apply logistic regression to a classification problem
• Create a two dimensional feature space to evalute the performance of classifiers
• Leverage the interoperability of SKLearn classifiers to compare KNN, Naive Bayes, Decision Trees and Logistic Regression on a single classification problem

#### Materials

The lesson notebook provides:

• A brief background on logistic classification
• A mesh function using np.meshgrid to evaluate the predictive functions on a 2 dimensional feature grid

The intention is to provide a starting template with which to contrast various classifiers on a clean, real-world data set.

Static Ipython Notebook

#### Assignments

Students are expected add additional classifiers to the notebook, experiment with parameters, and develop conclusions about the differences between classifier performance on the given sample data.

# Your Centroid or Mine? An Introduction to K-Means

Monday, May 19th

Humans are good, often too good, at clustering, and its another realm of our intelligence that we can programatically apply to machines. Toddlers can tell that objects are boats, flags, and doggies -- but how?

Machine clustering is used to categorize the web, understand galaxies, organize genetic, segment customers, classify mental illness, and detect disease patterns, to name just a few applications.

K Means is a very simple algorithm for classifying that works well and is by far the most widely used.

Here's some resources to get started:

### Recommended Resources

Title Author Type Length Difficulty Description Rating (1 to 4 Stars)
Cluster Analysis and K-Means Kumar, UMN PDF Excerpt 40 pages Intermediate Good chapter overview on clustering and then section "8.2.1 Basic K-Means Algorithm" gives great K-Means summary. The rest of 8.2 includes complications wth K-Means and concludes with the optimization math. If you just want the minimum, read pgs. 498 to 505 ++++
Clustering Overview StanfordML html page 3 pages Intermediate Good, quick overview of everything ++++
K-Means Clustering Mathematical Monk Video 15 minute Novice Good Kahn style overview of math +++
K-Means Wikipedia Entry Everyone Wikipedia 6 pages Intermediate Includes Iris and 'mickey mouse' we'll be looking at. ++

Class Lecture

# Review of Random Forests and the Ensemble Learning Approach

Wednesday, May 21st

We'll wrap up all machine learning material by having a more in detail discussion about the differences between bagging (bootstrapping), boosting, and random forests.

### Recommended Resources

Title Author Type Length
Ensemble Learning Wikipedia Article ---
sklearn doc Scikey-learn Documentation ---
yhat Blog on Random Forests yHat blog article ---
Ensemble Methods in Machine Learning Dietterich, Thomas PDF Journal 15 pages
A Few Useful Things to Know about Machine Learning Domingos, Pedro PDF Journal 9 pages
Ensemble Methods Hyer, Jay Presentation 31 Slides
Kaggle Random Forests Kaggle Kaggle ---

Class Lecture

Updates expected -- See Lesson Folder for further details