Wednesday, 9/3/2014
Monday, 9/8/2014
#####Learning how to use the file pager, less
Handy to have this in your bookmarks!
Wednesday, 9/10/2014
- Watch the 5 minute "Ipython Notebook Tour"
- Review "What is NumPy"
- Watch Wes McKinney's 10 minute Whirlwind Tour of Pandas (even once is ok ;-) )
- Another great resource: Review Chapters 1 to 5 of Julia Evans Cookbook
Monday 9/15/2014
Wednesday 9/17/2014
Lecture Notes: Data Visualization
Python Notebook: Plotting with Matplotlib
- Complete and submit previous assignments
Resource | About |
---|---|
Basic Plotting in Pandas | |
Matplotlib userguide | |
Matplotlib Gallery | Examples with Code |
Rougier and Prace EuroSciPy Matplotlib Tutorial | Short Overview |
Monday 9/22/2014 We'll be reviewing a number of datasets and going through the Data Exploration Process
The ACES model for Data Exploration:
Letter | Step | Notes |
---|---|---|
A | Acquire the data and Assemble the data frame | Find data, import into Pandas |
C | Clean the data frame | Identify and limit columns, rows, indices, dates, etc. |
E | Explore global properties | Visualize! Basic plots and stats appropriate to the data set |
S | Subset comparisons | Look at (visualize!) initial emergenet variable relationships and subsets |
- EDA with SAT Scores
- Grouping with Pandas
- Data Wrangling Movies
- EDA Questions
- Volinksy EDA Presentation
Wednesday 9/24/2014
[Project 1: Scraping, APIs, and Data Visualization](Project 1https://github.com/datadave/GADS9-NYC-Spring2014-Lectures/blob/master/projects/project01.md)
- Selected Presentations of Student Projects
- Discussion of Data Science Careers
- Introduction to Machine Learning
Monday 9/29/2014 We'll be discussing the linear regression algorithm and learn about scoring regression models
Please submit three optimized models using the data/day.csv
file in an ipython notebook or python script for each y variable casual, registered, and cnt. Please put this in a lab_submissions/lab07/yourname
folder.
Resource | About |
---|---|
Regressions with Sklearn | |
Overfitting Regressions | |
Guide to Logistic Regression | |
Khan Academy Algebra Review | |
MIT OCW |
Wednesday 10/01/2014 We'll be reviewing some basics of probability, developing ways to work with text data, and using a classification algorithm to classify text.
- Articulate Naive Bayes' advantages, flaws, applications and theoretical foundation
- Explain how Naive Bayes is applied to classify text or Spam
- Be familiar with using the N.B. classifiers in NLTK and SKLearn
- Create a basic Naive Bayes classifier
- NB_Gender_Names_NLTK: Notebook covering basics of Naive Bayes with single features
- NB_Biebama_NLTK: Demo: Classifying text as Obama or Bieber
- NB_Movies_SKLearn: Illustration of SK Learn NB functions
- NB_Movies_NTLK: Illustration of NB on text with NLTK
- Add a feature to the NLTK gender classifier to try and improve performance
- Create a classifier to tell the difference between two authors
- Brainstorm classification topics for projects (due May 14)
Based on student feedback:
Monday 10/6/2014
- Understand how to apply logistic regression to a classification problem
- Create a two dimensional feature space to evalute the performance of classifiers
- Leverage the interoperability of SKLearn classifiers to compare KNN, Naive Bayes, Decision Trees and Logistic Regression on a single classification problem
The lesson notebook provides:
- A brief background on logistic classification
- A mesh function using np.meshgrid to evaluate the predictive functions on a 2 dimensional feature grid
The intention is to provide a starting template with which to contrast various classifiers on a clean, real-world data set.
Students are expected add additional classifiers to the notebook, experiment with parameters, and develop conclusions about the differences between classifier performance on the given sample data.
Wednesday 10/8/2014
Humans are good, often too good, at clustering, and its another realm of our intelligence that we can programatically apply to machines. Toddlers can tell that objects are boats, flags, and doggies -- but how?
Machine clustering is used to categorize the web, understand galaxies, organize genetic, segment customers, classify mental illness, and detect disease patterns, to name just a few applications.
K Means is a very simple algorithm for classifying that works well and is by far the most widely used.
Here's some resources to get started:
| Title | Author | Type | Length | Difficulty | Description | Rating (1 to 4 Stars) | ----- | ----- | ---- | ----- | ------ | --- | --- | --- | |Cluster Analysis and K-Means| Kumar, UMN | PDF Excerpt | 40 pages | Intermediate | Good chapter overview on clustering and then section "8.2.1 Basic K-Means Algorithm" gives great K-Means summary. The rest of 8.2 includes complications wth K-Means and concludes with the optimization math. If you just want the minimum, read pgs. 498 to 505 | ++++ | Clustering Overview | StanfordML | html page | 3 pages | Intermediate | Good, quick overview of everything | ++++ | K-Means Clustering | Mathematical Monk | Video | 15 minute | Novice | Good Kahn style overview of math | +++ | K-Means Wikipedia Entry | Everyone | Wikipedia | 6 pages | Intermediate | Includes Iris and 'mickey mouse' we'll be looking at. | ++
Monday 10/13/2014
We'll wrap up all machine learning material by having a more in detail discussion about the differences between bagging (bootstrapping), boosting, and random forests.
Title | Author | Type | Length |
---|---|---|---|
Ensemble Learning | Wikipedia | Article | --- |
sklearn doc | Scikey-learn | Documentation | --- |
yhat Blog on Random Forests | yHat | blog article | --- |
Ensemble Methods in Machine Learning | Dietterich, Thomas | PDF Journal | 15 pages |
A Few Useful Things to Know about Machine Learning | Domingos, Pedro | PDF Journal | 9 pages |
Ensemble Methods | Hyer, Jay | Presentation | 31 Slides |
Kaggle Random Forests | Kaggle | Kaggle | --- |
Updates expected -- See Lesson Folder for further details