Data Science Course: Lectures and Materials
Issues: For questions, answers and discussions:
Viewing Your and Other Student Work:
Git Workflow and Command Line Tips:
Introduction to Data Science
Data Collection and Extraction
Project 1 Introduced
Learning how to use the file pager,
Handy to have this in your bookmarks!
Couple extra handy
Beautiful Soup Tutorials
APIs to play with
- Watch the 5 minute "Ipython Notebook Tour"
- Review "What is NumPy"
- Watch Wes McKinney's 10 minute Whirlwind Tour of Pandas (even once is ok ;-) )
- Another great resource: Review Chapters 1 to 5 of Julia Evans Cookbook
Data Visualization and MatPlotLib
- Complete and submit previous assignments
|Basic Plotting in Pandas|
|Matplotlib Gallery||Examples with Code|
|Rougier and Prace EuroSciPy Matplotlib Tutorial||Short Overview|
Exploratory Data Analysis
Monday 4/21/2014 We'll be reviewing a number of datasets and going through the Data Exploration Process
The ACES model for Data Exploration:
|A||Acquire the data and Assemble the data frame||Find data, import into Pandas|
|C||Clean the data frame||Identify and limit columns, rows, indices, dates, etc.|
|E||Explore global properties||Visualize! Basic plots and stats appropriate to the data set|
|S||Subset comparisons||Look at (visualize!) initial emergenet variable relationships and subsets|
- EDA with SAT Scores
- Grouping with Pandas
- Data Wrangling Movies
- EDA Questions
- Volinksy EDA Presentation
N/A - Please review all prior materials and work on Project 1.
Presentations, Machine Learning, and Data Science Careers
Project 1: Scraping, APIs, and Data Visualization
- Selected Presentations of Student Projects
- Discussion of Data Science Careers
- Introduction to Machine Learning
Monday 4/28/2014 We'll be discussing the linear regression algorithm and learn about scoring regression models
Please submit three optimized models using the
data/day.csv file in an ipython notebook or python script for each y variable casual, registered, and cnt. Please put this in a
|Regressions with Sklearn|
|Guide to Logistic Regression|
|Khan Academy Algebra Review|
Wednesday 4/30/2014 We'll be reviewing some basics of probability, developing ways to work with text data, and using a classification algorithm to classify text.
- Articulate Naive Bayes' advantages, flaws, applications and theoretical foundation
- Explain how Naive Bayes is applied to classify text or Spam
- Be familiar with using the N.B. classifiers in NLTK and SKLearn
- Create a basic Naive Bayes classifier
- NB_Gender_Names_NLTK: Notebook covering basics of Naive Bayes with single features
- NB_Biebama_NLTK: Demo: Classifying text as Obama or Bieber
- NB_Movies_SKLearn: Illustration of SK Learn NB functions
- NB_Movies_NTLK: Illustration of NB on text with NLTK
- Add a feature to the NLTK gender classifier to try and improve performance
- Create a classifier to tell the difference between two authors
- Brainstorm classification topics for projects (due May 14)
Follow Up Notes
Based on student feedback:
Classifier Comparison and Logistic Regression
- Understand how to apply logistic regression to a classification problem
- Create a two dimensional feature space to evalute the performance of classifiers
- Leverage the interoperability of SKLearn classifiers to compare KNN, Naive Bayes, Decision Trees and Logistic Regression on a single classification problem
The lesson notebook provides:
- A brief background on logistic classification
- A mesh function using np.meshgrid to evaluate the predictive functions on a 2 dimensional feature grid
The intention is to provide a starting template with which to contrast various classifiers on a clean, real-world data set.
Students are expected add additional classifiers to the notebook, experiment with parameters, and develop conclusions about the differences between classifier performance on the given sample data.
Your Centroid or Mine? An Introduction to K-Means
Monday, May 19th
Humans are good, often too good, at clustering, and its another realm of our intelligence that we can programatically apply to machines. Toddlers can tell that objects are boats, flags, and doggies -- but how?
Machine clustering is used to categorize the web, understand galaxies, organize genetic, segment customers, classify mental illness, and detect disease patterns, to name just a few applications.
K Means is a very simple algorithm for classifying that works well and is by far the most widely used.
Here's some resources to get started:
|Title||Author||Type||Length||Difficulty||Description||Rating (1 to 4 Stars)|
|Cluster Analysis and K-Means||Kumar, UMN||PDF Excerpt||40 pages||Intermediate||Good chapter overview on clustering and then section "8.2.1 Basic K-Means Algorithm" gives great K-Means summary. The rest of 8.2 includes complications wth K-Means and concludes with the optimization math. If you just want the minimum, read pgs. 498 to 505||++++|
|Clustering Overview||StanfordML||html page||3 pages||Intermediate||Good, quick overview of everything||++++|
|K-Means Clustering||Mathematical Monk||Video||15 minute||Novice||Good Kahn style overview of math||+++|
|K-Means Wikipedia Entry||Everyone||Wikipedia||6 pages||Intermediate||Includes Iris and 'mickey mouse' we'll be looking at.||++|
Review of Random Forests and the Ensemble Learning Approach
Wednesday, May 21st
We'll wrap up all machine learning material by having a more in detail discussion about the differences between bagging (bootstrapping), boosting, and random forests.
|yhat Blog on Random Forests||yHat||blog article||---|
|Ensemble Methods in Machine Learning||Dietterich, Thomas||PDF Journal||15 pages|
|A Few Useful Things to Know about Machine Learning||Domingos, Pedro||PDF Journal||9 pages|
|Ensemble Methods||Hyer, Jay||Presentation||31 Slides|
|Kaggle Random Forests||Kaggle||Kaggle||---|
Updates expected -- See Lesson Folder for further details