Skip to content

edwardt/msan501

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MSAN501 -- Computational analytics

Written by Terence Parr, prof. of computer science and analytics at the University of San Francisco, with ideas from the faculty.

The content contained in this repository represents a set of exercises for the computational analytics (PDF) 5-week bootcamp for the MS in Analytics program at the University of San Francisco. It collects all of the labs students must complete by the end of the bootcamp in order to pass. The labs start out as very simple tasks or step-by-step recipes but then accelerate in difficulty, culminating with an interesting text analysis project.

Table of contents

Part I -- Introduction

  • Audience and Summary
  • ``Newbies say the darndest things''

Part II -- Python Programming and Data Structures

  • Computing Point Statistics
  • Approximating sqrt(n) with the Babylonian Method
  • Generating Uniform Random Numbers
  • Histograms Using matplotlib
  • Graph Adjacency Lists and Matrices

Part III -- A Taste of Distributed Computing

  • Launching a Virtual Machine at Amazon Web Services
  • Linux command line
  • Using the Hadoop Streaming Interface with Python

Part IV -- Empirical statistics

  • Generating Binomial Distributions
  • Generating Exponential Random Variables
  • The Central Limit Theorem in Action
  • Generating Normal Random Variables
  • Confidence Intervals for Price of Hostess Twinkies
  • Is Free Beer Good For Tips?

Part V -- Optimization and Prediction

  • Iterative Optimization Via Gradient Descent
  • Predicting Murder Rates With Gradient Descent

Part VI -- Text Analysis

  • Summarizing Reuters Articles with TFIDF

Summary

This course is specifically designed as an introduction to analytics programming for those who are not yet skilled programmers. The course also explores many concepts from math and statistics, but in an empirical fashion rather than symbolically as one would do in a math class. Consequently, this course is also useful to programmers who would like to strengthen their understanding of numerical methods.

The exercises are grouped into parts. We begin with simple programs to compute statistics, build simple data structures, and use libraries to create visualizations and then move on to learning to use the UNIX command line, launch virtual computers in the cloud, and write simple Hadoop map-reduce programs. The empirical statistics part strives to give an intuitive feel for random variables, density functions, the central limit theorem, hypothesis testing, and confidence intervals. It's one thing to learn about their formal definitions, but to get a really solid grasp of these concepts, it really helps to observe statistics in action. All of the techniques we'll use in empirical statistics rely on the ability to generate random values from a particular distribution. We can do it all from a uniform random number generator, which is the first exercise in that part.

The optimization exercises deal with minimizing functions. Given a particular function, f(x), optimizing it generally means finding its minimum or maximum, which occur when the derivative goes flat: f'(x) = 0. When the function's derivative cannot be derived symbolically, we're left with a general technique called gradient descent that searches for minima. It's like putting a marble on a hilly surface and letting gravity bring it to the nearest minimum.

Finally, we'll do an exercise that introduces text analysis. We'll compute something called TFIDF that indicates how well that word distinguishes a document from other documents in a corpus. That score is used broadly in text analytics, but our exercise uses it to summarize documents by listing the most important words.

About

USF MSAN501 lecture notes and sample code

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TeX 94.3%
  • Python 5.7%