# Data Science and Machine Learning Resources

These are notes and resources exploring business development, data science, computer science, and engineering as a whole. Since I never went to college, this is an effort to attain a deep understanding of graduate level concepts and improve my Python development skills. I hope that it might be also useful to others.

My workflow focuses on Python 3 and Jupyer notebooks containing extensive [Markdown](https://daringfireball.net/projects/markdown/syntax) notes. I have also included an extensive machine learning curriculum in this repository for self-study along with notes for each lecture, which are committed as I complete them. I feel that [TensorFlow](tensorflow.org) is an emerging machine learning standard and a very useful tool for rapidly testing and visualizing models. My focus is on [TensorFlow Serving](https://tensorflow.github.io/serving/architecture_overview) for production environments.

I'll be using data from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/) and the [Stanford Large Network Dataset Collection](http://snap.stanford.edu/data/) for examples. Purely as an exercise, data will be downloaded as needed and not included in the repository. Thus, it is possible that examples may break in the future. Please open issues if missing data becomes a problem.

## Libraries and Toolkits

These libraries are included in this Docker image.

* [NumPy](http://www.numpy.org/)
* [SciPy](http://www.scipy.org/)
* [MatPlotLib](http://matplotlib.org/)
* [SciKit-Learn](http://scikit-learn.org/stable/)
* [Cython](http://cython.org/)
* [Pandas](http://pandas.pydata.org/)
* [NetworkX](https://networkx.github.io/)
* [TensorFlow](https://www.tensorflow.org/)

I will try to exercise legitimate academic rigor in this repository, referencing and including relevant papers where appropriate.

## Getting Started

I've included one ipynb from a third party, which discusses differential
equations and markov chains. This is included both for reference and because 
it is interesting.

### Configuring a MacOS X Environment

This is highly recommended. By using the Joyent pkgsrc repo, we can then use the exact same package versions for local development and in production, either on privately owned machines or in the Joyent public cloud. I am partial to real UNIX. Note that there are other ways to install Python and configure your environment, this is just my preference due to the fact that these instructions should mostly work on pretty much any Unix-like platform.

    curl -Os https://pkgsrc.joyent.com/packages/Darwin/bootstrap/bootstrap-trunk-x86_64-20160211.tar.gz
    sudo tar -zxpf bootstrap-trunk-x86_64-20160211.tar.gz -C /
    rm bootstrap-trunk-x86_64-20160211.tar.gz
    sudo chown -R $(whoami) /opt/pkg && sudo chown -R $(whoami) /var/db/pkgin
    /opt/pkg/bin/pkgin -y update
    /opt/pkg/bin/pkgin in gcc5-5.3.0nb1 python35-3.5.1nb2 py35-pip-7.1.2 zeromq-4.1.4 py35-qt4-4.11.4nb2 py35-sqlite3-3.5.1nb6
    ln -s /opt/pkg/gcc5/bin/gfortran /opt/pkg/bin/gfortran
    /opt/pkg/bin/pip install --upgrade pip 
    /opt/pkg/bin/pip install sphinx jupyter pandas scipy sklearn matplotlib networkx
    /opt/pkg/bin/pip install https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.1-cp35-none-any.whl
    mkdir ~/.matplotlib && echo "backend: qt4agg" >> ~/.matplotlib/matplotlibrc

You will also want to put a `PATH=/opt/pkg/sbin:/opt/pkg/bin:$PATH` declaration in your `~/.bashrc` or `~/.zshrc`.

Once set up, you may do:

    jupyter notebook

AND/OR:

    tensorboard --logdir=/tmp

**NOTE:** there are actually no TensorFlow models currently included in this repository.

## An emergent space

The tooling and methodology in the open source community is constantly evolving
and the best design patterns are still yet to be determined. The landscape is 
constantly evolving, and there are so many problem spaces, even within the 
domain of internet marketing, that one size cannot possibly fit all. I aspire 
to explore many approaches in this repository, although the *focus is on Python tooling*. 

## What is a data scientist?

"_The purpose of computing is insight, not numbers._" --Richard Hamming, 1961

According to the Hilary Mason definition, data science is three fundamentally
different things:

1. Math - having a theoretical understanding
2. Code - implementing ideas in code
3. Communication - telling stories about the data you work with

And I would add...

### 4. Visualization

"_The greatest value of a picture is when it forces us to notice what we never
expected to see._" --John Tukey

Moreso than being a pure mathematician or programmer, data science is about
framing the data in such a way that you can attain useful insights about it and
determine what questions to ask of it. Visualization is a powerful tool to 
achieve this goal.

The Visual Display of Quantitative Information by Edward Tufte is a canonical resource for creating useful visualizations.

## The Process

1. What is the problem we are really trying to solve?
2. How do we know when we've won? What are the error metrics for success?
3. Assuming we've solved this perfectly, what is the first thing we'll do?
4. How impactful is this? Does it matter?
5. What is the most evil thing that can be done with this?
6. Where do I begin?
    1. What data *do* we have?
    2. What data **should** we have?
    3. What **assumptions** do I make that I can experimentally verify?

You don't use data to replace intuition, you use data to enhance your
intuition. Take the assumptions you make, validate them, and act upon them.
Have metrics meetings. Have multiple people agree on what action they will take
based on your data. Don't follow data blindly, people have followed GPS into
lakes. We cannot trust technology over our brains.

Good people are much better than good tools. So what does technology offer?

We must identify things which are becoming possible but have not yet been
commoditized. The technology landscape is changing very quickly. Data science 
- R&D - needs to be integrated into our applications. Artificial intelligence is
anything that a computer cannot do today.

Natural language generation is an emergent field - taking large volumes of
structured data and producing human readable content.

We're beginning to see much more data from the real world. Many people call this
the internet of things. It's become very cheap to generate data from physical
reality. Realtime answers are increasingly important, although the definition of
"realtime" is nebulous. It could mean seconds and it could mean days.

### What kind of world does our technology enable?

With possibly billions of people using a product, we have the potential to 
actually alter human behavior. Data gives us superpowers. The biggest opportunities could be very surprising. Startups are taking commodity technology and combining it in new ways, but they don't tend to have much  data. Improving existing processes has a constrained upside - the value proposition is easier to articulate, but you can only reduce an already sunk cost for a potential customer.

# Self Study Resources

## Videos

* Neural Networks Demystified
    * [Part 1](http://lumiverse.io/video/part-1-data-and-architecture) - Data and Architecture
    * [Part 2](http://lumiverse.io/video/part-2-forward-propagation) - Forward Propagation
    * [Part 3](http://lumiverse.io/video/part-3-gradient-descent) - Gradient Decent
    * [Part 4](http://lumiverse.io/video/part-4-backpropagation) - Backpropagation
    * [Part 5](http://lumiverse.io/video/part-5-numerical-gradient-checking) - Numerical Gradient Checking
    * [Part 6](http://lumiverse.io/video/part-6-training) - Training
    * [Part 7](http://lumiverse.io/video/part-7-overfitting-testing-and-regularization) - Overfitting, testing, and regularization
* iPython in Depth
    * [Part 1](https://www.youtube.com/watch?v=xe_ATRmw0KM)
    * [Part 2](https://www.youtube.com/watch?v=A8VbS-YX2Lo)
    * [Part 3](https://www.youtube.com/watch?v=4tJKZWWRs6s)
* (Python) [Generators: The Final Frontier](https://www.youtube.com/watch?v=5-qadlG7tWo)
## Links

* [LaTeX Mathematics Markup](https://en.wikibooks.org/wiki/LaTeX/Mathematics)
* [Hacker's Guide to Neural Networks](http://karpathy.github.io/neuralnets/)
* [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/)
* [The Open Source Data Science Masters](http://datasciencemasters.org/)
* [A Taxonomy of Data Science](http://www.dataists.com/2010/09/a-taxonomy-of-data-science/)
* [A Practical Intro to Data Science](http://www.zipfianacademy.com/blog/post/46864003608/a-practical-intro-to-data-science)
* [Deep Learning with Spark and TensorFlow](https://databricks.com/blog/2016/01/25/deep-learning-with-spark-and-tensorflow.html)
* [CS231N Convolutional Neural Networks for Visual Recognition](http://cs231n.github.io/classification/)
* [Elements of Python Style](https://github.com/amontalenti/elements-of-python-style)

## Papers

* [Can Programming Be Liberated from The Von Neumann Style? A Functional Style and Its Algebra of Programs](doc/papers/can_programming_be_liberated.pdf)


## Books
* [How to Think Like a Computer Scientist](http://openbookproject.net/thinkcs/python/english3e/index.html)
* [Learning iPython for Interactive Computing and Data Visualization](doc/books/learning_ipython_for_interactive_computing_and_data_visualization.pdf)
* [iPython Interactive Computing and Visualization Cookbook](doc/books/ipython_interactive_computing_and_visualization_cookbook.pdf)
* [Bayesian Artificial Intelligence](doc/books/bayesian_artifical_intelligence_2nd_edition.pdf)
* [K & R: The C Programming Language](doc/books/k_and_r_the_c_programming_language.pdf)
* [Learn Python the Hard Way](doc/books/learn_python_the_hard_way.pdf)
* [Machine Learning for Hackers](doc/books/machine_learning_for_hackers.pdf)
* [Reliable Reasoning: Induction and Statistical Learning Theory](doc/books/reliable_reasoning_induction_and_statistical_learning_theory.pdf)
* [Think Bayes](doc/books/think_bayes.pdf)
* [The Visual Display of Quantitative Information](doc/books/the_visual_display_of_quantitative_information.pdf)

## Courses

### [Stanford - CS229][1]

This is a very fast-paced course taught by [Andrew Ng](https://www.coursera.org/instructor/andrewng),
so it is not a great starting point if you want to be taught the
history and concepts of machine learning. Nonetheless, it is taught by an active
practitioner at the cutting edge of the field.

While this is probably contentious, I'd say that [MIT-6.034](#mit---6034-artificial-intelligence) is a prerequisite. CS229 is more of an applied theory course.

### [MIT - 2.087][2] Engineering Mathematics: Linear Algebra and ODE

Note that this course starts with lecture 2, presumably due to a copyright claim. It seems likely that many other lectures were also removed, these 4 are the only ones posted.

#### Notes (in progress)
* [Lecture 2](classes/MIT/2.087/lectures/lecture_2.ipynb) - First-Order Equations
* [Lecture 3](classes/MIT/2.087/lectures/lecture_3.ipynb) - First-Order Equstions (continued)
* [Lecture 4](classes/MIT/2.087/lectures/lecture_4.ipynb) - Second-Order Equations
* [Lecture 5](classes/MIT/2.087/lectures/lecture_5.ipynb) - Second-Order Equations (continued)

### [MIT - 6.00SC][3] Introduction to Computer Science and Programming

### [MIT - 6.02 ][4] Introduction to EECS II: Digital Communication Systems

### [MIT - 6.033][5] Computer System Engineering

### [MIT - 6.034][6] Artificial Intelligence

This is a much "softer" course than [Stanford-CS229](#Stanford---CS229), 
providing more of a human  element and more historical context while also 
covering the technical details.

### [MIT - 6.035][7] Computer Language Engineering

### [MIT - 6.041SC][8] Probabilistic Systems Analysis and Applied Probability

### [MIT - 6.172][9] Performance Engineering of Software Systems

### [MIT - 6.849][10] Geometric Folding Algorithms

### [MIT - 6.851][11] Advanced Data Structures

This course will absolutely make you a better programmer.

#### Notes (in progress)

* [Lecture 1](classes/MIT/6.851/lectures/lecture_1.ipynb) - Persistent Data Structures
* [Lecture 2](classes/MIT/6.851/lectures/lecture_2.ipynb) - Retroactive Data Structures
* [Lecture 3](classes/MIT/6.851/lectures/lecture_3.ipynb) - Geometric Data Structures I
* [Lecture 4](classes/MIT/6.851/lectures/lecture_4.ipynb) - Geometric Data Structures II
* [Lecture 5](classes/MIT/6.851/lectures/lecture_5.ipynb) - Dynamic Optimality I
* [Lecture 6](classes/MIT/6.851/lectures/lecture_6.ipynb) - Dynamic Optimality II
* [Lecture 7](classes/MIT/6.851/lectures/lecture_7.ipynb) - Memory Hierarchy Models
* [Lecture 8](classes/MIT/6.851/lectures/lecture_8.ipynb) - Cache-Oblivious Structures I
* [Lecture 9](classes/MIT/6.851/lectures/lecture_9.ipynb) - Cache-Oblivious Structures II
* [Lecture 10](classes/MIT/6.851/lectures/lecture_10.ipynb) - Dictionaries
* [Lecture 11](classes/MIT/6.851/lectures/lecture_11.ipynb) - Integer Models
* [Lecture 12](classes/MIT/6.851/lectures/lecture_12.ipynb) - Fusion Trees
* [Lecture 13](classes/MIT/6.851/lectures/lecture_13.ipynb) - Integer Lower Bounds
* [Lecture 14](classes/MIT/6.851/lectures/lecture_14.ipynb) - Sorting in Linear Time
* [Lecture 15](classes/MIT/6.851/lectures/lecture_15.ipynb) - Static Trees
* [Lecture 16](classes/MIT/6.851/lectures/lecture_16.ipynb) - Strings
* [Lecture 17](classes/MIT/6.851/lectures/lecture_17.ipynb) - Succint Structures I
* [Lecture 18](classes/MIT/6.851/lectures/lecture_18.ipynb) - Succint Structures II
* [Lecture 19](classes/MIT/6.851/lectures/lecture_19.ipynb) - Dynamic Graphs I
* [Lecture 20](classes/MIT/6.851/lectures/lecture_20.ipynb) - Dynamic Graphs II
* [Lecture 21](classes/MIT/6.851/lectures/lecture_21.ipynb) - Dynamic Connectivity Lower Bound
* [Lecture 22](classes/MIT/6.851/lectures/lecture_22.ipynb) - History of Memory Models

### [MIT - 6.868J][12] The Society of Mind

### [MIT - 7.91J][13] Foundations of Computational and Systems Biology

### [MIT - 15.356][14] How to Develop Breakthrough Products & Services

### [MIT - 16.810][15] Engineering Design and Rapid Prototyping

### [MIT - 18.01][16] Single Variable Calculus

These notes include extensive LaTeX math markup, so taking them is a bit arduous but I consider it a worthwhile exercise. I also try to include NumPy/SciPy code samples which draw plots to illustrate the concepts.

#### Notes

* [Lecture 1](classes/MIT/18.01/lectures/lecture_1.ipynb)

### [MIT - 18.01SC][17] Homework Help for Single Variable Calculus

I will not be including any notes on this, it is only referenced for completeness.

### [MIT - 18.02][18] Multivariable Calculus

### [MIT - 18.03SC][19] Differential Equations

### [MIT - 18.05][20] Mathematical Methods for Engineers

### [MIT - 18.06][21] Linear Algebra

### [MIT - 18.085][22] Computational Science & Engineering I

### [Caltech - CS 156][23] Machine Learning

### [Caltech - Psy 120][24] The Neuronal Basis of Consciousness

[1]: https://www.youtube.com/watch?v=UzxYlbK2c7E&list=PLA89DCFA6ADACE599&index=1
[2]: https://www.youtube.com/watch?v=4X0SGGrXDiI&list=PLUl4u3cNGP63w3DE9izYp_3fpAdi0Wkga
[3]: https://www.youtube.com/watch?v=bX3jvD7XFPs&list=PLB2BE3D6CA77BB8F7
[4]: https://www.youtube.com/watch?v=BtaVq2g17G0&list=PLUl4u3cNGP63ZWyJMdWIVtyweopUN3xt3
[5]: https://www.youtube.com/watch?v=zm2VP0kHl1M&list=PL6535748F59DCA484
[6]: https://www.youtube.com/watch?v=TjZBTDzGeGg&list=PLUl4u3cNGP63gFHB6xb-kVBiQHYe_4hSi
[7]: https://www.youtube.com/playlist?list=PL0300FE43396456C1
[8]: https://www.youtube.com/watch?v=j9WZyLZCBzs&list=PLUl4u3cNGP60A3XMwZ5sep719_nh95qOe
[9]: https://www.youtube.com/playlist?list=PLD2AE32F507F10481
[10]: https://www.youtube.com/watch?v=MDcAOTaCXHs&index=1&list=PLUl4u3cNGP62xuxL4CQpy8uo2MeM4a3YD
[11]: https://www.youtube.com/watch?v=T0yzrZL1py0&list=PLUl4u3cNGP61hsJNdULdudlRL493b-XZf
[12]: https://www.youtube.com/watch?v=-pb3z2w9gDg&index=1&list=PLUl4u3cNGP61E-vNcDV0w5xpsIBYNJDkU
[13]: https://www.youtube.com/watch?v=lJzybEXmIj0&list=PLUl4u3cNGP63uK-oWiLgO7LLJV6ZCWXac
[14]: https://www.youtube.com/watch?v=cKcAcm5NDOI&list=PLUl4u3cNGP63IhNlzdfQL0OALmfEUy4Re
[15]: https://www.youtube.com/watch?v=bQrAhaXfSNA&list=PL9FF086E91F3A974C
[16]: https://www.youtube.com/playlist?list=PL590CCC2BC5AF3BC1
[17]: https://www.youtube.com/playlist?list=PL21BCE50ABFF029F1
[18]: https://www.youtube.com/watch?v=PxCxlsl_YwY&list=PL4C4C8A7D06566F38
[19]: https://www.youtube.com/watch?v=76WdBlGpxVw&list=PL64BDFBDA2AF24F7E
[20]: https://www.youtube.com/watch?v=CgfkEUOFAj0&list=PL375F54C29AC641E3
[21]: https://www.youtube.com/watch?v=ZK3O402wf1c&list=PLE7DDD91010BC51F8
[22]: https://www.youtube.com/watch?v=f2eYLK6TpRs&list=PL51CACD5B1F58C40C
[23]: https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
[24]: https://www.youtube.com/watch?v=cE9JeVCuN08&list=PL1DBCFC32CF6945EE