In [None]:
%%sh
pip install seaborn
pip install ggplot
pip install matplotlib

<h1 align="center"> Teaching Machines to Learn </h1>
<hr>

<img src="http://static1.businessinsider.com/image/535edec0ecad04c0741f732f/construction_google_car.gif" style="width: 80%;"/>

<br>

### 1. A World of Signals: 

Perception, whether human or machine, is fundamentally dependent on our ability to *sense and analyze signals* that abound in the natural and digital world. Humans are evolutionarily powered to sense these, although variations in ability exists from person to person. But machines have to be taught first what perception means, and then taught how to keep learning on acquired perception principles. 

This task makes for some of the most fascinating and interesting challenges that exist in our quest to make machines smarter. It also automatically divides computing into many subfields, based on the signals that a machine will encounter and must analyze:  

- When the signal is image/video/gif : *Computer Vision*
- When the signal is text            : *Natural Language Processing*
- When the signal is touch           : *Haptic Computing*
- When the signal is sound           : *Speech Recognition / Audio Signal Processing*
- When the signal is smell           : *There's [Cyranose!](https://en.wikipedia.org/wiki/Electronic_nose) / Classification of foods, bacteria detection [\(you laugh\)](http://www.disi.unige.it/person/MasulliF/papers/masulli-mcs02.pdf)*
- When the signal is taste           : *we aren't there yet* 

The idea is that if machines can analyze these signals (themselves representations of the signals we as humans perceive) accurately, they will have intelligence in dealing with situations that humans have to deal with on a daily basis. For example, a machine could analyze an image and recognize smiling faces. It can analyze text and recognize abusive language. It can analyze speech tones and recognizes distress or strain. It can sense the pressure of a touch, and that magnitude of pressure indicates different intentions of the user. As you can see, all of these so called "learnings" are trying to make machines perceive and analyze signals like humans do. 

The one caveat is: Machines can do it a million times faster than a human.

Apart from perception and processing, machines can also **LEARN** to perform activities that humans do ...tasks varying from driving a vehicle to something much esoteric.. such as copying artistic styles. 

<img src = "https://i.imgur.com/sb8dHcY.png" width=60%></img>
<br>

Machine Prediction is every bit as awesome, magical and fearsome as you can imagine. At the core, what a human brain does is match patterns and then predicts. And everything you feel is based on how well that prediction is going. For example, dopamine - the molecule behind our most fundamental cravings - [is a prediction error system](https://medium.com/the-spike/the-crimes-against-dopamine-b82b082d5f3d). 

### 2. AI and Machine Learning

Originally there were three subdivisions of AI: (1) Neural Networks, (2) Genetic Programming (Evolutionary Computing) and (3) Fuzzy Systems. As data became abundantly available and computation became cheaper and more powerful, a more statistical approach came into the forefront. This was when machine learning was born. You will see the terms 'AI' and 'Machine Learning' interchanged often. Machine Learning is a *type* of AI that is heavily dependent on data.

- Machine Learning: The ability of computers to learn from data without an **explicitely** pre-programmed rule set.
- Artificial Intelligence: Intelligence exhibited by machines. 



| Method | Learning Model | Improvement Criteria | ~Year | Pitfalls | 
| ------ | ----------- | ----------- | ----------- | ----------- | 
|1. Old AI   |  Precoded Rules | X | 1950s | Too few rules |
|2. Expert Systems | Inferred Rules | X | 1970s | Knowledge Acquisition problem |
|3. AI Winter | :( | :( | 1980s |  :( |
|4. Machine Learning | Data | Experience + Reinforcement| 1990s | Black box models |
|5. Deep Learning | Lots of Data | Experience + Reinforcement + Memory | 2000s | Cannot explain itself (yet) |

<br>

<img src = "https://qph.ec.quoracdn.net/main-qimg-d49da0fd1ac86b19d4e67d153926c026-p" width= 70%></img>
<br>


### 3. Your ML ability is limited by the data you have

There are many things you have to do before you get to the "algorithm" or "modeling" phase of a machine learning system. The chief among these is to transform/ format the data so it is easily ingestable by the algorithm. You must also look if your data is biased (it will be but you have to be aware of it). Then you must choose a type of machine learning algorithm to use. This is almost always dependent on the kind of data you have (more below). You can always run multiple algorithms on your data set and test the outcome of which model performs better. 

<br>
Samples? Lables? Categories? What are these...

<html>
<img src = "http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2015/04/drop_shadows_background2.png" width = 90%>

</html>
<br>

As you can probably guess from the figure, the three big factors in what model gives the best performance is mainly dependent on: 
- (1) how many samples of data do you have, 
- (2) do you have labels for your data instances and 
- (3) is your label categorical?

So lets quickly list the topics that we have/will covered:


> Data Ingestion
    - Data Formats (e.g., dataframes, dictionaries) **
    - Data Discovery **
    - Data Acquisition **
    - Integration and Fusion   (beware of Simpson's Paradox)
    - Transformation + Enrichment **
    

> Data Munging
    - Principal Component Analysis (PCA)  ** 
    - Dimensionality Reduction **
    - Sampling
    - Denoise
    - Feature Extraction
    
> Types
    - Supervised (I know the label of each data instance. )
    - Unsupervised (I do not know the label of any data instance. ) 
    - SemiSupervised ( some labeled, mostly unlabeled) 
    
> Supervised Algorithms:  
    - Decision Trees (this class)
    - Random Forests (this class)
    - Linear Regression (this class)
    - Support Vector Machines (next class)
    
> Unsupervised Algorithms: 
    - Kmeans ** 
    - Neural Nets (next class)

### 4. Here's what a ML pipeline looks like..



<html>
<img src= 'http://sumandebroy.com/columbia/images/mlloop.png' width=100%>
</html>


*Here are the basic steps in building machine learning algorithm:*
1. Signal Detection: find a source, check if it generates a signal
2. Estimation: Give values to those signals. E.g, each like button press = +1 POS vote but each heart/star press = +2 POS votes. 
3. Sample which parts of the data you want to use, or is usabe.
4. Split your data into 60%-40% between training & test. Training data is used to BUILD the model. Test data is used to EVALUATE the model. Ideally, you'd also keep some for validation, which is used to TUNE the model. 
5. Have a reinforcement framework so your model can improve over time.

*Things to think about*:
- Which nodes are most manual ?
- In which nodes can bias creep in.. and how?
- Which nodes lead to black box?

**A tale of terminology**: Machine learning uses statistics extensively in every conceivable way. Yet, you will find people using two different terminologies sometimes.

- Statistical Learning : Infer the process by which data you have was generated (Inference)
- Machine Learning: Know how you can predict what future data will look like w.r.t. some variable (Prediction)

Now this can start many flame wars, some people call machine learning "glorified statistics". But in such discussions always remember Ken Thompson's quote:
> *when in doubt, use brute force.*

Real world data can be messy, with incredibly complex feedback loops. When the assumptions are hard to catch or it is safer than guessing them wrong, *prediction is a more robust bet than inference*. This is why so many have embraced ML, because its safer to build software that predicts and then tune, rather than make assumptions about the data generating source. This is especially true if your training data is not large enough compared to the no. of features. 

Machine Learning makes no prior assumptions about the underlying relationships between the variables. You just throw in all the data you have, and the algorithm processes the data, discovers patterns - using which you can make predictions on the new data set. But this has its own pitfalls.. no free lunch. 

### 5. So what is a model

Obviously the model is a result of supervised or unsupervised learning methods applied to data. 

Labeled Data can be priceless: Its hard to get and difficult to implement. Most people use tools like [Mechanical Turk](http://neerajkumar.org/writings/mturk/). If labeled data is the future, all of the jobs that are taken away from AI might be replaced by labelling jobs. That'd make an interesting distopia. Ironically, that means we automated ourselves back into manual labor. 

Anyways, we will start with the simplest of models... 
<br>

### 5.1 Linear Regression: 

A linear regression takes a bunch of data, and attempts to find the relationship between the independent variable ("cause", X) and a dependent variables ("results", Y). To start, given a data set with two columns, X and Y, its task is to find a line that best describes Y as a function of X. It is used to figure out serious things in the real world like GDP, exchange rates, money flows, etc. and is a heavily used research tool in the social and political sciences.  

In [None]:
# Create some simple data y = 2*x + 4+ error
from pandas import DataFrame
import numpy as np
np.random.seed(0)

data = DataFrame({"x":np.random.randn(20)})
data["y"] = 3*data["x"]+4+2*np.random.randn(20)

For simplicity we are going to plot this using plotly. 

In [None]:
from plotly.plotly import iplot, sign_in
from plotly.graph_objs import *

sign_in("cocteautt","9psj3t57ti")

In [None]:
trace0 = Scatter(x=data['x'],y=data['y'],mode="markers",name="data")
mydata = [trace0]
iplot(mydata)

In [None]:
# build a model that tries to fit this data. we start with linear regression
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(data[["x"]], data["y"])

print model.intercept_
print model.coef_

In [None]:
trace0 = Scatter(x=data['x'],y=data['y'],mode="markers",name="data")
trace1 = Scatter(x=data['x'],y=model.predict(data[['x']]),name="regression line")
mydata = [trace0,trace1]
iplot(mydata)

**Sidebar: The Grammar of Graphics**

Let's look at some real data, this time using the "hardest" data set we've seen before. To help us with the data visualization, we are going to use ggplot (and not rely on plotly any longer). We are going to import all of the "names" in the ggplot package. In previous import statements we have selected a single name or a comma separated list of names. Here we take all the names that the package knows about -- all the functions, the variables, the data.

In [None]:
from ggplot import *

In [None]:
from pandas import read_csv

hardest = read_csv("hardest_small.csv")
hardest.head()

At their most basic, ggplots take 2 arguments: a **data frame** and accompanying **"aesthetics"** or aes object. Aesthetics define how ggplot will extract data from your data frame and render it. Think of it as the instructions for creating x, y, color, etc. components.

An aes object is just a dictionary with keys being an aesthetic property and values being strings or formulas relating to data in your data frame.

In [None]:
aes(x='education', y='income')

Here's our first ggplot. It's really just onto which we will place data.

In [None]:
ggplot(aes(x ="education",y='income'),data=hardest)

Now we will (quite literally) add a scatterplot (geom_point) to our canvas. We'll get into more detail on how this works later.

In [None]:
ggplot(aes(x ="education",y='income'),data=hardest)+geom_point()

To make this a little clearer, the grammar breaks the components of a graphic down into various pieces.

* **data** in ggplot, data must be stored as a pandas data frame
* **a coordinate system** describes 2-D space that data is projected onto (for example, Cartesian       coordinates, polar coordinates, map projections, and so on)
* **geoms** describe type of geometric objects that represent data (for example, points, lines,   
  polygons)
* **aesthetics** describe visual characteristics that represent data (for example, position, size,   color, shape, transparency, fill)
* **scales** for each aesthetic, describe how visual characteristic is converted to display values   (for example, log scales, color scales, size scales, shape scales, ...
* **stats** describe statistical transformations that typically summarize data (for example,      
  counts, means, medians, regression lines)
* **facets** describe how data is split into subsets and displayed as multiple small graphs

geom_point() says that we want to render our x and y data as points. We can further adapt them by assigning colors and other features. 

Make another scatterplot.

In [None]:
# put code here


Often, scatterplots are hard to read on their own, perhaps because of overplotting. We can introduce a "trend line" by adding statistical artifacts to the plot. Here we use a "smoother" -- Think of Galton dividing his data into bins on the x-axis, finding the mean of the y-values in each bin, plotting them and then connecting the dots. That's essentially what's going on here plus or minus some bells and whistles.

In [None]:
ggplot(aes(x='education',y='income'),data=hardest)+geom_point()+stat_smooth(method="lm",color="blue")

Of course we can add a lot of other components to a plot (again, literally adding them)...

In [None]:
ggplot(aes(x='education',y='income'),data=hardest)+\
    geom_point()+stat_smooth(method="lm",color="blue")+\
    ggtitle("Life expectancy and obesity rates")+\
    xlab("Percentage with College Education")+\
    ylab("Median income in the county")

In [None]:
ggplot(aes(x='education',y='unemployment'),data=hardest)+\
    geom_point()+stat_smooth(method="loess",color="blue")+\
    ggtitle("Unemployment and education rates")+\
    xlab("Percentage with College Education")+\
    ylab("Unemployment rate")

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn # not necessary
import numpy as np

### 5.2 Decision Trees

Decision trees are predictive models that maps features of items (represented by nodes) to their target labels (represented by leaves of the tree). Thus when a data instance encounters a decision tree model, it must traverse through the nodes to be labeled by one of the leaves. The nodes it chooses are based on the features the data instance posesses. 
<br>

<html>
<img src="http://sumandebroy.com/columbia/images/dtree.gif" ></img>
</html>

##### Making a decision tree

In [None]:
from sklearn.datasets import make_blobs

In [None]:
X, y = make_blobs(n_samples=300, centers=4,random_state=0, cluster_std=1.0)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');

In [None]:
# some basic code to visualize a decision tree boundaries. 
def visualize_tree(estimator, X, y, boundaries=True,xlim=None, ylim=None):
    estimator.fit(X, y)

    if xlim is None:
        xlim = (X[:, 0].min() - 0.1, X[:, 0].max() + 0.1)
    if ylim is None:
        ylim = (X[:, 1].min() - 0.1, X[:, 1].max() + 0.1)

    x_min, x_max = xlim
    y_min, y_max = ylim
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap='rainbow')
    plt.clim(y.min(), y.max())

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow')
    plt.axis('off')

    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)        
    plt.clim(y.min(), y.max())


In [None]:
# http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
from ipywidgets import interact
from sklearn.tree import DecisionTreeClassifier
DEPTH = 2
clf = DecisionTreeClassifier(max_depth=DEPTH, random_state=0) # 
result = visualize_tree(clf, X, y)
interact(result, depth=[1, 5])

In [None]:
# now try with DEPTH =3 

** But the classification quality can vary in every run: ** For example, take a look at two trees built on two subsets of this dataset. The details of the classification are very different. 

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
plt.figure()
visualize_tree(clf, X[:200], y[:200], boundaries=False)
plt.figure()
visualize_tree(clf, X[-200:], y[-200:], boundaries=False)

 A common way to address such deviances is to use an [Ensemble Method](http://scikit-learn.org/stable/modules/ensemble.html#forest): this is a meta-estimator which essentially averages the results of many individual estimators which over-fit the data. The most common ensemble method is a Random Forest, in which the ensemble is made up of many decision trees.

### 5.3 Enter Random Forests

In [None]:
def fit_randomized_tree(random_state=0):
    X, y = make_blobs(n_samples=300, centers=4,
                      random_state=0, cluster_std=2.0)
    clf = DecisionTreeClassifier(max_depth=15)
    
    rng = np.random.RandomState(random_state)
    i = np.arange(len(y))
    rng.shuffle(i)
    visualize_tree(clf, X[i[:250]], y[i[:250]], boundaries=False,
                   xlim=(X[:, 0].min(), X[:, 0].max()),
                   ylim=(X[:, 1].min(), X[:, 1].max()))
    
from IPython.html.widgets import interact
interact(fit_randomized_tree, random_state=[0, 100]);

Notice how the details of the model change as a function of the sample.

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
visualize_tree(clf, X, y, boundaries=False);

When is a random forest not **that** useful:
1. Stuctured data like images where a Neurel Net might do better
2. Small data.. might lead to overfitting
3. High dimensional data (sometimes)
    
But in general decision trees and random forests are very robust models, and you can do very interesting game AIs with it.
<br>
<img src = "https://ai2-s2-public.s3.amazonaws.com/figures/2016-11-08/11fa695a4e6b0f20a396edc4010d4990c8d29fe9/0-Figure1-1.png" width = 70%> </img><br>

## 6. Model Considerations

### 6.1 Scalability:

Sometimes **a less complex model can be more scalable**. Most modern prediction systems is a series of tradeoffs. Scalability of a model pertains to the fact that if the distribution (or pattern) of the input changes, how easily can the model adapt to it. Complex models are powerful but you need to make sure the distribution of your input will remain the same. 

<img src="http://metamarkets.com/wp-content/uploads/2011/03/photo-1-1024x768.jpg" width = 50%> </img>

### 6.2 Bias and Variance  - The model moves with the data

Any learning algorithm has errors that come from two sources..BIAS and VARIANCE. 

Bias is the tendency of your algorithm to consistenly not take all information into account, thus learning the wrong thing. This leads to UNDERFITTING. Variance is your algorithm's tendency to learn random things irrespective of the real signal. This leads to OVERFITTING. So the final thing we need to note is that models overfit and underfit. Here's how to intuitively understand this. 
<br> 

<img src = "https://qph.ec.quoracdn.net/main-qimg-f9c226fe76f482855b6d46b86c76779a-p" width=50%></img>



A person with high bias is someone who starts to answer before you can even finish asking. A person with high variance is someone who can think of all sorts of crazy answers. Combining these gives you different personalities:

- High bias/low variance: this ismsomeone who usually gives you the same answer, no matter what you ask, and is usually wrong about it;

- High bias/high variance: someone who takes wild guesses, all of which are sort of wrong;

- Low bias/high variance: a person who listens to you and tries to answer the best they can, but that daydreams a lot and may say something totally crazy;

- Low bias/low variance: a person who listens to you very carefully and gives you good answers pretty much all the time.

<hr>

### Next class:

1. *Predicting Titatic survivors*: One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. Our task will be to predict who had a greater chance of survival. 

2. *Neural Networks* : Attempting to model the brain

3. *Data-Driven Bugs* : How they are born and what they do

4. *Debugging and Explainable AI* : The more you learn about machine learning, the more you realise that debugging tools & a clear understanding of how the algorithms you’re using work are totally essential for making your models better. 

<br>
    
