## Teaching Machines to Learn

<img src=https://einstein.ai/static/images/pages/einstein.svg width=300>

<br>
**These notes were generated largely by Suman Deb Roy, with small edits by Mark Hansen**

### A World of Signals

Perception, whether human or machine, is fundamentally dependent on our ability to *sense and analyze signals* that abound in the natural and digital worlds. Humans are evolutionarily powered to sense these, although variations in ability exists from person to person. But machines have to be taught first what perception means, and then taught how to keep learning on acquired perception principles. 

This task makes for some of the most fascinating and interesting challenges that exist in our quest to make machines smarter. It also automatically divides computing into many subfields, based on the signals that a machine will encounter and must analyze:  

- When the signal is image/video/gif : *Computer Vision*
- When the signal is text            : *Natural Language Processing*
- When the signal is touch           : *Haptic Computing*
- When the signal is sound           : *Speech Recognition / Audio Signal Processing*
- When the signal is smell           : *There's [Cyranose!](https://en.wikipedia.org/wiki/Electronic_nose) / Classification of foods, bacteria detection [\(you laugh\)](http://www.disi.unige.it/person/MasulliF/papers/masulli-mcs02.pdf)*
- When the signal is taste           : *We aren't there yet* (although... [http://fastml.com/predicting-wine-quality/](PCA for wine quality))

The idea is that if machines can analyze these signals (themselves representations of the signals we as humans perceive) accurately, they will have intelligence in dealing with situations that humans have to deal with on a daily basis. For example, a machine could analyze an image and recognize smiling faces. It can analyze text and recognize abusive language. It can analyze speech tones and recognizes distress or strain. It can sense the pressure of a touch, and that magnitude of pressure indicates different intentions of the user. As you can see, all of these so called "learnings" are trying to make machines perceive and analyze signals like humans do.

As we have seen from Kosinski's work, we are immediately confronted with important questions as these activities take on social, political or even cultural importance. The idea that a machine is making decisions is often confused with objectivity in decision making, for example. By talking through the Machine Learning "pipeline," we can highlight where human decision making is required to design and build a learning system. These are places where our biases come into play. We will consider very different situations, and highlight the various reporting possibilities along each pipeline, from signals to actions.

We will see that Machine Learning procedures means different things to human learning. Sometimes, by creating a Machine Learning model, we learn about how something in the world works -- like mathematical models in physics. Sometimes, we want to the model to simply drive the car and we don't really care how it does it. We just need it to be safe. The roles of narrative and human understanding are often left out when we talk about ML. 

A couple good readings to help you think through these roles are [A Practice-Based Framework for Improving
Critical Data Studies and Data Science](https://www.liebertpub.com/doi/pdf/10.1089/big.2016.0050) and [Feminist Data Visualization](http://www.kanarinka.com/wp-content/uploads/2015/07/IEEE_Feminist_Data_Visualization.pdf). 


### AI and Machine Learning

Originally there were three subdivisions of AI: (1) Neural Networks, (2) Genetic Programming (Evolutionary Computing) and (3) Fuzzy Systems. As data became abundantly available and computation became cheaper and more powerful, a more statistical approach came into the forefront. This was when machine learning was born. You will see the terms 'AI' and 'Machine Learning' interchanged often. Machine Learning is a *type* of AI that is heavily dependent on data.

- Machine Learning: The ability of computers to learn from data without an **explicitely** pre-programmed rule set.
- Artificial Intelligence: Intelligence exhibited by machines. 



| Method | Learning Model | Improvement Criteria | ~Year | Pitfalls | 
| ------ | ----------- | ----------- | ----------- | ----------- | 
|1. Old AI   |  Precoded Rules | X | 1950s | Too few rules |
|2. Expert Systems | Inferred Rules | X | 1970s | Knowledge Acquisition problem |
|3. AI Winter | :( | :( | 1980s |  :( |
|4. Machine Learning | Data | Experience + Reinforcement| 1990s | Black box models |
|5. Deep Learning | Lots of Data | Experience + Reinforcement + Memory | 2000s | Cannot explain itself (yet) |

<br>

<img src = "https://qph.ec.quoracdn.net/main-qimg-d49da0fd1ac86b19d4e67d153926c026-p" width= 70%></img>
<br>

### Your ML ability is limited by the data you have

There are many things you have to do before you get to the "algorithm" or "modeling" phase of a machine learning system. The chief among these is to transform/format the data so it is easily ingestable by the algorithm. You must also look if your data is biased (it will be but you have to be aware of it). Who collected the data? What was their intention? Their motivations? Whose voices were left out? Then you must choose a type of machine learning algorithm to use. This is almost always dependent on the kind of data you have (more below). There are also so many different kinds of algorithms now, it can be hard to know why to choose one over another. Sometimes it will depend on the "difficulty" of your learning task, sometimes you want to be able to interpret your model - "Why this decision for this input?" You can always run multiple algorithms on your data set and test the outcome of which model performs better or provides better interpretation. 

<br>
Samples? Lables? Categories? What are these...

<html>
<img src = "http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2015/04/drop_shadows_background2.png" width = 90%>

</html>
<br>

As you can probably guess from the figure, the three big factors in what model gives the best performance is mainly dependent on: 
1. how many data points do you have, 
2. do you have labels for your data instances and 
3. is your label categorical?

So lets quickly list the topics that we have/will covered:

> Data Ingestion
    - Data Formats (e.g., dataframes, dictionaries) **
    - Data Discovery **
    - Data Acquisition **
    - Integration and Fusion (beware of Simpson's Paradox)
    - Transformation + Enrichment **
    

> Data Munging
    - Principal Component Analysis (PCA)  ** 
    - Dimensionality Reduction **
    - Sampling
    - Denoise
    - Feature Extraction
    
> Types
    - Supervised (I know the label of each data instance.)
    - Unsupervised (I do not know the label of any data instance.) 
    - SemiSupervised (some labeled, mostly unlabeled) 
    
> Supervised Algorithms:  
    - Decision Trees (this class)
    - Random Forests (this class)
    - Linear Regression (this class)
    - Support Vector Machines (next class)
    
> Unsupervised Algorithms: 
    - Kmeans ** 
    - Neural Nets (next class)
    
### The ML pipeline...

*Here are the basic steps in building machine learning algorithm:*
1. Signal Detection: find a source, check if it generates a signal
2. Give values to those signals. E.g, each like button press = +1 POS vote but each heart/star press = +2 POS votes. 
3. Sample which parts of the data you want to use, or is usable.
4. Split your data into 70%-30% between training & test. Training data is used to BUILD the model. Test data is used to EVALUATE the model. Ideally, you'd also keep some for validation, which is used to TUNE the model. 
5. Have a reinforcement framework so your model can improve over time.

*Things to think about*:
- Which nodes are most manual?
- In which nodes can bias creep in.. and how?
- Which nodes lead to black box?

###  So what is a model

A model is a result of supervised or unsupervised learning methods applied to data. 

Labeled Data can be priceless: Its hard to get and difficult to implement. Most people use tools like [Mechanical Turk](http://neerajkumar.org/writings/mturk/). If labeled data is the future, all of the jobs that are taken away from AI might be replaced by labelling jobs. Ironically, that means we automated ourselves back into manual labor. 

Anyways, we will start with the simplest of models... 
<br>

### 5.1 Linear Regression: 

A linear regression takes a bunch of data, and attempts to find the relationship between the independent variable ("cause", X) and a dependent variables ("results", Y). To start, given a data set with two columns, X and Y, its task is to find a line that best describes Y as a function of X. It is used to figure out serious things in the real world like GDP, exchange rates, money flows, etc. and is a heavily used research tool in the social and political sciences.  

In [None]:
%%sh
pip install sklearn --upgrade
pip install pydotplus

In [None]:
# Create some simple data y = 2*x + 4+ error
from pandas import DataFrame
import numpy as np
np.random.seed(0)

data = DataFrame({"x":np.random.randn(100)})
data["y"] = 3*data["x"]+4+2*np.random.randn(100)

For simplicity we are going to plot this using plotly. 

In [None]:
from plotly.plotly import iplot, sign_in
import plotly.graph_objs as go

sign_in("cocteautt","8YLww0QuMPVQ46meAMaq")

In [None]:
myplot_parts = [go.Scatter(x=data["x"],y=data["y"],mode="markers",name="data")]
mylayout = go.Layout(autosize=False, width=500,height=500)
myfigure = go.Figure(data = myplot_parts, layout = mylayout)
iplot(myfigure,filename="example")

In [None]:
# build a model that tries to fit this data. we start with linear regression
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(data[["x"]], data["y"])

print(model.intercept_)
print(model.coef_)

In [None]:
myplot_parts = [go.Scatter(x=data["x"],y=data["y"],mode="markers",name="data"),
                go.Scatter(x=data["x"],y=model.predict(data[["x"]]),name="regression line")]
mylayout = go.Layout(autosize=False, width=500,height=500)
myfigure = go.Figure(data = myplot_parts, layout = mylayout)
iplot(myfigure,filename="example")

In [None]:
from pandas import read_csv

data = read_csv("hardest.csv")
data.dropna(axis=0,how="any",inplace=True)

In [None]:
model = LinearRegression()
model.fit(data[["education"]], data["income"])

print(model.intercept_)
print(model.coef_)

In [None]:
myplot_parts = [go.Scatter(x=data["education"],y=data["income"],mode="markers",name="data"),
                go.Scatter(x=data["education"],y=model.predict(data[["education"]]),name="regression line")]
mylayout = go.Layout(autosize=False, width=600,height=600)
myfigure = go.Figure(data = myplot_parts, layout = mylayout)
iplot(myfigure,filename="example")

### Decision Trees

Decision trees are predictive models that maps features of items (represented by nodes) to their target labels (represented by leaves of the tree). Thus when a data instance encounters a decision tree model, it must traverse through the nodes to be labeled by one of the leaves. The nodes it chooses are based on the features the data instance posesses. 
<br>

**Narrative Potential in Trees**

Decision trees are well-known for their narrative potential. Here is an example from the New York Times where the model *was* the narrative.

![tree](https://static01.nyt.com/images/2008/04/16/us/0416-nat-subOBAMA.jpg)

And ProPublica had a lovely project called the [Message Machine](https://projects.propublica.org/emails/) where they looked email messages sent out by [political campaigns](https://www.propublica.org/special/message-machine-you-probably-dont-know-janet) and [reverse engineered the logic that generated them.](https://www.propublica.org/article/message-machine-starts-providing-answers)

Let's fit a model using SciKitLearn. We will take a data set like the NYT data, but this time having to do with the current election. 

In [None]:
from pandas import read_csv, set_option
data = read_csv("http://www.collingwoodresearch.com/uploads/8/3/6/0/8360930/county_data.csv",sep="\t")

In [None]:
set_option("display.max.columns",100)
data.head()

In [None]:
for c in data.columns: print(c)

In [None]:
data["winner"] = ['Trump' if data["pct_clinton"][i] < data["pct_trump"][i] else 'Clinton' for i in range(data.shape[0])]

Here we define a "response" that is which candidate won the county.

In [None]:
data.head(10)

In [None]:
data["winner"].value_counts()

Fit the tree and have a look!

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz

features = ["per_capita_income","pobama12cnty","percent_white"]

y = list(data["winner"])
X = data[features]

dt = DecisionTreeClassifier(min_samples_split=200)
dt = dt.fit(X, y)

In [None]:
%%sh
conda install graphviz

In [None]:
from IPython.display import Image  
from pydotplus import graph_from_dot_data
dot_data = export_graphviz(dt, out_file=None, 
                         feature_names=features,
                         class_names=["Clinton","Trump"])  
                         #filled=True, rounded=True,  
                         #special_characters=True)  
graph = graph_from_dot_data(dot_data)  
Image(graph.create_png()) 

###  Bias and Variance  - The model moves with the data

Any learning algorithm has errors that come from two sources..BIAS and VARIANCE. 

Bias is the tendency of your algorithm to consistenly not take all information into account, thus learning the wrong thing. This leads to UNDERFITTING. Variance is your algorithm's tendency to learn random things irrespective of the real signal. This leads to OVERFITTING. So the final thing we need to note is that models overfit and underfit. Here's how to intuitively understand this. 
<br> 

<img src = "https://qph.ec.quoracdn.net/main-qimg-f9c226fe76f482855b6d46b86c76779a-p" width=50%></img>



A person with high bias is someone who starts to answer before you can even finish asking. A person with high variance is someone who can think of all sorts of crazy answers. Combining these gives you different personalities:

- High bias/low variance: this ismsomeone who usually gives you the same answer, no matter what you ask, and is usually wrong about it;

- High bias/high variance: someone who takes wild guesses, all of which are sort of wrong;

- Low bias/high variance: a person who listens to you and tries to answer the best they can, but that daydreams a lot and may say something totally crazy;

- Low bias/low variance: a person who listens to you very carefully and gives you good answers pretty much all the time.