# Supervised machine learning



This section is partially inspired by the following Reference: http://cs229.stanford.edu/notes/cs229-notes1.pdf

- **Features** (input variables) $x^{(i)}\in \mathbb{X}$ 
- **Target** (output we are trying to predict) $y^{(i)} \in \mathbb{Y}$

$(x^{(i)},y^{(i)})$ is a **training example**

$\{(x^{(i)},y^{(i)}); i = 1,...,m\}$ is the **training set**

The goal of supervised learning is to learn a function $h: \mathbb{X}\mapsto\mathbb{Y}$, called the hypothesis, so that $h(x)$ is a good 
predictor of the corresponding $y$.

- **Regression** correspond to the case where $y$ is a continuous variable
- **Classification** correspond to the case where $y$ can only take a small number of discrete values

Examples: 
- Univariate Linear Regression: $h_w(x) = w_0+w_1x$,  with $\mathbb{X} = \mathbb{Y} = \mathbb{R}$
- Multivariate Linear Regression: $$h_w(x) = w_0+w_1x_1 + ... + w_nx_n = \sum_{i=0}^{n}w_ix_i = w^Tx,$$
with $\mathbb{Y} = \mathbb{R}$ and $\mathbb{X} = \mathbb{R^n}$.
Here $w_0$ is the intercept with the convention that $x_0=1$ to simplify notation.



## Binary Classification with Logistic Regression

- $y$ can take only two values, 0 or 1. For example, if $y$ is the sentiment associated with the tweet,
$y=1$ if the tweet is "positive" and $y=0$ is the tweet is "negative".

- $x^{(i)}$ represents the features of a tweet. For example the presence or absence of certain words.

- $y^{(i)}$ is the **label** of the training example represented by $x^{(i)}$.


Since $y\in\{0,1\}$ we want to limit $h_w(x)$ between $[0,1]$.

The **Logistic regression** consists of choosing $h_w(x)$ as

$$
h_w(x) = \frac{1}{1+e^{-w^Tx}}
$$

where $w^Tx = \sum_{i=0}^{n}w_ix_i$ and $h_w(x) = g(w^Tx)$ with

$$
g(x)=\frac{1}{1+e^{-x}}.
$$

$g(x)$ is the **logistic function** or **sigmoid function**


In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(-10,10)
y = 1/(1+np.exp(-x))

p = plt.plot(x,y)

- $g(x)\rightarrow 1$ for $x\rightarrow\infty$
- $g(x)\rightarrow 0$ for $x\rightarrow -\infty$
- $g(0) = 1/2$

Finally, to go from the regression to the classification, we can simply apply the following condition:

$$
y=\left\{
  \begin{array}{@{}ll@{}}
    1, & \text{if}\ h_w(x)>=1/2 \\
    0, & \text{otherwise}
  \end{array}\right.
$$

Let's clarify the notation. We have **$m$ training samples** and **$n$ features**, our training examples can be represented by a **$m$-by-$n$ matrix** $\underline{\underline{X}}=(x_{ij})$ ($m$-by-$n+1$, if we include the intercept term) that contains the training examples, $x^{(i)}$, in its rows.

The target values of the training set can be represented as a $m$-dimensional vector $\underline{y}$ and the parameters 
of our model as
a $n$-dimensional vector $\underline{w}$ ($n+1$ if we take into account the intercept).

Now, for a given training example $x^{(i)}$, the function that we want to learn (or fit) can be written:

$$
h_\underline{w}(x^{(i)}) = \frac{1}{1+e^{-\sum_{j=0}^n w_j x_{ij}}}
$$


### Likelihood of the model

How to find the parameters, also called *weights*, $\underline{w}$ that best fit our training data?
We want to find the weights $\underline{w}$ that maximize the likelihood of observing the target $\underline{y}$ given the observed features $\underline{\underline{X}}$.
We need a probabilistic model that gives us the probability of observing the value $y^{(i)}$ given the features $x^{(i)}$.

The function $h_\underline{w}(x^{(i)})$ can be used precisely for that:

$$
P(y^{(i)}=1|x^{(i)};\underline{w}) = h_\underline{w}(x^{(i)})
$$

$$
P(y^{(i)}=0|x^{(i)};\underline{w}) = 1 - h_\underline{w}(x^{(i)})
$$


we can write is more compactly as:

$$
P(y^{(i)}|x^{(i)};\underline{w}) = (h_\underline{w}(x^{(i)}))^{y^{(i)}} ( 1 - h_\underline{w}(x^{(i)}))^{1-y^{(i)}}
$$
where $y^{(i)}\in{0,1}$


We see that $y^{(i)}$ is a random variable following a Bernouilli distribution with expectation $h_\underline{w}(x^{(i)})$.



The **Likelihood function** of a statistical model is defined as:
$$
\mathcal{L}(\underline{w}) = \mathcal{L}(\underline{w};\underline{\underline{X}},\underline{y}) = P(\underline{y}|\underline{\underline{X}};\underline{w}).
$$

The likelihood takes into account all the $m$ training samples of our training dataset and estimates the likelihood 
of observing $\underline{y}$ given $\underline{\underline{X}}$ and $\underline{w}$. Assuming that the $m$ training examples were generated independently, we can write:

$$
\mathcal{L}(\underline{w}) = P(\underline{y}|\underline{\underline{X}};\underline{w}) = \prod_{i=1}^m P(y^{(i)}|x^{(i)};\underline{w}) = \prod_{i=1}^m (h_\underline{w}(x^{(i)}))^{y^{(i)}} ( 1 - h_\underline{w}(x^{(i)}))^{1-y^{(i)}}.
$$

This is the function that we want to maximize. It is usually much simpler to maximize the logarithm of this function, which is equivalent.

$$
l(\underline{w}) = \log\mathcal{L}(\underline{w}) = \sum_{i=1}^{m} \left(y^{(i)} \log h_\underline{w}(x^{(i)}) + (1- y^{(i)})\log(1- h_\underline{w}(x^{(i)})) \right)
$$

### Loss function and linear models

An other way of formulating this problem is by defining a Loss function $L\left(y^{(i)}, f(x^{(i)})\right)$ such that:

$$
\sum_{i=1}^{m} L\left(y^{(i)}, f(x^{(i)})\right) = - l(\underline{w}).
$$

And now the problem consists of minimizing $\sum_{i=1}^{m} L\left(y^{(i)}, f(x^{(i)})\right)$ over all the possible values of $\underline{w}$.

Using the definition of $h_\underline{w}(x^{(i)})$ you can show that $L$ can be written as:
$$
L\left(y^{(i)}=1, f(x^{(i)})\right) = \log_2\left(1+e^{-f(x^{(i)})}\right)
$$
and
$$
L\left(y^{(i)}=0, f(x^{(i)})\right) = \log_2\left(1+e^{-f(x^{(i)})}\right) - \log_2\left(e^{-f(x^{(i)})}\right)
$$

where $f(x^{(i)}) = \sum_{j=0}^n w_j x_{ij}$ is called the **decision function**.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

fx = np.linspace(-5,5)
Ly1 = np.log2(1+np.exp(-fx))
Ly0 = np.log2(1+np.exp(-fx)) - np.log2(np.exp(-fx))

p = plt.plot(fx,Ly1,label='L(1,f(x))')
p = plt.plot(fx,Ly0,label='L(0,f(x))')
plt.xlabel('f(x)')
plt.ylabel('L')
plt.legend()

Note that although the loss function is not linear, the decision function is a **linear function of the weights and features**. This is why the Logistic regression is part called a **linear model**.

Other linear models are defined by different loss functions. For example:
- Perceptron: $L \left(y^{(i)}, f(x^{(i)})\right) = \max(0, -y^{(i)}\cdot f(x^{(i)}))$
- Hinge-loss (soft-margin Support vector machine (SVM) classification): $L \left(y^{(i)}, f(x^{(i)})\right) = \max(0, 1-y^{(i)}\cdot f(x^{(i)}))$

See http://scikit-learn.org/stable/modules/sgd.html for more examples.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

fx = np.linspace(-5,5, 200)
Logit = np.log2(1+np.exp(-fx))
Percep = np.maximum(0,- fx) 
Hinge = np.maximum(0, 1- fx)
ZeroOne = np.ones(fx.size)
ZeroOne[fx>=0] = 0

p = plt.plot(fx,Logit,label='Logistic Regression')
p = plt.plot(fx,Percep,label='Perceptron')
p = plt.plot(fx,Hinge,label='Hinge-loss')
p = plt.plot(fx,ZeroOne,label='Zero-One loss')
plt.xlabel('f(x)')
plt.ylabel('L')
plt.legend()
ylims = plt.ylim((0,7))

## Tweet sentiment analysis

dataset: http://help.sentiment140.com/for-students/


In [1]:
# Let's create a DataFrame with each tweet using pandas
import pandas as pd
import json


def getTweetID(tweet):
    """ If properly included, get the ID of the tweet """
    return tweet.get('id')
    
def getUserIDandScreenName(tweet):
    """ If properly included, get the tweet 
        user ID and Screen Name """
    user = tweet.get('user')
    if user is not None:
        uid = user.get('id')
        screen_name = user.get('screen_name')
        return uid, screen_name
    else:
        return (None, None)
    

    
filename = 'trumpTweets.txt'

# create a list of dictionaries with the data that interests us
tweet_data_list = []
with open(filename, 'r') as fopen:
    # each line correspond to a tweet
    for line in fopen:
        tweet = json.loads(line)
        tweet_data_list.append({'tweet_id' : getTweetID(tweet),
                       'user_id' : getUserIDandScreenName(tweet)[0],
                       'text' : tweet.get('text')})

# put everything in a dataframe
tweet_df = pd.DataFrame.from_dict(tweet_data_list)



In [4]:
tweet_df.shape

(10931, 3)

In [6]:
#show the first 10 rows
tweet_df.head(10)

Unnamed: 0,text,tweet_id,user_id
0,RT @FoxNews: Trump Claims Top Dem Told Him He'...,8.500716e+17,8.396139e+17
1,#Trump doesn't care about governing. But he su...,8.500716e+17,2238084000.0
2,"Don’t miss today’s Federalist Radio on Syria, ...",8.500716e+17,1408004000.0
3,Trump is reportedly weighing up military strik...,8.500716e+17,1177421000.0
4,Schadenfreude... #obama gets a taste of his tr...,8.500716e+17,11775250.0
5,RT @JoinLauraNow: This ENTIRE #Syria killing o...,8.500716e+17,393844700.0
6,RT @DylanByers: Who wants to tell him? https:/...,8.500716e+17,8.195901e+17
7,$ATRM get 14 days #free #premium access on the...,8.500716e+17,8.056616e+17
8,RT @WilliamTurton: So Trump found out about th...,8.500716e+17,15855150.0
9,"RT @ian_sager: In the meantime, this is how fo...",8.500716e+17,182163800.0


#### Download the logistic regression classifier
https://www.dropbox.com/s/09rw6a85f7ezk31/sklearn_SGDLogReg_.pickle.zip?dl=1


In [7]:
# the classifier is saved in a "pickle" file
import pickle

with open('sklearn_SGDLogReg_.pickle', 'rb') as fopen:
    classifier = pickle.load(fopen)



In [8]:
classifier

{'label_inv_mapper': {0: 'neg', 1: 'pos'},
 'label_mapper': {'neg': 0, 'pos': 1},
 'sklearn_pipeline': Pipeline(steps=[('feat_vectorizer', DictVectorizer(dtype=<class 'numpy.int8'>, separator='=', sort=False,
         sparse=True)), ('classifier', SGDClassifier(alpha=7.847599703514622e-06, average=False, class_weight=None,
        epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15,
        learning_rate='optimal', loss='log', n_iter=10, n_jobs=1,
        penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
        warm_start=False))])}

In [9]:
cls = classifier['sklearn_pipeline']


In [10]:
cls

Pipeline(steps=[('feat_vectorizer', DictVectorizer(dtype=<class 'numpy.int8'>, separator='=', sort=False,
        sparse=True)), ('classifier', SGDClassifier(alpha=7.847599703514622e-06, average=False, class_weight=None,
       epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=10, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False))])

In [None]:
cls.get_params
