## Importing Functions and Data

In [1]:
import nltk 
from os import getcwd

nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\ishan\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ishan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [56]:
import numpy as np 
import pandas as pd
import re
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
import string 
import w1_unittest

from nltk.corpus import stopwords

from nltk.corpus import twitter_samples 


In [44]:
all_positve_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [45]:
test_pos = all_positve_tweets[4000:]
train_pos = all_positve_tweets[:4000]
test_neg= all_negative_tweets[4000:]
train_neg= all_negative_tweets[:4000]

train_X = train_pos + train_neg
test_X = test_pos+ test_neg

train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)),axis=0)
test_y = np.append(np.ones(len(test_pos)),np.zeros(len(test_neg)),axis=0)


In [46]:
print("train_y.shape", train_y.shape)

print("test_y.shape", test_y.shape)

train_y.shape (8000,)
test_y.shape (2000,)


In [47]:
def process_tweet(tweet):
    
    # remove old style retweet text "RT"
    tweet2 = re.sub(r'^RT[\s]+','',tweet)

    # remove hyperlinks
    tweet2 = re.sub(r'https?://[^\s\n\r]+','',tweet2)

    #removing hash form the tweet
    tweet2 = re.sub(r'#','',tweet2)

    #instantiate tokenizer 
    tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)
    tweet_tokens = tokenizer.tokenize(tweet2)

    stopwords_english = stopwords.words('english')
    
    tweets_clean = []

    for word in tweet_tokens:
        if (word not in stopwords_english and word not in string.punctuation):
            tweets_clean.append(word)

    #Instantiate stemming class
    stemmer = PorterStemmer()

    #Create an empty list to store the stems
    tweets_stem = []

    for word in tweets_clean:
        stem_word = stemmer.stem(word)
        tweets_stem.append(stem_word)

    return tweets_stem
    

In [48]:
def build_freqs(tweets,ys):
    yslist = np.squeeze(ys).tolist()

    freqs={}
    for y, tweet in zip(yslist,tweets):
        for word in process_tweet(tweet):
            pair = (word,y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1 
    return freqs

In [49]:
# create frequency dictionary
freqs = build_freqs(train_X, train_y)


## Processing Tweets

In [52]:
# test the function below
print('This is an example of a positive tweet: \n', train_X[0])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_X[0]))

This is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processed version of the tweet: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


## Logistic Regression

### Sigmoid function 

In [53]:
def sigmoid(z):
    h = 1/(1+np.exp(-z))
    return h

In [55]:
# testing the function 
if (sigmoid(0)==0.5):
    print('SUCCESS!')
else:
    print('Oops!')

if (sigmoid(4.92)==0.99275357604041685):
    print('CORRECT!')
else:
    print('Oops again!')

SUCCESS!
Oops again!


In [57]:
w1_unittest.test_sigmoid(sigmoid)

[92m All tests passed


Logistic Regression: Regrerssion abd a sigmoid

Logistic regression takes a regular linear regression and applies sigmoid to its output of the linear regeression.

Regression:
                    $$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$

                    Note that the $\theta$ values are "weights". 


Logistic regression
$$ h(z) = \frac{1}{1+\exp^{-z}}$$
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
We will refer to 'z' as the 'logits'.

<a name='1-2'></a>
### 1.2 - Cost function and Gradient

The cost function used for logistic regression is the average of the log loss across all training examples:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)}))\tag{5} $$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of training example 'i'.
* $h(z^{(i)})$ is the model's prediction for the training example 'i'.

The loss function for a single training example is
$$ Loss = -1 \times \left( y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) \right)$$

* All the $h$ values are between 0 and 1, so the logs will be negative. That is the reason for the factor of -1 applied to the sum of the two loss terms.
* Note that when the model predicts 1 ($h(z(\theta)) = 1$) and the label 'y' is also 1, the loss for that training example is 0. 
* Similarly, when the model predicts 0 ($h(z(\theta)) = 0$) and the actual label is also 0, the loss for that training example is 0. 
* However, when the model prediction is close to 1 ($h(z(\theta)) = 0.9999$) and the label is 0, the second term of the log loss becomes a large negative number, which is then multiplied by the overall factor of -1 to convert it to a positive loss value. $-1 \times (1 - 0) \times log(1 - 0.9999) \approx 9.2$ The closer the model prediction gets to 1, the larger the loss.

In [61]:
# Verifying that when the model predicts close to 1, but the actual label is 0 the losss is large positve vallue 
-1 *(1-0) * np.log(1-0.9999) #loss is about 9.2

9.210340371976294

Likewise, if the model predicts close to 0 (h(z)==0.0001) but the actual lable is 1, the first term in the losss fucntion becomes a large number.  -1X log(0.0001) ~ 9.2. The closer the prediction is to zero, the larger the losss. 

In [62]:
# Verifying that when the model predicts close to 0 but the actual label is 1, the loss is a large positive value 
-1 * np.log(0.0001) # 

9.210340371976182

#### Update the weights

To update your weight vector $\theta$, you will apply gradient descent to iteratively improve your model's predictions.  
The gradient of the cost function $J$ with respect to one of the weights $\theta_j$ is:

$$\nabla_{\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m(h^{(i)}-y^{(i)})x^{(i)}_j \tag{5}$$
* 'i' is the index across all 'm' training examples.
* 'j' is the index of the weight $\theta_j$, so $x^{(i)}_j$ is the feature associated with weight $\theta_j$

* To update the weight $\theta_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$:
$$\theta_j = \theta_j - \alpha \times \nabla_{\theta_j}J(\theta) $$
* The learning rate $\alpha$ is a value that we choose to control how big a single update will be.


<a name='ex-2'></a>
### Exercise 2 - gradientDescent
Implement gradient descent function.
* The number of iterations 'num_iters" is the number of times that you'll use the entire training set.
* For each iteration, you'll calculate the cost function using all training examples (there are 'm' training examples), and for all features.
* Instead of updating a single weight $\theta_i$ at a time, we can update all the weights in the column vector:  
$$\mathbf{\theta} = \begin{pmatrix}
\theta_0
\\
\theta_1
\\ 
\theta_2 
\\ 
\vdots
\\ 
\theta_n
\end{pmatrix}$$
* $\mathbf{\theta}$ has dimensions (n+1, 1), where 'n' is the number of features, and there is one more element for the bias term $\theta_0$ (note that the corresponding feature value $\mathbf{x_0}$ is 1).
* The 'logits', 'z', are calculated by multiplying the feature matrix 'x' with the weight vector 'theta'.  $z = \mathbf{x}\mathbf{\theta}$
    * $\mathbf{x}$ has dimensions (m, n+1) 
    * $\mathbf{\theta}$: has dimensions (n+1, 1)
    * $\mathbf{z}$: has dimensions (m, 1)
* The prediction 'h', is calculated by applying the sigmoid to each element in 'z': $h(z) = sigmoid(z)$, and has dimensions (m,1).
* The cost function $J$ is calculated by taking the dot product of the vectors 'y' and 'log(h)'.  Since both 'y' and 'h' are column vectors (m,1), transpose the vector to the left, so that matrix multiplication of a row vector with column vector performs the dot product.
$$J = \frac{-1}{m} \times \left(\mathbf{y}^T \cdot log(\mathbf{h}) + \mathbf{(1-y)}^T \cdot log(\mathbf{1-h}) \right)$$
* The update of theta is also vectorized.  Because the dimensions of $\mathbf{x}$ are (m, n+1), and both $\mathbf{h}$ and $\mathbf{y}$ are (m, 1), we need to transpose the $\mathbf{x}$ and place it on the left in order to perform matrix multiplication, which then yields the (n+1, 1) answer we need:
$$\mathbf{\theta} = \mathbf{\theta} - \frac{\alpha}{m} \times \left( \mathbf{x}^T \cdot \left( \mathbf{h-y} \right) \right)$$

In [None]:
def gradientDescent(x,y,theta, alpha,num_iters):
    