## Logistic Regression using Stopwords - Stack Overflow
The present model builds on top of the Logistic Regression presented in NLP Course offered by DeepLearning.ai in Coursera.
It creates a model of logistic regression using stopwords and NLTK module

## Import functions and data

In [1]:
# run this cell to import nltk
import csv
import re
from os import getcwd
import nltk                                # Python library for NLP
from nltk.corpus import twitter_samples    # sample Twitter dataset from NLTK
import matplotlib.pyplot as plt            # library for visualization
import random                              # pseudo-random number generator

### Imported functions

Download the data needed for this assignment. Check out the [documentation for the twitter_samples dataset](http://www.nltk.org/howto/twitter.html).

* twitter_samples: if you're running this notebook on your local computer, you will need to download it using:
```Python
nltk.download('twitter_samples')
```

* stopwords: if you're running this notebook on your local computer, you will need to download it using:
```python
nltk.download('stopwords')
```

#### Import some helper functions that we provided in the utils.py file:
* `process_tweet()`: cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems.
* `build_freqs()`: this counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label '1' or a negative label '0', then builds the `freqs` dictionary, where each key is a (word,label) tuple, and the value is the count of its frequency within the corpus of tweets.

In [2]:
# add folder, tmp2, from our local workspace containing pre-downloaded corpora files to nltk's data path
# this enables importing of these files without downloading it again when we refresh our workspace
filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)

In [3]:
import numpy as np
import pandas as pd
from nltk.corpus import twitter_samples 
from utils import process_tweet, build_freqs, iterate_file

### Prepare the data
* The `twitter_samples` contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets.  
    * If you used all three datasets, we would introduce duplicates of the positive tweets and negative tweets.  
    * You will select just the five thousand positive tweets and five thousand negative tweets.

In [4]:
# select the set of positive and negative tweets
# nltk.download('twitter_samples')
# all_positive_tweets = twitter_samples.strings('positive_tweets.json')
# all_negative_tweets = twitter_samples.strings('negative_tweets.json')
# print(all_positive_tweets[0:20])

In [5]:
#Iterator for given file
def iterate_file(file_path):
    with open(file_path, "r", encoding = 'utf8') as f:
        reader = csv.DictReader(f, delimiter=",")
        for row in reader:
            yield row

#Clean sentence (remove non alpha chars)
def cleanSentence(sentence):
    p = re.compile(r'<.*?>')
    sentence = p.sub('', sentence) 
    sentence = ''.join([i.lower() if i.isalpha() else " " if (i==" " or i=="-" or i=="_") else "" for i in sentence])
    return sentence
            
def load_questions(file_path, n=float("inf")):
    positive_questions = []
    negative_questions = []
    cont = 0
    for row in iterate_file(file_path):
        cont += 1 
        if cont %100000==0:
            print("idx:",cont)
        sentence = cleanSentence(row["title"])+cleanSentence(row["body"])+cleanSentence("".join(row["tags"].split("|")))
        words = sentence.split(" ")
        if float(row["stars"])>= 3.0:
            positive_questions.append(sentence)
        else:
            negative_questions.append(sentence)
        if cont==n:
            break
    
    return positive_questions, negative_questions

train_pos, train_neg = load_questions("data_stackoverflow_train.csv", 90000)
test_pos, test_neg = load_questions("data_stackoverflow_test.csv", 5000)

print(train_pos[0:20])
print("positive questions:", len(train_pos))
print("negative questions:", len(train_neg))
        

['stop firefox rendering inline colours in rgb formpim trying to write a javascript tool to work on items of a certain colour  on a test page i set the colour using an inline style to mimic the target pages but when the page is rendered the colour is specified using the css rgb functionppthe html tries to emulate the gmail container i want to change the background colour on  when i inspect the element in firebug the colour is displayed as rgbpprecodeltdiv stylewidth  height px border px solid black     background e none repeat scroll  gt    ltdivgtcodeprephow can i stop the colour being converted from codeecode to codergb  codepcss', 'javascript ie change input typepthe following javascript snippet will change the type of an input for instance from text to password its being used to allow users to display their password on screen when typing it inpprecodedocumentsave formpassword confirmtype textdocumentsave formpassword confirmtype passwordcodeprepthis works great in ffchrome but in i

* Train test split: 20% will be in the test set, and 80% in the training set.


In [6]:
# split the data into two pieces, one for training and one for testing (validation set) 
# test_pos = all_positive_tweets[10000:]
# train_pos = all_positive_tweets[:10000]
# test_neg = all_negative_tweets[10000:]
# train_neg = all_negative_tweets[:10000]


train_x = train_pos + train_neg 
test_x = test_pos + test_neg

* Create the numpy array of positive labels and negative labels.

In [7]:
# combine positive and negative labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

In [8]:
# Print the shape train and test sets
print("train_y.shape = " + str(train_y.shape))
print("test_y.shape = " + str(test_y.shape))
print(train_y[:5])

train_y.shape = (90000, 1)
test_y.shape = (5000, 1)
[[1.]
 [1.]
 [1.]
 [1.]
 [1.]]


* Create the frequency dictionary using the imported `build_freqs()` function.  
    * We highly recommend that you open `utils.py` and read the `build_freqs()` function to understand what it is doing.
    * To view the file directory, go to the menu and click File->Open.

```Python
    for y,tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1
```
* Notice how the outer for loop goes through each tweet, and the inner for loop steps through each word in a tweet.
* The `freqs` dictionary is the frequency dictionary that's being built. 
* The key is the tuple (word, label), such as ("happy",1) or ("happy",0).  The value stored for each key is the count of how many times the word "happy" was associated with a positive label, or how many times "happy" was associated with a negative label.

In [9]:
# create frequency dictionary
nltk.download('stopwords')
freqs = build_freqs(train_x, train_y)

# check the output
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Bull\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


type(freqs) = <class 'dict'>
len(freqs) = 1013951


#### Expected output
```
type(freqs) = <class 'dict'>
len(freqs) = 11346
```

### Process tweet
The given function `process_tweet()` tokenizes the tweet into individual words, removes stop words and applies stemming.

In [10]:
print("data values:")
print(freqs[('java', 1.0)])
print(freqs[('java', 0.0)])
# test the function below
print('This is an example of a positive tweet: \n', train_x[0])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[0]))

data values:
4221
3358
This is an example of a positive tweet: 
 stop firefox rendering inline colours in rgb formpim trying to write a javascript tool to work on items of a certain colour  on a test page i set the colour using an inline style to mimic the target pages but when the page is rendered the colour is specified using the css rgb functionppthe html tries to emulate the gmail container i want to change the background colour on  when i inspect the element in firebug the colour is displayed as rgbpprecodeltdiv stylewidth  height px border px solid black     background e none repeat scroll  gt    ltdivgtcodeprephow can i stop the colour being converted from codeecode to codergb  codepcss

This is an example of the processed version of the tweet: 
 ['stop', 'firefox', 'render', 'inlin', 'colour', 'rgb', 'formpim', 'tri', 'write', 'javascript', 'tool', 'work', 'item', 'certain', 'colour', 'test', 'page', 'set', 'colour', 'use', 'inlin', 'style', 'mimic', 'target', 'page', 'page', '

In [11]:
# freqs0 = freqs.copy()
# freqs = freqs0

In [104]:
# Get highest value and normalize everything
max_val = 0
max_key = "key"
for key,val in freqs.items():
    if val > max_val:
        max_val = val
        max_key = key

print("max key,val:", max_key, max_val)

for key,val in freqs.items():
    freqs[key] = val/max_val

max key,val: ('use', 0.0) 43.153


In [105]:
print("data values:")
print(freqs[('java', 1.0)])
print(freqs[('java', 0.0)])
# test the function below
print('This is an example of a positive tweet: \n', train_x[0])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[0]))

data values:
0.09781475216091581
0.07781614256251014
This is an example of a positive tweet: 
 stop firefox rendering inline colours in rgb formpim trying to write a javascript tool to work on items of a certain colour  on a test page i set the colour using an inline style to mimic the target pages but when the page is rendered the colour is specified using the css rgb functionppthe html tries to emulate the gmail container i want to change the background colour on  when i inspect the element in firebug the colour is displayed as rgbpprecodeltdiv stylewidth  height px border px solid black     background e none repeat scroll  gt    ltdivgtcodeprephow can i stop the colour being converted from codeecode to codergb  codepcss

This is an example of the processed version of the tweet: 
 ['stop', 'firefox', 'render', 'inlin', 'colour', 'rgb', 'formpim', 'tri', 'write', 'javascript', 'tool', 'work', 'item', 'certain', 'colour', 'test', 'page', 'set', 'colour', 'use', 'inlin', 'style', 'mimic

#### Expected output
```
This is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
 
This is an example of the processes version: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
```

# Part 1: Logistic regression 


### Part 1.1: Sigmoid
You will learn to use logistic regression for text classification. 
* The sigmoid function is defined as: 

$$ h(z) = \frac{1}{1+\exp^{-z}} \tag{1}$$

It maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability. 

<div style="width:image width px; font-size:100%; text-align:center;"><img src='../tmp2/sigmoid_plot.jpg' alt="alternate text" width="width" height="height" style="width:300px;height:200px;" /> Figure 1 </div>

#### Instructions: Implement the sigmoid function
* You will want this function to work if z is a scalar as well as if it is an array.

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li><a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.exp.html" > numpy.exp </a> </li>

</ul>
</p>



In [106]:
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def sigmoid(z): 
    '''
    Input:
        z: is the input (can be a scalar or an array)
    Output:
        h: the sigmoid of z
    '''
    
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # calculate the sigmoid of z
    h = 1.0/(1+np.exp(-z))
    ### END CODE HERE ###
    
    return h

In [107]:
# Testing your function 
if (sigmoid(0) == 0.5):
    print('SUCCESS!')
else:
    print('Oops!')

if (sigmoid(4.92) == 0.9927537604041685):
    print('CORRECT!')
else:
    print('Oops again!')

SUCCESS!
CORRECT!


### Logistic regression: regression and a sigmoid

Logistic regression takes a regular linear regression, and applies a sigmoid to the output of the linear regression.

Regression:
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
Note that the $\theta$ values are "weights". If you took the Deep Learning Specialization, we referred to the weights with the `w` vector.  In this course, we're using a different variable $\theta$ to refer to the weights.

Logistic regression
$$ h(z) = \frac{1}{1+\exp^{-z}}$$
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
We will refer to 'z' as the 'logits'.

### Part 1.2 Cost function and Gradient

The cost function used for logistic regression is the average of the log loss across all training examples:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)}))\tag{5} $$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of the i-th training example.
* $h(z(\theta)^{(i)})$ is the model's prediction for the i-th training example.

The loss function for a single training example is
$$ Loss = -1 \times \left( y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) \right)$$

* All the $h$ values are between 0 and 1, so the logs will be negative. That is the reason for the factor of -1 applied to the sum of the two loss terms.
* Note that when the model predicts 1 ($h(z(\theta)) = 1$) and the label $y$ is also 1, the loss for that training example is 0. 
* Similarly, when the model predicts 0 ($h(z(\theta)) = 0$) and the actual label is also 0, the loss for that training example is 0. 
* However, when the model prediction is close to 1 ($h(z(\theta)) = 0.9999$) and the label is 0, the second term of the log loss becomes a large negative number, which is then multiplied by the overall factor of -1 to convert it to a positive loss value. $-1 \times (1 - 0) \times log(1 - 0.9999) \approx 9.2$ The closer the model prediction gets to 1, the larger the loss.

In [108]:
# verify that when the model predicts close to 1, but the actual label is 0, the loss is a large positive value
-1 * (1 - 0) * np.log(1 - 0.9999) # loss is about 9.2

9.210340371976294

* Likewise, if the model predicts close to 0 ($h(z) = 0.0001$) but the actual label is 1, the first term in the loss function becomes a large number: $-1 \times log(0.0001) \approx 9.2$.  The closer the prediction is to zero, the larger the loss.

In [109]:
# verify that when the model predicts close to 0 but the actual label is 1, the loss is a large positive value
-1 * np.log(0.0001) # loss is about 9.2

9.210340371976182

#### Update the weights

To update your weight vector $\theta$, you will apply gradient descent to iteratively improve your model's predictions.  
The gradient of the cost function $J$ with respect to one of the weights $\theta_j$ is:

$$\nabla_{\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m(h^{(i)}-y^{(i)})x_j \tag{5}$$
* 'i' is the index across all 'm' training examples.
* 'j' is the index of the weight $\theta_j$, so $x_j$ is the feature associated with weight $\theta_j$

* To update the weight $\theta_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$:
$$\theta_j = \theta_j - \alpha \times \nabla_{\theta_j}J(\theta) $$
* The learning rate $\alpha$ is a value that we choose to control how big a single update will be.


## Instructions: Implement gradient descent function
* The number of iterations `num_iters` is the number of times that you'll use the entire training set.
* For each iteration, you'll calculate the cost function using all training examples (there are `m` training examples), and for all features.
* Instead of updating a single weight $\theta_i$ at a time, we can update all the weights in the column vector:  
$$\mathbf{\theta} = \begin{pmatrix}
\theta_0
\\
\theta_1
\\ 
\theta_2 
\\ 
\vdots
\\ 
\theta_n
\end{pmatrix}$$
* $\mathbf{\theta}$ has dimensions (n+1, 1), where 'n' is the number of features, and there is one more element for the bias term $\theta_0$ (note that the corresponding feature value $\mathbf{x_0}$ is 1).
* The 'logits', 'z', are calculated by multiplying the feature matrix 'x' with the weight vector 'theta'.  $z = \mathbf{x}\mathbf{\theta}$
    * $\mathbf{x}$ has dimensions (m, n+1) 
    * $\mathbf{\theta}$: has dimensions (n+1, 1)
    * $\mathbf{z}$: has dimensions (m, 1)
* The prediction 'h', is calculated by applying the sigmoid to each element in 'z': $h(z) = sigmoid(z)$, and has dimensions (m,1).
* The cost function $J$ is calculated by taking the dot product of the vectors 'y' and 'log(h)'.  Since both 'y' and 'h' are column vectors (m,1), transpose the vector to the left, so that matrix multiplication of a row vector with column vector performs the dot product.
$$J = \frac{-1}{m} \times \left(\mathbf{y}^T \cdot log(\mathbf{h}) + \mathbf{(1-y)}^T \cdot log(\mathbf{1-h}) \right)$$
* The update of theta is also vectorized.  Because the dimensions of $\mathbf{x}$ are (m, n+1), and both $\mathbf{h}$ and $\mathbf{y}$ are (m, 1), we need to transpose the $\mathbf{x}$ and place it on the left in order to perform matrix multiplication, which then yields the (n+1, 1) answer we need:
$$\mathbf{\theta} = \mathbf{\theta} - \frac{\alpha}{m} \times \left( \mathbf{x}^T \cdot \left( \mathbf{h-y} \right) \right)$$

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>use np.dot for matrix multiplication.</li>
    <li>To ensure that the fraction -1/m is a decimal value, cast either the numerator or denominator (or both), like `float(1)`, or write `1.` for the float version of 1. </li>
</ul>
</p>



In [110]:
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def minibatch_gradientDescent(x, y, theta, alpha, num_iters, batch_size):
    '''
    Input:
        x: matrix of features which is (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        num_iters: number of iterations you want to train your model for
    Output:
        J: the final cost
        theta: your final weight vector
    Hint: you might want to print the cost to make sure that it is going down.
    '''
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # get 'm', the number of rows in matrix x
    m = x.shape[0]
    
    for i in range(0, num_iters):
        
        for j in range(0, m, batch_size):

            # get z, the dot product of x and theta
            xj = x[j:j+batch_size,:]
            yj = y[j:j+batch_size,:]
            z = np.dot(xj ,theta)
            #mx1

            # get the sigmoid of z
            h = sigmoid(z)
            #mx1

            #print("shape z", z.shape)
            #print("shape h", h.shape)


            # calculate the cost function
            J = (-1/batch_size)*( np.dot(yj.T,np.log(h)) + np.dot((1-yj).T,np.log(1-h)) )
                


            # update the weights theta
            theta = theta - (alpha/batch_size)*np.dot(xj.T,h-yj)
    
        print(i,":", J)
        
    ### END CODE HERE ###
    J = float(J)
    return J, theta

In [111]:
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def gradientDescent(x, y, theta, alpha, num_iters):
    '''
    Input:
        x: matrix of features which is (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        num_iters: number of iterations you want to train your model for
    Output:
        J: the final cost
        theta: your final weight vector
    Hint: you might want to print the cost to make sure that it is going down.
    '''
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # get 'm', the number of rows in matrix x
    m = x.shape[0]
    
    for i in range(0, num_iters):
        
        # get z, the dot product of x and theta
        z = np.dot(x,theta)
        #mx1
        
        # get the sigmoid of z
        h = sigmoid(z)
        #mx1
        
        #print("shape z", z.shape)
        #print("shape h", h.shape)
        
        
        # calculate the cost function
        J = (-1/m)*( np.dot(y.T,np.log(h)) + np.dot((1-y).T,np.log(1-h)) )
        if i%100==0:
            print(i,":", J)
            

        # update the weights theta
        theta = theta - (alpha/m)*np.dot(x.T,h-y)
        
    ### END CODE HERE ###
    J = float(J)
    return J, theta

In [112]:
# Check the function
# Construct a synthetic test case using numpy PRNG functions
np.random.seed(1)
# X input is 10 x 3 with ones for the bias terms
tmp_X = np.append(np.ones((10, 1)), np.random.rand(10, 2) * 2000, axis=1)
# Y Labels are 10 x 1
tmp_Y = (np.random.rand(10, 1) > 0.35).astype(float)

# Apply gradient descent
tmp_J, tmp_theta = gradientDescent(tmp_X, tmp_Y, np.zeros((3, 1)), 1e-8, 700)
print(f"The cost after training is {tmp_J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(tmp_theta)]}")

0 : [[0.69314718]]
100 : [[0.68600073]]
200 : [[0.68163567]]
300 : [[0.67862941]]
400 : [[0.67630475]]
500 : [[0.67433793]]
600 : [[0.67257295]]
The cost after training is 0.67094970.
The resulting vector of weights is [4.1e-07, 0.00035658, 7.309e-05]


#### Expected output
```
The cost after training is 0.67094970.
The resulting vector of weights is [4.1e-07, 0.00035658, 7.309e-05]
```

## Part 2: Extracting the features

* Given a list of tweets, extract the features and store them in a matrix. You will extract two features.
    * The first feature is the number of positive words in a tweet.
    * The second feature is the number of negative words in a tweet. 
* Then train your logistic regression classifier on these features.
* Test the classifier on a validation set. 

### Instructions: Implement the extract_features function. 
* This function takes in a single tweet.
* Process the tweet using the imported `process_tweet()` function and save the list of tweet words.
* Loop through each word in the list of processed words
    * For each word, check the `freqs` dictionary for the count when that word has a positive '1' label. (Check for the key (word, 1.0)
    * Do the same for the count for when the word is associated with the negative label '0'. (Check for the key (word, 0.0).)


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>Make sure you handle cases when the (word, label) key is not found in the dictionary. </li>
    <li> Search the web for hints about using the `.get()` method of a Python dictionary.  Here is an <a href="https://www.programiz.com/python-programming/methods/dictionary/get" > example </a> </li>
</ul>
</p>


In [113]:
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)

def extract_features(tweet, freqs):
    '''
    Input: 
        tweet: a list of words for one tweet
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
    Output: 
        x: a feature vector of dimension (1,3)
    '''
    # process_tweet tokenizes, stems, and removes stopwords
    word_l = process_tweet(tweet)
            #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#     word_l = tweet.split()
    
    # 3 elements in the form of a 1 x 3 vector
    x = np.zeros((1, 3)) 
    
    #bias term is set to 1
    x[0,0] = 1 
    
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    
    # loop through each word in the list of words
    for word in word_l:
        
        # increment the word count for the positive label 1
        x[0,1] += freqs.get((word, 1.0),0)
        
        # increment the word count for the negative label 0
        x[0,2] += freqs.get((word, 0.0),0)
        
    ### END CODE HERE ###
    assert(x.shape == (1, 3))
    return x

In [114]:
# Check your function

# test 1
# test on training data
tmp1 = extract_features(train_x[0], freqs)
print(tmp1)

[[1.         6.0035687  7.77786017]]


#### Expected output
```
[[1.00e+00 3.02e+03 6.10e+01]]
```

In [115]:
# test 2:
# check for when the words are not in the freqs dictionary
tmp2 = extract_features('blorb bleeeeb bloooob', freqs)
print(tmp2)

[[1. 0. 0.]]


#### Expected output
```
[[1. 0. 0.]]
```

## Part 3: Training Your Model

To train the model:
* Stack the features for all training examples into a matrix `X`. 
* Call `gradientDescent`, which you've implemented above.

This section is given to you.  Please read it for understanding and run the cell.

In [116]:
# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))
print("extract_features:",train_x[0], extract_features(train_x[0], freqs) )
for i in range(len(train_x)):
    X[i, :]= extract_features(train_x[i], freqs)
#     if i%1000000==0:
#         print("extracting features:", i)
Y = train_y

extract_features: stop firefox rendering inline colours in rgb formpim trying to write a javascript tool to work on items of a certain colour  on a test page i set the colour using an inline style to mimic the target pages but when the page is rendered the colour is specified using the css rgb functionppthe html tries to emulate the gmail container i want to change the background colour on  when i inspect the element in firebug the colour is displayed as rgbpprecodeltdiv stylewidth  height px border px solid black     background e none repeat scroll  gt    ltdivgtcodeprephow can i stop the colour being converted from codeecode to codergb  codepcss [[1.         6.0035687  7.77786017]]


In [140]:

# training labels corresponding to X


# Apply gradient descent
from sklearn.utils import shuffle
Xs, Ys = shuffle(X, Y, random_state=0)

# J, theta = minibatch_gradientDescent(Xs, Ys, np.zeros((3, 1)), 1E-4, 1000, 5000)
J, theta = gradientDescent(Xs, Ys, np.zeros((3, 1)), 3E-3, 20000)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")

0 : [[0.69314718]]
100 : [[0.68969543]]
200 : [[0.68926565]]
300 : [[0.68884956]]
400 : [[0.68844679]]
500 : [[0.68805689]]
600 : [[0.68767934]]
700 : [[0.6873136]]
800 : [[0.68695912]]
900 : [[0.68661539]]
1000 : [[0.68628192]]
1100 : [[0.68595831]]
1200 : [[0.68564414]]
1300 : [[0.68533908]]
1400 : [[0.68504277]]
1500 : [[0.68475492]]
1600 : [[0.68447521]]
1700 : [[0.68420337]]
1800 : [[0.68393913]]
1900 : [[0.68368221]]
2000 : [[0.68343238]]
2100 : [[0.68318938]]
2200 : [[0.68295297]]
2300 : [[0.68272295]]
2400 : [[0.68249907]]
2500 : [[0.68228115]]
2600 : [[0.68206897]]
2700 : [[0.68186234]]
2800 : [[0.68166108]]
2900 : [[0.68146502]]
3000 : [[0.68127397]]
3100 : [[0.68108779]]
3200 : [[0.6809063]]
3300 : [[0.68072938]]
3400 : [[0.68055686]]
3500 : [[0.68038861]]
3600 : [[0.6802245]]
3700 : [[0.68006441]]
3800 : [[0.6799082]]
3900 : [[0.67975577]]
4000 : [[0.679607]]
4100 : [[0.67946179]]
4200 : [[0.67932002]]
4300 : [[0.6791816]]
4400 : [[0.67904643]]
4500 : [[0.67891441]]
4600 : 

**Expected Output**: 

```
The cost after training is 0.24216529.
The resulting vector of weights is [7e-08, 0.0005239, -0.00055517]
```

# Part 4: Test your logistic regression

It is time for you to test your logistic regression function on some new input that your model has not seen before. 

#### Instructions: Write `predict_tweet`
Predict whether a tweet is positive or negative.

* Given a tweet, process it, then extract the features.
* Apply the model's learned weights on the features to get the logits.
* Apply the sigmoid to the logits to get the prediction (a value between 0 and 1).

$$y_{pred} = sigmoid(\mathbf{x} \cdot \theta)$$

In [141]:
# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def predict_tweet(tweet, freqs, theta):
    '''
    Input: 
        tweet: a string
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
        theta: (3,1) vector of weights
    Output: 
        y_pred: the probability of a tweet being positive or negative
    '''
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    
    # extract the features of the tweet and store it into x
    x = extract_features(tweet, freqs)
    
    # make the prediction using x and theta
    y_pred = sigmoid(np.dot(x, theta))
    
    ### END CODE HERE ###
    
    return y_pred

In [142]:
# Run this cell to test your function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    print( '%s -> %f' % (tweet, predict_tweet(tweet, freqs, theta)))

I am happy -> 0.427927
I am bad -> 0.428318
this movie should have been great. -> 0.427671
great -> 0.428051
great great -> 0.428380
great great great -> 0.428709
great great great great -> 0.429039


**Expected Output**: 
```
I am happy -> 0.518580
I am bad -> 0.494339
this movie should have been great. -> 0.515331
great -> 0.515464
great great -> 0.530898
great great great -> 0.546273
great great great great -> 0.561561
```

In [143]:
# Feel free to check the sentiment of your own tweet below
my_tweet = 'love be in the air tonight'
predict_tweet(my_tweet, freqs, theta)

array([[0.42804029]])

## Check performance using the test set
After training your model using the training set above, check how your model might perform on real, unseen data, by testing it against the test set.

#### Instructions: Implement `test_logistic_regression` 
* Given the test data and the weights of your trained model, calculate the accuracy of your logistic regression model. 
* Use your `predict_tweet()` function to make predictions on each tweet in the test set.
* If the prediction is > 0.5, set the model's classification `y_hat` to 1, otherwise set the model's classification `y_hat` to 0.
* A prediction is accurate when `y_hat` equals `test_y`.  Sum up all the instances when they are equal and divide by `m`.


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>Use np.asarray() to convert a list to a numpy array</li>
    <li>Use np.squeeze() to make an (m,1) dimensional array into an (m,) array </li>
</ul>
</p>

In [144]:
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def test_logistic_regression(test_x, test_y, freqs, theta):
    """
    Input: 
        test_x: a list of tweets
        test_y: (m, 1) vector with the corresponding labels for the list of tweets
        freqs: a dictionary with the frequency of each pair (or tuple)
        theta: weight vector of dimension (3, 1)
    Output: 
        accuracy: (# of tweets classified correctly) / (total # of tweets)
    """
    
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    
    # the list for storing predictions
    y_hat = []
    
    for tweet in test_x:
        # get the label prediction for the tweet
        y_pred = predict_tweet(tweet, freqs, theta)
        
        if y_pred > 0.5:
            # append 1.0 to the list
            y_hat.append(1.0)
        else:
            # append 0 to the list
            y_hat.append(0.0)

    # With the above implementation, y_hat is a list, but test_y is (m,1) array
    # convert both to one-dimensional arrays in order to compare them using the '==' operator
    accuracy = np.sum(np.array(y_hat)==test_y.reshape(-1))/len(y_hat)

    ### END CODE HERE ###
    
    return accuracy

In [145]:
tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")

Logistic regression model's accuracy = 0.5864


#### Expected Output: 
```0.9950```  
Pretty good!

# Part 5: Error Analysis

In this part you will see some tweets that your model misclassified. Why do you think the misclassifications happened? Specifically what kind of tweets does your model misclassify?

In [103]:
# Some error analysis done for you
print('Label Predicted Tweet')
for x,y in zip(test_x,test_y):
    y_hat = predict_tweet(x, freqs, theta)

    if np.abs(y - (y_hat > 0.5)) > 0:
        print('THE TWEET IS:', x)
        print('THE PROCESSED TWEET IS:', process_tweet(x))
        print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore')))

  if sys.path[0] == '':


Label Predicted Tweet
THE TWEET IS: get pixels and colours from nsimage
THE PROCESSED TWEET IS: ['get', 'pixel', 'colour', 'nsimag']
1	0.00000000	b'get pixel colour nsimag'
THE TWEET IS: how can i list all deployed jax ws webservices
THE PROCESSED TWEET IS: ['list', 'deploy', 'jax', 'ws', 'webservic']
1	0.00000000	b'list deploy jax ws webservic'
THE TWEET IS: google app engine jsonpickle
THE PROCESSED TWEET IS: ['googl', 'app', 'engin', 'jsonpickl']
1	0.00000000	b'googl app engin jsonpickl'
THE TWEET IS: flash lite   javascript
THE PROCESSED TWEET IS: ['flash', 'lite', 'javascript']
1	0.00000000	b'flash lite javascript'
THE TWEET IS: from html to servlet
THE PROCESSED TWEET IS: ['html', 'servlet']
1	0.00000000	b'html servlet'
THE TWEET IS: htaccess redirect all files in one folder to exact same in another folder
THE PROCESSED TWEET IS: ['htaccess', 'redirect', 'file', 'one', 'folder', 'exact', 'anoth', 'folder']
1	0.00000000	b'htaccess redirect file one folder exact anoth folder'
THE T

THE TWEET IS: perform a sql join on django models that are not related
THE PROCESSED TWEET IS: ['perform', 'sql', 'join', 'django', 'model', 'relat']
1	0.00000000	b'perform sql join django model relat'
THE TWEET IS: propertiessettingssave only saves on first call
THE PROCESSED TWEET IS: ['propertiessettingssav', 'save', 'first', 'call']
1	0.00000000	b'propertiessettingssav save first call'
THE TWEET IS: how to force hibernate  or  to use cglib instead of javassist
THE PROCESSED TWEET IS: ['forc', 'hibern', 'use', 'cglib', 'instead', 'javassist']
1	0.00000000	b'forc hibern use cglib instead javassist'
THE TWEET IS: win application arent so object oriented and why there are so many pointers
THE PROCESSED TWEET IS: ['win', 'applic', 'arent', 'object', 'orient', 'mani', 'pointer']
1	0.00000000	b'win applic arent object orient mani pointer'
THE TWEET IS: spgridview current item in sharepoint
THE PROCESSED TWEET IS: ['spgridview', 'current', 'item', 'sharepoint']
1	0.00000000	b'spgridview cu

1	0.00000000	b'iphon sdk load titanium develop'
THE TWEET IS: rendering html in java
THE PROCESSED TWEET IS: ['render', 'html', 'java']
1	0.00000000	b'render html java'
THE TWEET IS: releasing winform program updates
THE PROCESSED TWEET IS: ['releas', 'winform', 'program', 'updat']
1	0.00000000	b'releas winform program updat'
THE TWEET IS: how is its lifetime of a return value extended to the scope of the calling function when it is bound to a const reference in the calling function
THE PROCESSED TWEET IS: ['lifetim', 'return', 'valu', 'extend', 'scope', 'call', 'function', 'bound', 'const', 'refer', 'call', 'function']
1	0.00000000	b'lifetim return valu extend scope call function bound const refer call function'
THE TWEET IS: dbdsqlitest execute failed datatype mismatch
THE PROCESSED TWEET IS: ['dbdsqlitest', 'execut', 'fail', 'datatyp', 'mismatch']
1	0.00000000	b'dbdsqlitest execut fail datatyp mismatch'
THE TWEET IS: how to distinguish between two different udp clients on the same i

THE TWEET IS: c multi monitor   find all visibleopen windows
THE PROCESSED TWEET IS: ['c', 'multi', 'monitor', 'find', 'visibleopen', 'window']
1	0.00000000	b'c multi monitor find visibleopen window'
THE TWEET IS: failed to obtain jdbc driver for mysql under tomcat environment
THE PROCESSED TWEET IS: ['fail', 'obtain', 'jdbc', 'driver', 'mysql', 'tomcat', 'environ']
1	0.00000000	b'fail obtain jdbc driver mysql tomcat environ'
THE TWEET IS: using a join model to relate a model to itself
THE PROCESSED TWEET IS: ['use', 'join', 'model', 'relat', 'model']
1	0.00000000	b'use join model relat model'
THE TWEET IS: is there any use for bash scripting anymore
THE PROCESSED TWEET IS: ['use', 'bash', 'script', 'anymor']
1	0.00000000	b'use bash script anymor'
THE TWEET IS: printf vs cout in c
THE PROCESSED TWEET IS: ['printf', 'vs', 'cout', 'c']
1	0.00000000	b'printf vs cout c'
THE TWEET IS: profiling java running by jni calls
THE PROCESSED TWEET IS: ['profil', 'java', 'run', 'jni', 'call']
1	0.00

THE PROCESSED TWEET IS: ['javascript', 'get', 'current', 'execut', 'script', 'node']
1	0.00000000	b'javascript get current execut script node'
THE TWEET IS: concatenating text in an aspnet localized label
THE PROCESSED TWEET IS: ['concaten', 'text', 'aspnet', 'local', 'label']
1	0.00000000	b'concaten text aspnet local label'
THE TWEET IS: javascript object properties access functions in parent constructor
THE PROCESSED TWEET IS: ['javascript', 'object', 'properti', 'access', 'function', 'parent', 'constructor']
1	0.00000000	b'javascript object properti access function parent constructor'
THE TWEET IS: should i always wrap an inputstream as bufferedinputstream
THE PROCESSED TWEET IS: ['alway', 'wrap', 'inputstream', 'bufferedinputstream']
1	0.00000000	b'alway wrap inputstream bufferedinputstream'
THE TWEET IS: aio network sockets and zero copy under linux
THE PROCESSED TWEET IS: ['aio', 'network', 'socket', 'zero', 'copi', 'linux']
1	0.00000000	b'aio network socket zero copi linux'
THE 

THE PROCESSED TWEET IS: ['would', 'format', 'datetim', 'intern', 'format']
1	0.00000000	b'would format datetim intern format'
THE TWEET IS: jquery validation how do i create a custom error message element not just text
THE PROCESSED TWEET IS: ['jqueri', 'valid', 'creat', 'custom', 'error', 'messag', 'element', 'text']
1	0.00000000	b'jqueri valid creat custom error messag element text'
THE TWEET IS: drag n drop from uipopovercontroller to other uiview
THE PROCESSED TWEET IS: ['drag', 'n', 'drop', 'uipopovercontrol', 'uiview']
1	0.00000000	b'drag n drop uipopovercontrol uiview'
THE TWEET IS: how to work with viewmodellocator in mvvm light
THE PROCESSED TWEET IS: ['work', 'viewmodelloc', 'mvvm', 'light']
1	0.00000000	b'work viewmodelloc mvvm light'
THE TWEET IS: how to choose cache write policy on ppc
THE PROCESSED TWEET IS: ['choos', 'cach', 'write', 'polici', 'ppc']
1	0.00000000	b'choos cach write polici ppc'
THE TWEET IS: what would be the best implementation to detect repeating sip me

THE PROCESSED TWEET IS: ['share', 'global', 'refer', 'among', 'jrubi', 'thread', 'insid', 'rack', 'applic']
1	0.00000000	b'share global refer among jrubi thread insid rack applic'
THE TWEET IS: include file mess
THE PROCESSED TWEET IS: ['includ', 'file', 'mess']
1	0.00000000	b'includ file mess'
THE TWEET IS: is it possible to pin a dll in memory to prevent unloading
THE PROCESSED TWEET IS: ['possibl', 'pin', 'dll', 'memori', 'prevent', 'unload']
1	0.00000000	b'possibl pin dll memori prevent unload'
THE TWEET IS: what belongs in an educational tool to demonstrate the unwarranted assumptions people make in cc
THE PROCESSED TWEET IS: ['belong', 'educ', 'tool', 'demonstr', 'unwarr', 'assumpt', 'peopl', 'make', 'cc']
1	0.00000000	b'belong educ tool demonstr unwarr assumpt peopl make cc'
THE TWEET IS: changing the volume without a volume slider on an iphone
THE PROCESSED TWEET IS: ['chang', 'volum', 'without', 'volum', 'slider', 'iphon']
1	0.00000000	b'chang volum without volum slider iphon'

1	0.00000000	b'iphon retina appl touch startup imag web app'
THE TWEET IS: insert new item in array on any position in php
THE PROCESSED TWEET IS: ['insert', 'new', 'item', 'array', 'posit', 'php']
1	0.00000000	b'insert new item array posit php'
THE TWEET IS: question about  operator in java
THE PROCESSED TWEET IS: ['question', 'oper', 'java']
1	0.00000000	b'question oper java'
THE TWEET IS: cannot add constraint to datatable which is a child table in two nested relations
THE PROCESSED TWEET IS: ['cannot', 'add', 'constraint', 'datat', 'child', 'tabl', 'two', 'nest', 'relat']
1	0.00000000	b'cannot add constraint datat child tabl two nest relat'
THE TWEET IS: sql server  check constraints that guarantees that only one value in all rows is set to  and others are 
THE PROCESSED TWEET IS: ['sql', 'server', 'check', 'constraint', 'guarante', 'one', 'valu', 'row', 'set', 'other']
1	0.00000000	b'sql server check constraint guarante one valu row set other'
THE TWEET IS: detect malicious url de

1	0.00000000	b'poker hand string display'
THE TWEET IS: default value for a reference to a pointer
THE PROCESSED TWEET IS: ['default', 'valu', 'refer', 'pointer']
1	0.00000000	b'default valu refer pointer'
THE TWEET IS: can i access qt signalsslots of objects out of scope
THE PROCESSED TWEET IS: ['access', 'qt', 'signalsslot', 'object', 'scope']
1	0.00000000	b'access qt signalsslot object scope'
THE TWEET IS: rails appropriate use of routing and resources
THE PROCESSED TWEET IS: ['rail', 'appropri', 'use', 'rout', 'resourc']
1	0.00000000	b'rail appropri use rout resourc'
THE TWEET IS: use firebug to get a deep copy of css in elements
THE PROCESSED TWEET IS: ['use', 'firebug', 'get', 'deep', 'copi', 'css', 'element']
1	0.00000000	b'use firebug get deep copi css element'
THE TWEET IS: what are the proscons of this class definition style
THE PROCESSED TWEET IS: ['proscon', 'class', 'definit', 'style']
1	0.00000000	b'proscon class definit style'
THE TWEET IS: django display nested categori

1	0.00000000	b'c string contain quot'
THE TWEET IS: grouping switch statement cases together
THE PROCESSED TWEET IS: ['group', 'switch', 'statement', 'case', 'togeth']
1	0.00000000	b'group switch statement case togeth'
THE TWEET IS: is flying saucer project closed
THE PROCESSED TWEET IS: ['fli', 'saucer', 'project', 'close']
1	0.00000000	b'fli saucer project close'
THE TWEET IS: how to detect the same user in different iphones
THE PROCESSED TWEET IS: ['detect', 'user', 'differ', 'iphon']
1	0.00000000	b'detect user differ iphon'
THE TWEET IS: datatypecurrency attribute fails on  sign
THE PROCESSED TWEET IS: ['datatypecurr', 'attribut', 'fail', 'sign']
1	0.00000000	b'datatypecurr attribut fail sign'
THE TWEET IS: why does rangestart end not include end
THE PROCESSED TWEET IS: ['rangestart', 'end', 'includ', 'end']
1	0.00000000	b'rangestart end includ end'
THE TWEET IS: can rails sweepers work across different controllers
THE PROCESSED TWEET IS: ['rail', 'sweeper', 'work', 'across', 'diff

1	0.00000000	b'detect mous click div javascript without side effect'
THE TWEET IS: practical rules for django middleware ordering
THE PROCESSED TWEET IS: ['practic', 'rule', 'django', 'middlewar', 'order']
1	0.00000000	b'practic rule django middlewar order'
THE TWEET IS: waveinproc  windows audio question
THE PROCESSED TWEET IS: ['waveinproc', 'window', 'audio', 'question']
1	0.00000000	b'waveinproc window audio question'
THE TWEET IS: use a variable inside a cdata block in javascript
THE PROCESSED TWEET IS: ['use', 'variabl', 'insid', 'cdata', 'block', 'javascript']
1	0.00000000	b'use variabl insid cdata block javascript'
THE TWEET IS: in internationalization in symfony
THE PROCESSED TWEET IS: ['internation', 'symfoni']
1	0.00000000	b'internation symfoni'
THE TWEET IS: how to use a net assembly in delphi without registering it in the gac or com
THE PROCESSED TWEET IS: ['use', 'net', 'assembl', 'delphi', 'without', 'regist', 'gac', 'com']
1	0.00000000	b'use net assembl delphi without r

1	0.00000000	b'run selenium use capybara lower speed'
THE TWEET IS: move cvs repository without rechecking out in eclipse
THE PROCESSED TWEET IS: ['move', 'cv', 'repositori', 'without', 'recheck', 'eclips']
1	0.00000000	b'move cv repositori without recheck eclips'
THE TWEET IS: java web installer   howd they do that
THE PROCESSED TWEET IS: ['java', 'web', 'instal', 'howd']
1	0.00000000	b'java web instal howd'
THE TWEET IS: aspnet mvc  using a viewmodel destroys my model binding
THE PROCESSED TWEET IS: ['aspnet', 'mvc', 'use', 'viewmodel', 'destroy', 'model', 'bind']
1	0.00000000	b'aspnet mvc use viewmodel destroy model bind'
THE TWEET IS: pull post from a specific wordpress category
THE PROCESSED TWEET IS: ['pull', 'post', 'specif', 'wordpress', 'categori']
1	0.00000000	b'pull post specif wordpress categori'
THE TWEET IS: zend translate join multiple file languages
THE PROCESSED TWEET IS: ['zend', 'translat', 'join', 'multipl', 'file', 'languag']
1	0.00000000	b'zend translat join multi

THE PROCESSED TWEET IS: ['get', 'hard', 'disk', 'use', 'space', 'free', 'space', 'partit', 'free', 'space']
1	0.00000000	b'get hard disk use space free space partit free space'
THE TWEET IS: how to split strings in objective c
THE PROCESSED TWEET IS: ['split', 'string', 'object', 'c']
1	0.00000000	b'split string object c'
THE TWEET IS: linq groupby convert a one to one list to a one to many
THE PROCESSED TWEET IS: ['linq', 'groupbi', 'convert', 'one', 'one', 'list', 'one', 'mani']
1	0.00000000	b'linq groupbi convert one one list one mani'
THE TWEET IS: how can i create a tfs  team project using the sdk
THE PROCESSED TWEET IS: ['creat', 'tf', 'team', 'project', 'use', 'sdk']
1	0.00000000	b'creat tf team project use sdk'
THE TWEET IS: css positionabsolute  dynamic height
THE PROCESSED TWEET IS: ['css', 'positionabsolut', 'dynam', 'height']
1	0.00000000	b'css positionabsolut dynam height'
THE TWEET IS: parsing html string
THE PROCESSED TWEET IS: ['pars', 'html', 'string']
1	0.00000000	b'p

THE PROCESSED TWEET IS: ['mysql', 'replic', 'question']
1	0.00000000	b'mysql replic question'
THE TWEET IS: oauth in google app engine
THE PROCESSED TWEET IS: ['oauth', 'googl', 'app', 'engin']
1	0.00000000	b'oauth googl app engin'
THE TWEET IS: c ifiltersample can anyone get it to work
THE PROCESSED TWEET IS: ['c', 'ifiltersampl', 'anyon', 'get', 'work']
1	0.00000000	b'c ifiltersampl anyon get work'
THE TWEET IS: setting up index as the default route for a controller
THE PROCESSED TWEET IS: ['set', 'index', 'default', 'rout', 'control']
1	0.00000000	b'set index default rout control'
THE TWEET IS: implementing logical right shift in c
THE PROCESSED TWEET IS: ['implement', 'logic', 'right', 'shift', 'c']
1	0.00000000	b'implement logic right shift c'
THE TWEET IS: using formation config file with pyrocms module
THE PROCESSED TWEET IS: ['use', 'format', 'config', 'file', 'pyrocm', 'modul']
1	0.00000000	b'use format config file pyrocm modul'
THE TWEET IS: locking output file for shell scri

THE PROCESSED TWEET IS: ['processor', 'endian', 'awar', 'requir']
0	1.00000000	b'processor endian awar requir'
THE TWEET IS: advantage and disadvantage of smarty template
THE PROCESSED TWEET IS: ['advantag', 'disadvantag', 'smarti', 'templat']
0	0.51120982	b'advantag disadvantag smarti templat'
THE TWEET IS: basics of strings
THE PROCESSED TWEET IS: ['basic', 'string']
0	0.51120982	b'basic string'
THE TWEET IS: a good repartition algorithm
THE PROCESSED TWEET IS: ['good', 'repartit', 'algorithm']
0	1.00000000	b'good repartit algorithm'
THE TWEET IS: jstl vs old school jsp el conflict
THE PROCESSED TWEET IS: ['jstl', 'vs', 'old', 'school', 'jsp', 'el', 'conflict']
0	1.00000000	b'jstl vs old school jsp el conflict'
THE TWEET IS: axis using threadsleep to do blocking
THE PROCESSED TWEET IS: ['axi', 'use', 'threadsleep', 'block']
0	0.99903607	b'axi use threadsleep block'
THE TWEET IS: database modification or start over
THE PROCESSED TWEET IS: ['databas', 'modif', 'start']
0	1.00000000	b'd

Later in this specialization, we will see how we can use deep learning to improve the prediction performance.

# Part 6: Predict with your own tweet

In [104]:
# Feel free to change the tweet below
my_tweet = 'please, tell me more. I want to know more'
print(process_tweet(my_tweet))
y_hat = predict_tweet(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
    print('Positive sentiment')
else: 
    print('Negative sentiment')

['pleas', 'tell', 'want', 'know']
[[0.]]
Negative sentiment


  if sys.path[0] == '':
