[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/y-akbal/ADA440_Python_4_DS/blob/main/ALE/tweet_classifier.ipynb)


# Tweet Sentiment Classifier with Logistic Regression
This week you are on your own, you will learn about logistic regression. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will:

* Learn how to extract features for logistic regression given some text using bag of words model,
* Implement logistic regression from scratch,

We will be using a data set of tweets given in csv form.

Things you shall be doing:

* Implement logistic regression from scratch,
* Read the csv file and extract features,
* Train your model on your dataset,
* Do some funny stuff.

Before you get started, please see
https://en.wikipedia.org/wiki/Logistic_regression

In [73]:
##Import some libraries!!!
import numpy as np

# Part 1: Logistic regression


### Part 1.1: Sigmoid
You will learn to use logistic regression for text classification.
* The sigmoid function is defined as:

$$ \sigma(z) = \frac{1}{1+e^{-z}} \tag{1}$$

It maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability.


In [74]:
##Implement sigmoid function now!!!
def sigmoid(x:np.ndarray)->np.ndarray:
    return 1/(1+np.exp(-x))

In [75]:
# Testing your function
if (sigmoid(0) == 0.5):
    print('Great Successs!!!')
else:
    print('Oops!')

if (sigmoid(4.92) == 0.9927537604041685):
    print('Good!!!')
else:
    print('Oops you did it again!')

Great Successs!!!
Good!!!


### Part 1.2: Logistic Regression Class
See the previous week (linear regression) for a quick refresher. The set up is the same.
### Logistic regression: regression and a sigmoid

Logistic regression takes a regular linear regression, and applies a sigmoid to the output of the linear regression.

Regression:
$$z = \theta_0  + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
Note that the $\theta$ values are "weights".
Logistic regression
$$ \sigma(z) = \frac{1}{1+e^{-z}}$$
$$z = \theta_0  + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
We will refer to 'z' as the 'logits'.

### Part 1.2 Cost function and Gradient

The cost function used for logistic regression is the average of the log loss across all training examples:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (\sigma(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-\sigma(z(\theta)^{(i)}))\tag{5} $$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of the i-th training example.
* $h(z(\theta)^{(i)})$ is the model's prediction for the i-th training example.

The loss function for a single training example is
$$ Loss = -1 \times \left( y^{(i)}\log (\sigma (z(\theta)^{(i)})) + (1-y^{(i)})\log (1-\sigma(z(\theta)^{(i)})) \right)$$

* All the $h$ values are between 0 and 1, so the logs will be negative. That is the reason for the factor of -1 applied to the sum of the two loss terms.
* Note that when the model predicts 1 ($\sigma(z(\theta)) = 1$) and the label $y$ is also 1, the loss for that training example is 0.
* Similarly, when the model predicts 0 ($\sigma(z(\theta)) = 0$) and the actual label is also 0, the loss for that training example is 0.
* However, when the model prediction is close to 1 ($\sigma(z(\theta)) = 0.9999$) and the label is 0, the second term of the log loss becomes a large negative number, which is then multiplied by the overall factor of -1 to convert it to a positive loss value. $-1 \times (1 - 0) \times log(1 - 0.9999) \approx 9.2$ The closer the model prediction gets to 1, the larger the loss.

#### Update the weights

To update your weight vector $\theta$, you will apply gradient descent to iteratively improve your model's predictions.  
The gradient of the cost function $J$ with respect to one of the weights $\theta$ is:

$$\nabla J(\theta) = \frac{1}{m} \sum_{i=1}^m(\sigma(z(\theta)^{(i)})-y^{(i)})x_j \tag{5}$$
* 'i' is the index across all 'm' training examples.
* 'j' is the index of the weight $\theta_j$, so $x_j$ is the feature associated with weight $\theta_j$

* To update the weight $\theta_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$:
$$\theta = \theta - \alpha \times \nabla_{\theta}J(\theta) $$
* The learning rate $\alpha$ is a value that we choose to control how big a single update will be.

In [76]:
## Where you see None, this means you have gotta do something there -- except __init__ function!!!!!
class LR:
    def __init__(self):
        self._fitted_ = False
        self.weight:np.ndarray = None

    def fit(self, X, y, lr = 1e-10, max_iter = 1000):

        if not self._fitted_:
            n, m = X.shape
            self.weight = 0.01*np.random.randn(m+1)
            self._fitted_ = True

        X_aug = np.concatenate([np.ones((n, 1)), X], axis=1) # Adding a column of ones from left (as in the get_logits fonction which is given)

        for j in range(max_iter):
          z = X_aug @ self.weight # Computing the logits (z) (using matrix multiplication for faster results than nested for-loops)
          probs = sigmoid(z) # Sigmoid function
          loss = -np.mean(y * np.log(probs) + (1 - y) * np.log(1 - probs))  ## get the loss

          ## in every 100 step print the loss (it should be decreasing over time)
          if j % 100 == 0:
            print(loss * n) # Multiplying by n to get the sum values

          grads = np.dot(X_aug.T, (probs - y)) / X.shape[0]

          self.weight -= lr*grads

    def get_logits(self, X):
        assert self._fitted_, "The model has not been fitted yet!"
        if len(X.shape) == 1:
            X = np.expand_dims(X, axis = 0)
        return np.concatenate([np.ones((X.shape[0], 1)), X], axis = 1) @ self.weight

    def predict(self, X):
        ## Get predictions here --- output array should be containing labels (only 0s and 1s)
        ## hint: np.where

        logits = self.get_logits(X)
        probs = sigmoid(logits)
        return np.where(probs >= 0.5, 1, 0)

In [77]:
##Let's give it a try --- probably you will spend some time here !!!!
np.random.seed(0)
X = np.random.randn(1000, 10)
y = np.random.randint(0, 2, 1000)

lr = LR()
lr.fit(X, y, lr = 1e-4, max_iter = 50000)
""" You should see something like as follows
693.1462797411714
693.1235240431618
693.1008833028151
693.078356935228
693.0559443584558
693.0336449934978
693.0114582642825
692.9893835976551
692.967420423361
692.9455681740333
692.9238262851777
692.9021941951596
692.8806713451888
692.8592571793067
692.8379511443716
692.8167526900455
692.7956612687802
692.7746763358036
692.7537973491062
692.7330237694271
692.7123550602412
692.6917906877453
692.6713301208449
692.650972831141
692.6307182929162
...
688.9843435354524
688.9824272801267
688.9805204006907
688.9786228506564
"""
sum(lr.predict(np.random.randn(100,10)))/100  ## <-- is this number close to 0.5 ?
float(lr.predict(np.random.randn(10))) ## <- Does it throw an error ?

693.1462797411714
693.1235240431618
693.100883302815
693.0783569352282
693.0559443584558
693.0336449934975
693.0114582642825
692.9893835976551
692.967420423361
692.9455681740333
692.9238262851777
692.9021941951596
692.880671345189
692.8592571793067
692.8379511443716
692.8167526900455
692.7956612687802
692.7746763358036
692.7537973491062
692.7330237694271
692.7123550602412
692.6917906877453
692.6713301208449
692.650972831141
692.6307182929162
692.6105659831226
692.5905153813676
692.5705659699015
692.5507172336042
692.5309686599726
692.5113197391067
692.4917699636981
692.4723188290161
692.4529658328956
692.4337104757244
692.4145522604301
692.3954906924682
692.3765252798089
692.357655532925
692.3388809647797
692.3202010908141
692.3016154289345
692.283123499501
692.2647248253146
692.2464189316053
692.2282053460201
692.2100835986113
692.1920532218239
692.1741137504839
692.156264721787
692.138505675286
692.1208361528796
692.1032556988007
692.0857638596044
692.0683601841569
692.0510442236233


  float(lr.predict(np.random.randn(10))) ## <- Does it throw an error ?


1.0

### Part 2: --- AncientGPT ---
### You are now ready to code a very old but still efficient tweet classifier! Import functions and data. We shall be using nltk package (use !pip install nlk if you receive any error)

In [78]:
# run this cell to import nltk and some other files!!!

import nltk
nltk.download('stopwords')
from os import getcwd
import re
import string
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
import sklearn

filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Read the tweets using pd.read_csv, and browse them.
### 1) Read the csv file,
### 2) Grab the columns, X and y
### 3) using sklearn.model_selection.train_test split split the dataset into two parts
### 3.5) names should be for compatibility train_x, train_y, test_x, test_y
### 4) Convert everything to numpy arrays

In [79]:
## Start by reading csv file -- download it directly from github, download it locally!!!
import pandas as pd
from sklearn.model_selection import train_test_split

url = 'tweets.csv' # Downloaded locally from https://github.com/y-akbal/ADA440_Python_4_DS/blob/main/ALE/tweets.csv

data = pd.read_csv(url)

# Grabbing the columns
X = data['Tweets']
y = data['Sentiment']

## Spliting it into train test split (using sklearn.model_selection.train_test to split the dataset into two parts)
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.33, random_state=42) # Used the values in the https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

# Converting to numpy arrays
train_x = np.array(train_x)
test_x = np.array(test_x)
train_y = np.array(train_y)
test_y = np.array(test_y)

### Below you shall create some helper functions process_tweet and build_freqs

In [80]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
import re
import string

def process_tweet(tweet:str)->list[str]:
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet
    """

    stemmer = PorterStemmer() ##Initalize stemmer
    stopwords_english = stopwords.words('english') ## Grab the stop_words

    # Steps -- begining of the regex stuff
    # Remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # Remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # Remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # Remove hashtags (Only removing the hash # sign from the word)
    tweet = re.sub(r'#', '', tweet)
    # end of regex stuff

    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)
    ## Below you will get tokenized tweet!!!
    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

In [81]:
process_tweet("Yep yep Hello dude, how you doin?") ## should be == ['yep', 'yep', 'hello', 'dude', 'doin']

['yep', 'yep', 'hello', 'dude', 'doin']

### Process tweet
The given function process_tweet() tokenizes the tweet into individual words, removes stop words and applies stemming.

In [82]:
# test the function below
print('This is an example of a positive tweet: \n', train_x[55])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[55]))

This is an example of a positive tweet: 
 @dischanmedia Its sad to hear about this, thank you so much for the overwhelmingly beautiful games, thank you for your hard work. :)

This is an example of the processed version of the tweet: 
 ['sad', 'hear', 'thank', 'much', 'overwhelmingli', 'beauti', 'game', 'thank', 'hard', 'work', ':)']


#### Expected output should more or less as follows!
```
This is an example of a positive tweet:
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processes version:
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
```

In [83]:
def build_freqs(tweets, ys):
    """Build frequencies.
    Input:
        tweets: a list of tweets
        ys: an m x 1 array with the sentiment label of each tweet
            (either 0 or 1)
    Output:
        freqs: a dictionary mapping each (word, sentiment) pair to its
        frequency
    """
    # Convert np array to list since zip needs an iterable.
    # The squeeze is necessary or the list ends up with one element.
    # Also note that this is just a NOP if ys is already a list.
    yslist = np.squeeze(ys).tolist()

    # Start with an empty dictionary and populate it by looping over all tweets
    # and over all processed words in each tweet.
    freqs = {}
    #
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1

    return freqs

In [84]:
# create frequency dictionary
freqs = build_freqs(train_x, train_y)
# check the output
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))

## What do you see here???
## The type of freqs is 'dict' and it contains 10014 keys

## See what freqs contains????
print(freqs)
## freqs dictionary contains the (word, sentiment) pairs as keys and frequencies as values
## For instance the word 'snapchat' used with 0 sentiment 51 times and used with 1 sentiment 25 times

type(freqs) = <class 'dict'>
len(freqs) = 10014
{('snapchat', 0.0): 51, ('tammirossm', 0.0): 2, ('kikgirl', 0.0): 9, ('kikchat', 0.0): 5, ('wet', 0.0): 6, ('wife', 0.0): 2, ('indiemus', 0.0): 7, ('sexi', 0.0): 7, (':(', 0.0): 3101, ('mom', 0.0): 9, ('far', 0.0): 12, ('away', 0.0): 16, ("i'm", 0.0): 233, ('hungri', 0.0): 11, ('ha', 1.0): 14, ('talk', 1.0): 30, ('cours', 1.0): 11, ('...', 1.0): 187, ('one', 1.0): 85, ('virtual', 1.0): 2, ('varieti', 1.0): 1, (':)', 1.0): 2375, ('omg', 0.0): 37, ('happen', 0.0): 35, ('love', 0.0): 107, ('tiddler', 0.0): 1, ('realli', 0.0): 86, ('silli', 0.0): 1, ('least', 0.0): 12, ('’', 0.0): 21, ('get', 0.0): 149, ('real', 0.0): 14, ('new', 0.0): 40, ('card', 0.0): 5, ('like', 0.0): 157, ('1', 0.0): 17, ('hour', 0.0): 25, ('..', 0.0): 69, ('miss', 0.0): 201, ('itb', 0.0): 1, ('omigod', 0.0): 1, ('hahaha', 1.0): 9, ('agre', 1.0): 12, ('sir', 1.0): 9, (':d', 1.0): 395, ('thank', 1.0): 386, ('guy', 1.0): 33, ('fun', 1.0): 40, ('anoth', 1.0): 16, ('1', 1.0)

## Part 3: Extracting the features

* Given a list of tweets, extract the features and store them in a matrix. You will extract two features.
    * The first feature is the number of positive words in a tweet.
    * The second feature is the number of negative words in a tweet.
* Then train your logistic regression classifier on these features.
* Test the classifier on a validation set.

### Instructions: Implement the extract_features function.
* This function takes in a single tweet.
* Process the tweet using the imported process_tweet() function and save the list of tweet words.
* Loop through each word in the list of processed words
    * For each word, check the freqs dictionary for the count when that word has a positive '1' label. (Check for the key (word, 1.0)
    * Do the same for the count for when the word is associated with the negative label '0'. (Check for the key (word, 0.0).)


In [85]:
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def extract_features(tweet, freqs):
    '''
    Input:
        tweet: a list of words for one tweet
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
    Output:
        x: a feature vector of dimension (1,3)
    '''
    # process_tweet tokenizes, stems, and removes stopwords
    word_l = process_tweet(tweet)

    # 3 elements in the form of a 1 x 2 vector
    x = np.zeros((1, 2))


    # loop through each word in the list of words
    for word in word_l:

        # increment the word count for the positive label 1
        x[0,0] += freqs.get((word, 1.0), 0) # Current word's 1 sentiment count (get 0 if there is no match)
        # increment the word count for the negative label 0
        x[0,1] += freqs.get((word, 0.0), 0) # Current word's 0 sentiment count (get 0 if there is no match)

    ### END CODE HERE ###
    assert(x.shape == (1, 2))
    return x

# Extract Features


## Part 3: Training Your Model

In [86]:
## Here you are on your own no instructions!!!!
## You will need to convert train_x to train_x with features extracted
## Normalize your data -- with min-max normalization ---
## create lr = LR(), call lr.fit ----  usual stuff!!!
## Normalize train_x, and train
## What accuracy you get on test set?

from sklearn.preprocessing import MinMaxScaler

train_x_features = np.zeros((train_x.shape[0], 2)) # Initializing the variable

for i in range(train_x.shape[0]): # Calculating the 1.0-0.0 sentiment counts of the sentences (total of stop words in sentences)
  train_x_features[i] = extract_features(train_x[i], freqs)

# Min-max normalization from https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
scaler = MinMaxScaler()
scaler.fit(train_x_features)
normalized_train_x_features = scaler.transform(train_x_features)

lr = LR()
# lr.fit(normalized_train_x_features, train_y) # Its accuracy score was 0.49 and it was so low that I decided to tune learning rate
# lr.fit(normalized_train_x_features, train_y, 1e-5) # It was still 0.49 and I decided that it is not enough
# lr.fit(normalized_train_x_features, train_y, 1e-2) # It became 0.87 but I thought I can improve this by increasing max iterations
lr.fit(normalized_train_x_features, train_y, 1e-2, 10000) # Accuracy score reached 0.94 which is good


test_x_features = np.zeros((test_x.shape[0], 2)) # Initializing the variable

for i in range(test_x.shape[0]): # Calculating the 1.0-0.0 sentiment counts of the sentences (total of stop words in sentences)
  test_x_features[i] = extract_features(test_x[i], freqs)

# Min-max normalization from https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
scaler.fit(test_x_features)
normalized_test_x_features = scaler.transform(test_x_features)

prediction = lr.predict(normalized_test_x_features)

from sklearn.metrics import accuracy_score

accuracy_score(test_y, prediction) # Calculating the accuracy

4646.8718673102485
4632.176738647489
4617.723817389428
4603.446549468115
4589.304651952378
4575.273680728829
4561.3387397418155
4547.49068836687
4533.723855144084
4520.034659672452
4506.420781963723
4492.880661797479
4479.413196989699
4466.017561557075
4452.693096147853
4439.439242030454
4426.25550133566
4413.141413122335
4400.096538980028
4387.120454379199
4374.212743485101
4361.372996058648
4348.600805614478
4335.895768336022
4323.2574824461
4310.685547851302
4298.179565950588
4285.7391395421055
4273.363872788366
4261.053371215819
4248.807241734356
4236.625092667998
4224.50653379155
4212.4511763700475
4200.458633199085
4188.528518644911
4176.660448683598
4164.854040938891
4153.10891471852
4141.424691048834
4129.800992707702
4118.237444255659
4106.733672065309
4095.2893043489726
4083.903971184639
4072.5773045402393
4061.3089382962803
4050.098508266883
4038.9456522192663
4027.850009891722
4016.811223010129
4005.828935303049
3994.9027925154496
3984.0324424211117
3973.2175348337564
3962.

0.9448484848484848