[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/y-akbal/ADA440_Python_4_DS/blob/main/ALE/tweet_classifier.ipynb)


# Tweet Sentiment Classifier with Logistic Regression
This week you are on your own, you will learn about logistic regression. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will:

* Learn how to extract features for logistic regression given some text using bag of words model,
* Implement logistic regression from scratch,

We will be using a data set of tweets given in csv form.

Things you shall be doing:

* Implement logistic regression from scratch,
* Read the csv file and extract features,
* Train your model on your dataset,
* Do some funny stuff.

Before you get started, please see
https://en.wikipedia.org/wiki/Logistic_regression

In [1]:
##Import some libraries!!!
import numpy as np

# Part 1: Logistic regression


### Part 1.1: Sigmoid
You will learn to use logistic regression for text classification.
* The sigmoid function is defined as:

$$ \sigma(z) = \frac{1}{1+e^{-z}} \tag{1}$$

It maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability.


In [2]:
##Implement sigmoid function now!!!
def sigmoid(x:np.ndarray)->np.ndarray:
    return 1 / (1 + np.exp(-x))

In [3]:
# Testing your function
if (sigmoid(0) == 0.5):
    print('Great Successs!!!')
else:
    print('Oops!')

if (sigmoid(4.92) == 0.9927537604041685):
    print('Good!!!')
else:
    print('Oops you did it again!')

Great Successs!!!
Good!!!


### Part 1.2: Logistic Regression Class
See the previous week (linear regression) for a quick refresher. The set up is the same.
### Logistic regression: regression and a sigmoid

Logistic regression takes a regular linear regression, and applies a sigmoid to the output of the linear regression.

Regression:
$$z = \theta_0  + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
Note that the $\theta$ values are "weights".
Logistic regression
$$ \sigma(z) = \frac{1}{1+e^{-z}}$$
$$z = \theta_0  + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
We will refer to 'z' as the 'logits'.

### Part 1.2 Cost function and Gradient

The cost function used for logistic regression is the average of the log loss across all training examples:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (\sigma(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-\sigma(z(\theta)^{(i)}))\tag{5} $$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of the i-th training example.
* $h(z(\theta)^{(i)})$ is the model's prediction for the i-th training example.

The loss function for a single training example is
$$ Loss = -1 \times \left( y^{(i)}\log (\sigma (z(\theta)^{(i)})) + (1-y^{(i)})\log (1-\sigma(z(\theta)^{(i)})) \right)$$

* All the $h$ values are between 0 and 1, so the logs will be negative. That is the reason for the factor of -1 applied to the sum of the two loss terms.
* Note that when the model predicts 1 ($\sigma(z(\theta)) = 1$) and the label $y$ is also 1, the loss for that training example is 0.
* Similarly, when the model predicts 0 ($\sigma(z(\theta)) = 0$) and the actual label is also 0, the loss for that training example is 0.
* However, when the model prediction is close to 1 ($\sigma(z(\theta)) = 0.9999$) and the label is 0, the second term of the log loss becomes a large negative number, which is then multiplied by the overall factor of -1 to convert it to a positive loss value. $-1 \times (1 - 0) \times log(1 - 0.9999) \approx 9.2$ The closer the model prediction gets to 1, the larger the loss.

#### Update the weights

To update your weight vector $\theta$, you will apply gradient descent to iteratively improve your model's predictions.  
The gradient of the cost function $J$ with respect to one of the weights $\theta$ is:

$$\nabla J(\theta) = \frac{1}{m} \sum_{i=1}^m(\sigma(z(\theta)^{(i)})-y^{(i)})x_j \tag{5}$$
* 'i' is the index across all 'm' training examples.
* 'j' is the index of the weight $\theta_j$, so $x_j$ is the feature associated with weight $\theta_j$

* To update the weight $\theta_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$:
$$\theta = \theta - \alpha \times \nabla_{\theta}J(\theta) $$
* The learning rate $\alpha$ is a value that we choose to control how big a single update will be.

In [9]:
## Where you see None, this means you have gotta do something there -- except __init__ function!!!!!
class LR:
    def __init__(self):
        self._fitted_ = False
        self.weight:np.ndarray = None

    def fit(self, X, y, lr = 1e-10, max_iter = 1000):

        if not self._fitted_:
            n, m = X.shape
            self.weight = 0.01*np.random.randn(m+1)
            self._fitted_ = True

        X_0=X.shape[0]
        x_val_ones=[np.ones((X_0, 1)), X]
        X_aug =  np.concatenate(x_val_ones, axis=1) ## add a column of ones from left!!!! like -- [ones, X] use np.concatenate

        for j in range(max_iter):
            ## get the loss --
            ## in every 100 step print the loss (it should be decreasing over time)
            ## get grads and update it below!!! see (5) above.
            ## This is the point that you will hit the wall ---

            log = self.get_logits(X)

            log_pred = sigmoid(log)

            err = log_pred - y

            val= X_aug.T @ err

            grads_val = val / len(y)

            self.weight = self.weight - (lr * grads_val)

            loss_val= y * np.log(log_pred) + (1 - y) * np.log(1 - log_pred)

            loss_tot = -np.mean(loss_val) * len(y)

            if j % 100 == 0:
                print(loss_tot)

    def get_logits(self, X):
        assert self._fitted_, "The model has not been fitted yet!"
        if len(X.shape) == 1:
            X = np.expand_dims(X, axis = 0)
        return np.concatenate([np.ones((X.shape[0], 1)), X], axis = 1) @ self.weight

    def predict(self, X):
        ## Get predictions here --- output array should be containing labels (only 0s and 1s)
        ## hint: np.where
        log = self.get_logits(X)
        prob_val = sigmoid(log)
        pred_result=np.where(prob_val >= 0.5, 1, 0)
        return pred_result

In [10]:
##Let's give it a try --- probably you will spend some time here !!!!
np.random.seed(0)
X = np.random.randn(1000, 10)
y = np.random.randint(0, 2, 1000)

lr = LR()
lr.fit(X, y, lr = 1e-4, max_iter = 50000)
""" You should see something like as follows
693.1462797411714
693.1235240431618
693.1008833028151
693.078356935228
693.0559443584558
693.0336449934978
693.0114582642825
692.9893835976551
692.967420423361
692.9455681740333
692.9238262851777
692.9021941951596
692.8806713451888
692.8592571793067
692.8379511443716
692.8167526900455
692.7956612687802
692.7746763358036
692.7537973491062
692.7330237694271
692.7123550602412
692.6917906877453
692.6713301208449
692.650972831141
692.6307182929162
...
688.9843435354524
688.9824272801267
688.9805204006907
688.9786228506564
"""
sum(lr.predict(np.random.randn(100,10)))/100  ## <-- is this number close to 0.5 ?
float(lr.predict(np.random.randn(10))) ## <- Does it throw an error ?

693.1462797411714
693.1235240431618
693.100883302815
693.0783569352282
693.0559443584558
693.0336449934975
693.0114582642825
692.9893835976551
692.967420423361
692.9455681740333
692.9238262851777
692.9021941951596
692.880671345189
692.8592571793067
692.8379511443716
692.8167526900455
692.7956612687802
692.7746763358036
692.7537973491062
692.7330237694271
692.7123550602412
692.6917906877453
692.6713301208449
692.650972831141
692.6307182929162
692.6105659831226
692.5905153813676
692.5705659699015
692.5507172336042
692.5309686599726
692.5113197391067
692.4917699636981
692.4723188290161
692.4529658328956
692.4337104757244
692.4145522604301
692.3954906924682
692.3765252798089
692.357655532925
692.3388809647797
692.3202010908141
692.3016154289345
692.283123499501
692.2647248253146
692.2464189316053
692.2282053460201
692.2100835986113
692.1920532218239
692.1741137504839
692.156264721787
692.138505675286
692.1208361528796
692.1032556988007
692.0857638596044
692.0683601841569
692.0510442236233


  float(lr.predict(np.random.randn(10))) ## <- Does it throw an error ?


1.0

### Part 2: --- AncientGPT ---
### You are now ready to code a very old but still efficient tweet classifier! Import functions and data. We shall be using nltk package (use !pip install nlk if you receive any error)

In [11]:
# run this cell to import nltk and some other files!!!

import nltk
nltk.download('stopwords')
from os import getcwd
import re
import string
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
import sklearn

filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Read the tweets using pd.read_csv, and browse them.
### 1) Read the csv file,
### 2) Grab the columns, X and y
### 3) using sklearn.model_selection.train_test split split the dataset into two parts
### 3.5) names should be for compatibility train_x, train_y, test_x, test_y
### 4) Convert everything to numpy arrays

In [12]:
##Your code here ##
## Start by reading csv file -- download it directly from github, download it locally!!!
## split it into train test split!!!
## make sure that the names are train_x, test_x, train_y, test_y

import pandas as pd

from sklearn.model_selection import train_test_split
url="https://raw.githubusercontent.com/y-akbal/ADA440_Python_4_DS/main/ALE/tweets.csv"
df = pd.read_csv(url)
train_x, train_y, test_x, test_y = train_test_split(df['Tweets'], df['Sentiment'], test_size=0.2, random_state=42)

train_x=np.array(train_x)
train_y=np.array(train_y)
test_x=np.array(test_x)
test_y=np.array(test_y)



### Below you shall create some helper functions process_tweet and build_freqs

In [13]:
def process_tweet(tweet:str)->list[str]:
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet
    """

    """
    google the following see what they do???

    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from nltk.tokenize import TweetTokenizer  -- (use with setup preserve_case=False, strip_handles=True,
                               reduce_len=True)
    """

    stemmer = PorterStemmer() ##Initalize stemmer
    stopwords_english = stopwords.words('english') ## Grab the stop_words
    ## --- ##
    ## Below you will right regex -- feel free to use ChatGPT!!!
    ## It should be of the following form
    ## tweet = re.sub(r" --pattern--", "", tweet)
    # Steps -- begining of the regex stuff
    # remove stock market tickers like $GE
    # remove old style retweet text "RT"
    # remove hyperlinks
    # remove hashtags
    # only removing the hash # sign from the word
    # end of regex stuff

    tweet = re.sub(r'\$\w*', '', tweet)
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    tweet = re.sub(r'https?:\/\/\S+', '', tweet)
    tweet = re.sub(r'#', '', tweet)


    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)
    ## Below you will get tokenized tweet!!!
    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

In [14]:
process_tweet("Yep yep Hello dude, how you doin?") ## should be == ['yep', 'yep', 'hello', 'dude', 'doin']

['yep', 'yep', 'hello', 'dude', 'doin']

### Process tweet
The given function process_tweet() tokenizes the tweet into individual words, removes stop words and applies stemming.

In [15]:
# test the function below
print('This is an example of a positive tweet: \n', train_x[55])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[55]))

This is an example of a positive tweet: 
 http://t.co/ziJiJYLDXT via @youtube...Reality is!! :(

This is an example of the processed version of the tweet: 
 ['via', '...', 'realiti', ':(']


#### Expected output should more or less as follows!
```
This is an example of a positive tweet:
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processes version:
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
```

In [16]:
def build_freqs(tweets, ys):
    """Build frequencies.
    Input:
        tweets: a list of tweets
        ys: an m x 1 array with the sentiment label of each tweet
            (either 0 or 1)
    Output:
        freqs: a dictionary mapping each (word, sentiment) pair to its
        frequency
    """
    # Convert np array to list since zip needs an iterable.
    # The squeeze is necessary or the list ends up with one element.
    # Also note that this is just a NOP if ys is already a list.
    yslist = np.squeeze(ys).tolist()

    # Start with an empty dictionary and populate it by looping over all tweets
    # and over all processed words in each tweet.
    freqs = {}
    #
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1

    return freqs

In [17]:
# create frequency dictionary
freqs = build_freqs(train_x, train_y)
# check the output
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))
## What do you see here???
## See what freqs contains????

type(freqs) = <class 'dict'>
len(freqs) = 13142


## Part 3: Extracting the features

* Given a list of tweets, extract the features and store them in a matrix. You will extract two features.
    * The first feature is the number of positive words in a tweet.
    * The second feature is the number of negative words in a tweet.
* Then train your logistic regression classifier on these features.
* Test the classifier on a validation set.

### Instructions: Implement the extract_features function.
* This function takes in a single tweet.
* Process the tweet using the imported process_tweet() function and save the list of tweet words.
* Loop through each word in the list of processed words
    * For each word, check the freqs dictionary for the count when that word has a positive '1' label. (Check for the key (word, 1.0)
    * Do the same for the count for when the word is associated with the negative label '0'. (Check for the key (word, 0.0).)


In [31]:
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def extract_features(tweet, freqs):
    '''
    Input:
        tweet: a list of words for one tweet
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
    Output:
        x: a feature vector of dimension (1,3)
    '''
    # process_tweet tokenizes, stems, and removes stopwords
    word_l = process_tweet(tweet)

    # 3 elements in the form of a 1 x 2 vector
    x = np.zeros((1, 2))


    # loop through each word in the list of words
    for word in word_l:

        # increment the word count for the positive label 1
        x[0,0] += freqs.get((word, 1.0), 0)
        # increment the word count for the negative label 0
        x[0,1] += freqs.get((word, 0), 0)

    ### END CODE HERE ###
    assert(x.shape == (1, 2))
    return x

# Extract Features


## Part 3: Training Your Model

In [35]:
## Here you are on your own no instructions!!!!
## You will need to convert train_x to train_x with features extracted
## Normalize your data -- with min-max normalization ---
## create lr = LR(), call lr.fit ----  usual stuff!!!
## Normalize train_x, and train
## What accuracy you get on test set?


df = pd.read_csv(url)
train_x, test_x, train_y, test_y = train_test_split(df['Tweets'], df['Sentiment'], test_size=0.2, random_state=42)
train_x = np.array(train_x)
train_y = np.array(train_y)
test_x = np.array(test_x)
test_y = np.array(test_y)




def min_max_normalize(X):
    X_min = X.min(axis=0)
    X_max = X.max(axis=0)
    X_norm = (X - X_min) / (X_max - X_min)
    return X_norm




def extract_features_(tweets):
    freqs = build_freqs(train_x, train_y)
    n=len(tweets)
    X_val = np.zeros((n, 2))
    for i, tweet in enumerate(tweets):
        X_val[i, :] = extract_features(tweet, freqs)
    return X_val



x_features_train = extract_features_(train_x)
x_features_test = extract_features_(test_x)

x_features_normalized_train= min_max_normalize(x_features_train)
x_features_normalized_test = min_max_normalize(x_features_test)

lr_model = LR()
lr_model.fit(x_features_normalized_train, train_y, lr=1e-4, max_iter=50000)

pred_test_ = lr_model.predict(x_features_normalized_test)


accuracy = np.mean(pred_test_ == test_y)

print(f"accuracy: {accuracy * 100:.2f}%")


5548.742054854694
5548.563486080486
5548.384948759359
5548.206442779534
5548.027968029795
5547.849524399486
5547.6711117785
5547.492730057287
5547.3143791268485
5547.136058878727
5546.957769205013
5546.779509998339
5546.601281151872
5546.423082559319
5546.244914114917
5546.066775713438
5545.888667250176
5545.710588620955
5545.532539722118
5545.35452045053
5545.176530703573
5544.998570379144
5544.820639375652
5544.642737592012
5544.464864927652
5544.2870212825
5544.109206556987
5543.931420652045
5543.7536634691
5543.575934910073
5543.398234877377
5543.220563273917
5543.042920003081
5542.865304968744
5542.68771807526
5542.510159227466
5542.332628330673
5542.15512529067
5541.977650013718
5541.800202406543
5541.622782376343
5541.445389830783
5541.2680246779855
5541.090686826537
5540.913376185483
5540.736092664321
5540.558836173008
5540.381606621946
5540.20440392199
5540.027227984441
5539.850078721045
5539.672956043989
5539.4958598659005
5539.318790099846
5539.141746659327
5538.964729458277