In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, f1_score

import time
from datetime import timedelta
import numpy

from EngineFiles.MachineLearning import LogisticRegressionModel as LRM

In [2]:
df = pd.read_csv('Data/indonesia_Tweet/clean_tweets.csv')
df.dropna(subset=['Tweet'],inplace=True)

In [3]:
x_train, x_test, y_train, y_test = LRM.splitSet(df, 0.2, 42)

In [4]:
freq = LRM.dictFrequency(x_train, y_train)
print("type(freqs) = " + str(type(freq)))
print("len(freqs) = " + str(len(freq.keys())))

type(freqs) = <class 'dict'>
len(freqs) = 16279


### The `sigmoid(z)` function is defined as: 

$$ h(z) = \frac{1}{1+\exp^{-z}} \tag{1}$$

It maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability.

### The `gradientDescent(x, y, theta, alpha, num_iters)` function:

* The number of iterations `num_iters` is the number of times that we'll use the entire training set.
* For each iteration, we'll calculate the cost function using all training examples (there are `m` training examples), and for all features.
* Instead of updating a single weight $\theta_i$ at a time, we can update all the weights in the column vector:  
$\mathbf{\theta} = \begin{pmatrix}
\theta_0
\\
\theta_1
\\ 
\theta_2 
\\ 
\vdots
\\ 
\theta_n
\end{pmatrix}$$
* $\mathbf{\theta}$ has dimensions (n+1, 1), where 'n' is the number of features, and there is one more element for the bias term $\theta_0$ (note that the corresponding feature value $\mathbf{x_0}$ is 1).
* The 'logits', 'z', are calculated by multiplying the feature matrix 'x' with the weight vector 'theta'.  $z = \mathbf{x}\mathbf{\theta}$
    * $\mathbf{x}$ has dimensions (m, n+1) 
    * $\mathbf{\theta}$: has dimensions (n+1, 1)
    * $\mathbf{z}$: has dimensions (m, 1)
* The prediction 'h', is calculated by applying the sigmoid to each element in 'z': $h(z) = sigmoid(z)$, and has dimensions (m,1).
* The cost function $J$ is calculated by taking the dot product of the vectors 'y' and 'log(h)'.  Since both 'y' and 'h' are column vectors (m,1), transpose the vector to the left, so that matrix multiplication of a row vector with column vector performs the dot product.
$$J = \frac{-1}{m} \times \left(\mathbf{y}^T \cdot log(\mathbf{h}) + \mathbf{(1-y)}^T \cdot log(\mathbf{1-h}) \right)$$
* The update of theta is also vectorized.  Because the dimensions of $\mathbf{x}$ are (m, n+1), and both $\mathbf{h}$ and $\mathbf{y}$ are (m, 1), we need to transpose the $\mathbf{x}$ and place it on the left in order to perform matrix multiplication, which then yields the (n+1, 1) answer we need:
$$\mathbf{\theta} = \mathbf{\theta} - \frac{\alpha}{m} \times \left( \mathbf{x}^T \cdot \left( \mathbf{h-y} \right) \right)$$

### The `extractFeatures(text, freqs)` function:

* Given a list of tweets, extract the features and store them in a matrix. we will extract two features.
    * The first feature is the number of positive words in a tweet.
    * The second feature is the number of negative words in a tweet. 
* Then train the logistic regression classifier on these features.
* Test the classifier on a validation set.
* This function takes in a single tweet.
* Process the tweet using the imported `process_tweet()` function and save the list of tweet words.
* Loop through each word in the list of processed words
    * For each word, check the `freqs` dictionary for the count when that word has a positive '1' label.
    * Do the same for the count for when the word is associated with the negative label '0'.


In [5]:
X = numpy.zeros((len(x_train), 3))
for i in range(len(x_train)):
    X[i, :] = LRM.extractFeatures(x_train[i], freq)

Y = y_train

J, theta = LRM.gradientDescent(X, Y, numpy.zeros((3, 1)), 1e-9, 10000)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in numpy.squeeze(theta)]}")

The cost after training is 0.69314718.
The resulting vector of weights is [-0.0, 4e-08, -9e-08]


### The `LR_predictTweet(tweet, freqs, theta)` function

Predict whether a tweet is positive or negative.

* Given a tweet, process it, then extract the features.
* Apply the model's learned weights on the features to get the logits.
* Apply the sigmoid to the logits to get the prediction (a value between 0 and 1).

$$y_{pred} = sigmoid(\mathbf{x} \cdot \theta)$$

In [6]:
for tweet in [x_test[10],x_test[20]]:
    print( '%s -> %f' % (tweet, LRM.LR_predictTweet(tweet, freq, theta)))

sih sayang banget nya goeun hati banget ikhlas ajar yuri banget main vokal menang my heart goeun -> 0.499963
agus budek dengar -> 0.499999


### Check performance using the test set
After training the model using the training set above, check how the model might perform on real, unseen data, by testing it against the test set.

### Implement `logisticRegressionAccuracy(x, y, freqs, theta)` function
* Given the test data and the weights of the trained model, calculate the accuracy of the logistic regression model. 
* Use `LR_predictTweet()` function to make predictions on each tweet in the test set.
* If the prediction is > 0.5, set the model's classification `y_pred` to 1, otherwise set the model's classification `y_pred` to 0.
* A prediction is accurate when `y_pred` equals `y_test`.  Sum up all the instances when they are equal and divide by `m`.


In [7]:
LR_accuracy = LRM.logisticRegressionAccuracy(x_test, y_test, freq, theta)
print(f"Logistic regression model's accuracy = {LR_accuracy:.4f}")

Logistic regression model's accuracy = 0.5985
