# Logistic Regression

## Basic NLP Flow

- **Step 1**: Clean data
    - Remove all irrelevant characters such as any non alphanumeric characters
    - Tokenize your text by separating it into individual words 
    - Remove words that are not relevant, such as “@” twitter mentions or urls 
    - Convert all characters to **lowercase**, in order to treat words such as “hello”, “Hello”, and “HELLO” the same 
    - Consider **lemmatization** (reduce words such as “am”, “are”, and “is” to a common form such as “be”)
- **Step 2**: Representation
    - Bag of Words or TFIDF
- **Step 3**: Classification
    - Naive Bayes
    - Logistic Regression

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# Import pandas, numpy and the dataset, save it in a object called 'sentiment'
import numpy as np
import pandas as pd

# Read the data
data = pd.read_csv('/content/drive/My Drive/FTMLE - Tonga/Data/movie_review.csv', encoding='utf-8', sep='\t')

# Let's check some samples
data.sample(10)


Unnamed: 0,id,review,sentiment
10402,7517_1,***One Out of Ten Stars*** <br /><br />Because...,0
750,1099_4,I was shocked by the ridiculously unbelievable...,0
16439,10636_8,This story is a complex and wonderful tale of ...,1
4147,6095_1,"i've seen a movie thats sort of like this, wer...",0
3957,3808_10,I voted this a 10 out of 10 simply because it ...,1
16932,4780_9,Always fancied this film from the video cover....,1
4720,1623_10,This is te cartoon that should have won instea...,1
7784,6752_7,A detective (Dana Andrews) with a reputation f...,1
15418,10494_1,"On Steve Irwin's show, he's hillarious. He doe...",0
7299,8837_1,This film is terrible. I was really looking fo...,0


## Sentiment analysis
The task is to build a model that will determine the tone (positive, negative) of the text. To do this, you will need to train the model on the existing data (movie_review.csv). The resulting model will have to determine the class (neutral, positive, negative) of new texts. The dataset contains the following fields:

| Field name | Meaning |
|------------|-----------|
| ID  | id of comment|
| review | text of reviews|
| sentiment | sentiment (1-positive, 0-negative)|



* Check balanced of the sentiment

In [0]:
data['sentiment'].value_counts(normalize = True)

1    0.501244
0    0.498756
Name: sentiment, dtype: float64

### Term frequency-inverse document frequency (tf-idf)

We could use raw term frequencies to score the words in our algorithm. But there is a problem though: If a word is very frequent in _all_ documents, then it probably doesn't carry a lot of information. In order to tackle this problem we can use **term frequency-inverse document frequency**, which will reduce the score the more frequent the word is accross all tweets. It is calculated like this:

\begin{equation*}
tf-idf(t,d) = tf(t,d) ~ idf(t,d)
\end{equation*}

_tf(t,d)_ is the raw term frequency descrived above. _idf(t,d)_ is the inverse document frequency, than can be calculated as follows:

\begin{equation*}
\log \frac{n_d}{1+df\left(d,t\right)}
\end{equation*}

where `n` is the total number of documents and _df(t,d)_ is the number of documents where the term `t` appears. 

The `1` addition in the denominator is just to avoid zero term for terms that appear in all documents, will not be entirely ignored. Ans the `log` ensures that low frequency term don't get too much weight.

Fortunately for us `scikit-learn` does all those calculations for us:

## Step 1: Data clean up

### Removing stop words

Now that we know how to format and score our input. Let's look at our **real** vocabulary. Specifically, the most common words:

In [0]:
from collections import Counter
vocab = Counter()

# Apply Counter to count words in our reviews
for document in data['review']:
  for word in document.split(' '):
    vocab[word] += 1
    
# Show 20 most common words
vocab.most_common(20)   

[('the', 258519),
 ('a', 139707),
 ('and', 137397),
 ('of', 128750),
 ('to', 119278),
 ('is', 92935),
 ('in', 77245),
 ('I', 59255),
 ('that', 57991),
 ('this', 51379),
 ('it', 48865),
 ('/><br', 45851),
 ('was', 42004),
 ('as', 38288),
 ('with', 37496),
 ('for', 36919),
 ('The', 30399),
 ('but', 30350),
 ('on', 27738),
 ('movie', 27342)]

### Stop words
As we can see, the most common words are meaningless in terms of sentiment: _I, to, the, and_... they don't give any information on positiveness or negativeness. They're basically **noise** that can most probably be eliminated. These kind of words are called _stop words_, and it is a common practice to remove them when doing text analysis.

In [0]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

vocab_reduced = Counter()
# Go through all of the items of vocab using vocab.items() and pick only words that are not in 'stop' 
# and save them in vocab_reduced
for word, count in vocab.items():
  if word.lower() not in stop:
    vocab_reduced[word] = count

vocab_reduced.most_common(20)

[('/><br', 45851),
 ('movie', 27342),
 ('film', 24768),
 ('one', 18704),
 ('like', 16278),
 ('would', 10720),
 ('good', 10243),
 ('really', 9773),
 ('even', 9530),
 ('see', 9077),
 ('-', 8181),
 ('get', 7857),
 ('story', 7652),
 ('much', 7634),
 ('time', 7028),
 ('make', 6719),
 ('could', 6700),
 ('also', 6672),
 ('people', 6604),
 ('first', 6570)]

This looks better, only in the 20 most common words we already see words that make sense: good, love, really... 

### Removing special characters and "trash"

If you look closer, you'll see that we're also taking into consideration punctuation signs ('-', ',', etc) and other html tags like `&amp`. We can definitely remove them for the sentiment analysis, but we will try to keep the emoticons, since those _do_ have a sentiment load:

In [0]:
import re

def preprocessor(text):
    """ Return a cleaned version of text
    """
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons,
    # removing the noise characters for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    
    return text

# Create some random texts for testing the function preprocessor()
print(preprocessor('/> This is, our result. :)) <br>'))


 this is our result  :)


We can see that many words such as “ask” and “asked” are just different tenses of the same verb. These words will make parse vectors require more memory and computational resources when modeling and the vast number of positions or dimensions can make the modeling process very challenging for traditional algorithms.

Reducing different forms of a word to a core root is essential. It help to shorter the word parse vectors. Words that are derived from one another can be mapped to a central word or symbol, especially if they have the same core meaning.

This is where something like stemming or lemmatization comes in.

We are almost ready! There is another trick we can use to reduce our vocabulary and consolidate words. If you think about it, words like: love, loving, etc. _Could_ express the same positivity. If that was the case, we would be  having two words in our vocabulary when we could have only one: lov. This process of reducing a word to its root is called **stemming**.


### Stemming
With stemming, words are reduced to their word stems. A word stem need not be the same root as a dictionary-based morphological root, it just is an equal to or smaller form of the word.

Ex: love, loved, loving -> love

In [0]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
# We are going to use Snowball stemmer this task
from nltk.stem import SnowballStemmer

snowball = SnowballStemmer(language='english')

def tokenizer_snowball(text):
    """ Split a text into list of words and apply stemming technic """

    return [snowball.stem(word) for word in nltk.word_tokenize(text)]

# Testing
print(tokenizer_snowball('The striped bats are hanging on their feet for best'))

['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best']


### Lemmatization
Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.<br>
Example:<br>
- rocks -> rock<br>
- corpora -> corpus<br>
- better -> good<br>

In [0]:
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0]
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

def tokenizer_lemma(text):
    return [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in nltk.word_tokenize(text)]

# Testing
print(tokenizer_lemma('The striped bats are hanging on their feet for best'))

['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']


## Step 2: Representation

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

stop = stopwords.words('english')

def tokenizer_snowball(text):
    return [snowball.stem(word) for word in text.split()]

def tokenizer_lemma(text):
    return [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in text.split()]

def preprocessor(text):
    """ Return a cleaned version of text """

    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons,
    # removing the noise character for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))

    return text

In [0]:
tfidf = TfidfVectorizer(stop_words=stop,
                        tokenizer=tokenizer_lemma,
                        preprocessor=preprocessor)

## Step 3: Classification

We are finally ready to train our algorithm. 

In [0]:
# split the dataset in train and test
from sklearn.model_selection import train_test_split

X = data['review']
y = data['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.2)

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# A pipeline is what chains several steps together, once the initial exploration is done. 
# For example, some codes are meant to transform features — normalise numericals, or turn text into vectors, 
# or fill up missing data, they are transformers; other codes are meant to predict variables by fitting an algorithm,
# they are estimators. Pipeline chains all these together which can then be applied to training data
clf = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(random_state=0))])

clf.fit(X_train, y_train)

  'stop_words.' % sorted(inconsistent))


Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=<function preprocessor at 0x7f40c98422f0>,
                                 smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', '...
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_lemma at 0x7f40c98427b8>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
          

In [0]:
import pickle
import os

pickle.dump(clf, open(os.path.join('/content/drive/My Drive/NLP_Model/lemma.pkl'), 'wb'))

In [0]:
#snowball_model = pickle.load(open('/content/drive/My Drive/NLP_Model/snowball.pkl', 'rb'))
lemma_model = pickle.load(open('/content/drive/My Drive/NLP_Model/lemma.pkl', 'rb'))

In [0]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

#snowball_predictions = snowball_model.predict(X_test)
lemma_predictions = lemma_model.predict(X_test)

  'stop_words.' % sorted(inconsistent))


In [0]:
# Let's compare results between the models
print('SNOWBALL MODEL')
print(f'Accuracy: %s' % accuracy_score(y_test, snowball_predictions))
print(f'Confusion matrix:' '\n', classification_report(y_test, snowball_predictions))
print('---------------------------------')
print('LEMMA MODEL')
print(f'Accuracy: %s' % accuracy_score(y_test, lemma_predictions))
print('Confusion matrix:' '\n', classification_report(y_test, lemma_predictions))

SNOWBALL MODEL
Accuracy: 0.8911111111111111
Confusion matrix:
               precision    recall  f1-score   support

           0       0.89      0.89      0.89      2218
           1       0.89      0.90      0.89      2282

    accuracy                           0.89      4500
   macro avg       0.89      0.89      0.89      4500
weighted avg       0.89      0.89      0.89      4500

---------------------------------
LEMMA MODEL
Accuracy: 0.888
Confusion matrix:
               precision    recall  f1-score   support

           0       0.89      0.89      0.89      2218
           1       0.89      0.89      0.89      2282

    accuracy                           0.89      4500
   macro avg       0.89      0.89      0.89      4500
weighted avg       0.89      0.89      0.89      4500



### Import test data

In [0]:
test_data = pd.read_csv('/content/drive/My Drive/FTMLE - Tonga/Data/movie_review_evaluation.csv', encoding='utf-8', sep='\t')
test_data.head(5)

Unnamed: 0,id,review
0,10633_1,I watched this video at a friend's house. I'm ...
1,4489_1,`The Matrix' was an exciting summer blockbuste...
2,3304_10,This movie is one among the very few Indian mo...
3,3350_3,The script for this movie was probably found i...
4,1119_1,Even if this film was allegedly a joke in resp...


In [0]:
# Using lemmanization model to predict test data
predictions = lemma_model.predict(test_data['review'])

In [0]:
d = {'review':test_data['review'], 'sentiment':predictions}
result = pd.DataFrame(d)
result

In [0]:
result.to_csv('andang.csv')