<a href="https://colab.research.google.com/github/arrpak/AmadeusChallenge/blob/master/%5BEN%5D_Week_01_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip -q install eli5==0.8.1
!pip -q install spacy==2.0.18
!pip -q install keras==2.2.4
!python -m spacy download en

!wget -O imdb.zip -qq --no-check-certificate "https://drive.google.com/uc?export=download&id=1vrQ5czMHoO3pEnmofFskymXMkq_u1dPc"
!unzip imdb.zip

# Text Classification

Let's start with probably the simplest task in NLP - sentiment analysis.

We are going to classify IMDB review on positives and negatives.

The dataset was taken from http://ai.stanford.edu/~amaas/data/sentiment/

In [None]:
import pandas as pd

train_df = pd.read_csv("train.tsv", delimiter="\t")
test_df = pd.read_csv("test.tsv", delimiter="\t")

print('Train size = {}'.format(len(train_df)))
print('Test size = {}'.format(len(test_df)))

Let's look through the texts.

In [None]:
!head train.tsv

## The very first classifier

We-e-ell, it looks quite straightforward. The review "I simply love this movie" is positive and when a guy says "I would almost recommend this film just so people can truly see a 1/10" he obviously didn't like it.

Is it really that simple?

Check your ability to classify the texts! Find the words and phrases which can be used as an indicator of a positive or a negative review. Add them to the list below.

Our first classifier will predict positive sentiment when the number of positive words in the text is greater than the negative ones.

In [None]:
#@title Starting classification! { vertical-output: true, display-mode: "form" }
positive_words = 'love', 'great', 'best', 'wonderful' #@param {type:"raw"}
negative_words = 'worst', 'awful', '1/10', 'crap' #@param {type:"raw"}

positives_count = test_df.review.apply(lambda text: sum(word in text for word in positive_words))
negatives_count = test_df.review.apply(lambda text: sum(word in text for word in negative_words))
is_positive = positives_count > negatives_count
correct_count = (is_positive == test_df.is_positive).values.sum()

accuracy = correct_count / len(test_df)

print('Test accuracy = {:.2%}'.format(accuracy))
if accuracy > 0.71:
    from IPython.display import Image, display
    display(Image('https://i.ibb.co/M5wrxZd/Implemented-a-better-than-random-classifier.png', width=500))

**Task** Find good keywords and phrases and achieve at least 71% accuracy on the test set (and don't forget to check the evaluation code of the classifier).

## A bit more serious classifier

### Preprocessing

**Task** Is there anyone who likes these `<br /><br />` inside the reviews? Write a regex that would delete 'em.

In [None]:
import re

pattern = re.compile(<implement it>)

print(train_df['review'].iloc[3])
print(pattern.subn(' ', train_df['review'].iloc[3])[0])

Apply it with `apply` method:

In [None]:
train_df['review'] = train_df['review'].apply(lambda text: pattern.subn(' ', text)[0])
test_df['review'] = test_df['review'].apply(lambda text: pattern.subn(' ', text)[0])

And it's machine learning time, finally!

First of all, we have to decide how to represent the text. 'Cos you have to explain to the computer that the texts are not just the sequences of bytes.

The simplest representation is bag-of-words.

Let's build a very-very large dictionary - list of all words in the train set, for instance. Then the text can be translated to the vector with the length equal to our dictionary size. Each element of the vector shows how many times the corresponding word appeared in the text:

![bow](https://raw.githubusercontent.com/DanAnastasyev/DeepNLP-Course/master/Week%2001/Images/BOW.png)

Does it sound hard? Actually, we don't even need to implement it - everything works out of the box in sklearn `CountVectorizer`.

It has following signature:

```python
CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=r'(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64'>)
```

Geeze, that were lots of parameters.

Let's pay attention to `lowercase=True` and `max_df=1.0, min_df=1, max_features=None` parameters. They are mainly about the size of the dictionary. 

The first one is about word's case. That is, do we need to consider, e.g. "The" and "the" as different words or similar? And what about "Smith" vs "smith"? In fact, we've just faced our first trade-off: are we willing to reduce dictionary size, but treat some different words as a single.

The next group is about statistical dictionary reduction. We can use only words that are not too infrequent (e.g. `min_df=5` says that we need only words that appeared at least 5 times in the train set), and not too frequent (`max_df=0.2` says that words with frequency more than 0.2 should be ignored).

Let's start with a simple example to understand better how it works:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

dummy_data = ['The movie was excellent',
              'the movie was awful']

dummy_matrix = vectorizer.fit_transform(dummy_data)

print(dummy_matrix.toarray())
print(vectorizer.get_feature_names())

*Question: how does the vectorizer find the word's boundaries? Pay attention to `token_pattern=r'(?u)\b\w\w+\b'` - how does such regex work?*

Check it on the real data:

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit(train_df['review'].values)

We can explore the words that appear in the dictionary:

In [None]:
vectorizer.get_feature_names()

And that's how it can be used to transform the text to its vector:

In [None]:
vectorizer.transform([train_df['review'].iloc[3]])

Exactly what we were going to get: a vector with 74849 elements (*check how many words are there in the dictionary*) and 207 non-zero elements. It's stored in the sparse matrix for exactly this reason: it's inefficient to create a dense matrix that has almost 75 thousand zeros when we can just store the non-zero positions and corresponding elements.

### Classification

Wait, and what should we do next with all these zeros (and 207 non-zeros, meh)? Well, actually we are gonna do the same thing we did in the very first assignment. Some words are positives, others are negatives. This time let's use ML to decide which is which!

![bow with weights](https://github.com/DanAnastasyev/DeepNLP-Course/raw/master/Week%2001/Images/BOW_weights.png)

Look, we have the vector with 5 ones. "the", "movie" and "was" are obviously neutral. "Good" should be positive, "ridiculously" - probably negative. And the whole sentence should be positive. So the ML algorithm has to find the weights for each word. That is, larger positive weights would be given to the most certainly positive words.

For instance, let's consider the two sentences set:
```
1   The movie was excellent
0   the movie was awful
```

You can probably guess that `+1` weight should be assigned to the `excellent` word, `-1` to the `awful` and every other word would have zero weight.

We are gonna use a linear model which will find the weights.

Let's try Logistic regression on our super-small set:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

dummy_data = ['The movie was excellent',
              'the movie was awful']
dummy_labels = [1, 0]

vectorizer = CountVectorizer()
classifier = LogisticRegression()

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(dummy_data, dummy_labels)

for word, coef in zip(vectorizer.get_feature_names(), classifier.coef_[0]):
    print(word, coef, sep='\t')

Well, it did exactly what we expected.

It's time to try it on the real data:

In [None]:
model.fit(train_df['review'], train_df['is_positive'])

In [None]:
from sklearn.metrics import accuracy_score

def eval_model(model, test_df):
    preds = model.predict(test_df['review'])
    print('Test accuracy = {:.2%}'.format(accuracy_score(test_df['is_positive'], preds)))
    return accuracy_score(test_df['is_positive'], preds)
    
accuracy = eval_model(model, test_df)

assert accuracy > 0.86, 'Guess, something went wrong'

Wow, we got some results! And it even better than the score of our first classifier!

Right now we should move to the most crucial part of the whole NLP - analysis of the model. Luckily, with logistic regression it's extremely simple.

We'll use great library [ELI5](https://eli5.readthedocs.io/en/latest/) for it.

In [None]:
import eli5

eli5.show_weights(classifier, vec=vectorizer, top=40)

What were we talking about?! Our model learnt to distinguish positive and negative words! No magic - just statistics!

We can also check its work on specific reviews:

In [None]:
print('Positive' if test_df['is_positive'].iloc[1] else 'Negative')
eli5.show_prediction(classifier, test_df['review'].iloc[1], vec=vectorizer, 
                     targets=['positive'], target_names=['negative', 'positive'])

You see, some words have a positive connotation, others are negative. You can see which is which (and if you hover your cursor over a specific word, you'll see its weight). The review above contains mostly positive words, and the model predicts correctly that the review is positive.

What are we going to see in a negative review?

In [None]:
print('Positive' if test_df['is_positive'].iloc[0] else 'Negative')
eli5.show_prediction(classifier, test_df['review'].iloc[0], vec=vectorizer, 
                     targets=['positive'], target_names=['negative', 'positive'])

Well, not bad, isn't it?

And the most essential: let's try to understand when and why the model makes wrong predictions.

In [None]:
import numpy as np

preds = model.predict(test_df['review'])
incorrect_pred_index = np.random.choice(np.where(preds != test_df['is_positive'])[0])

eli5.show_prediction(classifier, test_df['review'].iloc[incorrect_pred_index],
                     vec=vectorizer, targets=['positive'], target_names=['negative', 'positive'])

What do you think can be improved?

## Additional Features

### Tf-idf

Our model doesn't take into account any information about the words except their count in the specific text.

What if we want to help it? We already discussed that some words are more meaningful than others. And the words that appeared in most train texts are probably not too meaningful for our model. On the other hand, some words can be found just in a handful of texts.

Let's explain to the model that it should pay more attention to the second type of words rather than the first ones.

The simplest way to do it is to apply *tf-idf* weighting:
$$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)$$

*tf* (term-frequency) is the count of occurrences of the word `t` in specific text `d`. This is the count yielded by `CountVectorizer`.

*idf* (inverse document-frequency) is term that is inversely proportional to the number of texts with the given word. It can be calculated this way:
$$\text{idf}(t) = \text{log}\frac{1 + n_d}{1 + n_{d(t)}} + 1$$
where $n_d$ is the whole number of texts and $n_{d(t)}$ is the number of texts with the word `t`.

Well, you probably don't even need to remember the formula above. Just think about it as an unsupervised way to calculate additional word weight.

To use it, just replace `CountVectorizer` with `TfidfVectorizer`.

**Task** Use `TfidfVectorizer`, Luke! Check the quality of the model and look to its mistakes. Find the sentences that it started to annotate correctly compared to the previous model and find newly added mistakes.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

<use TfidfVectorizer model>

accuracy = eval_model(model, test_df)

assert accuracy > 0.88, 'Guess, something went wrong'

### Word N-grams

Up to this moment, we considered texts as a simple bag-of-words. But, hey, there is a difference between `good movie` and `not good movie`!

It's time to provide our model with such information. We are going to extract word bigrams. For instance, `not good movie` we will split into two bigrams: `not good` and `good movie`. Additionally, we will extract unigrams like we did before: `not`, `good` and `movie`.

In both Vectorizers you can find `ngram_range=(n_1, n_2)` parameter. It's about the n-grams: `ngram_range=(1, 2)` means that we are going to extract uni- and bigrams.

**Task** Increase the ngram range and evaluate the model.

In [None]:
<implement it>

accuracy = eval_model(model, test_df)
assert accuracy > 0.865, 'Guess, something went wrong'

### Symbol N-grams

Symbol n-grams give us a quick-and-dirty way to learn useful subwords without any linguistics.

For example, the word `badass` can be represented as a set of following trigrams:  
`##b #ba bad ada das ass ss# s##`

So interpretable, isn't it?

And again, it's very easy to implement. Just use `analyzer='char'` in your favourite Vectorizer and decide (or find using cross-validation) `ngram_range`.

**Task** Implement this stuff and don't forget to visualise it in ELI5 (it gives very nice results, really).

In [None]:
<implement it>

accuracy = eval_model(model, test_df)
assert accuracy > 0.875, 'Guess, something went wrong'

## Linguistic-Driven Features

### Lemmatization

Do we really need to distinguish the forms of a single word? For instance, why single and plural forms of a word have different weights.

**Task** Find forms of a word with different semantics according to model (based on the model weights).

To do something with this stuff, we can apply another nice library - [spacy](https://spacy.io).

In [None]:
import spacy
from spacy import displacy

nlp = spacy.load('en', disable=['parser'])

docs = [doc for doc in nlp.pipe(train_df.review.values[:50])]

In [None]:
for token in docs[0]:
    print(token.text, token.lemma_)

You can process the whole corpus (the set of available texts), but to make things faster I preprocessed everything for you.

To download preprocessed documents you'll have to authorize the request to Google Drive API, sorry.

In [None]:
# Install the PyDrive wrapper & import libraries.
# This only needs to be done once per notebook.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

downloaded = drive.CreateFile({'id': '1d1-5FwxK53ePwygNWeG7jhsOWZbi5HOv'})
downloaded.GetContentFile('train_docs.pkl')

downloaded = drive.CreateFile({'id': '1MMOY477t965G0C5DtXeREVp0X85UaNq5'})
downloaded.GetContentFile('test_docs.pkl')

In [None]:
import pickle

with open('train_docs.pkl', 'rb') as f:
    train_docs = pickle.load(f)
    
with open('test_docs.pkl', 'rb') as f:
    test_docs = pickle.load(f)

The preprocessed data consists of the list of word-lemma pairs (the first element in the tuple):

In [None]:
train_docs[0][0]

and NER information (we'll discuss it a bit later):

In [None]:
train_docs[0][1]

**Task** Apply classifier to the lemmatized texts.

### Stemming

More computationally effective (and much dirtier) way to normalize words is to use stemming. It's dumb and doesn't take into account the context of the words but this is why it's so efficient.

Basically, it's just a set of rules how to process a word to obtain its stem.

In [None]:
from nltk import PorterStemmer

stemmer = PorterStemmer()

print(stemmer.stem('become'))
print(stemmer.stem('becomes'))
print(stemmer.stem('became'))

**Task** Try to replace lemmas with stems.

### NER

NER stands to Named-Entity Recognition. It's a process of finding entities in texts. Such as:

In [None]:
displacy.render(docs[0], style='ent', jupyter=True)

The entities are described [here](https://spacy.io/api/annotation#named-entities).

The preprocessed data contains the entity type and its coordinates in the original text:

In [None]:
ner_data = train_docs[0][1]
print(train_df.review.iloc[0][ner_data[0][0]: ner_data[0][1]], ner_data[0][2])
print(train_df.review.iloc[0][ner_data[2][0]: ner_data[2][1]], ner_data[2][2])

Are we sure that a guy like Depp has semantic connotation? Or is it just our classifier learnt that some actors appear in positive reviews more frequently than in negatives?

**Task** Collect the list of entities in the preprocessed data and find which of them are strongly positive or strongly negative.

**Task** Remove entities (or rather some of them) from the texts. Check the classifier on normalized texts.

## Deep Learning Time!

The tiny fraction of you who still reading this is probably wondering: why we didn't use any deep learning in the DL for NLP course.

Let's use somewhat standard texts classification model - convolutional neural network over word embeddings.

We are gonna dig into what it is in the following notebooks. Right now we'll just use it :)

### Preprocessing
Well, things are not that simple, actually... We have to prepare the texts for the neural network.

Every text should be represented as a sequence of words - this the main difference between this and bag-of-words representation.

And it's not that simple to obtain the sequence. First of all, the raw texts have to be *tokenized*. This is the process of splitting text into tokens.

Actually, spacy did it for us already:

In [None]:
train_tokenized_texts = [[token for token, _ in doc[0]] for doc in train_docs]
test_tokenized_texts = [[token for token, _ in doc[0]] for doc in test_docs]

Compare untokenized text:

In [None]:
train_df.review.iloc[0]

With tokenized one:

In [None]:
' '.join(train_tokenized_texts[0])

Well, we still need to know the sequence length in advance. To speed up the process we can cut too long reviews. Let's use the histogram to understand where to cut: 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

_, _, hist = plt.hist([len(text) for text in train_tokenized_texts], bins='auto')
hist

Another problem: the words have to be enumerated. Actually, it's quite similar to the stuff that Vectorizers did for us. We have to build the dictionary (again, remember?) which will map each word to its index.

**Task** Choose the minimum number of word occurrences which we consider interesting. 

In [None]:
from collections import Counter

def build_word2idx(tokenized_texts, min_df):
    words_counter = Counter((word.lower() for text in tokenized_texts for word in text))

    word2idx = {
        '<pad>': 0,
        '<unk>': 1
    }
    for word, count in words_counter.most_common():
        if count < min_df:
            break

        word2idx[word] = len(word2idx)
    return word2idx
    
    
word2idx = build_word2idx(train_tokenized_texts, <choose me>)
print('Words count:', len(word2idx))

Let's check what we just did:

In [None]:
word2idx

Using `word2idx` and `max_text_len` constant deduced by the histogram we are able to convert texts to a matrix.

Wait, what the hell matrix am I talking about?

Remember, the Vectorizers converted list of texts to a matrix in the following way:

$$
\begin{align}
& \text{the movie was excellent} \\
& \text{the movie was extremely awful}
\end{align}
\to
 \begin{pmatrix}
  1 & 1 & 1 & 1 & 0 & 0 \\
  1 & 1 & 1 & 0 & 1 & 1
 \end{pmatrix}
$$

The first row in the matrix corresponds to the first sentence, the second... well, you already know, yeah.

The first column says how many times `the` appears in this specific sentence, the last is about how many times `awful` appeared, and so on.

Such representation gives us a fixed-length vector without saving of word-order.

Now we are going to build variable (almost) length vector for each text. Its length would be equal to the minimum between the text length and `max_text_len` constant.

We are going to write in each position of the vector index of the corresponding word in the text. And the `word2idx` mapping will give us the index.

For instance, for these two sentences above we'll have `word2idx`:

In [None]:
dummy_texts = ['the movie was excellent', 'the movie was extremely awful']
dummy_tokenized_texts = [text.split() for text in dummy_texts]

dummy_word2idx = build_word2idx(dummy_tokenized_texts, 0)

dummy_word2idx

Basically, its the same mapping as in Vectorizer but for the first two elements.

We need `'<unk>'` to be able to map unknown words (e.g. words with frequency lower than the `min_df` threshold) to some index.

And we need `'<pad>'` for... well, let's see how the sentences can be converted:

$$
\begin{align}
& \text{the movie was excellent} \\
& \text{the movie was extremely awful}
\end{align}
\to
 \begin{pmatrix}
  2 & 3 & 4 & 5 & 0 \\
  2 & 3 & 4 & 6 & 7
 \end{pmatrix}
$$

We need the `'<pad>'` symbol to be able to build a matrix from the variable-length vectors. We just pad the shorter vectors with zeros and hope that neural net will figure out that it should not deal anyhow with this zeros.

The conversion of each text can be performed in the following way:

In [None]:
[dummy_word2idx.get(token, 1) for token in dummy_tokenized_texts[0]]

The get method of our mapping would return 1 (index of `'<unk>'`)  each time it meets an out of mapping word.

To make use of the paddings, you can use the following trick: fill initial matrix with zeros (`np.zeros(shape)`) and replace zeros with real elements:

In [None]:
dummy_data = np.zeros((2, 5), dtype=np.int)

dummy_data[0, :len(dummy_tokenized_texts[0])] = [dummy_word2idx.get(token, 1) for token in dummy_tokenized_texts[0]]

dummy_data[0]

**Task** Convert the whole train and test sets to matrices in the similar way as described above.

In [None]:
def convert(texts, word2idx, max_text_len):
    data = np.zeros((len(texts), max_text_len), dtype=np.int)
    
    <implement it>

    return data

X_train = convert(train_tokenized_texts, word2idx, 1000)
X_test = convert(test_tokenized_texts, word2idx, 1000)

### Classification

Let's finally run the training of the model.

In keras to train a model you have to:
1. Define the model, e.g.:
```python 
model = Sequential()
model.add(Dense(1, activation='sigmoid', input_dim=NUM_WORDS))
```
2. Set the loss function and optimizer (and metrics, if you want):
```python
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
```

3. Run the training:
```python
model.fit(X_train, y_train, 
             batch_size=32, epochs=3,
             validation_data=(X_test, y_test))
```

In NLP you usually deal with classification tasks so you have to know three loss functions:

*   **categorical_crossentropy** - for multiclass classification with one-hot encoding vectors as the labels
*   **sparse_categorical_crossentropy** - like the previous one but with indices of correct classes, not one-hot encoding vectors
*   **binary_crossentropy** - for binary classification

As an optimizer you will most probably use `adam`.

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, Conv1D, GlobalMaxPooling1D

model = Sequential([
    Embedding(input_dim=len(word2idx), output_dim=64, input_shape=(X_train.shape[1],)),
    Conv1D(filters=128, kernel_size=3, padding='valid', activation='relu', strides=1),
    GlobalMaxPooling1D(),
    Dense(units=1, activation='sigmoid')
])

model.summary()
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
model.fit(X_train, train_df.is_positive, batch_size=128, epochs=5, 
          validation_data=(X_test, test_df.is_positive))

Let's evaluate the model on the test set.

In [None]:
loss, accuracy = model.evaluate(X_test, test_df.is_positive)

assert accuracy > 0.86

## Bits of Maths

We used logistic regression so many times today. Let's go over how it works.

In [None]:
np.random.seed(42)
w, X, y = np.random.random(10), np.random.random((11, 10)), 2 * (np.random.randint(0, 2, 11) - 0.5)

First of all, it's simple linear function with sigmoid activation:

$$h_w(X) = \sigma(X w),$$
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Calculate it:

In [None]:
def forward(X, w):
    """ Function that calcs h(X, w)
    X - matrix (n, m)
    w - vector (m,)
    """
    <calc h>

In [None]:
assert np.allclose(forward(X, w), [0.93470525, 0.86230993, 0.89161691, 0.89640013, 0.95225746, 0.94261494, 
                                   0.83213636, 0.94034399, 0.9093287, 0.9004457, 0.94200167])

Its loss function looks this way:
$$J(w)  = \frac{1}{m} \left(-y^T \text{log}(h_w) - (1-y)^T \text{log}(1 - h_w)\right)$$

To use gradient descent optimization, we need to calculate gradient
$\frac{\partial J(w)}{\partial w}$.

**Task** Implement them:

In [None]:
def loss(X, w, y):
    """ Loss function
    X - matrix (n, m)
    w - vector (m,)
    y - vector (n,)
    """
    <calc loss>


def gradient(X, w, y):
    """ Loss function gradient over w
    X - matrix (n, m)
    w - vector (m,)
    y - vector (n,)
    """
    <calc grad>


print('loss = ', loss(X, w, y))
print('gradient = ', gradient(X, w, y))

assert gradient(X, w, y).shape == w.shape

Check yourself: compare your results with gradient calculated using finite differences:

$$[\nabla f(x)]_i \approx \frac{f(x + \varepsilon \cdot e_i) - f(x)}{\varepsilon}$$

where $e_i = (0, ... , 0, 1, 0, ..., 0), \varepsilon \approx 10^{-8}$

In [None]:
def grad_finite_diff(func, x, eps=1e-8):
    """
    w - vector (m,)
    func - scalar function of w
    eps - constant
    """
    x, fval, dnum = x.astype(np.float64), func(x), np.zeros_like(x)
    
    E = np.eye(x.shape[0])
    deltas = np.array([func(x + eps * E[i]) for i in range(x.shape[0])])
    return (deltas - fval) / eps


mat_grad = gradient(X, w, y)
num_grad = grad_finite_diff(lambda w: loss(X, w, y), w)

err = np.max(np.abs(mat_grad - num_grad))
print('err = ', err, 'ok' if err < 1e-6 else 'the error is too large =(')

As a result, we will get following class:

In [None]:
class LogisticRegression:
    def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True):
        self._lr = lr
        self._num_iter = num_iter
        self._fit_intercept = fit_intercept
    
    def _add_intercept(self, X):
        intercept = np.ones((X.shape[0], 1))
        return np.concatenate((intercept, X), axis=1)
        
    def fit(self, X, y):
        if self._fit_intercept:
            X = self._add_intercept(X)
        
        self._w = np.zeros(X.shape[1])
        
        for i in range(self._num_iter):
            grad = gradient(X, self._w, y)
            self._w -= self._lr * grad
    
    def predict_prob(self, X):
        if self._fit_intercept:
            X = self._add_intercept(X)
    
        return forward(X, self._w)
    
    def predict(self, X, threshold):
        return self.predict_prob(X) >= threshold

Let's check it on the simple datasets:

In [None]:
from sklearn.datasets import make_moons, load_iris

def visualize(model, X, y):
    plt.figure(figsize=(10, 6))
    plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='b', label='0')
    plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='r', label='1')
    plt.legend()

    x1_min, x1_max = X[:,0].min(), X[:,0].max(),
    x2_min, x2_max = X[:,1].min(), X[:,1].max(),
    xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
    grid = np.c_[xx1.ravel(), xx2.ravel()]
    probs = model.predict_prob(grid).reshape(xx1.shape)
    plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='black')

    
iris = load_iris()
X = iris.data[:, :2]
y = (iris.target != 0) * 1

model = LogisticRegression(lr=0.1, num_iter=300000)
model.fit(X, y)

visualize(model, X, y)

In [None]:
X, y = make_moons(noise=0.1)

model = LogisticRegression()
model.fit(X, y)

visualize(model, X, y)

That's all, folks!