# Logistic Regression and Boosting Algorithms

© Data Trainers LLC. GPL v 3.0.

**Author:** Axel Sirota


## Predicting a Single Categorical Response
---



### Installing stuff

In [1]:
!pip install --upgrade textblob spacy 'gensim==4.2.0' swifter keras_preprocessing

Collecting textblob
  Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m626.3/626.3 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting gensim==4.2.0
  Downloading gensim-4.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting swifter
  Downloading swifter-1.4.0.tar.gz (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting keras_preprocessing
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m555.8 kB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: swifter
  Building wheel for swifter (setup.p

In [2]:
!python -m textblob.download_corpora lite
!python -m spacy download en_core_web_sm

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
Finished.
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the pa

In [3]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB         # Naive Bayes
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer

import spacy
import gensim
import warnings
import nltk
warnings.filterwarnings('ignore')
nltk.download('punkt')
textblob_tokenizer = lambda x: TextBlob(x).words


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
%%writefile get_data.sh
if [ ! -f yelp.csv ]; then
  wget -O yelp.csv https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
fi

Writing get_data.sh


In [5]:
!bash get_data.sh

--2024-05-09 10:54:10--  https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.85.18, 2620:100:6023:18::a27d:4312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.85.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/xds4lua69b7okw8/yelp.csv [following]
--2024-05-09 10:54:10--  https://www.dropbox.com/s/raw/xds4lua69b7okw8/yelp.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc4752c9673a779e7371b359b3fa.dl.dropboxusercontent.com/cd/0/inline/CSnIX4QyaijLKln110s_HiCm3h25EHo2W-SPKyvfZr52-MYot6r_3Yl_2BHyhClKkTozhbLoo3So32bSlMYIBcLOWK6d56pBbKbBZDGePGbJJm8ka9Y76vEKvf1B86EK8_Y9thJmdoSHKDaKL-8uYAyO/file# [following]
--2024-05-09 10:54:11--  https://uc4752c9673a779e7371b359b3fa.dl.dropboxusercontent.com/cd/0/inline/CSnIX4QyaijLKln110s_HiCm3h25EHo2W-SPKyvfZr52-MYot6r_3Yl_2BHyhClKkTozhbLoo3So32bSlMYIBcLOWK6d56pBbKbBZDGePG

In [6]:
# Read yelp.csv into a DataFrame.
path = './yelp.csv'
yelp = pd.read_csv(path)
# Create a new DataFrame that only contains the 5-star and 1-star reviews.
yelp_best_worst = yelp[ (yelp.stars == 1) | (yelp.stars == 5) ]

# Define X and y.
X = yelp_best_worst.text
y = yelp_best_worst.stars

# Split the new DataFrame into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

<a id="using-logistic-regression-for-classification"></a>
## Using Logistic Regression for Classification
---



In [7]:
# Fit a logistic regression model to predict stars from text
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

logreg.fit(X,y)


ValueError: could not convert string to float: 'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'

Of course this simply fails, we need to preprocess the text, convert it into a Tensor format and then and only then we can use models!

### Converting text to vectors

In [8]:
import re
nltk.download('stopwords')
my_stopwords = nltk.corpus.stopwords.words('english')
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'


def preprocess_text(text, should_join=True):
    text = ' '.join(word.lower() for word in textblob_tokenizer(text))
    text = re.sub(r'http\S+', '', text) # remove http links
    text = re.sub(r'bit.ly/\S+', '', text) # rempve bitly links
    text = text.strip('[link]') # remove [links]
    text = re.sub('['+my_punctuation + ']+', ' ', text) # remove punctuation
    text = re.sub('\s+', ' ', text) #remove double spacing
    text = re.sub(r"[^a-zA-Z.,&!?]+", r" ", text) # only normal characters
    text_token_list = [word for word in text.split(' ')
                            if word not in my_stopwords] # remove stopwords
    text_token_list = [word_rooter(word) if '#' not in word else word
                        for word in text_token_list] # apply word rooter
    text = ' '.join(text_token_list)
    if should_join:
      return ' '.join(gensim.utils.simple_preprocess(text))
    else:
      return gensim.utils.simple_preprocess(text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [9]:
X.head()

0    My wife took me here on my birthday for breakf...
1    I have no idea why some people give bad review...
3    Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4    General Manager Scott Petello is a good egg!!!...
6    Drop what you're doing and drive here. After I...
Name: text, dtype: object

In [10]:
# Apply the preprocessing to the dataset
import swifter
X_preprocessed = X.swifter.apply(preprocess_text)

Pandas Apply:   0%|          | 0/4086 [00:00<?, ?it/s]

In [11]:
X_preprocessed[0]

'wife took birthday breakfast excel weather perfect made sit outsid overlook ground absolut pleasur waitress excel food arriv quickli semi busi saturday morn look like place fill pretti quickli earlier get better favor get bloodi mari phenomen simpli best ever pretti sure use ingredi garden blend fresh order amaz everyth menu look excel white truffl scrambl egg veget skillet tasti delici came piec griddl bread amaz absolut made meal complet best toast ever anyway ca wait go bac'

How do we pass from text to numbers? With tokenizers. We will use Tensorflow ones!

In [12]:
# Find a set named vocab that has all unique words
vocab = set()
for review in X_preprocessed:
  for word in textblob_tokenizer(review):
    if word not in vocab:
      vocab.add(word)

In [13]:
print(f'{len(vocab)} unique words')

13140 unique words


In [14]:
# Implement this method
def get_maximum_review_length(srs):
    maximum = 0
    for review in srs:
      if len(textblob_tokenizer(review)) > maximum:
        maximum = len(textblob_tokenizer(review))
    return maximum


maximum = get_maximum_review_length(X_preprocessed)

In [15]:
print(f'The maximum review was {maximum} words long')

The maximum review was 477 words long


In [16]:
from tensorflow.keras.layers.experimental import preprocessing
ids_from_words = preprocessing.StringLookup(vocabulary=list(vocab), mask_token=None)

In [17]:
words_from_ids = preprocessing.StringLookup(
    vocabulary=ids_from_words.get_vocabulary(), invert=True, mask_token=None)

In [18]:
import tensorflow as tf
def text_from_ids(ids):
  return tf.strings.reduce_join(words_from_ids(ids), axis=-1, separator=' ')

In [21]:
ids = ids_from_words(textblob_tokenizer('Only you can prevent forest fires'))
ids

<tf.Tensor: shape=(6,), dtype=int64, numpy=array([    0,     0, 11093,  7429,  4842,     0])>

In [20]:
preprocess_text('Only you can prevent forest fires', should_join=False)

['prevent', 'forest', 'fire']

In [22]:
text_from_ids(ids)


<tf.Tensor: shape=(), dtype=string, numpy=b'[UNK] [UNK] can prevent forest [UNK]'>

In [23]:
def pad_sequence_of_tokens(x, maxlen, unk_token='[UNK]'):
  if len(x)<maxlen:
    x.extend([unk_token]*(maxlen-len(x)))
  return x

In [24]:
from keras_preprocessing.sequence import pad_sequences
# Very useful method to keep in mind
def get_ids_tensor(srs):

  processed = srs.swifter.apply(lambda x: pad_sequence_of_tokens(preprocess_text(x, should_join=False), maxlen=maximum)).to_list()
  return tf.squeeze(tf.constant(pad_sequences(ids_from_words(processed), maxlen=maximum, padding='post'), dtype='int32'))



In [25]:
all_ids = get_ids_tensor(srs=X_preprocessed.reset_index(drop=True))
all_ids

Pandas Apply:   0%|          | 0/4086 [00:00<?, ?it/s]

<tf.Tensor: shape=(4086, 477), dtype=int32, numpy=
array([[ 3805,   945,  1764, ...,     0,     0,     0],
       [ 4657,   250, 12429, ...,     0,     0,     0],
       [ 1102, 12778,  5746, ...,     0,     0,     0],
       ...,
       [10982,  2039, 11819, ...,     0,     0,     0],
       [ 8871, 12889,  3370, ...,     0,     0,     0],
       [ 7121,  1545,  7508, ...,     0,     0,     0]], dtype=int32)>

In [26]:
all_ids.shape

TensorShape([4086, 477])

In [27]:
# Split the all_ids into.a train a test sets
X_train, X_test, y_train, y_test = train_test_split(all_ids.numpy(), y, test_size=0.25, random_state=42)

### Using Logistic Regression

In [29]:

from sklearn.metrics import classification_report
# Train a Logistic Regression on X_train and give the accuracy
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           1       0.30      0.08      0.13       199
           5       0.81      0.96      0.88       823

    accuracy                           0.78      1022
   macro avg       0.56      0.52      0.50      1022
weighted avg       0.71      0.78      0.73      1022



## Using Boosting Algorithms and other things

In [30]:
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=50, learning_rate=0.5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.30      0.14      0.19       199
           5       0.82      0.92      0.87       823

    accuracy                           0.77      1022
   macro avg       0.56      0.53      0.53      1022
weighted avg       0.72      0.77      0.73      1022



In [31]:
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(n_estimators=50, learning_rate=0.5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.43      0.03      0.06       199
           5       0.81      0.99      0.89       823

    accuracy                           0.80      1022
   macro avg       0.62      0.51      0.47      1022
weighted avg       0.73      0.80      0.73      1022



In [32]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=50)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.27      0.02      0.03       199
           5       0.81      0.99      0.89       823

    accuracy                           0.80      1022
   macro avg       0.54      0.50      0.46      1022
weighted avg       0.70      0.80      0.72      1022



## Multiclass Classification

Just check in the estimators, most support multiclass classification.

In [33]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, multi_class='multinomial').fit(X, y)
clf.predict(X[:2, :])
clf.predict_proba(X[:2, :])
clf.score(X, y)

0.9733333333333334

### **Homework**: Try to perform the stars classification with Logistic Regression but without filtering only for 5 and 1 stars.