# Logistic Regression and Boosting Algorithms

© Data Trainers LLC. GPL v 3.0.

**Author:** Axel Sirota


## Predicting a Single Categorical Response
---



### Installing stuff

In [1]:
!pip install --upgrade textblob spacy 'gensim==4.2.0' swifter keras_preprocessing

Collecting spacy
  Downloading spacy-3.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m43.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gensim==4.2.0
  Downloading gensim-4.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting swifter
  Downloading swifter-1.4.0.tar.gz (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m69.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting keras_preprocessing
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting weasel<0.4.0,>=0.1.0 (from spacy)
  Download

In [2]:
!python -m textblob.download_corpora lite
!python -m spacy download en_core_web_sm

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
Finished.
2023-11-20 13:35:36.189172: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-20 13:35:36.189257: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-20 13:35:36.189305: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attemp

In [3]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB         # Naive Bayes
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer

import spacy
import gensim
import warnings
import nltk
warnings.filterwarnings('ignore')
nltk.download('punkt')
textblob_tokenizer = lambda x: TextBlob(x).words


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
%%writefile get_data.sh
if [ ! -f yelp.csv ]; then
  wget -O yelp.csv https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
fi

Writing get_data.sh


In [5]:
!bash get_data.sh

--2023-11-20 13:36:14--  https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.80.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.80.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/xds4lua69b7okw8/yelp.csv [following]
--2023-11-20 13:36:15--  https://www.dropbox.com/s/raw/xds4lua69b7okw8/yelp.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc06aecaa09f7d3482e2988fa664.dl.dropboxusercontent.com/cd/0/inline/CH5ZqZHoonL-FOlfPy2NS6JFpXRaQi_iaCX7oWqY1WqUmqCbLILir992JSeHGE37wfOmex_im1cFHbQ8a-r-PbYix3dDKlcL2Vrisz-fBzj9lB-FBtT1k1esltkiI8VydXWrN-fRWOun3zEjRtZQk6JR/file# [following]
--2023-11-20 13:36:16--  https://uc06aecaa09f7d3482e2988fa664.dl.dropboxusercontent.com/cd/0/inline/CH5ZqZHoonL-FOlfPy2NS6JFpXRaQi_iaCX7oWqY1WqUmqCbLILir992JSeHGE37wfOmex_im1cFHbQ8a-r-PbYix3dDKlcL2Vrisz-fBzj

In [7]:
# Read yelp.csv into a DataFrame.
path = './yelp.csv'
yelp = pd.read_csv(path)
# Create a new DataFrame that only contains the 5-star and 1-star reviews.
yelp_best_worst = yelp[ (yelp.stars == 1) | (yelp.stars == 5) ]
yelp_best_worst

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,wFweIWhv2fREZV_dYkz_1g,7,7,4
...,...,...,...,...,...,...,...,...,...,...
9990,R8VwdLyvsp9iybNqRvm94g,2011-10-03,pcEeHdAJPoFNF23es0kKWg,5,Yes I do rock the hipster joints. I dig this ...,review,b92Y3tyWTQQZ5FLifex62Q,1,1,1
9991,WJ5mq4EiWYAA4Vif0xDfdg,2011-12-05,EuHX-39FR7tyyG1ElvN1Jw,5,Only 4 stars? \n\n(A few notes: The folks that...,review,hTau-iNZFwoNsPCaiIUTEA,1,1,0
9992,f96lWMIAUhYIYy9gOktivQ,2009-03-10,YF17z7HWlMj6aezZc-pVEw,5,I'm not normally one to jump at reviewing a ch...,review,W_QXYA7A0IhMrvbckz7eVg,2,3,2
9994,L3BSpFvxcNf3T_teitgt6A,2012-03-19,0nxb1gIGFgk3WbC5zwhKZg,5,Let's see...what is there NOT to like about Su...,review,OzOZv-Knlw3oz9K5Kh5S6A,1,2,1


<a id="using-logistic-regression-for-classification"></a>
## Using Logistic Regression for Classification
---



In [11]:
# Define X and y.
X = yelp_best_worst.text
y = yelp_best_worst.stars

# Split the new DataFrame into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [160]:
# Fit a logistic regression model to predict stars from text

# logreg = LogisticRegression(solver='liblinear', random_state=0)

# logreg.fit(X,y)


Of course this simply fails, we need to preprocess the text, convert it into a Tensor format and then and only then we can use models!

### Converting text to vectors

In [13]:
import re
nltk.download('stopwords')
my_stopwords = nltk.corpus.stopwords.words('english')
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'


def preprocess_text(text, should_join=True):
    text = ' '.join(word.lower() for word in textblob_tokenizer(text))
    text = re.sub(r'http\S+', '', text) # remove http links
    text = re.sub(r'bit.ly/\S+', '', text) # rempve bitly links
    text = text.strip('[link]') # remove [links]
    text = re.sub('['+my_punctuation + ']+', ' ', text) # remove punctuation
    text = re.sub('\s+', ' ', text) #remove double spacing
    text = re.sub(r"[^a-zA-Z.,&!?]+", r" ", text) # only normal characters
    text_token_list = [word for word in text.split(' ')
                            if word not in my_stopwords] # remove stopwords
    text_token_list = [word_rooter(word) if '#' not in word else word
                        for word in text_token_list] # apply word rooter
    text = ' '.join(text_token_list)
    if should_join:
      return ' '.join(gensim.utils.simple_preprocess(text))
    else:
      return gensim.utils.simple_preprocess(text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [17]:
type(X_train)

pandas.core.series.Series

In [18]:
# Apply the preprocessing to the dataset
import swifter
X_preprocessed = X.apply(lambda x: preprocess_text(x))

How do we pass from text to numbers? With tokenizers. We will use Tensorflow ones!

In [139]:
# Find a set named vocab that has all unique words

vocab = []
for i in X_preprocessed:
  for word in preprocess_text(i, False):
    if word not in vocab:
      vocab.append(word)

In [140]:
print(f'{len(vocab)} unique words')

13173 unique words


In [141]:
# Implement this method
def get_maximum_review_length(srs):
  # maximum = srs.str.len().max()
  maximum = srs.map(lambda x: len(preprocess_text(x, should_join = False))).max()
  return maximum


maximum = get_maximum_review_length(X_preprocessed)

In [142]:
print(f'The maximum review was {maximum} words long')

The maximum review was 476 words long


In [143]:
from tensorflow.keras.layers.experimental import preprocessing
ids_from_words = preprocessing.StringLookup(vocabulary=list(vocab), mask_token=None)

In [144]:
words_from_ids = preprocessing.StringLookup(
    vocabulary=ids_from_words.get_vocabulary(), invert=True, mask_token=None)

In [145]:
import tensorflow as tf
def text_from_ids(ids):
  return tf.strings.reduce_join(words_from_ids(ids), axis=-1, separator=' ')

In [146]:
ids = ids_from_words(preprocess_text('Only you can prevent forest fires', should_join=False))
ids

<tf.Tensor: shape=(3,), dtype=int64, numpy=array([3513, 4878, 4249])>

In [147]:
preprocess_text('Only you can prevent forest fires', should_join=False)

['prevent', 'forest', 'fire']

In [148]:
text_from_ids(ids)


<tf.Tensor: shape=(), dtype=string, numpy=b'prevent forest fire'>

In [149]:
def pad_sequence_of_tokens(x, maxlen, unk_token='[UNK]'):
  if len(x)<maxlen:
    x.extend([unk_token]*(maxlen-len(x)))
  return x

In [150]:
from keras_preprocessing.sequence import pad_sequences
# Very useful method to keep in mind
def get_ids_tensor(srs):

  processed = srs.swifter.apply(lambda x: pad_sequence_of_tokens(preprocess_text(x, should_join=False), maxlen=maximum)).to_list()
  return tf.squeeze(tf.constant(pad_sequences(ids_from_words(processed), maxlen=maximum, padding='post'), dtype='int32'))



In [151]:
all_ids = get_ids_tensor(srs=X_preprocessed.reset_index(drop=True))
all_ids

Pandas Apply:   0%|          | 0/4086 [00:00<?, ?it/s]

<tf.Tensor: shape=(4086, 476), dtype=int32, numpy=
array([[   1,    2,    3, ...,    0,    0,    0],
       [  68,   69,   70, ...,    0,    0,    0],
       [ 135,  136,  137, ...,    0,    0,    0],
       ...,
       [7485,  121, 5271, ...,    0,    0,    0],
       [3400,  239,   24, ...,    0,    0,    0],
       [6734,  361, 1477, ...,    0,    0,    0]], dtype=int32)>

In [152]:
all_ids.shape

TensorShape([4086, 476])

In [154]:
# Split the all_ids into.a train a test sets
X_train, X_test, y_train, y_test = train_test_split(all_ids.numpy() , y, test_size=0.2)

### Using Logistic Regression

In [155]:

# Train a Logistic Regression on X_train and give the accuracy
logreg = LogisticRegression(solver='liblinear', random_state=0)

logreg.fit(X_train,y_train)


## Using Boosting Algorithms and other things

In [156]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
clf = GradientBoostingClassifier(n_estimators=50, learning_rate=0.5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.16      0.06      0.09       137
           5       0.83      0.94      0.88       681

    accuracy                           0.79       818
   macro avg       0.50      0.50      0.48       818
weighted avg       0.72      0.79      0.75       818



In [157]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import classification_report
clf = AdaBoostClassifier(n_estimators=50, learning_rate=0.5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.20      0.01      0.03       137
           5       0.83      0.99      0.90       681

    accuracy                           0.83       818
   macro avg       0.52      0.50      0.47       818
weighted avg       0.73      0.83      0.76       818



In [158]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
clf = RandomForestClassifier(n_estimators=50)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.75      0.02      0.04       137
           5       0.84      1.00      0.91       681

    accuracy                           0.83       818
   macro avg       0.79      0.51      0.48       818
weighted avg       0.82      0.83      0.76       818



## Multiclass Classification

Just check in the estimators, most support multiclass classification.

In [159]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, multi_class='multinomial').fit(X, y)
clf.predict(X[:2, :])
clf.predict_proba(X[:2, :])
clf.score(X, y)

0.9733333333333334

### **Homework**: Try to perform the stars classification with Logistic Regression but without filtering only for 5 and 1 stars.