# Logistic Regression and Boosting Algorithms

© Data Trainers LLC. GPL v 3.0.

**Author:** Axel Sirota


## Predicting a Single Categorical Response
---



### Installing stuff

In [1]:
!pip install --upgrade textblob spacy 'gensim==4.2.0' swifter keras_preprocessing



In [2]:
!python -m textblob.download_corpora lite
!python -m spacy download en_core_web_sm

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
Finished.
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you a

In [3]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB         # Naive Bayes
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer

import spacy
import gensim
import warnings
import nltk
warnings.filterwarnings('ignore')
nltk.download('punkt')
textblob_tokenizer = lambda x: TextBlob(x).words


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
%%writefile get_data.sh
if [ ! -f yelp.csv ]; then
  wget -O yelp.csv https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
fi

Overwriting get_data.sh


In [5]:
!bash get_data.sh

In [6]:
# Read yelp.csv into a DataFrame.
path = './yelp.csv'
yelp = pd.read_csv(path)
# Create a new DataFrame that only contains the 5-star and 1-star reviews.
yelp_best_worst = yelp[ (yelp.stars == 1) | (yelp.stars == 5) ]

# Define X and y.
X = yelp_best_worst.text
y = yelp_best_worst.stars

# Split the new DataFrame into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [7]:
yelp_best_worst.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4086 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   business_id  4086 non-null   object
 1   date         4086 non-null   object
 2   review_id    4086 non-null   object
 3   stars        4086 non-null   int64 
 4   text         4086 non-null   object
 5   type         4086 non-null   object
 6   user_id      4086 non-null   object
 7   cool         4086 non-null   int64 
 8   useful       4086 non-null   int64 
 9   funny        4086 non-null   int64 
dtypes: int64(4), object(6)
memory usage: 351.1+ KB


<a id="using-logistic-regression-for-classification"></a>
## Using Logistic Regression for Classification
---



In [9]:
# Fit a logistic regression model to predict stars from text

logreg = LogisticRegression()

logreg.fit(X_train,y_train)


ValueError: could not convert string to float: "If I could give it more than 5, I would.  Sweet Pea and I live down the street, literally DOWN THE STREET, from this bar.  We waited for it to open for what seemed like decades, praying that this was going to be the type of place that could become our local.  It has exceeded our expectations.  The atmosphere is amazing.  The drinks are amazing- every last one of them- but the margaritas are the best I've ever had.  They tasted like a fresh squeeze of sunshine that makes me happy inside.  Margarita Mondays- $4 margs AND free food?  Happy hours are amazing.  New Year's Eve last year was amazing.  The 1 year anniversary party was amazing.  But most of all, the owner and staff are some of the coolest peeps that you'll ever meet.  Go here.  You will love it."

Of course this simply fails, we need to preprocess the text, convert it into a Tensor format and then and only then we can use models!

### Converting text to vectors

In [10]:
import re
nltk.download('stopwords')
my_stopwords = nltk.corpus.stopwords.words('english')
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'


def preprocess_text(text, should_join=True):
    text = ' '.join(word.lower() for word in textblob_tokenizer(text))
    text = re.sub(r'http\S+', '', text) # remove http links
    text = re.sub(r'bit.ly/\S+', '', text) # rempve bitly links
    text = text.strip('[link]') # remove [links]
    text = re.sub('['+my_punctuation + ']+', ' ', text) # remove punctuation
    text = re.sub('\s+', ' ', text) #remove double spacing
    text = re.sub(r"[^a-zA-Z.,&!?]+", r" ", text) # only normal characters
    text_token_list = [word for word in text.split(' ')
                            if word not in my_stopwords] # remove stopwords
    text_token_list = [word_rooter(word) if '#' not in word else word
                        for word in text_token_list] # apply word rooter
    text = ' '.join(text_token_list)
    if should_join:
      return ' '.join(gensim.utils.simple_preprocess(text))
    else:
      return gensim.utils.simple_preprocess(text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
# Apply the preprocessing to the dataset
import swifter
# X_preprocessed = None
X_preprocessed = X.apply(preprocess_text)
# .swifter.allow_dask_on_strings().progress_bar(True)

In [15]:
# X_preprocessed.data[0]
# X_preprocessed_series = X_preprocessed.series
# preprocessed_text_at_index_0 = X_preprocessed_series.iloc[0]

X_preprocessed[0]

'wife took birthday breakfast excel weather perfect made sit outsid overlook ground absolut pleasur waitress excel food arriv quickli semi busi saturday morn look like place fill pretti quickli earlier get better favor get bloodi mari phenomen simpli best ever pretti sure use ingredi garden blend fresh order amaz everyth menu look excel white truffl scrambl egg veget skillet tasti delici came piec griddl bread amaz absolut made meal complet best toast ever anyway ca wait go bac'

How do we pass from text to numbers? With tokenizers. We will use Tensorflow ones!

In [18]:
# Find a set named vocab that has all unique words

import tensorflow as tf

# Initialize tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer()

# Fit tokenizer on preprocessed text
tokenizer.fit_on_texts(X_preprocessed)

# Get the vocabulary from the Tokenizer
vocab = tokenizer.word_index.keys()


In [19]:
# Fit a logistic regression model to predict stars from text

logreg = LogisticRegression()

logreg.fit(X_train,y_train)


ValueError: could not convert string to float: "If I could give it more than 5, I would.  Sweet Pea and I live down the street, literally DOWN THE STREET, from this bar.  We waited for it to open for what seemed like decades, praying that this was going to be the type of place that could become our local.  It has exceeded our expectations.  The atmosphere is amazing.  The drinks are amazing- every last one of them- but the margaritas are the best I've ever had.  They tasted like a fresh squeeze of sunshine that makes me happy inside.  Margarita Mondays- $4 margs AND free food?  Happy hours are amazing.  New Year's Eve last year was amazing.  The 1 year anniversary party was amazing.  But most of all, the owner and staff are some of the coolest peeps that you'll ever meet.  Go here.  You will love it."

Of course this simply fails, we need to preprocess the text, convert it into a Tensor format and then and only then we can use models!

> Add blockquote



### Converting text to vectors

In [20]:
import re
nltk.download('stopwords')
my_stopwords = nltk.corpus.stopwords.words('english')
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'


def preprocess_text(text, should_join=True):
    text = ' '.join(word.lower() for word in textblob_tokenizer(text))
    text = re.sub(r'http\S+', '', text) # remove http links
    text = re.sub(r'bit.ly/\S+', '', text) # rempve bitly links
    text = text.strip('[link]') # remove [links]
    text = re.sub('['+my_punctuation + ']+', ' ', text) # remove punctuation
    text = re.sub('\s+', ' ', text) #remove double spacing
    text = re.sub(r"[^a-zA-Z.,&!?]+", r" ", text) # only normal characters
    text_token_list = [word for word in text.split(' ')
                            if word not in my_stopwords] # remove stopwords
    text_token_list = [word_rooter(word) if '#' not in word else word
                        for word in text_token_list] # apply word rooter
    text = ' '.join(text_token_list)
    if should_join:
      return ' '.join(gensim.utils.simple_preprocess(text))
    else:
      return gensim.utils.simple_preprocess(text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
# Apply the preprocessing to the dataset
import swifter
X_preprocessed = X.apply(preprocess_text)

In [22]:
X_preprocessed

0       wife took birthday breakfast excel weather per...
1       idea peopl give bad review place goe show plea...
3       rosi dakota love chaparr dog park conveni surr...
4       gener manag scott petello good egg go detail l...
6       drop drive ate go back next day food good cute...
                              ...                        
9990    ye rock hipster joint dig place littl bit scen...
9991    star note folk rate place low must isol incid ...
9992    normal one jump review chain restaur especi su...
9994    et see like surpris stadium well tall can pbr ...
9999    locat star averag think arizona realli fantast...
Name: text, Length: 4086, dtype: object

How do we pass from text to numbers? With tokenizers. We will use Tensorflow ones!

In [23]:
# Find a set named vocab that has all unique words

import tensorflow as tf

# Initialize tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer()

# Fit tokenizer on preprocessed text
tokenizer.fit_on_texts(X_preprocessed)

# Get the vocabulary from the Tokenizer
vocab = tokenizer.word_index.keys()


In [24]:
print(f'{len(vocab)} unique words')

13140 unique words


In [26]:
# Implement this method
def get_maximum_review_length(srs):
    maximum = 0
    for text in srs:
        length = len(text.split())  # Split the text into words and get the length
        if length > maximum:
            maximum = length
    return maximum

maximum = get_maximum_review_length(X_preprocessed)


In [27]:
print(f'The maximum review was {maximum} words long')

The maximum review was 477 words long


In [28]:
from tensorflow.keras.layers.experimental import preprocessing
ids_from_words = preprocessing.StringLookup(vocabulary=list(vocab), mask_token=None)

In [29]:
words_from_ids = preprocessing.StringLookup(
    vocabulary=ids_from_words.get_vocabulary(), invert=True, mask_token=None)

In [30]:
import tensorflow as tf
def text_from_ids(ids):
  return tf.strings.reduce_join(words_from_ids(ids), axis=-1, separator=' ')

In [31]:
ids = ids_from_words(preprocess_text('Only you can prevent forest fires', should_join=False))
ids

<tf.Tensor: shape=(3,), dtype=int64, numpy=array([2526, 6363,  931])>

In [32]:
preprocess_text('Only you can prevent forest fires', should_join=False)

['prevent', 'forest', 'fire']

In [33]:
text_from_ids(ids)


<tf.Tensor: shape=(), dtype=string, numpy=b'prevent forest fire'>

In [34]:
def pad_sequence_of_tokens(x, maxlen, unk_token='[UNK]'):
  if len(x)<maxlen:
    x.extend([unk_token]*(maxlen-len(x)))
  return x

In [35]:
from keras_preprocessing.sequence import pad_sequences
# Very useful method to keep in mind
def get_ids_tensor(srs):

  processed = srs.swifter.apply(lambda x: pad_sequence_of_tokens(preprocess_text(x, should_join=False), maxlen=maximum)).to_list()
  return tf.squeeze(tf.constant(pad_sequences(ids_from_words(processed), maxlen=maximum, padding='post'), dtype='int32'))



In [36]:
all_ids = get_ids_tensor(srs=X_preprocessed.reset_index(drop=True))
all_ids

Pandas Apply:   0%|          | 0/4086 [00:00<?, ?it/s]

<tf.Tensor: shape=(4086, 477), dtype=int32, numpy=
array([[ 321,  139,  474, ...,    0,    0,    0],
       [4280,   45,   85, ...,    0,    0,    0],
       [5975, 7800,   10, ...,    0,    0,    0],
       ...,
       [9368,    9, 1894, ...,    0,    0,    0],
       [1871,   71,    6, ...,    0,    0,    0],
       [9038,  134,  719, ...,    0,    0,    0]], dtype=int32)>

In [41]:
all_ids.shape

TensorShape([4086, 477])

In [45]:
# Split the all_ids into.a train a test sets
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(all_ids.numpy(), y, test_size=0.2, random_state=42)


### Using Logistic Regression

In [49]:

# Train a Logistic Regression on X_train and give the accuracy
logreg = LogisticRegression(X_train)
# logreg.fit(X_train)

## Using Boosting Algorithms and other things

In [52]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

clf = GradientBoostingClassifier(n_estimators=50, learning_rate=0.5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.25      0.10      0.15       167
           5       0.80      0.92      0.86       651

    accuracy                           0.76       818
   macro avg       0.53      0.51      0.50       818
weighted avg       0.69      0.76      0.71       818



In [53]:
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(n_estimators=50, learning_rate=0.5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.12      0.01      0.01       167
           5       0.80      0.99      0.88       651

    accuracy                           0.79       818
   macro avg       0.46      0.50      0.45       818
weighted avg       0.66      0.79      0.70       818



In [54]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=50)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.60      0.04      0.07       167
           5       0.80      0.99      0.89       651

    accuracy                           0.80       818
   macro avg       0.70      0.51      0.48       818
weighted avg       0.76      0.80      0.72       818



## Multiclass Classification

Just check in the estimators, most support multiclass classification.

In [55]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, multi_class='multinomial').fit(X, y)
clf.predict(X[:2, :])
clf.predict_proba(X[:2, :])
clf.score(X, y)

0.9733333333333334

### **Homework**: Try to perform the stars classification with Logistic Regression but without filtering only for 5 and 1 stars.