# Data
https://www.kaggle.com/datasets/joebeachcapital/restaurant-reviews

Dataset of restaurant reviews with 10000 rows and 8 columns

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import re
import pickle

In [2]:
df = pd.read_csv('../data/restaurant_reviews.csv')
df.loc[df['Rating'] == 'Like', 'Rating'] = 5
df['Rating'] = df['Rating'].astype(float)

For simplification, we stay only review and rating columns.
Also, we create new column 'is_happy', where if rating >= 4 that customer was happy (1), else he was sad (0)


In [3]:
df['is_happy'] = (df['Rating'] >= 4).astype(int)
df = df[['Review','is_happy']]
df.dropna(inplace=True) # drop empty reviews

In [4]:
df['Review'] = df['Review'].str.lower()
df['Review'] = df['Review'].apply(lambda x: ''.join([re.sub(r'[^a-zA-Z\s]', '', i) for i in x])) # drop punctuation

# model baseline

We want to highlight important words for good or bad customer review

In [5]:
X, y = df['Review'], df['is_happy']

In [6]:
tf_idf = TfidfVectorizer(ngram_range=(1, 5), max_features=100000)
logit = LogisticRegression(C=1, n_jobs=-1, solver='lbfgs', random_state=17, verbose=1)
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf), 
                                 ('logit', logit)])

In [7]:
tfidf_logit_pipeline.fit(X,y)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 10 concurrent workers.


RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =       100001     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  6.90028D+03    |proj g|=  1.29150D+03


 This problem is unconstrained.



At iterate   50    f=  3.55273D+03    |proj g|=  5.11443D-02

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
*****     65     75      1     0     0   5.744D-02   3.553D+03
  F =   3552.7261055309709     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    1.6s finished


In [8]:
pickle.dump(tfidf_logit_pipeline, open('model.pkl', 'wb'))