# Logistic Regression For Tips Dataset

Use the tips dataset and apply a logistic regression model to the tips dataset. Bin a continuous variable (tips column into 3 different categories).

Your 3 categories for your continuous variable will be: (Bad Tipper, Good Tipper, Excellent Tipper)

In [1]:
# import packeges
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
# load data
df_logistic = sns.load_dataset("tips")

In [3]:
# create a function that classified the 'tip' column into these categories: (Bad Tipper, Good Tipper, Excellent Tipper)
def categories(col):

    preds = []
    for x in range(len(col)):
        # if the tip in this interval [1, 4) then classify the tipper as Bad Tipper
        if (col.iloc[x] >= col.min() and col.iloc[x] < (col.min()+3)):
                preds.append('Bad Tipper')
        # if the tip in this interval [4, 7) then classify the tipper as Good Tipper
        elif (col.iloc[x] >= (col.min()+3) and col.iloc[x] < (col.min()+6)):
                preds.append('Good Tipper')
        
        # if the tip in this interval [7, 10] then classify the tipper as Excellent Tipper
        else:
                preds.append('Excellent Tipper')
    
    return preds

In [4]:
# create a new column to classify the tipper
df_logistic["tip_categories"] = categories(df_logistic['tip'])

In [5]:
# convert all categorical columns to numeric columns
df_logistic['sex'] = pd.factorize( df_logistic['sex'].values )[0]
df_logistic['smoker'] = pd.factorize( df_logistic['smoker'].values )[0]
df_logistic['day'] = pd.factorize( df_logistic['day'].values )[0]
df_logistic['time'] = pd.factorize( df_logistic['time'].values )[0]

In [6]:
# show the head
df_logistic.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_categories
0,16.99,1.01,0,0,0,0,2,Bad Tipper
1,10.34,1.66,1,0,0,0,3,Bad Tipper
2,21.01,3.5,1,0,0,0,3,Bad Tipper
3,23.68,3.31,1,0,0,0,2,Bad Tipper
4,24.59,3.61,0,0,0,0,4,Bad Tipper


In [7]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    df_logistic.drop(['tip_categories','tip'], axis=1),
    df_logistic['tip_categories'], 
    test_size = 0.30, 
    random_state = 40)

In [8]:
# apply the StandardScaler to fit the model
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [9]:
# LogisticRegression model
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

LogisticRegression()

In [10]:
# predict the 'tip_categories' column
y_pred = classifier.predict(X_test)

In [11]:
# combined the y_pred with X_test
test_dataset = pd.DataFrame(X_test)

test_dataset['Actual_Tip'] = y_test.to_numpy()
test_dataset['Predict_Tip'] = y_pred

#show the head
test_dataset.head()

Unnamed: 0,0,1,2,3,4,5,Actual_Tip,Predict_Tip
0,0.000841,-1.407997,-0.777029,-0.068819,-0.545628,-0.57572,Bad Tipper,Bad Tipper
1,-0.120529,-1.407997,-0.777029,0.994741,1.83275,0.488111,Bad Tipper,Bad Tipper
2,-1.097496,0.710229,-0.777029,-0.068819,-0.545628,-0.57572,Bad Tipper,Bad Tipper
3,-0.321209,-1.407997,-0.777029,-1.132379,-0.545628,0.488111,Bad Tipper,Bad Tipper
4,-1.018185,-1.407997,-0.777029,0.994741,1.83275,-0.57572,Bad Tipper,Bad Tipper


In [12]:
# check if values matches in these columns
(test_dataset['Actual_Tip'] == test_dataset['Predict_Tip']).value_counts()

True     58
False    16
dtype: int64

# Linear Regression For The Tips Dataset

create a linear regression that predicts the tips column.

Compare your results from your linear and logistic regression using the following criteria:
- If your predicted regression is over 10% tolerance (error) then consider it a false prediction, otherwise, it is true Example: If the actual value is 1 and your prediction is [0.9, 1.10] this is true, otherwise it is false.

In [13]:
# load the data
df_linear = sns.load_dataset("tips")

In [14]:
# convert all categorical columns to numeric columns
df_linear['sex'] = pd.factorize( df_linear['sex'].values )[0]
df_linear['smoker'] = pd.factorize( df_linear['smoker'].values )[0]
df_linear['day'] = pd.factorize( df_linear['day'].values )[0]
df_linear['time'] = pd.factorize( df_linear['time'].values )[0]

In [15]:
# show the head
df_linear.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,0,0,2
1,10.34,1.66,1,0,0,0,3
2,21.01,3.5,1,0,0,0,3
3,23.68,3.31,1,0,0,0,2
4,24.59,3.61,0,0,0,0,4


In [16]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    df_linear.drop(['tip'], axis=1),
    df_linear['tip'], 
    test_size = 0.30, 
    random_state = 40)

In [17]:
#  LinearRegression model
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

In [18]:
# predict the 'tip' column
y_pred = lr.predict(X_test)

In [19]:
# combined the y_pred with X_test 
test_dataset = pd.DataFrame(X_test)

test_dataset['Actual_Tip'] = y_test.to_numpy()
test_dataset['Predict_Tip'] = y_pred

# show the head
test_dataset.head()

Unnamed: 0,total_bill,sex,smoker,day,time,size,Actual_Tip,Predict_Tip
29,19.65,0,0,1,0,2,3.0,3.005646
146,18.64,0,0,2,1,3,1.36,3.109629
75,10.51,1,0,1,0,2,1.25,2.092216
18,16.97,0,0,0,0,3,3.5,2.914557
132,11.17,0,0,2,1,2,1.5,2.303958


In [20]:
# create a function that check If your predicted regression is over 10% tolerance (error)
def error_tolerance(col1,col2):

    preds = []
    for x in range(len(col1)):
        # If your predicted regression is in 10% tolerance (error) will append True
        if (col2.iloc[x] >= (col1.iloc[x]*0.9) and col2.iloc[x] <= (col1.iloc[x]*1.1)):
            preds.append('True')
        
        # If your predicted regression is over 10% tolerance (error) will append False
        else:
            preds.append('False')
    
    return preds

In [21]:
# create a new column for 10% tolerance (error)
test_dataset["in_10error"] = error_tolerance(test_dataset['Actual_Tip'],test_dataset['Predict_Tip'])

In [22]:
#show the head
test_dataset.head()

Unnamed: 0,total_bill,sex,smoker,day,time,size,Actual_Tip,Predict_Tip,in_10error
29,19.65,0,0,1,0,2,3.0,3.005646,True
146,18.64,0,0,2,1,3,1.36,3.109629,False
75,10.51,1,0,1,0,2,1.25,2.092216,False
18,16.97,0,0,0,0,3,3.5,2.914557,False
132,11.17,0,0,2,1,2,1.5,2.303958,False


In [23]:
# count the values in 'in_10error' column
test_dataset['in_10error'].value_counts()

False    53
True     21
Name: in_10error, dtype: int64

# Results
Compare the false prediction between your classification and regression models and see which one performed better.

We can see that the number of False in the Linear Regression is much higher than the Logistic Regression, which is in the Linear Regression 53 False and in the Logistic Regression 16 False.

Based on this result, in the future I will use Logistic Regression for sure.