**This is a notebook for the AICP Internship Task for Tipping waiters for serving food**

# Dataset attributes
The data recorded by the food server is as follows:  
1.total_bill: Total bill in dollars including taxes  
2.tip: Tip given to waiters in dollars  
3.sex: gender of the person paying the bill   
4.smoker: whether the person smoked or not   
5.day: day of the week  
6.time: lunch or dinner   
7.size: number of people in a table


 **Importing Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

**Q1: Import data and check null values, column info, and descriptive statistics**

In [2]:
# load data
data = pd.read_csv('tips.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


In [4]:
data.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


In [5]:
data.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

**Q2: Tips analysis based on total bill, number of people, and day of the week**

In [6]:
# Create scatter plot
fig = px.scatter(data, x='total_bill', y='tip', color='day', size='size', hover_data=['day'], 
                 title='Tips Analysis', size_max=10, opacity=0.7)

# Calculate trend lines for each day
trend_lines = {}
for day in data['day'].unique():
    day_data = data[data['day'] == day]
    x = day_data['total_bill'].values.reshape(-1, 1)
    y = day_data['tip'].values.reshape(-1, 1)
    model = LinearRegression().fit(x, y)
    trend_lines[day] = model.predict(x)

# Add trend lines to the plot
for day, color in zip(data['day'].unique(), px.colors.qualitative.Set2):
    fig.add_scatter(x=data[data['day'] == day]['total_bill'], y=trend_lines[day].flatten(), mode='lines', 
                    line=dict(color=color, dash='dash'), name=f'Trend Line ({day})')

# Update layout
fig.update_layout(xaxis_title='Total Bill / Number of People', yaxis_title='Tip', legend_title='Day of the Week')

# Show the plot
fig.show()


**Q3: Tips analysis based on total bill, number of people, and gender**

In [7]:
# Create scatter plot
fig = px.scatter(data, x='total_bill', y='tip', color='sex', size='size', hover_data=['sex'], 
                 title='Tips Analysis', size_max=10, opacity=0.7)

# Calculate trend lines for each gender
trend_lines = {}
for gender in data['sex'].unique():
    gender_data = data[data['sex'] == gender]
    x = gender_data['total_bill'].values.reshape(-1, 1)
    y = gender_data['tip'].values.reshape(-1, 1)
    model = LinearRegression().fit(x, y)
    trend_lines[gender] = model.predict(x)

# Add trend lines to the plot
for gender in trend_lines:
    fig.add_scatter(x=data[data['sex'] == gender]['total_bill'], y=trend_lines[gender].flatten(), mode='lines', 
                    line=dict(dash='dash'), name=f'Trend Line ({gender})')

# Update layout
fig.update_layout(xaxis_title='Total Bill / Number of People', yaxis_title='Tip', legend_title='Gender')

# Show the plot
fig.show()


**Q4: Tips analysis based on total bill, number of people, and time of the meal**

In [8]:
# Create scatter plot
fig = px.scatter(data, x='total_bill', y='tip', color='time', size='size', hover_data=['time'], 
                 title='Tips Analysis', size_max=10, opacity=0.7)

# Calculate trend lines for each time
trend_lines = {}
for time in data['time'].unique():
    time_data = data[data['time'] == time]
    x = time_data['total_bill'].values.reshape(-1, 1)
    y = time_data['tip'].values.reshape(-1, 1)
    model = LinearRegression().fit(x, y)
    trend_lines[time] = model.predict(x)

# Add trend lines to the plot
for time in trend_lines:
    fig.add_scatter(x=data[data['time'] == time]['total_bill'], y=trend_lines[time].flatten(), mode='lines', 
                    line=dict(dash='dash'), name=f'Trend Line ({time})')

# Update layout
fig.update_layout(xaxis_title='Total Bill / Number of People', yaxis_title='Tip', legend_title='Time of the Meal')

# Show the plot
fig.show()


 **Q5: Tips analysis based on the days of the week**

In [9]:
# Calculate tips by day
tips_by_day = data.groupby('day')['tip'].sum()

# Find the day with the most tips
most_tips_day = tips_by_day.idxmax()

print("Day with the most tips:", most_tips_day)

# Create pie chart
fig = px.pie(names=tips_by_day.index, values=tips_by_day.values, title='Tips Analysis Based on the Days of the Week')

# Show the plot
fig.show()


Day with the most tips: Sat


**Q6: Tips analysis based on gender**

In [10]:
# Calculate tips by gender
tips_by_gender = data.groupby('sex')['tip'].sum()

# Find the gender that tips the most
most_tips_gender = tips_by_gender.idxmax()

print("Gender that tips the most:", most_tips_gender)

# Create pie chart
fig = px.pie(names=tips_by_gender.index, values=tips_by_gender.values, title='Tips Analysis Based on Gender')

# Show the plot
fig.show()


Gender that tips the most: Male


 **Q7: Tips analysis based on the days of the week**

In [11]:

tips_by_day = data.groupby('day')['tip'].sum()
most_tips_day = tips_by_day.idxmax()
print("Day with the most tips:", most_tips_day)
fig = px.pie(names=tips_by_day.index, values=tips_by_day.values, title='Tips Analysis Based on Gender')

# Show the plot
fig.show()

Day with the most tips: Sat


**Q8: Tips analysis based on smoker and non-smoker**

In [12]:
tips_by_smoker = data.groupby('smoker')['tip'].sum()
print("Total tips by smokers and non-smokers:")
print(tips_by_smoker)
fig = px.pie(names=tips_by_smoker.index, values=tips_by_smoker.values, title='Tips Analysis Based on Smoker and Non-Smoker')
# Show the plot
fig.show()

Total tips by smokers and non-smokers:
smoker
No     451.77
Yes    279.81
Name: tip, dtype: float64


**Q9: Tips analysis based on lunch and dinner**

In [13]:
tips_by_time = data.groupby('time')['tip'].sum()
print("Total tips by lunch and dinner:")
print(tips_by_time)
# Create pie chart
fig = px.pie(names=tips_by_time.index, values=tips_by_time.values, title='Tips Analysis Based on Gender')

# Show the plot
fig.show()

Total tips by lunch and dinner:
time
Dinner    546.07
Lunch     185.51
Name: tip, dtype: float64


**Q10: Data transformation - converting categorical values to numerical values**

In [14]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to each categorical column
data['sex'] = label_encoder.fit_transform(data['sex'])
data['smoker'] = label_encoder.fit_transform(data['smoker'])
data['day'] = label_encoder.fit_transform(data['day'])
data['time'] = label_encoder.fit_transform(data['time'])

# Print the first few rows of the encoded DataFrame
data.head()


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


**Q11: Splitting the data into training and test sets**

In [15]:
X = data.drop('tip', axis=1)
y = data['tip']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Q12: Training a Linear Regression model and making predictions**

In [16]:
model = LinearRegression()
model.fit(X_train, y_train)


In [21]:
input_data = {'total_bill': 24.50, 'sex': 0, 'smoker': 0, 'day': 0, 'time': 1, 'size': 4}
input_df = pd.DataFrame([input_data])
predicted_tip = model.predict(input_df)
print("The Predicted tip for the waiter is :", predicted_tip)

The Predicted tip for the waiter is : [3.94145885]


 *Checking the values of label encoding*

In [18]:
df = pd.read_csv('tips.csv')

In [19]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [20]:
from sklearn.preprocessing import LabelEncoder



# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to the 'day' column
df['day'] = label_encoder.fit_transform(df['day'])

# Get the mapping of original categories to encoded labels
day_mapping = dict(zip(range(len(label_encoder.classes_)), label_encoder.classes_))

# Print the mapping
print("Original categories to encoded labels mapping:")
print(day_mapping)


Original categories to encoded labels mapping:
{0: 'Fri', 1: 'Sat', 2: 'Sun', 3: 'Thur'}
