<img src="https://devra.ai/analyst/notebook/3448/image.jpg" style="width: 100%; height: auto;" />

<div style="text-align:center; border-radius:15px; padding:15px; color:white; margin:0; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden; margin-bottom: 1em;">
  <div style="font-size:150%; color:#FEE100"><b>Tip Prediction and Exploratory Analysis Notebook</b></div>
  <div>This notebook was created with the help of <a href="https://devra.ai/ref/kaggle" style="color:#6666FF">Devra AI</a></div>
</div>

## Table of Contents

- [Introduction](#Introduction)
- [Data Loading](#Data-Loading)
- [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
- [Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis)
- [Correlation Analysis](#Correlation-Analysis)
- [Predictive Modeling](#Predictive-Modeling)
- [Conclusion](#Conclusion)

If you find this notebook useful, please consider upvoting it.

In [None]:
# Import necessary libraries and suppress warnings
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

# Visualization libraries
import matplotlib
matplotlib.use('Agg')  # Use non-interactive backend
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

# Machine learning libraries for predictive modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.inspection import permutation_importance

# Configure matplotlib for inline plotting if only plt is imported
plt.switch_backend('Agg')

# Set aesthetic parameters for seaborn
sns.set(style='whitegrid', palette='muted', color_codes=True)

## Data Loading

We load the tip dataset from the CSV file. The dataset includes total bill amounts, tip amounts, and several categorical variables that provide contextual information for each record. Note that since there are no explicit date columns in this dataset, we do not perform any date parsing.

In [None]:
# Load the tip dataset
data_path = 'customertipping\dataset\tip.csv'
df = pd.read_csv(data_path, delimiter=',')

# Display the first few rows of the dataset (this output will be generated upon execution)
print(df.head())

## Data Cleaning and Preprocessing

In this section we check for missing values, data consistency, and convert categorical variables appropriately if needed. Although our dataset appears clean, it is important to show these standard procedures. For example, if other users encounter missing data, the methods used here could help address them.

In [None]:
# Check for missing values and data types
print('Missing values in each column:')
print(df.isnull().sum())

# Review data types
print('\nData types:')
print(df.dtypes)

# Ensure categorical variables are of type 'category'
categorical_cols = ['sex', 'smoker', 'day', 'time']
for col in categorical_cols:
    df[col] = df[col].astype('category')

# Check summary of the dataset
print('\nDataset summary:')
print(df.describe(include='all'))

## Exploratory Data Analysis

We now dive into various visualizations to gain insights from the dataset. The following plots will help us see distributions of numerical variables, counts for categorical variables, and pairwise relationships among the numeric features.

In [None]:
# Plot histograms for numerical variables
plt.figure(figsize=(12, 4))
for i, col in enumerate(['total_bill', 'tip', 'size'], 1):
    plt.subplot(1, 3, i)
    sns.histplot(df[col], kde=True, color='teal')
    plt.title(f'Histogram of {col}')
plt.tight_layout()
plt.show()

# Count plots for categorical features
plt.figure(figsize=(12, 8))
plt.subplot(2, 2, 1)
sns.countplot(x='sex', data=df, palette='pastel')
plt.title('Count of Sex')

plt.subplot(2, 2, 2)
sns.countplot(x='smoker', data=df, palette='pastel')
plt.title('Count of Smoker')

plt.subplot(2, 2, 3)
sns.countplot(x='day', data=df, palette='pastel')
plt.title('Count of Day')

plt.subplot(2, 2, 4)
sns.countplot(x='time', data=df, palette='pastel')
plt.title('Count of Time')

plt.tight_layout()
plt.show()

# Pair plot for numerical variables
sns.pairplot(df[['total_bill', 'tip', 'size']])
plt.show()

# Box plot to visualize distribution of tip by day
plt.figure(figsize=(8, 6))
sns.boxplot(x='day', y='tip', data=df, palette='Set2')
plt.title('Tip Distribution by Day')
plt.show()

# Violin plot for tip distribution by time
plt.figure(figsize=(8, 6))
sns.violinplot(x='time', y='tip', data=df, palette='Set3')
plt.title('Tip Distribution by Time')
plt.show()

# Strip plot for a detailed view of tip by size
plt.figure(figsize=(8, 6))
sns.stripplot(x='size', y='tip', data=df, jitter=True, palette='coolwarm')
plt.title('Tip vs Size')
plt.show()

## Correlation Analysis

We now compute a correlation matrix for the numeric variables. A correlation heatmap will only be generated if there are at least four numeric columns. In our case, there are three numeric fields (total_bill, tip, and size); hence we will simply display the correlation matrix.

In scenarios with more numeric features, a heatmap can highlight multicollinearity concerns or reveal interesting patterns.

In [None]:
# Select only numeric columns for correlation analysis
numeric_df = df.select_dtypes(include=[np.number])

print('Correlation Matrix:')
corr_matrix = numeric_df.corr()
print(corr_matrix)

if len(numeric_df.columns) >= 4:
    plt.figure(figsize=(8, 6))
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Heatmap of Numeric Features')
    plt.show()
else:
    print('\nNot enough numeric columns for a heatmap. Displaying correlation matrix is sufficient.')

## Predictive Modeling

Our objective is to predict the tip amount based on features such as total_bill, sex, smoker, day, time, and size. Since tip prediction is a continuous variable, regression is suitable. We perform the following steps:

1. Convert categorical variables into dummy/indicator variables.
2. Split the dataset into training and testing subsets.
3. Build a linear regression model and measure its performance using the R-squared score.
4. Compute permutation importance to gauge feature relevance.

Enjoy the journey into predictive insights with a hint of dry humor in the code comments.

In [None]:
# Prepare the data for predictive modeling
# Using pd.get_dummies for one-hot encoding the categorical variables
df_model = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Define features (X) and target (y)
X = df_model.drop('tip', axis=1)
y = df_model['tip']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Calculate the R-squared score to evaluate predictive performance
r2 = r2_score(y_test, y_pred)
print(f'R-squared Score: {r2:.3f}')

# Compute permutation importance
result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42, scoring='r2')
perm_importances = pd.Series(result.importances_mean, index=X.columns)

# Plotting permutation importances using a horizontal bar plot
plt.figure(figsize=(8, 6))
perm_importances.sort_values().plot.barh()
plt.xlabel('Mean Importance')
plt.title('Permutation Importance of Features')
plt.show()

# Plot actual vs predicted tip values to visually assess the model performance
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.7, color='navy')
plt.xlabel('Actual Tip')
plt.ylabel('Predicted Tip')
plt.title('Actual vs Predicted Tip')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', lw=2)  # Diagonal line for perfect prediction
plt.show()

## Conclusion

This notebook walked through the entire analytical process starting from data loading and cleaning to exploratory data analysis and predictive modeling for the tip dataset. The visualization techniques provided various insights into the underlying distributions and relationships in the data, while the regression model delivered a quantifiable performance metric. Future work may include more sophisticated models, feature engineering, or additional cross-validation to further enhance prediction accuracy.

Thank you for reading. If you found the notebook useful, please upvote it.