## Table of Contents

- [Introduction](#Introduction)
- [Data Loading](#Data-Loading)
- [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
- [Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis)
- [Correlation Analysis](#Correlation-Analysis)
- [Predictive Modeling](#Predictive-Modeling)
- [Conclusion](#Conclusion)

In [1]:
# Import necessary libraries and suppress warnings
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

# Visualization libraries
import matplotlib
matplotlib.use('TkAgg') 
import matplotlib.pyplot as plt

import seaborn as sns

# Machine learning libraries for predictive modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.inspection import permutation_importance



## Data Loading

We load the tip dataset from the CSV file. The dataset includes total bill amounts, tip amounts, and several categorical variables that provide contextual information for each record. Note that since there are no explicit date columns in this dataset, we do not perform any date parsing.

In [2]:
# Load the tip dataset
data_path = r'dataset\tip.csv'
df = pd.read_csv(data_path, delimiter=',')

# Display the first few rows of the dataset (this output will be generated upon execution)
print(df.head(2))

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3


In [3]:
df["tip_percentage"] = (df["tip"] / df["total_bill"]) * 100  # Creating a “Tip Percentage” column — because context matters. A $10 tip on a $200 bill isn’t the same as a $10 tip on a $40 meal.
df["tip_percentage"] = df["tip_percentage"].round(2)
df.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage
0,16.99,1.01,Female,No,Sun,Dinner,2,5.94
1,10.34,1.66,Male,No,Sun,Dinner,3,16.05
2,21.01,3.5,Male,No,Sun,Dinner,3,16.66


In [4]:
df.dtypes

total_bill        float64
tip               float64
sex                object
smoker             object
day                object
time               object
size                int64
tip_percentage    float64
dtype: object

## Data Cleaning and Preprocessing

In this section we check for missing values, data consistency, and convert categorical variables appropriately if needed. Although our dataset appears clean, it is important to show these standard procedures. For example, if other users encounter missing data, the methods used here could help address them.

## Exploratory Data Analysis

We now dive into various visualizations to gain insights from the dataset. The following plots will help us see distributions of numerical variables, counts for categorical variables, and pairwise relationships among the numeric features.

In [5]:
numerical =df.select_dtypes(include=['float64','int64']).columns
numerical

Index(['total_bill', 'tip', 'size', 'tip_percentage'], dtype='object')

In [6]:
categorical = df.select_dtypes(include='object').columns
categorical


Index(['sex', 'smoker', 'day', 'time'], dtype='object')

In [None]:
# univariate analysis of the 
def univariate_numerical_eda(df, column):
    """
    Performs univariate EDA on a single numerical column.
    Displays summary statistics, skewness, kurtosis, histogram, KDE, and boxplot.
    """
    print(f" Feature: {column}")
    print("="*40)
    print(df[column].describe().to_frame())
    print(f"\nSkewness: {df[column].skew():.3f}")
    print(f"Kurtosis: {df[column].kurt():.3f}")


    plt.figure(figsize=(12,4))

    # Histogram + KDE
    plt.subplot(1,2,1)
    sns.histplot(df[column], kde=True, bins=30, color='teal')
    plt.title(f'\n Distribution of {column}', fontsize=13)
    plt.xlabel(column)
    plt.ylabel('Frequency')

    # Boxplot
    plt.subplot(1,2,2)
    sns.boxplot(x=df[column], color='teal')
    plt.title(f'\n Boxplot of {column}', fontsize=13)

    plt.tight_layout()
    plt.show()

    # Optional note on transformation
    if abs(df[column].skew()) > 1:
        print(f"\n {column} is highly skewed. Consider log or Box-Cox transformation.")
    elif abs(df[column].skew()) > 0.5:
        print(f"{column} is moderately skewed.")
    else:
        print(f"{column} is fairly symmetric.")


In [8]:
for column in numerical:
    univariate_numerical_eda(df,column)


 total_bill is highly skewed. Consider log or Box-Cox transformation.

 tip is highly skewed. Consider log or Box-Cox transformation.

 size is highly skewed. Consider log or Box-Cox transformation.

 tip_percentage is highly skewed. Consider log or Box-Cox transformation.


In [9]:

def univariate_categorical(df, column):
    """
    Performs univariate EDA on a categorical or ordinal column.
    Displays frequency table, proportion, and a countplot.
    """
    print(f"Feature: {column}")
    print("="*40)

    # Frequency + proportion
    freq = df[column].value_counts()
    prop = df[column].value_counts(normalize=True) * 100
    summary = pd.DataFrame({'Count': freq, 'Percentage': prop.round(2)})
    print(summary)
    print()

    # Visualization
    plt.figure(figsize=(8,5))
    ax = sns.countplot(x=column, data=df, palette='Set2', order=freq.index,color='teal')
    plt.title(f'Distribution of {column}', fontsize=13)
    plt.xlabel(column)
    plt.ylabel('Count')

    # Annotate each bar with percentage
    total = len(df[column])
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x() + p.get_width()/2, height + 1,
                f'{(height/total)*100:.1f}%', ha='center')

    plt.tight_layout()
    plt.show()


In [10]:
for column in categorical:
    univariate_categorical(df,column)

Feature: sex
        Count  Percentage
sex                      
Male      157       64.34
Female     87       35.66

Feature: smoker
        Count  Percentage
smoker                   
No        151       61.89
Yes        93       38.11

Feature: day
      Count  Percentage
day                    
Sat      87       35.66
Sun      76       31.15
Thur     62       25.41
Fri      19        7.79

Feature: time
        Count  Percentage
time                     
Dinner    176       72.13
Lunch      68       27.87



## Correlation Analysis

We now compute a correlation matrix for the numeric variables. A correlation heatmap will only be generated if there are at least four numeric columns. In our case, there are three numeric fields (total_bill, tip, and size); hence we will simply display the correlation matrix.

In scenarios with more numeric features, a heatmap can highlight multicollinearity concerns or reveal interesting patterns.

In [11]:
# Select only numeric columns for correlation analysis
numeric_df = df.select_dtypes(include=[np.number])

print('Correlation Matrix:')
corr_matrix = numeric_df.corr()
print(corr_matrix)

if len(numeric_df.columns) >= 4:
    plt.figure(figsize=(8, 6))
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Heatmap of Numeric Features')
    plt.show()
else:
    print('\nNot enough numeric columns for a heatmap. Displaying correlation matrix is sufficient.')

Correlation Matrix:
                total_bill       tip      size  tip_percentage
total_bill        1.000000  0.675734  0.598315       -0.338629
tip               0.675734  1.000000  0.489299        0.342361
size              0.598315  0.489299  1.000000       -0.142844
tip_percentage   -0.338629  0.342361 -0.142844        1.000000


## Predictive Modeling

Our objective is to predict the tip amount based on features such as total_bill, sex, smoker, day, time, and size. Since tip prediction is a continuous variable, regression is suitable. We perform the following steps:

1. Convert categorical variables into dummy/indicator variables.
2. Split the dataset into training and testing subsets.
3. Build a linear regression model and measure its performance using the R-squared score.
4. Compute permutation importance to gauge feature relevance.

Enjoy the journey into predictive insights with a hint of dry humor in the code comments.

In [12]:
# Prepare the data for predictive modeling
# Using pd.get_dummies for one-hot encoding the categorical variables
df_model = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Define features (X) and target (y)
X = df_model.drop('tip', axis=1)
y = df_model['tip']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Calculate the R-squared score to evaluate predictive performance
r2 = r2_score(y_test, y_pred)
print(f'R-squared Score: {r2:.3f}')

# Compute permutation importance
result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42, scoring='r2')
perm_importances = pd.Series(result.importances_mean, index=X.columns)

# Plotting permutation importances using a horizontal bar plot
plt.figure(figsize=(8, 6))
perm_importances.sort_values().plot.barh()
plt.xlabel('Mean Importance')
plt.title('Permutation Importance of Features')
plt.show()

# Plot actual vs predicted tip values to visually assess the model performance
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.7, color='navy')
plt.xlabel('Actual Tip')
plt.ylabel('Predicted Tip')
plt.title('Actual vs Predicted Tip')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', lw=2)  # Diagonal line for perfect prediction
plt.show()

NameError: name 'categorical_cols' is not defined

## Conclusion

This notebook walked through the entire analytical process starting from data loading and cleaning to exploratory data analysis and predictive modeling for the tip dataset. The visualization techniques provided various insights into the underlying distributions and relationships in the data, while the regression model delivered a quantifiable performance metric. Future work may include more sophisticated models, feature engineering, or additional cross-validation to further enhance prediction accuracy.

Thank you for reading. If you found the notebook useful, please upvote it.