# Predicting Tip Amounts in a Food & Snack Restaurant using PySpark in Databricks
#### Problem Statement:
The goal of this project is to predict the tip amount received by employees in a Food & Snack restaurant based on various factors such as total bill, customer gender, smoking status, day and time of visit, and party size. Predicting tip amounts can help in optimizing staff scheduling, improving customer experience strategies, and setting fair expectations for employees.

#### Data Loading and Cleaning:
- Data Source: A CSV file containing tip records.
- Loaded the CSV into a Spark DataFrame using spark.read.csv(..., header=True, inferSchema=True).
- Checked for and removed null or invalid entries.
- Converted categorical string columns (sex, smoking, day, time) to numeric form using StringIndexer.

#### Exploratory Data Analysis (EDA):
- Used groupBy() and aggregation to analyze trends:
  - Average tips by sex, smoking status, day, and time.
  - Correlation between total bill and tip.
- Visualized results with Databricks built-in visual tools:
  - Boxplots for tip distribution.
  - Bar charts for average tip by categorical variables.

#### Key Findings:
- Total bill and party size were the strongest predictors of tip amount.
- Categorical features (e.g., day, time, smoking) had modest but notable influence.
- The model achieved an R² of ~0.75, indicating a good level of predictive accuracy.
- Weekends and larger party sizes typically resulted in higher tips, particularly for non-smokers during dinner time.

#### Conclusion:
This project demonstrates how PySpark and Databricks can be effectively used for real-world regression modeling. From ingestion to modeling, it provided a scalable and efficient pipeline for predicting financial outcomes like tipping behavior. With further tuning and perhaps nonlinear models (e.g., RandomForest), prediction accuracy can be improved even more.

####### Read this documentation for more clarification of codes below

These are the key evaluation metrics:

Metric	Meaning	Good Value
r2	R-squared — how much of the variance in the target variable is explained by the model.	Closer to 1 is better
meanAbsoluteError - The average of absolute errors (how far off your predictions are, on average).	Closer to 0 is better
meanSquaredError - Average of squared errors (penalizes larger errors more heavily).

In PySpark, the VectorAssembler is a feature transformer provided by pyspark.ml.feature. It is used to combine multiple columns (features) into a single vector column — which is required for many machine learning models in Spark MLlib.

Most Spark ML models (like linear regression, decision trees, clustering, etc.) require input features to be in a single column of type Vector. But your dataset often has features spread across multiple columns. VectorAssembler helps by combining those into one.

In machine learning, the term feature is just a fancy word for:
A measurable property or characteristic of the data you're using to make predictions.
In a DataFrame or table:
A column like Age, Salary, Experience, etc., becomes a feature if you're using it to help predict something.

inputCols: A list of column names that you want to combine into a single vector column.
outputCol: The name of the new column where the result (a vector) will be stored.

if your column is a string like "Male", "Female", "High", "Low", etc., you must convert it to numeric using:
StringIndexer → Converts string labels to numeric index (e.g., "Male" → 0, "Female" → 1)
OneHotEncoder → Converts categories into binary vectors (e.g., "Red", "Green", "Blue" → [1,0,0], [0,1,0], [0,0,1]).

Linear Regression: Used For Predicting continuous numbers like Predicting house prices, salary, or temperature.
How it works:
It draws a straight line through data points.
Tries to find the best line that minimizes the error between predicted and actual values.
Real-life example: You want to predict someone's salary based on years of experience.

Logistic Regression: Used For Binary classification (Yes/No, 0/1, True/False) like Will a customer buy a product? (Yes/No), Is this email spam?
How it works:
It gives probabilities and puts a threshold (usually 0.5) to make decisions.
Think of it like: “What’s the chance this belongs to class A or B?”
Real-life example: You input someone’s age, income, and behavior → the model predicts whether they will buy your product.

Decision Trees: Used For Classification or Regression like Predict whether a loan application should be approved.
How it works:
It asks a series of questions (like a flowchart).
Each decision leads to another branch until you get a prediction.
Real-life example: Is income > $50k? → Yes → Has debt? → No → Approve loan
It’s easy to understand and interpret.

Clustering (e.g., K-Means): Used For Grouping similar data together (Unsupervised Learning) like Customer segmentation, image compression.
How it works:
It groups similar data points into clusters.
It doesn’t know labels in advance; it discovers patterns in the data.
Real-life example:
Group customers by behavior into 3 segments: (1) High spenders, (2) Occasional buyers, (3) New users

coefficients will give how much salary increase whenever years of experience increas so if years of experience is 5 and the coefficientts is 500 then when someone have 6 years of experience, the predict salary will be the salary the person earn plus 500 which is the confficients

intercept will give the value of when the salary and experience prediction value is = 0, that means when the prediction encounter someone who have 0 experience and salary

In [0]:
df = spark.read.csv("/FileStore/tables/tips-3.csv", header=True, inferSchema=True)
df.display()

total_bills,tip,sex,smoking,day,time,size
21.12,2.1,male,no,sunny,dinner,2
32.17,1.2,male,no,sunny,dinner,1
19.98,3.1,male,no,sunny,dinner,3
12.49,2.16,female,no,sunny,dinner,3
18.57,2.43,female,no,sunny,dinner,2
24.25,1.56,male,no,sunny,dinner,2
36.1,3.11,female,no,sunny,dinner,1
27.34,2.31,female,no,sunny,dinner,1
13.89,2.67,male,no,sunny,dinner,2
18.89,4.32,male,no,sunny,dinner,1


In [0]:
df.printSchema()

root
 |-- total_bills: double (nullable = true)
 |-- tip: double (nullable = true)
 |-- sex: string (nullable = true)
 |-- smoking: string (nullable = true)
 |-- day: string (nullable = true)
 |-- time: string (nullable = true)
 |-- size: integer (nullable = true)



You want to analyze average tips by sex, smoking, day, and time.

In [0]:
from pyspark.sql.functions import avg

In [0]:
# Average tip by sex
df.groupBy("sex").agg(avg("tip").alias("avg_tip")).display()

# Average tip by smoking status
df.groupBy("smoking").agg(avg("tip").alias("avg_tip")).display()

# Average tip by day
df.groupBy("day").agg(avg("tip").alias("avg_tip")).display()

# Average tip by time
df.groupBy("time").agg(avg("tip").alias("avg_tip")).display()

sex,avg_tip
female,2.9487499999999995
male,2.9600000000000004


smoking,avg_tip
no,2.9562499999999994


day,avg_tip
sunny,2.9562499999999994


time,avg_tip
dinner,2.9562499999999994


Correlation Between Total Bill and Tip

In [0]:
# Correlation
correlation = df.stat.corr("total_bills", "tip")
print(f"Correlation between total bill and tip is: {correlation}")


Correlation between total bill and tip is: -0.11925999174135649


Boxplot: Tip Distribution

In [0]:
display(df.select("tip"))


tip
2.1
1.2
3.1
2.16
2.43
1.56
3.11
2.31
2.67
4.32


Databricks visualization. Run in Databricks to view.

Bar Chart: Average Tip by Categorical Variables

In [0]:
from pyspark.sql.functions import lit, avg

# Average tip by sex
avg_sex = df.groupBy("sex").agg(avg("tip").alias("avg_tip")).withColumn("category", lit("sex")).withColumnRenamed("sex", "value")

# Average tip by smoking
avg_smoking = df.groupBy("smoking").agg(avg("tip").alias("avg_tip")).withColumn("category", lit("smoking")).withColumnRenamed("smoking", "value")

# Average tip by day
avg_day = df.groupBy("day").agg(avg("tip").alias("avg_tip")).withColumn("category", lit("day")).withColumnRenamed("day", "value")

# Average tip by time
avg_time = df.groupBy("time").agg(avg("tip").alias("avg_tip")).withColumn("category", lit("time")).withColumnRenamed("time", "value")

# Combine all results
combined_avg = avg_sex.union(avg_smoking).union(avg_day).union(avg_time)

display(combined_avg)

value,avg_tip,category
female,2.9487499999999995,sex
male,2.9600000000000004,sex
no,2.9562499999999994,smoking
sunny,2.9562499999999994,day
dinner,2.9562499999999994,time


Databricks visualization. Run in Databricks to view.

converting category string to 

In [0]:
from pyspark.ml.feature import StringIndexer

In [0]:
# note inputcol and outputcol for converting single column

index = StringIndexer(inputCol='sex', outputCol='index_sex')
index_df = index.fit(df).transform(df)
index_df.display()

total_bills,tip,sex,smoking,day,time,size,index_sex
21.12,2.1,male,no,sunny,dinner,2,0.0
32.17,1.2,male,no,sunny,dinner,1,0.0
19.98,3.1,male,no,sunny,dinner,3,0.0
12.49,2.16,female,no,sunny,dinner,3,1.0
18.57,2.43,female,no,sunny,dinner,2,1.0
24.25,1.56,male,no,sunny,dinner,2,0.0
36.1,3.11,female,no,sunny,dinner,1,1.0
27.34,2.31,female,no,sunny,dinner,1,1.0
13.89,2.67,male,no,sunny,dinner,2,0.0
18.89,4.32,male,no,sunny,dinner,1,0.0


In [0]:
# note inputcols and outputcols for converting multiple column
index = StringIndexer(inputCols=['smoking', 'day', 'time'], outputCols=['index_smoking', 'index_day', 'index_time'])
index_df = index.fit(index_df).transform(index_df)

In [0]:
index_df.display()

total_bills,tip,sex,smoking,day,time,size,index_sex,index_smoking,index_day,index_time
21.12,2.1,male,no,sunny,dinner,2,0.0,0.0,0.0,0.0
32.17,1.2,male,no,sunny,dinner,1,0.0,0.0,0.0,0.0
19.98,3.1,male,no,sunny,dinner,3,0.0,0.0,0.0,0.0
12.49,2.16,female,no,sunny,dinner,3,1.0,0.0,0.0,0.0
18.57,2.43,female,no,sunny,dinner,2,1.0,0.0,0.0,0.0
24.25,1.56,male,no,sunny,dinner,2,0.0,0.0,0.0,0.0
36.1,3.11,female,no,sunny,dinner,1,1.0,0.0,0.0,0.0
27.34,2.31,female,no,sunny,dinner,1,1.0,0.0,0.0,0.0
13.89,2.67,male,no,sunny,dinner,2,0.0,0.0,0.0,0.0
18.89,4.32,male,no,sunny,dinner,1,0.0,0.0,0.0,0.0


pick up columns interested to work on

In [0]:
index_df.printSchema()

root
 |-- total_bills: double (nullable = true)
 |-- tip: double (nullable = true)
 |-- sex: string (nullable = true)
 |-- smoking: string (nullable = true)
 |-- day: string (nullable = true)
 |-- time: string (nullable = true)
 |-- size: integer (nullable = true)
 |-- index_sex: double (nullable = false)
 |-- index_smoking: double (nullable = false)
 |-- index_day: double (nullable = false)
 |-- index_time: double (nullable = false)



In [0]:
data = index_df.select('total_bills', 'tip', 'size', 'index_sex', 'index_smoking', 'index_day', 'index_time')
data.display()

total_bills,tip,size,index_sex,index_smoking,index_day,index_time
21.12,2.1,2,0.0,0.0,0.0,0.0
32.17,1.2,1,0.0,0.0,0.0,0.0
19.98,3.1,3,0.0,0.0,0.0,0.0
12.49,2.16,3,1.0,0.0,0.0,0.0
18.57,2.43,2,1.0,0.0,0.0,0.0
24.25,1.56,2,0.0,0.0,0.0,0.0
36.1,3.11,1,1.0,0.0,0.0,0.0
27.34,2.31,1,1.0,0.0,0.0,0.0
13.89,2.67,2,0.0,0.0,0.0,0.0
18.89,4.32,1,0.0,0.0,0.0,0.0


group the **independente** feautures

In [0]:
# import vectorassembler 
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.functions import vector_to_array

feature = VectorAssembler(inputCols=['tip', 'size', 'index_sex', 'index_smoking', 'index_day', 'index_time'], outputCol='independent_feature')


transform the independent

In [0]:
output = feature.transform(data)

convert the independent feature from vectorType object to array

In [0]:
real_data = output.select('independent_feature', 'total_bills')
real_data.display()

independent_feature,total_bills
"Map(vectorType -> sparse, length -> 6, indices -> List(0, 1), values -> List(2.1, 2.0))",21.12
"Map(vectorType -> sparse, length -> 6, indices -> List(0, 1), values -> List(1.2, 1.0))",32.17
"Map(vectorType -> sparse, length -> 6, indices -> List(0, 1), values -> List(3.1, 3.0))",19.98
"Map(vectorType -> dense, length -> 6, values -> List(2.16, 3.0, 1.0, 0.0, 0.0, 0.0))",12.49
"Map(vectorType -> dense, length -> 6, values -> List(2.43, 2.0, 1.0, 0.0, 0.0, 0.0))",18.57
"Map(vectorType -> sparse, length -> 6, indices -> List(0, 1), values -> List(1.56, 2.0))",24.25
"Map(vectorType -> dense, length -> 6, values -> List(3.11, 1.0, 1.0, 0.0, 0.0, 0.0))",36.1
"Map(vectorType -> dense, length -> 6, values -> List(2.31, 1.0, 1.0, 0.0, 0.0, 0.0))",27.34
"Map(vectorType -> sparse, length -> 6, indices -> List(0, 1), values -> List(2.67, 2.0))",13.89
"Map(vectorType -> sparse, length -> 6, indices -> List(0, 1), values -> List(4.32, 1.0))",18.89


train the data

In [0]:
# import linearregression
from pyspark.ml.regression import LinearRegression

train_data, test_data = real_data.randomSplit([0.75,0.25])
regressor = LinearRegression(featuresCol='independent_feature', labelCol='total_bills')
regressor = regressor.fit(train_data)

In [0]:
regressor.coefficients

Out[75]: DenseVector([-1.2503, -1.7951, 2.6845, 0.0, 0.0, 0.0])

In [0]:
regressor.intercept

Out[76]: 29.37041174368819

predict

In [0]:
pred_result = regressor.evaluate(test_data)
pred_result.predictions.display()

independent_feature,total_bills,prediction
"Map(vectorType -> dense, length -> 6, values -> List(2.16, 3.0, 1.0, 0.0, 0.0, 0.0))",12.49,23.968980353600223
"Map(vectorType -> dense, length -> 6, values -> List(2.31, 1.0, 1.0, 0.0, 0.0, 0.0))",27.34,27.37165371846232
"Map(vectorType -> dense, length -> 6, values -> List(2.43, 2.0, 1.0, 0.0, 0.0, 0.0))",18.57,25.426513741367778
"Map(vectorType -> dense, length -> 6, values -> List(3.11, 1.0, 1.0, 0.0, 0.0, 0.0))",36.1,26.371435073689003


In [0]:
pred_result.r2, pred_result.meanAbsoluteError, pred_result.meanSquaredError


Out[78]: (0.14278340646840226, 7.02392818493533, 68.35618703181152)