# 🚖 NYC Taxi Fare Prediction

## 📌 Overview  
This project predicts the **fare amount of New York City taxi rides** using machine learning regression models.  
The dataset includes pickup/drop-off coordinates, time, and passenger count. By applying **feature engineering** (distance calculation, pickup time, day of week), we train multiple models and compare their performance.  

---

## 🛠️ Tech Stack  
- Python  
- Pandas, NumPy  
- Matplotlib, Seaborn  
- Scikit-learn (Linear Regression, Decision Tree, Random Forest, Gradient Boosting)  

---

## 📊 Dataset  
- **Source:** TaxiFare.csv (50,000 entries)  
- **Features before preprocessing:**  
  - `unique_id`  
  - `amount` (target variable)  
  - `date_time_of_pickup`  
  - `longitude_of_pickup`, `latitude_of_pickup`  
  - `longitude_of_dropoff`, `latitude_of_dropoff`  
  - `no_of_passenger`  

---

## 🚀 Project Workflow  
1. **Data Preprocessing**  
   - Removed `unique_id` and filtered single-passenger rides.  
   - Dropped invalid distances/fare values (outliers).  
   - Engineered new features:  
     - `day_of_week` (0=Monday,…,6=Sunday)  
     - `pickup_time` (hour of day)  
     - `distance` (calculated using longitude/latitude differences).  

2. **Feature Selection**  
   - Final features: `day_of_week`, `pickup_time`, `distance`.  
   - Target: `amount` (fare).  

3. **Model Training**  
   - **Linear Regression**  
   - **Decision Tree Regressor**  
   - **Random Forest Regressor**  
   - **Gradient Boosting Regressor**  

4. **Evaluation**  
   - Compared R² Score (model performance).  
   - Calculated Mean Absolute Error (MAE).  

---

## 📈 Results  

| Model                   | R² Score | MAE   |
|--------------------------|----------|-------|
| Linear Regression        | 0.72     | 2.42  |
| Decision Tree Regressor  | 0.48     | 3.33  |
| Random Forest Regressor  | 0.70     | 2.55  |
| Gradient Boosting Regr.  | **0.75** | **2.29** |

✅ **Gradient Boosting Regressor** performed the best.  

Example predictions for `[day_of_week=4, pickup_time=17, distance=2.0 miles]` :  
- Linear Regression → **$10.75**  
- Decision Tree → **$19.50**  
- Random Forest → **$16.09**  
- Gradient Boosting → **$11.67**  

---

## ▶️ How to Run  
```bash
# Install dependencies
pip install pandas numpy matplotlib seaborn scikit-learn

# Run the notebook
jupyter notebook taxifare_prediction.ipynb


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

df = pd.read_csv('/content/TaxiFare (3).csv')
df.head()

df.shape

df.info()

sns.countplot(x=df['no_of_passenger'])

df = df[df['no_of_passenger'] == 1]
df = df.drop(['unique_id', 'no_of_passenger'], axis=1)
df.head()

df.shape

corr_matrix = df.corr()
corr_matrix['amount'].sort_values(ascending=False)

import datetime
from math import sqrt

for i, row in df.iterrows():
    dt = datetime.datetime.strptime(row['date_time_of_pickup'], '%Y-%m-%d %H:%M:%S UTC')
    df.at[i, 'day_of_week'] = dt.weekday()
    df.at[i, 'pickup_time'] = dt.hour
    x = (row['longitude_of_dropoff'] - row['longitude_of_pickup']) * 54.6 # 1 degree == 54.6 miles
    y = (row['latitude_of_dropoff'] - row['latitude_of_pickup']) * 69.0   # 1 degree == 69 miles
    distance = sqrt(x**2 + y**2)
    df.at[i, 'distance'] = distance

df.head()

df.drop(columns=['date_time_of_pickup', 'longitude_of_pickup', 'latitude_of_pickup', 'longitude_of_dropoff', 'latitude_of_dropoff'], inplace=True)
df.head()

corr_matrix = df.corr()
corr_matrix["amount"].sort_values(ascending=False)

df.describe

df = df[(df['distance'] > 1.0) & (df['distance'] < 10.0)]
df = df[(df['amount'] > 0.0) & (df['amount'] < 50.0)]
df.shape

corr_matrix = df.corr()
corr_matrix["amount"].sort_values(ascending=False)

from sklearn.model_selection import train_test_split

x = df.drop(['amount'], axis=1)
y = df['amount']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)


from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(x_train, y_train)

from sklearn.tree import DecisionTreeRegressor

decision_tree_model = DecisionTreeRegressor()
decision_tree_model.fit(x_train, y_train)

from sklearn.ensemble import RandomForestRegressor ,GradientBoostingRegressor

random_forest_model = RandomForestRegressor()
random_forest_model.fit(x_train, y_train)

model1 = GradientBoostingRegressor()
model1.fit(x_train, y_train)

print("Linear regression:", model.score(x_test, y_test))
print("Decision tree:", decision_tree_model.score(x_test, y_test))
print("Random forest:", random_forest_model.score(x_test, y_test))
print("Gradient boosting:", model1.score(x_test, y_test))

from sklearn.metrics import mean_absolute_error

print("Linear regression mae:",mean_absolute_error(y_test, model.predict(x_test)))
print("Decision tree mae:",mean_absolute_error(y_test, decision_tree_model.predict(x_test)))
print("Random forest mae:",mean_absolute_error(y_test, random_forest_model.predict(x_test)))
print("Gradient boosting mae:",mean_absolute_error(y_test, model1.predict(x_test)))

print("Linear regression:", model.predict([[4, 17, 2.0]]))
print("Decision tree:", decision_tree_model.predict([[4, 17, 2.0]]))
print("Random forest:", random_forest_model.predict([[4, 17, 2.0]]))
print("Gradient boosting:", model1.predict([[4, 17, 2.0]]))


FileNotFoundError: [Errno 2] No such file or directory: '/content/TaxiFare (3).csv'