**Problem Statement**:

Given a dataset of Uber ride information, our objective is to predict the price of an Uber ride from a given pickup point to the agreed drop-off location. We aim to build an accurate regression model that can provide estimates of ride fares, which can be valuable for both riders and drivers.

**Key Tasks**:

1. **Data Pre-processing**:
   - Load the Uber ride dataset.
   - Handle missing data, if any.
   - Convert and standardize data types.
   - Explore and clean the data for modeling.

2. **Regression Model Development**:
   - Implement the following regression models:
     - Linear Regression: A simple regression model to predict fares.
     - Ridge Regression: A regularized linear regression model to account for multicollinearity.
     - Lasso Regression: Another regularized regression model that performs feature selection.
   - Train these models on the dataset.

3. **Model Evaluation**:
   - Evaluate the models using common regression metrics, including but not limited to:
     - R-squared (R2): Measure of the variance explained by the model.
     - Root Mean Squared Error (RMSE): Measure of prediction accuracy.
   - Compare the performance of the three regression models.

**Data Source**:

The dataset used for this analysis is obtained from Kaggle, and it includes information related to Uber fares. You can access the dataset using the following link: [Uber Fares Dataset](https://www.kaggle.com/datasets/yasserh/uber-fares-dataset).

**Significance**:

This project is significant because it provides a practical and real-world application of regression analysis. Accurate fare prediction in ride-sharing services like Uber can help riders make informed decisions and help drivers better plan their routes and working hours. The project aims to build models that can potentially enhance the overall user experience and convenience of ride-sharing services.

### Step 1: Importing the dataset and required dependencies

In [1]:
# Importing necessary libraries
# !pip install geopy

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from geopy.distance import great_circle

In [2]:
# Loading your dataset
data = pd.read_csv('uber.csv')

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,24238194,2015-05-07 19:52:06.0000003,7.5,2015-05-07 19:52:06 UTC,-73.999817,40.738354,-73.999512,40.723217,1
1,27835199,2009-07-17 20:04:56.0000002,7.7,2009-07-17 20:04:56 UTC,-73.994355,40.728225,-73.99471,40.750325,1
2,44984355,2009-08-24 21:45:00.00000061,12.9,2009-08-24 21:45:00 UTC,-74.005043,40.74077,-73.962565,40.772647,1
3,25894730,2009-06-26 08:22:21.0000001,5.3,2009-06-26 08:22:21 UTC,-73.976124,40.790844,-73.965316,40.803349,3
4,17610152,2014-08-28 17:47:00.000000188,16.0,2014-08-28 17:47:00 UTC,-73.925023,40.744085,-73.973082,40.761247,5


In [4]:
data.describe()

Unnamed: 0.1,Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,200000.0,200000.0,200000.0,200000.0,199999.0,199999.0,200000.0
mean,27712500.0,11.359955,-72.527638,39.935885,-72.525292,39.92389,1.684535
std,16013820.0,9.901776,11.437787,7.720539,13.117408,6.794829,1.385997
min,1.0,-52.0,-1340.64841,-74.015515,-3356.6663,-881.985513,0.0
25%,13825350.0,6.0,-73.992065,40.734796,-73.991407,40.733823,1.0
50%,27745500.0,8.5,-73.981823,40.752592,-73.980093,40.753042,1.0
75%,41555300.0,12.5,-73.967154,40.767158,-73.963658,40.768001,2.0
max,55423570.0,499.0,57.418457,1644.421482,1153.572603,872.697628,208.0


### Step 2: Pre-process the dataset


In [5]:
# Checking for and dropping rows with missing or invalid coordinates
data = data.dropna(subset=['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'])
data = data[(data['pickup_latitude'] >= -90) & (data['pickup_latitude'] <= 90)]
data = data[(data['pickup_longitude'] >= -180) & (data['pickup_longitude'] <= 180)]
data = data[(data['dropoff_latitude'] >= -90) & (data['dropoff_latitude'] <= 90)]
data = data[(data['dropoff_longitude'] >= -180) & (data['dropoff_longitude'] <= 180)]

In [6]:
# Calculating the distance for each row and create a new column 'distance'
data['distance'] = data.apply(lambda row: great_circle(
    (row['pickup_latitude'], row['pickup_longitude']),
    (row['dropoff_latitude'], row['dropoff_longitude'])).miles, axis=1)

In [7]:
# Saving the updated DataFrame to a new CSV file
data.to_csv('uber_pre_processed.csv', index=False)

In [8]:
data = pd.read_csv('uber_pre_processed.csv')

In [9]:
data.head()

Unnamed: 0.1,Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance
0,24238194,2015-05-07 19:52:06.0000003,7.5,2015-05-07 19:52:06 UTC,-73.999817,40.738354,-73.999512,40.723217,1,1.04597
1,27835199,2009-07-17 20:04:56.0000002,7.7,2009-07-17 20:04:56 UTC,-73.994355,40.728225,-73.99471,40.750325,1,1.527078
2,44984355,2009-08-24 21:45:00.00000061,12.9,2009-08-24 21:45:00 UTC,-74.005043,40.74077,-73.962565,40.772647,1,3.129464
3,25894730,2009-06-26 08:22:21.0000001,5.3,2009-06-26 08:22:21 UTC,-73.976124,40.790844,-73.965316,40.803349,3,1.032524
4,17610152,2014-08-28 17:47:00.000000188,16.0,2014-08-28 17:47:00 UTC,-73.925023,40.744085,-73.973082,40.761247,5,2.78092


In [10]:
data.isnull().sum()

Unnamed: 0           0
key                  0
fare_amount          0
pickup_datetime      0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    0
dropoff_latitude     0
passenger_count      0
distance             0
dtype: int64

In [11]:
# Handling missing values if any
data.dropna(inplace=True)

In [12]:
# Splitting the dataset into features (X) and target (y)
X = data[['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'distance']]
y = data['fare_amount']

In [13]:
# We can use different methods like z-score or IQR to detect outliers
from scipy import stats
z_scores = np.abs(stats.zscore(data['distance']))
outliers = (z_scores > 3)
data = data[~outliers]

### Step 3: Implement Linear regression, Ridge, and Lasso regression models

In [14]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardizing the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [15]:
# Initializing the models
linear_reg_model = LinearRegression()
ridge_model = Ridge(alpha=100)  # You can adjust the alpha parameter
lasso_model = Lasso(alpha=0.01)  # You can adjust the alpha parameter


In [16]:
# Training the models
linear_reg_model.fit(X_train_scaled, y_train)
ridge_model.fit(X_train_scaled, y_train)
lasso_model.fit(X_train_scaled, y_train)


In [17]:
# Evaluating the models and comparing their respective scores
def evaluate_model(model, X, y):
    y_pred = model.predict(X)
    r2 = r2_score(y, y_pred)
    rmse = np.sqrt(mean_squared_error(y, y_pred))
    return r2, rmse
linear_reg_r2, linear_reg_rmse = evaluate_model(linear_reg_model, X_test_scaled, y_test)
ridge_r2, ridge_rmse = evaluate_model(ridge_model, X_test_scaled, y_test)
lasso_r2, lasso_rmse = evaluate_model(lasso_model, X_test_scaled, y_test)
print("Linear Regression R2 Score:", linear_reg_r2)
print("Linear Regression RMSE:", linear_reg_rmse)
print("Ridge Regression R2 Score:", ridge_r2)
print("Ridge Regression RMSE:", ridge_rmse)
print("Lasso Regression R2 Score:", lasso_r2)
print("Lasso Regression RMSE:", lasso_rmse)


Linear Regression R2 Score: 0.0006381141889498787
Linear Regression RMSE: 9.576268794382056
Ridge Regression R2 Score: 0.0006415603612498488
Ridge Regression RMSE: 9.576252283095616
Lasso Regression R2 Score: 0.0006315091164006414
Lasso Regression RMSE: 9.576300440498704




1. **Linear Regression R2 Score: 0.0006381141889498787**
   - R2 Score, or the coefficient of determination, measures the proportion of the variance in the dependent variable that is predictable from the independent variables in a regression model. An R2 score close to 1 indicates a good model fit, while a score close to 0 suggests a poor fit. In this case, the Linear Regression model has an R2 score of approximately 0.00064, which is very close to 0. This suggests that the model doesn't explain much of the variance in the data.

2. **Linear Regression RMSE: 9.576268794382056**
   - RMSE stands for Root Mean Square Error and is a measure of the average prediction error in the model. It quantifies the average difference between the predicted values and the actual values. A lower RMSE indicates a better model fit. In this case, the Linear Regression model has an RMSE of approximately 9.576, which suggests that, on average, the model's predictions are off by about 9.576 units.

3. **Ridge Regression R2 Score: 0.0006415603612498488**
   - This is the R2 score for the Ridge Regression model. Like the Linear Regression R2 score, it is very close to 0, indicating a poor fit.

4. **Ridge Regression RMSE: 9.576252283095616**
   - This is the RMSE for the Ridge Regression model. It is similar to the Linear Regression RMSE, suggesting that the predictions of the Ridge model have an average error of about 9.576 units.

5. **Lasso Regression R2 Score: 0.0006315091164006414**
   - This is the R2 score for the Lasso Regression model. Again, it is close to 0, indicating a poor fit.

6. **Lasso Regression RMSE: 9.576300440498704**
   - This is the RMSE for the Lasso Regression model. Similar to the other RMSE values, it suggests that the Lasso model's predictions have an average error of about 9.576 units.

As we can see, all three regression models (Linear, Ridge, and Lasso) seem to have very low R2 scores and relatively high RMSE values. This indicates that these models are not performing well in explaining or predicting the target variable, and they might need further refinement or a different approach to improve their performance.