![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 1. Chicago Cab Fare Predictor using Linear Regression

Accurately predicting the cost of a taxi ride can provide valuable insights for both riders and service providers, enabling more informed decisions and better financial planning. In this project, we focus on building a **Linear Regression model** to predict taxi fares in **Chicago, Illinois**. By analyzing patterns in historical data, we aim to create a model that can reliably estimate the fare for a given trip.

The [dataset used in this project](https://download.mlcc.google.com/mledu-datasets/chicago_taxi_train.csv) is a **subset of the [City of Chicago Taxi Trips dataset](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew)**, specifically **focusing on a two-day period in May 2022**. This data contains key features such as trip distance, pickup/dropoff locations, and ride duration, which we will leverage to train our predictive model.

**Project Objectives:**

- **Dataset:** A cleaned and preprocessed subset of taxi trips over a two-day period in May 2022.

- **Model:** A Linear Regression model that predicts the fare based on input features like trip distance, time of day, and other relevant variables.

- **Goal:** To build an accurate fare predictor that can assist in understanding taxi fare dynamics in Chicago.

This project not only serves as a practical application of regression modeling but also offers insights into the pricing structure of taxi services in a major metropolitan area.

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 2. Part I: Initial Setup

## 1. Import Required Libraries

In [1]:
# General Imports
import io

# Data Processing
import numpy as np
import pandas as pd

# Machine Learning
import keras

# Data Visualization
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import seaborn as sns

## 2. Load the Dataset

In [2]:
chicago_taxi_dataset = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/chicago_taxi_train.csv")

In [3]:
print(f"Shape of dataset: {chicago_taxi_dataset.shape}")

Shape of dataset: (31694, 18)


In [4]:
chicago_taxi_dataset.head()

Unnamed: 0,TRIP_START_TIMESTAMP,TRIP_END_TIMESTAMP,TRIP_START_HOUR,TRIP_SECONDS,TRIP_MILES,TRIP_SPEED,PICKUP_CENSUS_TRACT,DROPOFF_CENSUS_TRACT,PICKUP_COMMUNITY_AREA,DROPOFF_COMMUNITY_AREA,FARE,TIPS,TIP_RATE,TOLLS,EXTRAS,TRIP_TOTAL,PAYMENT_TYPE,COMPANY
0,05/17/2022 7:15:00 AM,05/17/2022 7:45:00 AM,7.25,2341,2.57,4.0,,,,17.0,31.99,2.0,6.3,0.0,0.0,33.99,Mobile,Flash Cab
1,05/17/2022 5:15:00 PM,05/17/2022 5:30:00 PM,17.25,1074,1.18,4.0,,17031080000.0,,8.0,9.75,3.0,27.9,0.0,1.0,14.25,Credit Card,Flash Cab
2,05/17/2022 5:15:00 PM,05/17/2022 5:30:00 PM,17.25,1173,1.29,4.0,17031320000.0,17031080000.0,32.0,8.0,10.25,0.0,0.0,0.0,0.0,10.25,Cash,Sun Taxi
3,05/17/2022 6:00:00 PM,05/17/2022 7:00:00 PM,18.0,3360,3.7,4.0,17031320000.0,17031240000.0,32.0,24.0,23.75,0.0,0.0,0.0,1.0,24.75,Cash,Choice Taxi Association
4,05/17/2022 5:00:00 PM,05/17/2022 5:30:00 PM,17.0,1044,1.15,4.0,17031320000.0,17031080000.0,32.0,8.0,10.0,0.0,0.0,0.0,0.0,10.0,Cash,Flash Cab


## 3. Update the Dataframe

From the loaded `chicago_taxi_dataset`, we are selecting only the relevant columns: `TRIP_MILES`, `TRIP_SECONDS`, `FARE`, `COMPANY`, `PAYMENT_TYPE`, and `TIP_RATE`.

In [5]:
# Update the DataFrame to use only specific columns from the dataset
training_df = chicago_taxi_dataset[['TRIP_MILES', 'TRIP_SECONDS', 'FARE', 'COMPANY', 'PAYMENT_TYPE', 'TIP_RATE']]

print(f"Total number of rows: {len(training_df.index)}")
print(f"Shape of dataset: {training_df.shape}\n\n")
training_df.head(200)

Total number of rows: 31694
Shape of dataset: (31694, 6)




Unnamed: 0,TRIP_MILES,TRIP_SECONDS,FARE,COMPANY,PAYMENT_TYPE,TIP_RATE
0,2.57,2341,31.99,Flash Cab,Mobile,6.3
1,1.18,1074,9.75,Flash Cab,Credit Card,27.9
2,1.29,1173,10.25,Sun Taxi,Cash,0.0
3,3.70,3360,23.75,Choice Taxi Association,Cash,0.0
4,1.15,1044,10.00,Flash Cab,Cash,0.0
...,...,...,...,...,...,...
195,1.13,821,9.00,Blue Ribbon Taxi Association,Mobile,22.9
196,0.57,414,6.00,Flash Cab,Cash,0.0
197,1.22,886,9.00,City Service,Cash,0.0
198,1.68,1219,9.00,Sun Taxi,Mobile,23.0


![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 2. Part II: Dataset Exploration

## 1. View Descriptive Statistics of the Dataset

In Pandas, the `DataFrame.describe()` method is used to generate descriptive statistics of a DataFrame, providing a summary of the central tendency, dispersion, and shape of a dataset's distribution for numerical columns.

Here's what `DataFrame.describe()` typically returns for numerical columns:

- **Count:** The number of non-null values.
- **Mean:** The average of the values.
- **Std:** The standard deviation, which measures the amount of variation or dispersion.
- **Min:** The minimum value.
- **25%:** The 25th percentile (first quartile).
- **50%:** The 50th percentile (median or second quartile).
- **75%:** The 75th percentile (third quartile).
- **Max:** The maximum value.

For categorical or object data types, you can use `DataFrame.describe(include=['object'])`, and it will return:

- **Count:** The number of non-null entries.
- **Unique:** The number of unique values.
- **Top:** The most frequent value.
- **Freq:** The frequency of the most frequent value.

**Note 1**

The `DataFrame.describe()` method is useful for getting a quick overview of your data and identifying patterns or outliers.

**Note 2**

The `include='all'` argument ensures that both numerical and categorical data are included in the summary.

In [6]:
training_df.describe(include='all')

Unnamed: 0,TRIP_MILES,TRIP_SECONDS,FARE,COMPANY,PAYMENT_TYPE,TIP_RATE
count,31694.0,31694.0,31694.0,31694,31694,31694.0
unique,,,,31,7,
top,,,,Flash Cab,Credit Card,
freq,,,,7887,14142,
mean,8.289463,1319.796397,23.90521,,,12.965785
std,7.265672,928.932873,16.970022,,,15.517765
min,0.5,60.0,3.25,,,0.0
25%,1.72,548.0,9.0,,,0.0
50%,5.92,1081.0,18.75,,,12.2
75%,14.5,1888.0,38.75,,,20.8


**Note**

You might be wondering why there are groups of `NaN` (not a number) values listed in the output. When working with data in Python, you may see this value if the result of a calculation can not be computed or if there is missing information. For example, in the taxi dataset `PAYMENT_TYPE` and `COMPANY` are non-numeric, categorical features; numeric information such as mean and max do not make sense for categorical features so the output displays `NaN`.

## 2. Dataset Analysis

Now we are going to analyze the subset of the Chicago Taxi Trips dataset using `pandas`. We'll answer specific questions about the dataset, such as the maximum fare, the mean trip distance, the number of cab companies, the most frequent payment type, and whether any features have missing data.

### Step 1: Finding the Maximum Fare

We use the `max()` function to find the maximum fare from the `FARE` column. This gives us the highest fare recorded in the dataset.

In [7]:
max_fare = training_df['FARE'].max()
print(f"Maximum Fare: ${max_fare:.2f}")

Maximum Fare: $159.25


### Step 2: Calculating the Mean Distance Across All Trips

The `mean()` function computes the average of all values in the `TRIP_MILES` column. This gives us the mean trip distance across all taxi rides in the dataset.

In [8]:
mean_distance = training_df['TRIP_MILES'].mean()
print(f"Mean Distance: {mean_distance:.2f} miles")

Mean Distance: 8.29 miles


### Step 3: Counting the Number of Cab Companies

Here, we use the `nunique()` function on the `COMPANY` column to determine the number of unique cab companies in the dataset.

In [9]:
num_unique_companies =  training_df['COMPANY'].nunique()
print(f"Number of Cab Companies: {num_unique_companies}")

Number of Cab Companies: 31


### Step 4: Identifying the Most Frequent Payment Type

The `value_counts()` function counts the occurrences of each payment type in the `PAYMENT_TYPE` column, and `idxmax()` returns the most frequent one. This helps us identify the most commonly used payment method for taxi rides.

In [10]:
most_freq_payment_type = training_df['PAYMENT_TYPE'].value_counts().idxmax()
print(f"Most Frequent Payment Type: {most_freq_payment_type}")

Most Frequent Payment Type: Credit Card


### Step 5: Checking for Missing Data

We use the `isnull().sum().sum()` chain to check for missing values. First `isnull()` returns a DataFrame of boolean values indicating where data is missing. `sum()` counts the number of missing values per column, and another `sum()` aggregates this count across all columns. If the result is 0, there are no missing values; otherwise, there are missing entries in the dataset.

In [13]:
n_missing_values = training_df.isnull().sum().sum()
print(f"Number of Missing Values: {n_missing_values}")
is_missing = "Yes" if n_missing_values > 0 else "No"
print(f"Are any features missing data?\t{is_missing}")

Number of Missing Values: 0
Are any features missing data?	No


## 3. Generate Correlation Matrix

In this section, we'll generate and analyze a correlation matrix to understand which features in the dataset are most closely related to our target variable, `FARE`. Correlation measures the strength of the linear relationship between two variables, with values ranging from -1 to 1. A value closer to 1 or -1 indicates a strong relationship, while values near 0 indicate a weak or no relationship.

Correlation values have the following meanings:

- **1.0:** Perfect **positive correlation** — when one attribute increases, the other attribute also increases.

- **-1.0:** Perfect **negative correlation** — when one attribute increases, the other attribute decreases.

- **0.0:** **No correlation** — the two attributes are not linearly related.

In general, the **higher the absolute value** of a correlation, the **greater its predictive power**.

### Step 1: View the Correlation Matrix

The `corr()` function is used to compute the pairwise correlation of the features in the dataset. We pass `numeric_only=True` to limit the correlation calculation to numeric columns. This matrix shows how strongly each feature is related to one another and to the target label, `FARE`.

In [14]:
training_df.corr(numeric_only=True)

Unnamed: 0,TRIP_MILES,TRIP_SECONDS,FARE,TIP_RATE
TRIP_MILES,1.0,0.800855,0.975344,-0.049594
TRIP_SECONDS,0.800855,1.0,0.830292,-0.084294
FARE,0.975344,0.830292,1.0,-0.070979
TIP_RATE,-0.049594,-0.084294,-0.070979,1.0


### Step 2: Identify the Feature with the Strongest Correlation to the Label `FARE`

From the correlation matrix, we can see that `TRIP_MILES` has the strongest positive correlation with `FARE`. This indicates that longer trips (in miles) tend to result in higher fares, which is expected. Additionally, `TRIP_SECONDS` (trip duration) also shows a strong correlation with the fare, meaning that both distance and time are important factors in predicting fare.

### Step 3: Identify the Feature with the Weakest Correlation to the Label `FARE`

The feature with the weakest correlation to the label `FARE` is `TIP_RATE`. This suggests that the amount tipped does not strongly influence the base fare, which makes sense because tips are generally independent of the fare amount and are often influenced by other factors like service quality.

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

You should be able to find the answers to the questions about the dataset by inspecting the table output after running the DataFrame describe method.

What is the maximum fare? 				              Answer: $159.25

What is the mean distance across all trips? 		Answer: 8.2895 miles

How many cab companies are in the dataset? 		  Answer: 31

What is the most frequent payment type? 		    Answer: Credit Card

Are any features missing data? 				          Answer: No

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

In [None]:
# Deep Learning as subset of ML

from IPython import display
display.Image("data/images/DL_01_Intro-01-DL-subset-of-ML.jpg")

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)