<a href="https://colab.research.google.com/github/diegodemiranda/linear_regression_models/blob/main/chicago_taxi_fare_prediction/linear_regression_chicago_taxi_fare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Linear Regression - Chicago Taxi fare ride

With this notebook we will use a dataset to train a model to predict the fare of a taxi ride in Chicago, Illinois.

# Part 1 - Setup the Environment


---

#### Load dependencies

The model depends on several Python libraries to help with data manipulation, machine learning tasks, and data visualization.

In [None]:
#general
import io

# data
import numpy as np
import pandas as pd

# machine learning
import keras

# data visualization
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import seaborn as sns

#### Load the dataset
The following code cell loads the dataset and creates a pandas DataFrame.

In [None]:
chicago_taxi_dataset = pd.read_csv("chicago_taxi_train.csv")

training_df = chicago_taxi_dataset[['TRIP_MILES', 'TRIP_SECONDS', 'FARE', 'COMPANY', 'PAYMENT_TYPE', 'TIP_RATE']]

print('Read dataset completed successfully.')
print('Total number of rows: {0}\n\n'.format(len(training_df.index)))

training_df.head(200)

# Part 2 - Dataset Exploration


---



### View dataset statistics

 In this step, I will use the `DataFrame.describe` method to view **descriptive statistics** about the dataset and answer some important questions about the data.

In [None]:
training_df.describe(include='all')

### View correlation matrix

In this step, I will use a **correlation matrix** to identify features whose values correlate well with the label. In essence, finding out how strongly related the numerical features of the taxi trip data are to each other.

Correlation values have the following meanings:

* **+1**: Perfect positive correlation (as one feature increases, the other increases proportionally)
* **-1**: Perfect negative correlation (as one feature increases, the other decreases proportionally)
* **0**: No linear correlation (no relationship between the features).

In general, the higher the absolute value of a correlation value, the greater its predictive power.
The correlation **does not imply causality**. The correlation matrix only captures **linear relationships**. If the relationship between the variables is non-linear, the correlation can be close to 0, even if there is a strong relationship.

In [None]:
training_df.corr(numeric_only = True)