# Homework - Week 01

## 01. Downloading the data

Download the data for January and February 2023.
Read the data for January. How many columns are there?

> 19

In [None]:
import pandas as pd


# download the data for january 2023
df_january =  pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet')

# download the data for february 2023
df_february = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet')


In [5]:
# check the number of columns in the january data
print(f"Number of columns in January data: {len(df_january.columns)}")

Number of columns in January data: 19


## 02. Computing duration
Now let's compute the duration variable. It should contain the duration of a ride in minutes.
What's the standard deviation of the trips duration in January?

> 42.59


In [17]:
# create a new column for the trip duration
df_january['trip_duration'] = df_january['tpep_dropoff_datetime'] - df_january['tpep_pickup_datetime']

df_january.head()

# convert the trip duration to minutes
df_january['trip_duration'] = df_january['trip_duration'].dt.total_seconds() / 60


In [18]:
# compute the standard deviation of the trip distance in the january data
print(f"Standard deviation of trip duration in January data: {df_january['trip_duration'].std()}")

Standard deviation of trip duration in January data: 42.594351241920904


## 03. Dropping outliers

Next, we need to check the distribution of the duration variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

> 98%

In [19]:
# check the distribution of trip duration
df_january['trip_duration'].describe()

count    3.066766e+06
mean     1.566900e+01
std      4.259435e+01
min     -2.920000e+01
25%      7.116667e+00
50%      1.151667e+01
75%      1.830000e+01
max      1.002918e+04
Name: trip_duration, dtype: float64

In [20]:
total_trips = len(df_january)

# clean df_january and leave only trips between 1 and 60 minutes (inclusive)

df_january_cleaned = df_january[
    (df_january['trip_duration'] >= 1) & (df_january['trip_duration'] <= 60)
]

# check the number of rows in the cleaned data
print(f"Number of rows in cleaned January data: {len(df_january_cleaned)}")
# check the number of rows in the original data
print(f"Number of rows in original January data: {total_trips}")
# check the ratio of cleaned data to original data
print(f"Ratio of cleaned data to original data: {len(df_january_cleaned) / total_trips:.2%}")

Number of rows in cleaned January data: 3009173
Number of rows in original January data: 3066766
Ratio of cleaned data to original data: 98.12%


## 4. One-hot encoding
Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

- Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will label encode them)
- Fit a dictionary vectorizer
- Get a feature matrix from it
- What's the dimensionality of this matrix (number of columns)?

> 515


In [None]:
# select the relevant columns
categorical_cols = [
    'PULocationID',
    'DOLocationID',
    #'payment_type',
    #'passenger_count'
]

# one-hot encode the categorical columns with dict vectorization

# transform the categorical columns to string type, otherwise dict vectorization will not work
df_january_cleaned[categorical_cols] = df_january_cleaned[categorical_cols].astype(str)

# check the data types of the categorical columns
df_january_cleaned[categorical_cols].dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_january_cleaned[categorical_cols] = df_january_cleaned[categorical_cols].astype(str)


PULocationID    object
DOLocationID    object
dtype: object

In [67]:
from sklearn.feature_extraction import DictVectorizer

# create a dict vectorizer
dv = DictVectorizer()

# transform the categorical columns to a dict
def df_to_dict(df, categorical_cols):
    return df[categorical_cols].to_dict(orient='records')

# transform the categorical columns to a dict
train_dicts = df_to_dict(df_january_cleaned, categorical_cols)

# transform the categorical columns to a matrix
X_train = dv.fit_transform(train_dicts)

# check the shape of the matrix
print(f"Shape of the matrix: {X_train.shape}")

Shape of the matrix: (3009173, 515)


## 05. Training a model
Now let's use the feature matrix from the previous step to train a model.

- Train a plain linear regression model with default parameters, where duration is the response variable
- Calculate the RMSE of the model on the training data
- What's the RMSE on train?

> 7.64

In [68]:
# set the target variable
y_train = df_january_cleaned['trip_duration'].values

# check the shape of the target variable
print(f"Shape of the target variable: {y_train.shape}")

Shape of the target variable: (3009173,)


In [69]:
# create a simple linear regression model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

linear_reg = LinearRegression()

# fit the model
linear_reg.fit(X_train, y_train)

y_pred = linear_reg.predict(X_train)

# check the RMSE
rmse = root_mean_squared_error(y_train, y_pred)

print(f"RMSE: {rmse:.2f} minutes")

RMSE: 7.65 minutes


## 06. Evaluating the model
Now let's apply this model to the validation dataset (February 2023).

What's the RMSE on validation?

> 7.81


In [70]:
# apply the same transformation to the february data
df_february['trip_duration'] = df_february['tpep_dropoff_datetime'] - df_february['tpep_pickup_datetime']
df_february['trip_duration'] = df_february['trip_duration'].dt.total_seconds() / 60

# clean df_february and leave only trips between 1 and 60 minutes (inclusive)
df_february_cleaned = df_february[
    (df_february['trip_duration'] >= 1) & (df_february['trip_duration'] <= 60)
]

# check the number of rows in the cleaned data
print(f"Number of rows in cleaned February data: {len(df_february_cleaned)}")

# transform the categorical columns to string type, otherwise dict vectorization will not work
df_february_cleaned[categorical_cols] = df_february_cleaned[categorical_cols].astype(str)

# transform the categorical columns to a dict
val_dicts = df_to_dict(df_february_cleaned, categorical_cols)

# transform the categorical columns to a matrix
X_val = dv.transform(val_dicts)

# check the shape of the matrix
print(f"Shape of the matrix: {X_val.shape}")

# set the target variable
y_val = df_february_cleaned['trip_duration'].values

# check the shape of the target variable
print(f"Shape of the target variable: {y_val.shape}")

Number of rows in cleaned February data: 2855951


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_february_cleaned[categorical_cols] = df_february_cleaned[categorical_cols].astype(str)


Shape of the matrix: (2855951, 515)
Shape of the target variable: (2855951,)


In [71]:
# predict the trip duration
y_pred_val = linear_reg.predict(X_val)

# check the RMSE
rmse = root_mean_squared_error(y_pred_val, y_val)
print(f"RMSE: {rmse:.2f} minutes")

RMSE: 7.81 minutes
