# 9. Integrating Polars Into the Data Science Workflow - Quiz

## 9.0. Import `polars` and Load Data

In [1]:
import matplotlib.pyplot as plt
import polars as pl
from sklearn.metrics import mean_absolute_error

pl.Config.set_tbl_rows(16)

polars.config.Config

In [2]:
zone_column_rename_mapping = {
    "LocationID": "location_id",
    "Borough": "borough",
    "Zone": "zone",
}
zones_df = (
    pl.read_parquet("../data/taxi_zone_lookup.parquet")
    .rename(zone_column_rename_mapping)
)

In [3]:
yellow_rides_column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}

rides_df_raw = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(yellow_rides_column_rename_mapping)
    .join(
        zones_df.select(pl.all().name.prefix("pu_")),
        on="pu_location_id",
    )
    .join(
        zones_df.select(pl.all().name.prefix("do_")),
        on="do_location_id",
    )
)

## 9.1 Question 1: Feature Least Correlated with Passenger Count

Using `rides_df_raw`, which feature is least correlated with `passenger_count` (either negatively or positively)? (Hint: You might need the polars function for absolute value, `.abs()`; also, remember to filter out `null` values as done in the module!)

In [4]:
#### YOUR CODE HERE

Unnamed: 0_level_0,passenger_count
str,f64
"""trip_distance""",0.000956
"""improvement_surcharge""",-0.003378


1. `passenger_count`
2. `extra`
3. `trip_distance`
4. `vendor_id`

## 9.2 Question 2: Total Amount vs Trip Distance Plot Analysis

Plot `total_amount` as a function of `trip_distance`. Which statements about the plot are true?

In [5]:
#### YOUR CODE HERE

TypeError: 'DataFramePlot' object is not callable

1. There is a second sub-majority of the data which adheres to a correlation line which has a slope of approximately $20/mile - $22/mile
2. The majority of the data adheres to a correlation line which has a slope of approximately $5/mile - $7/mile
3. Some rides appear to have a negative trip distance
4. A non-negligible minority of the data appears to have a trip distance of exactly 0

## 9.3 Question 3: Fare Amount Distribution Analysis

Plot an ECDF of `fare_amount`. Is the resultant distribution unimodal or multimodal (i.e. is there one peak to the distribution or multiple)? (Hint: exclude any noisy spikes!)

In [None]:
#### YOUR CODE HERE

1. Unimodal
2. Multimodal

## 9.4 Question 4: Mean Absolute Error Calculation

Given the following toy DataFrame of `y_predicted` and `y_truth`, measure the `mean_absolute_error`. True or False: the result is greater than .5. (Hint: use the `sklearn` implementation of `mean_absolute_error`.)

In [None]:
toy_result_df = pl.DataFrame({
    "y_predicted": [0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, ],
    "y_truth": [1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, ]
})
#### YOUR CODE HERE

1. True
2. False

## 9.5 Question 5: Sampling DataFrame with Fraction

In the module, we reviewed the function `.sample()`, and used it to reduce our data to a fixed number of rows; to this end, we passed in simply the number of rows that we wanted in the result with e.g. `.sample(10000)`. However, `.sample()` also provides the option to pass in a fraction of rows, with `.sample(fraction=X)`, where `X` must be between 0 and 1. Use this new way of using the function to reduce the data to 2% of its original size. What is the shape of the result?

In [None]:
result = (
    rides_df_raw
    #### YOUR CODE HERE
)
print(result)

1. 3582628
2. 2
3. 0
4. 71652