<a href="https://colab.research.google.com/github/drshahizan/Python-big-data/blob/main/Assignment%202a/QUAD/Polars_Assignment2a_GP8_QUAD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Polars

<br>
 <p align="center">
  <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTKH6i3lJ9tW7bna90G-1SO7QB__e3ri_8MCw&usqp=CAU"/>
 </p>
</br>

- **Polars** is both lazy and semi-lazy. It allows you to accomplish most of your work eagerly, similar to Pandas, but it also provides a sophisticated expression syntax that will be optimised and processed within the query engine.

Polars' purpose is to deliver a lightning-fast DataFrame library that makes use of all available cores on your machine. Unlike dask, which attempts to parallelize existing single-threaded libraries such as NumPy and Pandas, Polars is intended from the ground up for parallelization of queries on DataFrames.

Polars goes to great lengths to:

1.   Reduce redundant copies
2.   Traverse memory cache efficiently
3.   Minimize contention in parallelism


For more information about Polars, you can view from [Polars](https://pola-rs.github.io/polars-book/user-guide/index.html).

#Install dataset from kaggle

- Open [Kaggle](https://www.kaggle.com/datasets) and search for **nyc-yellow-taxi-trip-data** to download the dataset.

In [1]:
! pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from google.colab import files

files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"nursyamaliafaiqah","key":"40af763f4783b6151e069d12f41484b1"}'}

In [3]:
!mkdir -p ~/.kaggle

In [4]:
!cp kaggle.json ~/.kaggle/
!kaggle datasets download -d elemento/nyc-yellow-taxi-trip-data

Downloading nyc-yellow-taxi-trip-data.zip to /content
100% 1.78G/1.78G [00:16<00:00, 157MB/s]
100% 1.78G/1.78G [00:16<00:00, 120MB/s]


In [5]:
!unzip nyc-yellow-taxi-trip-data.zip #unzip the downloaded dataset

Archive:  nyc-yellow-taxi-trip-data.zip
  inflating: yellow_tripdata_2015-01.csv  
  inflating: yellow_tripdata_2016-01.csv  
  inflating: yellow_tripdata_2016-02.csv  
  inflating: yellow_tripdata_2016-03.csv  


#Installing Polars

Polars is a Python library for data visualization and analysis, and it is not included in the standard Python distribution. Therefore, in order to use it in your Google Colab notebook or any other project, you need to install it first.

In [5]:
!pip install 'polars[all]'

#or 
#pip install 'polars[numpy,pandas,pyarrow]' to  install a subset of all optional dependencies

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#Import Polars

Below is the way to import polars as pl.

In [6]:
import polars as pl

#Read csv file and load the dataset

In [7]:
# read csv file using polars
df = pl.read_csv('yellow_tripdata_2015-01.csv')

In [8]:
#load all dataset 

df

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
i64,str,str,i64,f64,f64,f64,i64,str,f64,f64,i64,f64,f64,f64,f64,f64,f64,f64
2,"""2015-01-15 19:...","""2015-01-15 19:...",1,1.59,-73.993896,40.750111,1,"""N""",-73.974785,40.750618,1,12.0,1.0,0.5,3.25,0.0,0.3,17.05
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,3.3,-74.001648,40.724243,1,"""N""",-73.994415,40.759109,1,14.5,0.5,0.5,2.0,0.0,0.3,17.8
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,1.8,-73.963341,40.802788,1,"""N""",-73.95182,40.824413,2,9.5,0.5,0.5,0.0,0.0,0.3,10.8
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,0.5,-74.009087,40.713818,1,"""N""",-74.004326,40.719986,2,3.5,0.5,0.5,0.0,0.0,0.3,4.8
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,3.0,-73.971176,40.762428,1,"""N""",-74.004181,40.742653,2,15.0,0.5,0.5,0.0,0.0,0.3,16.3
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,9.0,-73.874374,40.774048,1,"""N""",-73.986977,40.758194,1,27.0,0.5,0.5,6.7,5.33,0.3,40.33
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,2.2,-73.983276,40.726009,1,"""N""",-73.99247,40.749634,2,14.0,0.5,0.5,0.0,0.0,0.3,15.3
1,"""2015-01-10 20:...","""2015-01-10 20:...",3,0.8,-74.002663,40.734142,1,"""N""",-73.99501,40.726326,1,7.0,0.5,0.5,1.66,0.0,0.3,9.96
1,"""2015-01-10 20:...","""2015-01-10 21:...",3,18.2,-73.783043,40.644356,2,"""N""",-73.987595,40.759357,2,52.0,0.0,0.5,0.0,5.33,0.3,58.13
1,"""2015-01-10 20:...","""2015-01-10 20:...",2,0.9,-73.985588,40.767948,1,"""N""",-73.985916,40.759365,1,6.5,0.5,0.5,1.55,0.0,0.3,9.35


As we can see, Polars pretty-prints the output object, including the column name and datatype as headers.

In [None]:
# The head function shows by default the first 5 rows of a DataFrame. You can specify the number of rows you want to see (e.g. df.head(10)).

df.head()

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
i64,str,str,i64,f64,f64,f64,i64,str,f64,f64,i64,f64,f64,f64,f64,f64,f64,f64
2,"""2015-01-15 19:...","""2015-01-15 19:...",1,1.59,-73.993896,40.750111,1,"""N""",-73.974785,40.750618,1,12.0,1.0,0.5,3.25,0.0,0.3,17.05
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,3.3,-74.001648,40.724243,1,"""N""",-73.994415,40.759109,1,14.5,0.5,0.5,2.0,0.0,0.3,17.8
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,1.8,-73.963341,40.802788,1,"""N""",-73.95182,40.824413,2,9.5,0.5,0.5,0.0,0.0,0.3,10.8
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,0.5,-74.009087,40.713818,1,"""N""",-74.004326,40.719986,2,3.5,0.5,0.5,0.0,0.0,0.3,4.8
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,3.0,-73.971176,40.762428,1,"""N""",-74.004181,40.742653,2,15.0,0.5,0.5,0.0,0.0,0.3,16.3


#Data Preparation and Cleaning

 If you want to explicitly display the data type of each column, use the dtypes properties:

In [None]:
df.dtypes

[Int64,
 Utf8,
 Utf8,
 Int64,
 Float64,
 Float64,
 Float64,
 Int64,
 Utf8,
 Float64,
 Float64,
 Int64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64]

describe() returns summary statistics of your DataFrame. It will provide several quick statistics if possible.

In [None]:
df.describe()

describe,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
str,f64,str,str,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""",12748986.0,"""12748986""","""12748986""",12748986.0,12748986.0,12748986.0,12748986.0,12748986.0,"""12748986""",12748986.0,12748986.0,12748986.0,12748986.0,12748986.0,12748986.0,12748986.0,12748986.0,12748986.0,12748986.0
"""null_count""",0.0,"""0""","""0""",0.0,0.0,0.0,0.0,0.0,"""0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
"""mean""",1.521437,,,1.681491,13.45913,-72.561838,39.972823,1.036901,,-72.609039,39.999614,1.386712,11.905659,0.308279,0.497799,1.853814,0.243498,0.283143,15.108295
"""std""",0.49954,,,1.337924,9844.094218,10.125104,5.578691,0.673224,,9.966037,5.487742,0.498861,10.302537,0.591664,0.035342,1106.432314,1.527171,0.069086,1106.503247
"""min""",1.0,"""2015-01-01 00:...","""2015-01-01 00:...",0.0,0.0,-121.925812,0.0,1.0,"""N""",-740.166687,-9.029157,1.0,-450.0,-79.0,-0.5,-92.42,-26.0,0.0,-450.3
"""max""",2.0,"""2015-01-31 23:...","""2016-02-02 16:...",9.0,15420000.0,78.662651,404.700012,99.0,"""Y""",85.274025,459.533325,5.0,4008.0,999.99,0.5,3950588.8,1450.09,0.3,3950611.6
"""median""",2.0,,,1.0,1.68,-73.981598,40.753143,1.0,,-73.979759,40.753624,1.0,9.0,0.0,0.5,1.0,0.0,0.3,11.16


To get the column names, use the columns property:

In [None]:
df.columns

['VendorID',
 'tpep_pickup_datetime',
 'tpep_dropoff_datetime',
 'passenger_count',
 'trip_distance',
 'pickup_longitude',
 'pickup_latitude',
 'RateCodeID',
 'store_and_fwd_flag',
 'dropoff_longitude',
 'dropoff_latitude',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'improvement_surcharge',
 'total_amount']

To get the content of the DataFrame as a list of tuples, use the rows() method:

In [9]:
df.rows

<bound method DataFrame.rows of shape: (12748986, 19)
┌──────────┬────────────┬────────────┬────────────┬─────┬──────────┬────────────┬────────────┬────────────┐
│ VendorID ┆ tpep_picku ┆ tpep_dropo ┆ passenger_ ┆ ... ┆ tip_amou ┆ tolls_amou ┆ improvemen ┆ total_amou │
│ ---      ┆ p_datetime ┆ ff_datetim ┆ count      ┆     ┆ nt       ┆ nt         ┆ t_surcharg ┆ nt         │
│ i64      ┆ ---        ┆ e          ┆ ---        ┆     ┆ ---      ┆ ---        ┆ e          ┆ ---        │
│          ┆ str        ┆ ---        ┆ i64        ┆     ┆ f64      ┆ f64        ┆ ---        ┆ f64        │
│          ┆            ┆ str        ┆            ┆     ┆          ┆            ┆ f64        ┆            │
╞══════════╪════════════╪════════════╪════════════╪═════╪══════════╪════════════╪════════════╪════════════╡
│ 2        ┆ 2015-01-15 ┆ 2015-01-15 ┆ 1          ┆ ... ┆ 3.25     ┆ 0.0        ┆ 0.3        ┆ 17.05      │
│          ┆ 19:05:39   ┆ 19:23:42   ┆            ┆     ┆          ┆            ┆ 

Handling missing value

- The value for missing data in Pandas is determined by the column's dtype. Missing data is always displayed as a null value in Polars.

In [9]:
#count missing value

null_count_df = df.null_count()
print(null_count_df)

shape: (1, 19)
┌──────────┬────────────┬────────────┬────────────┬─────┬──────────┬────────────┬────────────┬────────────┐
│ VendorID ┆ tpep_picku ┆ tpep_dropo ┆ passenger_ ┆ ... ┆ tip_amou ┆ tolls_amou ┆ improvemen ┆ total_amou │
│ ---      ┆ p_datetime ┆ ff_datetim ┆ count      ┆     ┆ nt       ┆ nt         ┆ t_surcharg ┆ nt         │
│ u32      ┆ ---        ┆ e          ┆ ---        ┆     ┆ ---      ┆ ---        ┆ e          ┆ ---        │
│          ┆ u32        ┆ ---        ┆ u32        ┆     ┆ u32      ┆ u32        ┆ ---        ┆ u32        │
│          ┆            ┆ u32        ┆            ┆     ┆          ┆            ┆ u32        ┆            │
╞══════════╪════════════╪════════════╪════════════╪═════╪══════════╪════════════╪════════════╪════════════╡
│ 0        ┆ 0          ┆ 0          ┆ 0          ┆ ... ┆ 0        ┆ 0          ┆ 3          ┆ 0          │
└──────────┴────────────┴────────────┴────────────┴─────┴──────────┴────────────┴────────────┴────────────┘


In [10]:
#filling missing data

df = df.with_column(pl.col('improvement_surcharge').fill_null(pl.mean('improvement_surcharge')),)

After filling missing data, we need to check whether there is still missing data that need to be handled.

In [11]:
#check missing value

null_count_df = df.null_count()
print(null_count_df)

shape: (1, 19)
┌──────────┬────────────┬────────────┬────────────┬─────┬──────────┬────────────┬────────────┬────────────┐
│ VendorID ┆ tpep_picku ┆ tpep_dropo ┆ passenger_ ┆ ... ┆ tip_amou ┆ tolls_amou ┆ improvemen ┆ total_amou │
│ ---      ┆ p_datetime ┆ ff_datetim ┆ count      ┆     ┆ nt       ┆ nt         ┆ t_surcharg ┆ nt         │
│ u32      ┆ ---        ┆ e          ┆ ---        ┆     ┆ ---      ┆ ---        ┆ e          ┆ ---        │
│          ┆ u32        ┆ ---        ┆ u32        ┆     ┆ u32      ┆ u32        ┆ ---        ┆ u32        │
│          ┆            ┆ u32        ┆            ┆     ┆          ┆            ┆ u32        ┆            │
╞══════════╪════════════╪════════════╪════════════╪═════╪══════════╪════════════╪════════════╪════════════╡
│ 0        ┆ 0          ┆ 0          ┆ 0          ┆ ... ┆ 0        ┆ 0          ┆ 0          ┆ 0          │
└──────────┴────────────┴────────────┴────────────┴─────┴──────────┴────────────┴────────────┴────────────┘


Checking duplicate/unique

- Checking for duplicates or unique values can provide valuable insights into the quality and structure of your data, allowing you to make informed decisions and improve the accuracy of your analysis.

In [None]:
df.unique()

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
i64,str,str,i64,f64,f64,f64,i64,str,f64,f64,i64,f64,f64,f64,f64,f64,f64,f64
2,"""2015-01-15 19:...","""2015-01-15 19:...",1,1.59,-73.993896,40.750111,1,"""N""",-73.974785,40.750618,1,12.0,1.0,0.5,3.25,0.0,0.3,17.05
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,3.3,-74.001648,40.724243,1,"""N""",-73.994415,40.759109,1,14.5,0.5,0.5,2.0,0.0,0.3,17.8
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,1.8,-73.963341,40.802788,1,"""N""",-73.95182,40.824413,2,9.5,0.5,0.5,0.0,0.0,0.3,10.8
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,0.5,-74.009087,40.713818,1,"""N""",-74.004326,40.719986,2,3.5,0.5,0.5,0.0,0.0,0.3,4.8
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,3.0,-73.971176,40.762428,1,"""N""",-74.004181,40.742653,2,15.0,0.5,0.5,0.0,0.0,0.3,16.3
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,9.0,-73.874374,40.774048,1,"""N""",-73.986977,40.758194,1,27.0,0.5,0.5,6.7,5.33,0.3,40.33
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,2.2,-73.983276,40.726009,1,"""N""",-73.99247,40.749634,2,14.0,0.5,0.5,0.0,0.0,0.3,15.3
1,"""2015-01-10 20:...","""2015-01-10 20:...",3,0.8,-74.002663,40.734142,1,"""N""",-73.99501,40.726326,1,7.0,0.5,0.5,1.66,0.0,0.3,9.96
1,"""2015-01-10 20:...","""2015-01-10 21:...",3,18.2,-73.783043,40.644356,2,"""N""",-73.987595,40.759357,2,52.0,0.0,0.5,0.0,5.33,0.3,58.13
1,"""2015-01-10 20:...","""2015-01-10 20:...",2,0.9,-73.985588,40.767948,1,"""N""",-73.985916,40.759365,1,6.5,0.5,0.5,1.55,0.0,0.3,9.35


#Expression
- Polars' primary strength is expression. The expressions provide a versatile structure that not only solves simple questions but can also be easily extended to sophisticated studies. Below, we will go over the fundamental components that will serve as the foundation for all of your questions.

##Select statement

- To select a column we need to do two things. Define the DataFrame we want the data from. And second, select the data that we need. In the example below you see that we select col('*'). The asteriks stands for all columns.

In [None]:
df.select(pl.col('*'))

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
i64,str,str,i64,f64,f64,f64,i64,str,f64,f64,i64,f64,f64,f64,f64,f64,f64,f64
2,"""2015-01-15 19:...","""2015-01-15 19:...",1,1.59,-73.993896,40.750111,1,"""N""",-73.974785,40.750618,1,12.0,1.0,0.5,3.25,0.0,0.3,17.05
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,3.3,-74.001648,40.724243,1,"""N""",-73.994415,40.759109,1,14.5,0.5,0.5,2.0,0.0,0.3,17.8
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,1.8,-73.963341,40.802788,1,"""N""",-73.95182,40.824413,2,9.5,0.5,0.5,0.0,0.0,0.3,10.8
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,0.5,-74.009087,40.713818,1,"""N""",-74.004326,40.719986,2,3.5,0.5,0.5,0.0,0.0,0.3,4.8
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,3.0,-73.971176,40.762428,1,"""N""",-74.004181,40.742653,2,15.0,0.5,0.5,0.0,0.0,0.3,16.3
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,9.0,-73.874374,40.774048,1,"""N""",-73.986977,40.758194,1,27.0,0.5,0.5,6.7,5.33,0.3,40.33
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,2.2,-73.983276,40.726009,1,"""N""",-73.99247,40.749634,2,14.0,0.5,0.5,0.0,0.0,0.3,15.3
1,"""2015-01-10 20:...","""2015-01-10 20:...",3,0.8,-74.002663,40.734142,1,"""N""",-73.99501,40.726326,1,7.0,0.5,0.5,1.66,0.0,0.3,9.96
1,"""2015-01-10 20:...","""2015-01-10 21:...",3,18.2,-73.783043,40.644356,2,"""N""",-73.987595,40.759357,2,52.0,0.0,0.5,0.0,5.33,0.3,58.13
1,"""2015-01-10 20:...","""2015-01-10 20:...",2,0.9,-73.985588,40.767948,1,"""N""",-73.985916,40.759365,1,6.5,0.5,0.5,1.55,0.0,0.3,9.35


You can also specify the specific columns that you want to return. There are two ways to do this. 
- The first option is to create a list of column names. 
- The second option is to specify each column within a list in the select statement.

In [None]:
#first option

df.select(pl.col(['payment_type', 'fare_amount']))

payment_type,fare_amount
i64,f64
1,12.0
1,14.5
2,9.5
2,3.5
2,15.0
1,27.0
2,14.0
1,7.0
2,52.0
1,6.5


In [None]:
#second option - limit the records to load

df.select([pl.col('payment_type'),pl.col('fare_amount')]).limit(3)

payment_type,fare_amount
i64,f64
1,12.0
1,14.5
2,9.5


To select a single row in a DataFrame, pass in the row number using the row() method:

In [None]:
#the result in tuple

df.row(0) 

(2,
 '2015-01-15 19:05:39',
 '2015-01-15 19:23:42',
 1,
 1.59,
 -73.993896484375,
 40.7501106262207,
 1,
 'N',
 -73.97478485107422,
 40.75061798095703,
 1,
 12.0,
 1.0,
 0.5,
 3.25,
 0.0,
 0.3,
 17.05)

##Filter

- The filter option allows us to create a subset of the DataFrame. .

To select multiple rows, Polars recommends using the filter() function. For example, if you want to retrieve all payment_type '1' from the DataFrame, you can use the following expression:

In [None]:
df.filter(pl.col('payment_type') == 1)

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
i64,str,str,i64,f64,f64,f64,i64,str,f64,f64,i64,f64,f64,f64,f64,f64,f64,f64
2,"""2015-01-15 19:...","""2015-01-15 19:...",1,1.59,-73.993896,40.750111,1,"""N""",-73.974785,40.750618,1,12.0,1.0,0.5,3.25,0.0,0.3,17.05
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,3.3,-74.001648,40.724243,1,"""N""",-73.994415,40.759109,1,14.5,0.5,0.5,2.0,0.0,0.3,17.8
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,9.0,-73.874374,40.774048,1,"""N""",-73.986977,40.758194,1,27.0,0.5,0.5,6.7,5.33,0.3,40.33
1,"""2015-01-10 20:...","""2015-01-10 20:...",3,0.8,-74.002663,40.734142,1,"""N""",-73.99501,40.726326,1,7.0,0.5,0.5,1.66,0.0,0.3,9.96
1,"""2015-01-10 20:...","""2015-01-10 20:...",2,0.9,-73.985588,40.767948,1,"""N""",-73.985916,40.759365,1,6.5,0.5,0.5,1.55,0.0,0.3,9.35
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,0.9,-73.988617,40.723103,1,"""N""",-74.004395,40.728584,1,7.0,0.5,0.5,1.66,0.0,0.3,9.96
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,1.1,-73.993782,40.751419,1,"""N""",-73.967407,40.757217,1,7.5,0.5,0.5,1.0,0.0,0.3,9.8
1,"""2015-01-10 20:...","""2015-01-10 21:...",1,3.1,-73.973946,40.760448,1,"""N""",-73.997345,40.73521,1,19.0,0.5,0.5,3.0,0.0,0.3,23.3
2,"""2015-01-15 19:...","""2015-01-15 19:...",1,2.38,-73.976425,40.739811,1,"""N""",-73.983978,40.757889,1,16.5,1.0,0.5,4.38,0.0,0.3,22.68
2,"""2015-01-15 19:...","""2015-01-15 19:...",5,8.33,-73.86306,40.769581,1,"""N""",-73.952713,40.785782,1,26.0,1.0,0.5,8.08,5.33,0.3,41.21


##Selecting Rows and Columns

- Very often, you need to select rows and columns at the same time. You can do so by chaining the filter() and select() methods with specific columns, like this:

In [None]:
df.filter(pl.col('payment_type') == 1).select(['VendorID','payment_type'])

VendorID,payment_type
i64,i64
2,1
1,1
1,1
1,1
1,1
1,1
1,1
1,1
2,1
2,1


##Groupby



In [None]:
# without maintain_order you will get a random order back.
df.groupby('passenger_count', maintain_order=True).count()

passenger_count,count
i64,u32
1,8993870
3,528486
2,1814594
5,697645
6,454568
4,253228
0,6565
9,11
7,9
8,10


##With_columns

- with_colums allows you to create new columns for you analyses.

In [None]:
df.with_columns([pl.col('store_and_fwd_flag').count().alias('count_store_and_fwd_flag')])

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,count_store_and_fwd_flag
i64,str,str,i64,f64,f64,f64,i64,str,f64,f64,i64,f64,f64,f64,f64,f64,f64,f64,u32
2,"""2015-01-15 19:...","""2015-01-15 19:...",1,1.59,-73.993896,40.750111,1,"""N""",-73.974785,40.750618,1,12.0,1.0,0.5,3.25,0.0,0.3,17.05,12748986
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,3.3,-74.001648,40.724243,1,"""N""",-73.994415,40.759109,1,14.5,0.5,0.5,2.0,0.0,0.3,17.8,12748986
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,1.8,-73.963341,40.802788,1,"""N""",-73.95182,40.824413,2,9.5,0.5,0.5,0.0,0.0,0.3,10.8,12748986
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,0.5,-74.009087,40.713818,1,"""N""",-74.004326,40.719986,2,3.5,0.5,0.5,0.0,0.0,0.3,4.8,12748986
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,3.0,-73.971176,40.762428,1,"""N""",-74.004181,40.742653,2,15.0,0.5,0.5,0.0,0.0,0.3,16.3,12748986
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,9.0,-73.874374,40.774048,1,"""N""",-73.986977,40.758194,1,27.0,0.5,0.5,6.7,5.33,0.3,40.33,12748986
1,"""2015-01-10 20:...","""2015-01-10 20:...",1,2.2,-73.983276,40.726009,1,"""N""",-73.99247,40.749634,2,14.0,0.5,0.5,0.0,0.0,0.3,15.3,12748986
1,"""2015-01-10 20:...","""2015-01-10 20:...",3,0.8,-74.002663,40.734142,1,"""N""",-73.99501,40.726326,1,7.0,0.5,0.5,1.66,0.0,0.3,9.96,12748986
1,"""2015-01-10 20:...","""2015-01-10 21:...",3,18.2,-73.783043,40.644356,2,"""N""",-73.987595,40.759357,2,52.0,0.0,0.5,0.0,5.33,0.3,58.13,12748986
1,"""2015-01-10 20:...","""2015-01-10 20:...",2,0.9,-73.985588,40.767948,1,"""N""",-73.985916,40.759365,1,6.5,0.5,0.5,1.55,0.0,0.3,9.35,12748986


# Exploratory Analysis and Insights

In this notebook, we do not visualization. There are several reasons why large data may cause a crash in Google Colab when trying to visualize it:

1. The data may be too large to fit in the memory of the device being used. If this is the case, the device will not be able to handle the data and may crash.

2. The visualization library being used may not be optimized for handling large data sets. This could cause the library to consume too much memory or CPU resources, resulting in a crash.

3. The notebook itself may have insufficient resources to handle the visualization of large data. This could be due to a lack of memory or CPU resources allocated to the notebook.

4. The notebook may be running out of storage space, which could cause the system to crash when trying to save large data.

It is also possible that the problem could be caused by a combination of these factors.

Trip distance with more than 2.0km

In [15]:
df['trip_distance']>(2.0)

trip_distance
bool
false
true
false
false
true
true
true
false
true
false


Sum of vendor ID sum 

In [None]:
df.groupby('passenger_count').sum()

passenger_count,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
i64,i64,str,str,f64,f64,f64,i64,str,f64,f64,i64,f64,f64,f64,f64,f64,f64,f64
4,399713,,,727191.37,-18419000.0,10146000.0,262551,,-18431000.0,10153000.0,372954,3089900.0,82352.5,125920.0,339082.2,67775.19,68122.2,3780900.0
2,2767584,,,32737000.0,-131780000.0,72591000.0,1890843,,-131890000.0,72656000.0,2551700,22538000.0,583129.58,902644.5,2796400.0,541015.06,503809.799986,27905000.0
8,17,,,21.79,-591.893723,326.018887,38,,-665.638206,366.944695,13,295.8,1.0,3.0,12.09,9.75,2.4,324.64
6,908331,,,1267600.0,-33378000.0,18388000.0,464410,,-33367000.0,18382000.0,628713,5362855.3,131786.0,226755.0,684431.76,108908.58,135804.600001,6650600.0
0,6679,,,14197.78,-459919.288864,253360.366798,14792,,-476624.64946,262571.974075,9207,73562.76,1967.0,3121.5,9441.92,958.96,1892.1,90999.44
9,20,,,79.96,-665.579994,366.885517,47,,-813.91333,448.41341,14,581.9,1.0,4.5,75.2,24.99,2.7,690.89
3,849055,,,1507600.0,-38460000.0,21186000.0,546859,,-38481000.0,21199000.0,753220,6407700.0,173939.51,263037.0,760939.3,136146.5,146903.1,7900000.0
1,13073647,,,133340000.0,-651370000.0,358830000.0,9325958,,-651840000.0,359090000.0,12402358,105970000.0,2738600.0,4477080.5,17968000.0,2069600.0,2544900.0,135910000.0
5,1391726,,,1992600.0,-51224000.0,28219000.0,713918,,-51205000.0,28208000.0,960976,8346300.0,218507.5,347857.5,1075500.0,179941.27,208379.699998,10377000.0
7,11,,,23.1,-517.882095,285.279072,17,,-591.674599,325.949028,11,101.3,4.5,4.0,18.8,5.33,1.8,136.63


Rounding the distance travelled

In [None]:
df['trip_distance'].apply(lambda trip_distance : round(trip_distance))

trip_distance
i64
2
3
2
0
3
9
2
1
18
1


Convert to dict to map store_and_fwd_flag

In [None]:
mydict = {v:k for k, v in enumerate(df['store_and_fwd_flag'].unique())}

In [None]:
mydict

{'Y': 0, 'N': 1}

Sorting trip distance in reverse manner

In [None]:
from polars import col
trip_dist= df.lazy().sort(col("trip_distance"), reverse = True)
trip_dist1=trip_dist.collect()
trip_dist1

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
i64,str,str,i64,f64,f64,f64,i64,str,f64,f64,i64,f64,f64,f64,f64,f64,f64,f64
1,"""2015-01-11 21:...","""2015-01-11 22:...",1,1.5420e7,-73.992233,40.729248,1,"""N""",-74.028084,40.622639,2,34.5,0.5,0.5,0.0,0.0,0.3,35.8
1,"""2015-01-11 19:...","""2015-01-11 19:...",2,1.23318e7,-73.977341,40.749462,1,"""Y""",-74.00029,40.730511,2,2.5,0.0,0.5,0.0,0.0,0.3,3.3
1,"""2015-01-02 15:...","""2015-01-02 15:...",1,1.2e7,-73.95916,40.771851,1,"""N""",-73.955093,40.782879,2,4.5,0.0,0.5,0.0,0.0,0.0,5.3
1,"""2015-01-23 15:...","""2015-01-23 16:...",1,1.1800e7,-73.98719,40.77087,1,"""N""",-73.873169,40.774391,2,2.5,1.0,0.5,0.0,0.0,0.3,4.3
1,"""2015-01-23 10:...","""2015-01-23 10:...",1,1.1800e7,-73.946815,40.780571,1,"""N""",-73.954597,40.789566,2,2.5,0.0,0.5,0.0,0.0,0.3,3.3
1,"""2015-01-09 22:...","""2015-01-09 22:...",2,1.1800e7,-73.9935,40.762287,1,"""N""",-73.969383,40.795391,2,9.5,0.5,0.5,0.0,0.0,0.3,10.8
1,"""2015-01-09 21:...","""2015-01-09 21:...",1,8000016.5,-73.870872,40.773811,1,"""N""",-73.998283,40.694496,1,28.0,0.5,0.5,5.86,0.0,0.3,35.16
1,"""2015-01-09 22:...","""2015-01-09 23:...",1,8.00001e6,-73.987267,40.765903,1,"""N""",-73.955948,40.719486,2,21.0,0.5,0.5,0.0,0.0,0.3,22.3
1,"""2015-01-22 23:...","""2015-01-22 23:...",1,7468004.3,-74.003197,40.723087,1,"""N""",-73.898598,40.746029,2,29.5,0.5,0.5,0.0,0.0,0.3,30.8
1,"""2015-01-23 10:...","""2015-01-23 11:...",1,5e6,-73.959251,40.780979,1,"""N""",-73.996407,40.752399,2,2.5,0.0,0.5,0.0,0.0,0.3,3.3


The trip distance travelled by the Vendor with grouped by the passenger on the car

In [None]:
pv=df.groupby('passenger_count').pivot(pivot_column='VendorID', values_column='trip_distance').count()
pv

  pv=df.groupby('passenger_count').pivot(pivot_column='VendorID', values_column='trip_distance').count()


passenger_count,2,1
i64,u32,u32
1,4079777,4914093
3,320569,207917
2,952990,861604
5,694081,3564
6,453763,805
4,146485,106743
0,114,6451
9,9,2
7,2,7
8,7,3
