# Pandas 2.0.0
[Author: Elias Buitrago Bolivar](https://github.com/ebuitrago?tab=repositories)

Inspired in: https://www.youtube.com/watch?v=cSLPyRI_ZD8

Original data: Kaggle

New `Pandas` version incorporates Spark Arrow backend in order to replace `numpy`. According to his authors, this upgrade means a lot of improvements in terms of data management speed. So, This jupyter notebook is designed to study and compare new and previous `Pandas` versions. Details and explanations will be given directly in class, therefore the material isn't autoexplained. Don´t forget ask me for the access to the data. And, please, give credits to the original author's idea and, if consider, also to me.


In [None]:
# !pip install pandas==2.0.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pandas==2.0.2
  Downloading pandas-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.5.3
    Uninstalling pandas-1.5.3:
      Successfully uninstalled pandas-1.5.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==1.5.3, but you have pandas 2.0.2 which is incompatible.[0m[31m
[0mSuccessfully installed pandas-2.0.2


In [None]:
import pandas as pd
import numpy as np
import polars as pl
print('pandas', pd.__version__)
print('numpy', np.__version__)
print('polars', pl.__version__)

pandas 2.0.2
numpy 1.22.4
polars 0.17.3


## Read and concatenate files
Now, is your turn!

Give a comparison between new and previous `Pandas` versions by using the same data from the `PPvsSpark_01.ipynb` jupyter notebook lab. The objective is to measure the speed improvement in Pandas new version when reading, concatenate and aggregate files.

The points to develop in the lab are the following:

1.   Read data using previous Pandas version (method#1)
2.   Read data using new Pandas version (method#2)
3.   Concatenate files using both methods.
4.   Create aggregation tables using both method and the following operators: mean, sum, max.
5.   Compare the speed of the whole operations for both methods.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
# flights_file1 = "/content/drive/MyDrive/data/flights/Combined_Flights_2018.parquet"
# flights_file2 = "/content/drive/MyDrive/data/flights/Combined_Flights_2019.parquet"
flights_file3 = "/content/drive/MyDrive/data/flights/Combined_Flights_2020.parquet"
flights_file4 = "/content/drive/MyDrive/data/flights/Combined_Flights_2021.parquet"
flights_file5 = "/content/drive/MyDrive/data/flights/Combined_Flights_2022.parquet"

In [None]:
# %timeit
# df1 = pd.read_parquet(flights_file1, engine='pyarrow', dtype_backend='pyarrow')
# df2 = pd.read_parquet(flights_file2, engine='pyarrow', dtype_backend='pyarrow')
df3 = pd.read_parquet(flights_file3, engine='pyarrow', dtype_backend='pyarrow')
df4 = pd.read_parquet(flights_file4, engine='pyarrow', dtype_backend='pyarrow')
df5 = pd.read_parquet(flights_file5, engine='pyarrow', dtype_backend='pyarrow')

In [None]:
%whos

Variable        Type         Data/Info
--------------------------------------
ctypes          module       <module 'ctypes' from '/u<...>3.10/ctypes/__init__.py'>
df3             DataFrame                     FlightDa<...>022397 rows x 61 columns]
df4             DataFrame                     FlightDa<...>311871 rows x 61 columns]
df5             DataFrame                     FlightDa<...>078318 rows x 61 columns]
flights_file3   str          /content/drive/MyDrive/da<...>ined_Flights_2020.parquet
flights_file4   str          /content/drive/MyDrive/da<...>ined_Flights_2021.parquet
flights_file5   str          /content/drive/MyDrive/da<...>ined_Flights_2022.parquet
gc              module       <module 'gc' (built-in)>
libc            CDLL         <CDLL 'libc.so.6', handle<...>3df000 at 0x7fd6777c3a30>
pd              module       <module 'pandas' from '/u<...>ages/pandas/__init__.py'>


In [None]:
import sys

# These are the usual ipython objects, including this one you are creating
ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']

# Get a sorted list of the objects and their sizes
sorted([(x, sys.getsizeof(globals().get(x))) for x in dir() if not x.startswith('_') and x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)

[('df4', 3331471806),
 ('df3', 2650228101),
 ('df5', 2150528096),
 ('mem', 360),
 ('flights_file3', 114),
 ('flights_file4', 114),
 ('flights_file5', 114),
 ('pd', 72),
 ('libc', 48)]

In [None]:
import sys

# These are the usual ipython objects, including this one you are creating
ipython_vars = ["In", "Out", "exit", "quit", "get_ipython", "ipython_vars"]

# Get a sorted list of the objects and their sizes
mem = {
    key: value
    for key, value in sorted(
        [
            (x, sys.getsizeof(globals().get(x)))
            for x in dir()
            if not x.startswith("_") and x not in sys.modules and x not in ipython_vars
        ],
        key=lambda x: x[1],
        reverse=True,
    )
}
sum(mem.values()) / 1e6

8132.228825

In [None]:
import gc
gc.collect()

170

In [None]:
df = pd.concat([df3, df4, df5])

In [None]:
df_agg = df.groupby(['Airline','Year'])[["DepDelayMinutes", "ArrDelayMinutes"]].agg(
    ["mean", "sum", "max"]
)
df_agg = df_agg.reset_index()
df_agg.to_parquet("temp_pandas2.parquet")

In [None]:
del df, df3, df4, df5
import sys

# These are the usual ipython objects, including this one you are creating
ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']

# Get a sorted list of the objects and their sizes
sorted([(x, sys.getsizeof(globals().get(x))) for x in dir() if not x.startswith('_') and x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)

[('df_agg', 5639),
 ('mem', 360),
 ('flights_file3', 114),
 ('flights_file4', 114),
 ('flights_file5', 114),
 ('pd', 72),
 ('libc', 48)]

In [None]:
import gc
gc.collect()

0