# Pandas 2.0: Guide to Upgrading and Adapting

In [None]:
%pip install --upgrade pandas

In [2]:
print(pd.__version__)

2.0.0


## Improved Nullable Dtypes and Extension Arrays

Improved Nullable Dtypes and Extension Arrays
Pandas 2.0 brings faster and more memory-efficient operations to the table by adding support for PyArrow in the backend. 

This code demonstrates reading a CSV file with sample data, converting numeric columns to nullable data types, and saving and reading the data as a Parquet file using the pyarrow engine.

In [3]:
import pandas as pd
import io

# New sample data
new_sample_data = io.StringIO("""Category,Value,Flag,Label,Count,Rating,Percentage,Status,Code
    Fruit,100,True,Apple,25,4.5,0.50,InStock,A1
    Vegetable,200,False,Carrot,30,3.8,0.35,OutOfStock,B2
    Grain,150,True,Rice,20,4.2,0.25,InStock,C3
""")

# Reading CSV with pandas-backed nullable dtypes
data_frame = pd.read_csv(new_sample_data)

# Converting numeric columns to nullable dtypes
data_frame = data_frame.apply(pd.to_numeric, errors="ignore")

# Save the DataFrame as a Parquet file
data_frame.to_parquet("data_frame.parquet", engine="pyarrow")

# Read the Parquet file into a DataFrame
data_frame_from_parquet = pd.read_parquet("data_frame.parquet", engine="pyarrow")

In [4]:
# Print the shape of the DataFrame
print(f"Shape of the DataFrame: {data_frame.shape}")

# Print the columns in the DataFrame
print(f"Columns in the DataFrame: {data_frame.columns.tolist()}")

# Print summary statistics of the DataFrame
print("\nSummary statistics of the DataFrame:")
print(data_frame.describe())

# Print unique values in the 'Category' column
print(f"\nUnique values in the 'Category' column: {data_frame['Category'].unique()}")

Shape of the DataFrame: (3, 9)
Columns in the DataFrame: ['Category', 'Value', 'Flag', 'Label', 'Count', 'Rating', 'Percentage', 'Status', 'Code']

Summary statistics of the DataFrame:
       Value  Count    Rating  Percentage
count    3.0    3.0  3.000000    3.000000
mean   150.0   25.0  4.166667    0.366667
std     50.0    5.0  0.351188    0.125831
min    100.0   20.0  3.800000    0.250000
25%    125.0   22.5  4.000000    0.300000
50%    150.0   25.0  4.200000    0.350000
75%    175.0   27.5  4.350000    0.425000
max    200.0   30.0  4.500000    0.500000

Unique values in the 'Category' column: ['    Fruit' '    Vegetable' '    Grain']


## Copy-on-Write (CoW) Improvements 

By enabling CoW, Pandas can avoid making defensive copies when performing various operations, and instead, it only makes copies when necessary, which results in more efficient memory usage.

In [5]:
import pandas as pd

pd.options.mode.copy_on_write = True

data = {"a": [1, 2, 3], "b": [4.0, 5.0, 6.0], "c": ["x", "y", "z"]}
df1 = pd.DataFrame(data)
df2 = df1.copy()

df2["a"] = [7, 8, 9]

print(df1)
print(df2)

   a    b  c
0  1  4.0  x
1  2  5.0  y
2  3  6.0  z
   a    b  c
0  7  4.0  x
1  8  5.0  y
2  9  6.0  z


## Handling Differences in Data Type Support

Here’s an illustration of the process of creating a Pandas DataFrame that incorporates Apache Arrow-backed data types within a practical context.

In [6]:
import datetime
import pandas as pd

# Example of using new Arrow-based data types in Pandas 2.0
data = pd.DataFrame({
    'product_name': pd.Series(['Samsung Galaxy S22',
                               'iPhone 15 Pro Max'],
                              dtype='string'),
    'features': pd.Series([['5G', 'AMOLED display', '128 GB storage'],
                           ['5G', 'Super Retina XDR display', '512 GB storage']],
                          dtype='object'),
    'release_date': pd.Series([datetime.date(2022, 12, 10),
                               datetime.date(2022, 9, 30)],
                              dtype='datetime64[ns]')
})

print(data)

         product_name                                        features   
0  Samsung Galaxy S22            [5G, AMOLED display, 128 GB storage]  \
1   iPhone 15 Pro Max  [5G, Super Retina XDR display, 512 GB storage]   

  release_date  
0   2022-12-10  
1   2022-09-30  


## Evaluating Performance Implications

In many cases, the performance will be significantly improved with Pandas 2.0, especially when working with large datasets. However, some operations might be slower or not yet optimized, so it's crucial to benchmark your code and compare the performance with other tools or previous versions of Pandas.

Here's an example of how you can measure the performance of different data manipulation tasks in Pandas 2.0 compared to other data processing libraries such as Polars, DuckDB, and Dask

In [None]:
%pip install polars

In [None]:
%pip install duckdb

In [None]:
%conda install dask --yes

In [None]:
%pip install --upgrade pandas polars duckdb dask

In [3]:
%%time

import timeit
import pandas as pd
import polars as pl
import duckdb
import dask.dataframe as dd

# Prepare data
data = pd.DataFrame({
    'A': list(range(1000000)),
    'B': list(range(1000000, 2000000))
})

# Pandas 2.0
def pandas_operation():
    return data.groupby('A').sum()

pandas_time = timeit.timeit(pandas_operation, number=10)

# Polars
polars_data = pl.from_pandas(data)

def polars_operation():
    return polars_data.groupby('A').agg(pl.col('B').sum())

polars_time = timeit.timeit(polars_operation, number=10)

# DuckDB
duckdb_conn = duckdb.connect(database=':memory:', read_only=False)
duckdb_conn.register('data', data)
duckdb_cursor = duckdb_conn.cursor()

def duckdb_operation():
    duckdb_cursor.execute('SELECT A, SUM(B) FROM data GROUP BY A')
    return duckdb_cursor.fetchall()

duckdb_time = timeit.timeit(duckdb_operation, number=10)

# Dask
dask_data = dd.from_pandas(data, npartitions=4)

def dask_operation():
    return dask_data.groupby('A').sum().compute()

dask_time = timeit.timeit(dask_operation, number=10)

# Print results
print(f"Pandas 2.0: {pandas_time:.5f} seconds")
print(f"Polars: {polars_time:.5f} seconds")
print(f"DuckDB: {duckdb_time:.5f} seconds")
print(f"Dask: {dask_time:.5f} seconds")

Pandas 2.0: 1.19203 seconds
Polars: 1.14821 seconds
DuckDB: 11.92281 seconds
Dask: 1.40523 seconds
CPU times: user 22.5 s, sys: 2.73 s, total: 25.2 s
Wall time: 16.3 s


## Conclusion

Pandas 2.0 represents a significant milestone for the library, as the integration of Apache Arrow allows for simpler, faster, and more efficient data processing tasks.
For more information, you can also consult the official release notes and GitHub repository of Pandas 2.0.