In [1]:
import numpy as np 
import pandas as pd 

In **Pandas**, **extension data types** (or **ExtensionDtypes**) are a special class of data types that extend beyond the traditional NumPy data types (like `int64`, `float64`, `object`, etc.). These are designed to give Pandas more flexibility and allow it to support:

- Nullable types (e.g., integer columns with `NaN`)
- Better memory usage
- Custom behaviors for special data types


**Pandas Extension Data Types**:

| **ExtensionDtype** | **Description**                  | **Nullable** | **Example dtype string** |
|--------------------|----------------------------------|--------------|---------------------------|
| `Int8` / `Int16` / `Int32` / `Int64`   | Signed integers             | ✅ Yes      | `"Int64"`                |
| `UInt8` / `UInt16` / `UInt32` / `UInt64` | Unsigned integers         | ✅ Yes      | `"UInt32"`               |
| `float32` / `float64` (nullable)      | Floating-point numbers       | ✅ Yes      | `"Float64"`              |
| `boolean`              | Boolean values                  | ✅ Yes      | `"boolean"`              |
| `string`               | Text strings (not object)       | ✅ Yes      | `"string"`               |
| `category`             | Categorical values              | ✅ Yes      | `"category"`             |
| `Datetime64[ns]` / `Datetime64[ns, tz]` | Timestamps (w/ optional timezone) | ✅ Yes | `"datetime64[ns, UTC]"` |
| `Period[D]`, `Period[M]`, etc. | Time periods                | ✅ Yes      | `"period[M]"`            |
| `Sparse`               | Sparse data                     | ✅ Yes      | `pd.SparseDtype("int")`  |
| `Interval`             | Interval data (e.g., bins)      | ✅ Yes      | `"interval[int64]"`      |



###  Advantages

- Better handling of missing data
- Cleaner types for strings and booleans
- Support for custom extension arrays (you can even create your own types)


In [2]:
# Shortcomings of building on NumPy:

#     No native support for missing values in integer arrays  
#     Object dtype used for mixed types, causing inefficiency  
#     Inflexible type system, can't extend easily  
#     Memory overhead with object arrays  
#     Limited categorical/string support 
#     Hard to customize behavior of data types  


# example where we create a Series of integers with a missing value:
s = pd.Series([1,2,3, None])
s

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

In [4]:
# mainly for backward compatibility reasons, Series uses the legacy behavior of using a float64 data type and np.nan for the missing value.
s = pd.Series([1,2,3,None], dtype=pd.Int64Dtype())
s

# the <NA> indicates that a value is missing for an extension type array
# this uses the special pandas.NA sentinel value

0       1
1       2
2       3
3    <NA>
dtype: Int64

In [5]:
s.isna()

0    False
1    False
2    False
3     True
dtype: bool

In [6]:
s[3]

<NA>

In [7]:
s[3] is pd.NA

True

In [None]:
# we also could have used the shorthand "Int64" instead of pd.Int64Dtype() to specify the type.
# the capitalization is necessary, otherwise it will be a NumPy-based nonextension type

# s = pd.Series([1,2,3,None], dtype="Int64")

In [9]:
# pandas also has an extension type specialized for string data that does not use NumPy object arrays (may require the pyarrow library)

s = pd.Series(['one', 'two', None, 'three'], dtype=pd.StringDtype())
s


0      one
1      two
2     <NA>
3    three
dtype: string

In [10]:
# these string arrays generally use much less memory and are frequently computationally more efficient for doing operations on large datasets

# another important extension type is Categorical,

In [12]:
# extension types can be passed to the Series astype method, allowing you to convert easily as part of your data cleaning process

df = pd.DataFrame({"A": [1, 2, None, 4], "B": ["one", "two", "three", None], "C": [False, None, False, True]})
df

Unnamed: 0,A,B,C
0,1.0,one,False
1,2.0,two,
2,,three,False
3,4.0,,True


In [14]:
df['A'] = df['A'].astype("Int64")
df['B'] = df['B'].astype("string")
df['C'] = df['C'].astype("boolean")
df

Unnamed: 0,A,B,C
0,1.0,one,False
1,2.0,two,
2,,three,False
3,4.0,,True
