# Sparse data structures
Pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN / missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.

In [3]:
import pandas as pd
import numpy as np

ts = pd.Series(np.random.randn(10))
ts[2:-2] = np.nan
sts = pd.Series(pd.arrays.SparseArray(ts))
print(sts)

0    2.205386
1   -0.203786
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -1.358803
9    1.111127
dtype: Sparse[float64, nan]


Let us now assume you had a large NA DataFrame and execute the following code −

In [9]:
df = pd.DataFrame(np.random.randn(10000, 4))

df.iloc[:9998] = np.nan

sdf = df.astype(pd.SparseDtype("float", np.nan))

print(sdf.head())

    0   1   2   3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN


In [10]:
print(sdf.dtypes)

0    Sparse[float64, nan]
1    Sparse[float64, nan]
2    Sparse[float64, nan]
3    Sparse[float64, nan]
dtype: object


In [11]:
print(sdf.sparse.density)

0.0002


# SparseArray
arrays.SparseArray is a ExtensionArray for storing an array of sparse values (see dtypes for more on extension arrays). It is a 1-dimensional ndarray-like object storing only values distinct from the fill_value:

In [12]:
arr = np.random.randn(10)
arr[2:5] = np.nan
arr[7:8] = np.nan
sparr = pd.arrays.SparseArray(arr)

print(sparr)

[-0.6276816320815497, 0.82141575208587, nan, nan, nan, 1.1543554931992392, 2.010552362696765, nan, 0.5907138009484671, -1.1607948246680355]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9])



A sparse array can be converted to a regular (dense) ndarray with numpy.asarray()

In [13]:
print(np.asarray(sparr))

[-0.62768163  0.82141575         nan         nan         nan  1.15435549
  2.01055236         nan  0.5907138  -1.16079482]


# Sparse calculation
You can apply NumPy ufuncs to SparseArray and get a SparseArray as a result.

In [17]:
arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])

print(np.abs(arr))

# take absolute abs (make float and negetive value to int & positive), so -2. is 2

[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3])



The ufunc is also applied to fill_value. This is needed to get the correct dense result.

In [18]:
arr = pd.arrays.SparseArray([1., -1, -1, -2., -1], fill_value=-1)

print(np.abs(arr))

[1.0, 1, 1, 2.0, 1]
Fill: 1
IntIndex
Indices: array([0, 3])



In [19]:
print(np.abs(arr).to_dense())

[1. 1. 1. 2. 1.]


# SparseArray & to_dense

In [23]:
# importing pandas as pd 
import pandas as pd 
  
# Creating the Series in sparse 
sr = pd.arrays.SparseArray([19.5, 16.8, None, 22.78, None, 20.124, None, 18.1002, None]) 
  
# Print the series 
print(sr)

[19.5, 16.8, nan, 22.78, nan, 20.124, nan, 18.1002, nan]
Fill: nan
IntIndex
Indices: array([0, 1, 3, 5, 7])



In [26]:
# convert to dense object 
print(sr.to_dense())

[19.5    16.8        nan 22.78       nan 20.124      nan 18.1002     nan]
