## 3.2.13 Gotchas

If you are attempting to perform an operation you might see an exception like:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [17]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)

## 3.3.1 Head and tail

To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number.

In [5]:
long_series = pd.Series(np.random.randn(1000))

In [6]:
long_series.head()

0   -0.440928
1    0.762631
2   -0.010539
3    0.897531
4   -0.520248
dtype: float64

In [7]:
long_series.tail(3)

997    2.091614
998    1.069280
999   -0.252177
dtype: float64

##  3.3.2 Attributes and underlying data

pandas objects have a number of attributes enabling you to access the metadata

 shape: gives the axis dimensions of the object, consistent with ndarray
 • Axis labels
 – Series: index (only axis)
– DataFrame: index (rows) and columns Note, these attributes can be safely assigned to!

Note, these attributes can be safely assigned to!

In [14]:
df[:2]

Unnamed: 0,A,B,C,D
2013-01-01,-0.065535,0.02455,-0.207683,1.111742
2013-01-02,1.060116,-0.503713,-0.186309,1.054862


In [15]:
df.columns = [x.lower() for x in df.columns ]

In [16]:
df

Unnamed: 0,a,b,c,d
2013-01-01,-0.065535,0.02455,-0.207683,1.111742
2013-01-02,1.060116,-0.503713,-0.186309,1.054862
2013-01-03,1.279041,0.520875,-0.294267,0.116395
2013-01-04,0.864101,1.163065,1.31254,-0.610546
2013-01-05,-0.151097,1.72549,-0.943199,-0.020007
2013-01-06,-0.91034,-1.488262,0.489967,1.018382


Pandas objects (Index, Series, DataFrame) can be thought of as containers for arrays, which hold the actual data and do the actual computation. For many types, the underlying array is a numpy.ndarray. However, pandas and 3rd party libraries may extend NumPys type system to add support for custom arrays (see dtypes).

To get the actual data inside a Index or Series, use the .array property

In [18]:
s.array

<PandasArray>
[nan, nan, 1.0, 3.0, 5.0, nan]
Length: 6, dtype: float64

In [19]:
s.index.array

<DatetimeArray>
['2013-01-01 00:00:00', '2013-01-02 00:00:00', '2013-01-03 00:00:00',
 '2013-01-04 00:00:00', '2013-01-05 00:00:00', '2013-01-06 00:00:00']
Length: 6, dtype: datetime64[ns]

array will always be an ExtensionArray. The exact details of what an ExtensionArray is and why pandas uses them is a bit beyond the scope of this introduction. See dtypes for more.

If you know you need a NumPy array, use to_numpy() or numpy.asarray().

In [20]:
s.to_numpy()

array([nan, nan,  1.,  3.,  5., nan])

In [21]:
np.asarray(s)

array([nan, nan,  1.,  3.,  5., nan])

When the Series or Index is backed by an ExtensionArray , to_numpy() may involve copying data and coercing
values. See dtypes for more.

to_numpy() gives some control over the dtype of the resulting numpy.ndarray. For example, consider date- times with timezones. NumPy doesnt have a dtype to represent timezone-aware datetimes, so there are two possibly useful representations:


1. An object-dtype numpy.ndarray with Timestamp objects, each with the correct tz
2. A datetime64[ns] -dtype numpy.ndarray, where the values have been converted to UTC and the time-
zone discarded
Timezones may be preserved with dtype=object

In [22]:
ser = pd.Series(pd.date_range('2000',periods=2,tz='CET'))

In [23]:
ser.to_numpy(dtype=object)

array([Timestamp('2000-01-01 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2000-01-02 00:00:00+0100', tz='CET', freq='D')],
      dtype=object)

Or thrown away with dtype='datetime64[ns]'

In [24]:
ser.to_numpy(dtype='datetime64[ns]')

array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')

Getting the raw data inside a DataFrame is possibly a bit more complex. When your DataFrame only has a single data type for all the columns, DataFrame.to_numpy() will return the underlying data:

In [25]:
df.to_numpy()

array([[-0.49299552, -0.7741897 , -1.54732587,  0.38968693],
       [-0.29058525, -0.60001747, -1.28123134,  0.84825403],
       [ 0.15528031,  0.45869603, -1.44738353,  0.21002015],
       [ 1.1618357 ,  0.93982259,  0.1279657 , -0.59329746],
       [-0.28495476,  0.27377632,  0.34685025, -0.13198923],
       [ 0.94452636,  1.29080311, -1.92899403,  0.0878218 ]])

If a DataFrame contains homogeneously-typed data, the ndarray can actually be modified in-place, and the changes will be reflected in the data structure. For heterogeneous data (e.g. some of the DataFrames columns are not all the same dtype), this will not be the case. The values attribute itself, unlike the axis labels, cannot be assigned to.

Note: When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all of the data involved. For example, if strings are involved, the result will be of object dtype. If there are only floats and integers, the resulting array will be of float dtype.

In the past, pandas recommended Series.values or DataFrame.values for extracting the data from a Series or DataFrame. Youll still find references to these in old code bases and online. Going forward, we recommend avoiding .values and using .array or .to_numpy(). .values has the following drawbacks:
1. When your Series contains an extension type, its unclear whether Series.values returns a NumPy array or the extension array. Series.array will always return an ExtensionArray, and will never copy data. Series.to_numpy() will always return a NumPy array, potentially at the cost of copying / coercing values.
2. When your DataFrame contains a mixture of data types, DataFrame.values may involve copying data and coercing values to a common dtype, a relatively expensive operation. DataFrame.to_numpy(), being a method, makes it clearer that the returned NumPy array may not be a view on the same data in the DataFrame.


## 3.3.4 Flexible binary operations


With binary operations between pandas data structures, there are two key points of interest:

• Broadcasting behavior between higher- (e.g. DataFrame) and lower-dimensional (e.g. Series) objects.

• Missing data in computations.


We will demonstrate how to manage these issues independently, though they can be handled simultaneously.

###  Matching / broadcasting behavior

DataFrame has the methods add(), sub(), mul(), div() and related functions radd(), rsub(), for carrying out binary operations. For broadcasting behavior, Series input is of primary interest. Using these functions, you can use to either match on the index or columns via the axis keyword:

In [26]:
df = pd.DataFrame({
    'one':pd.Series(np.random.randn(3),index=['a','b','c']),
    'two':pd.Series(np.random.randn(4),index=['a','b','c','d']),
    'three':pd.Series(np.random.randn(3),index=['b','c','d'])
})

In [27]:
df

Unnamed: 0,one,two,three
a,-0.807035,-1.36224,
b,-0.808933,-0.711589,0.771078
c,-1.048687,1.150256,0.185809
d,,2.492871,-2.392236


In [28]:
row = df.iloc[1]

In [29]:
columns = df['two']

In [30]:
df.sub(row,axis='columns')

Unnamed: 0,one,two,three
a,0.001898,-0.650651,
b,0.0,0.0,0.0
c,-0.239754,1.861845,-0.585269
d,,3.20446,-3.163314


In [52]:
df.sub(row,axis=1)

Unnamed: 0,A,B,C,D,one,three,two
2013-01-01,,,,,,,
2013-01-02,,,,,,,
2013-01-03,,,,,,,
2013-01-04,,,,,,,
2013-01-05,,,,,,,
2013-01-06,,,,,,,


Furthermore you can align a level of a MultiIndexed DataFrame with a Series.

In [50]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
dfmi = df.copy()

In [51]:
dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'),
                                        (1, 'c'), (2, 'a')],
                                       names=['first', 'second'])

ValueError: Length mismatch: Expected axis has 6 elements, new values have 4 elements

In [47]:
dfmi.sub(column, axis=0, level='second')

NameError: name 'column' is not defined