# Pandas

https://pandas.pydata.org/

`conda install pandas`

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:

- Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
- Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining data sets
- Flexible reshaping and pivoting of data sets
- Hierarchical labeling of axes (possible to have multiple labels per tick)
- Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

## Series and Dataframes

The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.

Also, we would like sensible default behaviors for the common API functions which take into account the typical orientation of time series and cross-sectional data sets. When using ndarrays to store 2- and 3-dimensional data, a burden is placed on the user to consider the orientation of the data set when writing functions; axes are considered more or less equivalent (except when C- or Fortran-contiguousness matters for performance). In pandas, the axes are intended to lend more semantic meaning to the data; i.e., for a particular data set there is likely to be a “right” way to orient the data. The goal, then, is to reduce the amount of mental effort required to code up data transformations in downstream functions.

## SQL and Pandas

Both of them work with "tabular" data but they are not the same thing.

![Basic DataFrame](images/pandas-basic.png "Basic DataFrame")

SQL usually refers to a DBMS (Data Base Management System) that implements the relational model. It is used to keep data and maintain data integrity in long period of times. At the same time it offers a language to query and analyse the data that resides in the DBMS.

![Relational Model](images/relational.jpg "Relational Model")

Pandas is a data analysis library, it's not designed to keep data and it's integrity.

![MultiIndex DataFrame](images/pandas-multindex.png "MultiIndex DataFrame")

## 10 minutes to pandas

https://pandas.pydata.org/pandas-docs/version/0.25/getting_started/10min.html

In [0]:
import numpy as np
import pandas as pd

In [3]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [4]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [5]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.978582,-0.491139,-0.470538,-1.940178
2013-01-02,1.376487,-0.487193,0.397903,-0.422354
2013-01-03,-1.327458,0.808837,-0.145697,0.407213
2013-01-04,-1.115759,-2.017644,0.199215,-0.407738
2013-01-05,-0.899722,-1.847228,-0.285282,0.066929
2013-01-06,-0.893708,0.705151,0.296319,0.006475


In [6]:
df2 = pd.DataFrame({'A': 1.,
  'B': pd.Timestamp('20130102'),
  'C': pd.Series(1, index=list(range(4)), dtype='float32'),
  'D': np.array([3] * 4, dtype='int32'),
  'E': pd.Categorical(["test", "train", "test", "train"]),
  'F': 'foo'})

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [7]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [8]:
df.head(2)
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,-1.115759,-2.017644,0.199215,-0.407738
2013-01-05,-0.899722,-1.847228,-0.285282,0.066929
2013-01-06,-0.893708,0.705151,0.296319,0.006475


In [9]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [10]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [11]:
df.to_numpy()

array([[-0.97858186, -0.4911392 , -0.47053818, -1.9401777 ],
       [ 1.37648719, -0.48719283,  0.39790329, -0.42235389],
       [-1.32745818,  0.8088373 , -0.14569661,  0.40721293],
       [-1.11575888, -2.01764443,  0.19921516, -0.40773773],
       [-0.89972195, -1.84722821, -0.2852821 ,  0.06692903],
       [-0.89370827,  0.70515111,  0.29631925,  0.00647493]])

In [12]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.63979,-0.554869,-0.001347,-0.381609
std,1.001162,1.205478,0.349236,0.82529
min,-1.327458,-2.017644,-0.470538,-1.940178
25%,-1.081465,-1.508206,-0.250386,-0.4187
50%,-0.939152,-0.489166,0.026759,-0.200631
75%,-0.895212,0.407065,0.272043,0.051816
max,1.376487,0.808837,0.397903,0.407213


In [13]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-0.978582,1.376487,-1.327458,-1.115759,-0.899722,-0.893708
B,-0.491139,-0.487193,0.808837,-2.017644,-1.847228,0.705151
C,-0.470538,0.397903,-0.145697,0.199215,-0.285282,0.296319
D,-1.940178,-0.422354,0.407213,-0.407738,0.066929,0.006475


In [14]:
df.sort_index(axis = 1,ascending= False)

Unnamed: 0,D,C,B,A
2013-01-01,-1.940178,-0.470538,-0.491139,-0.978582
2013-01-02,-0.422354,0.397903,-0.487193,1.376487
2013-01-03,0.407213,-0.145697,0.808837,-1.327458
2013-01-04,-0.407738,0.199215,-2.017644,-1.115759
2013-01-05,0.066929,-0.285282,-1.847228,-0.899722
2013-01-06,0.006475,0.296319,0.705151,-0.893708


In [15]:
df.sort_values(by = 'B')

Unnamed: 0,A,B,C,D
2013-01-04,-1.115759,-2.017644,0.199215,-0.407738
2013-01-05,-0.899722,-1.847228,-0.285282,0.066929
2013-01-01,-0.978582,-0.491139,-0.470538,-1.940178
2013-01-02,1.376487,-0.487193,0.397903,-0.422354
2013-01-06,-0.893708,0.705151,0.296319,0.006475
2013-01-03,-1.327458,0.808837,-0.145697,0.407213


In [16]:
df['A']

2013-01-01   -0.978582
2013-01-02    1.376487
2013-01-03   -1.327458
2013-01-04   -1.115759
2013-01-05   -0.899722
2013-01-06   -0.893708
Freq: D, Name: A, dtype: float64

In [17]:
df[['A','B']]

Unnamed: 0,A,B
2013-01-01,-0.978582,-0.491139
2013-01-02,1.376487,-0.487193
2013-01-03,-1.327458,0.808837
2013-01-04,-1.115759,-2.017644
2013-01-05,-0.899722,-1.847228
2013-01-06,-0.893708,0.705151


In [18]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-0.978582,-0.491139,-0.470538,-1.940178
2013-01-02,1.376487,-0.487193,0.397903,-0.422354
2013-01-03,-1.327458,0.808837,-0.145697,0.407213


In [19]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,1.376487,-0.487193,0.397903,-0.422354
2013-01-03,-1.327458,0.808837,-0.145697,0.407213
2013-01-04,-1.115759,-2.017644,0.199215,-0.407738


In [20]:
df.loc[dates[0]]

A   -0.978582
B   -0.491139
C   -0.470538
D   -1.940178
Name: 2013-01-01 00:00:00, dtype: float64

In [21]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2013-01-01,-0.978582,-0.491139
2013-01-02,1.376487,-0.487193
2013-01-03,-1.327458,0.808837
2013-01-04,-1.115759,-2.017644
2013-01-05,-0.899722,-1.847228
2013-01-06,-0.893708,0.705151


In [22]:
df.loc['20130102':'20130104',['A','B']]

Unnamed: 0,A,B
2013-01-02,1.376487,-0.487193
2013-01-03,-1.327458,0.808837
2013-01-04,-1.115759,-2.017644


In [23]:
df.loc['20130102',['A','B']]

A    1.376487
B   -0.487193
Name: 2013-01-02 00:00:00, dtype: float64

In [24]:
df.loc[dates[0],'A']

-0.9785818626516731

In [25]:
df.at[dates[0],'A']

-0.9785818626516731

In [26]:
df.iloc[3]

A   -1.115759
B   -2.017644
C    0.199215
D   -0.407738
Name: 2013-01-04 00:00:00, dtype: float64

In [27]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,-1.115759,-2.017644
2013-01-05,-0.899722,-1.847228


In [28]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,1.376487,0.397903
2013-01-03,-1.327458,-0.145697
2013-01-05,-0.899722,-0.285282


In [29]:
df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2013-01-02,1.376487,-0.487193,0.397903,-0.422354
2013-01-03,-1.327458,0.808837,-0.145697,0.407213


In [30]:
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,-0.491139,-0.470538
2013-01-02,-0.487193,0.397903
2013-01-03,0.808837,-0.145697
2013-01-04,-2.017644,0.199215
2013-01-05,-1.847228,-0.285282
2013-01-06,0.705151,0.296319


In [31]:
df.iloc[1,1]

-0.48719282858700735

In [32]:
df.iat[1,1]

-0.48719282858700735

In [33]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-02,1.376487,-0.487193,0.397903,-0.422354


In [34]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,,,
2013-01-02,1.376487,,0.397903,
2013-01-03,,0.808837,,0.407213
2013-01-04,,,0.199215,
2013-01-05,,,,0.066929
2013-01-06,,0.705151,0.296319,0.006475


In [35]:
df2 = df.copy()

df2['E'] = ['one','one','two','three','four','three']

df2[df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,E
2013-01-03,-1.327458,0.808837,-0.145697,0.407213,two
2013-01-05,-0.899722,-1.847228,-0.285282,0.066929,four


In [0]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))

df['F'] = s1

In [37]:
df.at[dates[0],'A'] = 0
df.iat[0,1] = 0
df.loc[:,'D'] = np.array([5] * len(df))

df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.470538,5,
2013-01-02,1.376487,-0.487193,0.397903,5,1.0
2013-01-03,-1.327458,0.808837,-0.145697,5,2.0
2013-01-04,-1.115759,-2.017644,0.199215,5,3.0
2013-01-05,-0.899722,-1.847228,-0.285282,5,4.0
2013-01-06,-0.893708,0.705151,0.296319,5,5.0


In [38]:
df2 = df.copy()

df2[df2 > 0] = -df2

df2

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.470538,-5,
2013-01-02,-1.376487,-0.487193,-0.397903,-5,-1.0
2013-01-03,-1.327458,-0.808837,-0.145697,-5,-2.0
2013-01-04,-1.115759,-2.017644,-0.199215,-5,-3.0
2013-01-05,-0.899722,-1.847228,-0.285282,-5,-4.0
2013-01-06,-0.893708,-0.705151,-0.296319,-5,-5.0


In [39]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1
df1

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,-0.470538,5,,1.0
2013-01-02,1.376487,-0.487193,0.397903,5,1.0,1.0
2013-01-03,-1.327458,0.808837,-0.145697,5,2.0,
2013-01-04,-1.115759,-2.017644,0.199215,5,3.0,


In [40]:
df1.dropna(how = 'any')

Unnamed: 0,A,B,C,D,F,E
2013-01-02,1.376487,-0.487193,0.397903,5,1.0,1.0


In [41]:
df1.fillna(value = 5)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,-0.470538,5,5.0,1.0
2013-01-02,1.376487,-0.487193,0.397903,5,1.0,1.0
2013-01-03,-1.327458,0.808837,-0.145697,5,2.0,5.0
2013-01-04,-1.115759,-2.017644,0.199215,5,3.0,5.0


In [42]:
df1.isna()

Unnamed: 0,A,B,C,D,F,E
2013-01-01,False,False,False,False,True,False
2013-01-02,False,False,False,False,False,False
2013-01-03,False,False,False,False,False,True
2013-01-04,False,False,False,False,False,True


In [43]:
df.mean()

A   -0.476693
B   -0.473013
C   -0.001347
D    5.000000
F    3.000000
dtype: float64

In [44]:
df.mean(1)

2013-01-01    1.132365
2013-01-02    1.457440
2013-01-03    1.267136
2013-01-04    1.013162
2013-01-05    1.193554
2013-01-06    2.021552
Freq: D, dtype: float64

In [45]:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [46]:
df.sub(s, axis='index')

Unnamed: 0,A,B,C,D,F
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,-2.327458,-0.191163,-1.145697,4.0,1.0
2013-01-04,-4.115759,-5.017644,-2.800785,2.0,0.0
2013-01-05,-5.899722,-6.847228,-5.285282,0.0,-1.0
2013-01-06,,,,,


In [47]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.470538,5,
2013-01-02,1.376487,-0.487193,-0.072635,10,1.0
2013-01-03,0.049029,0.321644,-0.218332,15,3.0
2013-01-04,-1.06673,-1.696,-0.019116,20,6.0
2013-01-05,-1.966452,-3.543228,-0.304398,25,10.0
2013-01-06,-2.86016,-2.838077,-0.008079,30,15.0


In [48]:
df.apply(lambda x: x.max() - x.min())

A    2.703945
B    2.826482
C    0.868441
D    0.000000
F    4.000000
dtype: float64

In [49]:
s = pd.Series(np.random.randint(0, 7, size=10))
s.value_counts()

6    2
4    2
1    2
0    2
3    1
2    1
dtype: int64

In [50]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

In [51]:
df = pd.DataFrame(np.random.randn(10, 4))
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,0.231578,0.038452,1.66243,-1.868856
1,0.990806,-0.648231,-1.061662,0.820018
2,-0.672656,-0.548242,0.847644,0.086607
3,-1.198687,-0.461153,-0.048753,1.425093
4,-0.783141,-0.890506,1.774859,0.419163
5,-0.460744,1.149717,-2.242177,0.572991
6,0.258531,-0.591343,0.68301,-1.0236
7,0.776816,-0.16083,0.018049,-1.217341
8,-1.021578,-0.719735,0.631795,-2.641814
9,-0.374103,-1.477149,1.121477,-1.706487


In [52]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


In [53]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,bar,2,5


In [54]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
s = df.iloc[3]
df.append(s, ignore_index=True)

Unnamed: 0,A,B,C,D
0,-0.681851,0.58406,-0.06595,-0.523204
1,0.425475,1.398535,0.359583,0.244216
2,2.276143,-1.124914,-0.550189,1.170828
3,0.386888,0.546328,-0.460811,0.464237
4,1.6966,-0.755909,0.211742,-1.383923
5,-0.186347,1.718713,-1.684482,0.321883
6,1.598336,0.059697,1.571537,1.450203
7,-0.440493,0.522463,-0.965012,-1.327299
8,0.386888,0.546328,-0.460811,0.464237


In [55]:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
  'foo', 'bar', 'foo', 'foo'],
  'B': ['one', 'one', 'two', 'three',
  'two', 'two', 'one', 'three'],
  'C': np.random.randn(8),
  'D': np.random.randn(8)})

df

Unnamed: 0,A,B,C,D
0,foo,one,0.27432,-0.374614
1,bar,one,0.797963,0.496657
2,foo,two,-0.83296,1.157739
3,bar,three,-0.038679,-0.615842
4,foo,two,0.862664,-1.970775
5,bar,two,-0.715286,-2.535398
6,foo,one,0.458169,-0.985152
7,foo,three,-0.391164,-2.011834


In [56]:
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,0.043998,-2.654583
foo,0.371029,-4.184636


In [57]:
df.groupby(['A', 'B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.797963,0.496657
bar,three,-0.038679,-0.615842
bar,two,-0.715286,-2.535398
foo,one,0.732489,-1.359766
foo,three,-0.391164,-2.011834
foo,two,0.029704,-0.813036


In [58]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
  'foo', 'foo', 'qux', 'qux'],
  ['one', 'two', 'one', 'two',
  'one', 'two', 'one', 'two']]))

tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [59]:
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [60]:
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.66898,-0.29174
bar,two,0.367659,1.479652
baz,one,-1.894775,-1.76005
baz,two,1.620593,1.437861
foo,one,1.203857,-1.04317
foo,two,-0.96696,-0.349794
qux,one,0.338101,-0.669889
qux,two,-0.374305,0.849433


In [61]:
df2 = df[:4]
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.66898,-0.29174
bar,two,0.367659,1.479652
baz,one,-1.894775,-1.76005
baz,two,1.620593,1.437861


In [62]:
stacked = df2.stack()
stacked

first  second   
bar    one     A    1.668980
               B   -0.291740
       two     A    0.367659
               B    1.479652
baz    one     A   -1.894775
               B   -1.760050
       two     A    1.620593
               B    1.437861
dtype: float64

In [63]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.66898,-0.29174
bar,two,0.367659,1.479652
baz,one,-1.894775,-1.76005
baz,two,1.620593,1.437861


In [64]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,1.66898,0.367659
bar,B,-0.29174,1.479652
baz,A,-1.894775,1.620593
baz,B,-1.76005,1.437861


In [65]:
stacked.unstack(0)

Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,1.66898,-1.894775
one,B,-0.29174,-1.76005
two,A,0.367659,1.620593
two,B,1.479652,1.437861


In [66]:
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
  'B': ['A', 'B', 'C'] * 4,
  'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
  'D': np.random.randn(12),
  'E': np.random.randn(12)})

df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,-0.745171,-0.78072
1,one,B,foo,0.556295,-1.229875
2,two,C,foo,0.346838,-0.565596
3,three,A,bar,0.321303,1.010725
4,one,B,bar,-0.575468,0.283687
5,one,C,bar,-0.067011,1.61217
6,two,A,foo,1.050893,0.023881
7,three,B,foo,-0.994939,0.770481
8,one,C,foo,-1.508536,0.083518
9,one,A,bar,-0.270449,0.873954


In [67]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-0.270449,-0.745171
one,B,-0.575468,0.556295
one,C,-0.067011,-1.508536
three,A,0.321303,
three,B,,-0.994939
three,C,0.317121,
two,A,,1.050893
two,B,0.642509,
two,C,,0.346838


In [69]:
rng = pd.date_range('1/1/2012', periods=100, freq='S')
rng

DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01',
               '2012-01-01 00:00:02', '2012-01-01 00:00:03',
               '2012-01-01 00:00:04', '2012-01-01 00:00:05',
               '2012-01-01 00:00:06', '2012-01-01 00:00:07',
               '2012-01-01 00:00:08', '2012-01-01 00:00:09',
               '2012-01-01 00:00:10', '2012-01-01 00:00:11',
               '2012-01-01 00:00:12', '2012-01-01 00:00:13',
               '2012-01-01 00:00:14', '2012-01-01 00:00:15',
               '2012-01-01 00:00:16', '2012-01-01 00:00:17',
               '2012-01-01 00:00:18', '2012-01-01 00:00:19',
               '2012-01-01 00:00:20', '2012-01-01 00:00:21',
               '2012-01-01 00:00:22', '2012-01-01 00:00:23',
               '2012-01-01 00:00:24', '2012-01-01 00:00:25',
               '2012-01-01 00:00:26', '2012-01-01 00:00:27',
               '2012-01-01 00:00:28', '2012-01-01 00:00:29',
               '2012-01-01 00:00:30', '2012-01-01 00:00:31',
               '2012-01-

In [70]:
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts

2012-01-01 00:00:00     77
2012-01-01 00:00:01    177
2012-01-01 00:00:02    443
2012-01-01 00:00:03     39
2012-01-01 00:00:04    102
                      ... 
2012-01-01 00:01:35     91
2012-01-01 00:01:36    175
2012-01-01 00:01:37    442
2012-01-01 00:01:38      3
2012-01-01 00:01:39    118
Freq: S, Length: 100, dtype: int64

In [71]:
ts.resample('5Min').sum()

2012-01-01    24477
Freq: 5T, dtype: int64

In [73]:
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
rng

DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09',
               '2012-03-10'],
              dtype='datetime64[ns]', freq='D')

In [75]:
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

2012-03-06    0.647335
2012-03-07   -0.094159
2012-03-08    0.047059
2012-03-09    1.545392
2012-03-10   -1.426249
Freq: D, dtype: float64

In [76]:
ts_utc = ts.tz_localize('UTC')
ts_utc

2012-03-06 00:00:00+00:00    0.647335
2012-03-07 00:00:00+00:00   -0.094159
2012-03-08 00:00:00+00:00    0.047059
2012-03-09 00:00:00+00:00    1.545392
2012-03-10 00:00:00+00:00   -1.426249
Freq: D, dtype: float64

In [77]:
ts_utc.tz_convert('US/Eastern')

2012-03-05 19:00:00-05:00    0.647335
2012-03-06 19:00:00-05:00   -0.094159
2012-03-07 19:00:00-05:00    0.047059
2012-03-08 19:00:00-05:00    1.545392
2012-03-09 19:00:00-05:00   -1.426249
Freq: D, dtype: float64

## Essential basic functionality

https://pandas.pydata.org/pandas-docs/version/0.25/getting_started/basics.html

## Data structures

https://pandas.pydata.org/pandas-docs/version/0.25/getting_started/dsintro.html

## Reference

- loc[], iloc[], at[], iat[]
- head(), tail()
- describe()
- sort_index(axis=N, ascending=True)
- sort_values(by=N)
- to_numpy()
- pd.date_range()
- .str
- groupby()

## Data statistics with pandas

Reproduce the exercise `Populations` from Numpy but using Pandas data structures and functions