# Getting Started: Series, DataFrame

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. 

## Series 

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. That is, a Series is a 1-d numpy array with a label for each component, called __index__.

In [3]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
obj = Series([4, 7, -5, 3]) # Default index is from 0 to 1
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
obj.values # Both index and values are attributes
obj.index

RangeIndex(start=0, stop=4, step=1)

Compared with a regular NumPy array, you can use values in the index when selecting single values or a set of values:

In [5]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2['a'])
obj2[['a', 'c']]

-5


a   -5
c    3
dtype: int64

NumPy array operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [8]:
print(obj2[obj2 > 4])
np.exp(obj2)

b    7
dtype: int64


d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. Should you have data contained in a Python dict, you can create a Series from it by passing the dict:

In [9]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

Additionaly, you can have a list for values and list for the index.

In [12]:
pop = [35000, 71000, 16000, 5000]
state = ['Ohio', 'Texas', 'Oregon', 'Utah']
obj4 = Series(pop, index = state)
obj4

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

### Missing Data

The isnull and notnull functions in pandas should be used to detect missing data:

In [15]:
obj4.isnull() #method

Ohio      False
Texas     False
Oregon    False
Utah      False
dtype: bool

Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality:

In [17]:
obj4.name = 'Population'
obj4.index.name = 'State'
obj4

State
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
Name: Population, dtype: int64

A Series’s index can be altered in place by assignment:

In [19]:
obj4.index = ['Radnor', 'McConaughey', 'Pot', 'Lebron James']
obj4

Radnor          35000
McConaughey     71000
Pot             16000
Lebron James     5000
Name: Population, dtype: int64

## DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; __it can be thought of as a dict of Series (one for all sharing the same index)__.

Thus, there you have one way of creating a dataframe.

In [24]:
equipos = Series({'Madrid': 'Real Madrid', 'Barcelona': 'FCB', 'Valencia': 'Valencia'})
reino = Series({'Madrid': 'Reyes Españoles', 'Barcelona': 'Catalunya', 'Valencia': 'Valencia'})
data = {'equipos': equipos, 'reino': reino} # names are the names of the columns.
df = DataFrame(data)
df

Unnamed: 0,equipos,reino
Barcelona,FCB,Catalunya
Madrid,Real Madrid,Reyes Españoles
Valencia,Valencia,Valencia


A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:

In [25]:
print(df['equipos'])
df['reino']

Barcelona            FCB
Madrid       Real Madrid
Valencia        Valencia
Name: equipos, dtype: object


Barcelona          Catalunya
Madrid       Reyes Españoles
Valencia            Valencia
Name: reino, dtype: object

Columns can be modified (and added) by assignment

In [27]:
df['Deuda'] = [100, -100, 0]
df

Unnamed: 0,equipos,reino,Deuda
Barcelona,FCB,Catalunya,100
Madrid,Real Madrid,Reyes Españoles,-100
Valencia,Valencia,Valencia,0


When assigning lists or arrays to a column, the value’s length must match the length of the DataFrame.

In [32]:
df['Visitado'] = Series({'Barcelona': 0, 'Madrid': 0, 'Valencia': 0, 'NY': 1}) #ignores if it's not in the original index
df

Unnamed: 0,equipos,reino,Deuda,Visitado
Barcelona,FCB,Catalunya,100,0
Madrid,Real Madrid,Reyes Españoles,-100,0
Valencia,Valencia,Valencia,0,0


To delete a column, do the same as you would delete a key-value in a dict:

In [33]:
del df['Visitado']
df.columns

Index(['equipos', 'reino', 'Deuda'], dtype='object')

### Index Objects

pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels used when con- structing a Series or DataFrame is internally converted to an Index

In [34]:
obj = Series(range(3), index=['a', 'b', 'c'])
obj.index

Index(['a', 'b', 'c'], dtype='object')

Index objects are immutable and thus can’t be modified by the user. Immutability is important so that Index objects can be safely shared among data structures:

In [36]:
try:
    obj.index[3] = 'd'
except:
    print('Index Objects are inmutable')

Index Objects are inmutable


As objects, index objects have their own attributes and their own methods. 

# Functionality

## Reindexing

When conforming an already existing pandas object to a new index, use .reindex.

In [3]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

We can also alter the columns index as follows:

In [5]:
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [6]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


## Dropping entries from axis

 As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis:

In [3]:
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj.drop('c')

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis:

In [4]:
data = DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [5]:
data.drop(['Colorado', 'Ohio']) # Dropping rows with the index.
data.drop(['two', 'three'], axis = 1) # Dropping columns with their names.

Unnamed: 0,one,four
Ohio,0,3
Colorado,4,7
Utah,8,11
New York,12,15


## Indexing, Selection and Filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. That is, three ways: (i) integer slicing, (ii) boolean slicing, (iii) index slicing.

### Series

In [7]:
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print(obj[1]) # integer
print(obj[[True, False, False, False]]) # Boolean
obj['b'] # Index

1.0
a    0.0
dtype: float64


1.0

Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:

In [8]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

### DataFrames

For the DataFrame object, things work a little differently. Slicing with labels slice columns of the DataFrame. Slicing with numbers or slicing with booleans, slices some of the rows of the DataFrame.

In [9]:
data = DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [12]:
print(data['two'])
data[['three', 'one']]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64


Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [13]:
data[:2] #first two rows

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [14]:
data[data['three'] > 5] # boolean indexing

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


For DataFrame label-indexing on the rows, I introduce the special indexing field ix. It enables you to select a subset of the rows and columns from a DataFrame with NumPy- like notation plus axis labels.

In [15]:
data.ix[['Colorado', 'Utah'], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,4,5
Utah,11,8,9


In [16]:
data.ix[:'Utah', 'two']

Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int64

In [17]:
data.ix[data.three > 5, :3]

Unnamed: 0,one,two,three
Colorado,4,5,6
Utah,8,9,10
New York,12,13,14


Moraleja: for any fancy indexing, use .ix. 

 ## Arithmetic and Data Alignment
 
 One of the most important pandas features is the behavior of arithmetic between ob- jects with different indexes. When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.

In [18]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1 + s2 # The internal data alignment introduces NA values in the indices that don’t overlap.

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In the case of DataFrame, alignment is performed on both the rows and the columns:

In [19]:
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1 + df2 # Adding these together results in NA values in the locations that don’t overlap. must be in both

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [20]:
df1.add(df2, fill_value=0) # if in one but not in both, take as zero. 

Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,3.0,1.0,6.0,5.0
Oregon,9.0,,10.0,11.0
Texas,9.0,4.0,12.0,8.0
Utah,0.0,,1.0,2.0


## Function Application and Mapping

NumPy ufuncs (element-wise array methods) work fine with pandas objects. 

In [21]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.486429,0.062054,2.177177
Ohio,0.261101,2.87211,0.551418
Texas,-0.627626,-0.418332,-0.262476
Oregon,0.963589,-1.577445,0.864322


In [22]:
frame.abs()

Unnamed: 0,b,d,e
Utah,0.486429,0.062054,2.177177
Ohio,0.261101,2.87211,0.551418
Texas,0.627626,0.418332,0.262476
Oregon,0.963589,1.577445,0.864322


Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this:

In [24]:
f = lambda x: x.max() - x.min()
frame.apply(f, axis = 0) # per column

b    1.591215
d    4.449555
e    2.439653
dtype: float64

In [25]:
frame.apply(f, axis = 1) # per row

Utah      2.115123
Ohio      2.611008
Texas     0.365151
Oregon    2.541034
dtype: float64

More complicated functions can also be defined:

In [27]:
def f(x):
    '''
    Given a np.array, return the following. 
    '''
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f, axis = 0)

Unnamed: 0,b,d,e
min,-0.627626,-1.577445,-0.262476
max,0.963589,2.87211,2.177177


The analogous of apply with series is map.

In [28]:
format = lambda x: '%.2f' % x
frame['e'].map(format)

Utah       2.18
Ohio       0.55
Texas     -0.26
Oregon     0.86
Name: e, dtype: object

## Sorting and Ranking

Sorting a data set by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:

In [30]:
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index(axis = 0) # sort by index

a    1
b    2
c    3
d    0
dtype: int64

In [37]:
obj.sort_values()

d    0
a    1
b    2
c    3
dtype: int64

In [40]:
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [41]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


## Summarizing and Computing Descriptive Statistics

pandas objects are equipped with a set of common mathematical and statistical meth- ods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla NumPy arrays, they are all built from the ground up to exclude missing data

In [43]:
df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], 
               index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
print(df.sum(axis = 0)) # sum across rows
df.sum(axis = 1) # sum across columns

one    9.25
two   -5.80
dtype: float64


a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [44]:
df.mean(axis=1, skipna=False) # mean for each row

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

Some methods, like idxmin and idxmax, return indirect statistics like the index value where the minimum or maximum values are attained:

In [45]:
df.idxmax()

one    b
two    d
dtype: object

In [46]:
df.describe()



Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,,
50%,,
75%,,
max,7.1,-1.3


## Correlation and covariance

Some summary statistics, like correlation and covariance, are computed from pairs of arguments.

In [49]:
import pandas_datareader.data as web
end = '2015-01-01'
start = '2007-01-01'
get_px = lambda x: web.DataReader(x, 'yahoo', start=start, end=end)['Adj Close']
symbols = ['SPY','TLT','MSFT']
# raw adjusted close prices
data = pd.DataFrame({sym:get_px(sym) for sym in symbols})
data.head(5)

Unnamed: 0_level_0,MSFT,SPY,TLT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007-01-03,23.478417,115.486103,63.65283
2007-01-04,23.439102,115.731178,64.038778
2007-01-05,23.305433,114.808069,63.760039
2007-01-08,23.533456,115.339066,63.874396
2007-01-09,23.557044,115.241041,63.874396


In [51]:
returns = data.pct_change(axis = 0)
returns.tail(5)

Unnamed: 0_level_0,MSFT,SPY,TLT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-12-24,-0.006398,9.6e-05,0.005443
2014-12-26,-0.005401,0.003225,0.003711
2014-12-29,-0.008981,0.001343,0.007475
2014-12-30,-0.009062,-0.005366,0.002713
2014-12-31,-0.012123,-0.009923,0.00191


The corr method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Relatedly, cov computes the covariance:

In [52]:
returns.MSFT.corr(returns.SPY) # Series

0.70357965962753477

In [53]:
returns.corr() # DataFrame

Unnamed: 0,MSFT,SPY,TLT
MSFT,1.0,0.70358,-0.317389
SPY,0.70358,1.0,-0.461967
TLT,-0.317389,-0.461967,1.0


In [54]:
returns.corrwith(returns.SPY)

MSFT    0.703580
SPY     1.000000
TLT    -0.461967
dtype: float64

## Unique Values, Value Counts, and Membership

In [56]:
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

In [57]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

The very, very __important__ isin() method. Analogous to %in% in R. 

In [58]:
mask = obj.isin(['b', 'c'])
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

## Handling Missing Data

### Filtering out missing

## Hierarchical Indexing

Hierarchical indexing is an important feature of pandas enabling you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form.

## Series

In [5]:
data = Series(np.random.randn(10), 
              index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], 
              [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data

a  1    0.219128
   2    0.135280
   3    0.890866
b  1    2.129106
   2    1.346062
   3   -0.964070
c  1   -0.382904
   2   -0.877322
d  2    0.061551
   3    0.442131
dtype: float64

In [6]:
data.index # MultiIndex

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

With a hierarchically-indexed object, so-called partial indexing is possible, enabling you to concisely select subsets of the data:

In [7]:
data['b']

1    2.129106
2    1.346062
3   -0.964070
dtype: float64

Selection is even possible in some cases from an “inner” level:

In [8]:
data[:, 2]

a    0.135280
b    1.346062
c   -0.877322
d    0.061551
dtype: float64

In [9]:
data.unstack()

Unnamed: 0,1,2,3
a,0.219128,0.13528,0.890866
b,2.129106,1.346062,-0.96407
c,-0.382904,-0.877322,
d,,0.061551,0.442131


In [11]:
data.unstack().stack()

a  1    0.219128
   2    0.135280
   3    0.890866
b  1    2.129106
   2    1.346062
   3   -0.964070
c  1   -0.382904
   2   -0.877322
d  2    0.061551
   3    0.442131
dtype: float64

Many descriptive and summary statistics on DataFrame and Series have a level option in which you can specify the level you want to sum by on a particular axis.

In [12]:
data.index.names = ['key1', 'key2']
data.mean(level = 'key1', axis = 0)

key1
a    0.415091
b    0.837032
c   -0.630113
d    0.251841
dtype: float64

It’s not unusual to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame’s col- umns. 

In [13]:
frame = DataFrame({'a': range(7), 'b': range(7, 0, -1), 
                   'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                   'd': [0, 1, 2, 0, 1, 2, 3]})
frame2 = frame.set_index(['c', 'd'])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [14]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


In [60]:
import pandas_datareader.data as web
end = '2015-01-01'
start = '2007-01-01'
get_px = lambda x: web.DataReader(x, 'yahoo', start=start, end=end)['Adj Close']
symbols = ['SPY','TLT','MSFT']
# raw adjusted close prices
data = pd.DataFrame({sym:get_px(sym) for sym in symbols})
data.head(5)

Unnamed: 0_level_0,MSFT,SPY,TLT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007-01-03,23.478417,115.486103,63.774307
2007-01-04,23.439102,115.731178,64.160992
2007-01-05,23.305433,114.808069,63.88172
2007-01-08,23.533456,115.339066,63.996296
2007-01-09,23.557044,115.241041,63.996296


In [61]:
data = data.reset_index()
data2 = pd.melt(data, id_vars ='Date',var_name = 'Index', value_name = 'Value')
data2.head(5)

Unnamed: 0,Date,Index,Value
0,2007-01-03,MSFT,23.478417
1,2007-01-04,MSFT,23.439102
2,2007-01-05,MSFT,23.305433
3,2007-01-08,MSFT,23.533456
4,2007-01-09,MSFT,23.557044


In [65]:
data2.set_index(['Index', 'Date']).head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Index,Date,Unnamed: 2_level_1
MSFT,2007-01-03,23.478417
MSFT,2007-01-04,23.439102
MSFT,2007-01-05,23.305433
MSFT,2007-01-08,23.533456
MSFT,2007-01-09,23.557044
