In [9]:
import pandas as pd

In [10]:
from pandas import Series, DataFrame

# 5.1 Introduction to pandas Data Structures

## Series

<b> A series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index. </b>

In [11]:
obj = pd.Series([4, 7, -5, 3])

In [12]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

And so on...

<b> Where an index for the data is not specificed, a defalut is given: 0 through N -1 (where N is the length of the data) </b>

In [13]:
obj.values

array([ 4,  7, -5,  3])

In [14]:
obj.index # like range(4)

RangeIndex(start=0, stop=4, step=1)

In [15]:
obj2 = pd.Series([4, 7, -5, 3], index =['d', 'b', 'a', 'c'])

In [16]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

<b>The above example shows a series with a labeled index identifying eah data point</b>

In [17]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [18]:
obj2['a']

-5

<b> Below you can see that you can label the index of a single value or a set of values</b>

In [19]:
obj2['d'] = 6

In [20]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

In [21]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [22]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [None]:
np.exp(obj2)

<b>np is not defined in the above example</b>

<b>Another way to think of a Series is as a fixed-length, ordered dict, as it is mapping the index value to the data value, can be used in many contexts where you might as a dict:</b>

In [23]:
'b' in obj2

True

In [24]:
'e' in obj2

False

<b>If you have data in a Python dict, you can create a Series from it by passing the dict:</b>

In [25]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [26]:
obj3 = pd.Series(sdata)

In [27]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

<b>When you are only passing a dict, the index in the Series will have the dict's keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the Series:</b>

In [28]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [29]:
obj4 = pd.Series(sdata, index=states)

In [30]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

<b>Example above: </b> 3 values found in sdata were placed in appropriate locations, but no value for California was found so it appears as NaN (not a number), in pandas this is mark missing or a NA value. Similiarly as Utah was not included in states, it is excluded from the resulting object.

<b>In pandas the isnull & notnull functions are used to detect missing data:</b>

In [31]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [32]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

<b>Series also has these as instance methods:</b>

In [33]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

<b>Useful series feature:</b> automatically aligns by index label in arithmetic operations

In [34]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [35]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [36]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

<b>Above examples: similiar to join operation (to be discussed in ch. 7)</b>

<b>Both the Series object itself and its index have a name attribute, which integrates with other key areas of the pandas functionality:</b>

In [37]:
obj4.name = 'population'

In [38]:
obj4.index.name = 'state'

In [39]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

<b>A series's index can be altered in-place by assignment:</b>

In [40]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [41]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [42]:
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## Data Frame

<b>A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, exc).</b>

A <b>DataFrame</b> has both a row and column index; it can be thought of as a dict of Series all sharing the same index.

The data is stored as one or more two-dimensional blocks rather than a list, dict or some other collection of one-dmensional arrays.

<b>There are many ways to construct a Data Frame, one of the most common is from a dict of equal-length list or NumPy arrays:<b>

In [43]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

<b>The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order:</b>

In [44]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


<b> As seen above: </b>If you are using Jupyter notebook, pandas DataFrom objects are displayed more browser-friendly HTML table

<b>For large DataFrames, the head method selects only the first five rows:</b>

In [45]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


<b>If you specify a sequence of columns, the DataFrame's columns will be arranged in that order: </b>

In [46]:
pd.DataFrame(data, columns=['year','state','pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


<b>If you pass a column that isn't contained in the dict, it will appear with missing values in the result:</b>

In [47]:
frame2 = pd.DataFrame(data, columns=['year','state','pop','debt'], index=['one','two','three','two','five','six'])

In [48]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
two,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [49]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

<b>A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:</b>

In [50]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
two      Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [51]:
frame2.year

one      2000
two      2001
three    2002
two      2001
five     2002
six      2003
Name: year, dtype: int64

<b>Note:</b> frame2[column] works for any column name, but frame2.column only works where the column name is a valid Python variable name

<b>Note: </b>Returned Series have the same index as the DataFrame and their name attribute has been set appropriately.

<b>Rows can be retrived by position or name with the special loc attribute:</b>

In [52]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

<b>Columns can be modified by assignment. EX: the empty 'debt' column could be assigned a scalar value or an array of values:</b>

In [53]:
frame2['debt'] = 16.5

In [54]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
two,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [55]:
frame2['debt'] = np.arrange(6.)

NameError: name 'np' is not defined

In [56]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
two,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


<b>The above should show debt starting at 0 and going up to 6</b>

<b>When you are assigning lists or arrays to column, the values length must match the length of the DataFrame. If you assign a Series, its tabels will be asligned xactly to the DataFrame's index, inserting missing values into any holes:</b>

In [57]:
val = pd.Series([-1.2, -1.5, -1.7], index =['two', 'four', 'five'])

In [58]:
frame2['debt'] = val

In [59]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
two,2001,Nevada,2.4,-1.2
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


<b>Assigning a column that doesnt exist will create a new column. The del keyword will delete columns as with a dict:</b>

In [60]:
frame2['eastern'] = frame2.state == 'Ohio'

In [61]:
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
two,2001,Nevada,2.4,-1.2,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


<b>To delete the column we added above, use the del method:</b>

In [62]:
del frame2['eastern']

In [63]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

<b>Another common form of data is a nested dict of dicts:</b>

In [64]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

<b>If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices:</b>

In [65]:
frame3 = pd.DataFrame(pop)

In [66]:
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


<b>You can transpose the DataFrame (swap rows & columns):</b>

In [67]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


<b>The keys in the inner dicts are combined and sorted to form the index in the result.
This isn't true if an explicit index is specified:</b>

In [68]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


<b>Dicts of Series are treated similiarly:</b>

In [69]:
pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}

In [70]:
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


<b>If a DataFrame's index and columns have their names attributes set, these will also be displayed:</b>

In [71]:
frame3.index.name = 'year'; frame3.columns.name = 'state'

In [72]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


<b>As with Series, the values attribute returns the data contained in the DataFrame as a two-dimensional ndarray:</b>

In [73]:
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

<b>If the DataFrame's columns are different dtypes, the dtype of the values array will be chosen to accommondate all of the columns</b>

In [74]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.2],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

## Index Objects

pandas’s <b>Index objects</b> are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:

In [75]:
obj = pd.Series(range(3), index=['a','b','c'])

In [76]:
index = obj.index

In [77]:
index

Index(['a', 'b', 'c'], dtype='object')

In [78]:
index[1:]

Index(['b', 'c'], dtype='object')

<b>Index objects are immutable

In [79]:
index[1] = 'd'

TypeError: Index does not support mutable operations

<b>Immutability makes it safer to share Index objects amount data structures:

In [80]:
labels = pd.Index(np.arange(3))

NameError: name 'np' is not defined

In [81]:
labels

NameError: name 'labels' is not defined

In [82]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)

NameError: name 'labels' is not defined

In [83]:
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [84]:
obj2.index is labels

NameError: name 'labels' is not defined

<b>In addition to being array-like, an Index also behaves like a fixed-size set:

In [85]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [86]:
frame3.columns

Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [87]:
'Ohio' in frame3.columns

True

In [88]:
2003 in frame3.index

False

<b>Unlike Python sets, a pandas Index can contain duplicate labels:

In [89]:
dup_labels = pd.Index(['foo','foo','bar','bar'])

In [90]:
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

# 5.2 Essential Functionality