# getting started with pandas 

### From the python for data analysis book

While pandas adopts many coding idioms from numpy the bigest difference is that pandas is designed for working with tabular data and heterogeneous data. Numpy by contrast, is best suited for working with homogeneous numerical array data. 

In [18]:
import pandas as pd
import numpy as np

In [3]:
from pandas import Series, DataFrame


## SERIES
A series is a one-dimentional array-like object containing a sequence of values and an associated array of data labels, called its index. 

In [8]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

A string representation of a Series, shows the index on the left and the values on the right. 

In [9]:
obj.values

array([ 4,  7, -5,  3])

In [11]:
obj.index # like range 4

RangeIndex(start=0, stop=4, step=1)

In [12]:
# often it will be desireable to create a Series with an index identifying each data point with a label

obj2 = pd.Series([4,7,-5,3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [13]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

Compared with NumPy arrays, you can use labels in the index when aselevting single values or a set of values

In [14]:
obj2['a']

-5

In [15]:
obj2[['c','a','d']]

c    3
a   -5
d    4
dtype: int64

In [16]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

In [17]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

In [19]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Another way to think about a Series is as a fixed length, ordered dict, as it is a mapping of index values to data vaules. It can be used in many contexts where you might use a dict:

In [20]:
'b' in obj2

True

In [21]:
'e' in obj2

False

Should you have data contained in a Python dict, you can create a Series from it by passing the dict

In [22]:
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000} # the dict

In [23]:
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict the index in the Series will have the dict's keys in sorted order. you can override this by passing the dict keys in the order you want them to appear in the resulting Series

In [27]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [28]:
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

- Since no values exist for California, it appears as NaN. In pandas this is considered to mark a missing or NA value.
- Since 'Utah' was not included in 'states' it is excluded from the resulting object.  


- ```isnull``` and ```notnull``` funcitons in pandas should be used to detect missing data. 

In [29]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [30]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [32]:
obj4.isnull() # Series also has these as instance methods

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

A useful Series feature is that it automatically aligns by index label in arithmetic operations

In [33]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [34]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [35]:
obj3 + obj4 # you can think of this as being similar to a join operation

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality:

In [36]:
obj4.name = 'population'

In [37]:
obj4.index.name = 'state'

In [38]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series index can be altered in-place by assignment:

In [39]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [40]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [41]:
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## Dataframe

- A dataframe represents a rectanuglar table of data and contains an ordered collection of columns each of which can be a different value type.
- The DataFrame has both a row and column index; and can be thought of as a dict of Series all shareing the same index
- Under the hood, the data is stored as one or more two-dimentional blocks.


There are many ways to construct a DataFrame, thoug hone of the most common is from a dict of equal length list or Nummpy arrays. 

In [42]:
data = {'state':['Ohio','Ohio','Ohio', 'Nevada','Nevada','Nevada',], 
        'year':[2000,2001,2002,2001,2002,2003],
        'pop':[1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [43]:
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [44]:
# For lage DataFrames the .head() method seleects only the first five
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [45]:
# if you specify a sequence of columns, the DataFrame will be arranged in that order

pd.DataFrame(data, columns=['year','state','pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [49]:
# if you have a column that is not in the dict it will appear with mising values
frame2 = pd.DataFrame(data, columns=['year','state','pop', 'dept'], index=['one','two','three','four','five','six'])
frame2

Unnamed: 0,year,state,pop,dept
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [50]:
# retreve columns
frame2.columns

Index(['year', 'state', 'pop', 'dept'], dtype='object')

In [53]:
# retrive rows in spesific col
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [56]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [57]:
# rows can also be retrived by position or name with the special loc attribute

frame2.loc['three']

year     2002
state    Ohio
pop       3.6
dept      NaN
Name: three, dtype: object

In [58]:
# columns can be modified by assignment
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,dept,debt
one,2000,Ohio,1.5,,16.5
two,2001,Ohio,1.7,,16.5
three,2002,Ohio,3.6,,16.5
four,2001,Nevada,2.4,,16.5
five,2002,Nevada,2.9,,16.5
six,2003,Nevada,3.2,,16.5


In [59]:
frame2['debt'] = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,dept,debt
one,2000,Ohio,1.5,,0.0
two,2001,Ohio,1.7,,1.0
three,2002,Ohio,3.6,,2.0
four,2001,Nevada,2.4,,3.0
five,2002,Nevada,2.9,,4.0
six,2003,Nevada,3.2,,5.0


In [60]:
# when assigning list of arrays to a col, the langth must match the langth of the dataframe. 
# If assigning Series, its labels will be realigned exactly to the DataFrame's index, inserting missing values in any holes. 

val = pd.Series([-1.2,-1.5,-1.7], index=['two', 'four', 'five'])


In [61]:
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,dept,debt
one,2000,Ohio,1.5,,
two,2001,Ohio,1.7,,-1.2
three,2002,Ohio,3.6,,
four,2001,Nevada,2.4,,-1.5
five,2002,Nevada,2.9,,-1.7
six,2003,Nevada,3.2,,


In [62]:
# the del keyword will delete column(s) as with a dict
del frame2['dept']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


Another common form of data is a nested dict of dicts: 

In [63]:
pop = {'Nevada':{2001: 2.4, 2002: 2.9},'Ohio':{2000: 1.5, 2001: 1.7, 2002: 3.6}}

> if the nested dict is passed to the dataframe, 
> pandas will ***interpret the outer dict keys as the columns and the inner keys as the row indices***

In [64]:
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [65]:
# you can transpose the dataframe like with syntax similar to Numpy array
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [66]:
# you can set index and row names like this

frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [67]:
# As with Series, the values attribute returns the data contained in the DataFrame as a two-dimentional ndarray
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [68]:
# If the Dataframe's columns are different dtypes, the dtype of the values array 
# will be chosen to accommodate all of the columns

frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

### Possible data inputs to DataFrame constructor

| Type | Notes |
|---|---|
| dict of arrays, list or tuples | each sequence becomes a colum in the dataframe: all sequences must be the same length |
| NumPy structured/record array | Treated as the "dict of arrays" case |
| dict of Series | Each value becomes a column; indexes from each Series are unioned together to form the results row index if no explicit index is passed |
| dict of dicts | Each inner dict becomes a column; keys are unioned to form the row index as in the "dict of Series" case |
| List of dicts or Series | Each item becomes a row in the DataFrame, union of dict keys ir Series indexes become the DataFrame column labels |
| List of lists or tuples | Treated as the "2D ndarray" case |
| Another dataFrame | The dataframe indexes are used unless different ones are passed | 
| Numpy MaskedArray | Like the "2D ndarray" case except masked values vecome NA/Missing in the dataFrame result |

### Index Objects

- index objects are responsible for holding the axis labels and other matadata
- any array or other sequence of labels you ise then constructing a Series or DataFrame is internally converted to an Index
- Index objects are immutable and thus cant be modified by the user
- An index also behaves like a fixed-size set

In [90]:
obj = pd.Series(range(3), index=['a','b','c'])

In [91]:
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [92]:
index[1:]

Index(['b', 'c'], dtype='object')

In [93]:
# - Index objects are immutable and thus cant be modified by the user
# will get typerror
index[1] = 'd'

TypeError: Index does not support mutable operations

In [94]:
# immutability makes it easier to share index objects amoung data structures
labels = pd.Index(np.arange(3))