# Python Data Science Handbook

## Chapter 13 - Introducing Pandas Objects

3 fundamental Pandas data structures

- Series
- DataFrame
- Index

### Pandas Series Object

one-dimensional array of index data

In [1]:
import pandas as pd

pd_series = pd.Series([1,3,5])
pd_series

0    1
1    3
2    5
dtype: int64

combines sequence of values with an explicit sequence of indices

In [17]:
pd_series.values

array([1, 3, 5], dtype=int64)

In [2]:
pd_series.keys()

RangeIndex(start=0, stop=3, step=1)

In [4]:
list(pd_series.items())

[(0, 1), (1, 3), (2, 5)]

In [18]:
pd_series.index

RangeIndex(start=0, stop=3, step=1)

access data via index with square bracket notation []

In [19]:
pd_series[1]

3

In [22]:
pd_series[0:2]

0    1
1    3
dtype: int64

can specify index with Series (diff between np.array)

In [23]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                         index=['a', 'b', 'c', 'd'])

data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [24]:
data['a']

0.25

very similar to Python dictionary, which maps keys to valus, series maps keys to set of typed values

to create from scratch

- index is optional

_pd.Series(data, index=index)_

In [25]:
pd.Series([2,4,6])

0    2
1    4
2    6
dtype: int64

### Pandas DataFrame Object

can also access index of pandas dataframes (row level) similar to np.array

_pd.index_

but since its more than 1 dimension, we have a columns attribute which is an Index object holding the column labels ~ if thinking in terms of dictionary where a key maps to a value, a column name maps to a series of column data

_pd.columns_

Building Dataframe Objects

_pd.DataFrame(data 
            , columns=[]
            , index=[])_

args

index - refers to row level index

### Pandas Index Object

## Chapter 14: Data Indexing and Selection

### Data Selection in Series

when indexing with explicit integer index, will use explicit index

In [10]:
import pandas as pd

data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5]) # explicity integer index

data[5]

'c'

slicing with use the implicit python-stlye indices (default integer)

In [13]:
data[1:3]

3    b
5    c
dtype: object

loc attribute allows indexing and slicing that references the explicit index (one we define in the panda series)

In [14]:
data.loc[1]

'a'

iloc allows through the implicit python style

In [15]:
data.iloc[1]

'b'

## Data Selection in DataFrames

dataframe as a dictionary 

(e.g. explicit index in pd.series = key in dictionary), so really Dataframe are several series sharing the same index

In [16]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                           'Florida': 170312, 'New York': 141297,
                           'Pennsylvania': 119280})
pop = pd.Series({'California': 39538223, 'Texas': 29145505,
                          'Florida': 21538187, 'New York': 20201249,
                          'Pennsylvania': 13002700})
data = pd.DataFrame({'area':area, 'pop':pop})

data

Unnamed: 0,area,pop
California,423967,39538223
Texas,695662,29145505
Florida,170312,21538187
New York,141297,20201249
Pennsylvania,119280,13002700


In [40]:
data[data['area'] < 200000]

Unnamed: 0,area,pop
Florida,170312,21538187
New York,141297,20201249
Pennsylvania,119280,13002700


In [17]:
# access columns or series that make up the columns

data['area']

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

In [18]:
# can also use attribute-style access with column names that strings

data.area

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

dataframe as two-dimensional array

In [20]:
# see raw underlying data with values attribute

data.values

array([[  423967, 39538223],
       [  695662, 29145505],
       [  170312, 21538187],
       [  141297, 20201249],
       [  119280, 13002700]], dtype=int64)

passing a single index to an array accesses a row

In [26]:
data.values[0]

array([  423967, 39538223], dtype=int64)

passing a single index to a Dataframe accesses a column

In [28]:
data['area']

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

use implicit index

In [31]:
data.iloc[:3,:2]

Unnamed: 0,area,pop
California,423967,39538223
Texas,695662,29145505
Florida,170312,21538187


use explicity index

In [32]:
data.loc[:'Florida',:'pop']

Unnamed: 0,area,pop
California,423967,39538223
Texas,695662,29145505
Florida,170312,21538187


## Chapter 16: Handling Missing Data