### Pandas introduction

NumPy provide a ndarray data structure
well-organized data seen in numerical computing tasks

Pandas is a newer package built to top of NumPy

##### DataFrames
 - multi-d arrays with rows and labels
 - 2-dimensional
 - often with heterogeneous data types of missing data

##### Series
 - 1-dimensional

In [2]:
import pandas as pd

#this import convention is used throughout the remainder of this book

#### Introducing Pandas Objects

 - enhanced versons of NumPy structured arrays
 - rows and columns are identified with labels
 
##### Data structures:
 - Series
 - DataFrame
 - Index

In [3]:
import numpy as np

#also need to use NumPy

#### The panda Series Object

more general and flexible than 1-dimensional NumPy
Series have presence of index unlike NumPy

NumPy has implicitly defined integer index
Pandas has explicitly defined index, not shown

In [9]:
# 1- dimensional array of indexed data
# created from a list of array

data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

#Series wraps both sequence of values and a sequence of indices

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [5]:
#appl value and index attributes to Series Object

data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [6]:
data.index
#gives index and values information

RangeIndex(start=0, stop=4, step=1)

In [7]:
#like Python list and NumPy
# can assess the index by [ ]

data[1]

0.5

In [8]:
data[1:3]

#slicing final index is exclued

1    0.50
2    0.75
dtype: float64

In [10]:
#index assigned to Series can be strings
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                index= ['a', 'b', 'c', 'd'])

data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [11]:
#item access
data['b']

#index/key will show the value

0.5

In [13]:
#can use noncontiguous or nonsequential indices:

data= pd.Series([0.25, 0.5, 0.75, 1.0],
               index=[2,5,3,7])

data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [14]:
data[3]

0.75

In [19]:
data[7]

1.0

In [15]:
data.iloc[3]

#this uses implicity index through iloc

1.0

##### Series as a specialized dictionary
 - Panda Series is a specialization of Python dictionary
 - Series map typed keys to arbitrary values
 - more efficient than Python Dict
 
 - can construct Series from Dict

In [20]:
population_dict ={'California': 38332521,
                 'Texas': 26448193,
                 'New York': 19651127,
                 'Florida': 19552860,
                 'Illinois': 12882135}

#convert Python Dict to Series using pd.Series
population = pd.Series(population_dict)

population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [21]:
#dictionary-style access
population['Texas']

26448193

In [22]:
# UNLIKE dictionary, series support array-style slicing

population['California': 'Florida']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

#### Constructing Series Objects

###### Construct Panda Series from Scratch

pd.Series( [ array of data ], index=[ (row of numbers ] )

##### data:
- can be one of many entities
- can be a list of NumPy array

##### index:
- is an optional argument
- usually default numerical sequence
- unless assigned to non-sequential number or string

In [23]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [24]:
#data can be a scalar which is repeated
# to fill a specified index

pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [26]:
#data can be a dictionary and the keys become the index

pd.Series({2:'a', 1:'b', 3:'c'}, index=[3,2])

#the index keys only state index =[3, 2]
# Series is only populated with the explicity stated index

3    c
2    a
dtype: object

#### The Pandas DataFrame Object

##### Series
 - Analog of 1-dimensional array
 - have flexible indices
 
##### DataFrame
 - Analog of 2-dimensional array
 - have flexible indices and flexible Column name
 - sequence of algined Series Objects
 - 'aligned' by sharing the same index

In [27]:
# create the 1st Series

#this is the array 
area_dict = {'California': 423967, 
             'Texas': 695662, 
             'New York': 141297,
             'Florida': 170312, 
             'Illinois': 149995}

#convert to Series using pd.Series
area = pd.Series(area_dict)
area

# output has values with indices with values

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [29]:
# Dataframe constructed by making 2 Series into Dictionary format

states = pd.DataFrame({'popul': population,
                       'area': area })

states

Unnamed: 0,popul,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [30]:
#Like Series Object, DataFrame has index attribute

states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [32]:
#In additiona, DataFrame has columns attribute

states.columns

Index(['popul', 'area'], dtype='object')

##### DataFrame as a specialized dictionary

In [33]:
#DataFrame map column name to a series of column data

#when call the column 'area' will return the series of column data

states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [34]:
states['popul']

#call the name of the column

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: popul, dtype: int64

In [45]:
#Dataframe has row and column

#call row by number, this is implicit index
states[0: 2]

Unnamed: 0,popul,area
California,38332521,423967
Texas,26448193,695662


#### Constructing DataFrame Objects

##### 1. From a single Series object

In [46]:
pd.DataFrame( population, columns =['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


##### 2. From a list of dicts

In [48]:
data =[{ 'a':i , 'b': 2 * i}
       for i in range(3)]

pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [49]:
#if some keys in dictionary are missing
#Pandas will fill them in with NaN values
#NaN = 'Not a Number'

pd.DataFrame([{'a': 1, 'b':2}, {'b':3, 'c':4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


##### 3. From a dictionary of Series objects

In [50]:
pd.DataFrame({'popul':population, 
              'area': area})

Unnamed: 0,popul,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


##### 4. 2- dimensional NumPy array

In [52]:
#create a 2-dimensional array
# add in specified column and index names

pd.DataFrame(np.random.rand(3,2),
            columns = ['foo', 'bar'],
             index = ['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.964172,0.91102
b,0.565774,0.168304
c,0.947219,0.452705


##### 5.From a NumPy array

In [54]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B','f8')])

A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [55]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


##### The Pandas Index Object 

- Series and DataFrame contain an explicit index to modify & ref data
- index is immutable, cannot be changed
- Index Object is an integer by default
- can be assigned as insequential numerical sequence 
- can be assigned as string

In [56]:
ind = pd.Index([ 2, 3, 5, 7, 11])

ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

In [58]:
#Index is an immutable array
 
# operates like an array
# using Python indexing notation to retrieve values or slices

ind[1]

3

In [59]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [61]:
# Index Objects have attributes familiar from NumPy arrays

print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


In [62]:
# indices are immutable, cannot be modified by normal means

ind[1] = 0

TypeError: Index does not support mutable operations

In [64]:
#Index as ordered set

#joins across data set

indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

indA & indB
#intersection

Int64Index([3, 5, 7], dtype='int64')

In [66]:
indA | indB
#union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [67]:
indA ^ indB

#symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')

#### Data Indexing & Selection

- 1-dimensional Series object
- 2-dimensional DataFrame object

##### Data Selection in Series

In [68]:
# Series as a dictionary
# provide mapping from a collection of keys to valyes

#import pandas as pd

data = pd.Series([0.25, 0.5, 0.75, 1.9],
                index = ['a', 'b', 'c', 'd'])

data

a    0.25
b    0.50
c    0.75
d    1.90
dtype: float64

In [69]:
data['b']

0.5

In [70]:
'a' in data

True

In [71]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [73]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.9)]

In [74]:
#can add a new index/value pair just like a dictionary

data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.90
e    1.25
dtype: float64

##### Series is a 1-dimensional array

- same basic mechanisms as NumPy
- masking
- slicing
- fancy indexing

In [75]:
#slicing by explicit index
data['a':'c']

#laat index is included in the slice

a    0.25
b    0.50
c    0.75
dtype: float64

In [76]:
#slicing by implicit integer index

data[0:2]
#last index is excluded from the slice

a    0.25
b    0.50
dtype: float64

In [77]:
#masking
data[(data>0.3) & (data< 0.8)]

b    0.50
c    0.75
dtype: float64

In [78]:
# fancy indexing

data[['a','e']]

a    0.25
e    1.25
dtype: float64

##### Indexers: loc, iloc, ix

In [79]:
#Series by DEFAULT
# explicit index when indexing
# implicit index when slicing

#loc
#iloc
#ix - used for dataframe due to rows and columns

data = pd.Series(['a', 'b', 'c'], index = [ 1, 3, 5])

data

1    a
3    b
5    c
dtype: object

In [80]:
# explicit index when indexing

data[1]

'a'

In [81]:
#implicity index when slicing

data[1:3]

3    b
5    c
dtype: object

In [82]:
#loc attribute allow indexing and slicing using explicit index

data.loc[1]

'a'

In [83]:
data.loc[1:3]

1    a
3    b
dtype: object

In [84]:
#iloc attribut allow indexing and slicing using implicit index

data.iloc[1]

'b'

In [85]:
data.iloc[1:3]

3    b
5    c
dtype: object

##### Data Selection in Series

- act like a 2-dimensional or structured array
- dictionary of Series structures

In [4]:
#DataFrame as a dictionart

import numpy as np
import pandas as pd

area = pd.Series({'California': 423967, 
                  'Texas': 695662,
                  'New York': 141297, 
                  'Florida': 170312,
                  'Illinois': 149995})

pop = pd.Series({'California': 38332521, 
                  'Texas': 26448193,
                  'New York': 19651127, 
                  'Florida': 19552860,
                 'Illinois': 12882135})

data = pd.DataFrame({'area': area, 'population':pop})

data

Unnamed: 0,area,population
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [5]:
#individual Series that make up columns of the DataFrame
#can be accessed by dict-style indexing of column name

data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [6]:
data['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

In [7]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [8]:
data.area is data['area']
#column

True

In [9]:
data.loc['California']
#to access index

area            423967
population    38332521
Name: California, dtype: int64

In [10]:
data.values[0]
#to access row

array([  423967, 38332521], dtype=int64)

In [11]:
#can add a new column 

data['density'] = data['population']/ data['area']

data

Unnamed: 0,area,population,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [12]:
#DataFrame as a 2-dimensional array

data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [13]:
#transpose the full DataFrame to switch the rows & columns

data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
population,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


### DataFrame

#### access columns
- e.g data['column name']

#### access rows
- e.g data.values[0]

#### access index 
- e.g. use iloc, loc, ix method

#### access index and columns 
- e.g. use iloc, loc, ix method

In [14]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [15]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [16]:
#can slice out the arrays we want

#iloc - implicity - exclude last integer index
data.iloc[:3, :2]

Unnamed: 0,area,population
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [17]:
#loc - explicit - include last integer index
data.loc[:'Texas' , :'population']

Unnamed: 0,area,population
California,423967,38332521
Texas,695662,26448193


In [109]:
# ix - allow the hybrid of 2 approaches

data.ix[:3, :'population']

AttributeError: 'DataFrame' object has no attribute 'ix'

In [None]:
#loc indexing - explicit index
# can combine masking and fancy indexing

data.loc[data.density> 100, ['pop', 'density']]

#ERROR now numpy no longer support this with pandas

In [18]:
#can set or modify values

data.iloc[0,2] = 90

data

Unnamed: 0,area,population,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [19]:
#slicing by rows
data['Texas':'Florida']

Unnamed: 0,area,population,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


In [21]:
#slicing by rows
#implicit row index 
data[1:3]

Unnamed: 0,area,population,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


In [22]:
#direct masking operations are interpreted row-wise
data[data.density>100]

Unnamed: 0,area,population,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


#### Operating on Data in Pandas

- ufuncs will preserve index and column labels in output
- Pandas will align indices for binary operations - addition..etv

In [24]:
#Ufuncs: Index Preservation
#any NumPy ufunc will work on Panda & DataFrame Objects

rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10 , 4))

ser

0    6
1    3
2    7
3    4
dtype: int32

In [27]:
df = pd.DataFrame(rng.randint(0, 10,(3 , 4)),
                  columns =['A', 'B', 'C','D'])

df

Unnamed: 0,A,B,C,D
0,1,7,5,1
1,4,0,9,5
2,8,0,9,2


In [28]:
#apply NumPy ufunc with indices preserved.

np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [29]:
np.sin(df* np.pi/4)

Unnamed: 0,A,B,C,D
0,0.7071068,-0.707107,-0.707107,0.707107
1,1.224647e-16,0.0,0.707107,-0.707107
2,-2.449294e-16,0.0,0.707107,1.0


##### UFuncs: Index Alignments

Index Alignment in Series

In [30]:
#Combining 2 different data sources in series
 
area = pd.Series({'Alaska': 1723337, 
                  'Texas': 695662,
                  'California': 423967}, 
                   name='area')

population = pd.Series({'California': 38332521, 
                        'Texas': 26448193,
                        'New York': 19651127}, 
                       name='population')

#what happens when divide to compute population density

population/area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [31]:
#union of indices of 2 input arrays

area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

In [33]:
#any missing values are filled in by NaN by default

A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])

A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

In [34]:
#if want to add but fill value to replace NaN

A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

In [35]:
##### Index Alignment in DataFrame

#alignt for BOTH columns and indices

A = pd.DataFrame(rng.randint(0, 20,(2,2)),
                columns =list('AB'))

A

Unnamed: 0,A,B
0,11,19
1,2,4


In [36]:
B = pd.DataFrame(rng.randint(0, 10, (3,3)),
                columns = list('BAC'))

B

Unnamed: 0,B,A,C
0,2,6,4
1,8,6,1
2,3,8,1


In [37]:
A + B

#indices are aligned correctly irrespective of their order
#indices are sorted

#columns are also aligned

Unnamed: 0,A,B,C
0,17.0,21.0,
1,8.0,12.0,
2,,,


In [38]:
A.add(B, fill_value = 0)

Unnamed: 0,A,B,C
0,17.0,21.0,4.0
1,8.0,12.0,1.0
2,8.0,3.0,1.0


In [40]:
# use arithmetic method and pass desired fill_value in place 
# of missing entries
    
fill = A.stack().mean()
A.add(B, fill_value = fill)

Unnamed: 0,A,B,C
0,17.0,21.0,13.0
1,8.0,12.0,10.0
2,17.0,12.0,10.0


##### Python operators and Pandas methods

 (+) add()
 (-) , sub(), substract()
 mul(), multiply()
 truediv, div(), divide()
 floordiv()
 mod()
 pow()

In [41]:
##### Ufuncs: Operations btwn DataFrame & Series

A = rng.randint(10, size=(3,4))
A

array([[9, 8, 9, 4],
       [1, 3, 6, 7],
       [2, 0, 3, 1]])

In [42]:
A - A[0]

# A[0] refers to the row
#applied row-wise according to broadcasting rules for NumPy

array([[ 0,  0,  0,  0],
       [-8, -5, -3,  3],
       [-7, -8, -6, -3]])

In [43]:
#this is DataFrame

df = pd.DataFrame(A, columns = list('QRST'))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-8,-5,-3,3
2,-7,-8,-6,-3


In [45]:
#operate column-wise, identify column and use axis

df.subtract(df['R'], axis = 0)

Unnamed: 0,Q,R,S,T
0,1,0,1,-4
1,-2,0,3,4
2,2,0,3,1


In [46]:
df.subtract(df['R'], axis = 1)

#do broadcasting, but its WRONG
#use WRONG axis

Unnamed: 0,Q,R,S,T,0,1,2
0,,,,,,,
1,,,,,,,
2,,,,,,,


In [47]:
halfrow = df.iloc[0, ::2]

halfrow

Q    9
S    9
Name: 0, dtype: int32

In [48]:
df - halfrow

#the indices and columns are preserved

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-8.0,,-3.0,
2,-7.0,,-6.0,
