JUPYTER NOTEBOOK CREATED BY

# DR. RAJAN GUPTA
# DEEN DAYAL UPADHYAYA COLLEGE
# UNIVERSITY OF DELHI
# rgupta.cs.du@gmail.com

In [None]:
#Chapter 5 Pandas - [5.1-5.2]
#Run cell using "Shift + Enter"

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

In [None]:
#import the library
import pandas as pd
from pandas import Series, DataFrame

**5.1 Introduction to pandas Data Structures**

**SERIES** - A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data.

In [None]:
#creating a series like a 1D array with indexes
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [None]:
#displaying the array representation of the object
obj.values

array([ 4,  7, -5,  3])

In [None]:
#displaying the size of the index of the object
#size varies from 'start' to 'stop-1'
obj.index

RangeIndex(start=0, stop=4, step=1)

In [None]:
#creating the array with customized index
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [None]:
#checking the index of the object
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [None]:
#accessing elements using index
obj2['a']

-5

In [None]:
obj[2]

-5

In [None]:
#Assign values to the object using index
obj2['d'] = 6
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [None]:
#accessing object elements using random indexes through passing a list of indices
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

Basic Arithmetic operations like NumPy

In [None]:
#Index accessing using boolean operators
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [None]:
obj2*2

d    12
b    14
a   -10
c     6
dtype: int64

In [None]:
#We can use the numpy library on the pandas objects
import numpy as np

In [None]:
#using np function on pandas series object
np.exp(obj)

0      54.598150
1    1096.633158
2       0.006738
3      20.085537
dtype: float64

In [None]:
#Using series object like a dictionary in python
'b' in obj2

True

In [None]:
'e' in obj2

False

Should you have data contained in a Python dict, you can create a Series from it by passing the dict

In [None]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
sdata

{'Ohio': 35000, 'Oregon': 16000, 'Texas': 71000, 'Utah': 5000}

In [None]:
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When passing a dictionary, the index in the resulting Series will have the dict’s keys in sorted order. We can override this by passing the dict keys in the order we want them to appear in the resulting Series.

In [None]:
#we are intentionally passing a value california which is not there in the dict
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [None]:
#Finding Null Values in Pandas
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [None]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [None]:
#directly using series instance to check null
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

A useful 'Series' feature for many applications is that it automatically aligns by index label in arithmetic operations

In [None]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [None]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [None]:
#adding two series which will automatically align themselves wrt their index
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [None]:
#We can provide the name to the object and its index
#providing Population name to the Object 4
#and providing "States" name to the index

obj4.name = 'Population'
obj4.index.name = 'States'
obj4

States
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: Population, dtype: float64

In [None]:
#Altering the index name through in-place assignment
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [None]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

**DATAFRAME**

A DataFrame represents a rectangular table of data and contains an ordered collection
of columns, each of which can be a different value type (numeric, string,
boolean, etc.). The DataFrame has both a row and column index; it can be thought of
as a dict of Series all sharing the same index.

In [None]:
#Creating DataFrame (can be done in multiple ways)
#Common ways - dict of equal-length lists or NumPy arrays

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
         'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [None]:
#selecting the first 5 rows of the dataframe for larger dataset
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [None]:
#selecting the last 5 elements of the dataset
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [None]:
#Reordering the columns of the data frame
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [None]:
#passing a column that isn’t contained in the dict will appear with missing values in the result
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four','five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [None]:
#checking the columns of the frame
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [None]:
#column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [None]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [None]:
#accessing rows can be done using loc
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [None]:
#modification of columns by assignment
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [None]:
#using numpy functions to fill dataframe
frame2['debt'] = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


In [None]:
#inserting few values only in the dataframe
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


In [None]:
#Assigning a column that doesn’t exist will create a new column
cond = frame2.state == 'Ohio'
frame2['eastern'] = cond
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


In [None]:
#Deleting a keyword 
del frame2['eastern']
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [None]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


In [None]:
#Another types of data using dict of dict

#pandas will interpret the outer dict keys as the columns and the inner keys as the row indices

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [None]:
#transposing the dataframe
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [None]:
#Treatment of Dicts of Series
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


In [None]:
#setting the name of the column and rows
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [None]:
#values being returned as ndarray
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [None]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

**Index Objects for SERIES & DATAFRAME**

Pandas’s Index objects are responsible for holding the axis labels and other metadata

In [None]:
#creating a series
obj = pd.Series(range(3), index=['a', 'b', 'c'])
obj

a    0
b    1
c    2
dtype: int64

In [None]:
#creating index objects (immutable)
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [None]:
#slicing in the index
index[1:]

Index(['b', 'c'], dtype='object')

In [None]:
#immutable property
index[1] = 'd'

TypeError: ignored

In [None]:
#Due to immutability it is safer to share index objects among data structures
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [None]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [None]:
obj2.index is labels

True

In [None]:
#index can also behave like fixed-size set
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [None]:
frame3.columns

Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [None]:
'Ohio' in frame3.columns

True

In [None]:
2003 in frame3.index

False

In [None]:
#pandas index can have duplicate values
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

# 5.2 Essential Functionality

**Reindexing**

In [None]:
#to create a new object with the data conformed to a new index
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [None]:
#rearranges the data according to the new index and inserting missing value if any index value is not there
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [None]:
#usage of forward fills
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [None]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [None]:
#With DataFrame, reindex can alter either the (row) index, columns, or both
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [None]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [None]:
#reindexing column names
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [None]:
#using loc in frames which is no longer supported in python new version
#frame.loc[['a', 'b', 'c', 'd'], states]

Dropping Entries from an Axis

In [None]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [None]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [None]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

In [None]:
#deleting values from the axis
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
#dropping rows
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
#dropping columns
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [None]:
#dropping the axis and storing results in-place
obj.drop('c', inplace=True)
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

Indexing, Selection, and Filtering

In [None]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [None]:
#indexing using index value
obj['b']

1.0

In [None]:
#indexing using index number
obj[1]

1.0

In [None]:
#slicing
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [None]:
#using list of index values
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [None]:
#using list of index positions
obj[[1,3]]

b    1.0
d    3.0
dtype: float64

In [None]:
#using boolean operation
obj[obj<2]

a    0.0
b    1.0
dtype: float64

In [None]:
#slicing using index values
obj['b':'c']

b    1.0
c    2.0
dtype: float64

In [None]:
#setting values to sliced portion
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

In [None]:
#dataframe
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
#column access
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [None]:
#list of columns
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [None]:
#slicing for rows
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [None]:
#conditional slicing
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
#scalar comparison
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [None]:
#conditional assignment
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Selection with loc and iloc

In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
#loc works with row and column values
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int64

In [None]:
#iloc works with row and column positions
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

In [None]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [None]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [None]:
#loc works with indexing and slicing both
data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [None]:
#iloc also works with indexing and slicing both
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


Integer Indexes

In [None]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [None]:
#ser[-1]
#this will create error as working on integer indexes like lists and tuples is not allowed

In [None]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

In [None]:
#this is allowed with non-integer indexes
ser2[-1]

2.0

In [None]:
#checking with loc
ser[:1]

0    0.0
dtype: float64

In [None]:
#loc works from start to end in slicing instead of "end-1"
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In [None]:
ser.iloc[:1]

0    0.0
dtype: float64

Arithmetic and Data Alignment

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [None]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [None]:
#adding two series in pandas
#common elements are added
#uncommon elements are shown as NaN
s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [None]:
#adding 2D dataframes
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [None]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [None]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [None]:
#adding two dataframes
#common rows are added
#uncommon rows return NaN
df1+df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [None]:
#dataframe addition with no common column or row label will result null
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})

In [None]:
df1

Unnamed: 0,A
0,1
1,2


In [None]:
df2

Unnamed: 0,B
0,3
1,4


In [None]:
#no common column labels will results in all NaN
df1-df2

Unnamed: 0,A,B
0,,
1,,


In [None]:
#Arithmetic Models with fill values
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),columns=list('abcde'))

In [None]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [None]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [None]:
df2.loc[1, 'b'] = np.nan
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [None]:
df1+df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [None]:
#to remove nan, we can fill them with 0 while adding
#important function in data analysis
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [None]:
#division
1/df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [None]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [None]:
#filling value with 0
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


In [None]:
#Operations between Dataframe and Series
arr = np.arange(12.).reshape((3, 4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [None]:
arr[0]

array([0., 1., 2., 3.])

In [None]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

In [None]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [None]:
series = frame.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [None]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [None]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2

b    0
e    1
f    2
dtype: int64

In [None]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


Function Application and Mapping

In [None]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-1.884034,0.594691,0.487762
Ohio,2.686563,0.175174,0.972721
Texas,-0.931093,1.743606,-0.778279
Oregon,1.338907,0.556186,-0.580039


In [None]:
#applying numpy ufuncs
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.884034,0.594691,0.487762
Ohio,2.686563,0.175174,0.972721
Texas,0.931093,1.743606,0.778279
Oregon,1.338907,0.556186,0.580039


In [None]:
#applying functions of finding range in columns
f = lambda x: x.max() - x.min()
frame.apply(f)

b    4.570598
d    1.568432
e    1.751000
dtype: float64

In [None]:
#specifying function on axis
frame.apply(f, axis='columns')
#axis=col means that func will be applied for each row

Utah      2.478725
Ohio      2.511389
Texas     2.674699
Oregon    1.918946
dtype: float64

In [None]:
#another way of defining
def f(x): return pd.Series([x.min(), x.max()], index=['min', 'max'])

In [None]:
frame.apply(f)

Unnamed: 0,b,d,e
min,-1.884034,0.175174,-0.778279
max,2.686563,1.743606,0.972721


In [None]:
#applying decimal formatting of a number
format = lambda x: '%.2f' % x

In [None]:
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-1.88,0.59,0.49
Ohio,2.69,0.18,0.97
Texas,-0.93,1.74,-0.78
Oregon,1.34,0.56,-0.58


In [None]:
#selective formatting on a column
frame['e'].map(format)

Utah       0.49
Ohio       0.97
Texas     -0.78
Oregon    -0.58
Name: e, dtype: object

Sorting & Ranking

In [None]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj

d    0
a    1
b    2
c    3
dtype: int64

In [None]:
#sorting the indexes of the dataframe
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [None]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [None]:
#sorting alongside columns
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [None]:
#sorting a series by its values
obj = pd.Series([4, 7, -3, 2])
obj

0    4
1    7
2   -3
3    2
dtype: int64

In [None]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [None]:
#missing values(nan) are sorted at the end
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj

0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64

In [None]:
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [None]:
#multiple columns can be sorted together
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [None]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [None]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


In [None]:
#ranking
#assigns ranks from one through the number of valid data points in an array
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [None]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [None]:
#instead of splitting ranks, clear ranks can be assigned too
#ranks are resolved based on first occurence
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [None]:
#ranking in descending order assigning tie values the max rank in the group
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [None]:
#computing ranks overs the columns
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 
                      'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [None]:
#defining rank in every row for column values, separately
frame.rank(axis='columns')

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


Axis Indexes with Duplicate Labels

In [None]:
#duplicate indices
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [None]:
#check if indexes are unique
obj.index.is_unique

False

In [None]:
obj['a']

a    0
a    1
dtype: int64

In [None]:
obj['c']

4

In [None]:
#accessing duplicate indexes in multidimensional dataframe
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

Unnamed: 0,0,1,2
a,-0.085628,-1.128399,-0.740289
a,1.534146,0.346013,-0.070948
b,-2.215553,2.396678,-0.734277
b,2.070072,-0.399956,0.574241


In [None]:
df.loc['b']

Unnamed: 0,0,1,2
b,-2.215553,2.396678,-0.734277
b,2.070072,-0.399956,0.574241
