# Pandas

Pandas is a tool used for working with tabular data. It is a major tool used for data analysis.
In contrast with numpy which is  used homogenour data such as arrays, pandas is used with
heterogenous tabular data

In [11]:
import pandas as pd
import numpy as np

## Pandas Data Structures

Lets look at some data structures in python

### Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index

In [2]:
obj = pd.Series([4, 7, -5, 3])

In [3]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [5]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

you can use labels in the index when selecting single values or a set of values

In [6]:
obj2['a']

-5

In [7]:
obj2[['a', 'b', 'c', 'd']]

a   -5
b    7
c    3
d    4
dtype: int64

In [8]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

In [9]:
obj2 * 4

d    16
b    28
a   -20
c    12
dtype: int64

In [12]:
np.sqrt(obj2)

  """Entry point for launching an IPython kernel.


d    2.000000
b    2.645751
a         NaN
c    1.732051
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values.

Creating a series from a python dict

In [13]:
sdata = {'Mumbai': 35000, 'Hyderabad': 71000, 'Kanpur': 16000, 'Ahmedabad': 5000}

obj3 = pd.Series(sdata)
obj3

Mumbai       35000
Hyderabad    71000
Kanpur       16000
Ahmedabad     5000
dtype: int64

In [14]:
obj3['Mumbai']

35000

In [15]:
cities = ['Hyderabad', 'Mumbai', 'Ahmedabad', 'Kanpur','Pune']
obj4 = pd.Series(sdata, index=cities)

In [16]:
obj4

Hyderabad    71000.0
Mumbai       35000.0
Ahmedabad     5000.0
Kanpur       16000.0
Pune             NaN
dtype: float64

The isnull function in pandas is used to detect missing data

In [17]:
pd.isnull(obj4)

Hyderabad    False
Mumbai       False
Ahmedabad    False
Kanpur       False
Pune          True
dtype: bool

In [18]:
pd.notnull(obj4)

Hyderabad     True
Mumbai        True
Ahmedabad     True
Kanpur        True
Pune         False
dtype: bool

In [19]:
obj3.isnull()

Mumbai       False
Hyderabad    False
Kanpur       False
Ahmedabad    False
dtype: bool

In [20]:
obj3 + obj4

Ahmedabad     10000.0
Hyderabad    142000.0
Kanpur        32000.0
Mumbai        70000.0
Pune              NaN
dtype: float64

In [21]:
obj4.index, obj4.values

(Index(['Hyderabad', 'Mumbai', 'Ahmedabad', 'Kanpur', 'Pune'], dtype='object'),
 array([71000., 35000.,  5000., 16000.,    nan]))

In [23]:
obj4.index.name = 'cities'
obj4.name = 'cities and population'
obj4

cities
Hyderabad    71000.0
Mumbai       35000.0
Ahmedabad     5000.0
Kanpur       16000.0
Pune             NaN
Name: cities and population, dtype: float64

### DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. The exact details of DataFrame’s internals are outside the scope of this class.

In [24]:
data = {'state': ['AP', 'Maharashtra', 'Himanchal', 'Punjab', 'UP', 'Punjab'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [25]:
df = pd.DataFrame(data)

In [26]:
df

Unnamed: 0,state,year,pop
0,AP,2000,1.5
1,Maharashtra,2001,1.7
2,Himanchal,2002,3.6
3,Punjab,2001,2.4
4,UP,2002,2.9
5,Punjab,2003,3.2


In [27]:
df.head()

Unnamed: 0,state,year,pop
0,AP,2000,1.5
1,Maharashtra,2001,1.7
2,Himanchal,2002,3.6
3,Punjab,2001,2.4
4,UP,2002,2.9


In [28]:
df.head(2)

Unnamed: 0,state,year,pop
0,AP,2000,1.5
1,Maharashtra,2001,1.7


In [51]:
df2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],index=['one', 'two', 'three', 'four','five', 'six'])
df2

Unnamed: 0,year,state,pop,debt
one,2000,AP,1.5,
two,2001,Maharashtra,1.7,
three,2002,Himanchal,3.6,
four,2001,Punjab,2.4,
five,2002,UP,2.9,
six,2003,Punjab,3.2,


In [52]:
df2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [53]:
df2['state']

one               AP
two      Maharashtra
three      Himanchal
four          Punjab
five              UP
six           Punjab
Name: state, dtype: object

In [54]:
df2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [55]:
df2.loc['two']

year            2001
state    Maharashtra
pop              1.7
debt             NaN
Name: two, dtype: object

In [35]:
df

Unnamed: 0,state,year,pop
0,AP,2000,1.5
1,Maharashtra,2001,1.7
2,Himanchal,2002,3.6
3,Punjab,2001,2.4
4,UP,2002,2.9
5,Punjab,2003,3.2


In [36]:
df.loc[3]

state    Punjab
year       2001
pop         2.4
Name: 3, dtype: object

In [37]:
df.loc[3]['state']

'Punjab'

In [56]:
df2['debt'] = np.arange(6)

In [57]:
df2

Unnamed: 0,year,state,pop,debt
one,2000,AP,1.5,0
two,2001,Maharashtra,1.7,1
three,2002,Himanchal,3.6,2
four,2001,Punjab,2.4,3
five,2002,UP,2.9,4
six,2003,Punjab,3.2,5


In [58]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

df2['debt'] = val

In [59]:
df2

Unnamed: 0,year,state,pop,debt
one,2000,AP,1.5,
two,2001,Maharashtra,1.7,-1.2
three,2002,Himanchal,3.6,
four,2001,Punjab,2.4,-1.5
five,2002,UP,2.9,-1.7
six,2003,Punjab,3.2,


In [61]:
df2['good_income'] = df2['pop'] < 2.0

In [62]:
df2

Unnamed: 0,year,state,pop,debt,good_income
one,2000,AP,1.5,,True
two,2001,Maharashtra,1.7,-1.2,True
three,2002,Himanchal,3.6,,False
four,2001,Punjab,2.4,-1.5,False
five,2002,UP,2.9,-1.7,False
six,2003,Punjab,3.2,,False


In [64]:
del df2['good_income']

In [65]:
df2

Unnamed: 0,year,state,pop,debt
one,2000,AP,1.5,
two,2001,Maharashtra,1.7,-1.2
three,2002,Himanchal,3.6,
four,2001,Punjab,2.4,-1.5
five,2002,UP,2.9,-1.7
six,2003,Punjab,3.2,


In [66]:
df2.loc[df2.year==2000, 'debt'] = 1.0

In [67]:
df2

Unnamed: 0,year,state,pop,debt
one,2000,AP,1.5,1.0
two,2001,Maharashtra,1.7,-1.2
three,2002,Himanchal,3.6,
four,2001,Punjab,2.4,-1.5
five,2002,UP,2.9,-1.7
six,2003,Punjab,3.2,


In [69]:
df2.loc['three', 'debt'] = 1.5

In [70]:
df2

Unnamed: 0,year,state,pop,debt
one,2000,AP,1.5,1.0
two,2001,Maharashtra,1.7,-1.2
three,2002,Himanchal,3.6,1.5
four,2001,Punjab,2.4,-1.5
five,2002,UP,2.9,-1.7
six,2003,Punjab,3.2,


In [71]:
pop = {'Mumbai': {2001: 2.4, 2002: 2.9},'Pune': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [72]:
df3 = pd.DataFrame(pop)

In [73]:
df3

Unnamed: 0,Mumbai,Pune
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [74]:
df3.T

Unnamed: 0,2000,2001,2002
Mumbai,,2.4,2.9
Pune,1.5,1.7,3.6


In [77]:
df3.index.name = 'year'

In [78]:
df3

Unnamed: 0_level_0,Mumbai,Pune
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [80]:
df3.columns.name = 'city'

In [81]:
df3

city,Mumbai,Pune
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [82]:
arr = df3.values

In [83]:
arr

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

In [84]:
arr.shape

(3, 2)

In [85]:
df2

Unnamed: 0,year,state,pop,debt
one,2000,AP,1.5,1.0
two,2001,Maharashtra,1.7,-1.2
three,2002,Himanchal,3.6,1.5
four,2001,Punjab,2.4,-1.5
five,2002,UP,2.9,-1.7
six,2003,Punjab,3.2,


In [86]:
arr2 = df2.values

In [87]:
arr2

array([[2000, 'AP', 1.5, 1.0],
       [2001, 'Maharashtra', 1.7, -1.2],
       [2002, 'Himanchal', 3.6, 1.5],
       [2001, 'Punjab', 2.4, -1.5],
       [2002, 'UP', 2.9, -1.7],
       [2003, 'Punjab', 3.2, nan]], dtype=object)

### Index Objects

Index objects are immutable and thus can’t be modified by the user

In [89]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [90]:
obj

a    0
b    1
c    2
dtype: int64

In [91]:
index = pd.Index(np.arange(3))

In [92]:
index

Int64Index([0, 1, 2], dtype='int64')

In [93]:
index[1] = '4'  # TypeError

TypeError: Index does not support mutable operations

Immutability makes it safer to share Index objects among data structures:

In [95]:
obj2 = pd.Series([1.5, -2.5, 0], index=index)

In [96]:
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

Unlike Python sets, a pandas Index can contain duplicate labels

In [97]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])

In [98]:
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

## Essential Functionality

This section will walk you through the fundamental mechanics of interacting with the data contained in a Series or DataFrame.

### Reindexing

An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index.

In [99]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [100]:
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [102]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [103]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [110]:
df = pd.DataFrame(np.arange(9).reshape((3, 3)),index=['a', 'c', 'd'],columns=['Mumbai', 'Pune', 'Bengaluru'])

In [111]:
df

Unnamed: 0,Mumbai,Pune,Bengaluru
a,0,1,2
c,3,4,5
d,6,7,8


In [112]:
df2 = df.reindex(['a','b','c'])

In [113]:
df2

Unnamed: 0,Mumbai,Pune,Bengaluru
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0


Columns can be reindexed with columns keyword

### Dropping entries from an axis

In [115]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [116]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [117]:
obj.drop(['d'])

a    0.0
b    1.0
c    2.0
e    4.0
dtype: float64

In [118]:
df2

Unnamed: 0,Mumbai,Pune,Bengaluru
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0


In [122]:
df

Unnamed: 0,Mumbai,Pune,Bengaluru
a,0,1,2
c,3,4,5
d,6,7,8


In [121]:
df.drop(['a', 'c'])

Unnamed: 0,Mumbai,Pune,Bengaluru
d,6,7,8


In [123]:
df.drop(['Mumbai', 'Pune'], axis=1)

Unnamed: 0,Bengaluru
a,2
c,5
d,8


In [124]:
df

Unnamed: 0,Mumbai,Pune,Bengaluru
a,0,1,2
c,3,4,5
d,6,7,8


In [125]:
df.drop(['Mumbai', 'Pune'], axis=1, inplace=True)

In [126]:
df

Unnamed: 0,Bengaluru
a,2
c,5
d,8


### Indexing, Selection and Filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers.

In [127]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [128]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [129]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [130]:
obj[['b','a','d']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [131]:
obj[[1,3]]

b    1.0
d    3.0
dtype: float64

In [132]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

In [135]:
df = pd.DataFrame(np.arange(9).reshape((3, 3)),index=['a', 'c', 'd'],columns=['Mumbai', 'Pune', 'Bengaluru'])

In [136]:
df

Unnamed: 0,Mumbai,Pune,Bengaluru
a,0,1,2
c,3,4,5
d,6,7,8


In [137]:
df[['Mumbai','Pune']]

Unnamed: 0,Mumbai,Pune
a,0,1
c,3,4
d,6,7


In [138]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['Punjab', 'Gujrat', 'Kerala', 'UP'],
                    columns=['one', 'two', 'three', 'four'])

In [139]:
data

Unnamed: 0,one,two,three,four
Punjab,0,1,2,3
Gujrat,4,5,6,7
Kerala,8,9,10,11
UP,12,13,14,15


In [140]:
data.index

Index(['Punjab', 'Gujrat', 'Kerala', 'UP'], dtype='object')

In [141]:
data[data['one'] < 5]

Unnamed: 0,one,two,three,four
Punjab,0,1,2,3
Gujrat,4,5,6,7


In [142]:
data['two']

Punjab     1
Gujrat     5
Kerala     9
UP        13
Name: two, dtype: int64

In [143]:
data[['one', 'three']]

Unnamed: 0,one,three
Punjab,0,2
Gujrat,4,6
Kerala,8,10
UP,12,14


In [144]:
data < 5

Unnamed: 0,one,two,three,four
Punjab,True,True,True,True
Gujrat,True,False,False,False
Kerala,False,False,False,False
UP,False,False,False,False


### Selection with loc and iloc

In [145]:
data

Unnamed: 0,one,two,three,four
Punjab,0,1,2,3
Gujrat,4,5,6,7
Kerala,8,9,10,11
UP,12,13,14,15


For DataFrame label-indexing on the rows, let's look at special indexing operators loc and iloc. They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either axis labels (loc) or integers (iloc).

In [147]:
data.loc['Punjab', ['two', 'four']]

two     1
four    3
Name: Punjab, dtype: int64

In [148]:
data.iloc[0]

one      0
two      1
three    2
four     3
Name: Punjab, dtype: int64

In [149]:
data.iloc[0, ['two', 'four']]

TypeError: cannot perform reduce with flexible type

In [150]:
data

Unnamed: 0,one,two,three,four
Punjab,0,1,2,3
Gujrat,4,5,6,7
Kerala,8,9,10,11
UP,12,13,14,15


In [151]:
data.iloc[0, [0,1]]

one    0
two    1
Name: Punjab, dtype: int64

In [152]:
data.loc[:'Kerala', ['two', 'three']]

Unnamed: 0,two,three
Punjab,1,2
Gujrat,5,6
Kerala,9,10


In [153]:
data.iloc[:, :3]

Unnamed: 0,one,two,three
Punjab,0,1,2
Gujrat,4,5,6
Kerala,8,9,10
UP,12,13,14


In [155]:
data.iloc[:, :1]

Unnamed: 0,one
Punjab,0
Gujrat,4
Kerala,8
UP,12


In [156]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Gujrat,4,5,6
Kerala,8,9,10
UP,12,13,14


In [159]:
data

Unnamed: 0,one,two,three,four
Punjab,0,1,2,3
Gujrat,4,5,6,7
Kerala,8,9,10,11
UP,12,13,14,15


In [158]:
data.iloc[-1]

one      12
two      13
three    14
four     15
Name: UP, dtype: int64

In [160]:
data[:2] # first 2 rows

Unnamed: 0,one,two,three,four
Punjab,0,1,2,3
Gujrat,4,5,6,7


### Arithmetic and Data Alignment

In the case of DataFrame, alignment is performed on both the rows and the columns

In [162]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),index=['MP', 'Tamilnadu', 'Rajasthan'])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),index=['Goa', 'MP', 'Tamilnadu', 'Rajasthan'])

In [163]:
df1

Unnamed: 0,b,c,d
MP,0.0,1.0,2.0
Tamilnadu,3.0,4.0,5.0
Rajasthan,6.0,7.0,8.0


In [164]:
df2

Unnamed: 0,b,d,e
Goa,0.0,1.0,2.0
MP,3.0,4.0,5.0
Tamilnadu,6.0,7.0,8.0
Rajasthan,9.0,10.0,11.0


In [165]:
df1 + df2

Unnamed: 0,b,c,d,e
Goa,,,,
MP,3.0,,6.0,
Rajasthan,15.0,,18.0,
Tamilnadu,9.0,,12.0,


In [166]:
df1 - df2

Unnamed: 0,b,c,d,e
Goa,,,,
MP,-3.0,,-2.0,
Rajasthan,-3.0,,-2.0,
Tamilnadu,-3.0,,-2.0,


### Function Application and Mapping

In [167]:
df1

Unnamed: 0,b,c,d
MP,0.0,1.0,2.0
Tamilnadu,3.0,4.0,5.0
Rajasthan,6.0,7.0,8.0


In [169]:
df1.apply(lambda x: x.min())

b    0.0
c    1.0
d    2.0
dtype: float64

In [171]:
df1.c.map(lambda x: x+2)

MP           3.0
Tamilnadu    6.0
Rajasthan    9.0
Name: c, dtype: float64

In [172]:
df1

Unnamed: 0,b,c,d
MP,0.0,1.0,2.0
Tamilnadu,3.0,4.0,5.0
Rajasthan,6.0,7.0,8.0


In [173]:
df1.applymap(lambda x: x+1000.)

Unnamed: 0,b,c,d
MP,1000.0,1001.0,1002.0
Tamilnadu,1003.0,1004.0,1005.0
Rajasthan,1006.0,1007.0,1008.0


### Sorting and Ranking

In [174]:
df2

Unnamed: 0,b,d,e
Goa,0.0,1.0,2.0
MP,3.0,4.0,5.0
Tamilnadu,6.0,7.0,8.0
Rajasthan,9.0,10.0,11.0


In [175]:
df3 = df2
df3.reset_index()

Unnamed: 0,index,b,d,e
0,Goa,0.0,1.0,2.0
1,MP,3.0,4.0,5.0
2,Tamilnadu,6.0,7.0,8.0
3,Rajasthan,9.0,10.0,11.0


In [176]:
df3

Unnamed: 0,b,d,e
Goa,0.0,1.0,2.0
MP,3.0,4.0,5.0
Tamilnadu,6.0,7.0,8.0
Rajasthan,9.0,10.0,11.0


In [177]:
df3.reset_index(drop=True)

Unnamed: 0,b,d,e
0,0.0,1.0,2.0
1,3.0,4.0,5.0
2,6.0,7.0,8.0
3,9.0,10.0,11.0


In [178]:
df3 = df3.reset_index(drop=True)

In [179]:
df3

Unnamed: 0,b,d,e
0,0.0,1.0,2.0
1,3.0,4.0,5.0
2,6.0,7.0,8.0
3,9.0,10.0,11.0


In [180]:
df3.index = [0,2,1,3]

In [181]:
df3

Unnamed: 0,b,d,e
0,0.0,1.0,2.0
2,3.0,4.0,5.0
1,6.0,7.0,8.0
3,9.0,10.0,11.0


In [183]:
df3.sort_index()

Unnamed: 0,b,d,e
0,0.0,1.0,2.0
1,6.0,7.0,8.0
2,3.0,4.0,5.0
3,9.0,10.0,11.0


In [185]:
df3

Unnamed: 0,b,d,e
0,0.0,1.0,2.0
2,3.0,4.0,5.0
1,6.0,7.0,8.0
3,9.0,10.0,11.0


In [184]:
df3.sort_index(axis=1) # sort according to columns lexicographically

Unnamed: 0,b,d,e
0,0.0,1.0,2.0
2,3.0,4.0,5.0
1,6.0,7.0,8.0
3,9.0,10.0,11.0


In [186]:
df3.sort_index(axis=1, ascending=False)

Unnamed: 0,e,d,b
0,2.0,1.0,0.0
2,5.0,4.0,3.0
1,8.0,7.0,6.0
3,11.0,10.0,9.0


In [187]:
df3.sort_values(by='b', ascending=False)

Unnamed: 0,b,d,e
3,9.0,10.0,11.0
1,6.0,7.0,8.0
2,3.0,4.0,5.0
0,0.0,1.0,2.0


In [188]:
df3.sort_values(by='b')

Unnamed: 0,b,d,e
0,0.0,1.0,2.0
2,3.0,4.0,5.0
1,6.0,7.0,8.0
3,9.0,10.0,11.0


In [194]:
df4 = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 2, 8, 1]})

In [195]:
df4

Unnamed: 0,b,a
0,4,0
1,7,2
2,-3,8
3,2,1


In [196]:
df4.sort_values(by='b')

Unnamed: 0,b,a
2,-3,8
3,2,1
0,4,0
1,7,2


In [197]:
df4.sort_values(by=['b', 'a'])

Unnamed: 0,b,a
2,-3,8
3,2,1
0,4,0
1,7,2


In [198]:
df4.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
0,4,0
3,2,1
1,7,2
2,-3,8


In [199]:
df5 = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 0, 8, 1]})

In [201]:
df5.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
0,4,0
1,7,0
3,2,1
2,-3,8


Ranking assigns ranks from one through the number of valid data points in an array. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank

In [202]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

In [204]:
np.unique(obj)

array([-5,  0,  2,  4,  7])

In [209]:
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [205]:
obj.sort_values()

1   -5
5    0
4    2
3    4
6    4
0    7
2    7
dtype: int64

-5,       0,      2,        4,      4,       7  ,      7

index 1, index 5, index 4, index 3, index 6, index 0,  index 2

rank 1, rank 2,   rank 3,  rank (4+5)/2,  rank (4+5)/2,  rank (6+7)/2,   rank (6+7)/2

After sorting we see that index 1 has rank 1, index 5 has rank 2, index 4 has rank 3, index 3 has rank (4+5)/2 = 4.5,
index 6 has rank 4.5, index 0 has rank (6+7)/2 = 6.5

In [203]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

Ranks can also be assigned according to the order in which they’re observed in the data:

In [206]:
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

order in which they are observed: -5, 0, 2, 4, 4 7,7

In [207]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

You can rank in descending order and assign tie values the maximum rank in the group

7,       7,      4,        4,      2,       0  ,      -5

index 0, index 2,index 3, index 6, index 4, index 5,  index 1

rank 2, rank 2,   rank 4,  rank 4,  rank 5,  rank 6,   rank 7

In [208]:
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

### It is possible to have duplicate indexes in the same dataframe

## Summarizing and Descriptive Statistics

In [210]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],columns=['one', 'two'])

df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


Calling DataFrame’s sum method returns a Series containing column sums:

In [211]:
df.sum()

one    9.25
two   -5.80
dtype: float64

Passing axis='columns' or axis=1 sums across the columns instead:

In [212]:

df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

Some methods, like idxmin and idxmax, return indirect statistics like the index value where the minimum or maximum values are attained:

In [214]:
df.idxmax()

one    b
two    d
dtype: object

In [215]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [216]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


### Correlations and Covariance

### If you don't have pandas-datareader installed run the following command: !pip install pandas-datareader

In [217]:
import pandas_datareader.data as web
all_data = {ticker: web.get_data_yahoo(ticker)
            for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

price = pd.DataFrame({ticker: data['Adj Close']
                     for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker: data['Volume']
                      for ticker, data in all_data.items()})

In [222]:
all_data['AAPL'].head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2009-12-31,30.478571,30.08,30.447144,30.104286,88102700.0,20.159719
2010-01-04,30.642857,30.34,30.49,30.572857,123432400.0,20.473503
2010-01-05,30.798571,30.464285,30.657143,30.625713,150476200.0,20.508902
2010-01-06,30.747143,30.107143,30.625713,30.138571,138040000.0,20.18268
2010-01-07,30.285715,29.864286,30.25,30.082857,119282800.0,20.145369


In [223]:
price.head()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2009-12-31,20.159719,100.460022,24.345514,307.986847
2010-01-04,20.473503,101.649574,24.720928,311.349976
2010-01-05,20.508902,100.421654,24.728914,309.978882
2010-01-06,20.18268,99.76931,24.57715,302.164703
2010-01-07,20.145369,99.42395,24.321552,295.130463


In [224]:
volume.head()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2009-12-31,88102700.0,4223400.0,31929700.0,2455400.0
2010-01-04,123432400.0,6155300.0,38409100.0,3937800.0
2010-01-05,150476200.0,6841400.0,49749600.0,6048500.0
2010-01-06,138040000.0,5605300.0,58182400.0,8009000.0
2010-01-07,119282800.0,5840600.0,50559700.0,12912000.0


In [225]:
returns = price.pct_change()

In [227]:
returns.head()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2009-12-31,,,,
2010-01-04,0.015565,0.011841,0.01542,0.01092
2010-01-05,0.001729,-0.01208,0.000323,-0.004404
2010-01-06,-0.015906,-0.006496,-0.006137,-0.025209
2010-01-07,-0.001849,-0.003462,-0.0104,-0.023279


The corr method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Relatedly, cov computes the covariance:

Covariance is a statistical term, defined as a systematic relationship between a pair of random variables wherein a change in one variable reciprocated by an equivalent change in another variable.

Correlation is described as a measure in statistics, which determines the degree to which two or more random variables move in tandem. During the study of two variables, if it has been observed that the movement in one variable, is reciprocated by an equivalent movement another variable, in some way or the other, then the variables are said to be correlated.

In [228]:
returns['MSFT'].corr(returns['IBM'])

0.49071829998695543

In [229]:
returns['MSFT'].cov(returns['IBM'])

8.758707971407001e-05

In [230]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.375762,0.447317,0.454454
IBM,0.375762,1.0,0.490718,0.411964
MSFT,0.447317,0.490718,1.0,0.537255
GOOG,0.454454,0.411964,0.537255,1.0


In [231]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


### Unique Values, value counts

In [233]:
df.nunique()

one    3
two    2
dtype: int64

In [234]:
df['one'].unique()

array([1.4 , 7.1 ,  nan, 0.75])

In [235]:
df['two'].value_counts()

-1.3    1
-4.5    1
Name: two, dtype: int64