# Introduction to Pandas

Here we will learn how to use pandas for data analysis. You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order:

* Introduction to Pandas
* Series
* DataFrames
* Missing Data
* GroupBy
* Merging,Joining and Concatenating
* Operations

In [1]:
import numpy as np
import pandas as pd

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [2]:
labels = ['a','b','c','d']
my_list = [10,20,30,40]
arr = np.array([10,20,30,40,50,60])
d = {'a':10,'B':200,'c':30}

In [3]:
my_list

[10, 20, 30, 40]

In [4]:
#type(pd.Series(data=my_list))
pd.Series(labels,my_list)
#pd.Series(my_list,labels)
#pd.Series(index=labels,data=my_list)

10    a
20    b
30    c
40    d
dtype: object

In [5]:
#pd.Series(arr)
#pd.Series(arr,labels) #index=['a','B','c','D','e','F'])

In [6]:
d

{'a': 10, 'B': 200, 'c': 30}

In [7]:
print(d.keys())
print(d.values())

dict_keys(['a', 'B', 'c'])
dict_values([10, 200, 30])


In [8]:
pd.Series(d)

a     10
B    200
c     30
dtype: int64

In [10]:
#pd.Series(index=['Sam','Ram','Sachin','Arun','Kabeer'],data=[32,28,26,34,20])
#pd.Series(['Sam','Ram','Sachin','Arun','Kabeer'],[32,28,26,34,20])
pd.Series([32,28,26,56,34],['Sam','Ram','Sachin','Arun','Kabeer'])

Sam       32
Ram       28
Sachin    26
Arun      56
Kabeer    34
dtype: int64

In [11]:
pd.Series([32,28,26,34])

0    32
1    28
2    26
3    34
dtype: int64

In [12]:
type(pd.Series([32,28,26,34,20],['Sam','Ram','Sachin','Arun','Kabeer']))

pandas.core.series.Series

# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language.
We can think of a DataFrame as a bunch of Series objects put together to share the same index.

In [13]:
#import pandas as pd
#import numpy as np

In [14]:
from numpy.random import randn
np.random.seed(50)#999999999)

In [15]:
np.random.seed(101)
randn(5) #[-1.56035211, -0.0309776 , -0.62092842, -1.46458049,  1.41194612]

array([2.70684984, 0.62813271, 0.90796945, 0.50382575, 0.65111795])

In [16]:
'A_B_C_D_E'.split('_')

['A', 'B', 'C', 'D', 'E']

In [17]:
df = pd.DataFrame(randn(5,4),
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=['W', 'X', 'Y', 'Z'])
df

Unnamed: 0,W,X,Y,Z
A,-0.319318,-0.848077,0.605965,-2.018168
B,0.740122,0.528813,-0.589001,0.188695
C,-0.758872,-0.933237,0.955057,0.190794
D,1.978757,2.605967,0.683509,0.302665
E,1.693723,-1.706086,-1.159119,-0.134841


In [18]:
type(df)

pandas.core.frame.DataFrame

## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [19]:
#df['W'] # or try []
#df.Z #, Single column becomes series try 
type(df['Z'])
#df.W

pandas.core.series.Series

In [20]:
df[['Z','X']]

Unnamed: 0,Z,X
A,-2.018168,-0.848077
B,0.188695,0.528813
C,0.190794,-0.933237
D,0.302665,2.605967
E,-0.134841,-1.706086


In [21]:
df['Z']

A   -2.018168
B    0.188695
C    0.190794
D    0.302665
E   -0.134841
Name: Z, dtype: float64

In [22]:
type(df[['Z','X']])

pandas.core.frame.DataFrame

In [23]:
type(df.X)

pandas.core.series.Series

In [24]:
New_var=df.W
New_var

A   -0.319318
B    0.740122
C   -0.758872
D    1.978757
E    1.693723
Name: W, dtype: float64

In [25]:
#New_var+5

In [26]:
# Pass a list of column names
df[['W','Y']]

Unnamed: 0,W,Y
A,-0.319318,0.605965
B,0.740122,-0.589001
C,-0.758872,0.955057
D,1.978757,0.683509
E,1.693723,-1.159119


In [27]:
type(df[['X','Z']])

pandas.core.frame.DataFrame

**Creating a new column:**

In [28]:
df

Unnamed: 0,W,X,Y,Z
A,-0.319318,-0.848077,0.605965,-2.018168
B,0.740122,0.528813,-0.589001,0.188695
C,-0.758872,-0.933237,0.955057,0.190794
D,1.978757,2.605967,0.683509,0.302665
E,1.693723,-1.706086,-1.159119,-0.134841


In [29]:
df['new'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,new
A,-0.319318,-0.848077,0.605965,-2.018168,0.286647
B,0.740122,0.528813,-0.589001,0.188695,0.151122
C,-0.758872,-0.933237,0.955057,0.190794,0.196184
D,1.978757,2.605967,0.683509,0.302665,2.662266
E,1.693723,-1.706086,-1.159119,-0.134841,0.534604


** Removing Columns**

In [30]:
df.drop('new',axis=1)     #Not inplace unless specified!, check df
df1 = df.drop('new',axis=1)      
df1

Unnamed: 0,W,X,Y,Z
A,-0.319318,-0.848077,0.605965,-2.018168
B,0.740122,0.528813,-0.589001,0.188695
C,-0.758872,-0.933237,0.955057,0.190794
D,1.978757,2.605967,0.683509,0.302665
E,1.693723,-1.706086,-1.159119,-0.134841


In [31]:
df1

Unnamed: 0,W,X,Y,Z
A,-0.319318,-0.848077,0.605965,-2.018168
B,0.740122,0.528813,-0.589001,0.188695
C,-0.758872,-0.933237,0.955057,0.190794
D,1.978757,2.605967,0.683509,0.302665
E,1.693723,-1.706086,-1.159119,-0.134841


In [32]:
df.drop('new',axis=1,inplace=True)
df

Unnamed: 0,W,X,Y,Z
A,-0.319318,-0.848077,0.605965,-2.018168
B,0.740122,0.528813,-0.589001,0.188695
C,-0.758872,-0.933237,0.955057,0.190794
D,1.978757,2.605967,0.683509,0.302665
E,1.693723,-1.706086,-1.159119,-0.134841


In [33]:
#Can also drop rows this way:
df.drop('A',axis=0)

Unnamed: 0,W,X,Y,Z
B,0.740122,0.528813,-0.589001,0.188695
C,-0.758872,-0.933237,0.955057,0.190794
D,1.978757,2.605967,0.683509,0.302665
E,1.693723,-1.706086,-1.159119,-0.134841


In [34]:
df

Unnamed: 0,W,X,Y,Z
A,-0.319318,-0.848077,0.605965,-2.018168
B,0.740122,0.528813,-0.589001,0.188695
C,-0.758872,-0.933237,0.955057,0.190794
D,1.978757,2.605967,0.683509,0.302665
E,1.693723,-1.706086,-1.159119,-0.134841


** Selecting Rows**

In [35]:
#df.loc['A']     #By row index name OR try  
#df.iloc[3] #by index position
print(type(df.iloc[3]))

<class 'pandas.core.series.Series'>


In [36]:
df.iloc[2:4]
print(type(df.iloc[2:4]))

<class 'pandas.core.frame.DataFrame'>


** Selecting subset of rows and columns **

In [37]:
#df.loc['B','Y']
df.loc[['A','E'],['W','Z']]

Unnamed: 0,W,Z
A,-0.319318,-2.018168
E,1.693723,-0.134841


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [38]:
df

Unnamed: 0,W,X,Y,Z
A,-0.319318,-0.848077,0.605965,-2.018168
B,0.740122,0.528813,-0.589001,0.188695
C,-0.758872,-0.933237,0.955057,0.190794
D,1.978757,2.605967,0.683509,0.302665
E,1.693723,-1.706086,-1.159119,-0.134841


In [39]:
df < 0 #Logical output

Unnamed: 0,W,X,Y,Z
A,True,True,False,True
B,False,False,True,False
C,True,True,False,False
D,False,False,False,False
E,False,True,True,True


In [40]:
df[df > 0]

Unnamed: 0,W,X,Y,Z
A,,,0.605965,
B,0.740122,0.528813,,0.188695
C,,,0.955057,0.190794
D,1.978757,2.605967,0.683509,0.302665
E,1.693723,,,


In [41]:
df[df['Z'] < 0]

Unnamed: 0,W,X,Y,Z
A,-0.319318,-0.848077,0.605965,-2.018168
E,1.693723,-1.706086,-1.159119,-0.134841


In [42]:
#df[df['Y']<0]['W']  # 
df[df['Y']> 0][['W','X']]

Unnamed: 0,W,X
A,-0.319318,-0.848077
C,-0.758872,-0.933237
D,1.978757,2.605967


For two conditions you can use | and & with parenthesis:

In [43]:
df[(df['W'] < 0) & (df['Y'] > 0)]

Unnamed: 0,W,X,Y,Z
A,-0.319318,-0.848077,0.605965,-2.018168
C,-0.758872,-0.933237,0.955057,0.190794


# Missing Data

Let's see methods to deal with Missing Data in pandas:

In [44]:
import numpy as np
import pandas as pd

In [45]:
df = pd.DataFrame({'A':[1,5,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3],
                  'D':[2,8,6]})
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,2
1,5.0,,2,8
2,,,3,6


In [46]:
df.dropna()

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,2


In [47]:
df.dropna(axis=1)

Unnamed: 0,C,D
0,1,2
1,2,8
2,3,6


In [48]:
df.fillna(value=-10)

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,2
1,5.0,-10.0,2,8
2,-10.0,-10.0,3,6


In [49]:
df['A'].mean()

3.0

In [50]:
df['A'].min()

1.0

In [51]:
df['A'].fillna(value=df['A'].min())   #median, max, min

0    1.0
1    5.0
2    1.0
Name: A, dtype: float64

In [52]:
df['A'].fillna(value = df['A'].mean())

0    1.0
1    5.0
2    3.0
Name: A, dtype: float64

In [53]:
df.fillna(value=10)

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,2
1,5.0,10.0,2,8
2,10.0,10.0,3,6


In [54]:
df1 = df.fillna(value=10)
df1

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,2
1,5.0,10.0,2,8
2,10.0,10.0,3,6


In [55]:
df1['B'] + df1['A']

0     6.0
1    15.0
2    20.0
dtype: float64

# Groupby & Summary

The groupby method allows you to group rows of data together and call aggregate functions

In [56]:
#import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350],
       'Profit':[210,350,420,100,190,350]}
data

{'Company': ['GOOG', 'GOOG', 'MSFT', 'MSFT', 'FB', 'FB'],
 'Person': ['Sam', 'Charlie', 'Amy', 'Vanessa', 'Carl', 'Sarah'],
 'Sales': [200, 120, 340, 124, 243, 350],
 'Profit': [210, 350, 420, 100, 190, 350]}

In [57]:
df = pd.DataFrame(data)
df

Unnamed: 0,Company,Person,Sales,Profit
0,GOOG,Sam,200,210
1,GOOG,Charlie,120,350
2,MSFT,Amy,340,420
3,MSFT,Vanessa,124,100
4,FB,Carl,243,190
5,FB,Sarah,350,350


In [58]:
df['Profit'].mean()#quantile(0.95)

270.0

In [59]:
by_comp = df.groupby("Company")

In [60]:
by_comp['Sales'].mean()#quantile(0.95)    # df.groupby('Company').mean() OR use std(), min(), max(), count()

Company
FB      296.5
GOOG    160.0
MSFT    232.0
Name: Sales, dtype: float64

In [61]:
by_comp.describe() #by_comp['Sales'].describe()
#by_comp['Profit'].describe()

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Profit,Profit,Profit,Profit,Profit,Profit,Profit,Profit
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
FB,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0,2.0,270.0,113.137085,190.0,230.0,270.0,310.0,350.0
GOOG,2.0,160.0,56.568542,120.0,140.0,160.0,180.0,200.0,2.0,280.0,98.994949,210.0,245.0,280.0,315.0,350.0
MSFT,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0,2.0,260.0,226.27417,100.0,180.0,260.0,340.0,420.0


In [62]:
df.describe()

Unnamed: 0,Sales,Profit
count,6.0,6.0
mean,229.5,270.0
std,100.899455,121.819539
min,120.0,100.0
25%,143.0,195.0
50%,221.5,280.0
75%,315.75,350.0
max,350.0,420.0


In [63]:
by_comp.describe().transpose()

Unnamed: 0,Company,FB,GOOG,MSFT
Sales,count,2.0,2.0,2.0
Sales,mean,296.5,160.0,232.0
Sales,std,75.660426,56.568542,152.735065
Sales,min,243.0,120.0,124.0
Sales,25%,269.75,140.0,178.0
Sales,50%,296.5,160.0,232.0
Sales,75%,323.25,180.0,286.0
Sales,max,350.0,200.0,340.0
Profit,count,2.0,2.0,2.0
Profit,mean,270.0,280.0,260.0


In [64]:
type(by_comp.describe().transpose())

pandas.core.frame.DataFrame

In [65]:
by_comp.describe().transpose()[['FB','MSFT']]

Unnamed: 0,Company,FB,MSFT
Sales,count,2.0,2.0
Sales,mean,296.5,232.0
Sales,std,75.660426,152.735065
Sales,min,243.0,124.0
Sales,25%,269.75,178.0
Sales,50%,296.5,232.0
Sales,75%,323.25,286.0
Sales,max,350.0,340.0
Profit,count,2.0,2.0
Profit,mean,270.0,260.0


# Merging, Joining, and Concatenating

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating.

In [66]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4,5,6,7]) 

In [67]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [68]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [69]:
#Concatenation
pd.concat([df1,df2])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [70]:
pd.concat([df1,df2],axis=1)  # Use same index and run

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1
0,A0,B0,C0,D0,,,,
1,A1,B1,C1,D1,,,,
2,A2,B2,C2,D2,,,,
3,A3,B3,C3,D3,,,,
4,,,,,A4,B4,C4,D4
5,,,,,A5,B5,C5,D5
6,,,,,A6,B6,C6,D6
7,,,,,A7,B7,C7,D7


In [71]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3','K4'],
                     'A': ['A0', 'A1', 'A2', 'A3','A4'],
                     'B': ['B0', 'B1', 'B2', 'B3','B4']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3','K5'],
                          'C': ['C0', 'C1', 'C2', 'C3','C5'],
                          'D': ['D0', 'D1', 'D2', 'D3','D5']})  
print(left)
print(right)

  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
3  K3  A3  B3
4  K4  A4  B4
  key   C   D
0  K0  C0  D0
1  K1  C1  D1
2  K2  C2  D2
3  K3  C3  D3
4  K5  C5  D5


In [72]:
# Merge
pd.merge(left,right,how='right',on='key')

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3
4,K5,,,C5,D5


In [73]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})

In [74]:
print(left)
print(right)

  key1 key2   A   B
0   K0   K0  A0  B0
1   K0   K1  A1  B1
2   K1   K0  A2  B2
3   K2   K1  A3  B3
  key1 key2   C   D
0   K0   K0  C0  D0
1   K1   K0  C1  D1
2   K1   K0  C2  D2
3   K2   K0  C3  D3


In [75]:
pd.merge(left, right,on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


In [76]:
pd.merge(left, right, how='outer', on=['key1', 'key2'])  #Default how='inner'

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,
5,K2,K0,,,C3,D3


In [77]:
pd.merge(left, right, how='right', on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2
3,K2,K0,,,C3,D3


In [78]:
pd.merge(left, right, how='left', on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


In [79]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [80]:
print(left)
print(right)

     A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2
     C   D
K0  C0  D0
K2  C2  D2
K3  C3  D3


In [81]:
df=left.join(right)
print(df)
#print(pd.merge(left,right,))
#df.loc[['K0'],['A']]=='A0' #['C']

     A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1  NaN  NaN
K2  A2  B2   C2   D2


In [82]:
df=right.join(left)
df
#df.loc[['K0'],['A']]=='A0' #['C']

Unnamed: 0,C,D,A,B
K0,C0,D0,A0,B0
K2,C2,D2,A2,B2
K3,C3,D3,,


In [83]:
left.join(right,how='outer')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


# Operations

There are lots of operations with pandas that will be really useful

In [84]:
#import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,2],'col2':[444,555,666,444,444],'col3':['abcjadfsj','defljljlj','ghi','xyz','ghi']})
df.index=['a','b','c','d','e']
df

Unnamed: 0,col1,col2,col3
a,1,444,abcjadfsj
b,2,555,defljljlj
c,3,666,ghi
d,4,444,xyz
e,2,444,ghi


In [85]:
#Unique Values
df['col2'].unique() #
df['col2'].nunique() # for number of unique values

3

In [86]:
#Freq table
df['col3'].value_counts()

ghi          2
abcjadfsj    1
defljljlj    1
xyz          1
Name: col3, dtype: int64

### Selection/Subsetting

In [87]:
#Select from DataFrame using criteria from multiple columns
#newdf = df[(df['col1']<5) & (df['col2']==444)]
newdf =df[(df['col3']=='ghi')]
newdf

Unnamed: 0,col1,col2,col3
c,3,666,ghi
e,2,444,ghi


### Applying function

In [88]:
def sq(x):
    return x**2
print(df)
df['col1']=df['col1'].apply(sq)
print(df)

   col1  col2       col3
a     1   444  abcjadfsj
b     2   555  defljljlj
c     3   666        ghi
d     4   444        xyz
e     2   444        ghi
   col1  col2       col3
a     1   444  abcjadfsj
b     4   555  defljljlj
c     9   666        ghi
d    16   444        xyz
e     4   444        ghi


In [89]:
df['col3'].apply(len)

a    9
b    9
c    3
d    3
e    3
Name: col3, dtype: int64

In [90]:
df['col1'].std() #mean, median, std, min, max, count

5.890670590009256

In [91]:
df.describe()

Unnamed: 0,col1,col2
count,5.0,5.0
mean,6.8,510.6
std,5.890671,99.281418
min,1.0,444.0
25%,4.0,444.0
50%,4.0,444.0
75%,9.0,555.0
max,16.0,666.0


** Permanently Removing a Column**

In [92]:
del df['col3']

In [93]:
df

Unnamed: 0,col1,col2
a,1,444
b,4,555
c,9,666
d,16,444
e,4,444


** Get column and index names: **

In [94]:
df.columns

Index(['col1', 'col2'], dtype='object')

In [95]:
df.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

** Sorting and Ordering a DataFrame:**

In [96]:
df

Unnamed: 0,col1,col2
a,1,444
b,4,555
c,9,666
d,16,444
e,4,444


In [97]:
df.sort_values(by='col2',ascending=True)
#inplace=False by default df.sort_values("Name", axis = 0, ascending = True, inplace = True, na_position ='last') 

Unnamed: 0,col1,col2
a,1,444
d,16,444
e,4,444
b,4,555
c,9,666


In [98]:
df

Unnamed: 0,col1,col2
a,1,444
b,4,555
c,9,666
d,16,444
e,4,444


** Find Null Values or Check for Null Values**

In [99]:
df.isnull()

Unnamed: 0,col1,col2
a,False,False
b,False,False
c,False,False
d,False,False
e,False,False


In [100]:
# Drop rows with NaN Values
df.dropna()

Unnamed: 0,col1,col2
a,1,444
b,4,555
c,9,666
d,16,444
e,4,444


In [101]:
df = pd.DataFrame({'col1':[1,2,3,np.nan],
                   'col2':[np.nan,555,666,444],
                   'col3':['abc','def','ghi','xyz']})
df

Unnamed: 0,col1,col2,col3
0,1.0,,abc
1,2.0,555.0,def
2,3.0,666.0,ghi
3,,444.0,xyz


In [102]:
df.isnull()

Unnamed: 0,col1,col2,col3
0,False,True,False
1,False,False,False
2,False,False,False
3,True,False,False


In [103]:
df.dropna()   # axis = 0 by default, axis = 1 for removing column with missing cases

Unnamed: 0,col1,col2,col3
1,2.0,555.0,def
2,3.0,666.0,ghi


In [104]:
df.fillna(22)

Unnamed: 0,col1,col2,col3
0,1.0,22.0,abc
1,2.0,555.0,def
2,3.0,666.0,ghi
3,22.0,444.0,xyz


In [105]:
# Sample of data frame
df.sample(2)

Unnamed: 0,col1,col2,col3
1,2.0,555.0,def
3,,444.0,xyz
