pandas adopts significant parts of NumPy’s idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.


In [1]:
from pandas import Series, DataFrame

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data.


In [2]:
import pandas as pd
pd.Series([1,1.5,'a',[1,2,3]])

0            1
1          1.5
2            a
3    [1, 2, 3]
dtype: object

You can get the array representation and index object of the Series via its values and index attributes, respectively

In [3]:
series=pd.Series([1,1.5,'a',[1,2,3]])
series.index

RangeIndex(start=0, stop=4, step=1)

In [4]:
series.values

array([1, 1.5, 'a', list([1, 2, 3])], dtype=object)

In [5]:
list([1, 2, 3])

[1, 2, 3]

In [6]:
ser=pd.Series(['ankit','summi','kiioo'], index=['a','b','c'])

In [7]:
ser

a    ankit
b    summi
c    kiioo
dtype: object

In [8]:
ser.iloc[0]  #accessing first element using index

'ankit'

In [9]:
ser.loc['a']  #accessing first element using label

'ankit'

In [10]:
ser['a']

'ankit'

In [11]:
ser[['a','c']]  #accessing multiple elements using label

a    ankit
c    kiioo
dtype: object

In [12]:
ser[1]

  ser[1]


'summi'

In [13]:
ser

a    ankit
b    summi
c    kiioo
dtype: object

In [14]:
ser['d'] = 'soumi'  #adding new elements
ser

a    ankit
b    summi
c    kiioo
d    soumi
dtype: object

Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link

In [15]:
ser[(ser=='soumi') | (ser=='kiioo')]

c    kiioo
d    soumi
dtype: object

In [16]:
ser*2

a    ankitankit
b    summisummi
c    kiiookiioo
d    soumisoumi
dtype: object

In [17]:
ser+ser

a    ankitankit
b    summisummi
c    kiiookiioo
d    soumisoumi
dtype: object

In [18]:
ser+'3'

a    ankit3
b    summi3
c    kiioo3
d    soumi3
dtype: object

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict

In [19]:
'b' in ser

True

In [20]:
'f' in ser

False

In [21]:
ser1=pd.Series({'a':'ankkit','k':'kiioo'})
ser1

a    ankkit
k     kiioo
dtype: object

I will use the terms “missing” or “NA” interchangeably to refer to missing data. The isnull and notnull functions in pandas should be used to detect missing data

In [22]:
obj=pd.Series([1,2,None,4,5],index=['a','b','c','d','e'])
print(obj,'\n\n',obj.isnull(),'\n\n',obj.isnull().sum())


a    1.0
b    2.0
c    NaN
d    4.0
e    5.0
dtype: float64 

 a    False
b    False
c     True
d    False
e    False
dtype: bool 

 1


Both the Series object itself and its index have a name attribute

In [23]:
obj.name='My Object'
obj.index.name = 'My Index'
obj

My Index
a    1.0
b    2.0
c    NaN
d    4.0
e    5.0
Name: My Object, dtype: float64

In [24]:
import numpy as np
obj=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'],dtype=np.int16)
obj

a    1
b    2
c    3
d    4
e    5
dtype: int16

**Dataframe**

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays

In [25]:
data={'name':['ankit','kiio','summi','soumi'],'institute':['ISI','IIIT-D','ECIL','IISc'],'Priority':[4,1,2,3],'Marks':[0,100.0,100.0,100.0]}
df=pd.DataFrame(data,index=['a','b','c','d'],columns=['name','institute','Marks','Priority'])
df

Unnamed: 0,name,institute,Marks,Priority
a,ankit,ISI,0.0,4
b,kiio,IIIT-D,100.0,1
c,summi,ECIL,100.0,2
d,soumi,IISc,100.0,3


In [26]:
df.head(3)

Unnamed: 0,name,institute,Marks,Priority
a,ankit,ISI,0.0,4
b,kiio,IIIT-D,100.0,1
c,summi,ECIL,100.0,2


In [27]:
df.columns

Index(['name', 'institute', 'Marks', 'Priority'], dtype='object')

In [28]:
df['name'] #dict like notation 

a    ankit
b     kiio
c    summi
d    soumi
Name: name, dtype: object

In [29]:
df.name

a    ankit
b     kiio
c    summi
d    soumi
Name: name, dtype: object

In [30]:
# column to numpy array
df.name.values

array(['ankit', 'kiio', 'summi', 'soumi'], dtype=object)

In [31]:
df['name'].values

array(['ankit', 'kiio', 'summi', 'soumi'], dtype=object)

In [32]:
x=pd.Series({'a':1,'b':[1,2]},index=['b','a'])

In [33]:
x

b    [1, 2]
a         1
dtype: object

In [34]:
x['b']

[1, 2]

In [35]:
df.loc[['c','d']]

Unnamed: 0,name,institute,Marks,Priority
c,summi,ECIL,100.0,2
d,soumi,IISc,100.0,3


In [36]:
df.iloc[[0,1]]

Unnamed: 0,name,institute,Marks,Priority
a,ankit,ISI,0.0,4
b,kiio,IIIT-D,100.0,1


In [37]:
df.loc[df['name']=='kiio']

Unnamed: 0,name,institute,Marks,Priority
b,kiio,IIIT-D,100.0,1


In [38]:
df['Marrital Statu']='No'

In [39]:
df

Unnamed: 0,name,institute,Marks,Priority,Marrital Statu
a,ankit,ISI,0.0,4,No
b,kiio,IIIT-D,100.0,1,No
c,summi,ECIL,100.0,2,No
d,soumi,IISc,100.0,3,No


In [40]:
df['rank']=np.arange(1,5,1)
print(df)

    name institute  Marks  Priority Marrital Statu  rank
a  ankit       ISI    0.0         4             No     1
b   kiio    IIIT-D  100.0         1             No     2
c  summi      ECIL  100.0         2             No     3
d  soumi      IISc  100.0         3             No     4


In [41]:
df['rank']=pd.Series([4,1,2,3],index=['a','b','c','d'])
print(df)

    name institute  Marks  Priority Marrital Statu  rank
a  ankit       ISI    0.0         4             No     4
b   kiio    IIIT-D  100.0         1             No     1
c  summi      ECIL  100.0         2             No     2
d  soumi      IISc  100.0         3             No     3


In [42]:
# You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array
df.T

Unnamed: 0,a,b,c,d
name,ankit,kiio,summi,soumi
institute,ISI,IIIT-D,ECIL,IISc
Marks,0.0,100.0,100.0,100.0
Priority,4,1,2,3
Marrital Statu,No,No,No,No
rank,4,1,2,3


In [43]:
df.transpose()

Unnamed: 0,a,b,c,d
name,ankit,kiio,summi,soumi
institute,ISI,IIIT-D,ECIL,IISc
Marks,0.0,100.0,100.0,100.0
Priority,4,1,2,3
Marrital Statu,No,No,No,No
rank,4,1,2,3


a DataFrame’s index and columns have their name attributes like series.

In [44]:
print(df['rank'].values)
print(df.values)

[4 1 2 3]
[['ankit' 'ISI' 0.0 4 'No' 4]
 ['kiio' 'IIIT-D' 100.0 1 'No' 1]
 ['summi' 'ECIL' 100.0 2 'No' 2]
 ['soumi' 'IISc' 100.0 3 'No' 3]]


In [45]:
df1=pd.DataFrame({'Name':['ankit','kiio'],'Age':[30,25]},index=['a','b'])
new_index=df1.index
df2=pd.DataFrame({'College':['ISI','IIIT-D'],'Rank':[0,1]},index=new_index)
print(df1)
print(df2)

    Name  Age
a  ankit   30
b   kiio   25
  College  Rank
a     ISI     0
b  IIIT-D     1


Index objects are immutable and thus can’t be modified by the user

In [46]:
label=pd.Index(np.arange(2))
df2=pd.DataFrame({'College':['ISI','IIIT-D'],'Rank':[0,1]},index=label)
print(df2)

  College  Rank
0     ISI     0
1  IIIT-D     1


In [47]:
# reindexing

df1=pd.DataFrame({'Name':['ankit','kiio'],'Age':[30,25]},index=['a','b'])
df1=df1.reindex(['b','a','z'])
print(df1)

    Name   Age
b   kiio  25.0
a  ankit  30.0
z    NaN   NaN


For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills the values.

In [48]:
df1=pd.DataFrame({'Name':['ankit','kiio'],'Age':[30,25]},index=['a','b'])
df1=df1.reindex(['b','a','z'],method='ffill')
print(df1)

    Name  Age
b   kiio   25
a  ankit   30
z   kiio   25


In [49]:
df=pd.DataFrame([[1.2,'ankit'],[9.7,'kiio'],[9.5,'soumi']],index=['a','b','c'],columns=['Grade','Name'])
print(df)
df.drop('a',inplace=True)
print(df)


   Grade   Name
a    1.2  ankit
b    9.7   kiio
c    9.5  soumi
   Grade   Name
b    9.7   kiio
c    9.5  soumi


In [50]:
df.drop(['Grade'],axis=1)

Unnamed: 0,Name
b,kiio
c,soumi


In [51]:
df.drop(index=['b','c'])

Unnamed: 0,Grade,Name


In [52]:
df=pd.Series(['ankit','kiio','summi','soumi'],index=['a','b','c','d'])
print(df[1:4])
print(df[['b','c']])

b     kiio
c    summi
d    soumi
dtype: object
b     kiio
c    summi
dtype: object


In [53]:
###### NER (Named Entity Recognition) ######

import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

text = "Barack Obama was born in Hawaii and was the president of the United States."

# Process the text
doc = nlp(text)

# Extract and print named entities
for ent in doc.ents:
    print(ent.text, ent.label_)

Barack Obama PERSON
Hawaii GPE
the United States GPE


In [54]:
df=pd.DataFrame(['ankit','kiio','summi','soumi'],index=['a','b','c','d'],columns=['Name'])
print(df[1:4]) # selecting rows
print(df[['Name']]) # selecting columns
print(df[df['Name']=='kiio']) # filtering condition

    Name
b   kiio
c  summi
d  soumi
    Name
a  ankit
b   kiio
c  summi
d  soumi
   Name
b  kiio


In [55]:
df=pd.DataFrame([[1.2,'ankit'],[9.7,'kiio'],[9.5,'soumi']],index=['a','b','c'],columns=['Grade','Name'])
print(df)
print(df.loc['a',['Grade','Name']])

   Grade   Name
a    1.2  ankit
b    9.7   kiio
c    9.5  soumi
Grade      1.2
Name     ankit
Name: a, dtype: object


In [56]:
print(df.iloc[1,[1,0]])

Name     kiio
Grade     9.7
Name: b, dtype: object


In [57]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [58]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],index=['a', 'c', 'e', 'f', 'g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [59]:
s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [60]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),columns=list('abcd'))
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [61]:
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),columns=list('abcde'))
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [62]:
df2.loc[1, 'b'] = np.nan

In [63]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [64]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [65]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [66]:
df1*df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,4.0,9.0,
1,20.0,,42.0,56.0,
2,80.0,99.0,120.0,143.0,
3,,,,,


In [67]:
df1.add(df2, fill_value=0) # either in df1 or df2, whereever the value is nan, it will be replaced by zero and 
# then do df1+df2 as usual

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [68]:
arr = np.arange(12.).reshape((3, 4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [69]:
arr[0]

array([0., 1., 2., 3.])

In [70]:
arr-arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

When we subtract arr[0] from arr, the subtraction is performed once for each row. This is referred to as broadcasting.

If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods.

In [71]:
frame=pd.DataFrame(np.arange(12).reshape(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [72]:
series3 = frame['d']
series3

Utah       1
Ohio       4
Texas      7
Oregon    10
Name: d, dtype: int64

In [73]:
frame.sub(series3, axis='index')

Unnamed: 0,b,d,e
Utah,-1,0,1
Ohio,-1,0,1
Texas,-1,0,1
Oregon,-1,0,1


The axis number that you pass is the axis to match on. In this case we mean to match on the DataFrame’s row index (axis='index' or axis=0) and broadcast across.

In [74]:
# NumPy ufuncs (element-wise array methods) also work with pandas objects:

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.829709,-0.077173,-0.07241
Ohio,-0.328474,-0.322371,-0.581475
Texas,0.619285,0.236105,-1.212218
Oregon,0.043921,1.539648,-0.300668


In [75]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.829709,0.077173,0.07241
Ohio,0.328474,0.322371,0.581475
Texas,0.619285,0.236105,1.212218
Oregon,0.043921,1.539648,0.300668


Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this.

In [76]:
frame

Unnamed: 0,b,d,e
Utah,0.829709,-0.077173,-0.07241
Ohio,-0.328474,-0.322371,-0.581475
Texas,0.619285,0.236105,-1.212218
Oregon,0.043921,1.539648,-0.300668


In [77]:
f = lambda x: x.max() - x.min()
frame.apply(f)


b    1.158182
d    1.862019
e    1.139808
dtype: float64

In [78]:
# If you pass axis='columns' to apply, the function will be invoked once per row instead
frame.apply(f, axis='columns')


Utah      0.906882
Ohio      0.259104
Texas     1.831503
Oregon    1.840317
dtype: float64

In [79]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,-0.328474,-0.322371,-1.212218
max,0.829709,1.539648,-0.07241


In [80]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f,axis='columns')

Unnamed: 0,min,max
Utah,-0.077173,0.829709
Ohio,-0.581475,-0.322371
Texas,-1.212218,0.619285
Oregon,-0.300668,1.539648


Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating-point value in frame. You can do this with apply map.

In [81]:
format = lambda x: '%.2f' % x
frame.applymap(format)

  frame.applymap(format)


Unnamed: 0,b,d,e
Utah,0.83,-0.08,-0.07
Ohio,-0.33,-0.32,-0.58
Texas,0.62,0.24,-1.21
Oregon,0.04,1.54,-0.3


In [82]:
frame['e'].map(format)

Utah      -0.07
Ohio      -0.58
Texas     -1.21
Oregon    -0.30
Name: e, dtype: object

In [83]:
pd.Series([1,3,2,7],index=['d','b','a','c']).sort_index()

a    2
b    3
c    7
d    1
dtype: int64

In [84]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),index=['three', 'one'],columns=['d', 'a', 'b', 'c'])
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [85]:
frame.sort_index(axis=1,ascending=True)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [86]:
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [87]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [88]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [89]:
frame.sort_values(by=['a', 'b']) # first sort by a and then b and when values of a are same then for that 
# values of b will be sorted

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


In [91]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]],index=['a', 'b', 'c', 'd'],\
                  columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [92]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [93]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

Some methods, like idxmin and idxmax, return indirect statistics like the index value where the minimum or maximum values are attained.

In [94]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [95]:
df.idxmax()

one    b
two    d
dtype: object

In [96]:
df.idxmin()

one    d
two    b
dtype: object

**Method --> Description**     

count --> Number of non-NA values    
describe --> Compute set of summary statistics for Series or each DataFrame column     
min, max --> Compute minimum and maximum values     
argmin, argmax --> Compute index locations (integers) at which minimum or maximum value obtained, respectively     
idxmin, idxmax --> Compute index labels at which minimum or maximum value obtained, respectively   
quantile --> Compute sample quantile ranging from 0 to 1     
sum --> Sum of values     
mean --> Mean of values     
median --> Arithmetic median (50% quantile) of values     
mad --> Mean absolute deviation from mean value      
prod --> Product of all values     
var --> Sample variance of values      
std --> Sample standard deviation of values     
skew --> Sample skewness (third moment) of values     
kurt --> Sample kurtosis (fourth moment) of values     
cumsum --> Cumulative sum of values      
cummin, cummax --> Cumulative minimum or maximum of values, respectively     
cumprod --> Cumulative product of values      
diff --> Compute first arithmetic difference (useful for time series)     
pct_change --> Compute percent changes

In [97]:
df.count()

one    3
two    2
dtype: int64

In [98]:
df.min()

one    0.75
two   -4.50
dtype: float64

In [99]:
df.max()

one    7.1
two   -1.3
dtype: float64

In [100]:
df.quantile()

one    1.4
two   -2.9
Name: 0.5, dtype: float64

In [101]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [102]:
df.mean()

one    3.083333
two   -2.900000
dtype: float64

In [103]:
df.median()

one    1.4
two   -2.9
dtype: float64

In [104]:
df.prod()

one    7.455
two    5.850
dtype: float64

In [105]:
df.var()

one    12.205833
two     5.120000
dtype: float64

In [106]:
df.std()

one    3.493685
two    2.262742
dtype: float64

In [107]:
df.skew()

one    1.664846
two         NaN
dtype: float64

In [108]:
df.kurt()

one   NaN
two   NaN
dtype: float64

In [109]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [110]:
df.cumprod()

Unnamed: 0,one,two
a,1.4,
b,9.94,-4.5
c,,
d,7.455,5.85


In [111]:
df.cummin()

Unnamed: 0,one,two
a,1.4,
b,1.4,-4.5
c,,
d,0.75,-4.5


In [112]:
df.cummax()

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,7.1,-1.3


In [113]:
df.diff()

Unnamed: 0,one,two
a,,
b,5.7,
c,,
d,,


In [114]:
df.pct_change()

  df.pct_change()


Unnamed: 0,one,two
a,,
b,4.071429,
c,0.0,0.0
d,-0.894366,-0.711111


The corr method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Relatedly, cov computes the covariance

In [115]:
df['one'].corr(df['two'])

np.float64(-1.0)

In [116]:
df['one'].cov(df['two'])

np.float64(-10.16)

value_counts --> Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order

In [117]:
import pandas as pd
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],'Qu2': [2, 3, 1, 2, 3],'Qu3': [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [118]:
result = data.apply(pd.value_counts).fillna(0)
result

  result = data.apply(pd.value_counts).fillna(0)


Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
