Pandas is a popular Python package for data manipulation. It is build on the top of Numpy. The main data structures `Series` and `DataFrames`.

`Series` is a one-dimensional labeled array capable of holding data of any type. The axis labels are similar to keys in a dictionary and are collectively called index.

In [1]:
import pandas as pd
labels=['Moscow','London', 'Paris']
population=[11.5,8.6,2.2]
pd.Series(population)

0    11.5
1     8.6
2     2.2
dtype: float64

In [2]:
pd.Series(data=population)

0    11.5
1     8.6
2     2.2
dtype: float64

In [3]:
pd.Series(population,labels)

Moscow    11.5
London     8.6
Paris      2.2
dtype: float64

In [4]:
cities=pd.Series(data=population,index=labels)
cities

Moscow    11.5
London     8.6
Paris      2.2
dtype: float64

Data can be of any type

In [5]:
pd.Series(['cat',3.1,True,[2,3]])

0       cat
1       3.1
2      True
3    [2, 3]
dtype: object

Elements of `Series` are accessed similar the case of dictionaries:

In [6]:
cities['Moscow']

11.5

Operations with series are based on the index position

In [7]:
city_growth=pd.Series(data=[2.1,1.0,0.5,0.7],index=['Moscow','London', 'Paris', 'Berlin'])
cities+city_growth

Berlin     NaN
London     9.6
Moscow    13.6
Paris      2.7
dtype: float64

In [8]:
# NaN: not a number

`DataFrame` is a two-dimensional data structure with labeled axes (rows and columns).

### (by Jose Portilla)

In [9]:
import numpy as np
matrix=np.random.randn(5,4)

In [10]:
pd.DataFrame(matrix)

Unnamed: 0,0,1,2,3
0,-0.399242,-0.107662,-0.163098,-0.032075
1,-0.732937,-0.446154,-0.930087,-1.647125
2,-0.337211,-0.455042,-0.110879,-0.439515
3,1.479313,-0.297076,0.692951,-0.501868
4,-1.317928,-0.020203,-0.149772,-0.087291


In [11]:
df=pd.DataFrame(data=matrix,index='A B C D E'.split(),columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,-0.399242,-0.107662,-0.163098,-0.032075
B,-0.732937,-0.446154,-0.930087,-1.647125
C,-0.337211,-0.455042,-0.110879,-0.439515
D,1.479313,-0.297076,0.692951,-0.501868
E,-1.317928,-0.020203,-0.149772,-0.087291


In [12]:
df['W']

A   -0.399242
B   -0.732937
C   -0.337211
D    1.479313
E   -1.317928
Name: W, dtype: float64

In [13]:
type(df['W'])

pandas.core.series.Series

In [14]:
col=['X','W',]
df[col]

Unnamed: 0,X,W
A,-0.107662,-0.399242
B,-0.446154,-0.732937
C,-0.455042,-0.337211
D,-0.297076,1.479313
E,-0.020203,-1.317928


Creating a new column

In [15]:
df['new'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,new
A,-0.399242,-0.107662,-0.163098,-0.032075,-0.56234
B,-0.732937,-0.446154,-0.930087,-1.647125,-1.663024
C,-0.337211,-0.455042,-0.110879,-0.439515,-0.44809
D,1.479313,-0.297076,0.692951,-0.501868,2.172264
E,-1.317928,-0.020203,-0.149772,-0.087291,-1.467701


Removing a column

In [16]:
df.drop('new',axis=1)

Unnamed: 0,W,X,Y,Z
A,-0.399242,-0.107662,-0.163098,-0.032075
B,-0.732937,-0.446154,-0.930087,-1.647125
C,-0.337211,-0.455042,-0.110879,-0.439515
D,1.479313,-0.297076,0.692951,-0.501868
E,-1.317928,-0.020203,-0.149772,-0.087291


In [17]:
# Not inplace unless specified!
df

Unnamed: 0,W,X,Y,Z,new
A,-0.399242,-0.107662,-0.163098,-0.032075,-0.56234
B,-0.732937,-0.446154,-0.930087,-1.647125,-1.663024
C,-0.337211,-0.455042,-0.110879,-0.439515,-0.44809
D,1.479313,-0.297076,0.692951,-0.501868,2.172264
E,-1.317928,-0.020203,-0.149772,-0.087291,-1.467701


In [18]:
df.drop('new',axis=1,inplace=True)

In [19]:
df

Unnamed: 0,W,X,Y,Z
A,-0.399242,-0.107662,-0.163098,-0.032075
B,-0.732937,-0.446154,-0.930087,-1.647125
C,-0.337211,-0.455042,-0.110879,-0.439515
D,1.479313,-0.297076,0.692951,-0.501868
E,-1.317928,-0.020203,-0.149772,-0.087291


Selecting Rows

In [20]:
df.loc['A']

W   -0.399242
X   -0.107662
Y   -0.163098
Z   -0.032075
Name: A, dtype: float64

In [21]:
# or by index location
df.iloc[0]

W   -0.399242
X   -0.107662
Y   -0.163098
Z   -0.032075
Name: A, dtype: float64

In [22]:
df.iloc[[0,3]]

Unnamed: 0,W,X,Y,Z
A,-0.399242,-0.107662,-0.163098,-0.032075
D,1.479313,-0.297076,0.692951,-0.501868


Selecting subset of rows and columns

In [23]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,-0.399242,-0.163098
B,-0.732937,-0.930087


Conditional Selection

In [24]:
df>0

Unnamed: 0,W,X,Y,Z
A,False,False,False,False
B,False,False,False,False
C,False,False,False,False
D,True,False,True,False
E,False,False,False,False


In [25]:
# Use it a conditional filter
df[df>0]

Unnamed: 0,W,X,Y,Z
A,,,,
B,,,,
C,,,,
D,1.479313,,0.692951,
E,,,,


In [26]:
df['W']>0

A    False
B    False
C    False
D     True
E    False
Name: W, dtype: bool

In [27]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
D,1.479313,-0.297076,0.692951,-0.501868


For two conditions you can use | (or) and & (and) with parenthesis:

In [28]:
df[(df['X']>0) & (df['Y'] > 0.1)]

Unnamed: 0,W,X,Y,Z


Reset to default 0,1...n index. The former index becomes a column with name `index`.

In [29]:
# not inplace
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,-0.399242,-0.107662,-0.163098,-0.032075
1,B,-0.732937,-0.446154,-0.930087,-1.647125
2,C,-0.337211,-0.455042,-0.110879,-0.439515
3,D,1.479313,-0.297076,0.692951,-0.501868
4,E,-1.317928,-0.020203,-0.149772,-0.087291


In [30]:
df

Unnamed: 0,W,X,Y,Z
A,-0.399242,-0.107662,-0.163098,-0.032075
B,-0.732937,-0.446154,-0.930087,-1.647125
C,-0.337211,-0.455042,-0.110879,-0.439515
D,1.479313,-0.297076,0.692951,-0.501868
E,-1.317928,-0.020203,-0.149772,-0.087291


Let us introduce a new column and transform it to index.

In [31]:
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df

Unnamed: 0,W,X,Y,Z,States
A,-0.399242,-0.107662,-0.163098,-0.032075,CA
B,-0.732937,-0.446154,-0.930087,-1.647125,NY
C,-0.337211,-0.455042,-0.110879,-0.439515,WY
D,1.479313,-0.297076,0.692951,-0.501868,OR
E,-1.317928,-0.020203,-0.149772,-0.087291,CO


In [32]:
# not inplace
df.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,-0.399242,-0.107662,-0.163098,-0.032075
NY,-0.732937,-0.446154,-0.930087,-1.647125
WY,-0.337211,-0.455042,-0.110879,-0.439515
OR,1.479313,-0.297076,0.692951,-0.501868
CO,-1.317928,-0.020203,-0.149772,-0.087291


In [33]:
df

Unnamed: 0,W,X,Y,Z,States
A,-0.399242,-0.107662,-0.163098,-0.032075,CA
B,-0.732937,-0.446154,-0.930087,-1.647125,NY
C,-0.337211,-0.455042,-0.110879,-0.439515,WY
D,1.479313,-0.297076,0.692951,-0.501868,OR
E,-1.317928,-0.020203,-0.149772,-0.087291,CO


In [34]:
# not inplace
df.set_index('States',inplace=True)
df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,-0.399242,-0.107662,-0.163098,-0.032075
NY,-0.732937,-0.446154,-0.930087,-1.647125
WY,-0.337211,-0.455042,-0.110879,-0.439515
OR,1.479313,-0.297076,0.692951,-0.501868
CO,-1.317928,-0.020203,-0.149772,-0.087291


Summary statistics on all numerical columns

In [35]:
df.describe()

Unnamed: 0,W,X,Y,Z
count,5.0,5.0,5.0,5.0
mean,-0.261601,-0.265227,-0.132177,-0.541575
std,1.048025,0.196623,0.574286,0.651958
min,-1.317928,-0.455042,-0.930087,-1.647125
25%,-0.732937,-0.446154,-0.163098,-0.501868
50%,-0.399242,-0.297076,-0.149772,-0.439515
75%,-0.337211,-0.107662,-0.110879,-0.087291
max,1.479313,-0.020203,0.692951,-0.032075


How many positive values in a column

In [36]:
df['W']>0

States
CA    False
NY    False
WY    False
OR     True
CO    False
Name: W, dtype: bool

In [37]:
ser_w=df['W']>0

In [38]:
ser_w.value_counts()

False    4
True     1
Name: W, dtype: int64

Another method

In [39]:
# Number of true values. True corresponds to 1, and False corresponds to 0. 
ser_w.sum()

1

In [40]:
len(ser_w)

5

### Groupby
The groupby method allows you to group rows of data together and call aggregate functions

In [41]:
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}

In [42]:
df = pd.DataFrame(data)
df

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charlie,120
2,MSFT,Amy,340
3,MSFT,Vanessa,124
4,FB,Carl,243
5,FB,Sarah,350


In [43]:
by_comp = df.groupby("Company")
by_comp

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000191EB3F69A0>

In [44]:
by_comp.mean()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [45]:
by_comp.describe()

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
FB,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0
GOOG,2.0,160.0,56.568542,120.0,140.0,160.0,180.0,200.0
MSFT,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0


In [46]:
by_comp.describe().transpose()

Unnamed: 0,Company,FB,GOOG,MSFT
Sales,count,2.0,2.0,2.0
Sales,mean,296.5,160.0,232.0
Sales,std,75.660426,56.568542,152.735065
Sales,min,243.0,120.0,124.0
Sales,25%,269.75,140.0,178.0
Sales,50%,296.5,160.0,232.0
Sales,75%,323.25,180.0,286.0
Sales,max,350.0,200.0,340.0


### Operations

In [47]:
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


Info on Unique Values

In [48]:
# unique values
df['col2'].unique()

array([444, 555, 666], dtype=int64)

In [49]:
# number of unique values
df['col2'].nunique()

3

In [50]:
# number of each unique value
df['col2'].value_counts()

444    2
666    1
555    1
Name: col2, dtype: int64

In [51]:
# Col1>Col2
# Col2==444 
newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf

Unnamed: 0,col1,col2,col3
3,4,444,xyz


Applying Functions

In [52]:
def times2(x):
    return x*2
df['col1'].apply(times2)

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

In [53]:
df['new']=df['col1'].apply(times2)
df

Unnamed: 0,col1,col2,col3,new
0,1,444,abc,2
1,2,555,def,4
2,3,666,ghi,6
3,4,444,xyz,8


In [54]:
df['col1'].sum()

10

Deleting a column

In [55]:
del df['col1']
df

Unnamed: 0,col2,col3,new
0,444,abc,2
1,555,def,4
2,666,ghi,6
3,444,xyz,8


In [56]:
df.drop(labels=1,axis=0)

Unnamed: 0,col2,col3,new
0,444,abc,2
2,666,ghi,6
3,444,xyz,8


In [57]:
df.drop(labels='col2',axis=1)

Unnamed: 0,col3,new
0,abc,2
1,def,4
2,ghi,6
3,xyz,8


In [58]:
df

Unnamed: 0,col2,col3,new
0,444,abc,2
1,555,def,4
2,666,ghi,6
3,444,xyz,8


Get column and index names

In [59]:
df.columns

Index(['col2', 'col3', 'new'], dtype='object')

In [60]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   col2    4 non-null      int64 
 1   col3    4 non-null      object
 2   new     4 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 224.0+ bytes


 Sorting

In [62]:
df.sort_values(by='col2') #inplace=False by default

Unnamed: 0,col2,col3,new
0,444,abc,2
3,444,xyz,8
1,555,def,4
2,666,ghi,6


### Data Input and Output

For more information search for "Pandas IO"

In [63]:
import numpy as np
import pandas as pd

### CSV
Comma Separated Values files are text files that use commas as field delimeters.
<br> You may need to install <tt>xlrd</tt> and <tt>openpyxl</tt>.<br> In your terminal/command prompt run:

    conda install xlrd
    conda install openpyxl

Then restart Jupyter Notebook.

CSV Input

In [64]:
df = pd.read_csv('example.csv')
df

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


CSV Output

In [65]:
newdf=df[['a','b']]
newdf.to_csv('new_example.csv',index=False)

In [66]:
# index is 0,1,2,... by default
pd.read_csv('new_example.csv')

Unnamed: 0,a,b
0,0,1
1,4,5
2,8,9
3,12,13


### Excel
Pandas can read and write MS Excel files. However, this only imports data, not formulas or images. A file that contains images or macros may cause the <tt>.read_excel()</tt>method to crash. 

In [67]:
pd.read_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

Unnamed: 0.1,Unnamed: 0,a,b,c,d
0,0,0,1,2,3
1,1,4,5,6,7
2,2,8,9,10,11
3,3,12,13,14,15


In [68]:
# Unnamed correspond to index.
df=pd.read_excel('Excel_Sample.xlsx',sheet_name='Sheet1')
df.columns

Index(['Unnamed: 0', 'a', 'b', 'c', 'd'], dtype='object')

In [69]:
df.drop('Unnamed: 0',axis=1)

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
