# Pandas

Pandas is a package used for managing data.

Pandas main use is that it creates 2 new data types for storing data: series and dataframe.

Think of a pandas dataframe like an excel spreadsheet that is storing some data.  One column can have customer name, one column can have product sold name, another column can have price or quantity... Then the rows could be individual sales.

A dataframe is made up of several series.  Each column of a dataframe is a series.

We can name each column and row of a dataframe.

A pandas dataframe is very similar to a data.frame in R.

Similar to numpy arrays, a dataframe is a more robust data type for storing data than lists of lists. Dataframes are more flexible than numpy arrays.

A numpy array can create a matrix with all entries of the same data type.  In a dataframe each column can have its own datatype.  

That's not to say numpy arrays aren't useful.  It is often easiest to convert some subset of a dataframe to a numpy array and then use that to do some math.

Pandas also has SQL-like functions for merging, joining, and sorting dataframes.



In [90]:
import pandas as pd
import numpy as np  

In [91]:


mylist = [5.4,6.1,1.7,99.8]
myarray = np.array(mylist)


In [92]:
myseries1 = pd.Series(data=mylist)
print(myseries1)
myseries2 = pd.Series(data=myarray)
print(myseries2)

0     5.4
1     6.1
2     1.7
3    99.8
dtype: float64
0     5.4
1     6.1
2     1.7
3    99.8
dtype: float64


In [93]:

print(myseries1[2])

1.7


In [94]:


mylabels = ['first','second','third','fourth']
myseries3 = pd.Series(data=mylist,index=mylabels)
print(myseries3)

first      5.4
second     6.1
third      1.7
fourth    99.8
dtype: float64


In [95]:

myseries4 = pd.Series(mylist,mylabels)
print(myseries4)

first      5.4
second     6.1
third      1.7
fourth    99.8
dtype: float64


In [96]:

print(myseries4['second'])

6.1


In [97]:

myseries5 = pd.Series([5.5,1.1,8.8,1.6],['first','third','fourth','fifth'])
print(myseries5)
print('')
print(myseries5+myseries4)

first     5.5
third     1.1
fourth    8.8
fifth     1.6
dtype: float64

fifth       NaN
first      10.9
fourth    108.6
second      NaN
third       2.8
dtype: float64


In [98]:

df1 = pd.concat([myseries4,myseries5],axis=1,sort=False)
df1

Unnamed: 0,0,1
first,5.4,5.5
second,6.1,
third,1.7,1.1
fourth,99.8,8.8
fifth,,1.6


In [99]:

df2 = pd.DataFrame(np.random.randn(5,5))
df2

Unnamed: 0,0,1,2,3,4
0,-0.345441,0.25054,0.910153,-2.109231,-0.093473
1,0.171482,-1.682491,0.193088,-0.434769,-0.192234
2,0.444273,-0.796343,-2.373228,0.106792,-1.222923
3,-1.907719,0.574182,0.757314,-0.697487,0.405775
4,-1.186158,-1.020411,0.093776,0.568696,-0.083483


In [100]:

df3 = pd.DataFrame(np.random.randn(5,5),index=['first row','second row','third row','fourth row','fifth row'],
                   columns=['first col','second col','third col','fourth col','fifth col'])
df3

Unnamed: 0,first col,second col,third col,fourth col,fifth col
first row,1.407045,0.472177,0.501935,-1.531533,-2.484195
second row,0.517405,-0.378353,0.812098,1.792378,0.338146
third row,-0.202217,-0.693028,-0.595876,-1.241763,0.109437
fourth row,-0.84842,1.021994,0.820039,-1.688163,1.674328
fifth row,0.08981,-0.14311,0.207664,-0.374218,-1.333064


In [101]:

print(df3['second col'])
print('')
df3[['third col','first col']]

first row     0.472177
second row   -0.378353
third row    -0.693028
fourth row    1.021994
fifth row    -0.143110
Name: second col, dtype: float64



Unnamed: 0,third col,first col
first row,0.501935,1.407045
second row,0.812098,0.517405
third row,-0.595876,-0.202217
fourth row,0.820039,-0.84842
fifth row,0.207664,0.08981


In [102]:

df3.loc['fourth row']

first col    -0.848420
second col    1.021994
third col     0.820039
fourth col   -1.688163
fifth col     1.674328
Name: fourth row, dtype: float64

In [103]:
df3.iloc[2]

first col    -0.202217
second col   -0.693028
third col    -0.595876
fourth col   -1.241763
fifth col     0.109437
Name: third row, dtype: float64

In [104]:
df3.loc[['fourth row','first row'],['second col','third col']]

Unnamed: 0,second col,third col
fourth row,1.021994,0.820039
first row,0.472177,0.501935


In [105]:

df3>0

Unnamed: 0,first col,second col,third col,fourth col,fifth col
first row,True,True,True,False,False
second row,True,False,True,True,True
third row,False,False,False,False,True
fourth row,False,True,True,False,True
fifth row,True,False,True,False,False


In [106]:
print(df3[df3>0])

            first col  second col  third col  fourth col  fifth col
first row    1.407045    0.472177   0.501935         NaN        NaN
second row   0.517405         NaN   0.812098    1.792378   0.338146
third row         NaN         NaN        NaN         NaN   0.109437
fourth row        NaN    1.021994   0.820039         NaN   1.674328
fifth row    0.089810         NaN   0.207664         NaN        NaN


In [107]:

df3['sixth col'] = np.random.randn(5,1)
df3

Unnamed: 0,first col,second col,third col,fourth col,fifth col,sixth col
first row,1.407045,0.472177,0.501935,-1.531533,-2.484195,0.203568
second row,0.517405,-0.378353,0.812098,1.792378,0.338146,-0.940503
third row,-0.202217,-0.693028,-0.595876,-1.241763,0.109437,1.255136
fourth row,-0.84842,1.021994,0.820039,-1.688163,1.674328,-0.462425
fifth row,0.08981,-0.14311,0.207664,-0.374218,-1.333064,-1.151878


In [108]:

df3.drop('first col',axis=1,inplace=True)

In [109]:
df3

Unnamed: 0,second col,third col,fourth col,fifth col,sixth col
first row,0.472177,0.501935,-1.531533,-2.484195,0.203568
second row,-0.378353,0.812098,1.792378,0.338146,-0.940503
third row,-0.693028,-0.595876,-1.241763,0.109437,1.255136
fourth row,1.021994,0.820039,-1.688163,1.674328,-0.462425
fifth row,-0.14311,0.207664,-0.374218,-1.333064,-1.151878


In [110]:
df5 = df3.drop('second row',axis=0)
df5

Unnamed: 0,second col,third col,fourth col,fifth col,sixth col
first row,0.472177,0.501935,-1.531533,-2.484195,0.203568
third row,-0.693028,-0.595876,-1.241763,0.109437,1.255136
fourth row,1.021994,0.820039,-1.688163,1.674328,-0.462425
fifth row,-0.14311,0.207664,-0.374218,-1.333064,-1.151878


In [111]:

df5.reset_index()

Unnamed: 0,index,second col,third col,fourth col,fifth col,sixth col
0,first row,0.472177,0.501935,-1.531533,-2.484195,0.203568
1,third row,-0.693028,-0.595876,-1.241763,0.109437,1.255136
2,fourth row,1.021994,0.820039,-1.688163,1.674328,-0.462425
3,fifth row,-0.14311,0.207664,-0.374218,-1.333064,-1.151878


In [112]:
df5

Unnamed: 0,second col,third col,fourth col,fifth col,sixth col
first row,0.472177,0.501935,-1.531533,-2.484195,0.203568
third row,-0.693028,-0.595876,-1.241763,0.109437,1.255136
fourth row,1.021994,0.820039,-1.688163,1.674328,-0.462425
fifth row,-0.14311,0.207664,-0.374218,-1.333064,-1.151878


In [113]:
df5.reset_index(inplace=True)
df5

Unnamed: 0,index,second col,third col,fourth col,fifth col,sixth col
0,first row,0.472177,0.501935,-1.531533,-2.484195,0.203568
1,third row,-0.693028,-0.595876,-1.241763,0.109437,1.255136
2,fourth row,1.021994,0.820039,-1.688163,1.674328,-0.462425
3,fifth row,-0.14311,0.207664,-0.374218,-1.333064,-1.151878


In [114]:

df5['new name'] = ['This','is','the','row']
df5


Unnamed: 0,index,second col,third col,fourth col,fifth col,sixth col,new name
0,first row,0.472177,0.501935,-1.531533,-2.484195,0.203568,This
1,third row,-0.693028,-0.595876,-1.241763,0.109437,1.255136,is
2,fourth row,1.021994,0.820039,-1.688163,1.674328,-0.462425,the
3,fifth row,-0.14311,0.207664,-0.374218,-1.333064,-1.151878,row


In [115]:
df5.set_index('new name',inplace=True)
df5

Unnamed: 0_level_0,index,second col,third col,fourth col,fifth col,sixth col
new name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
This,first row,0.472177,0.501935,-1.531533,-2.484195,0.203568
is,third row,-0.693028,-0.595876,-1.241763,0.109437,1.255136
the,fourth row,1.021994,0.820039,-1.688163,1.674328,-0.462425
row,fifth row,-0.14311,0.207664,-0.374218,-1.333064,-1.151878
