**Pandas Data Frames**

A pandas data frame is a 2-dimensional array in which the columns have common indices. We usually use such an object when we have a collection of observations/items and a fixed collection of variables/attributes that describe each of the observations.

Consider a convenience store and data for each possible transaction. A transaction has a transaction ID, a customer ID, the amount of money associated with the sale, the amount of tax associated with the sale, and a date time stamp.

The data can be stored as 3 separate dictionaries with the transcation ID as the key. A key point here is that the key is common to all of the variables.

In [5]:
import mylib as my
import os
os.chdir(my.onedrive+"//CurrentCourses//553.688.Spring.2023//Lectures//")"April//Lecture19//")

In [2]:
customer={'T12341':"George",'T12342':"Thomas",'T12343':"Josephine",'T12344':"Steve",'T12345':"James"}
amount={'T12341':12.75,'T12342':10.89,'T12343':4.67,'T12344':11.21,'T12345':4.00}
tax={'T12341':.64,'T12342':.55,'T12343':.24,'T12344':.46,'T12345':.15}
newcol={'T12341':[.64,.35],'T12342':55,'T12343':.24,'T12344':"dwjdkw",'T12345':{"g":.15,"h":20}}
datetime={'T12341':'12:23:2018:12:38','T12342':'12:23:2018:12:45','T12343':'12:23:2018:12:45','T12344':'12:23:2018:12:47','T12345':'12:23:2018:12:49'}

In [3]:
import pandas as pd
df=pd.DataFrame({'customer':customer,'amount':amount,'tax':tax, 'datetime':datetime}) # "newcol":newcol})
print(df)

         customer  amount   tax          datetime
T12341     George   12.75  0.64  12:23:2018:12:38
T12342     Thomas   10.89  0.55  12:23:2018:12:45
T12343  Josephine    4.67  0.24  12:23:2018:12:45
T12344      Steve   11.21  0.46  12:23:2018:12:47
T12345      James    4.00  0.15  12:23:2018:12:49


In [None]:
## We can write a data frame out to a comma delimited file

In [4]:
df.to_csv("customer_data.csv")

In [None]:
## By default, the index is included.

In [5]:
df.to_csv("customer_data.csv",index=False)

**csv files**

We can read a comma delimited file to create a data frame.

In [None]:
df=pd.read_csv("customer_data.csv")
df.head()

We can also include an index and specify which column becomes the index.

In [None]:
df=pd.read_csv("customer_data.csv",index_col=0)
df.head()

The _shape_ attribute tells us the number of rows and columns

In [6]:
df.shape

(5, 4)

**Data types**

The data types of the columns are stored in the dtypes attribute as a pandas series.

In [7]:
df.dtypes

customer     object
amount      float64
tax         float64
datetime     object
dtype: object

The column names are stored in the _columns_ attribute. 

In [8]:
print(df.columns)

Index(['customer', 'amount', 'tax', 'datetime'], dtype='object')


In [9]:
list(df.columns)

['customer', 'amount', 'tax', 'datetime']

In [None]:
print(type(df.columns))

In [None]:
for x in df.columns:
    print(x)

for i in range(df.columns.size):
    print(str(i)+" "+df.columns[i])

**index**

The _index_ is stored in the index attribute. This is also an index object.

In [10]:
print(df.index)

Index(['T12341', 'T12342', 'T12343', 'T12344', 'T12345'], dtype='object')


We can get a column as a Series in two different ways: 

- using the dot
- using square brackets with the variable name in quotes 

In [11]:
df.tax

T12341    0.64
T12342    0.55
T12343    0.24
T12344    0.46
T12345    0.15
Name: tax, dtype: float64

In [12]:
df["tax"]

T12341    0.64
T12342    0.55
T12343    0.24
T12344    0.46
T12345    0.15
Name: tax, dtype: float64

We can create a data frame with a subset of the columns using a list.

In [13]:
df[["customer","tax"]]

Unnamed: 0,customer,tax
T12341,George,0.64
T12342,Thomas,0.55
T12343,Josephine,0.24
T12344,Steve,0.46
T12345,James,0.15


We can also construct a data frame from a 2-dimensional numpy array. 

Here we create a data frame from a 10 x 3 array of standard normal random variables. 

When we use this method, by default the index is 0,1,...n-1 where n is the number of rows in the input 2-d array.

In [14]:
import numpy as np
import pandas as pd
M=np.random.normal(10, 5, (10,3))
print(M)
df=pd.DataFrame(M, columns=['x', 'y','z'])
df

[[-0.60043489 10.73173496  5.65106913]
 [ 6.38534765 14.94361885  6.9675459 ]
 [17.34318193  9.54015735 11.31136799]
 [14.77130166 11.46022436  9.95936316]
 [19.27907697 13.5453974   2.09131505]
 [14.90667213  2.03322577 14.75359079]
 [ 9.48613827 16.40032659  7.37573111]
 [12.59954367 16.43105435  8.38050486]
 [ 7.21379782 16.01622974  8.70337493]
 [12.6270695  -1.0650933  11.06809386]]


Unnamed: 0,x,y,z
0,-0.600435,10.731735,5.651069
1,6.385348,14.943619,6.967546
2,17.343182,9.540157,11.311368
3,14.771302,11.460224,9.959363
4,19.279077,13.545397,2.091315
5,14.906672,2.033226,14.753591
6,9.486138,16.400327,7.375731
7,12.599544,16.431054,8.380505
8,7.213798,16.01623,8.703375
9,12.62707,-1.065093,11.068094


We can also specify the index in the constructor.

In [15]:
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'],
               index=['a','b','c','d','e','f','g','h','i','j'])
df

Unnamed: 0,x,y,z
a,11.982355,13.754252,4.197758
b,11.666802,5.22424,7.546153
c,9.118848,6.249172,6.792994
d,8.851631,21.102466,15.946509
e,9.816698,6.41714,5.906496
f,3.072211,11.623233,14.723038
g,7.191691,6.302813,6.50338
h,7.053001,15.505428,9.256247
i,9.449183,8.694107,6.64641
j,7.86248,12.826298,14.011987


**Assigning an index**

We can assign an index to a data frame after it is constructed. This approach creates the new index _positionally_, that is, the order of assignment is according to the ordering of the rows.

In [16]:
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.normal(10, 5, (5,3)),
             columns=['x', 'y','z'])
#             index=['a','b','c','d','e'])
print(df)
df.index=['c','d','e','f','g']
print(df)

           x          y          z
0   3.808573   8.222364   8.999643
1   8.319182  18.778179  16.622375
2   5.661015  18.590159   9.876612
3  12.831437   1.877926  16.049012
4   5.936901   3.238578  16.317290
           x          y          z
c   3.808573   8.222364   8.999643
d   8.319182  18.778179  16.622375
e   5.661015  18.590159   9.876612
f  12.831437   1.877926  16.049012
g   5.936901   3.238578  16.317290


We can use an existing column to create a new index. We can do this "inplace" meaning that we modify the existing data frame, or we can create a new data frame object.

In [17]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
            columns=['x', 'y','z'])
df

Unnamed: 0,x,y,z
0,13.328179,16.811139,8.102287
1,17.027778,3.473851,8.990559
2,14.218391,14.868016,9.480588
3,7.605404,15.559195,9.047861
4,13.9084,11.277981,9.985178
5,11.836899,23.910585,2.018541
6,6.824768,11.514617,7.454956
7,22.051282,3.879606,8.388539
8,16.645641,19.103593,13.613959
9,10.029491,12.015391,14.964004


In [18]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df2=df.set_index('x')
df2

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
7.796663,11.029757,13.632137
5.189702,5.844675,13.793459
6.088915,7.318521,8.861079
6.304395,3.705188,12.693479
7.28868,18.546115,11.974123
13.103748,11.188231,10.118542
11.430031,-0.131952,8.56999
8.578265,8.934822,12.381737
8.913076,-0.781033,11.436991
11.420092,4.544826,2.480288


In [None]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df.set_index('x',inplace=True)
df.head(10)

We can add an index that is not an existing column by creating a new column first.

In [19]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])


df['newcolumn']=['a','b','c','d','e','f','g','h','i','j']
print(df)
df.set_index('newcolumn',inplace=True)
print(df)

           x          y          z newcolumn
0  10.443746  10.732672  12.478161         a
1  15.619251   4.116326  11.792088         b
2   5.241283   3.567649  -1.558814         c
3   6.607825  15.279013  10.384250         d
4   9.509347  18.310627   5.638552         e
5  13.547912  14.356490  16.391716         f
6   3.652935  16.739815  20.574360         g
7   1.524651   6.623581   6.350933         h
8   8.902497   3.068381   6.668083         i
9  10.536462  10.973034   3.609080         j
                   x          y          z
newcolumn                                 
a          10.443746  10.732672  12.478161
b          15.619251   4.116326  11.792088
c           5.241283   3.567649  -1.558814
d           6.607825  15.279013  10.384250
e           9.509347  18.310627   5.638552
f          13.547912  14.356490  16.391716
g           3.652935  16.739815  20.574360
h           1.524651   6.623581   6.350933
i           8.902497   3.068381   6.668083
j          10.536462  10.973034 

This does the same thing.

In [20]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])

df.index=['a','b','c','d','e','f','g','h','i','j']
print(df)

           x          y          z
a  20.724844   8.199320   8.843872
b  12.274561  16.234458   6.997941
c  11.238885  14.084491   1.976333
d  11.719350  11.017946  15.764897
e   9.644725   0.983373  17.626776
f   6.334839  15.921011   9.652054
g   5.082531  13.059999  14.827436
h  13.573711   9.656660  12.123568
i  11.952524   7.566422   9.980512
j   8.205263   9.173130  11.568216


Note that we can't create a new column using the dot notation.

In [None]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df.newcolumn=['a','b','c','d','e','f','g','h','i','j']
print(df)

In [None]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df["newcolumn"]=['a' for i in range(10)]
print(df)
df.index=df.newcolumn
print(df)

**Copying**

As usual, remember you might want to copy!

In [21]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df['myindex']=['a','b','c','d','e','f','g','h','i','j']
df2=df # df2 is a reference to the same object as df
df2.set_index('myindex',inplace=True)
df.head()
print(df)

                 x          y          z
myindex                                 
a        11.600138  12.189171  17.727250
b        13.540448  13.842041  16.509940
c         8.935494  10.698063  13.346725
d        12.735797   7.773211  11.325655
e        11.832691   4.826676   7.165023
f         6.938914  12.379760  12.506489
g         8.028788  11.836189  11.175921
h         8.831659  15.315747  10.201756
i        11.320055  10.526377  17.560307
j        13.551762   5.065426  18.434085


In [22]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df['myindex']=['a','b','c','d','e','f','g','h','i','j']
df2=df.copy()
df2.set_index('myindex',inplace=True)
df.head()
print(df)

           x          y          z myindex
0  -0.106734   1.757769   1.002131       a
1  13.649289  21.731076   7.668331       b
2  11.302302   7.504953   8.435466       c
3  -1.100326  12.012646   4.116923       d
4   6.628965  13.095527   8.103989       e
5   7.201744  14.997500  10.163402       f
6   8.829424   7.526975  10.907920       g
7  14.596004   2.572379  15.377307       h
8  14.626773  10.599624   5.921168       i
9  14.816968   8.407309  12.273626       j


A good index usually has the property that it identifies a unique observation. So when setting the index, we might wish to perform an integrity check.

In [23]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df['myindex']=['a','b','c','d','e','f','g','h','a','j']

try:
    df.set_index('myindex',inplace=True,verify_integrity=True)
except ValueError:
    print("integrity check failed")
df


integrity check failed


Unnamed: 0,x,y,z,myindex
0,5.783496,12.010168,7.744012,a
1,0.288021,14.127683,10.14629,b
2,10.743895,9.23301,6.501808,c
3,4.315514,6.022782,12.946163,d
4,5.55903,16.176096,12.1787,e
5,9.243145,11.22748,6.348335,f
6,6.664602,14.173572,14.779938,g
7,9.42178,8.81901,10.544122,h
8,11.716613,15.361831,18.28615,a
9,11.054544,18.908229,6.799764,j


That said, uniqueness is not a strict requirement.

In [None]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df['myindex']=['a','b','c','d','e','f','g','h','a','j']
try:
    df.set_index('myindex',inplace=True,verify_integrity=False)
except ValueError:
    print("integrity check failed")
df

We can revert back to the default index (0,1,...) using the reset_index method.

When we do this, the old index is retained as a column.

In [24]:
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'],
               index=['a','b','c','d','e','f','g','h','i','j'])
print(df)
print(df.columns)

df.reset_index(inplace=True)
print(df)
print(df.columns)

           x          y          z
a   4.506940   4.525327  10.422189
b  11.528840  10.823148  10.528653
c   9.577567  12.770474  17.631923
d  10.488932   8.242022  12.627924
e  15.207455  14.795386  10.201545
f   9.107228  13.618833  13.331245
g   5.130633  15.323087  12.014781
h  13.083845  10.559684  13.130431
i  14.108936   6.747988   3.444035
j  15.334993   6.093962  13.564625
Index(['x', 'y', 'z'], dtype='object')
  index          x          y          z
0     a   4.506940   4.525327  10.422189
1     b  11.528840  10.823148  10.528653
2     c   9.577567  12.770474  17.631923
3     d  10.488932   8.242022  12.627924
4     e  15.207455  14.795386  10.201545
5     f   9.107228  13.618833  13.331245
6     g   5.130633  15.323087  12.014781
7     h  13.083845  10.559684  13.130431
8     i  14.108936   6.747988   3.444035
9     j  15.334993   6.093962  13.564625
Index(['index', 'x', 'y', 'z'], dtype='object')


We can _reindex_ a data frame by supplying a list of values.

In the following example, we sort the index and use that as the new index.

Key point: this does not change which index value is associated with which row of values in the data frame.

In [25]:
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'],
             index=['g','c','h','b','a','e','f','i','d','j'])
print(df)
L=list(df.index) # list of index values
print(L)
print(sorted(L))
df=df.reindex(sorted(L))
print(df)

           x          y          z
g  12.966630   9.584956   8.886095
c   0.876345  17.419395  13.696239
h   4.336917   3.103108   8.429176
b  16.875621   9.445366   8.456519
a  14.987151   3.336191  12.939643
e   8.487375   7.699992   7.386159
f  14.382734  11.174197  13.342960
i  15.845151   6.956576   9.865940
d  23.127316   8.071907   4.004636
j   8.734888  11.078268   4.892849
['g', 'c', 'h', 'b', 'a', 'e', 'f', 'i', 'd', 'j']
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
           x          y          z
a  14.987151   3.336191  12.939643
b  16.875621   9.445366   8.456519
c   0.876345  17.419395  13.696239
d  23.127316   8.071907   4.004636
e   8.487375   7.699992   7.386159
f  14.382734  11.174197  13.342960
g  12.966630   9.584956   8.886095
h   4.336917   3.103108   8.429176
i  15.845151   6.956576   9.865940
j   8.734888  11.078268   4.892849


The indices do not have to all be included.

In [None]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'],
             index=['g','c','h','b','a','e','f','i','d','j'])
print(df)
L=['g','a','e']
print(L)
df=df.reindex(L)
print(df)

Repeats are allowed.

In [None]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'],
             index=['g','c','h','b','a','e','f','i','d','j'])
print(df)
L=['a','g','a','g','b','b']
print(L)
df=df.reindex(L)
print(df)

In [None]:
df.reindex(['a','g'])

When an new index value is used, missing values are filled in.

In [None]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'],
             index=['a','c','d','f','g','e','b','j','h','i'])
print(df)
df=df.reindex(['f','g','t','s','c'])
print(df)

the __head()__ and __tail()__ methods are useful for inspecting a small portion of a data frame.

In [26]:
df.head(3)

Unnamed: 0,x,y,z
a,14.987151,3.336191,12.939643
b,16.875621,9.445366,8.456519
c,0.876345,17.419395,13.696239


In [None]:
df.head(2)

In [27]:
df.tail(3)

Unnamed: 0,x,y,z
h,4.336917,3.103108,8.429176
i,15.845151,6.956576,9.86594
j,8.734888,11.078268,4.892849


In [28]:
df.tail()

Unnamed: 0,x,y,z
f,14.382734,11.174197,13.34296
g,12.96663,9.584956,8.886095
h,4.336917,3.103108,8.429176
i,15.845151,6.956576,9.86594
j,8.734888,11.078268,4.892849
