**Pandas Data Frames**

A pandas data frame is a 2-dimensional array in which the columns have common indices. We usually use such an object when we have a collection of observations/items and a fixed collection of variables/attributes that describe each of the observations.

Consider a convenience store and data for each possible transaction. A transaction has a transaction ID, a customer ID, the amount of money associated with the sale, the amount of tax associated with the sale, and a date time stamp.

The data can be stored as 3 separate dictionaries with the transcation ID as the key. A key point here is that the key is common to all of the variables.

In [5]:
import mylib as my
import os
os.chdir(my.onedrive+"//CurrentCourses//553.688.Spring.2023//Lectures//")"April//Lecture19//")

In [1]:
customer={'T12341':"George",'T12342':"Thomas",'T12343':"Josephine",'T12344':"Steve",'T12345':"James"}
amount={'T12341':12.75,'T12342':10.89,'T12343':4.67,'T12344':11.21,'T12345':4.00}
tax={'T12341':.64,'T12342':.55,'T12343':.24,'T12344':.46,'T12345':.15}
newcol={'T12341':[.64,.35],'T12342':55,'T12343':.24,'T12344':"dwjdkw",'T12345':{"g":.15,"h":20}}
datetime={'T12341':'12:23:2018:12:38','T12342':'12:23:2018:12:45','T12343':'12:23:2018:12:45','T12344':'12:23:2018:12:47','T12345':'12:23:2018:12:49'}

In [1]:
import pandas as pd
df=pd.DataFrame({'customer':customer,'amount':amount,'tax':tax, 'datetime':datetime}) # "newcol":newcol})


NameError: name 'customer' is not defined

In [None]:
## We can write a data frame out to a comma delimited file

In [8]:
df.to_csv("customer_data1.csv")

In [None]:
## By default, the index is included.

In [4]:
df.to_csv("customer_data.csv",index=False)

**csv files**

We can read a comma delimited file to create a data frame.

In [11]:
df=pd.read_csv("customer_data1.csv")
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,customer,amount,tax,datetime
0,0,0,George,12.75,0.64,12:23:2018:12:38
1,1,1,Thomas,10.89,0.55,12:23:2018:12:45
2,2,2,Josephine,4.67,0.24,12:23:2018:12:45
3,3,3,Steve,11.21,0.46,12:23:2018:12:47
4,4,4,James,4.0,0.15,12:23:2018:12:49


We can also include an index and specify which column becomes the index.

In [14]:
df=pd.read_csv("customer_data.csv",index_col=0)
df.head()

Unnamed: 0,customer,amount,tax,datetime
0,George,12.75,0.64,12:23:2018:12:38
1,Thomas,10.89,0.55,12:23:2018:12:45
2,Josephine,4.67,0.24,12:23:2018:12:45
3,Steve,11.21,0.46,12:23:2018:12:47
4,James,4.0,0.15,12:23:2018:12:49


The _shape_ attribute tells us the number of rows and columns

In [6]:
df.shape

(5, 4)

**Data types**

The data types of the columns are stored in the dtypes attribute as a pandas series.

In [7]:
df.dtypes

customer     object
amount      float64
tax         float64
datetime     object
dtype: object

The column names are stored in the _columns_ attribute. 

In [8]:
print(df.columns)

Index(['customer', 'amount', 'tax', 'datetime'], dtype='object')


In [9]:
list(df.columns)

['customer', 'amount', 'tax', 'datetime']

In [15]:
print(type(df.columns))

<class 'pandas.core.indexes.base.Index'>


In [16]:
for x in df.columns:
    print(x)

for i in range(df.columns.size):
    print(str(i)+" "+df.columns[i])

customer
amount
tax
datetime
0 customer
1 amount
2 tax
3 datetime


**index**

The _index_ is stored in the index attribute. This is also an index object.

In [17]:
print(df.index)

Int64Index([0, 1, 2, 3, 4], dtype='int64')


We can get a column as a Series in two different ways: 

- using the dot
- using square brackets with the variable name in quotes 

In [19]:
df.tax

0    0.64
1    0.55
2    0.24
3    0.46
4    0.15
Name: tax, dtype: float64

In [20]:
df["tax"]

0    0.64
1    0.55
2    0.24
3    0.46
4    0.15
Name: tax, dtype: float64

We can create a data frame with a subset of the columns using a list.

In [21]:
df[["customer","tax"]]

Unnamed: 0,customer,tax
0,George,0.64
1,Thomas,0.55
2,Josephine,0.24
3,Steve,0.46
4,James,0.15


We can also construct a data frame from a 2-dimensional numpy array. 

Here we create a data frame from a 10 x 3 array of standard normal random variables. 

When we use this method, by default the index is 0,1,...n-1 where n is the number of rows in the input 2-d array.

In [22]:
import numpy as np
import pandas as pd
M=np.random.normal(10, 5, (10,3))
print(M)
df=pd.DataFrame(M, columns=['x', 'y','z'])
df

[[15.09481419 15.19778296 19.20241842]
 [11.80482733  7.1450584  10.54425626]
 [ 6.94408169 13.34108913 14.50133706]
 [ 3.857651   -0.30387235  6.6539544 ]
 [15.98459961  2.96199297  7.61264132]
 [ 8.23020481 18.32837188  9.03854737]
 [13.0816169  11.45995519  2.44372734]
 [ 7.40069477 10.05798098 12.81122923]
 [15.1216967  10.4157701   8.10647268]
 [15.41987083  5.4643112   3.9193966 ]]


Unnamed: 0,x,y,z
0,15.094814,15.197783,19.202418
1,11.804827,7.145058,10.544256
2,6.944082,13.341089,14.501337
3,3.857651,-0.303872,6.653954
4,15.9846,2.961993,7.612641
5,8.230205,18.328372,9.038547
6,13.081617,11.459955,2.443727
7,7.400695,10.057981,12.811229
8,15.121697,10.41577,8.106473
9,15.419871,5.464311,3.919397


We can also specify the index in the constructor.

In [23]:
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'],
               index=['a','b','c','d','e','f','g','h','i','j'])
df

Unnamed: 0,x,y,z
a,4.283439,10.669548,12.96548
b,3.923138,3.51319,8.202821
c,9.114227,12.751014,15.756973
d,9.813205,1.436731,7.391404
e,4.112175,10.008039,18.788865
f,4.112234,15.571798,7.936547
g,-0.529463,7.525474,15.729775
h,11.031221,15.594405,4.718158
i,5.693667,19.164322,9.544229
j,9.181733,8.882572,10.579311


**Assigning an index**

We can assign an index to a data frame after it is constructed. This approach creates the new index _positionally_, that is, the order of assignment is according to the ordering of the rows.

In [24]:
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.normal(10, 5, (5,3)),
             columns=['x', 'y','z'])
#             index=['a','b','c','d','e'])
print(df)
df.index=['c','d','e','f','g']
print(df)

          x         y          z
0  5.524908  5.809753  10.777820
1  6.978091  2.355096   8.606505
2  8.309716 -0.171347   6.023355
3  4.267410  7.218644  -2.684022
4  3.319786  8.115015  11.585352
          x         y          z
c  5.524908  5.809753  10.777820
d  6.978091  2.355096   8.606505
e  8.309716 -0.171347   6.023355
f  4.267410  7.218644  -2.684022
g  3.319786  8.115015  11.585352


We can use an existing column to create a new index. We can do this "inplace" meaning that we modify the existing data frame, or we can create a new data frame object.

In [17]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
            columns=['x', 'y','z'])
df

Unnamed: 0,x,y,z
0,13.328179,16.811139,8.102287
1,17.027778,3.473851,8.990559
2,14.218391,14.868016,9.480588
3,7.605404,15.559195,9.047861
4,13.9084,11.277981,9.985178
5,11.836899,23.910585,2.018541
6,6.824768,11.514617,7.454956
7,22.051282,3.879606,8.388539
8,16.645641,19.103593,13.613959
9,10.029491,12.015391,14.964004


In [25]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df2=df.set_index('x')
df2

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
9.250579,4.443227,5.347925
12.532742,9.839832,12.767862
9.876916,9.255887,10.878407
9.531469,12.708833,7.729255
16.348739,11.06006,9.466784
3.911216,12.647263,11.834287
17.64378,12.802094,8.529804
7.552932,7.669257,10.242501
6.11054,14.837826,13.144592
4.276908,7.865519,10.739532


In [36]:
import numpy as np
np.random.seed(0)
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df.set_index('x',inplace=True)
df.head(10)

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
18.820262,12.000786,14.89369
21.204466,19.33779,5.113611
14.750442,9.243214,9.483906
12.052993,10.720218,17.271368
13.805189,10.608375,12.219316
11.668372,17.470395,8.974209
11.565339,5.729521,-2.764949
13.268093,14.322181,6.289175
21.348773,2.728172,10.228793
9.064081,17.663896,17.346794


We can add an index that is not an existing column by creating a new column first.

In [43]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])


df['newcolumn']=['a','b','c','d','e','f','g','h','i','j']
print(df)
df.set_index('newcolumn',inplace=True)
print(df)

           x          y          z newcolumn
0  14.550895  11.586091  13.931640         a
1   7.667905   5.277769   7.949752         b
2   9.914898  11.895759  21.296545         c
3   9.788714   5.220275   8.270091         d
4   7.682020  12.407407   2.296015         e
5  10.316310  10.782533  11.160905         f
6   7.013420   8.810391   2.879695         g
7   7.533401   7.285693  12.080250         h
8   4.219088  13.905991  17.472423         i
9  -0.349925  12.131294  13.384540         j
                   x          y          z
newcolumn                                 
a          14.550895  11.586091  13.931640
b           7.667905   5.277769   7.949752
c           9.914898  11.895759  21.296545
d           9.788714   5.220275   8.270091
e           7.682020  12.407407   2.296015
f          10.316310  10.782533  11.160905
g           7.013420   8.810391   2.879695
h           7.533401   7.285693  12.080250
i           4.219088  13.905991  17.472423
j          -0.349925  12.131294 

This does the same thing.

In [41]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])

df.index=['a','b','c','d','e','f','g','h','i','j']
print(df)

           x          y          z
a   9.658792  18.566714   6.276226
b   5.867807   9.507737   6.682609
c  15.633180   4.600342   4.262657
d   7.810900   7.509838  19.647660
e  14.747104  10.437756   3.872822
f  14.221815   4.998923   2.276145
g  15.940149  11.584713  14.604294
h  11.593638  14.284153   6.744872
i   4.828786  13.407973   5.982952
j   6.552251   7.722337  10.087396


Note that we can't create a new column using the dot notation.

In [44]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df.newcolumn=['a','b','c','d','e','f','g','h','i','j']
print(df)

           x          y          z
0   6.812815   8.013641   9.335597
1   8.511046   8.454935   1.619981
2  15.761658  15.398093   5.933179
3   2.667878  12.605324   7.121060
4  10.709766   8.403358  13.457694
5  13.473746   6.372013   3.083180
6   2.085308  13.051897   4.055704
7   7.465918   7.018430   9.737164
8   0.318601  10.943893  12.619455
9  10.442110   8.445569  10.487001


  df.newcolumn=['a','b','c','d','e','f','g','h','i','j']


In [45]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df["newcolumn"]=['a' for i in range(10)]
print(df)
df.index=df.newcolumn
print(df)

           x          y          z newcolumn
0  11.995232  -3.862964  19.779562         a
1  11.950467   6.737957   8.045233         a
2  12.468709   9.419480  -0.153422         a
3  20.322464   9.447297  15.100864         a
4   6.539751  17.681885  11.431718         a
5  13.044219   4.773733  16.055726         a
6  13.449091  16.509231   6.859562         a
7   7.594864  21.519583   4.699921         a
8   9.320251  15.684457  10.488625         a
9  12.914768   8.002755  11.850279         a
                   x          y          z newcolumn
newcolumn                                           
a          11.995232  -3.862964  19.779562         a
a          11.950467   6.737957   8.045233         a
a          12.468709   9.419480  -0.153422         a
a          20.322464   9.447297  15.100864         a
a           6.539751  17.681885  11.431718         a
a          13.044219   4.773733  16.055726         a
a          13.449091  16.509231   6.859562         a
a           7.594864  21.519

**Copying**

As usual, remember you might want to copy!

In [47]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df['myindex']=['a','b','c','d','e','f','g','h','i','j']
df2=df # df2 is a reference to the same object as df
df2.set_index('myindex',inplace=True)
df.head()
print(df)


                 x          y          z
myindex                                 
a        13.735942   4.055275  13.866265
b         4.080597  -3.295861  13.031598
c         1.220547  12.254672   6.579946
d        18.297754  15.342547   7.733071
e         6.560812   3.929613   7.795387
f         8.598223   8.176532  10.783519
g        12.892607  11.748272   6.179280
h         2.811043  16.822659   6.552754
i         6.738532   7.394053   0.784652
j         7.610130   7.601721  13.101791
            x     y     z
myindex                  
a        True  True  True
b        True  True  True
c        True  True  True
d        True  True  True
e        True  True  True
f        True  True  True
g        True  True  True
h        True  True  True
i        True  True  True
j        True  True  True


In [22]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df['myindex']=['a','b','c','d','e','f','g','h','i','j']
df2=df.copy()
df2.set_index('myindex',inplace=True)
df.head()
print(df)

           x          y          z myindex
0  -0.106734   1.757769   1.002131       a
1  13.649289  21.731076   7.668331       b
2  11.302302   7.504953   8.435466       c
3  -1.100326  12.012646   4.116923       d
4   6.628965  13.095527   8.103989       e
5   7.201744  14.997500  10.163402       f
6   8.829424   7.526975  10.907920       g
7  14.596004   2.572379  15.377307       h
8  14.626773  10.599624   5.921168       i
9  14.816968   8.407309  12.273626       j


A good index usually has the property that it identifies a unique observation. So when setting the index, we might wish to perform an integrity check.

In [48]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df['myindex']=['a','b','c','d','e','f','g','h','a','j']

try:
    df.set_index('myindex',inplace=True,verify_integrity=True)
except ValueError:
    print("integrity check failed")
df


integrity check failed


Unnamed: 0,x,y,z,myindex
0,13.492286,10.018854,14.659242,a
1,11.699825,9.921589,10.804641,b
2,9.046733,8.025752,8.661332,c
3,4.359943,11.402209,5.034382,d
4,14.208156,8.752707,10.247475,e
5,12.469184,13.216572,2.146883,f
6,8.965482,14.400895,1.509471,g
7,11.936402,-1.277821,4.887466,h
8,10.193153,1.716424,5.072446,a
9,2.640825,18.240675,10.821139,j


That said, uniqueness is not a strict requirement.

In [None]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'])
df['myindex']=['a','b','c','d','e','f','g','h','a','j']
try:
    df.set_index('myindex',inplace=True,verify_integrity=False)
except ValueError:
    print("integrity check failed")
df

We can revert back to the default index (0,1,...) using the reset_index method.

When we do this, the old index is retained as a column.

In [24]:
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'],
               index=['a','b','c','d','e','f','g','h','i','j'])
print(df)
print(df.columns)

df.reset_index(inplace=True)
print(df)
print(df.columns)

           x          y          z
a   4.506940   4.525327  10.422189
b  11.528840  10.823148  10.528653
c   9.577567  12.770474  17.631923
d  10.488932   8.242022  12.627924
e  15.207455  14.795386  10.201545
f   9.107228  13.618833  13.331245
g   5.130633  15.323087  12.014781
h  13.083845  10.559684  13.130431
i  14.108936   6.747988   3.444035
j  15.334993   6.093962  13.564625
Index(['x', 'y', 'z'], dtype='object')
  index          x          y          z
0     a   4.506940   4.525327  10.422189
1     b  11.528840  10.823148  10.528653
2     c   9.577567  12.770474  17.631923
3     d  10.488932   8.242022  12.627924
4     e  15.207455  14.795386  10.201545
5     f   9.107228  13.618833  13.331245
6     g   5.130633  15.323087  12.014781
7     h  13.083845  10.559684  13.130431
8     i  14.108936   6.747988   3.444035
9     j  15.334993   6.093962  13.564625
Index(['index', 'x', 'y', 'z'], dtype='object')


We can _reindex_ a data frame by supplying a list of values.

In the following example, we sort the index and use that as the new index.

Key point: this does not change which index value is associated with which row of values in the data frame.

In [25]:
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'],
             index=['g','c','h','b','a','e','f','i','d','j'])
print(df)
L=list(df.index) # list of index values
print(L)
print(sorted(L))
df=df.reindex(sorted(L))
print(df)

           x          y          z
g  12.966630   9.584956   8.886095
c   0.876345  17.419395  13.696239
h   4.336917   3.103108   8.429176
b  16.875621   9.445366   8.456519
a  14.987151   3.336191  12.939643
e   8.487375   7.699992   7.386159
f  14.382734  11.174197  13.342960
i  15.845151   6.956576   9.865940
d  23.127316   8.071907   4.004636
j   8.734888  11.078268   4.892849
['g', 'c', 'h', 'b', 'a', 'e', 'f', 'i', 'd', 'j']
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
           x          y          z
a  14.987151   3.336191  12.939643
b  16.875621   9.445366   8.456519
c   0.876345  17.419395  13.696239
d  23.127316   8.071907   4.004636
e   8.487375   7.699992   7.386159
f  14.382734  11.174197  13.342960
g  12.966630   9.584956   8.886095
h   4.336917   3.103108   8.429176
i  15.845151   6.956576   9.865940
j   8.734888  11.078268   4.892849


The indices do not have to all be included.

In [52]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'],
             index=['g','c','h','b','a','e','f','i','d','j'])
print(df)
L=['g','a','e']
print(L)
df=df.reindex(['a','p','g'])
print(df)

           x          y          z
g  13.857030  15.147194   5.456184
c   7.878412  14.312980  -3.278095
h  17.566640  12.765660   9.771480
b  11.102538   4.850324   8.250283
a  15.501422  16.490110  23.481120
e   9.630377   6.707235   7.428830
f   4.909791   9.610726  11.913662
i   9.828789  15.481734   8.828921
d   8.262747   7.093658   1.836827
j   2.161161   4.104210  16.507140
['g', 'a', 'e']
           x          y          z
a  15.501422  16.490110  23.481120
p        NaN        NaN        NaN
g  13.857030  15.147194   5.456184


Repeats are allowed.

In [None]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'],
             index=['g','c','h','b','a','e','f','i','d','j'])
print(df)
L=['a','g','a','g','b','b']
print(L)
df=df.reindex(L)
print(df)

In [None]:
df.reindex(['a','g'])

When an new index value is used, missing values are filled in.

In [None]:
import numpy as np
df=pd.DataFrame(np.random.normal(10, 5, (10,3)),
             columns=['x', 'y','z'],
             index=['a','c','d','f','g','e','b','j','h','i'])
print(df)
df=df.reindex(['f','g','t','s','c'])
print(df)

the __head()__ and __tail()__ methods are useful for inspecting a small portion of a data frame.

In [26]:
df.head(3)

Unnamed: 0,x,y,z
a,14.987151,3.336191,12.939643
b,16.875621,9.445366,8.456519
c,0.876345,17.419395,13.696239


In [None]:
df.head(2)

In [27]:
df.tail(3)

Unnamed: 0,x,y,z
h,4.336917,3.103108,8.429176
i,15.845151,6.956576,9.86594
j,8.734888,11.078268,4.892849


In [28]:
df.tail()

Unnamed: 0,x,y,z
f,14.382734,11.174197,13.34296
g,12.96663,9.584956,8.886095
h,4.336917,3.103108,8.429176
i,15.845151,6.956576,9.86594
j,8.734888,11.078268,4.892849
