Pandas provides an object called `DataFrame`. Dataframes represent tabular data. They include columns, each of which is a Series with a name, and all columns share the same index.

In [1]:
import pandas as pd

# let's create a dataframe by reading CSV data from a URL
data_url = 'https://nes-lter-data.whoi.edu/api/ctd/en617/metadata.csv'

my_dataframe = pd.read_csv(data_url, parse_dates=['date'])
my_dataframe.head(5) # "head" means just show the first n rows (in this case 5)

Unnamed: 0,cruise,cast,date,latitude,longitude,nearest_station
0,EN617,1,2018-07-20 17:23:53+00:00,41.200667,-70.885333,L1
1,EN617,2,2018-07-20 22:57:14+00:00,41.030333,-70.880667,L2
2,EN617,3,2018-07-21 01:15:21+00:00,41.03,-70.769833,u2a
3,EN617,4,2018-07-21 02:58:24+00:00,41.030333,-70.991167,d2a
4,EN617,5,2018-07-21 06:39:49+00:00,40.863667,-70.883,L3


In [2]:
# accessing column names returns a column index
my_dataframe.columns

Index(['cruise', 'cast', 'date', 'latitude', 'longitude', 'nearest_station'], dtype='object')

In [3]:
# accessing a single column returns a Series
my_dataframe['date']

0    2018-07-20 17:23:53+00:00
1    2018-07-20 22:57:14+00:00
2    2018-07-21 01:15:21+00:00
3    2018-07-21 02:58:24+00:00
4    2018-07-21 06:39:49+00:00
5    2018-07-21 11:04:40+00:00
6    2018-07-21 12:37:37+00:00
7    2018-07-21 14:51:08+00:00
8    2018-07-21 17:23:38+00:00
9    2018-07-21 21:39:59+00:00
10   2018-07-21 23:17:27+00:00
11   2018-07-22 02:08:27+00:00
12   2018-07-22 05:00:38+00:00
13   2018-07-22 08:58:16+00:00
14   2018-07-22 13:38:24+00:00
15   2018-07-22 15:46:45+00:00
16   2018-07-22 18:17:39+00:00
17   2018-07-22 21:46:41+00:00
18   2018-07-23 08:08:23+00:00
19   2018-07-23 10:21:49+00:00
20   2018-07-23 12:59:50+00:00
21   2018-07-23 15:46:43+00:00
22   2018-07-23 17:28:33+00:00
23   2018-07-23 20:55:25+00:00
24   2018-07-24 07:38:18+00:00
25   2018-07-24 11:18:38+00:00
26   2018-07-24 12:57:11+00:00
27   2018-07-24 14:11:34+00:00
28   2018-07-24 15:32:23+00:00
29   2018-07-24 16:50:32+00:00
30   2018-07-24 18:24:44+00:00
31   2018-07-24 19:51:39+00:00
32   201

In [4]:
my_dataframe.index

RangeIndex(start=0, stop=35, step=1)

In [5]:
len(my_dataframe)

35

Pandas provides many useful functions on dataframes. For example you can sort a dataframe by values in one or more of the columns:

In [6]:
my_dataframe.sort_values('latitude')

Unnamed: 0,cruise,cast,date,latitude,longitude,nearest_station
21,EN617,22,2018-07-23 15:46:43+00:00,39.771333,-70.872333,L11
20,EN617,21,2018-07-23 12:59:50+00:00,39.7715,-70.879167,L11
19,EN617,20,2018-07-23 10:21:49+00:00,39.7725,-70.987333,d11a
24,EN617,25,2018-07-24 07:38:18+00:00,39.773833,-70.875333,L11
18,EN617,19,2018-07-23 08:08:23+00:00,39.773833,-70.769667,u11a
22,EN617,23,2018-07-23 17:28:33+00:00,39.941167,-70.768167,L12
25,EN617,26,2018-07-24 11:18:38+00:00,39.947833,-70.879167,L10
17,EN617,18,2018-07-22 21:46:41+00:00,39.9495,-70.884333,L10
14,EN617,15,2018-07-22 13:38:24+00:00,40.093667,-70.779,u9a
15,EN617,16,2018-07-22 15:46:45+00:00,40.096,-70.993167,d9a


In [7]:
# using column data as an index
# indexes can have names
by_cast = my_dataframe.set_index('cast')
by_cast.head()

Unnamed: 0_level_0,cruise,date,latitude,longitude,nearest_station
cast,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,EN617,2018-07-20 17:23:53+00:00,41.200667,-70.885333,L1
2,EN617,2018-07-20 22:57:14+00:00,41.030333,-70.880667,L2
3,EN617,2018-07-21 01:15:21+00:00,41.03,-70.769833,u2a
4,EN617,2018-07-21 02:58:24+00:00,41.030333,-70.991167,d2a
5,EN617,2018-07-21 06:39:49+00:00,40.863667,-70.883,L3


In [8]:
by_cast.index

Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
            35],
           dtype='int64', name='cast')

You can access a row from a dataframe by index using the `loc` attribute. A series representing the row is returned, indexed by column name.

In [9]:
row = by_cast.loc[24]
row

cruise                                 EN617
date               2018-07-23 20:55:25+00:00
latitude                              40.366
longitude                            -70.766
nearest_station                          u6a
Name: 24, dtype: object

In [10]:
# since that's an ordinary Series, you can access values by column name
row['date']

Timestamp('2018-07-23 20:55:25+0000', tz='UTC')

You can slice by index on a dataframe also using the `loc` attribute:

In [11]:
# index slicing on a dataframe using loc
by_cast.loc[28:33]

Unnamed: 0_level_0,cruise,date,latitude,longitude,nearest_station
cast,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
28,EN617,2018-07-24 14:11:34+00:00,40.227167,-70.886167,L7
29,EN617,2018-07-24 15:32:23+00:00,40.367333,-70.887333,L6
30,EN617,2018-07-24 16:50:32+00:00,40.514833,-70.879167,L5
31,EN617,2018-07-24 18:24:44+00:00,40.6975,-70.880833,L4
32,EN617,2018-07-24 19:51:39+00:00,40.8625,-70.8775,L3
33,EN617,2018-07-24 21:39:44+00:00,41.028667,-70.878333,L2
