# Extracting a single value

## By square bracket notation

We can extract a single value by using the square bracket notation twice.  For example, I can get the 11,000th value from the rainfall amount column like this.a row or a column from a data frame/series.  This is a simple consequence of the fact that square bracket notation works on both data frames _and_ series.  The left-most one is working on a data frame and returning a series, the second one is working on the series.


In [16]:
import pandas as pd

wentworth = pd.read_csv("data/rainfall/IDCJAC0009_047045_1800_Data.csv")
wentworth["Rainfall amount (millimetres)"][322923]

KeyError: 322923

In [5]:
import pandas as pd

wentworth = pd.read_csv("data/rainfall/IDCJAC0009_047045_1800_Data.csv")
wentworth["Rainfall amount (millimetres)"][11000]

files = ['IDCJAC0009_047045_1800_Data', 'IDCJAC0009_049092_1800_Data', 'IDCJAC0009_063049_1800_Data']

wentworth["Rainfall amount (millimetres)"].sum()
# Exercise, try out mean, mode, min, and max
 
# Something to think about:
# What are the datatypes returned by each method?
 
print(wentworth["Rainfall amount (millimetres)"].mean()) # Float
print(wentworth["Rainfall amount (millimetres)"].mode())  # Why has mode given us a series?
# mode = the most frequent values
# 1, 1, 2, 2,
# 1 and 2 are BOTH the most frequent
print(wentworth["Rainfall amount (millimetres)"].min())
print(wentworth["Rainfall amount (millimetres)"].max())

data_frame = pd.read_csv('data/rainfall/IDCJAC0009_067105_1800_Data.csv')
display(data_frame)
data_frame['Rainfall amount (millimetres)'].max()

highest = 0
highest_file = ""
for file in files:
    data_frame = pd.read_csv('data/rainfall/' + file + '.csv')
    average = data_frame['Rainfall amount (millimetres)'].mean()
    if average > highest:
        highest = average
        highest_file = file
print(highest)
print(highest_file)

import glob, os
files = glob.glob(os.path.join('data/rainfall/IDC*.csv'))
print(files)
for file in files:
    data_frame = pd.read_csv(file)
    average = data_frame['Rainfall amount (millimetres)'].mean()
    if average > highest:
        highest = average
        highest_file = file
print(highest)
print(highest_file)

0.6992802603579922
0    0.0
Name: Rainfall amount (millimetres), dtype: float64
0.0
113.0


Unnamed: 0,Product code,Bureau of Meteorology station number,Year,Month,Day,Rainfall amount (millimetres),Period over which rainfall was measured (days),Quality
0,IDCJAC0009,67105,1994,1,1,,,
1,IDCJAC0009,67105,1994,1,2,,,
2,IDCJAC0009,67105,1994,1,3,,,
3,IDCJAC0009,67105,1994,1,4,,,
4,IDCJAC0009,67105,1994,1,5,,,
...,...,...,...,...,...,...,...,...
10620,IDCJAC0009,67105,2023,1,29,0.2,1.0,N
10621,IDCJAC0009,67105,2023,1,30,19.2,1.0,N
10622,IDCJAC0009,67105,2023,1,31,32.2,1.0,N
10623,IDCJAC0009,67105,2023,2,1,0.2,1.0,N


2.6003934254240253
IDCJAC0009_063049_1800_Data
['data/rainfall/IDCJAC0009_047045_1800_Data.csv', 'data/rainfall/IDCJAC0009_049092_1800_Data.csv', 'data/rainfall/IDCJAC0009_063049_1800_Data.csv', 'data/rainfall/IDCJAC0009_063245_1800_Data.csv', 'data/rainfall/IDCJAC0009_063292_1800_Data.csv', 'data/rainfall/IDCJAC0009_063298_1800_Data.csv', 'data/rainfall/IDCJAC0009_065019_1800_Data.csv', 'data/rainfall/IDCJAC0009_066119_1800_Data.csv', 'data/rainfall/IDCJAC0009_066128_1800_Data.csv', 'data/rainfall/IDCJAC0009_067105_1800_Data.csv', 'data/rainfall/IDCJAC0009_075167_1800_Data.csv']
3.213410650887574
data/rainfall/IDCJAC0009_066119_1800_Data.csv


## By Summarising

Pandas provides some "magic" when it comes to summarising columns.  Series have a set of "methods" attached to them that you can call any time you like to get summaries.  Note that these summaries work on Series, so you should extract them first.  Examples are:
  * add up all elements (`sum`)
  * calculate the average (`mean`) or mode (`mode`)
  * find the largest (`max`) or smallest (`min`).

# Example

What is the largest rainfall day for Richmond RAAF base (which is in the file `data/rainfall/IDCJAC0009_067105_1800_Data.csv`)?

Which of our rainfall files has the highest average rainfall?

# Exercise

What is the total rainfall recorded for Meriwagga (rainfall file 075167)?  What is the maximum and minimum rainfall on any one day?  I am sure you can guess the minimum, but what code will give it to you?

## By `loc` and `iloc`

We've seen how to recover a Series from a DataFrame - i.e. how to extract a column.

Lets see how to extract a row.

It is important to realise that, since DataFrames are built from Series, it is somewhat awkward to pull out a single row.  In effect, we are asking for pandas to visit each Series and grab the value at a particular index.

Instead of doing this though, we will use the `loc` functionality of pandas.

`loc` and `iloc` are functions that can get columns _or rows_.  `loc` goes by column name when getting columns and by index when getting rows.  `iloc` goes by the order of the column when getting columns and the order of the row when getting rows.

`loc` and `iloc` actually take two parameters to look up both axis at once.

In [6]:
wentworth.loc[1110, "Rainfall amount (millimetres)"]

data_frame = pd.read_csv('data/rainfall/IDCJAC0009_075167_1800_Data.csv')
print(data_frame['Rainfall amount (millimetres)'].sum())
print(data_frame['Rainfall amount (millimetres)'].max())
print(data_frame['Rainfall amount (millimetres)'].min())


my_frame = pd.DataFrame()
my_frame['first column'] = ["a", "b", "c", "d", "e"]
my_frame['second column'] = ["asdasd", "asdasd", "asdsad", "asdasd", "asdasd"]
my_frame['third column'] = ["a23", "fg13", "h141", "hg123", "6123a"]
my_frame = my_frame.set_index("first column")
display(my_frame)
 
display(my_frame.loc['b', 'third column']) # labels
display(my_frame.iloc[1, 1]) # indexes

15648.6
86.0
0.0


Unnamed: 0_level_0,second column,third column
first column,Unnamed: 1_level_1,Unnamed: 2_level_1
a,asdasd,a23
b,asdasd,fg13
c,asdsad,h141
d,asdasd,hg123
e,asdasd,6123a


'fg13'

'fg13'

but (as you can see) does it _row first_.  This means if we only give one, they will look up by row and give you back a series for that row.  It looks like the table was "flipped", but that is not really what happens.

In [3]:
wentworth.loc[1110]

Product code                                      IDCJAC0009
Bureau of Meteorology station number                   47045
Year                                                    1936
Month                                                      1
Day                                                       16
Rainfall amount (millimetres)                            0.0
Period over which rainfall was measured (days)           NaN
Quality                                                    Y
Name: 1110, dtype: object

# Example

What was the rainfall for the 1st May 2019 in Richmond RAF?

In [9]:

richmond = pd.read_csv('data/rainfall/IDCJAC0009_067105_1800_Data.csv')
richmond = richmond.set_index(['Year', 'Month', 'Day'])
display(richmond.loc[2019, 5, 1])

Product code                                      IDCJAC0009
Bureau of Meteorology station number                   67105
Rainfall amount (millimetres)                            0.0
Period over which rainfall was measured (days)           1.0
Quality                                                    N
Name: (2019, 5, 1), dtype: object

# Exercise

What is the title of the 6th row in the `workouts.csv` file?

In [3]:
wentworth.loc[1110]

NameError: name 'wentworth' is not defined

# Using `loc`/`iloc` for everything?

Many pandas programmers just use `loc` and `iloc` for everything but I will not.  Using them "hides" the underlying working of pandas and since we are here to learn, that doesn't suit us.  We will use it when we need to, but stick to square bracket notation as much as possible.  If you post a question on stack overflow you will probably get a `loc`/`iloc` based answer though, so we want to make sure you really know how they work.