We typically populate a data frame from some external source, the most common being a CSV file.  For this we use the `read_csv` function from the pandas library.  We pass it a path to the file we want to load.  On this machine, we are running Linux in the background, so we use Linux-style paths.  Your path would look different on a Windows machine.

In [12]:
import pandas as pd

box_hill = pd.read_csv("data/rainfall/box_hill.csv")
box_hill

Unnamed: 0,Product code,Bureau of Meteorology station number,Year,Month,Day,Rainfall amount (millimetres),Period over which rainfall was measured (days),Quality
0,IDCJAC0009,67104,1990,1,1,,,
1,IDCJAC0009,67104,1990,1,2,,,
2,IDCJAC0009,67104,1990,1,3,,,
3,IDCJAC0009,67104,1990,1,4,,,
4,IDCJAC0009,67104,1990,1,5,,,
...,...,...,...,...,...,...,...,...
11764,IDCJAC0009,67104,2022,3,18,0.0,1.0,N
11765,IDCJAC0009,67104,2022,3,19,17.0,1.0,N
11766,IDCJAC0009,67104,2022,3,20,1.0,1.0,N
11767,IDCJAC0009,67104,2022,3,21,0.0,1.0,N


A number of very interesting things happen:
  1) The dataframe has been given an index that starts at 0 and goes up one at at time.  That data was not in the original CSV file.
  2) Any empty cells were given the value `NaN` (which means "not a number")
  3) The first row is used to create column names (remember a column in a `Series`)
  4) When printing the frame, only the first 5 and last 5 rows are shown and the full imensions are shown at the bottom.

If we extract one of the series from this frame, it will use the generated indexes.

In [7]:
box_hill["Rainfall amount (millimeters)"]

0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
         ... 
11764     0.0
11765    17.0
11766     1.0
11767     0.0
11768     0.0
Name: Rainfall amount (millimetres), Length: 11769, dtype: float64

# `read_csv` options

There are a [very large number of paramters](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) you can set to adjust the way `read_csv` works.  We will look at a few here:
  * setting the index column(s)
  * ???

## Setting the index columns

Typically, the data will already have a column that works as an index.  An index is any column that is unique for that row.  I.e. it has a different value on every row of the data.  For this reason we don't have a single column in our data that can do this, but there is a combination that works!  If we combine "Year", "Month", and "Day", the result is different for each row.  The advantage of doing this is that the index becomes a more natural way to look at the data.  If we do use multiple columns for the index, we get a "Multi-Index" which we will talke about soon.

In [15]:
box_hill_multi = pd.read_csv("data/rainfall/box_hill.csv", index_col=["Year", "Month", "Day"])
box_hill_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Product code,Bureau of Meteorology station number,Rainfall amount (millimetres),Period over which rainfall was measured (days),Quality
Year,Month,Day,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1990,1,1,IDCJAC0009,67104,,,
1990,1,2,IDCJAC0009,67104,,,
1990,1,3,IDCJAC0009,67104,,,
1990,1,4,IDCJAC0009,67104,,,
1990,1,5,IDCJAC0009,67104,,,
...,...,...,...,...,...,...,...
2022,3,18,IDCJAC0009,67104,0.0,1.0,N
2022,3,19,IDCJAC0009,67104,17.0,1.0,N
2022,3,20,IDCJAC0009,67104,1.0,1.0,N
2022,3,21,IDCJAC0009,67104,0.0,1.0,N


That is really all there is to loading csv files, we will now look at more data frame techniques.  In particular, we will look at the types of things you often want to do when loading up a CSV file.

# More DataFrame techniques.

## Extracting a single value

We can extract a single value by using the square bracket notation twice.  For example, I can get the 11,000th value from the rainfall amount column like this.

In [11]:
box_hill["Rainfall amount (millimetres)"][11000]

19.0

## Working with multi-indexes

What should I put if I want to look up an index in a table with a multi-index?  First, lets pull a series from the data frame.  You will notice that it also has a multi-index.

In [17]:
box_hill_multi["Rainfall amount (millimetres)"]

Year  Month  Day
1990  1      1       NaN
             2       NaN
             3       NaN
             4       NaN
             5       NaN
                    ... 
2022  3      18      0.0
             19     17.0
             20      1.0
             21      0.0
             22      0.0
Name: Rainfall amount (millimetres), Length: 11769, dtype: float64

In fact, I can ask for one, two, or three indexes, depending on what data I want. I can drop off the year and reduce the series to months and days (a multi-index sized 2) by giving a year.  Notice that the resulting series has a multi-index!  This can trick people into thinking they are looking at a data frame (since it looks like there is more than one column) but hte missing column name on the last column and the print-out giving a `dtype` at the bottom are the giveaways that this is a series with a complex index instead of a frame with mutliple columns.

In [26]:
box_hill_multi["Rainfall amount (millimetres)"][1993]

Month  Day
1      1       0.0
       2       0.0
       3       0.0
       4      14.0
       5       4.0
              ... 
12     27      0.0
       28      0.0
       29      0.0
       30      1.8
       31      0.0
Name: Rainfall amount (millimetres), Length: 365, dtype: float64

 Or, I can get just days if I give a year and a month (say January 2021).  Notice that I need to put the two indexes into parenthesis.  This is called a "tuple" and many times you need to give multiple values in one place, you bundle them in this way.

In [25]:
box_hill_multi["Rainfall amount (millimetres)"][(2021,1)]

Day
1      1.0
2      0.0
3      1.0
4      0.0
5     13.0
6      0.0
7      4.0
8      5.0
9      0.0
10     0.0
11     0.0
12     0.0
13     0.0
14     0.0
15     0.0
16     0.0
17     0.0
18     0.0
19     0.0
20     0.0
21     0.0
22     0.0
23     0.0
24     0.0
25     0.0
26     0.0
27     0.0
28     1.0
29     4.0
30     5.0
31     3.0
Name: Rainfall amount (millimetres), dtype: float64

Or I can get right down to a single value if I give values for all three indexes (say 3rd Feb 2022).

In [24]:
box_hill_multi["Rainfall amount (millimetres)"][(2022,2,3)]

2.0

# Exercises

## Load without multi-indexes
Load the daily rainfall data for Lithgow (you will find it in the same folder as Box Hill).  What was the first year for which data is recorded in that file?  What was the last?  You can assume the data is sorted in order from oldest to newest.

In [27]:
# load data into variable `lithgow`

print("print the first year within the lithgow data")

print("print the last year within in the lithgow data")

## Load with multi-indexes
Can you do the same for Hornsby Pool but by using multi-indexes (or not using them, if you used them above)?

## Particular month
Can you display the rainfall data for the month of March 2005 in a single table using what we have learned so far?

# Conclusion

Multi-Indexes is _not_ something Excel does well.  The move to pandas has paid off already!