We turn our attention now to more subjective decisions - ones that don't just fall out from built-in functions as easily as that.  We might need some "logic" to help us decide which rows/columns should remain.  The two main topics that we will tackle to give us this tool is:
  * filtering with lambdas
  * regular expressions

  Regular expressions are a little easier to imagine, lets start with them.

  In the next code block, I pull out the "A Grade" data from the `wsmtb/2018-XC-Club-Champs.csv` file.  You should notice that the time data is not quite right.  We must have used some European software to generate this because it seems to have a centi-seconds data separated with a comma?  We don't have the tooling to deal with timport his, and we are not even confident that is what it is, and who needs that sort of precision?  Imagine our job is to "clean up" that data so the centi-seconds are not there.

In [4]:
import pandas as pd

a_grade = pd.read_csv("data/wsmtb/2018-XC-Club-Champs.csv", skiprows = [0,1,3], nrows=8)
a_grade

Unnamed: 0,Category Rank,Bib,Name,Laps Completed,Race Time,Lap1,Lap2,Lap3,Lap4,Lap5,Lap6
0,1,21,Luke Brame,5,"01:26:26,3","0:17:01,8","0:17:06,3","0:17:08,0","0:17:29,1","0:17:41,2",
1,2,30,Dylan George,5,"01:26:44,6","0:17:34,5","0:17:03,6","0:17:36,8","0:17:27,8","0:17:02,1",
2,3,1,Brian Price,5,"01:27:21,3","0:17:30,3","0:17:07,5","0:17:39,3","0:17:25,7","0:17:38,8",
3,4,13,Owen Gordon,5,"01:29:00,5","0:17:34,2","0:17:31,1","0:17:43,4","0:18:17,3","0:17:54,7",
4,5,33,Jordan Davies,5,"01:29:10,5","0:17:43,9","0:17:32,8","0:17:56,7","0:18:21,2","0:17:36,1",
5,6,5,Benjamin Green,5,"01:31:21,4","0:17:34,1","0:17:42,9","0:17:34,6","0:18:13,8","0:20:16,2",
6,7,7,Ian Anderson,5,"01:32:09,8","0:18:29,9","0:18:29,4","0:18:43,6","0:18:26,4","0:18:00,8",
7,8,34,Joel Stearnes,5,"01:34:49,9","0:18:52,6","0:18:34,1","0:18:35,7","0:18:46,7","0:20:01,0",


Recall that we can get any one column with square bracket notation (`a_grade["Race Time"]`).  Did you know though that we can get a smaller data frame if we put a list of columns in the square brackets?

In [5]:
times = a_grade[["Race Time", "Lap1", "Lap2", "Lap3", "Lap4", "Lap5"]]
times

Unnamed: 0,Race Time,Lap1,Lap2,Lap3,Lap4,Lap5
0,"01:26:26,3","0:17:01,8","0:17:06,3","0:17:08,0","0:17:29,1","0:17:41,2"
1,"01:26:44,6","0:17:34,5","0:17:03,6","0:17:36,8","0:17:27,8","0:17:02,1"
2,"01:27:21,3","0:17:30,3","0:17:07,5","0:17:39,3","0:17:25,7","0:17:38,8"
3,"01:29:00,5","0:17:34,2","0:17:31,1","0:17:43,4","0:18:17,3","0:17:54,7"
4,"01:29:10,5","0:17:43,9","0:17:32,8","0:17:56,7","0:18:21,2","0:17:36,1"
5,"01:31:21,4","0:17:34,1","0:17:42,9","0:17:34,6","0:18:13,8","0:20:16,2"
6,"01:32:09,8","0:18:29,9","0:18:29,4","0:18:43,6","0:18:26,4","0:18:00,8"
7,"01:34:49,9","0:18:52,6","0:18:34,1","0:18:35,7","0:18:46,7","0:20:01,0"


This is all the times!  Just wanted to show that off, and this seems the best place :)

Now, the usefull method we need is:
  * [replace](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html), or is it
  * [replace](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html)

uh-oh.

# Important tangent on documentation and stack overflow and underlying knowledge

You will find the above happens fairly often in a rich library like `pandas`.  There are two completely different methods called `replace`.  They do _similar_ but not exactly the same things.  They work in different circumstances.  In this case:
  * `replace` works on `DataFrames` and is mainly concerned with letting you look for certain _values_ in the data frame and replace them with other values
  * `str.replace` works on `Series` and is mainly concerned with letting you replace parts of strings within the cells of the series.

However, because they both support regular expressions - there is overlap in what they can do.  The problem here is that you might find solutions using one when the other is more appropriate.  Alternatively, you might learn about one and never even realise the other is available.

## Solution

Curated learning - just like we are doing here - lucky you!

# Regular Expressions

This is a whole topic on its own but very important for data manipulation.  I won't recreate yet-another regex tutorial, [go and do this one, then come back and do the exercises](https://regexone.com/)

# Exercises - regex

  * Use `str.replace` to remove the centi-seconds from `times["Race Time"]`
  * Use `str.replace` to convert all centi-seconds to ".0000" in `times["Race Time"]`
  * (Harder) Use `replace` from `DataFrame` to convert the `,` to `.` before the centi-seconds in `times`.

In [6]:
# put your solutions here

# Achieving the same with a general mechanism (lambdas)

There is a way to apply any computation to a `Series`, it is called [`apply`](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html).  The first parameter you pass to this method is a _lambda_ or a _function name_.  A _lambda_ is just a way of putting a function in a small space and since we will use small examples, we will use lambdas.  We won't dive fully into lambdas, we will use them in one of two ways, both of which we show here.  Lets start off by reproducing the example above (note that we are using yet-another method called `replace`, this one works on strings directly)

In [7]:
times["Race Time"].apply(lambda x: x.replace(",", "."))

0    01:26:26.3
1    01:26:44.6
2    01:27:21.3
3    01:29:00.5
4    01:29:10.5
5    01:31:21.4
6    01:32:09.8
7    01:34:49.9
Name: Race Time, dtype: object

The `apply` method takes one paramter, a lambda telling it what to do on each value it sees.  Pandas takes care of running that lambda on each individual value in the series and giving back a new series with adjusted values.  `lambda x` indicates we want the value to be given the name `x` within the lambda and the line after the `:` is that to give back as the adjusted value.

In truth, many things we might want this for have automated alternatives (as we saw with `+`, `<`, `to_timedelta`, and `str.replace`) but we need this for when the others are not available.  

As a real-world example, in the wsmtb data, you can recover the grade a person was entered from their "bib number".  Below 100 means "a grade", 100 to 199 means "b grade", etc.  This is very useful because, as you have seen, that data is not easily accessible elsewhere in the files since it is stored as a header between sub-tables.  Lets use `apply` to recover it.  Firstly I will need to get every participant from the table though, and that needs another trick


In [42]:
champs = pd.read_csv("data/wsmtb/2018-XC-Club-Champs.csv", skiprows=2)
riders = champs[champs.apply(lambda row: pd.notnull(row["Bib"]) and row["Bib"].isnumeric(), axis = 1)]
riders

Unnamed: 0,Category Rank,Bib,Name,Laps Completed,Race Time,Lap1,Lap2,Lap3,Lap4,Lap5,Lap6
1,1,21,Luke Brame,5,"01:26:26,3","0:17:01,8","0:17:06,3","0:17:08,0","0:17:29,1","0:17:41,2",
2,2,30,Dylan George,5,"01:26:44,6","0:17:34,5","0:17:03,6","0:17:36,8","0:17:27,8","0:17:02,1",
3,3,1,Brian Price,5,"01:27:21,3","0:17:30,3","0:17:07,5","0:17:39,3","0:17:25,7","0:17:38,8",
4,4,13,Owen Gordon,5,"01:29:00,5","0:17:34,2","0:17:31,1","0:17:43,4","0:18:17,3","0:17:54,7",
5,5,33,Jordan Davies,5,"01:29:10,5","0:17:43,9","0:17:32,8","0:17:56,7","0:18:21,2","0:17:36,1",
...,...,...,...,...,...,...,...,...,...,...,...
129,1,553,Sonia Vetisch,3,"00:14:16,6","0:04:48,3","0:04:41,6","0:04:46,8",,,
130,2,554,Emma Blume,3,"00:14:55,1","0:04:53,1","0:04:55,0","0:05:07,2",,,
131,3,558,Maia Uberoi-Robson,3,"00:15:08,2","0:04:49,9","0:04:59,8","0:05:18,6",,,
132,4,551,Natasha Padula,3,"00:15:39,9","0:05:12,5","0:05:01,1","0:05:26,4",,,


In [43]:
 pd.options.mode.chained_assignment = None

grades = riders["Bib"].astype(int).apply(lambda b: "A" if b < 100 else ("B" if b < 200 else ("C" if b < 300 else ("D" if b < 400 else ("E" if b < 500 else "F")))))
riders.insert(2,"Grade",grades)
riders

Unnamed: 0,Category Rank,Bib,Grades,Name,Laps Completed,Race Time,Lap1,Lap2,Lap3,Lap4,Lap5,Lap6
1,1,21,A,Luke Brame,5,"01:26:26,3","0:17:01,8","0:17:06,3","0:17:08,0","0:17:29,1","0:17:41,2",
2,2,30,A,Dylan George,5,"01:26:44,6","0:17:34,5","0:17:03,6","0:17:36,8","0:17:27,8","0:17:02,1",
3,3,1,A,Brian Price,5,"01:27:21,3","0:17:30,3","0:17:07,5","0:17:39,3","0:17:25,7","0:17:38,8",
4,4,13,A,Owen Gordon,5,"01:29:00,5","0:17:34,2","0:17:31,1","0:17:43,4","0:18:17,3","0:17:54,7",
5,5,33,A,Jordan Davies,5,"01:29:10,5","0:17:43,9","0:17:32,8","0:17:56,7","0:18:21,2","0:17:36,1",
...,...,...,...,...,...,...,...,...,...,...,...,...
129,1,553,F,Sonia Vetisch,3,"00:14:16,6","0:04:48,3","0:04:41,6","0:04:46,8",,,
130,2,554,F,Emma Blume,3,"00:14:55,1","0:04:53,1","0:04:55,0","0:05:07,2",,,
131,3,558,F,Maia Uberoi-Robson,3,"00:15:08,2","0:04:49,9","0:04:59,8","0:05:18,6",,,
132,4,551,F,Natasha Padula,3,"00:15:39,9","0:05:12,5","0:05:01,1","0:05:26,4",,,


Lets add a pile of columns that might be useful, perhaps it would be good to know the cummulative time for each lap.  We are going to need some `for` loops for this....

In [55]:
riders["Total1"] = pd.to_timedelta(riders["Lap1"])
for lap in [2,3,4,5,6]:
    sofar = pd.to_timedelta(riders["Lap1"])
    for calc in range(2,lap+1):
        sofar = sofar + pd.to_timedelta(riders[f"Lap{lap}"].replace(",\d", "", regex=True))
    riders[f"Total{lap}"] = sofar
riders

Unnamed: 0,Category Rank,Bib,Grades,Name,Laps Completed,Race Time,Lap1,Lap2,Lap3,Lap4,Lap5,Lap6,Total1,Total2,Total3,Total4,Total5,Total6
1,1,21,A,Luke Brame,5,"01:26:26,3","0:17:01,8","0:17:06,3","0:17:08,0","0:17:29,1","0:17:41,2",,0 days 00:17:18,0 days 00:34:24,0 days 00:51:34,0 days 01:09:45,0 days 01:28:02,NaT
2,2,30,A,Dylan George,5,"01:26:44,6","0:17:34,5","0:17:03,6","0:17:36,8","0:17:27,8","0:17:02,1",,0 days 00:22:45,0 days 00:39:48,0 days 00:57:57,0 days 01:15:06,0 days 01:30:53,NaT
3,3,1,A,Brian Price,5,"01:27:21,3","0:17:30,3","0:17:07,5","0:17:39,3","0:17:25,7","0:17:38,8",,0 days 00:22:03,0 days 00:39:10,0 days 00:57:21,0 days 01:14:18,0 days 01:32:35,NaT
4,4,13,A,Owen Gordon,5,"01:29:00,5","0:17:34,2","0:17:31,1","0:17:43,4","0:18:17,3","0:17:54,7",,0 days 00:22:42,0 days 00:40:13,0 days 00:58:08,0 days 01:17:33,0 days 01:34:18,NaT
5,5,33,A,Jordan Davies,5,"01:29:10,5","0:17:43,9","0:17:32,8","0:17:56,7","0:18:21,2","0:17:36,1",,0 days 00:24:19,0 days 00:41:51,0 days 01:00:11,0 days 01:19:22,0 days 01:34:43,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129,1,553,F,Sonia Vetisch,3,"00:14:16,6","0:04:48,3","0:04:41,6","0:04:46,8",,,,0 days 00:12:03,0 days 00:16:44,0 days 00:21:35,NaT,NaT,NaT
130,2,554,F,Emma Blume,3,"00:14:55,1","0:04:53,1","0:04:55,0","0:05:07,2",,,,0 days 00:12:51,0 days 00:17:46,0 days 00:23:05,NaT,NaT,NaT
131,3,558,F,Maia Uberoi-Robson,3,"00:15:08,2","0:04:49,9","0:04:59,8","0:05:18,6",,,,0 days 00:12:19,0 days 00:17:18,0 days 00:22:55,NaT,NaT,NaT
132,4,551,F,Natasha Padula,3,"00:15:39,9","0:05:12,5","0:05:01,1","0:05:26,4",,,,0 days 00:07:05,0 days 00:12:06,0 days 00:17:57,NaT,NaT,NaT



# Filtering rows (matt calls them "masks" but pandas people just use them willy-nilly)

I find it hard to resolve this syntax, but lets have a go, lets see how well I can explain this.

You know how we use a string in square brackets to get a column from a data frame?  Well what if we put a boolean series in there instead?  Pandas will line that series up against the data frame and use it to filter the rows


In [58]:
example = pd.DataFrame()
example["One"] = pd.Series([1,2,3,4])
example["Two"] = pd.Series(["a", "b", "c", "d"])
example

Unnamed: 0,One,Two
0,1,a
1,2,b
2,3,c
3,4,d


In [59]:
mask = pd.Series([True, True, False, True])
mask

0     True
1     True
2    False
3     True
dtype: bool

In [60]:
example[mask]

Unnamed: 0,One,Two
0,1,a
1,2,b
3,4,d


So this seems pretty esotetic, but it ends up being our main way of filtering rows.  For example, we can filter based on the values in one of the columns

In [63]:
filter = example["One"] != 2
print(filter)
example[filter]

0     True
1    False
2     True
3     True
Name: One, dtype: bool


Unnamed: 0,One,Two
0,1,a
2,3,c
3,4,d


and the common useage by `pandas` people is to combine all that into one line.  It makes sense if you break it down, but I find the syntax infuriatingly difficult to parse

In [64]:
example[example["One"] != 2]

Unnamed: 0,One,Two
0,1,a
2,3,c
3,4,d


# Exercise - lithgow filter

Filter the rainfall data from Lithgow (`data/rainfall/lithgow.csv`) to include only data from the 1990's

In [None]:
# solution here