We turn our attention now to more subjective decisions - ones that don't just fall out from built-in functions as easily as that.  We might need some "logic" to help us decide which rows/columns should remain.  The two main topics that we will tackle to give us this tool is:
  * filtering with masks
  * regular expressions

  Regular expressions are a little easier to imagine, lets start with them.

  In the next code block, I pull out the "A Grade" data from the `wsmtb/2018-XC-Club-Champs.csv` file.  You should notice that the time data is not quite right.  We must have used some European software to generate this because it seems to have a centi-seconds data separated with a comma?  We don't have the tooling to deal with timport his, and we are not even confident that is what it is, and who needs that sort of precision?  Imagine our job is to "clean up" that data so the centi-seconds are not there.

In [4]:
import pandas as pd

a_grade = pd.read_csv("data/wsmtb/2018-XC-Club-Champs.csv", skiprows = [0,1,3], nrows=8)
a_grade

Unnamed: 0,Category Rank,Bib,Name,Laps Completed,Race Time,Lap1,Lap2,Lap3,Lap4,Lap5,Lap6
0,1,21,Luke Brame,5,"01:26:26,3","0:17:01,8","0:17:06,3","0:17:08,0","0:17:29,1","0:17:41,2",
1,2,30,Dylan George,5,"01:26:44,6","0:17:34,5","0:17:03,6","0:17:36,8","0:17:27,8","0:17:02,1",
2,3,1,Brian Price,5,"01:27:21,3","0:17:30,3","0:17:07,5","0:17:39,3","0:17:25,7","0:17:38,8",
3,4,13,Owen Gordon,5,"01:29:00,5","0:17:34,2","0:17:31,1","0:17:43,4","0:18:17,3","0:17:54,7",
4,5,33,Jordan Davies,5,"01:29:10,5","0:17:43,9","0:17:32,8","0:17:56,7","0:18:21,2","0:17:36,1",
5,6,5,Benjamin Green,5,"01:31:21,4","0:17:34,1","0:17:42,9","0:17:34,6","0:18:13,8","0:20:16,2",
6,7,7,Ian Anderson,5,"01:32:09,8","0:18:29,9","0:18:29,4","0:18:43,6","0:18:26,4","0:18:00,8",
7,8,34,Joel Stearnes,5,"01:34:49,9","0:18:52,6","0:18:34,1","0:18:35,7","0:18:46,7","0:20:01,0",


Recall that we can get any one column with square bracket notation (`a_grade["Race Time"]`).  Did you know though that we can get a smaller data frame if we put a list of columns in the square brackets?

In [5]:
times = a_grade[["Race Time", "Lap1", "Lap2", "Lap3", "Lap4", "Lap5"]]
times

Unnamed: 0,Race Time,Lap1,Lap2,Lap3,Lap4,Lap5
0,"01:26:26,3","0:17:01,8","0:17:06,3","0:17:08,0","0:17:29,1","0:17:41,2"
1,"01:26:44,6","0:17:34,5","0:17:03,6","0:17:36,8","0:17:27,8","0:17:02,1"
2,"01:27:21,3","0:17:30,3","0:17:07,5","0:17:39,3","0:17:25,7","0:17:38,8"
3,"01:29:00,5","0:17:34,2","0:17:31,1","0:17:43,4","0:18:17,3","0:17:54,7"
4,"01:29:10,5","0:17:43,9","0:17:32,8","0:17:56,7","0:18:21,2","0:17:36,1"
5,"01:31:21,4","0:17:34,1","0:17:42,9","0:17:34,6","0:18:13,8","0:20:16,2"
6,"01:32:09,8","0:18:29,9","0:18:29,4","0:18:43,6","0:18:26,4","0:18:00,8"
7,"01:34:49,9","0:18:52,6","0:18:34,1","0:18:35,7","0:18:46,7","0:20:01,0"


This is all the times!  Just wanted to show that off, and this seems the best place :)

Now, the usefull method we need is:
  * [replace](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html), or is it
  * [replace](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html)

uh-oh.

# Important tangent on documentation and stack overflow and underlying knowledge

You will find the above happens fairly often in a rich library like `pandas`.  There are two completely different methods called `replace`.  They do _similar_ but not exactly the same things.  They work in different circumstances.  In this case:
  * `replace` works on `DataFrames` and is mainly concerned with letting you look for certain _values_ in the data frame and replace them with other values
  * `str.replace` works on `Series` and is mainly concerned with letting you replace parts of strings within the cells of the series.

However, because they both support regular expressions - there is overlap in what they can do.  The problem here is that you might find solutions using one when the other is more appropriate.  Alternatively, you might learn about one and never even realise the other is available.

What is the solution to this problem?  Curated learning - just like we are doing here - lucky you!

# Regular Expressions

This is a whole topic on its own but very important for data manipulation.  I won't recreate yet-another regex tutorial, [go and do this one, then come back and do the exercises](https://regexone.com/)

# Exercises - regex

  * Use `str.replace` to remove the centi-seconds from `times["Race Time"]`
  * Use `str.replace` to convert all centi-seconds to ".0000" in `times["Race Time"]`
  * (Harder) Use `replace` from `DataFrame` to convert the `,` to `.` before the centi-seconds in `times`.

In [6]:
# put your solutions here


# Filtering rows (matt calls them "masks" but pandas people just use them willy-nilly)

I find it hard to resolve this syntax, but lets have a go, lets see how well I can explain this.

You know how we use a string in square brackets to get a Series from a data frame?  We've also seen that you can use a list of strings there and get back a smaller DataFrame!  

Well what if we put a boolean series in there instead?  Pandas will line that series up against the data frame and use it to filter the rows


In [58]:
example = pd.DataFrame()
example["One"] = pd.Series([1,2,3,4])
example["Two"] = pd.Series(["a", "b", "c", "d"])
example

Unnamed: 0,One,Two
0,1,a
1,2,b
2,3,c
3,4,d


In [59]:
mask = pd.Series([True, True, False, True])
mask

0     True
1     True
2    False
3     True
dtype: bool

In [60]:
example[mask]

Unnamed: 0,One,Two
0,1,a
1,2,b
3,4,d


So this seems pretty esotetic, but it ends up being our main way of filtering rows.  For example, we can filter based on the values in one of the columns

In [63]:
filter = example["One"] != 2
print(filter)
example[filter]

0     True
1    False
2     True
3     True
Name: One, dtype: bool


Unnamed: 0,One,Two
0,1,a
2,3,c
3,4,d


and the common useage by `pandas` people is to combine all that into one line.  It makes sense if you break it down, but I find the syntax infuriatingly difficult to parse

In [64]:
example[example["One"] != 2]

Unnamed: 0,One,Two
0,1,a
2,3,c
3,4,d


# Exercise - lithgow filter

Filter the rainfall data from Lithgow (`data/rainfall/lithgow.csv`) to include only data from the 1990's

In [None]:
# solution here

# Concepts

  * Regular expressions are a little language commonly used for searching and modifying strings
  * A boolean series can be used to index a DataFrame.  When you do so, you get back a smaller DataFrame with every index for which the Series was `True`

# Python concepts
  * It is possible to have completely different methods with the same name