# Chapter 1
## Getting Started
### Exercise 1
Do some simple arithmetic, try `1+1`. To execute a cell, press `<Shift>` + `<Enter>` (`<Maj>` + `<Entrée>` on some French keyboards). Other possible operators as `-`, `/`, `*`, and `**`.

Use the \[+\] button in the toolbar at the top to make more cells if you need them. They will appear under your currently selected cell.

Suppose we want to calculate the sine of 3. The first thing you might try is

In [7]:
sin(3)

NameError: name 'sin' is not defined

However, that doesn't work. This is because the math functions are located in a package that we need to import first using `import` as we saw in the slides before.

In [1]:
import math

In [9]:
math.sin(3)

0.1411200080598672

Notice that we have to put the name of the package first, followed by a period and then the function.

To get help on any function, type `help(functionname)`

In [2]:
help(math.sin)

Help on built-in function sin in module math:

sin(x, /)
    Return the sine of x (measured in radians).



## Text
You can type text between cell by changing the cell type from "Code" to "Markdown" in the drop down menu in the toolbar at the top. This lets you write reports with code embedded. It uses a special language called "Markdown" which is plain text with some formatting.

* Text can be made *italic* by using single asterisks `*This will be italic*`
* Text can be made **bold** by using double asterisks `**This will be bold**`
* Or ***combined*** with triple asterisks `***This will be both***`

Headers are possible with `#`. The more you use, the higher the level.
# Level 1
`# Level 1`
## Level 2
`## Level 2`
### Level 3
`### Level 3`

For the full Markdown specification, see <https://sourceforge.net/p/jupiter/wiki/markdown_syntax/>

### Exercise 2
Claim your notebook! Enter your name below and make Python bold. Press `<Shift>`+`<Enter>` to display. Don't forget to change the cell type. Double-click the text to enter edit mode again.

In [None]:
# Notebook of ...
Today we will learn Python!

## Reading data
We will use a package called pandas for reading data. It makes it much easier to load and manipulate data and is extensively used in many fields. We will import it now.

In [1]:
import pandas as pd
import numpy as np

# The line below one is only needed for this workshop
from done import imdone, quizanswer

Notice that we used `import pandas as pd`. This does the same as `import pandas` but it renames the package to the shorter name `pd`. Now every time we want to use a function from pandas, we only need to type `pd.somefunction()` instead of `pandas.somefunction()`.

To read in data we will use the `read_csv` function which, as the name implies, can read in a CSV file. CSV stands for Comma Separated Values and is one of the most common format. Even Excel allows exporting to CSV.

In [2]:
mpg = pd.read_csv("data/mpg.csv")

Note about reading data: it is reading relative to the current working directory.

### Viewing data
Several functions exist to look at the data. The simplest is to write the variable and execute it.

In [3]:
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
5,audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact
6,audi,a4,3.1,2008,6,auto(av),f,18,27,p,compact
7,audi,a4 quattro,1.8,1999,4,manual(m5),4,18,26,p,compact
8,audi,a4 quattro,1.8,1999,4,auto(l5),4,16,25,p,compact
9,audi,a4 quattro,2.0,2008,4,manual(m6),4,20,28,p,compact


This is a very long list. We can use the functions `head` and `tail` to just look at a small piece.

In [4]:
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [5]:
mpg.tail()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
229,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
230,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
231,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
232,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize
233,volkswagen,passat,3.6,2008,6,auto(s6),f,17,26,p,midsize


We can also query what the column names are with `columns` and what types of data the columns have with `dtypes`.

In [12]:
mpg.columns

Index(['manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty',
       'hwy', 'fl', 'class'],
      dtype='object')

In [11]:
mpg.dtypes

manufacturer     object
model            object
displ           float64
year              int64
cyl               int64
trans            object
drv              object
cty               int64
hwy               int64
fl               object
class            object
dtype: object

## Accessing subsets of data
When working with data we often just want to use a specific subset of the data. We can access the columns using the format `dataframe.column name` or `dataframe["column name"]`. They do the same thing, but the second form is easier when there are spaces in the column name.

Rows can be extracted by specifying the range you want like `dataframe[start:end]` or with `dataframe.loc[start:end]` or with `dataframe.iloc[start:end]`.

In [19]:
mpg.model.head()

0    a4
1    a4
2    a4
3    a4
4    a4
Name: model, dtype: object

In [17]:
mpg[2:4]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact


In [33]:
mpg.iloc[[1,4,8]]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
8,audi,a4 quattro,1.8,1999,4,auto(l5),4,16,25,p,compact


In [28]:
mpg["model"][2:4]

Unnamed: 0,model,displ
2,a4,2.0
3,a4,2.0


### Exercise 2
1. Find the 10th to 20th row of the DataFrame.
1. Show only the class column.
1. (Advanced) Show the columns manufacturer and model for rows 2, 4, 8, 13

In [None]:
# Running this cell will let me know you have completed this exercise.
# Yes, I'm using Python to help me with the Python workshop!
imdone(1,2)

Another handy function for DataFrames is `describe()` which calculates a few of the numerical properties to quickly examine your data.

In [17]:
mpg.describe()

Unnamed: 0,displ,year,cyl,cty,hwy
count,234.0,234.0,234.0,234.0,234.0
mean,3.471795,2003.5,5.888889,16.858974,23.440171
std,1.291959,4.509646,1.611534,4.255946,5.954643
min,1.6,1999.0,4.0,9.0,12.0
25%,2.4,1999.0,4.0,14.0,18.0
50%,3.3,2003.5,6.0,17.0,24.0
75%,4.6,2008.0,8.0,19.0,27.0
max,7.0,2008.0,8.0,35.0,44.0


## Data sanitization
Never trust your data. Surveys can be filled in incorrectly, conventions can change, lots of things can happen. We will use a dataset I grabbed from [Ontario's Open Data platform](https://data.ontario.ca) containing all of the names of babies born in Ontario for the past 100 years. The data is already on our Jupyter Hub space similar to `data/mpg.csv` we used before. The new data set is called `data/babynames.csv`.

### Exercise 3
Load the dataset and call the new object `babynames`.

In [None]:
imdone(1,3)

Since this is actual data collected by many municipalities and other government office, many of which by hand, there is likely going to be some erroneous data.

One mistake might be that people put in initials. So let's check the fields. We'll use a bit of data manipulation here which we will talk about more in chapter 3.

In [4]:
babynames[babynames["name"].map(len) == 1]["name"].unique()

array(['J', 'M', 'K'], dtype=object)

This seems like the data were just initials. That's probably not right.

How about names with two letters?

In [3]:
babynames[babynames["name"].map(len) == 2]["name"].unique()

array(['Jo', 'Ka', 'Le', 'My', 'No', 'Yi', 'Yu', 'Wm', 'Ty', 'Vu', 'Bo',
       'Ho', 'An', 'Om', 'Zi', 'Aj', 'Li', 'Ji', 'Cy', 'Md'], dtype=object)

# BREAK TIME!
You may get the follow error when we start again

<img src="images/connection_failed.png" />

This means the connection with the kernel has timed out and you need to go to <https://uottawa.syzygy.ca> and log in again.