<img src="../assets/data_analysis_with_polars_copyright-1.png" width="600"/>

# Up & running with `Polars`
By the end of this lecture you will be able to:
- import `Polars`
- read data from a CSV file into a `DataFrame`
- select rows and columns from a `DataFrame`
- make a scatter plot with Plotly

## Importing `Polars`

By convention we import `Polars` as `pl`

In [None]:
import polars as pl
import plotly.express as px

## Reading from a CSV
We will load data from our first CSV - the dataset of Titanic passengers. This is located in the data directory one level up from this notebook.

In [None]:
csvFile = '../data/titanic.csv'

Similar to Pandas we can read this CSV with `read_csv`. 

Exercise: on the next line after the CSV is read below print out the first few rows with the `.head` method. 

In [None]:
df = pl.read_csv(csvFile)

Each row in the Titanic dataset has details about a passenger on the Titanic along with a binary value in the `Survived` column to show if they survived `1` or died `0`.

>The code above to load and read the CSV looks similar to `Pandas`. However, we can see the first difference with Pandas:

> **Polars does not use an index**.

> Polars is much faster than Pandas so the lack of an index in Polars is not a performance disadvantage. We will also see that Polars code is easier to read and write without an index. 

## Selecting data from a `DataFrame`

We can look at a row using square bracket indexing with an integer row number.

In [None]:
df[0]

> Polars does not have a `loc` or `iloc` method. We will learn more about selecting data in Section 2.

We can also select a column in the same way as in Pandas with square brackets.

In [None]:
df['Age'].head(3)

## Simple statistics
As in Pandas we can get simple statistics on a `DataFrame` or columns

In [None]:
df.mean()

## Visualisation
To make a visualisation with Plotly we select the columns using `[]`

In [None]:
px.scatter(
    x=df["Age"],
    y=df["Fare"],
    color=df["Sex"],
    labels={'x':"Age","y":"Fare","color":"Sex"})

## Exercises
Each lecture will finish with some exercises for you. The solutions will be found at the bottom of the notebook.

In the exercises you will develop your understanding of:
- reading a CSV
- getting simple statistics


## Exercise 1 - use your Pandas experience
Use your Pandas experience to get some information from the Titanic CSV

First read the CSV into a `DataFrame` called `df` by replacing `<blank>` with your code.

In [None]:
df = <blank>

Exercise 1 cont:
- how many rows and columns are in the `DataFrame`
- what is the average age and fare of the passengers?
- what is the largest number of siblings `SibSp`?

## Exercise 2: Visualisation
Make a scatter plot of the passenger age `Age` versus the number of parents & children `Parch` with the sex in color.

## Solutions

## Solution to Exercise 1

In [None]:
df = pl.read_csv(csvFile)

Number of rows and columns

In [None]:
df.shape

Average of Age and Fare

In [None]:
df[["Age","Fare"]].mean()

Maximum number of siblings

In [None]:
df["SibSp"].max()

## Solution to Exercise 2

In [None]:
px.scatter(x=df["Age"],y=df["Parch"],    color=df["Sex"],
)