# The Basics
In this notebook, we will explore the very basics of Polars.
If you already have experience with Polars, you may want to just briefly skip over this and continue with the next exercise.

In [29]:
import polars as pl

## Creating and Manipulating DataFrames

Let's start with the creation of a simple DataFrame in Polars.
If you are used to pandas, this should look quite familiar.
In Polars, we can use `DataFrame.schema()` to show the columns and corresponding types of the DataFrame.
If not explicitly provided in the input data, these will be inferred.

In [None]:
# Create a simple DataFrame
df = pl.DataFrame(
    {
        "name": ["Sjon", "Anita", "Klaas", "Sofie", "Kees"],
        "age": [34, 31, 41, 25, 52],
        "gender": ["M", "F", "M", "F", "M"],
        "salary": [75000, 65000, 80000, 70000, 90000],
        "experience": [5, 3, 7, 4, 10],
    }
)
print(df)
print(df.schema)

Each list we just provided as input data is now a polars `Series` instance within the DataFrame with the shown dtype:

In [None]:
a = df.get_column("name")
print(a)

We can see that all the integer columns in the schema were inferred to be i64 values.
This is not optimal for storage if we know we will never need the space this gives us.
Therefore, it could be a good idea to cast the columns to the datatypes we need.
Alternatively, we can ensure that the series we provided as input data have the required datatype.
In that case, instead of passing lists as input data, we would have to use numpy arrays with defined dtypes, or we can pass in Polars series directly.
We can mix as we please, defining one column as a list, another as a `Series` and yet another as a `numpy.array`.

Try it out below to obtain more suitable and restrictive dtypes for the example data frame!

### Exercise 1.1
Use the cast function on the DataFrame to cast the dtypes to more efficient types.
Check the reference for the available data types: https://docs.pola.rs/api/python/stable/reference/datatypes.html.

## Basic Manipulations

Let's try out some basic data manipulation polars allows us to do below.

### Exercise 1.2
Reverse the order of the columns!
For this you can use the `select` method in combination with some indexing magic.

For changing the order of the rows, we usually want to use a `sort` statement on one of the columns.
In polars, you can simply call `sort` with the desired column names and change the effect of the sort statement by passing in additional arguments.


### Exercise 1.3
Sort the DataFrame by experience and separately by salary.

A third important manipulation we will look at here, is how to change the values of the individual records.
This can be done using the `with_columns()` method on the `DataFrame`.
It is tempting to think that `with_columns()` creates a copy of the data frame and thus leads to significant memory overhead.
It should therefore be noted that the method does *not* actually copy existing data, but keeps a reference instead!

A simple usage of this method would be to use the values from another column:

In [None]:
df.with_columns(earnings="salary")

Note that the string value in the expression above will always refer to a column name, as in many expressions in Polars.
If instead you want to refer to literal values, you can use `pl.lit()` with the value passed in as an argument.
This function is an example of a function returning a Polars expression.

## Using expressions for manipulation

If we need to do more advanced manipulation, we will likely need the expression power of Polars.
Polars allows us to define expressions for manipulation of column values using the `pl.col()` construct in combination with Polars functions that return Expressions or in combination with native Python operators.
The resulting type of the expression will be, surprisingly, a Polars `Expression` instance:


In [None]:
print(type(pl.col("a") > 2))
print(type(pl.col("a").abs()))

For a full list of available expressions, check out https://docs.pola.rs/api/python/stable/reference/expressions/index.html, where the a full reference is provided ordered by expression type.
We can essentially use expressions in any method that has a dependency on data frame values, e.g. the `select()`, `filter()` and `with_column()` methods.
Upon evaluation of the method, Polars will evaluate the expression in the context of the data frame.
If an expression with `pl.col()` is passed into a method that manipulates the column values, the existing column will be replaced by the new values resulting from the expression.
If in those cases we want to rename the column (and thus not overwrite the original one), we can use the `alias()` method on the Expression object.

### Exercise 1.4
Use a the `select()` method on the data frame in combination with expressions to create a new data frame that shows the name in combination with a column named `sufficient_experience`, where `sufficient_experience` should show "no" for those with less than 5 experience points and "yes" for those with more.

## Reading from and saving to files
Polars makes it very easy to write DataFrames to files and read them back in.
In the below exercise, we will write our data frame to a CSV file and read the data back in, which will show us an issue with this file format.
Other file formats do not only solve this issue, but facilitate more efficient writing and reading as well for big data.
We will look into this further in the third part of this hackathon.

### Exercise 1.5
Save the DataFrame created above to a CSV file and read it in again.
What do you notice about the data types after reading the data back in?