# CSV files 1: reading a CSV file
By the end of this lecture you will be able to:
- set the column names when reading a CSV file
- specify how to parse a CSV file
- specify a dtype schema  when reading a CSV file
- modify CPU and memory usage when reading a CSV file

Warning: this is a long lecture as we go through the full CSV parsing process!

## What is a CSV file?
A CSV file is:
- a text file that uses a comma (or other delimiter) to separate values
- a file where data is ordered in rows rather than columns
- a file where the only potential metadata is a header row of column names - no type information for each column is specified

In [1]:
import polars as pl

In [2]:
csv_file = "../data/titanic.csv"

We read this CSV file as we have read it many times before!

In [3]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


## CSV parsers

CSV files are text and so must be parsed to:
1. get the column names
2. split each row into columns
3. infer the dtype of each column

Polars has two engines to parse CSV files: 
- the default Polars parser
- the PyArrow parser

We can use the PyArrow parser with the `use_pyarrow` argument in `pl.read_csv`.

In my experiments the `Polars` built-in parser is faster and I recommend using it unless there is a specific need for PyArrow.

## Compressed CSV files
Polars can read .gzip compressed CSV files but not .bz compressed CSV files. 

To read .bz compressed CSV files use the PyArrow parser
```python
pl.read_csv(csv_file,use_pyarrow = True)`
```
## Header and column names
By default Polars takes the first row of a CSV as the header to set the column names.

### No header
If the first row is not a header we can set `has_header = False` and the column names are `column_1` etc

In [4]:
pl.read_csv(csv_file, has_header=False).head(2)

column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12
str,str,str,str,str,str,str,str,str,str,str,str
"""PassengerId""","""Survived""","""Pclass""","""Name""","""Sex""","""Age""","""SibSp""","""Parch""","""Ticket""","""Fare""","""Cabin""","""Embarked"""
"""1""","""0""","""3""","""Braund, Mr. Owen Harris""","""male""","""22""","""1""","""0""","""A/5 21171""","""7.25""",,"""S"""


### Rename columns
We can rename columns immediately after the CSV is parsed with `new_columns`.

In this example we rename the first column in lowercase

In [5]:
pl.read_csv(csv_file,new_columns=['passengerid']).head(2)

passengerid,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


### Skip rows after the header
We skip rows after the header is parsed with `skip_rows_after_header`

In [6]:
pl.read_csv(csv_file,skip_rows_after_header=1).head(2)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


### Skip rows to the header

If the header is on line `N` of the CSV we can set `skip_rows = N - 1`

In this example we set line 2 of the CSV as the header

In [7]:
pl.read_csv(csv_file,skip_rows=1).head(2)

1,0,3,"Braund, Mr. Owen Harris",male,22,1_duplicated_0,0_duplicated_0,A/5 21171,7.25,Unnamed: 10_level_0,S
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


If header names are duplicated (as with columns 7 and 8 of this example) Polars adds `_duplicated_0` to the column name

## Parsing CSVs
In the following examples we simulate CSV files with Python strings.

The newline `\n` character shows the line breaks in the simulated CSV file

The `b` before the start of the string converts the string to bytes so that it can be passed to `pl.read_csv`

In [8]:
CSV_string = b"A,B,C\n0,1,2\n"
pl.read_csv(CSV_string)

A,B,C
i64,i64,i64
0,1,2


### Delimiter
Polars assumes the delimiter is a `,`. This can be changed with the `sep` argument.

In this example we have a CSV with tab (`\t`) separated data rather than comma-separated data

In [9]:
tab_CSV_string = b"A\tB\tC\n0\t1\t2\n"

pl.read_csv(tab_CSV_string,separator="\t")

A,B,C
i64,i64,i64
0,1,2


### Comment lines

Comment lines that start with a certain character in the CSV are ignored by setting the `comment_prefix`

In [10]:
comment_CSV_string = b"a,b,c\n#Comment\n0,1,2\n"
pl.read_csv(comment_CSV_string,comment_prefix="#")

a,b,c
i64,i64,i64
0,1,2


### Quotes
Quotes in the CSV are indicated with the `quote_char`. 

In this example we have quotes because a text contains a comma

In [11]:
quote_CSV_string = b'name,age\n"Armstrong,Neil",39\n'
pl.read_csv(quote_CSV_string,quote_char='"')

name,age
str,i64
"""Armstrong,Neil""",39


## Infering the dtypes
Polars needs to understand the dtype of each column in the CSV. To do this Polars:
- reads the first 100 lines
- if a dtype can be inferred it sets the dtype for that column
- if a consistent dtype cannot be inferred then a `ComputeError` exception is raised

### Number of rows to infer the dtypes
We can adjust the number of lines used for type inference.

In the Titanic CSV the `Age` column starts off with 57 integers before a decimal value in line 58.

If we try to set `infer_schema_length` lower than 58  Polars raises a `ComputeError` because it infers an integer dtype and then encounters a float on line 58 (check this by reducing the value here)

In [12]:
pl.read_csv(csv_file,infer_schema_length=58).head(2)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


### Setting the schema
We can also define the schema with a `dict` when reading the CSV. 

In this example we read in integers and floats as 32-bit dtypes

In [13]:
pl.read_csv(
    csv_file,
    schema={
        "PassengerId": pl.Int32,
        "Survived": pl.Int32,
        "Pclass": pl.Int32,
        "Name": pl.Utf8,
        "Sex": pl.Utf8,
        "Age": pl.Float32,
        "SibSp": pl.Int32,
        "Parch": pl.Int32,
        "Ticket": pl.Utf8,
        "Fare": pl.Float32,
        "Cabin": pl.Utf8,
        "Embarked": pl.Utf8,
    },
).head(2)


PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i32,i32,i32,str,str,f32,i32,i32,str,f32,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.283302,"""C85""","""C"""


## Handling mixed types and exceptions
In this example we have a CSV file that will raise an exception as the values in the first column are:
- `1.0` which looks like a float and 
- `a` which is a string

In this case Polars casts the column to string dtype

In [14]:
mixed_type_csv_file = "../data/badCSV.csv"

In [15]:
pl.read_csv(mixed_type_csv_file)

col1,col2
str,str
"""1.0""","""aa"""
"""b""","""aa"""


### Overriding the inferred schema
We can override the schema inferred by Polars by specifying the `schema_overrides` argument. 

In this example we read columns of integers as: integer, string and float

In [23]:
CSV_string = b"A,B,C\n0,1,2\n0,1,2\n"

pl.read_csv(CSV_string,schema_overrides={'B':pl.Utf8,'C':pl.Float32})

A,B,C
i64,str,f32
0,"""1""",2.0
0,"""1""",2.0


### Ignore errors
We can also tell Polars to ignore errors in which case values that cannot be cast to the schema for that column are returned as `null`.

In this case we try to make column `A` boolean but have `True` in the first row and `0` in the second row

In [24]:
CSV_string = b"A,B\nTrue,1\n0,1\n"

pl.read_csv(CSV_string,schema_overrides={'A':pl.Boolean},ignore_errors=True)

A,B
bool,i64
True,1
,1


## Set values to `null`
We might know that there are values in a column that are incorrect.

We set the value `2` in `B` to `null` with `null_values`

In [18]:
CSV_string = b"A,B\nTrue,1\n1,2\n"

pl.read_csv(CSV_string,null_values="2")

A,B
str,i64
"""True""",1.0
"""1""",


We can also pass a list of strings to `null_values`

In [19]:
CSV_string = b"A,B\nTrue,1\n1,2\n"

pl.read_csv(CSV_string,null_values=["1","2"])

A,B
bool,str
True,
,


Or specify different values to set as `null` for different columns

In [20]:
CSV_string = b"A,B\nTrue,1\n1,2\n"

pl.read_csv(CSV_string,null_values={"B":"1"})

A,B
str,i64
"""True""",
"""1""",2.0


## Performance of CSV parsing
### Number of threads
The CSV parser in Polars is multithreaded and uses the same number of threads as there are cores on your computer.

We can vary this with the `n_threads` argument. We can use fewer threads to reduce CPU usage or more threads to (potentially) reduce read time.

In experiments on my computer (8 cores) with different datasets compared to the default:
- reducing `n_threads` to 1 increases time taken by 3x
- reducing `n_threads` to 3-4 increases time taken by 30%
- increasing `n_threads` to 40 reduces time taken by 30% on some datasets but makes no difference on others

Experiment with your own datasets if you want to reduce CPU usage or reduce read time.

### Memory usage
We can potentially reduce memory usage when reading a large CSV with `low_memory = True`. When reading a large CSV Polars reads the CSV into separate chunks in memory before combining the chunks into a `DataFrame` that is a single chunk in memory.

With `low_memory = True` Polars uses a slower non-parallel method of combining the chunks into a `DataFrame` that is a single chunk in memory.

If memory is really an issue then it is best to use streaming mode as explored in the working with multiple CSV files lecture later in this Section.

## Exercises
In the exercises you will develop your understanding of:
- setting the column names of a CSV
- parsing a CSV
- setting the dtypes
- modifying the number of threads

### Exercise 1
In this exercise we want to parse the CSV strings to produce a `DataFrame` equal to the following

In [25]:
target = pl.DataFrame({"a":[1,2],"b":[3,4],"c":[5,6]})
target

a,b,c
i64,i64,i64
1,3,5
2,4,6


Parse the CSV strings in the following cells

In [26]:
CSV_string = b"Data passed quality control 2020-01-01\na,b,c\n1,3,5\n2,4,6\n"
pl.read_csv(<blank>)

SyntaxError: invalid syntax (321814191.py, line 2)

In [None]:
# Rename columns
CSV_string = b"A,B,C\n1,3,5\n2,4,6\n"
pl.read_csv(<blank>)

In [None]:
# Whitespace delimiter
CSV_string = b"a b c\n1 3 5\n2 4 6\n"
pl.read_csv(<blank>)

In [None]:
# Comment line
CSV_string = b"a,b,c\n#Data passed quality control 2020-01-01\n1,3,5\n2,4,6\n"
pl.read_csv(<blank>)

This time parse the CSV to produce a `DataFrame` with all columns as 64-bit floats

In [None]:
CSV_string = b"a,b,c\n#Data passed quality control 2020-01-01\n1,3,5\n2,4,6\n"
pl.read_csv(<blank>)

Find missing data in the CSV and replace with `null`. Ensure columns are not cast to string

In [None]:
CSV_string = b"a,b,c\n1,3,5\nNA,4,na\n"
pl.read_csv(<blank>)

## Exercise 2
Parse the NYC taxi CSV with:
- the default number of threads,
- one thread and
- 40 threads
to see if it affects performance.

(This dataset is too small to see any differences - but try it with your own datasets to see if changing the number of threads affects performance)

In [None]:
nyccsv_file = "../data/nyc_trip_data_1k.csv"

In [None]:
%%timeit -n1 -r3
pl.read_csv(<blank>)

## Solutions

### Solution to exercise 1
In this exercise we want to parse the CSV strings to produce a `DataFrame` equal to the following

In [27]:
target = pl.DataFrame({"a":[1,2],"b":[3,4],"c":[5,6]})
target

a,b,c
i64,i64,i64
1,3,5
2,4,6


Parse the CSV strings in the following cells

In [28]:
CSV_string = b"Data passed quality control 2020-01-01\na,b,c\n1,3,5\n2,4,6\n"
pl.read_csv(CSV_string,skip_rows=1)

a,b,c
i64,i64,i64
1,3,5
2,4,6


In [29]:
CSV_string = b"A,B,C\n1,3,5\n2,4,6\n"
pl.read_csv(CSV_string,new_columns=["a","b","c"])

a,b,c
i64,i64,i64
1,3,5
2,4,6


In [30]:
CSV_string = b"a b c\n1 3 5\n2 4 6\n"
pl.read_csv(CSV_string,separator=" ")

a,b,c
i64,i64,i64
1,3,5
2,4,6


In [31]:
CSV_string = b"a,b,c\n#Data passed quality control 2020-01-01\n1,3,5\n2,4,6\n"
pl.read_csv(CSV_string,comment_prefix="#")

a,b,c
i64,i64,i64
1,3,5
2,4,6


This time parse the CSV with all columns as 64-bit floats

In [38]:
CSV_string = b"a,b,c\n#Data passed quality control 2020-01-01\n1,3,5\n2,4,6\n"
pl.read_csv(CSV_string,comment_prefix="#",schema_overrides={"a":pl.Float64,"b":pl.Float64,"c":pl.Float64})

a,b,c
f64,f64,f64
1.0,3.0,5.0
2.0,4.0,6.0


Find missing data and replace with `null`. Ensure columns are not cast to string

In [33]:
CSV_string = b"a,b,c\n1,3,5\nNA,4,na\n"
pl.read_csv(CSV_string,null_values={"a":"NA","c":"na"})

a,b,c
i64,i64,i64
1.0,3,5.0
,4,


## Solution to exercise 2
Parse the NYC taxi CSV with:
- the default number of threads
- one thread
- 40 threads
to see if it affects performance.

Then try it with your own datasets to see if it affects performance

In [34]:
nyccsv_file = "../data/nyc_trip_data_1k.csv"

In [35]:
%%timeit -n1
pl.read_csv(nyccsv_file,n_threads=1)

2.77 ms ± 435 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [36]:
%%timeit -n1
pl.read_csv(nyccsv_file)

3.69 ms ± 614 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [37]:
%%timeit -n1
pl.read_csv(nyccsv_file,n_threads=40)

4.22 ms ± 706 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
