# CSV files 1: reading a CSV file
By the end of this lecture you will be able to:
- set the column names when reading a CSV file
- specify how to parse a CSV file
- specify a dtype schema  when reading a CSV file
- modify CPU and memory usage when reading a CSV file

Warning: this is a long lecture as we go through the full CSV parsing process!

## What is a CSV file?
A CSV file is:
- a text file that uses a comma (or other delimiter) to separate values
- a file where data is ordered in rows rather than columns
- a file where the only potential metadata is a header row of column names - no type information for each column is specified

In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

We read this CSV file as we have read it many times before!

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

## CSV parsers

CSV files are text and so must be parsed to:
1. get the column names
2. split each row into columns
3. infer the dtype of each column

Polars has two engines to parse CSV files: 
- the default Polars parser
- the PyArrow parser

We can use the PyArrow parser with the `use_pyarrow` argument in `pl.read_csv`.

In my experiments the `Polars` built-in parser is faster and I recommend using it unless there is a specific need for PyArrow.

## Compressed CSV files
Polars can read .gzip compressed CSV files but not .bz compressed CSV files. 

To read .bz compressed CSV files use the PyArrow parser
```python
pl.read_csv(csvFile,use_pyarrow = True)`
```
## Header and column names
By default Polars takes the first row of a CSV as the header to set the column names.

### No header
If the first row is not a header we can set `has_header = False` and the column names are `column_1` etc

In [None]:
pl.read_csv(csvFile, has_header=False).head(2)

### Rename columns
We can rename columns immediately after the CSV is parsed with `new_columns`.

In this example we rename the first column in lowercase

In [None]:
pl.read_csv(csvFile,new_columns=['passengerid']).head(2)

### Skip rows after the header
We skip rows after the header is parsed with `skip_rows_after_header`

In [None]:
pl.read_csv(csvFile,skip_rows_after_header=1).head(2)

### Skip rows to the header

If the header is on line `N` of the CSV we can set `skip_rows = N - 1`

In this example we set line 2 of the CSV as the header

In [None]:
pl.read_csv(csvFile,skip_rows=1).head(2)

If header names are duplicated (as with columns 7 and 8 of this example) Polars adds `_duplicated_0` to the column name

## Parsing CSVs
In the following examples we simulate CSV files with Python strings.

The newline `\n` character shows the line breaks in the simulated CSV file

The `b` before the start of the string converts the string to bytes so that it can be passed to `pl.read_csv`

In [None]:
CSVString = b"A,B,C\n0,1,2\n"
pl.read_csv(CSVString)

### Delimiter
Polars assumes the delimiter is a `,`. This can be changed with the `sep` argument.

In this example we have a CSV with tab (`\t`) separated data rather than comma-separated data

In [None]:
tabCSVString = b"A\tB\tC\n0\t1\t2\n"

pl.read_csv(tabCSVString,sep="\t")

### Comment lines

Comment lines that start with a certain character in the CSV are ignored by setting the `comment_char`

In [None]:
commentCSVString = b"a,b,c\n#Comment\n0,1,2\n"
pl.read_csv(commentCSVString,comment_char="#")

### Quotes
Quotes in the CSV are indicated with the `quote_char`. 

In this example we have quotes because a text contains a comma

In [None]:
quoteCSVString = b'name,age\n"Armstrong,Neil",39\n'
pl.read_csv(quoteCSVString,quote_char='"')

## Infering the dtypes
Polars needs to understand the dtype of each column in the CSV. To do this Polars:
- reads the first 100 lines
- if a dtype can be inferred it sets the dtype for that column
- if a consistent dtype cannot be inferred then a `ComputeError` exception is raised

### Number of rows to infer the dtypes
We can adjust the number of lines used for type inference.

In the Titanic CSV the `Age` column starts off with 57 integers before a decimal value in line 58.

If we try to set `infer_schema_length` lower than 58  Polars raises a `ComputeError` because it infers an integer dtype and then encounters a float on line 58 (check this by reducing the value here)

In [None]:
pl.read_csv(csvFile,infer_schema_length=58).head(2)

## Handling mixed types and exceptions
In this example we have a CSV file that will raise an exception as the values in the first column are:
- `1.0` which looks like a float and 
- `a` which is a string

Polars raises a `ComputeError` as it is cannot reconcile a float and a string by default (we see how to address this below)

In [None]:
mixedTypeCSVFile = "../data/badCSV.csv"

This raises an `ComputeError` exception

In [None]:
## Uncomment and run to get the exception
# pl.read_csv(mixedTypeCSVFile)

### Specifying the dtypes
We can address the error by specifying the `dtypes` argument. 

In this example we read the mixed type CSV with the first column as a string dtype

In [None]:
pl.read_csv(mixedTypeCSVFile,dtypes={'col1':pl.Utf8})

### Ignore errors
We can also tell Polars to ignore errors in which case values that cannot be cast to the schema for that column are returned as `null`

In [None]:
pl.read_csv(mixedTypeCSVFile,ignore_errors=True)

## Set values to `null`
We might know that there are values in a column that are incorrect.

We set the value `b` in `col1` to `null` with `null_values`

In [None]:
pl.read_csv(mixedTypeCSVFile,null_values="b")

We can also pass a list of strings to `null_values`

In [None]:
pl.read_csv(mixedTypeCSVFile,null_values=["b"])

We can also specify different values to set as `null` for different columns

In [None]:
pl.read_csv(mixedTypeCSVFile,null_values={"col1":"b"})

## Performance of CSV parsing
### Number of threads
The CSV parser in Polars is multithreaded and uses the same number of threads as there are cores on your computer.

We can vary this with the `n_threads` argument. We can use fewer threads to reduce CPU usage or more threads to (potentially) reduce read time.

In experiments on my computer (8 cores) with different datasets compared to the default:
- reducing `n_threads` to 1 increases time taken by 3x
- reducing `n_threads` to 3-4 increases time taken by 30%
- increasing `n_threads` to 40 reduces time taken by 30% on some datasets but makes no difference on others

Experiment with your own datasets if you want to reduce CPU usage or reduce read time.

### Memory usage
We can potentially reduce memory usage when reading a large CSV with `low_memory = True`. When reading a large CSV Polars reads the CSV into separate chunks in memory before combining the chunks into a `DataFrame` that is a single chunk in memory.

With `low_memory = True` Polars uses a slower non-parallel method of combining the chunks into a `DataFrame` that is a single chunk in memory.

## Exercises
In the exercises you will develop your understanding of:
- setting the column names of a CSV
- parsing a CSV
- setting the dtypes
- modifying the number of threads

### Exercise 1
In this exercise we want to parse the CSV strings to produce a `DataFrame` equal to the following

In [None]:
target = pl.DataFrame({"a":[1,2],"b":[3,4],"c":[5,6]})
target

Parse the CSV strings in the following cells

In [None]:
csvString = b"Data passed quality control 2020-01-01\na,b,c\n1,3,5\n2,4,6\n"
pl.read_csv(<blank>)

In [None]:
# Rename columns
csvString = b"A,B,C\n1,3,5\n2,4,6\n"
pl.read_csv(<blank>)

In [None]:
# Whitespace delimiter
csvString = b"a b c\n1 3 5\n2 4 6\n"
pl.read_csv(<blank>)

In [None]:
# Comment line
csvString = b"a,b,c\n#Data passed quality control 2020-01-01\n1,3,5\n2,4,6\n"
pl.read_csv(<blank>)

This time parse the CSV to produce a `DataFrame` with all columns as 64-bit floats

In [None]:
csvString = b"a,b,c\n#Data passed quality control 2020-01-01\n1,3,5\n2,4,6\n"
pl.read_csv(<blank>)

Find missing data in the CSV and replace with `null`. Ensure columns are not cast to string

In [None]:
csvString = b"a,b,c\n\n1,3,5\nNA,4,na\n"
pl.read_csv(<blank>)

## Exercise 2
Parse the NYC taxi CSV with:
- the default number of threads,
- one thread and
- 40 threads
to see if it affects performance.

In [None]:
nycCSVFile = "../data/nyc_trip_data_1k.csv"

In [None]:
%%timeit -n1 -r3
pl.read_csv(<blank>)

This dataset is too small to see any differences - try it with your own datasets to see if changing the number of threads affects performance

## Solutions

### Solution to exercise 1
In this exercise we want to parse the CSV strings to produce a `DataFrame` equal to the following

In [None]:
target = pl.DataFrame({"a":[1,2],"b":[3,4],"c":[5,6]})
target

Parse the CSV strings in the following cells

In [None]:
csvString = b"Data passed quality control 2020-01-01\na,b,c\n1,3,5\n2,4,6\n"
pl.read_csv(csvString,skip_rows=1)

In [None]:
csvString = b"A,B,C\n1,3,5\n2,4,6\n"
pl.read_csv(csvString,new_columns=["a","b","c"])

In [None]:
csvString = b"a b c\n1 3 5\n2 4 6\n"
pl.read_csv(csvString,sep=" ")

In [None]:
csvString = b"a,b,c\n#Data passed quality control 2020-01-01\n1,3,5\n2,4,6\n"
pl.read_csv(csvString,comment_char="#")

This time parse the CSV with all columns as 64-bit floats

In [None]:
csvString = b"a,b,c\n#Data passed quality control 2020-01-01\n1,3,5\n2,4,6\n"
pl.read_csv(csvString,comment_char="#",dtypes={"a":pl.Float64,"b":pl.Float64,"c":pl.Float64})

Find missing data and replace with `null`. Ensure columns are not cast to string

In [None]:
csvString = b"a,b,c\n\n1,3,5\nNA,4,na\n"
pl.read_csv(csvString,null_values={"a":"NA","c":"na"})

## Solution to exercise 2
Parse the NYC taxi CSV with:
- the default number of threads
- one thread
- 40 threads
to see if it affects performance.

Then try it with your own datasets to see if it affects performance

In [None]:
nycCSVFile = "../data/nyc_trip_data_1k.csv"

In [None]:
%%timeit -n1
pl.read_csv(nycCSVFile,n_threads=1)

In [None]:
%%timeit -n1
pl.read_csv(nycCSVFile)

In [None]:
%%timeit -n1
pl.read_csv(nycCSVFile,n_threads=40)