# 01 Basic classifier

Here is a starter Jupyter notebook to get started.

## Data exploration

Data exploration is a crucial first step in any data science project.
It involves understanding the data's structure, identifying patterns or trends, and detecting anomalies or missing values.

### Loading data

Typically, you can load a dataset using various methods in [Polars](https://docs.pola.rs/), such as normal loading and memory-mapped loading.
These methods work well for smaller datasets but can cause memory issues with very large datasets, leading to kernel crashes.
In such cases, chunked reading is a more efficient approach.

Normal loading reads the entire dataset into memory at once.
This method is straightforward but can be problematic with large datasets.

```python
import polars as pl

# Load the dataset normally; however, this may crash for large datasets.
data = pl.read_parquet("../../../data/train.parquet")
```

Memory-mapped loading can improve performance by mapping the file directly into memory, reducing the overhead of data copying.
However, it still requires sufficient memory to hold the entire dataset.

### Chunked reading

For very large datasets, chunked reading is a more efficient approach.
This method reads the dataset in smaller chunks, which helps manage memory usage more effectively.
By processing data incrementally, chunked reading avoids the pitfalls of memory overload and allows for scalable data exploration and analysis.

1. **Memory Management**: Large datasets can easily exceed the available memory, causing kernel crashes or significant slowdowns. Chunked reading mitigates this by only loading manageable portions of the data at a time.
2. **Scalability**: Chunked reading enables the handling of datasets that are much larger than the system’s memory capacity, making it a scalable solution for big data problems.
3. **Incremental Processing**: It allows for incremental processing of data, which can be useful for real-time data analysis and processing tasks.
4. **Flexibility**: You can perform operations on each chunk independently, which provides flexibility in data processing and can lead to more efficient computations.

Chunked reading involves reading a fixed number of rows (a chunk) from the dataset, processing that chunk, and then moving on to the next chunk.
This approach ensures that only a small portion of the dataset is loaded into memory at any given time.
To start, we need to import the necessary library for reading parquet files.
We will use the [`pyarrow.parquet`](https://arrow.apache.org/docs/python/index.html) library to handle the parquet files.

In [1]:
import pyarrow.parquet as pq

With the library imported, we can now define a function to read the dataset in chunks.
This will help us manage the large data size without overwhelming our system's memory.


[ParqetFile](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html)

In [2]:
path_train_data = "../../../data/train.parquet"
data_train = pq.ParquetFile(source=path_train_data)

For very large datasets, using `pyarrow.ParquetFile` with the `iter_batches` method is an efficient approach to manage memory usage.
This method reads the dataset in smaller chunks or batches, allowing for scalable data exploration and analysis.
We will define a function to read the dataset in batches using `pyarrow`.
This function will handle reading the data in specified batch sizes.

In [3]:
data_train_gen = data_train.iter_batches()

To ensure our batch reading function works correctly, we will retrieve and inspect the first batch of data.

In [4]:
data_train_sample = next(data_train.iter_batches())
print(data_train_sample)

pyarrow.RecordBatch
id: int64
buildingblock1_smiles: string
buildingblock2_smiles: string
buildingblock3_smiles: string
molecule_smiles: string
protein_name: string
binds: int64
----
id: [0,1,2,3,4,5,6,7,8,9,...,65526,65527,65528,65529,65530,65531,65532,65533,65534,65535]
buildingblock1_smiles: ["C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21","C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21","C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21","C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21","C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21","C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21","C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21","C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21","C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21","C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21",...,"C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21","C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21","C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21","C#CC[C@@H](CC(=O)O)NC(=O)OCC1c2ccccc2-c2ccccc21"