# Some useful things

A collection of miscellaneous useful things!

# Parquet files: Better than CSV in (almost) every way

CSV files and plain text files are great.  I love them.  They're a universal way to share data, whether it's in CSV, JSON, YAML, or some other format.  I love being able to very quickly and easily look at the structure of a file using any number of standard tools.

...but, things like CSV and JSON have some big problems.  For one, the data always needs to be parsed, which can take a lot of time.  For two, the data is stored in _row-major_ order: as you read through the file one byte at a time, you're moving "left to right" through the rows.  This can make it hard to do something like "only load these three columns" with much efficiency, since you often need to read most of the row anyways to see where the columns start and stop.

Enter Parquet, the format originally used by Apache Arrow, but now widely adopted by every tool that's kept up with the times.  Parquet differs from CSV and Excel and JSON and so on in a few major ways:
- It is _column-major_ ordered.  As you iterate through the raw bytes of a file, you're moving down the columns, one column at a time.  There's a bit of header information in the file that says where each column starts and stops, so it's very easy and fast to only read in a few columns from a file that might have thousands.
- It is a _binary_ file format.  It can't be opened and read with a basic text editor.
- It has excellent support for _transparent compression_ using a wide range of compression standards.  You _can_ compress a CSV file, but I find it's usually not as fast as Parquet compression.
- It is way faster to read and write.  Most data structures like Pandas `DataFrame`s already store data as collections of columns--not rows--so less work is needed to reshape the data.

Pandas and Dask both have built-in support for Parquet, as long as you have the PyArrow or fastparquet libraries installed.  PyArrow is the default that they will look for, but fastparquet is a much smaller installation.  I generally use PyArrow; I've had a few issues in the past getting fastparquet to work, but I think those have all been resolved now.  PyArrow is a bit more feature-rich and supports more compression algorithms, but the difference is honestly pretty minimal if you're only reading from/writing to Parquet file with Pandas.  Install either library with:

```bash
conda install pyarrow
# or
conda install fastparquet
```

Then just use the `read_parquet()` and `to_parquet()` functions in Pandas and Dask.  (I have PyArrow installed for the below examples).

The following cells show the speed and file size differences between CSV and Parquet, when Parquet is compressed with Zstandard.

In [1]:
# get some data--this is a ~50mb CSV file from the US Census Bureau.
import os
import requests
import pandas as pd

if not (os.path.isfile("parquet_demo.csv") and os.path.isfile("parquet_demo.parquet")):
    df = pd.read_csv("https://www2.census.gov/programs-surveys/bds/tables/time-series/bds2019_cty_fzc.csv")
    df.to_csv("parquet_demo.csv", index=False)
    df.to_parquet("parquet_demo.parquet", index=False, compression="zstd")

In [2]:
# File sizes
print(f"CSV file: {os.path.getsize('parquet_demo.csv') // 1_000:,}kb")
print(f"Parquet file: {os.path.getsize('parquet_demo.parquet') // 1_000:,}kb")

CSV file: 52,790kb
Parquet file: 16,590kb


In [3]:
# Load times--all columns
print("CSV load time (all columns):")
%time pd.read_csv("parquet_demo.csv")
print("\nParquet load time (all columns):")
%time pd.read_parquet("parquet_demo.parquet")
print()

CSV load time (all columns):
CPU times: total: 1.48 s
Wall time: 1.49 s

Parquet load time (all columns):
CPU times: total: 1.69 s
Wall time: 635 ms



In [4]:
# Load times--just two columns
print("CSV load time (2 columns):")
%time pd.read_csv("parquet_demo.csv", usecols=["year", "net_job_creation"])
print("\nParquet load time (2 columns):")
%time pd.read_parquet("parquet_demo.parquet", columns=["year", "net_job_creation"])
print()

CSV load time (2 columns):
CPU times: total: 391 ms
Wall time: 404 ms

Parquet load time (2 columns):
CPU times: total: 62.5 ms
Wall time: 40.7 ms



In [5]:
# Serialization time--if we don't specify a file, we get back the raw
# text/bytes that would be saved to file.
df = pd.read_parquet("parquet_demo.parquet")
print("CSV serialization time:")
%time _ = df.to_csv(index=False)
print("\nParquet serialization time:")
%time _ = df.to_parquet(index=False)

CSV serialization time:
CPU times: total: 3.02 s
Wall time: 3.02 s

Parquet serialization time:
CPU times: total: 1.38 s
Wall time: 1.36 s


In [None]:
# Rapid-fire compression comparison in PyArrow.
from time import time

# make the dataframe a lot bigger to emphasize the differences
# between algorithms.
_df = pd.concat((df for i in range(10)))

for (alg, level) in [
    ("snappy", None),
    ("gzip", 1),
    ("gzip", 3),
    ("gzip", 9),
    ("brotli", 9),
    ("lz4", 9),
    ("zstd", 1),
    ("zstd", 6),
    ("zstd", 19),
    (None, None)
]:
    save_start = time()
    serialized = _df.to_parquet(compression=alg, compression_level=level)
    save_end = time()
    
    with open("_testing.parquet", "wb") as OUT: OUT.write(serialized)
    
    load_start = time()
    pd.read_parquet("_testing.parquet")
    load_end = time()
    
    size = len(serialized) // 1_000
    
    if level is not None:
        print(f"{alg}, compression level {level}")
    elif alg is None:
        print("No compression")
    else:
        print(alg)
    print(f"\tSave time:  {save_end - save_start:.2f}s")
    print(f"\tLoad time:  {load_end - load_start:.2f}s")
    print(f"\tSaved size: {len(serialized) // 1_000:,}kb")
    print()
    
os.remove("_testing.parquet")

snappy
	Save time:  12.93s
	Load time:  5.87s
	Saved size: 186,501kb

gzip, compression level 1
	Save time:  92.70s
	Load time:  6.63s
	Saved size: 169,826kb

gzip, compression level 3
	Save time:  64.81s
	Load time:  6.34s
	Saved size: 165,906kb

gzip, compression level 9
	Save time:  28.35s
	Load time:  6.83s
	Saved size: 153,338kb

brotli, compression level 9
	Save time:  37.84s
	Load time:  6.07s
	Saved size: 118,255kb



# `black`: format your code

[`black`](https://github.com/psf/black) is a Python code formatter.  It reads your Python code, and applies a very particular and highly opinionated set of formatting conventions, but it does so without changing anything about _what your code does_ or _how it runs_.

I love `black`.  It is an extremely easy way to guarantee that your code follows some pretty sensible layout choices, making your code a lot easier to read!  It can't do things like pick better variable/function names for you, but I don't know of anything that can.

Install black with:

```bash
conda install black
```

Then run it from the command line with:

```bash
python -m black [your file].py
```
Black will then reformat your code and _overwrite_ the original file's contents.

Most editors, including PyCharm, let you add arbitrary commands to run.  I've got `black` configured to run in PyCharm with a single keystroke, and reformat everything in my project.  I use it constantly.

There are a few configuration options/settings you can use with `black`, but there are only two that probably matter:

```bash
-l [number]           How long a line should be before it wraps
-t [py39|py310|...]   Python version(s) to support
```

`-l` can usually be set to somethinh around 90.  `-t` should be set to whatever version of Python you're using, though `py311` is pretty much always a good default, since it'll work for older versions too.

You can disable `black`'s formatting in some parts of your code using two "magic comments":
```python
# fmt: off
[code that will not be formatted]
# fmt: on

[code that will be formatted]
```

You can also manually put a comma at the end of the last item in a list, tuple, dictionary, function arguments, or anywhere else you use commas.  `black` will see that, and will take that as a signal to force every item in that list, tuple, etc. to be on it own line.  Not adding a final comma lets `black` decide whether to put everything on its own line or not.

```python
# before black
my_list = [1, 2, 3, 4,]

# after black
my_list = [
    1,
    2,
    3,
    4,
]
```

(Taking away the final comma in the above example and re-running black will put everything back on one line).

Here's a quick look at a more extensive before and after.

```python
# Before black.  You should be ashamed if you write code this ugly.
def   foo (x,y= 10,
        z = 0
           ):
    if isinstance(x, int) and isinstance(y, int) and isinstance(z, int) and x % 2 == 0 and y % 2 == 0 and z % 2 == 0 and x > 0 and y > 0 and z > 0:
        print(
            "All arguments"
            " are even positive integers"
        )

    else: print("At least one arguent is negative, non-integer, or not even")
    
# After black.  Much nicer!
def foo(x, y=10, z=0):
    if (
        isinstance(x, int)
        and isinstance(y, int)
        and isinstance(z, int)
        and x % 2 == 0
        and y % 2 == 0
        and z % 2 == 0
        and x > 0
        and y > 0
        and z > 0
    ):
        print("All arguments" " are even positive integers")

    else:
        print("At least one arguent is negative, non-integer, or not even")
```

# `isort`: sort your imports!

`isort` is like `black`, but it serves only one purpose: it sorts your import statements.  The general convention in Python is sort imports as follows:
- All standard library imports, alphabetized by module name.
- All third-party imports, alphabetized by module name.
- All imports of other files in your program, alphabetized by module/program name.

Conventionally, there's also a single whitespace between each of these blocks.

Install `isort` with:

```bash
conda install isort
```

Like with `black`, run it from the command line (or as an action in your editor):
```bash
python -m isort [your file].py
```

I use `isort` and `black` together all the time.  I find you get the nicest results if you run `isort` first, then `black`; that's actually what my keybinding in PyCharm is set up to do.

# `chime`: play alert sounds

Here's something I never knew I wanted until I heard about it: [`chime`](https://github.com/MaxHalford/chime).  All it does it give you a few functions that, when called, play short alert sounds.  That's it.  But, this is surprisingly useful: scatter a few of these functions throughout your code and you can get auditory updates!

E.g.: put `chime.success()` at the very end of your code, and you'll get a little sound when it finished running.  Or put `chime.error()` somewhere inside a `try-except` block.  (Or the reverse.  The name of the function just says what sound it plays.  There's no reason you can't use the error sound to indicate success!).

Install `chime` with:

```bash
conda isntall -c conda-forge chime
```

Then use it like so:

In [1]:
import time
import chime

# List the available themes
print(chime.themes())

# I like the Big Sur theme, so change to it.
chime.theme("big-sur")

chime.info()
for i in range(10):
    time.sleep(0.5)
chime.success()

['big-sur', 'chime', 'mario', 'material', 'zelda']


(You may not be able to hear the chime unles you run the above snippet locally on your own machine).  There are also functions to play your own custom sounds instead.  Definitely a neat little library!

# `TPOT`: Genetic programming for ML tuning

I'm only going to _barely_ scrape the surface here.  Genetic programming is a huge and fascinating topic.  But, in brief: genetic programming is one approach to optimizing systems, e.g. machine learning pipelines, using techniques based on analogies to natural selection and evolution.  You start with a "generation" of candidate solutions (e.g., data processing pipelines, or models), usually randomly initialized.  You then see how well each member of the generation does at a given task (e.g., you check their predictive accuracy on some data after fitting them).  All but the top few candidates "die," and the next generation of candidates is made by generating random "mutations" of candidates in the previous generation, and/or by mixing-and-matching pieces from different candidates.  You keep running this process until you've either got a good enough set of candidates, or you've expended all your compute resources.

Genetic programming can find some extremely weird, but effective, solutions.  The downside is that it is _absurdly_ slow.  It's common to see time limits placed on genetic programming models, e.g., "run for 10 minutes, and then give me whatever the best candidate is at that point in time."

TPOT is a Python library that uses genetic programming to find good pipelines for predictive modeling.  It's got built-in support for a huge range of models: scikit-learn, XGBoost, Dask, PyTorch (for neural networks), and more.  Install TPOT with:

```bash
conda install -c conda-forge tpot
```

Or, install TPOT and most of its optional dependencies with:

```bash
conda install -c conda-forge tpot xgboost dask dask-ml scikit-mdr skrebate
```

Then just import `TPOTCLassifier` or `TPOTRegressor` and treat them like any standard scikit-learn model.  Maybe don't do cross-validation with them, since they're designed to do that themselves--just create and fit them.

Here's one of the TPOT example programs that finds a classification pipeline for the scikit-learn "digits" dataset (a set of hand-written digits):

In [2]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data,
    digits.target,
    train_size=0.75,
    test_size=0.25,
    random_state=42
)

# Here's the TPOT part of the code.
# This will run for about an hour.
tpot_clf = TPOTClassifier(
    generations=5,        # 5 generation...
    population_size=50,   # ...each generation having 50 candidates.
    verbosity=2,
    random_state=42,
    max_eval_time_mins=1, # max 1 min to evaluate each pipeline--for sanity's sake
    n_jobs=10,            # my computer has 12 threads, so check 10 models at once
)
tpot_clf.fit(X_train, y_train)
print(tpot_clf.score(X_test, y_test))
# Export the model to a .py file for easy re-use.
tpot_clf.export('tpot_digits_pipeline.py')



Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.9844058928817294

Generation 2 - Current best internal CV score: 0.9844058928817294

Generation 3 - Current best internal CV score: 0.9866363761531047

Generation 4 - Current best internal CV score: 0.9866363761531047

Generation 5 - Current best internal CV score: 0.9873743632107945

Best pipeline: KNeighborsClassifier(Normalizer(input_matrix, norm=l2), n_neighbors=2, p=2, weights=distance)
0.9866666666666667




This is a bit of a toy example--it's a very easy dataset to get high accuracy on, and the accuracy hit 98.4% in the first generation--but if you have a lot of time and/or compute power to throw at a problem, TPOT can often find really, _really_ good solution that might be hard to come up with on your own.

For reference, here's what the final pipeline looked like when I ran this myself.  It might look different if you re-run this notebook.  (This is what got saved to the "tpot_digits_pipeline.py" file in `tpot.export()` statement of the above code).

```python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: 0.9873743632107945
exported_pipeline = make_pipeline(
    Normalizer(norm="l2"),
    KNeighborsClassifier(n_neighbors=2, p=2, weights="distance")
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
```