# Some miscellaneous useful libraries

This is a short collection of miscellaneous libraries I've found very useful for data work.  A lot of these are pretty "bite sized" libraries that solve one or two very specific problems, but which can make your life so much easier.  Or, they're extremely deep and detailed libraries, but which you can use just one or two quick things out of to get a huge boost to your code's speed, or your productivity, or whatever else.

# tqdm: Progress bars!

Install with:

```bash
conda install tqdm
```

`tqdm` gives you progress bars when iterating through something.  This is immensely helpful if you've got a loop that will run over a lot of data, e.g., looping through lines in a very large file.  It has two important functions in it:

- `tqdm.tqdm()`: wraps any iterable object and provides progress bars as you iterate through it.
- `tqdm.trange()`: shortcut for `tqdm.tqdm(range(...))`.

If you're using Jupyter notebooks, you can import these from `tqdm.notebook` for prettier printing.  Otherwise, import them from `tqdm` directly.  `tqdm` also integrates wonderfully with Pandas, which we'll see next month.

In [1]:
from tqdm import tqdm, trange
from time import sleep # sleep(n) -> pause the program for n seconds

for i in trange(10):
    sleep(1)
    
numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
for i in tqdm(numbers):
    sleep(1)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:10<00:00,  1.01s/it]
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:10<00:00,  1.01s/it]


In [2]:
# Same code, just using the tqdm.notebook implementations
from tqdm.notebook import tqdm, trange
from time import sleep

for i in trange(10):
    sleep(1)
    
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
for i in tqdm(numbers):
    sleep(1)

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/9 [00:00<?, ?it/s]

You can also add some nice decorations to the progress bars, like descriptions, and turn on unit scaling:

In [3]:
for i in trange(10_000_000, desc="A big loop!"):
    pass

for i in trange(10_000_000, desc="A big loop!", unit_scale=True):
    pass

A big loop!:   0%|          | 0/10000000 [00:00<?, ?it/s]

A big loop!:   0%|          | 0.00/10.0M [00:00<?, ?it/s]

You can manually control when the progress bar updates.  Note that if you don't give `tqdm.tqdm()` an iterable, or you give it one that doesn't have a known length, it will only show the number of iterations complete.  It won't print estimated time to completion information.  You can pass `total=` to force it to use that value for the total number of iterations it has to go though, if you know it ahead of time.

In [4]:
with tqdm(desc="Odd numbers found") as pbar:
    for i in range(1000):
        if i % 2 == 1:
            pbar.update(1)
            sleep(0.001)

Odd numbers found: 0it [00:00, ?it/s]

Lastly, you can have multiple progress bars going at once, e.g. to track different things.

In [5]:
with tqdm(desc="Numbers checked", position=0, total=1000) as pbar1, tqdm(desc="Odd numbers found", position=1) as pbar2:
    for i in range(1000):
        pbar1.update(1)
        if i % 2 == 1:
            pbar2.update(1)
        sleep(0.001)

Numbers checked:   0%|          | 0/1000 [00:00<?, ?it/s]

Odd numbers found: 0it [00:00, ?it/s]

# `timeit`: How fast is my code?

You've seen some examples already of me using the `timeit` module from the standard library, but let's dive into what it actually does.  It's pretty simple: you give it a piece of code, it runs it a bunch of times, and it tells you how long it took.  `timeit` is mostly intended for running a small piece of code a lot of times, you can easily change this.

The basic recipe for timeit is:

In [6]:
from timeit import timeit
print(timeit("100 % 3"))

0.010030700000001502


Note that you need to pass `timeit.timeit()` a string, which gets interpreted as Python code and executed.  We can't pass expressions as arugments to functions--Python will always try to evaluate them and pass the result--so expressions have to be passed as strings.  It's a bit clunky, especially compared to languages like Lisp or Julia, but it allows other elements of Python (mostly on the backend, in the language's implementation) to be much simpler and more secure.

By default, timeit will:
- Run the snippet 1,000,000 times.
- Assume that any variables referenced in the snippet are also defined in the snippet.

Both of these behaviors can be changed.
- Change the number of times the code runs with the `number=` argument.
- Tell `timeit.timeit()` to use variables defined elsewhere in your code with the `globals=` argument.

The `globals=` argument needs to be given a `dict`ionary containing `{"variable_name": value}` pairs.  E.g., `{"x": 10}` will tell `timeit.timeit()` "when you see the variable `x` in the snippet, it has the value 10 (unless it gets re-defined in the snippet)."  There's a built-in Python function, `globals()`, which will give you a dictionary of anything defined *in global scope* in your current program.  We're not going to worry too much about the details of what that means, but just know that if you're using `timeit.timeit()` on a line that isn't indented at all, you should be able to pass `globals=globals()` and have everything just work.

In [7]:
x = [1,2,3,4,5,6,7,8,9,10]
# print(timeit("5 in x")) # NameError: name 'x' is not defined
print(timeit("5 in x", globals=globals())) # 'x' is now the value just defined
print(timeit("5 in x", globals=globals(), number=1000)) # only run the snipped 1,000 times

# Multi-line strings are useful with timeit.timeit() for longer snippets.
# But becareful--the string needs to be all the way at the zero-indent level,
# or you'll get weird errors about unexpected indentation.  This makes your code
# look kind of ugly, unfortunately.
print(timeit(
"""
if 5 in x:
    found = True
else:
    found = False
""",
    globals=globals(),
    number=1000
))

0.06899919999999327
6.820000000118398e-05
7.229999999935899e-05


# Profiling your code: Why is my *program* slow?

`timeit` is awesome for quick testing of small snippets.  If you're writing your code and you need to know something like "will it be faster to use integer division here, or do regular division and then convert to an integer?" then `timeit` is your friend.  But, if you already have a program, and it takes a long time to run, and you want to know what parts are slow (and thus, what parts you should focus on speeding up), `timeit` won't do anything for you: you'll need to turn to a *profiler*.

A profiler, in the programming world, is just a program that:

1. Run your program.
2. Track how long your program spent doing each thing it does.
3. Tell you how long each part took to run.

They're a bit cumbersome to use at times, but they are the best way to actually figure out why your program is slow and how you can make it go faster.  Python has two profilers in the standard library: `cProfile`--the one you should generally use--and `profile`--the one you should only use if you're trying to customize how the profiler behaves (which you probably won't ever need to do).

In [8]:
import cProfile
import time

# a function we want to find the slow spots for.
# we'll add a time.sleep() command to artificially increase
# the amount of time it takes to run.
def my_function(x):
    time.sleep(5)
    return x >= 10
    

# like with timeit: give `cProfile.run()` a string to be executed.
print(cProfile.run("my_function(10)"))

         5 function calls in 4.996 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.996    4.996 1390719897.py:7(my_function)
        1    0.000    0.000    4.996    4.996 <string>:1(<module>)
        1    0.000    0.000    4.996    4.996 {built-in method builtins.exec}
        1    4.996    4.996    4.996    4.996 {built-in method time.sleep}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


None


There is a *lot* you can customize about how the profiler runs.  You can save the results to a file and then process them with the `pstats` module in the standard library.  You can also run the profiler from the command line if you want to profile an entire program:

```bash
python -m cProfile [-o output_file] my_program.py
```

There's a lot to the Python profilers; check the documentation for more details.

# Numba: Speedy speedy math

As fast as Numpy is for doing math, it's still not as fast as we could get.  The fastest possible code would be *compiled* ahead of time down to machine code.  This requires a *compiler*: a program that reads the code we write, "translates" it into the 1s and 0s that the computer understands, and usually performs a lot of extra optimizations along the way to get us faster code with the same outputs.  Unfortunately, Python is an interpreted language--it gets executed one line at a time, with very little optimization--which makes writing it easier in some cases, but at the cost of speed.  Even Numpy is still interpreted.  Numpy is using a lot of compiled C code--which is very fast--but there's still Python interpretation that needs to happen.

Enter Numba: a library that will compile your functions for you, using something called *just in time* (or JIT) compilation.  The "just in time" part isn't something we need to worry about, though.  Basically, Numba gives you free extra speed.  Sometimes a *lot*, with some tweaks to your code.

Install Numba with:

```bash
conda install numba
```

Note: the currently available versions of Numba require a Numpy version no later than 1.20; the current Numpy versions you can get through Conda are generally newer.  So you may need to use the command:

```bash
conda install numba numpy<=1.20
```

Then import the `jit` and `njit` decorators.  `njit` is generally faster, but doesn't work on as much Python code (it requires things to be more directly translatable to simpler languages, but it can generate much faster code if that's the case).  `jit` is more flexible, but not always as fast.

`jit`/`njit` are what are called *decorators.*  This is a piece of pretty complex Python concept that I've intentionally not covered.  It's closely related to the idea of a *closure* in general programming/computer science, but for our purposes, we're going to treat it like a little bit of magic and not worry about what's going on.  Basically: we write `@jit`/`@njit` on the line right before the function we want to speed up.  There are a few arguments we can pass to the decorators, e.g., `@njit(nogil=True)`.  Not having the parentheses uses all the default settings.

In [9]:
from numba import jit, njit
import numpy as np
from timeit import timeit

def dot_product(x, y):
    total = 0
    for i in range(len(x)):
        total = total + x[i] * y[i]
    return total

# can also ust @njit here--this code doesn't benefit much from it
@jit
def dot_product_numba(x, y):
    total = 0
    for i in range(len(x)):
        total = total + x[i] * y[i]
    return total

rng = np.random.default_rng()
x = rng.random(size=1000)
y = rng.random(size=1000)

print(
    "Dot product, 10,000 times, normal Python:",
    timeit("dot_product(x, y)", globals=globals(), number=10000)
)

print(
    "Dot product, 10,000 times, Numba acceleration:",
    timeit("dot_product_numba(x, y)", globals=globals(), number=10000)
)

Dot product, 10,000 times, normal Python: 2.3463404999999966
Dot product, 10,000 times, Numba acceleration: 0.5256425999999976


In the above example, `np.dot()` would actually faster for this specific use case--but just adding one decorator to your code gets you a free speedup!

Numba has a lot of depth to it--options for GPU acceleration, using either NVidia or AMD GPUs; very fast, low-level parallelization; and more.  Using all those tricks can actually get your Python code *almost as fast as really good C code*--at the expense of some non-trivial re-writes to your program.  

# Cupy: Numpy on your GPU

If you have an NVidia GPU, you can install the `cupy` library, which gives you Numpy arrays, but on your GPU.  This has some tradeoffs compared to normal Numpy:

- Cupy is generally a drop-in replacement for Numpy *in your code*--they try very hard to re-implement as much of the Numpy API as possible, so that it magically "just works" on the GPU.  If you try to pass a Cupy array to a library that expects Numpy arrays, it might throw an error.
    - Some changes to recent versions of Numpy are *incompatible* with Cupy, e.g., using `np.random.default_rng()`.
    - Some functions' positional arguments may also be different from Numpy, but passing function arguments by name usually solves this.
- Cupy will generally be faster than Numpy for the same operations, especially massively parallelizeable ones like matrix multiplication.
- Because Cupy is running on your GPU, you will be limited by *your GPU's available RAM*, which will almost always be less than your main system RAM.
    - You can get computers/servers/cloud instances with hundreds of gigabytes of RAM.  It might cost a few hundred dollars (or more), but realistically, a few hundred dollars of main system RAM costs about as much as one or two decent GPUs.  Each of those GPUs might have 12-16gb of RAM.  So if you happen to have really, really big arrays, you probably need to use Numpy.
- Cupy is really designed for NVidia GPUs, and may not run well on AMD GPUs.  I don't think it'll run at all on Intel graphics.
    - There is [expermental support for AMD GPUs](https://docs.cupy.dev/en/v11.0.0b2/install.html#using-cupy-on-amd-gpu-experimental), but since it's only experimental, it may be prone to crashes/bugs/performance issues at this point in time.  (I don't have access to an AMD GPU to test with).
- As is generally true for GPU programming, getting data to and from the GPU is the hardest part.  Generally, you'll want to move as many things to the GPU (i.e., into Cupy arrays) as you can, do all your array logic there, and then move them back.  Try to make sure you're only operating on Cupy arrays using other Cupy arrays to avoid the CPU-to-GPU data transfer bottleneck.

GPUs excel at *matrix multiplication* and *dot products of very large arrays*--so if you happen to be doing matrix multiplication, expect *huge* speedups.  A lot of other array operations are faster on GPUs too, but matrix multiplication is the most extreme and noteworthy example.

Installing Cupy requires installing the `cudatoolkit` package through conda.  This package is required to run general-purpose compute on your NVidia GPU, and it will require that you have a current NVidia driver installed *system-wide* (you can't install NVidia drivers through conda--or at least, you can't do that and expect them to work well).

```bash
conda install cudatoolkit cupy
```

From there, just replace Numpy with Cupy in your code.

In [10]:
from timeit import timeit
import cupy as cp
# cupy does not support numpy's `np.random.rng()` interface.
# it does follow Numpy's older--and still supported--
# `np.random` interface.
cp_x = cp.random.random(size=(1000, 1000))
print(
    "Cupy, 1000x1000 matrix multiplication, 100 times: ",
    timeit("y = cp_x @ cp_x", globals=globals(), number=100)
)

import numpy as np
np_rng = np.random.default_rng()
np_x = np_rng.random(size=(1000, 1000))
print(
    "Numpy, 1000x1000 matrix multiplication, 100 times: ",
    timeit("y = np_x @ np_x", globals=globals(), number=100)
)



Cupy, 1000x1000 matrix multiplication, 100 times:  0.5765803999999974
Numpy, 1000x1000 matrix multiplication, 100 times:  51.27477349999998


Dang.  That's a nice speedup.

# Prettier printing with `pprint` and `icecream`

Consider the following dictionary, which is pretty messy and deeply nested:

In [11]:
my_dict = {"name":"Henry","favorite languages":{"python":{"rank":"2","proficiency":"high"},"julia":{"rank":"1","proficiency":"medium"},"haskell":{"rank":"3","proficiency":"medium-low"}}, "degrees":[["Rice", "Physics", "BA"], ["Rice", "Linguistics", "BA"], ["UT Arlington", "Linguistics", "MA"]]}

If we just print it out, we get a hard-to-read mess.

In [12]:
print(my_dict)

{'name': 'Henry', 'favorite languages': {'python': {'rank': '2', 'proficiency': 'high'}, 'julia': {'rank': '1', 'proficiency': 'medium'}, 'haskell': {'rank': '3', 'proficiency': 'medium-low'}}, 'degrees': [['Rice', 'Physics', 'BA'], ['Rice', 'Linguistics', 'BA'], ['UT Arlington', 'Linguistics', 'MA']]}


We would like to see this a bit more nicely-formatted for us.  There are two major options.  (well, three, if you consider that this can be converted to a JSON and formatted using the JSON library--but we're focused on more general ways to format things nicely).  The first is to use the `pprint` module in the standard library.  This library "pretty-prints" (hence the name) data structures you pass to the `pprint.pprint()` function.

In [13]:
import pprint
pprint.pprint(my_dict)

{'degrees': [['Rice', 'Physics', 'BA'],
             ['Rice', 'Linguistics', 'BA'],
             ['UT Arlington', 'Linguistics', 'MA']],
 'favorite languages': {'haskell': {'proficiency': 'medium-low', 'rank': '3'},
                        'julia': {'proficiency': 'medium', 'rank': '1'},
                        'python': {'proficiency': 'high', 'rank': '2'}},
 'name': 'Henry'}


If you need to get the prett-formatted string, because you want to do something with it later, you can use `pprint.pformat`.

In [14]:
formatted = pprint.pformat(my_dict)
print(formatted)

{'degrees': [['Rice', 'Physics', 'BA'],
             ['Rice', 'Linguistics', 'BA'],
             ['UT Arlington', 'Linguistics', 'MA']],
 'favorite languages': {'haskell': {'proficiency': 'medium-low', 'rank': '3'},
                        'julia': {'proficiency': 'medium', 'rank': '1'},
                        'python': {'proficiency': 'high', 'rank': '2'}},
 'name': 'Henry'}


There are a lot of things you can control with the pretty-printing tools in Python--you can read the `pprint` documentation for details--but I find that just using `pprint.pprint()` usually gets the job done.

There is also a third-party library, `icecream`, which has a function in it named `ic()`.  This function behaves like pretty-printing, but it's more designed for debugging and monitoring your program.  Install icecream with:

```bash
conda install -c conda-forge icecream
```

(`icecream` is not available in the main `conda` channel--you need to grab it from `conda-forge`)

In [15]:
!conda install --yes -c conda-forge icecream

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [16]:
from icecream import ic
ic(my_dict)

ic| my_dict: {'degrees': [['Rice', 'Physics', 'BA'],
                          ['Rice', 'Linguistics', 'BA'],
                          ['UT Arlington', 'Linguistics', 'MA']],
              'favorite languages': {'haskell': {'proficiency': 'medium-low', 'rank': '3'},
                                     'julia': {'proficiency': 'medium', 'rank': '1'},
                                     'python': {'proficiency': 'high', 'rank': '2'}},
              'name': 'Henry'}


{'name': 'Henry',
 'favorite languages': {'python': {'rank': '2', 'proficiency': 'high'},
  'julia': {'rank': '1', 'proficiency': 'medium'},
  'haskell': {'rank': '3', 'proficiency': 'medium-low'}},
 'degrees': [['Rice', 'Physics', 'BA'],
  ['Rice', 'Linguistics', 'BA'],
  ['UT Arlington', 'Linguistics', 'MA']]}

Note the red text in the above output--that's the part you usually care about.  In PyCharm or anything other than Jupyter, the output generally looks nicer, including supporting colors!

# `dbm`: easy interfaces with simple DBM-style databases

[DBM](https://en.wikipedia.org/wiki/DBM_(computing)) databases are very old, very simple, and very useful.  They're extremely simple forms of key-value storage (like even more simple JSONs): all keys and values must be strings (or bytes) when accessing a database via Python.  DBM databases aren't very fault-tolerant, robust, or flexible as other database options, so their use cases are fairly limited these days.  (especially with SQLite--see a bit further down in this notebook--being as easy to use, similarly fast, and way more flexible and widely available).  That said, sometimes the simplicity of a DBM database is all you need.

The Python interface to DBM is *very* simple, and mostly looks like accessing a dictionary.  Sometimes this simplicity is all you need, which can make `dbm` pretty useful in spite of its limitations. (other database interfaces require more code and more complexity).  The only tricky-ish part is that the file modes are a bit different than normal files; check the `dbm` documentation in Python for details.

In [17]:
import dbm
# "c" mode -> create a blank database, open in read+write mode
with dbm.open("my_dbm_database.db", "c") as my_db:
    my_db["Henry"] = "Data scientist"
    # this will throw an error
    # my_db["Number"] = 100
    
# there is now a DBM database stored in `my_dbm_database.db1
# which can be re-loaded and accessed later.
with dbm.open("my_dbm_database.db", "r") as my_db:
    # note that text stored in these databases gets converted to bytestrings
    print(my_db["Henry"])

b'Data scientist'


# sqlite3: Easily work with local SQL databases

Note: if you don't have any experience with SQL, this library won't be any use to you--but SQL is extremely easy to learn.  It's definitely something you *should* learn, but it goes beyond the scope of what we're covering in these workshops.

Most SQL databases rely on a *client-server* architecture: you have one program running that hosts the database and runs queries, and a separate program that lets the user write queries.  These are generally on different machines: the server is running on, well, a server somewhere, and the client program is running on your machine, sending your queries to the server to be executed.  This is great for something like a central database for a company, school, or department.

But, SQL is *super* convenient for writing all sorts of data queries.  [SQLite](https://www.sqlite.com/index.html) is a variant of SQL that makes it easy to use SQL in your own projects without needing a whole, dedicated database.  SQLite differs from other SQL implementations in a few important ways:
- It is *self-contained.*  There is no server and client.  SQLite reads and writes databases to and from single files on your computer.
- It is *zero-configuration.*  Most SQL server implementations need a fair bit of configuration before you can use them, both on the server and client end.  SQLite requires nothing other than pointing it at the right file to get up and running.
- It uses a very simple, stripped down version of the SQL language.  It doesn't have many of the conveniences of bigger implementations, but it's also not designed for the same kinds of complex queries and workloads.

In other words: SQLite is an amazing choice when your project would benefit from a (usually small) database.  But it is *not* a good choice for running your company or school's central database off of.

Python's standard library has a `sqlite3` module that lets you interact with SQLite databases very easily.

In [18]:
import sqlite3

with sqlite3.connect("my_sqlite_db.db") as con:
    # interactiong with the database go through the `cursor` object.
    # `cursor.execute("some SQL code")` executes SQL queries in the
    # database.
    cursor = con.cursor()
    
    # create a table
    cursor.execute("drop table if exists TestTable")
    cursor.execute("create table TestTable (id int, name text)")
    
    cursor.execute("insert into TestTable values (1, 'Henry')")
    cursor.execute("insert into TestTable values (2, 'George')")
    cursor.execute("insert into TestTable values (3, 'Justin')")
    
    # call con.commit()--not cursor.commit()--to save any changes
    con.commit()
    
    # querying the database returns an object containing the query results
    query_result = cursor.execute("select * from TestTable")
    print(query_result)
    
    # iterate through it to get results as tuples, one tuple per row, one entry
    # per column.
    for row in query_result:
        print(row)

<sqlite3.Cursor object at 0x0000028824AD6B90>
(1, 'Henry')
(2, 'George')
(3, 'Justin')


If you know some SQL, you can use pretty much all of that knowledge in SQLite.  Joins, filters, creating table, etc.  The only thing to be careful of is how SQLite handles data types: it is not as strict as other SQL databases when it comes to enforcing types.  It'll often coerce types, and it's happy to have a column that stores a mix of integer and text values.  Usually this isn't an issue--if you're using SQLite, you probably aren't doing something where the type checking and such has to be handled by the database itself.  Either you can handle that in your own code, or you have little enough data it isn't an issue, or something else.  But, it can be a bit of a stumbling block for people coming from other SQL implementations.

# And many other databases!

Basically any other database system you can think of will have some Python interface.
- Redis (through the `redis` third-party library), a blazing fast key-value database that's a lot more robust and flexible than dbm.
- MySQL, Oracle SQL, Microsoft SQL, and basically any other major SQL database (other than SQLite) via the `sqlalchemy` third-party library.  (and others--but `sqlalchemy` is the most widely used one).
- MongoDB via the `pymongo` third-pary library.
- Numerous graph databases, NoSQL databases, and more.

If you need a database, there'll be some way to access it in Python.