(python-libraries)=

# Useful Python Libraries

The following libraries are available and recommended for use by Cal-ITP data analysts. Our JupyterHub environment comes with all of these installed already, except for `calitp-data-infra`. A full list of the packages included in the system image that underpins our JupyterHub environment can be found (and updated when needed) [here](https://github.com/cal-itp/data-infra/blob/main/images/jupyter-singleuser/pyproject.toml).

## Table of Contents

1. [shared utils](#shared-utils)
2. [calitp-data-analysis](#calitp-data-analysis)
3. [siuba](#siuba)
   <br> - [Basic Query](#basic-query)
   <br> - [Collect Query Results](#collect-query-results)
   <br> - [Show Query SQL](#show-query-sql)
   <br> - [More siuba Resources](more-siuba-resources)
4. [pandas](pandas-resources)
5. [Add New Packages](#add-new-packages)
6. [Appendix: calitp-data-infra](appendix)

(shared-utils)=

## shared utils

A set of shared utility functions can also be installed, similarly to any Python library. The `shared_utils` are stored in two places: [here](https://github.com/cal-itp/data-analyses/shared_utils) in `data-analyses`, which houses functions that are more likely to be updated. Shared functions that are updated less frequently are housed [here](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis) in the `calitp_data_analysis` package in `data-infra`. Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel.

### In terminal

- Navigate to the package folder: `cd data-analyses/_shared_utils`
- Use the make command to run through conda install and pip install: `make setup_env`
  - Note: you may need to select Kernel -> Restart Kernel from the top menu after make setup_env in order to successfully import shared_utils
- Alternative: add an `alias` to your `.bash_profile`:
  - In terminal use `cd` to navigate to the home directory (not a repository)
  - Type `nano .bash_profile` to open the .bash_profile in a text editor
  - Add a line at end: `alias go='cd ~/data-analyses/portfolio && pip install -r requirements.txt && cd ../_shared_utils && make setup_env && cd ..'`
  - Exit with Ctrl+X, hit yes, then hit enter at the filename prompt
  - Restart your server; you can check your changes with `cat .bash_profile`

### In notebook

```python
from calitp_data_analysis import geography_utils

geography_utils.WGS84
```

See [data-analyses/starter_kit](https://github.com/cal-itp/data-analyses/tree/main/starter_kit) for examples on how to use `shared_utils` for general functions, charts, and maps.

(calitp-data-analysis)=

## calitp-data-analysis

`calitp-data-analysis` is an internal library of utility functions used to access our warehouse data for analysis purposes.

### import tbls

Most notably, you can include `import tbls` at the top of your notebook to import a table from the warehouse in the form of a `tbls`:

```python
from calitp_data_analysis.tables import tbls
```

Example:

In [1]:
from calitp_data_analysis.tables import tbls

tbls.mart_gtfs.dim_agency()

Unnamed: 0,key,_gtfs_key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_dt,_feed_valid_from,_line_number,feed_timezone
0,a196a8996bb74b7bd6d92f0cc2802620,00e77c29f2ce1986407cb93f53936dff,25915c089571c49be33b59e10c25d2ac,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-15,2022-09-15 00:01:13+00:00,1,America/Los_Angeles
1,033d018b031002914fbfa48c872d4a65,8f5e204872e6b7034479a6d11502547b,3d30b02b9008f788eb89c41c137786c1,MT,,http://www.mantecatransit.com,America/Los_Angeles,en,(209) 456-8000,,,aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...,2022-01-19,2022-01-19 00:11:01+00:00,1,America/Los_Angeles
2,1d48492cd90930d6612fc70ef18bf8d4,93c44e614219c41696a8148018c1d83b,f87ac9abe81a7548325423ede69c3b86,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-09,2022-09-09 00:00:56+00:00,1,America/Los_Angeles
3,282b4e07171c48ac08162e0dc8749066,18fcf47148228c7b56c19726c854054b,9b8e9b8befe560293a6c5b38dc19ffbb,MT,,http://www.mantecatransit.com,America/Los_Angeles,en,(209) 456-8000,,,aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...,2021-07-28,2021-07-28 13:56:16+00:00,1,America/Los_Angeles
4,56112d81d62d71d408ea0247309178e4,090343c89ab8fe1d5e3b79e5687bbcca,21fa0b125d801eb5058da2ec5d748bda,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-15,2022-09-15 14:59:54.274779+00:00,1,America/Los_Angeles


### query_sql

`query_sql` is another useful function to use inside of JupyterHub notebooks to turn a SQL query into a pandas DataFrame.

As an example, in a notebook:

In [2]:
from calitp_data_analysis.sql import query_sql

In [3]:
df_dim_agency = query_sql("""
SELECT
    *
FROM `mart_gtfs.dim_agency`
LIMIT 10""", as_df=True)

In [4]:
df_dim_agency.head()

Unnamed: 0,key,_gtfs_key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_dt,_feed_valid_from,_line_number,feed_timezone
0,a196a8996bb74b7bd6d92f0cc2802620,00e77c29f2ce1986407cb93f53936dff,25915c089571c49be33b59e10c25d2ac,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-15,2022-09-15 00:01:13+00:00,1,America/Los_Angeles
1,033d018b031002914fbfa48c872d4a65,8f5e204872e6b7034479a6d11502547b,3d30b02b9008f788eb89c41c137786c1,MT,,http://www.mantecatransit.com,America/Los_Angeles,en,(209) 456-8000,,,aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...,2022-01-19,2022-01-19 00:11:01+00:00,1,America/Los_Angeles
2,1d48492cd90930d6612fc70ef18bf8d4,93c44e614219c41696a8148018c1d83b,f87ac9abe81a7548325423ede69c3b86,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-09,2022-09-09 00:00:56+00:00,1,America/Los_Angeles
3,282b4e07171c48ac08162e0dc8749066,18fcf47148228c7b56c19726c854054b,9b8e9b8befe560293a6c5b38dc19ffbb,MT,,http://www.mantecatransit.com,America/Los_Angeles,en,(209) 456-8000,,,aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...,2021-07-28,2021-07-28 13:56:16+00:00,1,America/Los_Angeles
4,56112d81d62d71d408ea0247309178e4,090343c89ab8fe1d5e3b79e5687bbcca,21fa0b125d801eb5058da2ec5d748bda,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-15,2022-09-15 14:59:54.274779+00:00,1,America/Los_Angeles


(siuba)=

## siuba

`siuba` is a tool that allows the same analysis code to run on a pandas DataFrame,
as well as generate SQL for different databases.
It supports most [pandas Series methods](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) analysts use. See the [siuba docs](https://siuba.readthedocs.io) for more information.

The examples below go through the basics of using siuba, collecting a database query to a local DataFrame,
and showing SQL test queries that siuba code generates.

### Basic query

In [5]:
from calitp_data_analysis.tables import tbls
from siuba import _, filter, count, collect, show_query

# query agency information, then filter for a single gtfs feed,
# and then count how often each feed key occurs
(tbls.mart_gtfs.dim_agency()
    >> filter(_.agency_id == 'BA', _.base64_url == 'aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1SRw==')
    >> count(_.feed_key)
)

Unnamed: 0,feed_key,n
0,8f5949d03a0cb1243ac9301df8adef14,1
1,4697bb52eb4da8bcff925f503a623326,1
2,a64532640829f3eae9129cd5d5e9590b,1
3,f4095db9282b859842d338f5db032561,1
4,e839b5cbcd5b29a16c2ac4d40cd4439d,1


### Collect query results

Note that siuba by default prints out a preview of the SQL query results.
In order to fetch the results of the query as a pandas DataFrame, run `collect()`.

In [6]:
tbl_agency_names = tbls.mart_gtfs.dim_agency() >> collect()

# Use pandas .head() method to show first 5 rows of data
tbl_agency_names.head()


Unnamed: 0,key,_gtfs_key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_dt,_feed_valid_from,_line_number,feed_timezone
0,a196a8996bb74b7bd6d92f0cc2802620,00e77c29f2ce1986407cb93f53936dff,25915c089571c49be33b59e10c25d2ac,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-15,2022-09-15 00:01:13+00:00,1,America/Los_Angeles
1,033d018b031002914fbfa48c872d4a65,8f5e204872e6b7034479a6d11502547b,3d30b02b9008f788eb89c41c137786c1,MT,,http://www.mantecatransit.com,America/Los_Angeles,en,(209) 456-8000,,,aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...,2022-01-19,2022-01-19 00:11:01+00:00,1,America/Los_Angeles
2,1d48492cd90930d6612fc70ef18bf8d4,93c44e614219c41696a8148018c1d83b,f87ac9abe81a7548325423ede69c3b86,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-09,2022-09-09 00:00:56+00:00,1,America/Los_Angeles
3,282b4e07171c48ac08162e0dc8749066,18fcf47148228c7b56c19726c854054b,9b8e9b8befe560293a6c5b38dc19ffbb,MT,,http://www.mantecatransit.com,America/Los_Angeles,en,(209) 456-8000,,,aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...,2021-07-28,2021-07-28 13:56:16+00:00,1,America/Los_Angeles
4,56112d81d62d71d408ea0247309178e4,090343c89ab8fe1d5e3b79e5687bbcca,21fa0b125d801eb5058da2ec5d748bda,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-15,2022-09-15 14:59:54.274779+00:00,1,America/Los_Angeles


### Show query SQL

While `collect()` fetches query results, `show_query()` prints out the SQL code that siuba generates.

In [7]:
(tbls.mart_gtfs.dim_agency()
  >> filter(_.agency_name.str.contains("Metro"))
  >> show_query(simplify=True)
)


SELECT * 
FROM `mart_gtfs.dim_agency` AS `mart_gtfs.dim_agency_1` 
WHERE regexp_contains(`mart_gtfs.dim_agency_1`.`agency_name`, 'Metro')


Unnamed: 0,key,_gtfs_key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_dt,_feed_valid_from,_line_number,feed_timezone
0,0782ef40776b4ad8fba49ca86a375e5a,621d37259d6a5a7f641fd5a6e8a747f7,30a2d5133c29abfbdc9e504bef2830e9,960,Madera Metro,https://www.cityofmadera.ca.gov/home/departmen...,America/Los_Angeles,en,(559) 661-RIDE (7433),https://www.cityofmadera.ca.gov/wp-content/upl...,,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,2023-04-08,2023-04-08 03:00:20.134144+00:00,1,America/Los_Angeles
1,7dde33f3c8629a4703af876c042aca24,89aec1b29dcfdd86a5ee1632db4cf0ac,58440d643369c3d8690e4c27ba3911bb,960,Madera Metro,https://www.madera.gov/home/departments/transi...,America/Los_Angeles,en,(559) 661-RIDE (7433),https://www.madera.gov/home/departments/transi...,,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,2023-09-07,2023-09-07 03:00:29.777243+00:00,1,America/Los_Angeles
2,775d1b8c065a397b32e7ecfc72d5d6e7,cfe46fa3d9321606d251d13168498d03,a6975de68648952eda4d63b6f6d09985,960,Madera Metro,https://www.cityofmadera.ca.gov/home/departmen...,America/Los_Angeles,en,(559) 661-RIDE (7433),https://www.cityofmadera.ca.gov/wp-content/upl...,,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,2021-08-13,2021-08-13 00:18:57+00:00,1,America/Los_Angeles
3,b7aba0d9631f4978a1778a4527b9dc62,b0e8c947748d71e0bfa038d470ffd2c2,0637217a45e976c6c80d21ae2367c5fb,960,Madera Metro,https://www.cityofmadera.ca.gov/home/departmen...,America/Los_Angeles,en,(559) 661-RIDE (7433),https://www.cityofmadera.ca.gov/wp-content/upl...,,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,2021-07-17,2021-07-17 00:13:13+00:00,1,America/Los_Angeles
4,65ce81af96c6e1bbe1e8b34e9c6cd1bf,a51fc5abab3caa0f0adc09c2ad04ffe6,4f9efa4176247918b04f6c7ec3926b35,960,Madera Metro,https://www.cityofmadera.ca.gov/home/departmen...,America/Los_Angeles,en,(559) 661-RIDE (7433),https://www.cityofmadera.ca.gov/wp-content/upl...,,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,2021-04-16,2021-04-16 00:01:13+00:00,1,America/Los_Angeles


Note that here the pandas Series method `str.contains` corresponds to `regexp_contains` in Google BigQuery.

(more-siuba-resources)=

### More siuba Resources

- [siuba docs](https://siuba.readthedocs.io)
- ['Tidy Tuesday' live analyses with siuba](https://www.youtube.com/playlist?list=PLiQdjX20rXMHc43KqsdIowHI3ouFnP_Sf)

(pandas-resources)=

## pandas

The library pandas is very commonly used in data analysis, and the external resources below provide a brief overview of it's use.

- [Cheat Sheet - pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

## Add New Packages

While most Python packages an analyst uses come in JupyterHub, there may be additional packages you'll want to use in your analysis.

- Install [shared utility functions](#shared-utils)
- Change directory into the project task's subfolder and add `requirements.txt` and/or `conda-requirements.txt`
- Run `pip install -r requirements.txt` and/or `conda install --yes -c conda-forge --file conda-requirements.txt`

(appendix)=

## Appendix: calitp-data-infra

The [calitp-data-infra](https://pypi.org/project/calitp-data-infra/) package, used primarily by warehouse mainainers and data pipeline developers, includes utilities that analysts will likely not need need to interact with directly (and therefore generally won't need to install), but which may be helpful to be aware of. For instance, the `get_secret_by_name()` and `get_secrets_by_label()` functions in [the package's `auth` module](https://github.com/cal-itp/data-infra/blob/main/packages/calitp-data-infra/calitp_data_infra/auth.py) are used to interact with Google's [Secret Manager](https://console.cloud.google.com/security/secret-manager), the service that securely stores API keys and other sensitive information that underpins many of our data syncs.

You can read more about the `calitp-data-infra` Python package [here](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-infra#readme).