(python-libraries)=

# Useful Python Libraries

The following libraries are available and recommended for use by Cal-ITP data analysts. Our JupyterHub environment comes with all of these installed already, except for `calitp-data-infra`. A full list of the packages included in the system image that underpins our JupyterHub environment can be found (and updated when needed) [here](https://github.com/cal-itp/data-infra/blob/main/images/jupyter-singleuser/pyproject.toml).

## Table of Contents

1. [shared utils](#shared-utils)
2. [calitp-data-analysis](#calitp-data-analysis)
3. [siuba](#siuba)
   <br> - [Basic Query](#basic-query)
   <br> - [Collect Query Results](#collect-query-results)
   <br> - [Show Query SQL](#show-query-sql)
   <br> - [More siuba Resources](more-siuba-resources)
4. [pandas](pandas-resources)
5. [Add New Packages](#add-new-packages)
6. [Updating calitp-data-analysis](#updating-calitp-data-analysis)
7. [Appendix: calitp-data-infra](appendix)

(shared-utils)=

## shared utils

A set of shared utility functions can also be installed, similarly to any Python library. The `shared_utils` are stored in two places: [here](https://github.com/cal-itp/data-analyses/shared_utils) in `data-analyses`, which houses functions that are more likely to be updated. Shared functions that are updated less frequently are housed [here](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis) in the `calitp_data_analysis` package in `data-infra`. Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel.

### In terminal

- Navigate to the package folder: `cd data-analyses/_shared_utils`
- Use the make command to run through conda install and pip install: `make setup_env`
  - Note: you may need to select Kernel -> Restart Kernel from the top menu after make setup_env in order to successfully import shared_utils
- Alternative: add an `alias` to your `.bash_profile`:
  - In terminal use `cd` to navigate to the home directory (not a repository)
  - Type `nano .bash_profile` to open the .bash_profile in a text editor
  - Add a line at end: `alias go='cd ~/data-analyses/portfolio && pip install -r requirements.txt && cd ../_shared_utils && make setup_env && cd ..'`
  - Exit with Ctrl+X, hit yes, then hit enter at the filename prompt
  - Restart your server; you can check your changes with `cat .bash_profile`

### In notebook

```python
from calitp_data_analysis import geography_utils

geography_utils.WGS84
```

See [data-analyses/starter_kit](https://github.com/cal-itp/data-analyses/tree/main/starter_kit) for examples on how to use `shared_utils` for general functions, charts, and maps.

(calitp-data-analysis)=

## calitp-data-analysis

`calitp-data-analysis` is an internal library of utility functions used to access our warehouse data for analysis purposes.

### import tbls

Most notably, you can include `import tbls` at the top of your notebook to import a table from the warehouse in the form of a `tbls`:

```python
from calitp_data_analysis.tables import tbls
```

Example:

In [1]:
from calitp_data_analysis.tables import tbls

tbls.mart_gtfs.dim_agency()

Unnamed: 0,key,_gtfs_key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_dt,_feed_valid_from,_line_number,feed_timezone
0,a196a8996bb74b7bd6d92f0cc2802620,00e77c29f2ce1986407cb93f53936dff,25915c089571c49be33b59e10c25d2ac,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-15,2022-09-15 00:01:13+00:00,1,America/Los_Angeles
1,1d48492cd90930d6612fc70ef18bf8d4,93c44e614219c41696a8148018c1d83b,f87ac9abe81a7548325423ede69c3b86,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-09,2022-09-09 00:00:56+00:00,1,America/Los_Angeles
2,56112d81d62d71d408ea0247309178e4,090343c89ab8fe1d5e3b79e5687bbcca,21fa0b125d801eb5058da2ec5d748bda,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-15,2022-09-15 14:59:54.274779+00:00,1,America/Los_Angeles
3,033d018b031002914fbfa48c872d4a65,8f5e204872e6b7034479a6d11502547b,3d30b02b9008f788eb89c41c137786c1,MT,,http://www.mantecatransit.com,America/Los_Angeles,en,(209) 456-8000,,,aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...,2022-01-19,2022-01-19 00:11:01+00:00,1,America/Los_Angeles
4,fdbd4bd9cc406a651ec80dba6ac4a1c1,ebfb64604b1c18bdf8ab91a2e4394491,ba587fd71b4f9a956ff36e309e6acf3f,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-05-14,2022-05-14 00:01:18+00:00,1,America/Los_Angeles


### query_sql

`query_sql` is another useful function to use inside of JupyterHub notebooks to turn a SQL query into a pandas DataFrame.

As an example, in a notebook:

In [2]:
from calitp_data_analysis.sql import query_sql

In [3]:
df_dim_agency = query_sql("""
SELECT
    *
FROM `mart_gtfs.dim_agency`
LIMIT 10""", as_df=True)

In [4]:
df_dim_agency.head()

Unnamed: 0,key,_gtfs_key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_dt,_feed_valid_from,_line_number,feed_timezone
0,a196a8996bb74b7bd6d92f0cc2802620,00e77c29f2ce1986407cb93f53936dff,25915c089571c49be33b59e10c25d2ac,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-15,2022-09-15 00:01:13+00:00,1,America/Los_Angeles
1,1d48492cd90930d6612fc70ef18bf8d4,93c44e614219c41696a8148018c1d83b,f87ac9abe81a7548325423ede69c3b86,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-09,2022-09-09 00:00:56+00:00,1,America/Los_Angeles
2,56112d81d62d71d408ea0247309178e4,090343c89ab8fe1d5e3b79e5687bbcca,21fa0b125d801eb5058da2ec5d748bda,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-15,2022-09-15 14:59:54.274779+00:00,1,America/Los_Angeles
3,033d018b031002914fbfa48c872d4a65,8f5e204872e6b7034479a6d11502547b,3d30b02b9008f788eb89c41c137786c1,MT,,http://www.mantecatransit.com,America/Los_Angeles,en,(209) 456-8000,,,aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...,2022-01-19,2022-01-19 00:11:01+00:00,1,America/Los_Angeles
4,fdbd4bd9cc406a651ec80dba6ac4a1c1,ebfb64604b1c18bdf8ab91a2e4394491,ba587fd71b4f9a956ff36e309e6acf3f,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-05-14,2022-05-14 00:01:18+00:00,1,America/Los_Angeles


(siuba)=

## siuba

`siuba` is a tool that allows the same analysis code to run on a pandas DataFrame,
as well as generate SQL for different databases.
It supports most [pandas Series methods](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) analysts use. See the [siuba docs](https://siuba.readthedocs.io) for more information.

The examples below go through the basics of using siuba, collecting a database query to a local DataFrame,
and showing SQL test queries that siuba code generates.

### Basic query

In [5]:
from calitp_data_analysis.tables import tbls
from siuba import _, filter, count, collect, show_query

# query agency information, then filter for a single gtfs feed,
# and then count how often each feed key occurs
(tbls.mart_gtfs.dim_agency()
    >> filter(_.agency_id == 'BA', _.base64_url == 'aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1SRw==')
    >> count(_.feed_key)
)

Unnamed: 0,feed_key,n
0,8f5949d03a0cb1243ac9301df8adef14,1
1,b01a3ce6ba2e972a6acc08574a3e06a9,1
2,f4095db9282b859842d338f5db032561,1
3,cae827cd30737d76600b13970dde458a,1
4,4697bb52eb4da8bcff925f503a623326,1


### Collect query results

Note that siuba by default prints out a preview of the SQL query results.
In order to fetch the results of the query as a pandas DataFrame, run `collect()`.

In [6]:
tbl_agency_names = tbls.mart_gtfs.dim_agency() >> collect()

# Use pandas .head() method to show first 5 rows of data
tbl_agency_names.head()


Unnamed: 0,key,_gtfs_key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_dt,_feed_valid_from,_line_number,feed_timezone
0,a196a8996bb74b7bd6d92f0cc2802620,00e77c29f2ce1986407cb93f53936dff,25915c089571c49be33b59e10c25d2ac,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-15,2022-09-15 00:01:13+00:00,1,America/Los_Angeles
1,1d48492cd90930d6612fc70ef18bf8d4,93c44e614219c41696a8148018c1d83b,f87ac9abe81a7548325423ede69c3b86,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-09,2022-09-09 00:00:56+00:00,1,America/Los_Angeles
2,56112d81d62d71d408ea0247309178e4,090343c89ab8fe1d5e3b79e5687bbcca,21fa0b125d801eb5058da2ec5d748bda,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-09-15,2022-09-15 14:59:54.274779+00:00,1,America/Los_Angeles
3,033d018b031002914fbfa48c872d4a65,8f5e204872e6b7034479a6d11502547b,3d30b02b9008f788eb89c41c137786c1,MT,,http://www.mantecatransit.com,America/Los_Angeles,en,(209) 456-8000,,,aHR0cHM6Ly93d3cuY2kubWFudGVjYS5jYS51cy9Db21tdW...,2022-01-19,2022-01-19 00:11:01+00:00,1,America/Los_Angeles
4,fdbd4bd9cc406a651ec80dba6ac4a1c1,ebfb64604b1c18bdf8ab91a2e4394491,ba587fd71b4f9a956ff36e309e6acf3f,LAX FlyAway,,http://www.LAXFlyAway.org,America/Los_Angeles,en,(714) 507-1170,,,aHR0cHM6Ly93d3cuZmx5bGF4LmNvbS8tL21lZGlhL2ZseW...,2022-05-14,2022-05-14 00:01:18+00:00,1,America/Los_Angeles


### Show query SQL

While `collect()` fetches query results, `show_query()` prints out the SQL code that siuba generates.

In [7]:
(tbls.mart_gtfs.dim_agency()
  >> filter(_.agency_name.str.contains("Metro"))
  >> show_query(simplify=True)
)


SELECT * 
FROM `mart_gtfs.dim_agency` AS `mart_gtfs.dim_agency_1` 
WHERE regexp_contains(`mart_gtfs.dim_agency_1`.`agency_name`, 'Metro')


Unnamed: 0,key,_gtfs_key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_dt,_feed_valid_from,_line_number,feed_timezone
0,b7aba0d9631f4978a1778a4527b9dc62,b0e8c947748d71e0bfa038d470ffd2c2,0637217a45e976c6c80d21ae2367c5fb,960,Madera Metro,https://www.cityofmadera.ca.gov/home/departmen...,America/Los_Angeles,en,(559) 661-RIDE (7433),https://www.cityofmadera.ca.gov/wp-content/upl...,,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,2021-07-17,2021-07-17 00:13:13+00:00,1,America/Los_Angeles
1,5510fd5cfd5718d36c5cea9efc61b05e,267a1b2a19b0526a5086cded02ba3325,2716d91a7c0f05036fd7d593a6980ec4,960,Madera Metro,https://www.cityofmadera.ca.gov/home/departmen...,America/Los_Angeles,en,(559) 661-RIDE (7433),https://www.cityofmadera.ca.gov/wp-content/upl...,,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,2022-12-16,2022-12-16 03:00:27.604217+00:00,1,America/Los_Angeles
2,9e1782ca894aff6f1a20d273002b31a8,46bb281dadd29531a7208ddb2d4ad911,eb45540bc9b652b8ec24a952481dadc4,960,Madera Metro,https://www.madera.gov/home/departments/transi...,America/Los_Angeles,en,(559) 661-RIDE (7433),https://www.madera.gov/home/departments/transi...,,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,2023-07-11,2023-07-11 03:00:46.743372+00:00,1,America/Los_Angeles
3,775d1b8c065a397b32e7ecfc72d5d6e7,cfe46fa3d9321606d251d13168498d03,a6975de68648952eda4d63b6f6d09985,960,Madera Metro,https://www.cityofmadera.ca.gov/home/departmen...,America/Los_Angeles,en,(559) 661-RIDE (7433),https://www.cityofmadera.ca.gov/wp-content/upl...,,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,2021-08-13,2021-08-13 00:18:57+00:00,1,America/Los_Angeles
4,3d00853c3a6634cba7a650b240057e0d,45be61a1cffaca77873a82d81b83f4fa,995bc341c731451ec41fcdceb1a83586,960,Madera Metro,https://www.cityofmadera.ca.gov/home/departmen...,America/Los_Angeles,en,(559) 661-RIDE (7433),https://www.cityofmadera.ca.gov/wp-content/upl...,,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3...,2022-08-17,2022-08-17 00:01:19+00:00,1,America/Los_Angeles


Note that here the pandas Series method `str.contains` corresponds to `regexp_contains` in Google BigQuery.

(more-siuba-resources)=

### More siuba Resources

- [siuba docs](https://siuba.readthedocs.io)
- ['Tidy Tuesday' live analyses with siuba](https://www.youtube.com/playlist?list=PLiQdjX20rXMHc43KqsdIowHI3ouFnP_Sf)

(pandas-resources)=

## pandas

The library pandas is very commonly used in data analysis, and the external resources below provide a brief overview of it's use.

- [Cheat Sheet - pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

## Add New Packages

While most Python packages an analyst uses come in JupyterHub, there may be additional packages you'll want to use in your analysis.

- Install [shared utility functions](#shared-utils),
- Change directory into the project task's subfolder and add `requirements.txt` and/or `conda-requirements.txt`
- Run `pip install -r requirements.txt` and/or `conda install --yes -c conda-forge --file conda-requirements.txt`

(updating-calitp-data-analysis)=

## Updating calitp-data-analysis

`calitp-data-analysis` is a [package](https://pypi.org/project/calitp-data-analysis/) that lives [here](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis) in the `data-infra` repo. Follow the steps below update to the package.

<b>Steps </b>

Adapted from [this Slack thread](https://cal-itp.slack.com/archives/C02KH3DGZL7/p1694470040574809).

1. Make the changes you want in the `calitp-data-analysis` folder inside `packages` [here](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis). If you are only changing package metadata (author information, package description, etc.) without changing the function of the package itself, that information lives in `pyproject.toml` rather than in the `calitp-data-analysis` subfolder.
   - If you are adding a new function that relies on a package that isn't already a dependency, run `poetry add <package name>` after changing directories to `data-infra/packages/calitp_data_analysis`. Check this [Jupyter image file](https://github.com/cal-itp/data-infra/blob/main/images/jupyter-singleuser/pyproject.toml) for the version number associated with the package, because you should specify the version.
     - For example, your function relies on `dask`. In the Jupyter image file, the version is `dask = "~2022.8"` so run `poetry add dask==~2022.8` in the terminal.
   - You may also have run `poetry install mypy`. `mypy` is a package that audits all the functions. [Read more about it here.](https://mypy-lang.org/)
2. Each time you update the package, you must also update the version number. We use dates to reflect which version we are on. Update the version in [pyproject.toml](https://github.com/cal-itp/data-infra/blob/main/packages/calitp-data-analysis/pyproject.toml#L3) that lives in `calitp-data-analysis` to either today's date or a future date.
3. Open a new pull request and make sure the new version date appears on the [test version page](https://test.pypi.org/project/calitp-data-analysis/).
   - The new version date may not show up on the test page due to errors. Check the GitHub Action page of your pull request to see the errors that have occurred.
   - If you run into the error message like this, `error: Skipping analyzing "dask_geopandas": module is installed, but missing library stubs or py.typed marker  [import]` go to your `.py` file and add `# type: ignore` behind the package import.
     - To fix the error above, change `import dask_geopandas as dg` to `import dask_geopandas as dg  # type: ignore`.
   - It is encouraged to make changes in a set of smaller commits. For example, add all the necessary packages with `poetry run <package` first, fix any issues flagged by `mypy`, and finally address any additional issues.
4. Merge the PR. Once it is merged in, the [actual package](https://pypi.org/project/calitp-data-analysis/) will display the new version number. To make sure everything works as expected, run `pip install calitp-data-analysis==<new version here>` in a cell of Jupyter notebook and import a package (or two) such as `from calitp_data_analysis import styleguide`.
5. Update the new version number in the `data-infra` repository [here](https://github.com/cal-itp/data-infra/blob/main/images/dask/requirements.txt#L30), [here](https://github.com/cal-itp/data-infra/blob/main/images/jupyter-singleuser/pyproject.toml#L48), [here](https://github.com/cal-itp/data-infra/blob/main/docs/requirements.txt), and anywhere else you find a reference to the old version of the package. You'll also want to do the same for any other Cal-ITP repositories that reference the calitp-data-analysis package.
   - As of writing, the only other repository that references to the package version is [reports](https://github.com/cal-itp/reports).

<b>Resources</b>

- [Issue #870](https://github.com/cal-itp/data-analyses/issues/870)
- [Pull Request #2994](https://github.com/cal-itp/data-infra/pull/2944)
- [Slack thread](https://cal-itp.slack.com/archives/C02KH3DGZL7/p1694470040574809)

(appendix)=

## Appendix: calitp-data-infra

The [calitp-data-infra](https://pypi.org/project/calitp-data-infra/) package, used primarily by warehouse mainainers and data pipeline developers, includes utilities that analysts will likely not need need to interact with directly (and therefore generally won't need to install), but which may be helpful to be aware of. For instance, the `get_secret_by_name()` and `get_secrets_by_label()` functions in [the package's `auth` module](https://github.com/cal-itp/data-infra/blob/main/packages/calitp-data-infra/calitp_data_infra/auth.py) are used to interact with Google's [Secret Manager](https://console.cloud.google.com/security/secret-manager), the service that securely stores API keys and other sensitive information that underpins many of our data syncs.

You can read more about the `calitp-data-infra` Python package [here](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-infra#readme).