(python-libraries)=
# Python Libraries (WIP)
The following libraries are available and recommended for use by Cal-ITP data analysts.

## Table of Contents
1. [Add New Packages](#add-new-packages)
1. [calitp](#calitp)
1. [siuba](#siuba)
<br> - [Basic Query](#basic-query)
<br> - [Collect Query Results](#collect-query-results)
<br> - [Show Query SQL](#show-query-sql)
1. [shared utils](#shared-utils)

## Add New Packages

While most Python packages an analyst uses comes in JupyterHub, there may be additional packages you'll want to use in your analysis.

* Install [shared utility functions](#shared-utils)
* Change directory into the project task's subfolder and add `requirements.txt` and/or `conda-requirements.txt`
* Run `pip install -r requirements.txt` and/or `conda install --yes -c conda-forge --file conda-requirements.txt`


(calitp)=
## calitp
(siuba)=
## siuba
Siuba is a tool that allows the same analysis code to run on a pandas DataFrame,
as well as generate SQL for different databases.
It supports most [pandas Series methods](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) analysts use.
See the [siuba docs](https://siuba.readthedocs.io) for more information.

The examples below go through the basics of using siuba, collecting a database query to a local DataFrame,
and showing SQL test queries that siuba code generates.

### Basic query

In [1]:
from myst_nb import glue
from calitp.tables import tbl
from siuba import _, filter, count, collect, show_query

# query lastest validation notices, then filter for a single gtfs feed,
# and then count how often each code occurs
(tbl.views.validation_notices()
    >> filter(_.calitp_itp_id == 10, _.calitp_url_number==0)
    >> count(_.code)
)

Unnamed: 0,code,n
0,unknown_file,4
1,decreasing_or_equal_shape_distance,2
2,unused_shape,1


### Collect query results
Note that siuba by default prints out a preview of the SQL query results.
In order to fetch the results of the query as a pandas DataFrame, run `collect()`.

In [2]:
tbl_agency_names = tbl.views.gtfs_agency_names() >> collect()

# Use pandas .head() method to show first 5 rows of data
tbl_agency_names.head()


Unnamed: 0,calitp_itp_id,calitp_url_number,agency_name
0,256,0,Porterville Transit
1,257,0,PresidiGo
2,259,0,Redding Area Bus Authority
3,4,0,AC Transit
4,260,0,Beach Cities Transit


### Show query SQL

While `collect()` fetches query results, `show_query()` prints out the SQL code that siuba generates.

In [3]:
(tbl.views.gtfs_agency_names()
  >> filter(_.agency_name.str.contains("Metro"))
  >> show_query(simplify=True)
)


SELECT `anon_1`.`calitp_itp_id`, `anon_1`.`calitp_url_number`, `anon_1`.`agency_name` 
FROM (SELECT calitp_itp_id, calitp_url_number, agency_name 
FROM `views.gtfs_agency_names`) AS `anon_1` 
WHERE regexp_contains(`anon_1`.`agency_name`, 'Metro')


Unnamed: 0,calitp_itp_id,calitp_url_number,agency_name
0,278,0,San Diego Metropolitan Transit System
1,293,0,Santa Barbara Metropolitan Transit District
2,296,0,Santa Cruz Metropolitan Transit District
3,323,0,Metrolink
4,182,1,Metro


Note that here the pandas Series method `str.contains` corresponds to `regexp_contains` in Google BigQuery.

## shared utils
A set of shared utility functions can also be installed, similarly to any Python library. The [shared_utils](https://github.com/cal-itp/data-analyses/shared_utils) are stored here. Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel.

```python
# In terminal:
cd data-analyses/_shared_utils

# Use the make command to run through conda install and pip install
make setup_env

# In notebook:
import shared_utils

shared_utils.geography_utils.WGS84
```

See [data-analyses/example_reports](https://github.com/cal-itp/data-analyses/tree/main/example_report) for examples in how to use `shared_utils` for [general functions](https://github.com/cal-itp/data-analyses/blob/main/example_report/shared_utils_examples.ipynb), [charts](https://github.com/cal-itp/data-analyses/blob/main/example_report/example_charts.ipynb), and [maps](https://github.com/cal-itp/data-analyses/blob/main/example_report/example_maps.ipynb).