(python-libraries)=
# Python Libraries
The following libraries are available and recommended for use by Cal-ITP data analysts.

## Table of Contents
1. [calitp](#calitp)
1. [siuba](#siuba)
<br> - [Basic Query](#basic-query)
<br> - [Collect Query Results](#collect-query-results)
<br> - [Show Query SQL](#show-query-sql)
<br> - [More siuba Resources](more-siuba-resources)
1. [shared utils](#shared-utils)
1. [pandas](pandas-resources)
1. [Add New Packages](#add-new-packages)

(calitp)=
## calitp
`calitp` is an internal library of utility functions used to access our warehouse data.

### import tbl

Most notably, you can include `import tbl` at the top of your notebook to import a table from the warehouse in the form of a `tbl`:

```python
from calitp.tables import tbl
```

Example:

In [1]:
from calitp.tables import tbl

tbl.views.gtfs_schedule_fact_daily_feed_routes()

Unnamed: 0,feed_key,route_key,date,calitp_extracted_at,calitp_deleted_at
0,-2151971653496488316,-6184099611572222930,2021-08-17,2021-05-27,2021-08-20
1,913836324682460005,-9192256639167498460,2021-08-22,2021-07-12,2021-10-01
2,-1514300518519871606,7861026961853424301,2021-10-16,2021-10-12,2021-10-24
3,-5576060957641356567,5185699003924973220,2021-10-16,2021-08-08,2022-01-18
4,7047843866677258129,-8453063117411979560,2021-08-17,2021-06-21,2021-08-27


### query_sql

`query_sql` is another useful function to use inside of JupyterHub notebooks to turn a SQL query into a pandas DataFrame.

As an example, in a notebook:

In [2]:
from calitp import query_sql

In [3]:
df_dim_feeds = query_sql("""
SELECT
    *
FROM `views.gtfs_schedule_dim_feeds`
LIMIT 10""", as_df=True)

In [4]:
df_dim_feeds.head()

Unnamed: 0,feed_key,calitp_itp_id,calitp_url_number,calitp_agency_name,calitp_gtfs_schedule_url,calitp_id_in_latest,calitp_feed_id,calitp_feed_name,feed_publisher_name,feed_publisher_url,feed_lang,default_lang,feed_version,feed_contact_email,feed_contact_url,feed_start_date,feed_end_date,is_composite_feed,calitp_extracted_at,calitp_deleted_at
0,-3971869378563233510,8,1,Monterey-Salinas Transit,http://www.mst.org/google/google_transit.zip,False,8__1,Monterey-Salinas Transit (1),MST,http://www.mst.org,en,,202101-February2021v00,,,,,False,2021-04-15,2021-05-13
1,-6443842139930306849,13,0,Amtrak,https://content.amtrak.com/content/gtfs/GTFS.zip,True,13__0,Amtrak (0),Amtrak,http://www.amtrak.com,en,,,,,2022-03-05,,False,2022-03-04,2099-01-01
2,3220016414264074104,13,0,Amtrak,https://content.amtrak.com/content/gtfs/GTFS.zip,True,13__0,Amtrak (0),Amtrak,http://www.amtrak.com,en,,,,,2022-02-28,,False,2022-02-28,2022-03-04
3,-1246851374109274803,13,0,Amtrak,https://storage.googleapis.com/gtfs-data/sched...,True,13__0,Amtrak (0),Amtrak,http://www.amtrak.com,en,,,,,2021-09-04,,False,2021-10-19,2022-02-28
4,-4817923854551428448,16,0,Antelope Valley Transit Authority,https://www.avta.com/userfiles/files/AVTA%20GT...,True,16__0,Antelope Valley Transit Authority (0),AVTA,http://www.avta.com,en,,20210920,,,2021-09-23,2022-05-31,False,2021-12-15,2099-01-01


(siuba)=
## siuba
`siuba` is a tool that allows the same analysis code to run on a pandas DataFrame,
as well as generate SQL for different databases.
It supports most [pandas Series methods](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) analysts use. See the [siuba docs](https://siuba.readthedocs.io) for more information.

The examples below go through the basics of using siuba, collecting a database query to a local DataFrame,
and showing SQL test queries that siuba code generates.

### Basic query

In [5]:
from myst_nb import glue
from calitp.tables import tbl
from siuba import _, filter, count, collect, show_query

# query lastest validation notices, then filter for a single gtfs feed,
# and then count how often each code occurs
(tbl.views.gtfs_schedule_dim_feeds()
    >> filter(_.calitp_itp_id == 10, _.calitp_url_number==0)
    >> count(_.feed_key)
)

Unnamed: 0,feed_key,n
0,-74403229883010320,1
1,4353298498747921001,1
2,1619926170103152824,1
3,2570854701378106641,1
4,-5013919702465349414,1


### Collect query results
Note that siuba by default prints out a preview of the SQL query results.
In order to fetch the results of the query as a pandas DataFrame, run `collect()`.

In [6]:
tbl_agency_names = tbl.views.gtfs_schedule_dim_feeds() >> collect()

# Use pandas .head() method to show first 5 rows of data
tbl_agency_names.head()


Unnamed: 0,feed_key,calitp_itp_id,calitp_url_number,calitp_agency_name,calitp_gtfs_schedule_url,calitp_id_in_latest,calitp_feed_id,calitp_feed_name,feed_publisher_name,feed_publisher_url,feed_lang,default_lang,feed_version,feed_contact_email,feed_contact_url,feed_start_date,feed_end_date,is_composite_feed,calitp_extracted_at,calitp_deleted_at
0,-3971869378563233510,8,1,Monterey-Salinas Transit,http://www.mst.org/google/google_transit.zip,False,8__1,Monterey-Salinas Transit (1),MST,http://www.mst.org,en,,202101-February2021v00,,,,,False,2021-04-15,2021-05-13
1,-6443842139930306849,13,0,Amtrak,https://content.amtrak.com/content/gtfs/GTFS.zip,True,13__0,Amtrak (0),Amtrak,http://www.amtrak.com,en,,,,,2022-03-05,,False,2022-03-04,2099-01-01
2,3220016414264074104,13,0,Amtrak,https://content.amtrak.com/content/gtfs/GTFS.zip,True,13__0,Amtrak (0),Amtrak,http://www.amtrak.com,en,,,,,2022-02-28,,False,2022-02-28,2022-03-04
3,-1246851374109274803,13,0,Amtrak,https://storage.googleapis.com/gtfs-data/sched...,True,13__0,Amtrak (0),Amtrak,http://www.amtrak.com,en,,,,,2021-09-04,,False,2021-10-19,2022-02-28
4,-4817923854551428448,16,0,Antelope Valley Transit Authority,https://www.avta.com/userfiles/files/AVTA%20GT...,True,16__0,Antelope Valley Transit Authority (0),AVTA,http://www.avta.com,en,,20210920,,,2021-09-23,2022-05-31,False,2021-12-15,2099-01-01


### Show query SQL

While `collect()` fetches query results, `show_query()` prints out the SQL code that siuba generates.

In [7]:
(tbl.views.gtfs_schedule_dim_feeds()
  >> filter(_.calitp_agency_name.str.contains("Metro"))
  >> show_query(simplify=True)
)


SELECT `anon_1`.`feed_key`, `anon_1`.`calitp_itp_id`, `anon_1`.`calitp_url_number`, `anon_1`.`calitp_agency_name`, `anon_1`.`calitp_gtfs_schedule_url`, `anon_1`.`calitp_id_in_latest`, `anon_1`.`calitp_feed_id`, `anon_1`.`calitp_feed_name`, `anon_1`.`feed_publisher_name`, `anon_1`.`feed_publisher_url`, `anon_1`.`feed_lang`, `anon_1`.`default_lang`, `anon_1`.`feed_version`, `anon_1`.`feed_contact_email`, `anon_1`.`feed_contact_url`, `anon_1`.`feed_start_date`, `anon_1`.`feed_end_date`, `anon_1`.`is_composite_feed`, `anon_1`.`calitp_extracted_at`, `anon_1`.`calitp_deleted_at` 
FROM (SELECT feed_key, calitp_itp_id, calitp_url_number, calitp_agency_name, calitp_gtfs_schedule_url, calitp_id_in_latest, calitp_feed_id, calitp_feed_name, feed_publisher_name, feed_publisher_url, feed_lang, default_lang, feed_version, feed_contact_email, feed_contact_url, feed_start_date, feed_end_date, is_composite_feed, calitp_extracted_at, calitp_deleted_at 
FROM `views.gtfs_schedule_dim_feeds`) AS `anon_1` 
W

Unnamed: 0,feed_key,calitp_itp_id,calitp_url_number,calitp_agency_name,calitp_gtfs_schedule_url,calitp_id_in_latest,calitp_feed_id,calitp_feed_name,feed_publisher_name,feed_publisher_url,feed_lang,default_lang,feed_version,feed_contact_email,feed_contact_url,feed_start_date,feed_end_date,is_composite_feed,calitp_extracted_at,calitp_deleted_at
0,-5156377554606803196,293,0,Santa Barbara Metropolitan Transit District,http://sbmtd.gov/google_transit/feed.zip,True,293__0,Santa Barbara Metropolitan Transit District (0),Santa Barbara MTD,http://www.sbmtd.gov,en,,220425 April 25_20220331,,,2022-04-25,2022-08-14,False,2022-04-04,2099-01-01
1,-4045042305146455543,296,0,Santa Cruz Metropolitan Transit District,http://scmtd.com/google_transit/google_transit...,True,296__0,Santa Cruz Metropolitan Transit District (0),Santa Cruz Metro,http://www.scmtd.com,en,,,gtfsupdates@scmtd.com,https://scmtd.com/en/riders-guide/google-trans...,,,False,2021-09-07,2021-12-10
2,-6941377690184990428,296,0,Santa Cruz Metropolitan Transit District,http://scmtd.com/google_transit/google_transit...,True,296__0,Santa Cruz Metropolitan Transit District (0),Santa Cruz Metro,http://www.scmtd.com,en,,,gtfsupdates@scmtd.com,https://www.scmtd.com/en/riders-guide/about-go...,,,False,2021-12-10,2099-01-01
3,-4127903740760437821,323,0,Metrolink,https://www.metrolinktrains.com/globalassets/a...,True,323__0,Metrolink (0),Metrolink Trains,http://www.metrolinktrains.com,en,,20210621,,,,,False,2021-08-10,2021-10-28
4,3980840321914839605,323,0,Metrolink,https://www.metrolinktrains.com/globalassets/a...,True,323__0,Metrolink (0),Metrolink Trains,http://www.metrolinktrains.com,en,,20210423,,,,,False,2021-05-24,2021-08-10


Note that here the pandas Series method `str.contains` corresponds to `regexp_contains` in Google BigQuery.

(more-siuba-resources)=
### More siuba Resources:
* [siuba docs](https://siuba.readthedocs.io)
* ['Tidy Tuesday' live analyses with siuba](https://www.youtube.com/playlist?list=PLiQdjX20rXMHc43KqsdIowHI3ouFnP_Sf)

## shared utils
A set of shared utility functions can also be installed, similarly to any Python library. The [shared_utils](https://github.com/cal-itp/data-analyses/shared_utils) are stored here. Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel.

```python
# In terminal:
cd data-analyses/_shared_utils

# Use the make command to run through conda install and pip install
make setup_env

# In notebook:
import shared_utils

shared_utils.geography_utils.WGS84

# Note: you may need to select Kernel -> Restart Kernel from the top menu
# after make setup_env in order to successfully import shared_utils
```

See [data-analyses/example_reports](https://github.com/cal-itp/data-analyses/tree/main/example_report) for examples in how to use `shared_utils` for [general functions](https://github.com/cal-itp/data-analyses/blob/main/example_report/shared_utils_examples.ipynb), [charts](https://github.com/cal-itp/data-analyses/blob/main/example_report/example_charts.ipynb), and [maps](https://github.com/cal-itp/data-analyses/blob/main/example_report/example_maps.ipynb).

(pandas-resources)=
## pandas
The library pandas is very commonly used in data analysis, and the external resources below provide a brief overview of it's use.

* [Cheat Sheet - pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

## Add New Packages

While most Python packages an analyst uses come in JupyterHub, there may be additional packages you'll want to use in your analysis.

* Install [shared utility functions](#shared-utils)
* Change directory into the project task's subfolder and add `requirements.txt` and/or `conda-requirements.txt`
* Run `pip install -r requirements.txt` and/or `conda install --yes -c conda-forge --file conda-requirements.txt`