(python-libraries)=
# Python Libraries
The following libraries are available and recommended for use by Cal-ITP data analysts.

## Table of Contents
1. [calitp](#calitp)
1. [siuba](#siuba)
<br> - [Basic Query](#basic-query)
<br> - [Collect Query Results](#collect-query-results)
<br> - [Show Query SQL](#show-query-sql)
<br> - [More siuba Resources](more-siuba-resources)
1. [shared utils](shared-utils)
1. [pandas](pandas-resources)
1. [Add New Packages](#add-new-packages)

(calitp)=
## calitp
`calitp` is an internal library of utility functions used to access our warehouse data.

### import tbl

Most notably, you can include `import tbl` at the top of your notebook to import a table from the warehouse in the form of a `tbl`:

```python
from calitp.tables import tbl
```

Example:

In [1]:
from calitp.tables import tbl

tbl.views.gtfs_schedule_fact_daily_feed_routes()

interfere with sqlalchemy_bigquery.
pybigquery should be uninstalled.
  module = __import__(self.module_name, fromlist=['__name__'], level=0)


Unnamed: 0,feed_key,route_key,date,calitp_extracted_at,calitp_deleted_at
0,7817634405777406718,-2928154868513813160,2021-10-26,2021-10-06,2022-01-27
1,4491805378535272830,7137778547958573300,2022-06-08,2022-05-11,2022-06-09
2,5132421106685856008,-599716898042426427,2021-05-24,2021-05-03,2021-10-13
3,4721774146993707119,8852572296140580860,2022-03-24,2021-12-21,2022-07-09
4,-692593336326281124,-7935709566553824035,2022-01-27,2022-01-05,2022-01-31


### query_sql

`query_sql` is another useful function to use inside of JupyterHub notebooks to turn a SQL query into a pandas DataFrame.

As an example, in a notebook:

In [2]:
from calitp import query_sql

In [3]:
df_dim_feeds = query_sql("""
SELECT
    *
FROM `views.gtfs_schedule_dim_feeds`
LIMIT 10""", as_df=True)

In [4]:
df_dim_feeds.head()

Unnamed: 0,feed_key,calitp_itp_id,calitp_url_number,calitp_agency_name,raw_gtfs_schedule_url,calitp_id_in_latest,calitp_feed_id,calitp_feed_name,feed_publisher_name,feed_publisher_url,feed_lang,default_lang,feed_version,feed_contact_email,feed_contact_url,feed_start_date,feed_end_date,is_composite_feed,calitp_extracted_at,calitp_deleted_at
0,18414309913382105,1,2,Glendale Beeline,http://glendaleca.gov/Home/ShowDocument?id=29549,False,1__2,Glendale Beeline (2),,,,,,,,,,False,2021-04-15,2021-04-16
1,-5674234595782173143,2,0,Beach Cities Transit,https://www.redondo.org/civicax/filebank/blobd...,False,2__0,Beach Cities Transit (0),,,,,,,,,,False,2021-04-15,2021-05-13
2,6604405555889056708,2,5,SamTrans,http://www.samtrans.com/Assets/GTFS/samtrans/S...,False,2__5,SamTrans (5),,,,,,,,,,False,2021-04-15,2021-04-16
3,-5403103044674130453,3,0,Commuter Express,http://lacitydot.com/gtfs/administrator/gtfszi...,False,3__0,Commuter Express (0),,,,,,,,,,False,2021-04-15,2021-05-13
4,-6200903198154079414,5,0,e-Tran,https://share.elkgrovecity.org/messages/tpLyue...,False,5__0,e-Tran (0),,,,,,,,,,False,2021-04-15,2021-05-13


(siuba)=
## siuba
`siuba` is a tool that allows the same analysis code to run on a pandas DataFrame,
as well as generate SQL for different databases.
It supports most [pandas Series methods](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) analysts use. See the [siuba docs](https://siuba.readthedocs.io) for more information.

The examples below go through the basics of using siuba, collecting a database query to a local DataFrame,
and showing SQL test queries that siuba code generates.

### Basic query

In [5]:
from myst_nb import glue
from calitp.tables import tbl
from siuba import _, filter, count, collect, show_query

# query lastest validation notices, then filter for a single gtfs feed,
# and then count how often each code occurs
(tbl.views.gtfs_schedule_dim_feeds()
    >> filter(_.calitp_itp_id == 10, _.calitp_url_number==0)
    >> count(_.feed_key)
)

Unnamed: 0,feed_key,n
0,-5013919702465349414,1
1,-74403229883010320,1
2,-1803822485067769256,1
3,-5768367084319898193,1
4,2570854701378106641,1


### Collect query results
Note that siuba by default prints out a preview of the SQL query results.
In order to fetch the results of the query as a pandas DataFrame, run `collect()`.

In [6]:
tbl_agency_names = tbl.views.gtfs_schedule_dim_feeds() >> collect()

# Use pandas .head() method to show first 5 rows of data
tbl_agency_names.head()


Unnamed: 0,feed_key,calitp_itp_id,calitp_url_number,calitp_agency_name,raw_gtfs_schedule_url,calitp_id_in_latest,calitp_feed_id,calitp_feed_name,feed_publisher_name,feed_publisher_url,feed_lang,default_lang,feed_version,feed_contact_email,feed_contact_url,feed_start_date,feed_end_date,is_composite_feed,calitp_extracted_at,calitp_deleted_at
0,18414309913382105,1,2,Glendale Beeline,http://glendaleca.gov/Home/ShowDocument?id=29549,False,1__2,Glendale Beeline (2),,,,,,,,,,False,2021-04-15,2021-04-16
1,-5674234595782173143,2,0,Beach Cities Transit,https://www.redondo.org/civicax/filebank/blobd...,False,2__0,Beach Cities Transit (0),,,,,,,,,,False,2021-04-15,2021-05-13
2,6604405555889056708,2,5,SamTrans,http://www.samtrans.com/Assets/GTFS/samtrans/S...,False,2__5,SamTrans (5),,,,,,,,,,False,2021-04-15,2021-04-16
3,-5403103044674130453,3,0,Commuter Express,http://lacitydot.com/gtfs/administrator/gtfszi...,False,3__0,Commuter Express (0),,,,,,,,,,False,2021-04-15,2021-05-13
4,-6200903198154079414,5,0,e-Tran,https://share.elkgrovecity.org/messages/tpLyue...,False,5__0,e-Tran (0),,,,,,,,,,False,2021-04-15,2021-05-13


### Show query SQL

While `collect()` fetches query results, `show_query()` prints out the SQL code that siuba generates.

In [7]:
(tbl.views.gtfs_schedule_dim_feeds()
  >> filter(_.calitp_agency_name.str.contains("Metro"))
  >> show_query(simplify=True)
)


SELECT `anon_1`.`feed_key`, `anon_1`.`calitp_itp_id`, `anon_1`.`calitp_url_number`, `anon_1`.`calitp_agency_name`, `anon_1`.`raw_gtfs_schedule_url`, `anon_1`.`calitp_id_in_latest`, `anon_1`.`calitp_feed_id`, `anon_1`.`calitp_feed_name`, `anon_1`.`feed_publisher_name`, `anon_1`.`feed_publisher_url`, `anon_1`.`feed_lang`, `anon_1`.`default_lang`, `anon_1`.`feed_version`, `anon_1`.`feed_contact_email`, `anon_1`.`feed_contact_url`, `anon_1`.`feed_start_date`, `anon_1`.`feed_end_date`, `anon_1`.`is_composite_feed`, `anon_1`.`calitp_extracted_at`, `anon_1`.`calitp_deleted_at` 
FROM (SELECT feed_key, calitp_itp_id, calitp_url_number, calitp_agency_name, raw_gtfs_schedule_url, calitp_id_in_latest, calitp_feed_id, calitp_feed_name, feed_publisher_name, feed_publisher_url, feed_lang, default_lang, feed_version, feed_contact_email, feed_contact_url, feed_start_date, feed_end_date, is_composite_feed, calitp_extracted_at, calitp_deleted_at 
FROM `views.gtfs_schedule_dim_feeds`) AS `anon_1` 
WHERE r

Unnamed: 0,feed_key,calitp_itp_id,calitp_url_number,calitp_agency_name,raw_gtfs_schedule_url,calitp_id_in_latest,calitp_feed_id,calitp_feed_name,feed_publisher_name,feed_publisher_url,feed_lang,default_lang,feed_version,feed_contact_email,feed_contact_url,feed_start_date,feed_end_date,is_composite_feed,calitp_extracted_at,calitp_deleted_at
0,3365907141671447842,278,0,San Diego Metropolitan Transit System,https://www.sdmts.com/google_transit_files/goo...,True,278__0,San Diego Metropolitan Transit System (0),MTS,http://www.sdmts.com,EN,,v1.1 Effective June 12 2022,,,,,False,2022-05-19,2022-06-27
1,-2403866816282273791,278,0,San Diego Metropolitan Transit System,https://www.sdmts.com/google_transit_files/goo...,True,278__0,San Diego Metropolitan Transit System (0),MTS,http://www.sdmts.com,EN,,v1.1 - Nov 21 2021 svc change unmerged + remov...,,,,,False,2021-11-21,2021-11-23
2,-6451044529006095394,278,0,San Diego Metropolitan Transit System,https://www.sdmts.com/google_transit_files/goo...,True,278__0,San Diego Metropolitan Transit System (0),MTS,http://www.sdmts.com,EN,,v4 - SVCC October 25 Changes,,,,,False,2021-10-21,2021-11-03
3,8344144664609790510,278,0,San Diego Metropolitan Transit System,https://www.sdmts.com/google_transit_files/goo...,True,278__0,San Diego Metropolitan Transit System (0),MTS,http://www.sdmts.com,EN,,v1.5 Sept 5. 2021 svc change merged file new o...,,,,,False,2021-08-28,2021-09-05
4,-5216492229531861956,278,0,San Diego Metropolitan Transit System,https://www.sdmts.com/google_transit_files/goo...,True,278__0,San Diego Metropolitan Transit System (0),MTS,http://www.sdmts.com,EN,,v1 merged 2201 & 2111(v5),,,,,False,2022-01-13,2022-01-30


Note that here the pandas Series method `str.contains` corresponds to `regexp_contains` in Google BigQuery.

(more-siuba-resources)=
### More siuba Resources:
* [siuba docs](https://siuba.readthedocs.io)
* ['Tidy Tuesday' live analyses with siuba](https://www.youtube.com/playlist?list=PLiQdjX20rXMHc43KqsdIowHI3ouFnP_Sf)

(shared-utils)=
## shared utils
A set of shared utility functions can also be installed, similarly to any Python library. The [shared_utils](https://github.com/cal-itp/data-analyses/tree/main/_shared_utils) are stored here. Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel.

```python
# In terminal:
cd data-analyses/_shared_utils

# Use the make command to run through conda install and pip install
make setup_env

# In notebook:
import shared_utils

shared_utils.geography_utils.WGS84

# Note: you may need to select Kernel -> Restart Kernel from the top menu
# after make setup_env in order to successfully import shared_utils
```

See [data-analyses/example_reports](https://github.com/cal-itp/data-analyses/tree/main/example_report) for examples on how to use `shared_utils` for general functions, charts, and maps.

(pandas-resources)=
## pandas
The library pandas is very commonly used in data analysis, and the external resources below provide a brief overview of it's use.

* [Cheat Sheet - pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

## Add New Packages

While most Python packages an analyst uses come in JupyterHub, there may be additional packages you'll want to use in your analysis.

* Install [shared utility functions](#shared-utils)
* Change directory into the project task's subfolder and add `requirements.txt` and/or `conda-requirements.txt`
* Run `pip install -r requirements.txt` and/or `conda install --yes -c conda-forge --file conda-requirements.txt`