(python-libraries)=
# Useful Python Libraries
The following libraries are available and recommended for use by Cal-ITP data analysts.

## Table of Contents
1. [shared utils](#shared-utils)
1. [calitp](#calitp)
1. [siuba](#siuba)
<br> - [Basic Query](#basic-query)
<br> - [Collect Query Results](#collect-query-results)
<br> - [Show Query SQL](#show-query-sql)
<br> - [More siuba Resources](more-siuba-resources)
1. [pandas](pandas-resources)
1. [Add New Packages](#add-new-packages)

(shared-utils)=
## shared utils
A set of shared utility functions can also be installed, similarly to any Python library. The [shared_utils](https://github.com/cal-itp/data-analyses/shared_utils) are stored here. Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel.


### In terminal:
* Navigate to the package folder: `cd data-analyses/_shared_utils`
* Use the make command to run through conda install and pip install: `make setup_env`
    * Note: you may need to select Kernel -> Restart Kernel from the top menu after make setup_env in order to successfully import shared_utils
* Alternative: add an `alias` to your `.bash_profile`:
    * In terminal use `cd` to navigate to the home directory (not a repository)
    * Type `nano .bash_profile` to open the .bash_profile in a text editor
    * Add a line at end: `alias go='cd ~/data-analyses/portfolio && pip install -r requirements.txt && cd ../_shared_utils && make setup_env && cd ..'`
    * Exit with Ctrl+X, hit yes, then hit enter at the filename prompt
    * Restart your server; you can check your changes with `cat .bash_profile`


### In notebook:
```python
import shared_utils

#example of using shared_utils
shared_utils.geography_utils.WGS84
```

See [data-analyses/example_reports](https://github.com/cal-itp/data-analyses/tree/main/example_report) for examples on how to use `shared_utils` for general functions, charts, and maps.

(calitp)=
## calitp
`calitp` is an internal library of utility functions used to access our warehouse data.

### import tbls

Most notably, you can include `import tbls` at the top of your notebook to import a table from the warehouse in the form of a `tbls`:

```python
from calitp_data_analysis.tables import tbls
```

Example:

In [1]:
from calitp_data_analysis.tables import tbls

tbls.views.gtfs_schedule_fact_daily_feed_routes()

Unnamed: 0,feed_key,route_key,date,calitp_extracted_at,calitp_deleted_at
0,-7983989138583547129,-3290508603988508141,2021-06-20,2021-05-25,2021-08-10
1,-2471333877807076817,5352800139218184492,2022-02-10,2022-02-09,2022-05-19
2,-6443842139930306849,5907486868358039588,2022-03-10,2022-03-04,2022-05-09
3,2769981974308579277,1882740876604807449,2022-03-10,2022-02-22,2022-03-18
4,-8241857340783144053,-5875501989002539466,2021-06-20,2021-06-04,2021-07-04


### query_sql

`query_sql` is another useful function to use inside of JupyterHub notebooks to turn a SQL query into a pandas DataFrame.

As an example, in a notebook:

In [2]:
from calitp_data_analysis.sql import query_sql

In [3]:
df_dim_feeds = query_sql("""
SELECT
    *
FROM `views.gtfs_schedule_dim_feeds`
LIMIT 10""", as_df=True)

In [4]:
df_dim_feeds.head()

Unnamed: 0,feed_key,calitp_itp_id,calitp_url_number,calitp_agency_name,raw_gtfs_schedule_url,calitp_id_in_latest,calitp_feed_id,calitp_feed_name,feed_publisher_name,feed_publisher_url,...,default_lang,feed_version,feed_contact_email,feed_contact_url,feed_start_date,feed_end_date,rank,is_composite_feed,calitp_extracted_at,calitp_deleted_at
0,18414309913382105,1,2,Glendale Beeline,http://glendaleca.gov/Home/ShowDocument?id=29549,False,1__2,Glendale Beeline (2),,,...,,,,,,,,False,2021-04-15,2021-04-16
1,-5674234595782173143,2,0,Beach Cities Transit,https://www.redondo.org/civicax/filebank/blobd...,False,2__0,Beach Cities Transit (0),,,...,,,,,,,,False,2021-04-15,2021-05-13
2,6604405555889056708,2,5,SamTrans,http://www.samtrans.com/Assets/GTFS/samtrans/S...,False,2__5,SamTrans (5),,,...,,,,,,,,False,2021-04-15,2021-04-16
3,-5403103044674130453,3,0,Commuter Express,http://lacitydot.com/gtfs/administrator/gtfszi...,False,3__0,Commuter Express (0),,,...,,,,,,,,False,2021-04-15,2021-05-13
4,-6200903198154079414,5,0,e-Tran,https://share.elkgrovecity.org/messages/tpLyue...,False,5__0,e-Tran (0),,,...,,,,,,,,False,2021-04-15,2021-05-13


(siuba)=
## siuba
`siuba` is a tool that allows the same analysis code to run on a pandas DataFrame,
as well as generate SQL for different databases.
It supports most [pandas Series methods](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) analysts use. See the [siuba docs](https://siuba.readthedocs.io) for more information.

The examples below go through the basics of using siuba, collecting a database query to a local DataFrame,
and showing SQL test queries that siuba code generates.

### Basic query

In [5]:
from calitp_data_analysis.tables import tbls
from siuba import _, filter, count, collect, show_query

# query lastest validation notices, then filter for a single gtfs feed,
# and then count how often each code occurs
(tbls.views.gtfs_schedule_dim_feeds()
    >> filter(_.calitp_itp_id == 10, _.calitp_url_number==0)
    >> count(_.feed_key)
)

Unnamed: 0,feed_key,n
0,4353298498747921001,1
1,-5768367084319898193,1
2,2200762960394821566,1
3,-5013919702465349414,1
4,2570854701378106641,1


### Collect query results
Note that siuba by default prints out a preview of the SQL query results.
In order to fetch the results of the query as a pandas DataFrame, run `collect()`.

In [6]:
tbl_agency_names = tbls.views.gtfs_schedule_dim_feeds() >> collect()

# Use pandas .head() method to show first 5 rows of data
tbl_agency_names.head()


Unnamed: 0,feed_key,calitp_itp_id,calitp_url_number,calitp_agency_name,raw_gtfs_schedule_url,calitp_id_in_latest,calitp_feed_id,calitp_feed_name,feed_publisher_name,feed_publisher_url,...,default_lang,feed_version,feed_contact_email,feed_contact_url,feed_start_date,feed_end_date,rank,is_composite_feed,calitp_extracted_at,calitp_deleted_at
0,18414309913382105,1,2,Glendale Beeline,http://glendaleca.gov/Home/ShowDocument?id=29549,False,1__2,Glendale Beeline (2),,,...,,,,,,,,False,2021-04-15,2021-04-16
1,-5674234595782173143,2,0,Beach Cities Transit,https://www.redondo.org/civicax/filebank/blobd...,False,2__0,Beach Cities Transit (0),,,...,,,,,,,,False,2021-04-15,2021-05-13
2,6604405555889056708,2,5,SamTrans,http://www.samtrans.com/Assets/GTFS/samtrans/S...,False,2__5,SamTrans (5),,,...,,,,,,,,False,2021-04-15,2021-04-16
3,-5403103044674130453,3,0,Commuter Express,http://lacitydot.com/gtfs/administrator/gtfszi...,False,3__0,Commuter Express (0),,,...,,,,,,,,False,2021-04-15,2021-05-13
4,-6200903198154079414,5,0,e-Tran,https://share.elkgrovecity.org/messages/tpLyue...,False,5__0,e-Tran (0),,,...,,,,,,,,False,2021-04-15,2021-05-13


### Show query SQL

While `collect()` fetches query results, `show_query()` prints out the SQL code that siuba generates.

In [7]:
(tbls.views.gtfs_schedule_dim_feeds()
  >> filter(_.calitp_agency_name.str.contains("Metro"))
  >> show_query(simplify=True)
)


SELECT * 
FROM `views.gtfs_schedule_dim_feeds` AS `views.gtfs_schedule_dim_feeds_1` 
WHERE regexp_contains(`views.gtfs_schedule_dim_feeds_1`.`calitp_agency_name`, 'Metro')


Unnamed: 0,feed_key,calitp_itp_id,calitp_url_number,calitp_agency_name,raw_gtfs_schedule_url,calitp_id_in_latest,calitp_feed_id,calitp_feed_name,feed_publisher_name,feed_publisher_url,...,default_lang,feed_version,feed_contact_email,feed_contact_url,feed_start_date,feed_end_date,rank,is_composite_feed,calitp_extracted_at,calitp_deleted_at
0,8654575366107518227,278,0,San Diego Metropolitan Transit System,https://www.sdmts.com/google_transit_files/goo...,True,278__0,San Diego Metropolitan Transit System (0),MTS,http://www.sdmts.com,...,,v1 - Nov 21 2021 svc change merged Sept 5 2021 v4,,,,,1,False,2021-11-03,2021-11-21
1,8795779120208877258,278,0,San Diego Metropolitan Transit System,https://www.sdmts.com/google_transit_files/goo...,True,278__0,San Diego Metropolitan Transit System (0),MTS,http://www.sdmts.com,...,,v1 merged - 2206v2 & 2209v1.2,,,,,1,False,2022-08-16,2022-08-25
2,-5216492229531861956,278,0,San Diego Metropolitan Transit System,https://www.sdmts.com/google_transit_files/goo...,True,278__0,San Diego Metropolitan Transit System (0),MTS,http://www.sdmts.com,...,,v1 merged 2201 & 2111(v5),,,,,1,False,2022-01-13,2022-01-30
3,6592067446191147968,278,0,San Diego Metropolitan Transit System,https://www.sdmts.com/google_transit_files/goo...,True,278__0,San Diego Metropolitan Transit System (0),MTS,http://www.sdmts.com,...,,v1.1 Effective Jan 30 2022,,,,,1,False,2022-01-30,2022-02-08
4,3077664894694920933,278,0,San Diego Metropolitan Transit System,https://www.sdmts.com/google_transit_files/goo...,True,278__0,San Diego Metropolitan Transit System (0),MTS,http://www.sdmts.com,...,,v1.3.1: 1.2 pathways + GTFS Fares v2 rev 1,,,,,1,False,2022-07-25,2022-08-15


Note that here the pandas Series method `str.contains` corresponds to `regexp_contains` in Google BigQuery.

(more-siuba-resources)=
### More siuba Resources:
* [siuba docs](https://siuba.readthedocs.io)
* ['Tidy Tuesday' live analyses with siuba](https://www.youtube.com/playlist?list=PLiQdjX20rXMHc43KqsdIowHI3ouFnP_Sf)


(pandas-resources)=
## pandas
The library pandas is very commonly used in data analysis, and the external resources below provide a brief overview of it's use.

* [Cheat Sheet - pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

## Add New Packages

While most Python packages an analyst uses come in JupyterHub, there may be additional packages you'll want to use in your analysis.

* Install [shared utility functions](#shared-utils)
* Change directory into the project task's subfolder and add `requirements.txt` and/or `conda-requirements.txt`
* Run `pip install -r requirements.txt` and/or `conda install --yes -c conda-forge --file conda-requirements.txt`