(python-libraries)=
# Useful Python Libraries
The following libraries are available and recommended for use by Cal-ITP data analysts.

## Table of Contents
1. [shared utils](#shared-utils)
1. [calitp-data-analysis](#calitp-data-analysis)
1. [siuba](#siuba)
<br> - [Basic Query](#basic-query)
<br> - [Collect Query Results](#collect-query-results)
<br> - [Show Query SQL](#show-query-sql)
<br> - [More siuba Resources](more-siuba-resources)
1. [pandas](pandas-resources)
1. [Add New Packages](#add-new-packages)

(shared-utils)=
## shared utils
A set of shared utility functions can also be installed, similarly to any Python library. The [shared_utils](https://github.com/cal-itp/data-analyses/shared_utils) are stored here. Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel.


### In terminal:
* Navigate to the package folder: `cd data-analyses/_shared_utils`
* Use the make command to run through conda install and pip install: `make setup_env`
    * Note: you may need to select Kernel -> Restart Kernel from the top menu after make setup_env in order to successfully import shared_utils
* Alternative: add an `alias` to your `.bash_profile`:
    * In terminal use `cd` to navigate to the home directory (not a repository)
    * Type `nano .bash_profile` to open the .bash_profile in a text editor
    * Add a line at end: `alias go='cd ~/data-analyses/portfolio && pip install -r requirements.txt && cd ../_shared_utils && make setup_env && cd ..'`
    * Exit with Ctrl+X, hit yes, then hit enter at the filename prompt
    * Restart your server; you can check your changes with `cat .bash_profile`


### In notebook:
```python
import shared_utils

#example of using shared_utils
shared_utils.geography_utils.WGS84
```

See [data-analyses/example_reports](https://github.com/cal-itp/data-analyses/tree/main/example_report) for examples on how to use `shared_utils` for general functions, charts, and maps.

(calitp-data-analysis)=
## calitp-data-analysis
`calitp-data-analysis` is an internal library of utility functions used to access our warehouse data for analysis purposes.

### import tbls

Most notably, you can include `import tbls` at the top of your notebook to import a table from the warehouse in the form of a `tbls`:

```python
from calitp_data_analysis.tables import tbls
```

Example:

In [1]:
from calitp_data_analysis.tables import tbls

tbls.mart_gtfs.dim_agency()

Unnamed: 0,key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_feed_valid_from,feed_timezone
0,323ff99feb3b3044e0cdaec49ef6545b,2206b7c801ab674767c2f5c8d329d379,139,Gold Coast Transit District,http://www.goldcoasttransit.org/,America/Los_Angeles,en-US,(805) 487-4222,https://www.gctd.org/fares-rider-guide/,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2023-05-23 03:00:33.896861+00:00,America/Los_Angeles
1,aca898ef7339d88b2f50deb6f5ff2a1d,2206b7c801ab674767c2f5c8d329d379,142,Thousand Oaks Transit,https://www.toaks.org/departments/public-works...,America/Los_Angeles,en-US,(805) 375-5473,https://www.toaks.org/departments/public-works...,TOTransit@toaks.org,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2023-05-23 03:00:33.896861+00:00,America/Los_Angeles
2,1e327c2b6f77784866d08701c71bc9d7,2206b7c801ab674767c2f5c8d329d379,143,VCTC Intercity,https://www.goventura.org/,America/Los_Angeles,en-US,800.438.1112,https://www.goventura.org/,ridercomments@goventura.org,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2023-05-23 03:00:33.896861+00:00,America/Los_Angeles
3,3cbe6d09312bb2989258bf3974c23b4b,2206b7c801ab674767c2f5c8d329d379,144,Kanan Shuttle,https://www.toaks.org/departments/public-works...,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2023-05-23 03:00:33.896861+00:00,America/Los_Angeles
4,8df055482dc146c5774630d86850eb6c,2206b7c801ab674767c2f5c8d329d379,147,Moorpark City Transit,https://www.moorparkca.gov/227/Bus-Ride-Guide,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2023-05-23 03:00:33.896861+00:00,America/Los_Angeles


### query_sql

`query_sql` is another useful function to use inside of JupyterHub notebooks to turn a SQL query into a pandas DataFrame.

As an example, in a notebook:

In [2]:
from calitp_data_analysis.sql import query_sql

In [3]:
df_dim_agency = query_sql("""
SELECT
    *
FROM `mart_gtfs.dim_agency`
LIMIT 10""", as_df=True)

In [4]:
df_dim_agency.head()

Unnamed: 0,key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_feed_valid_from,feed_timezone
0,bf4db6c7641e8730d408f439c02e9a2b,d8c64fa93c64e79e5fe98a15158fafbf,148,Ojai Trolley,https://ojaitrolley.com/,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2022-12-01 03:00:18.078034+00:00,America/Los_Angeles
1,2834655cc5f39097ce19be03997d1a3e,afce6c3a0a20fb1c7f208f3d409e23f6,148,Ojai Trolley,https://ojaitrolley.com/,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2021-09-08 00:18:30+00:00,America/Los_Angeles
2,d1c6e8e5b51d83268fa487cff02a1b39,69d80cd8544d3e64df5786f363a732d9,148,Ojai Trolley,https://ojaitrolley.com/,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2023-02-04 03:00:25.207981+00:00,America/Los_Angeles
3,dd5b1f2606a5eb98e25c87d7e627030c,3db48769762ceb7d8844f3c76556f351,148,Ojai Trolley,https://ojaitrolley.com/,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2022-05-22 00:00:48+00:00,America/Los_Angeles
4,5931967af296e3a7bd37b2e9bdd81b5f,616e8acf589dd9d45bee5d31c0ebec70,148,Ojai Trolley,https://ojaitrolley.com/,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2022-07-18 00:00:58+00:00,America/Los_Angeles


(siuba)=
## siuba
`siuba` is a tool that allows the same analysis code to run on a pandas DataFrame,
as well as generate SQL for different databases.
It supports most [pandas Series methods](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) analysts use. See the [siuba docs](https://siuba.readthedocs.io) for more information.

The examples below go through the basics of using siuba, collecting a database query to a local DataFrame,
and showing SQL test queries that siuba code generates.

### Basic query

In [5]:
from calitp_data_analysis.tables import tbls
from siuba import _, filter, count, collect, show_query

# query agency information, then filter for a single gtfs feed,
# and then count how often each feed key occurs
(tbls.mart_gtfs.dim_agency()
    >> filter(_.agency_id == 'BA', _.base64_url == 'aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1SRw==')
    >> count(_.feed_key)
)

Unnamed: 0,feed_key,n
0,c0b96681751ce30a7bda9c5d168e0012,1
1,c03e392e71c9e3cca7a435e203360137,1
2,f7225a06a395ae2ad82fe131173c6217,1
3,2ada39f5f43c038d81eaec84dd94f039,1
4,76c11e566ba97e3a393b9af2baeae87c,1


### Collect query results
Note that siuba by default prints out a preview of the SQL query results.
In order to fetch the results of the query as a pandas DataFrame, run `collect()`.

In [6]:
tbl_agency_names = tbls.mart_gtfs.dim_agency() >> collect()

# Use pandas .head() method to show first 5 rows of data
tbl_agency_names.head()


Unnamed: 0,key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_feed_valid_from,feed_timezone
0,323ff99feb3b3044e0cdaec49ef6545b,2206b7c801ab674767c2f5c8d329d379,139,Gold Coast Transit District,http://www.goldcoasttransit.org/,America/Los_Angeles,en-US,(805) 487-4222,https://www.gctd.org/fares-rider-guide/,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2023-05-23 03:00:33.896861+00:00,America/Los_Angeles
1,aca898ef7339d88b2f50deb6f5ff2a1d,2206b7c801ab674767c2f5c8d329d379,142,Thousand Oaks Transit,https://www.toaks.org/departments/public-works...,America/Los_Angeles,en-US,(805) 375-5473,https://www.toaks.org/departments/public-works...,TOTransit@toaks.org,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2023-05-23 03:00:33.896861+00:00,America/Los_Angeles
2,1e327c2b6f77784866d08701c71bc9d7,2206b7c801ab674767c2f5c8d329d379,143,VCTC Intercity,https://www.goventura.org/,America/Los_Angeles,en-US,800.438.1112,https://www.goventura.org/,ridercomments@goventura.org,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2023-05-23 03:00:33.896861+00:00,America/Los_Angeles
3,3cbe6d09312bb2989258bf3974c23b4b,2206b7c801ab674767c2f5c8d329d379,144,Kanan Shuttle,https://www.toaks.org/departments/public-works...,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2023-05-23 03:00:33.896861+00:00,America/Los_Angeles
4,8df055482dc146c5774630d86850eb6c,2206b7c801ab674767c2f5c8d329d379,147,Moorpark City Transit,https://www.moorparkca.gov/227/Bus-Ride-Guide,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2023-05-23 03:00:33.896861+00:00,America/Los_Angeles


### Show query SQL

While `collect()` fetches query results, `show_query()` prints out the SQL code that siuba generates.

In [7]:
(tbls.mart_gtfs.dim_agency()
  >> filter(_.agency_name.str.contains("Metro"))
  >> show_query(simplify=True)
)


SELECT * 
FROM `mart_gtfs.dim_agency` AS `mart_gtfs.dim_agency_1` 
WHERE regexp_contains(`mart_gtfs.dim_agency_1`.`agency_name`, 'Metro')


Unnamed: 0,key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_feed_valid_from,feed_timezone
0,af421322f617b9548ae28186e5ba49dc,0b907dea17910cbb1d54e5f5f8c96a51,Metrolink,Metrolink Trains,http://www.metrolinktrains.com,America/Los_Angeles,en,(800) 371-LINK,,,aHR0cHM6Ly93d3cubWV0cm9saW5rdHJhaW5zLmNvbS9nbG...,2023-05-23 03:00:33.896861+00:00,America/Los_Angeles
1,b11be9ba641dfec413c9a8832f27231d,f0f0bd46e4dda08fd6b626313117fb6e,,Santa Cruz Metro,http://www.scmtd.com,America/Los_Angeles,en,(831)425-8600,http://www.scmtd.com/fares,,aHR0cDovL3NjbXRkLmNvbS9nb29nbGVfdHJhbnNpdC9nb2...,2022-03-09 00:11:38+00:00,America/Los_Angeles
2,f0b36129e328bd211eb26b2f09ebd46e,f0c46ac3e17145702b7591f3d6b01562,,Santa Cruz Metro,http://www.scmtd.com,America/Los_Angeles,en,(831)425-8600,http://www.scmtd.com/fares,,aHR0cDovL3NjbXRkLmNvbS9nb29nbGVfdHJhbnNpdC9nb2...,2022-03-02 00:10:47+00:00,America/Los_Angeles
3,74ec0eadba16cfb47ab6c75ab515c80f,7e5c436680b8a29ad04b84fe6a01a1b3,,Santa Cruz Metro,http://www.scmtd.com,America/Los_Angeles,en,(831)425-8600,http://www.scmtd.com/fares,,aHR0cDovL3NjbXRkLmNvbS9nb29nbGVfdHJhbnNpdC9nb2...,2021-06-02 00:01:35+00:00,America/Los_Angeles
4,b33a6dfe2708469621f32f071c53b00c,26cb80eaa0f7dc367ed56125c1d596df,,Santa Cruz Metro,http://www.scmtd.com,America/Los_Angeles,en,(831)425-8600,http://www.scmtd.com/fares,,aHR0cDovL3NjbXRkLmNvbS9nb29nbGVfdHJhbnNpdC9nb2...,2021-04-16 00:01:13+00:00,America/Los_Angeles


Note that here the pandas Series method `str.contains` corresponds to `regexp_contains` in Google BigQuery.

(more-siuba-resources)=
### More siuba Resources:
* [siuba docs](https://siuba.readthedocs.io)
* ['Tidy Tuesday' live analyses with siuba](https://www.youtube.com/playlist?list=PLiQdjX20rXMHc43KqsdIowHI3ouFnP_Sf)


(pandas-resources)=
## pandas
The library pandas is very commonly used in data analysis, and the external resources below provide a brief overview of it's use.

* [Cheat Sheet - pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

## Add New Packages

While most Python packages an analyst uses come in JupyterHub, there may be additional packages you'll want to use in your analysis.

* Install [shared utility functions](#shared-utils)
* Change directory into the project task's subfolder and add `requirements.txt` and/or `conda-requirements.txt`
* Run `pip install -r requirements.txt` and/or `conda install --yes -c conda-forge --file conda-requirements.txt`