(python-libraries)=
# Useful Python Libraries
The following libraries are available and recommended for use by Cal-ITP data analysts. Our JupyterHub environment comes with all of these installed already!

## Table of Contents
1. [shared utils](#shared-utils)
1. [calitp-data-analysis](#calitp-data-analysis)
1. [siuba](#siuba)
<br> - [Basic Query](#basic-query)
<br> - [Collect Query Results](#collect-query-results)
<br> - [Show Query SQL](#show-query-sql)
<br> - [More siuba Resources](more-siuba-resources)
1. [pandas](pandas-resources)
1. [Add New Packages](#add-new-packages)

(shared-utils)=
## shared utils
A set of shared utility functions can also be installed, similarly to any Python library. The [shared_utils](https://github.com/cal-itp/data-analyses/shared_utils) are stored here. Generalized functions for analysis are added as collaborative work evolves so we aren't constantly reinventing the wheel.


### In terminal:
* Navigate to the package folder: `cd data-analyses/_shared_utils`
* Use the make command to run through conda install and pip install: `make setup_env`
    * Note: you may need to select Kernel -> Restart Kernel from the top menu after make setup_env in order to successfully import shared_utils
* Alternative: add an `alias` to your `.bash_profile`:
    * In terminal use `cd` to navigate to the home directory (not a repository)
    * Type `nano .bash_profile` to open the .bash_profile in a text editor
    * Add a line at end: `alias go='cd ~/data-analyses/portfolio && pip install -r requirements.txt && cd ../_shared_utils && make setup_env && cd ..'`
    * Exit with Ctrl+X, hit yes, then hit enter at the filename prompt
    * Restart your server; you can check your changes with `cat .bash_profile`


### In notebook:
```python
import shared_utils

#example of using shared_utils
shared_utils.geography_utils.WGS84
```

See [data-analyses/example_reports](https://github.com/cal-itp/data-analyses/tree/main/example_report) for examples on how to use `shared_utils` for general functions, charts, and maps.

(calitp-data-analysis)=
## calitp-data-analysis
`calitp-data-analysis` is an internal library of utility functions used to access our warehouse data for analysis purposes.

### import tbls

Most notably, you can include `import tbls` at the top of your notebook to import a table from the warehouse in the form of a `tbls`:

```python
from calitp_data_analysis.tables import tbls
```

Example:

In [1]:
from calitp_data_analysis.tables import tbls

tbls.mart_gtfs.dim_agency()

Unnamed: 0,key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_feed_valid_from,feed_timezone
0,8c3cc324869128546ee7a4d610896e77,e0c151a1bcd4a5c0fad1009f81b63dfb,148,Ojai Trolley,https://ojaitrolley.com/,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2022-08-11 03:19:08+00:00,America/Los_Angeles
1,dd5b1f2606a5eb98e25c87d7e627030c,3db48769762ceb7d8844f3c76556f351,148,Ojai Trolley,https://ojaitrolley.com/,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2022-05-22 00:00:48+00:00,America/Los_Angeles
2,6976d45d287c7d837db2f65721c575db,fa8776d1269b9c14f0fe9d6668b70adc,148,Ojai Trolley,https://ojaitrolley.com/,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2022-02-03 00:02:54+00:00,America/Los_Angeles
3,f5b02052d3a363cae1419e5d10347e47,e1c0acd02b16c89ef2e8bb33dc8e2ed3,148,Ojai Trolley,https://ojaitrolley.com/,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2022-01-18 00:11:05+00:00,America/Los_Angeles
4,1ed1f114eb7019e1322b9efb5fe5cb45,760d28afdc6eda1fc0c87b6789f9d535,148,Ojai Trolley,https://ojaitrolley.com/,America/Los_Angeles,en-US,,,,aHR0cHM6Ly9nb3ZjYnVzLmNvbS9ndGZz,2021-05-14 00:01:46+00:00,America/Los_Angeles


### query_sql

`query_sql` is another useful function to use inside of JupyterHub notebooks to turn a SQL query into a pandas DataFrame.

As an example, in a notebook:

In [2]:
from calitp_data_analysis.sql import query_sql

In [3]:
df_dim_agency = query_sql("""
SELECT
    *
FROM `mart_gtfs.dim_agency`
LIMIT 10""", as_df=True)

In [4]:
df_dim_agency.head()

Unnamed: 0,key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_feed_valid_from,feed_timezone
0,338a53d217228d164b0fb1fc627565de,79e9692d3fd18810f1ffca033fc0c712,30,LADOT126,https://www.ladottransit.com/,America/Los_Angeles,en-US,213-808-2273,https://store.ladottransit.com/,,aHR0cHM6Ly9sYWRvdGJ1cy5jb20vZ3Rmcw==,2023-05-26 03:00:25.009579+00:00,America/Los_Angeles
1,21dda69fbaaa4ef74b7fcff19b6ce4e6,79e9692d3fd18810f1ffca033fc0c712,44,LADOTMVC,https://www.ladottransit.com/,America/Los_Angeles,en-US,213-808-2273,https://store.ladottransit.com/,,aHR0cHM6Ly9sYWRvdGJ1cy5jb20vZ3Rmcw==,2023-05-26 03:00:25.009579+00:00,America/Los_Angeles
2,7bb265ab2aca29c91ff80b14f467ae76,79e9692d3fd18810f1ffca033fc0c712,45,LADOTMVS,https://www.ladottransit.com/,America/Los_Angeles,en-US,213-808-2273,https://store.ladottransit.com/,,aHR0cHM6Ly9sYWRvdGJ1cy5jb20vZ3Rmcw==,2023-05-26 03:00:25.009579+00:00,America/Los_Angeles
3,440da58fc0a32ef2de3419809bf5705f,79e9692d3fd18810f1ffca033fc0c712,47,LADOTMVN,https://www.ladottransit.com/,America/Los_Angeles,en-US,213-808-2273,https://store.ladottransit.com/,,aHR0cHM6Ly9sYWRvdGJ1cy5jb20vZ3Rmcw==,2023-05-26 03:00:25.009579+00:00,America/Los_Angeles
4,83fdea14429a48718357420d7a18053d,79e9692d3fd18810f1ffca033fc0c712,183,LADOTDT,https://www.ladottransit.com/,America/Los_Angeles,en-US,213-808-2273,https://store.ladottransit.com/,,aHR0cHM6Ly9sYWRvdGJ1cy5jb20vZ3Rmcw==,2023-05-26 03:00:25.009579+00:00,America/Los_Angeles


(siuba)=
## siuba
`siuba` is a tool that allows the same analysis code to run on a pandas DataFrame,
as well as generate SQL for different databases.
It supports most [pandas Series methods](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) analysts use. See the [siuba docs](https://siuba.readthedocs.io) for more information.

The examples below go through the basics of using siuba, collecting a database query to a local DataFrame,
and showing SQL test queries that siuba code generates.

### Basic query

In [5]:
from calitp_data_analysis.tables import tbls
from siuba import _, filter, count, collect, show_query

# query agency information, then filter for a single gtfs feed,
# and then count how often each feed key occurs
(tbls.mart_gtfs.dim_agency()
    >> filter(_.agency_id == 'BA', _.base64_url == 'aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1SRw==')
    >> count(_.feed_key)
)

Unnamed: 0,feed_key,n
0,c49f4931ea0df15735867081c15d18c6,1
1,c40c108bd611c2db159f06079aab5568,1
2,4c3d38b30917cbb5b53f34cdd684d156,1
3,e7e76dc211a5ff1fc43cd39638b1d5b0,1
4,d15fe707c6789d7a755209b37ee998e1,1


### Collect query results
Note that siuba by default prints out a preview of the SQL query results.
In order to fetch the results of the query as a pandas DataFrame, run `collect()`.

In [6]:
tbl_agency_names = tbls.mart_gtfs.dim_agency() >> collect()

# Use pandas .head() method to show first 5 rows of data
tbl_agency_names.head()


Unnamed: 0,key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_feed_valid_from,feed_timezone
0,338a53d217228d164b0fb1fc627565de,79e9692d3fd18810f1ffca033fc0c712,30,LADOT126,https://www.ladottransit.com/,America/Los_Angeles,en-US,213-808-2273,https://store.ladottransit.com/,,aHR0cHM6Ly9sYWRvdGJ1cy5jb20vZ3Rmcw==,2023-05-26 03:00:25.009579+00:00,America/Los_Angeles
1,21dda69fbaaa4ef74b7fcff19b6ce4e6,79e9692d3fd18810f1ffca033fc0c712,44,LADOTMVC,https://www.ladottransit.com/,America/Los_Angeles,en-US,213-808-2273,https://store.ladottransit.com/,,aHR0cHM6Ly9sYWRvdGJ1cy5jb20vZ3Rmcw==,2023-05-26 03:00:25.009579+00:00,America/Los_Angeles
2,7bb265ab2aca29c91ff80b14f467ae76,79e9692d3fd18810f1ffca033fc0c712,45,LADOTMVS,https://www.ladottransit.com/,America/Los_Angeles,en-US,213-808-2273,https://store.ladottransit.com/,,aHR0cHM6Ly9sYWRvdGJ1cy5jb20vZ3Rmcw==,2023-05-26 03:00:25.009579+00:00,America/Los_Angeles
3,440da58fc0a32ef2de3419809bf5705f,79e9692d3fd18810f1ffca033fc0c712,47,LADOTMVN,https://www.ladottransit.com/,America/Los_Angeles,en-US,213-808-2273,https://store.ladottransit.com/,,aHR0cHM6Ly9sYWRvdGJ1cy5jb20vZ3Rmcw==,2023-05-26 03:00:25.009579+00:00,America/Los_Angeles
4,83fdea14429a48718357420d7a18053d,79e9692d3fd18810f1ffca033fc0c712,183,LADOTDT,https://www.ladottransit.com/,America/Los_Angeles,en-US,213-808-2273,https://store.ladottransit.com/,,aHR0cHM6Ly9sYWRvdGJ1cy5jb20vZ3Rmcw==,2023-05-26 03:00:25.009579+00:00,America/Los_Angeles


### Show query SQL

While `collect()` fetches query results, `show_query()` prints out the SQL code that siuba generates.

In [7]:
(tbls.mart_gtfs.dim_agency()
  >> filter(_.agency_name.str.contains("Metro"))
  >> show_query(simplify=True)
)


SELECT * 
FROM `mart_gtfs.dim_agency` AS `mart_gtfs.dim_agency_1` 
WHERE regexp_contains(`mart_gtfs.dim_agency_1`.`agency_name`, 'Metro')


Unnamed: 0,key,feed_key,agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone,agency_fare_url,agency_email,base64_url,_feed_valid_from,feed_timezone
0,abc09ff0fa034196599fb2180f312202,cb422246689ee1dd3357a4011be6ce52,,Santa Cruz Metro,http://www.scmtd.com,America/Los_Angeles,en,(831)425-8600,http://www.scmtd.com/fares,,aHR0cDovL3NjbXRkLmNvbS9nb29nbGVfdHJhbnNpdC9nb2...,2021-09-08 00:18:30+00:00,America/Los_Angeles
1,f9cce9bafc585e98bd2834e90057f50f,3ec7760666f8efe04b23bfc8595dd2d8,,Santa Cruz Metro,http://www.scmtd.com,America/Los_Angeles,en,(831)425-8600,http://www.scmtd.com/fares,,aHR0cDovL3NjbXRkLmNvbS9nb29nbGVfdHJhbnNpdC9nb2...,2021-12-11 00:22:57+00:00,America/Los_Angeles
2,f0b36129e328bd211eb26b2f09ebd46e,f0c46ac3e17145702b7591f3d6b01562,,Santa Cruz Metro,http://www.scmtd.com,America/Los_Angeles,en,(831)425-8600,http://www.scmtd.com/fares,,aHR0cDovL3NjbXRkLmNvbS9nb29nbGVfdHJhbnNpdC9nb2...,2022-03-02 00:10:47+00:00,America/Los_Angeles
3,74ec0eadba16cfb47ab6c75ab515c80f,7e5c436680b8a29ad04b84fe6a01a1b3,,Santa Cruz Metro,http://www.scmtd.com,America/Los_Angeles,en,(831)425-8600,http://www.scmtd.com/fares,,aHR0cDovL3NjbXRkLmNvbS9nb29nbGVfdHJhbnNpdC9nb2...,2021-06-02 00:01:35+00:00,America/Los_Angeles
4,b33a6dfe2708469621f32f071c53b00c,26cb80eaa0f7dc367ed56125c1d596df,,Santa Cruz Metro,http://www.scmtd.com,America/Los_Angeles,en,(831)425-8600,http://www.scmtd.com/fares,,aHR0cDovL3NjbXRkLmNvbS9nb29nbGVfdHJhbnNpdC9nb2...,2021-04-16 00:01:13+00:00,America/Los_Angeles


Note that here the pandas Series method `str.contains` corresponds to `regexp_contains` in Google BigQuery.

(more-siuba-resources)=
### More siuba Resources:
* [siuba docs](https://siuba.readthedocs.io)
* ['Tidy Tuesday' live analyses with siuba](https://www.youtube.com/playlist?list=PLiQdjX20rXMHc43KqsdIowHI3ouFnP_Sf)


(pandas-resources)=
## pandas
The library pandas is very commonly used in data analysis, and the external resources below provide a brief overview of it's use.

* [Cheat Sheet - pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

## Add New Packages

While most Python packages an analyst uses come in JupyterHub, there may be additional packages you'll want to use in your analysis.

* Install [shared utility functions](#shared-utils)
* Change directory into the project task's subfolder and add `requirements.txt` and/or `conda-requirements.txt`
* Run `pip install -r requirements.txt` and/or `conda install --yes -c conda-forge --file conda-requirements.txt`