# Project structure and basic imports

This notebook shows the most basic end-to-end example of working with data.

When using notebooks, use appropriate markdown heading levels to easily navigate your 
document

In [1]:
import pandas as pd
from pathlib import Path
from dotenv import load_dotenv
import os

# Automatically reload external modules as you change them
# This can be set automatically by adding 
from IPython import get_ipython

pd.options.plotting.backend = 'plotly' # Use plotly instead of matplotlib

ip = get_ipython()
ip.run_line_magic("load_ext", "autoreload")
ip.run_line_magic("autoreload", "2")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Keeping configuration and passwords separate from your code

There are many reasons that you might want to use information that is not stored in a
repository:

- passwords (which should NEVER be harcoded in the codebase)
- configuration options that other users of your code may want to change, such as 
  paths to data files
- number of processers to use for multiprocessing

These can be stored in an environment variable that is not tracked in git. Each 
developer creates their own `.env` file that is read in by the code.

The python package python-dotenv allows you to add a set of variables to your 
environment and read them from python

https://pypi.org/project/python-dotenv/

In [2]:
load_dotenv(override=True) # take environment variables from .env.
                           # Pass the override=True argument if you have already loaded
                           # variables and want to replace them with an updated value

True

## Working with paths

Since python v3.7, it is no longer necessary to use string manipulation to build paths.
The pathlib library makes it much easier to construct paths, navigate to parent or
child folders, iterate over files within a folder, etc.

https://docs.python.org/3/library/pathlib.html

In [3]:
sample_data_path = Path(os.getenv("DATA_PATH")) / 'sample_full_greater_sydney'

# e.g. list all files in the data folder
[f for f in sample_data_path.glob("*.txt")]

[WindowsPath('C:/Temp/advanced_python/sample_full_greater_sydney/agency.txt'),
 WindowsPath('C:/Temp/advanced_python/sample_full_greater_sydney/calendar.txt'),
 WindowsPath('C:/Temp/advanced_python/sample_full_greater_sydney/calendar_dates.txt'),
 WindowsPath('C:/Temp/advanced_python/sample_full_greater_sydney/notes.txt'),
 WindowsPath('C:/Temp/advanced_python/sample_full_greater_sydney/routes.txt'),
 WindowsPath('C:/Temp/advanced_python/sample_full_greater_sydney/shapes.txt'),
 WindowsPath('C:/Temp/advanced_python/sample_full_greater_sydney/stops.txt'),
 WindowsPath('C:/Temp/advanced_python/sample_full_greater_sydney/stop_times.txt'),
 WindowsPath('C:/Temp/advanced_python/sample_full_greater_sydney/trips.txt')]

## Importing data



In [4]:
df_pd = pd.read_csv(sample_data_path / "trips.txt")
df_pd.head()

  df_pd = pd.read_csv(data_path / "trips.txt")


Unnamed: 0,route_id,service_id,trip_id,shape_id,trip_headsign,direction_id,block_id,wheelchair_accessible,route_direction,trip_note,bikes_allowed
0,1-SC0-1-sj2-2,AA51+1,1.AA51.1-SC0-1-sj2-2.1.R,1-SC0-1-sj2-2.1.R,Kiama,1,,1,Bomaderry to Kiama,,
1,1-SC0-1-sj2-2,AA51+1,3.AA51.1-SC0-1-sj2-2.1.R,1-SC0-1-sj2-2.1.R,Kiama,1,,1,Bomaderry to Kiama,,
2,1-SC0-1-sj2-2,AA51+1,5.AA51.1-SC0-1-sj2-2.1.R,1-SC0-1-sj2-2.1.R,Kiama,1,,1,Bomaderry to Kiama,,
3,1-SC0-1-sj2-2,AA51+1,7.AA51.1-SC0-1-sj2-2.2.H,1-SC0-1-sj2-2.2.H,Bomaderry,0,,1,Kiama to Bomaderry,,
4,1-SC0-1-sj2-2,AA51+1,9.AA51.1-SC0-1-sj2-2.2.H,1-SC0-1-sj2-2.2.H,Bomaderry,0,,1,Kiama to Bomaderry,,


In [5]:
df_pd.groupby("route_direction").size().sort_values(ascending=False).head(10)

route_direction
City to Berowra via Gordon                   2989
Hornsby to Gordon via Strathfield            2453
City to Parramatta or Leppington             2229
Parramatta or Leppington to City             2222
City to Emu Plains or Richmond               2129
Emu Plains or Richmond to City               2088
Berowra to City via Gordon                   2060
City to Macarthur via Airport or Sydenham    1941
Macarthur to City via Airport or Sydenham    1904
Bondi Junction to Waterfall or Cronulla      1837
dtype: int64

## Advanced import

Polars is a modern library that is built as an extremely fast replacement for pandas.
Speed improvements can be between 10x and 60x faster for common actions such as
loading a csv, joining two large dataframes or grouping and aggregation.

As the library is still relatively new and not used as widely, it is recommended
that this is used when working with larger datasets where performance may be a concern.

With the release of pandas 2.0, it is now also easy to convert dataframes between the
two formats, so you can benefit from the powerful features of pandas when you need them,
and then convert to a polars dataframe for a slow join.

https://www.pola.rs/

In [6]:
import polars as pl

In [7]:
pl.Config.set_fmt_str_lengths(100) # Show up to 100 characters in dataframes

polars.config.Config

In [8]:
df_pl = pl.read_csv(sample_data_path / "trips.txt")
df_pl.head(5)

route_id,service_id,trip_id,shape_id,trip_headsign,direction_id,block_id,wheelchair_accessible,route_direction,trip_note,bikes_allowed
str,str,str,str,str,i64,str,i64,str,str,str
"""1-SC0-1-sj2-2""","""AA51+1""","""1.AA51.1-SC0-1-sj2-2.1.R""","""1-SC0-1-sj2-2.1.R""","""Kiama""",1,"""""",1,"""Bomaderry to Kiama""","""""",""""""
"""1-SC0-1-sj2-2""","""AA51+1""","""3.AA51.1-SC0-1-sj2-2.1.R""","""1-SC0-1-sj2-2.1.R""","""Kiama""",1,"""""",1,"""Bomaderry to Kiama""","""""",""""""
"""1-SC0-1-sj2-2""","""AA51+1""","""5.AA51.1-SC0-1-sj2-2.1.R""","""1-SC0-1-sj2-2.1.R""","""Kiama""",1,"""""",1,"""Bomaderry to Kiama""","""""",""""""
"""1-SC0-1-sj2-2""","""AA51+1""","""7.AA51.1-SC0-1-sj2-2.2.H""","""1-SC0-1-sj2-2.2.H""","""Bomaderry""",0,"""""",1,"""Kiama to Bomaderry""","""""",""""""
"""1-SC0-1-sj2-2""","""AA51+1""","""9.AA51.1-SC0-1-sj2-2.2.H""","""1-SC0-1-sj2-2.2.H""","""Bomaderry""",0,"""""",1,"""Kiama to Bomaderry""","""""",""""""


In [9]:
df_pl.groupby("route_direction").count().sort("count", descending=True).head(10)

route_direction,count
str,u32
"""City to Berowra via Gordon""",2989
"""Hornsby to Gordon via Strathfield""",2453
"""City to Parramatta or Leppington""",2229
"""Parramatta or Leppington to City""",2222
"""City to Emu Plains or Richmond""",2129
"""Emu Plains or Richmond to City""",2088
"""Berowra to City via Gordon""",2060
"""City to Macarthur via Airport or Sydenham""",1941
"""Macarthur to City via Airport or Sydenham""",1904
"""Bondi Junction to Waterfall or Cronulla""",1837


## Outputs

In [10]:
# Imports usually go at the top of the file
# This is included here for ease of reference
import plotly.express as px

In [22]:
fig = df_pd.groupby("route_direction").size().sort_values(ascending=False).head(20).plot(
    kind='bar', orientation='h', width=800
)
fig.update_layout(title="Top 20 routes by number of rows", showlegend=False)
fig.update_xaxes(title="")
fig.update_yaxes(title="", autorange="reversed")