# Exercise 3: Strings, Functions, If Else, For Loops

In [58]:
import altair as alt
import numpy as np
import pandas as pd
from calitp_data_analysis import calitp_color_palette

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

* Using a f-strings, load in your merged dataframe from Exercise 3.

In [3]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/starter_kit/"

In [4]:
FILE = "starter_kit_example_merge.parquet"

In [5]:
df = pd.read_parquet(f"{GCS_FILE_PATH}{FILE}")

In [7]:
df.head(1)

Unnamed: 0,ct_district,project_name,Scope of Work,Project Cost,accessibility_score,dac_accessibility_score,dac_traffic_impacts_score,freight_efficiency_score,freight_sustainability_score,mode_shift_score,lu_natural_resources_score,safety_score,vmt_score,zev_score,public_engagement_score,climate_resilience_score,program_fit_score,overall_score
0,2,Meadow Magic Multi-Use Path,"A 2-mile Class I bike lane and multi-use path through a scenic meadow, featuring wildflower plantings, public art installations, and educational signage highlighting local wildlife.",6265525,1,8,9,3,10,3,8,2,2,10,4,2,4,66


## Categorizing
* There are 30 projects. They all vary in themes, some are transit oriented while others are focused on Active Transportation (ATP).
* Categorizing data is an important part of data cleaning and analyzing so we can present the data in a more succint and insightful way. 
* Let's organize projects into three categories.
    * ATP
    * Transit
    * General Lanes

### Task 1: Strings
* Below are some of the common keywords that fall into the categories detailed above. They are held in a `list`.
* Feel free to add other terms you think are relevant. 
* We are going to search the `Scope of Work` column for these keywords. 

In [8]:
transit = ["transit", "passenger rail", "bus", "ferry"]
atp = ["bike", "pedestrian", "bicycle", "sidewalk", "path"]
general_lanes = ["general", "auxiliary"]

#### Step 1: Cleaning
* Remember in Exercise 2 some of the project names didn't merge between the two dataframes?
* In the real world, a lot of string data can be spelled in different ways, different cases, abbreviated, and the like.
* The easiest way to clean this up is by lowercasing, stripping the white spaces, and replacing characters.
* Also, by simplifying a string column, we can search through it easier. 

In [9]:
df["Scope of Work"] = (
    df["Scope of Work"]
    .str.lower()
    .str.strip()
    .str.replace("-", " ")
    .str.replace("+", " ")
    .str.replace("_", " ")
)

  df["Scope of Work"]


* `str.contains()` allows you to search through the column. 
* Let's search for projects that have "transit" in their descriptions. 
* <b>Tip</b>
    * The data we work with tends to be pretty wide. Scrolling horizontally gets tiresome.
    * Placing all the columns you want to temporarily work within a `list` like `preview_subset` below is a good idea. 

In [10]:
preview_subset = ["project_name", "Scope of Work"]

In [11]:
transit_only_projects = df.loc[df["Scope of Work"].str.contains("transit")]

* Let's see how many transit projects are in this dataset.
* Let's read through the Scope of Work to make sure it's what we expect.

In [12]:
len(transit_only_projects)

7

In [13]:
transit_only_projects[preview_subset]

Unnamed: 0,project_name,Scope of Work
11,Greenway Gables Managed Lanes,"managed lanes prioritizing carpools, clean vehicles, and public transit, featuring real time traffic updates and incentives for sustainable transportation choices."
16,Sparkle City Smart Streets Initiative,"an intelligent transportation system integrating traffic management, real time transit information, and smart parking solutions to enhance mobility and reduce congestion."
19,Rolling Renaissance Rabbit Express,"new, eco friendly rolling stock for public transit, incorporating advanced propulsion systems, comfortable seating, and onboard amenities."
20,Transit Treasure Transit Oasis,"transit supportive features, including shelters, wi fi, and real time information displays, prioritizing passenger convenience and accessibility."
25,Trail of Treats and Transit Hub,"a multi use path connecting to public transit, featuring public art installations, wayfinding signage, and amenities like bike storage and repair stations."
27,Park and Ride Petal Paradise,"an attractive park and ride facility with amenities like ev charging, wi fi, and convenient access to nearby transit options."
43,Brookside Bus Blossom Lane,"prioritize public transportation and enhance air quality by dedicating lanes to buses and hovs on brookside boulevard, integrating smart traffic signals and real time transit information inspired by the ancient elves."


#### Step 2: Filtering
* We've found all the projects that says "transit" somewhere in its description. 
* Now there are just many more elements to go. We forgot about bikes, bus, rail..
* However, the method we used above leaves us with multiple dataframes. We actually just want our one original dataframe tagged with categories. 
* A faster way: join all the keywords you want.
* | designates "or".
* You can read this as "I want projects that contain the word bus, transit, or rail..."

In [14]:
transit_keywords = f"({'|'.join(transit)})"

In [15]:
# Print it out
transit_keywords

'(transit|passenger rail|bus|ferry)'

* Filter again - notice the .loc after df and how there are brackets around `df`?
* How many more projects appear when we filter for 3 additional transit related keywords, compared to only transit?

In [16]:
df.loc[df["Scope of Work"].str.contains(transit_keywords)][preview_subset]

  df.loc[df["Scope of Work"].str.contains(transit_keywords)][preview_subset]


Unnamed: 0,project_name,Scope of Work
11,Greenway Gables Managed Lanes,"managed lanes prioritizing carpools, clean vehicles, and public transit, featuring real time traffic updates and incentives for sustainable transportation choices."
16,Sparkle City Smart Streets Initiative,"an intelligent transportation system integrating traffic management, real time transit information, and smart parking solutions to enhance mobility and reduce congestion."
18,Coastal Commuter Carousel,"a 30 mile passenger rail line connecting coastal towns, featuring modern train sets, enhanced station amenities, and scenic viewing cars."
19,Rolling Renaissance Rabbit Express,"new, eco friendly rolling stock for public transit, incorporating advanced propulsion systems, comfortable seating, and onboard amenities."
20,Transit Treasure Transit Oasis,"transit supportive features, including shelters, wi fi, and real time information displays, prioritizing passenger convenience and accessibility."
21,Berry Best Bus Rapid Transit,"dedicated bus lanes with comfortable stops, featuring off board fare payment, priority traffic signals, and enhanced passenger amenities."
25,Trail of Treats and Transit Hub,"a multi use path connecting to public transit, featuring public art installations, wayfinding signage, and amenities like bike storage and repair stations."
27,Park and Ride Petal Paradise,"an attractive park and ride facility with amenities like ev charging, wi fi, and convenient access to nearby transit options."
43,Brookside Bus Blossom Lane,"prioritize public transportation and enhance air quality by dedicating lanes to buses and hovs on brookside boulevard, integrating smart traffic signals and real time transit information inspired by the ancient elves."


In [17]:
print(len(transit_only_projects))
print(len(df.loc[df["Scope of Work"].str.contains(transit_keywords)]))

7
9


  print(len(df.loc[df["Scope of Work"].str.contains(transit_keywords)]))



* Let's put this all together. 
* I want any project that contains a transit component to be tagged as "Y" in a column called  "Transit". If a project doesn't have a transit component, it gets tagged as a "N".

In [18]:
df["Transit"] = np.where(
    (df["Scope of Work"].str.contains(transit_keywords)),
    "Y",
    "N",
)

  (df["Scope of Work"].str.contains(transit_keywords)),


* Using `value_counts()` we can see the breakdown of transit related projects.

In [19]:
df.Transit.value_counts()

N    35
Y     9
Name: Transit, dtype: int64

### Task 2: Functions 
* It looks only the 9 transit projects were categorized.
* We are missing the 2 categories: ATP and General Lane related projects.
* We could repeat the steps above or we can use a function.
    * You can think of a function as a piece of code you write only once but reuse more than once.
    * In the long run, functions save you work and look neater when you present your work.
* <b> Resources</b>: Functions are incredibly important. Please spend more time than usual on this section and practice the tutorials linked.
    * [Please read this great tutorial.](https://www.practicalpythonfordatascience.com/00_python_crash_course_functions)
    * [And refer to this page on our docs.](https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html#functions)

In [None]:
# Practice here

####  Let's build a function together.
* This will be repetitive after the tutorials, but you will use functions all the time at DDS and it's one of the most critical concepts to grasp.
* Start your function with `def():`` and the name you'd like.

In [None]:
# def categorize():

* Now let's think of what are the two elements that we will repeat.
* We merely want to substitute `transit_keywords` with ATP or General Lane related keywords.
* Instead of the `df["Transit]"`, we want to create two new columns called something like `df["ATP]"` and `df["General_Lanes]"` to hold our yes/no results.
* Add the two elements that need to be substituted into the argument of your function.
    * It's good practice to specify what exactly the parameter should be: a string/list/dataframe. 

In [None]:
# def categorize(df:pd.DataFrame, keywords:list, new_column:str):

* It's also a nice idea to document what your function will return.
* In our case, it's a Pandas dataframe. 

In [None]:
# def categorize(df:pd.DataFrame, keywords:list, new_column:str)->pd.DataFrame:

* Think about the steps we took to categorize transit only.
* Add the sections of the code we will be reusing and sub in the original variables for the arguments.
    *  First, we joined the keywords from a list into a tuple.
    *  Second, we searched through the Scope of Work column for the keywords.
    *  Third, if we find the keyword, we will tag the project as "Y" in the column "new_column". If the keyword isn't found, the project is tagged as "N".

In [20]:
def categorize(df: pd.DataFrame, keywords: list, new_column: str) -> pd.DataFrame:
    joined_keywords = f"({'|'.join(keywords)})"  # Remember this used to be the list called transit_keywords, but it must be changed into a tuple.

    # We are now creating a new column: notice how parameters has no quotation marks.
    df[new_column] = np.where(
                                        (df["Scope of Work"].str.contains(joined_keywords)), # Why do you think "Scope of Work" has quotation marks around it?
        "Y",
        "N",
    )

    # We are returning the updated dataframe from this function
    return df

* Now let's use your function

In [21]:
df = categorize(df, atp, "ATP")

  (df["Scope of Work"].str.contains(joined_keywords)), # Why do you think "Scope of Work" has quotation marks around it?


In [22]:
df.ATP.value_counts()

N    30
Y    14
Name: ATP, dtype: int64

In [23]:
df = categorize(df, transit, "Transit")

  (df["Scope of Work"].str.contains(joined_keywords)), # Why do you think "Scope of Work" has quotation marks around it?


In [24]:
df = categorize(df, general_lanes, "General_Lanes")

  (df["Scope of Work"].str.contains(joined_keywords)), # Why do you think "Scope of Work" has quotation marks around it?


In [25]:
df.General_Lanes.value_counts()

N    40
Y     4
Name: General_Lanes, dtype: int64

* Use the `groupby` technique from Exercise 2 to get some descriptive statistics for these 3 new columns
* Use `.reset_index()` after `aggregate()` to see what happens.

In [39]:
df.groupby(["General_Lanes", "Transit", "ATP"]).aggregate(
    {"project_name": "nunique", "overall_score": "median"}
).reset_index()

Unnamed: 0,General_Lanes,Transit,ATP,project_name,overall_score
0,N,N,N,20,73.0
1,N,N,Y,11,75.0
2,N,Y,N,8,78.0
3,N,Y,Y,1,83.0
4,Y,N,N,2,65.5
5,Y,N,Y,2,88.0


## Function + If-Else
* Above, we can see all types of combinations of categories a project can fall into. 
* Let's do away with these "Y" and "N" columns and create actual categories in an actual column called `categories`.
* If a project has "N" for all 3 of the General Lane, Transit, and ATP columns, it should be `Other`. 
* If a project has "Y" for all 3, it should be categorized as "General Lane, Transit, and ATP".
* If a project has "Y" for only ATP and Transit, it should be categorized as "Transit and ATP".
* Yes this will be very tedious given all the combinations!
* To write the function to create these categories, read these resources:
    * [DDS Apply Docs](https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html#functions)
    * [DDS If-Else Tutorial](https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html#if-else-statements)
    * [Geeks for Geeks: if-else with multiple conditions](https://www.geeksforgeeks.org/check-multiple-conditions-in-if-statement-python/)

In [41]:
def categorize(row):
    if (row.General_Lanes == "N") & (row.Transit == "N") & (row.ATP == "N"):
        return "Other"
    elif (row.General_Lanes == "N") & (row.Transit == "N") & (row.ATP == "Y"):
        return "ATP"
    elif (row.General_Lanes == "N") & (row.Transit == "Y") & (row.ATP == "N"):
        return "Transit"
    elif (row.General_Lanes == "N") & (row.Transit == "Y") & (row.ATP == "Y"):
        return "Transit and ATP"
    elif (row.General_Lanes == "Y") & (row.Transit == "N") & (row.ATP == "N"):
        return "General Lanes"
    elif (row.General_Lanes == "Y") & (row.Transit == "N") & (row.ATP == "Y"):
        return "General Lanes and ATP"
    else:
        return "Transit, General Lanes, and ATP"

In [43]:
# Apply your function
df["category"] = df.apply(categorize, axis=1)

### Please export your output as a `.parquet` to GCS before moving onto the next step

## For Loops 
* For Loops are one of the greatest gifts of Python. 
* Below is a simple for loop that prints out all the numbers in range of 10.


In [47]:
for i in range(10):
    print(i)

0
1
2
3
4
5
6
7
8
9


* Here, I'm looping over a couple of columns in my dataframe and printing some descriptive statistics about it.
* Notice how I have to use `print` and `display` to show the results.
    * Try this same block of code without `print` and `display` to see the difference.

In [93]:
for column in ["zev_score", "vmt_score", "accessibility_score"]:
    print(f"Statistics for {column}")
    display(df[column].describe())

Statistics for zev_score


count   44.00
mean     4.98
std      2.96
min      1.00
25%      3.00
50%      4.00
75%      7.25
max     10.00
Name: zev_score, dtype: float64

Statistics for vmt_score


count   44.00
mean     5.66
std      3.04
min      1.00
25%      2.75
50%      6.00
75%      8.00
max     10.00
Name: vmt_score, dtype: float64

Statistics for accessibility_score


count   44.00
mean     5.39
std      3.14
min      1.00
25%      2.75
50%      5.00
75%      8.00
max     10.00
Name: accessibility_score, dtype: float64

#### Using a For Loop
* Below, I have already aggregated the dataframe for you.

In [53]:
agg1 = (
    df.groupby(["category"])
    .aggregate(
        {"overall_score": "median", "Project Cost": "median", "project_name": "nunique"}
    )
    .reset_index()
    .rename(
        columns={
            "overall_score": "median_score",
            "Project Cost": "median_project_cost",
            "project_name": "total_projects",
        }
    )
)

In [54]:
agg1

Unnamed: 0,category,median_score,median_project_cost,total_projects
0,ATP,75.0,6238994.0,11
1,General Lanes,65.5,4172279.0,2
2,General Lanes and ATP,88.0,8663951.0,2
3,Other,73.0,5232062.0,20
4,Transit,78.0,3510634.0,8
5,Transit and ATP,83.0,7285919.0,1


* I have also prepared an Altair chart function. 

In [85]:
def create_chart(df: pd.DataFrame, column: str) -> alt.Chart:
    title = column.replace("_"," ").title()
    chart = (
        alt.Chart(df, title=f"{title} by Categories")
        .mark_bar(size=20)
        .encode(
            x=alt.X(column),
            y=alt.Y("category"),
            color=alt.Color(
                "category",
                scale=alt.Scale(
                    range=calitp_color_palette.CALITP_CATEGORY_BRIGHT_COLORS
                ),
            ),
            tooltip=list(df.columns),
        )
        .properties(width=400, height=250)
    )
    return chart

* Use the function to create a chart out of the aggregated dataset.

In [86]:
create_chart(agg1, "median_score")

* We have a couple of other columns left that still need to be visualized. 
* This is the perfect case for using a for loop, since we all we want to do is replace the column above with the two remainig columns. 
* Try this below! 
    * Hint: you'll have to wrap the function with `display()` to get your results.

In [87]:
for column in ["median_score", "median_project_cost", "total_projects"]:
    display(create_chart(agg1, column))