# Exercise 3: Strings, Functions, If Else, For Loops

In [1]:
import altair as alt
import numpy as np
import pandas as pd

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

* Using a f-strings, load in your merged dataframe from Exercise 3.

In [3]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/starter_kit/"

In [4]:
FILE = "starter_kit_merge.parquet"

In [5]:
df = pd.read_parquet(f"{GCS_FILE_PATH}{FILE}")

### Amanda, note to self why are there min and max scores here??

In [6]:
df.head()

Unnamed: 0,ct_district,project_name,Scope of Work,accessibility_score,dac_accessibility_score,dac_traffic_impacts_score,freight_efficiency_score,freight_sustainability_score,mode_shift_score,lu_natural_resources_score,safety_score,vmt_score,zev_score,public_engagement_score,climate_resilience_score,program_fit_score,overall_score,min_score,max_score
0,10,Meadow Magic Multi-Use Path,"A 2-mile Class I bike lane and multi-use path through a scenic meadow, featuring wildflower plantings, public art installations, and educational signage highlighting local wildlife.",10,3,4,8,3,6,10,9,2,4,5,2,2,68,68,68
1,8,Bunny Hop Bike Boulevard,"A Class II bike lane with charming streetlights, benches, and bike racks designed to resemble carrot sticks, connecting residential neighborhoods to local schools and parks.",8,9,5,8,7,8,10,8,5,1,1,3,9,82,82,82
2,2,Strawberry Shortcake Sidewalks,"Colorful, patterned sidewalks connecting local schools and parks, incorporating playful strawberry-themed crosswalks and decorative street furniture.",1,3,1,10,5,10,3,7,4,3,3,2,3,55,55,55
3,3,River Ramble Rabbit Trail,"A 5-mile Class III bike lane along a picturesque riverfront, offering stunning views, river access points, and interpretive signage sharing the area's natural and cultural history.",4,2,9,9,9,10,9,4,7,1,3,5,2,74,74,74
4,10,Lilac Lane Dream Complete Street,"A vibrant Complete Street featuring bike lanes, wide sidewalks, and ample green space, prioritizing pedestrian safety and community engagement through public events and programming.",10,10,9,4,9,10,7,2,1,7,1,3,3,76,76,76


## Categorizing
* There are 30 projects. They all vary in themes, some are transit oriented while others are focused on Active Transportation (ATP).
* Categorizing data is an important part of data cleaning and analyzing so we can present the data in a more succint and insightful way. 
* Let's organize projects into three categories.
    * ATP
    * Transit
    * General Lanes

### Task 1: Strings
* Below are some of the common keywords that fall into the categories detailed above. They are held in a `list`.
* Feel free to add other terms you think are relevant. 
* We are going to search the `Scope of Work` column for these keywords. 

In [7]:
transit = ["transit", "passenger rail", "bus", "ferry"]
atp = ["bike", "pedestrian", "bicycle", "sidewalk", "path"]
general_lanes = ["general", "auxiliary"]

#### Step 1: Cleaning
* Remember in Exercise 2 some of the project names didn't merge between the two dataframes?
* In the real world, a lot of string data can be spelled in different ways, different cases, abbreviated, and the like.
* The easiest way to clean this up is by lowercasing, stripping the white spaces, and replacing characters.
* Also, by simplifying a string column, we can saerch through it easier. 

In [8]:
df["Scope of Work"] = (
    df["Scope of Work"]
    .str.lower()
    .str.strip()
    .str.replace("-", " ")
    .str.replace("+", " ")
    .str.replace("_", " ")
)

  df["Scope of Work"]


* `str.contains()` allows you to search through the column. 
* Let's search for projects that have "transit" in their descriptions. 
* Tip
    * The data we work with tends to be pretty wide. Scrolling horizontally gets tiresome.
    * Placing all the columns you want to temporarily work within a `list` like `preview_subset` below is a good idea. 

In [9]:
preview_subset = ["project_name", "Scope of Work"]

In [10]:
transit_only_projects = df.loc[df["Scope of Work"].str.contains("transit")]

In [11]:
# Let's see how many transit projects
len(transit_only_projects)

6

In [12]:
transit_only_projects[preview_subset]

Unnamed: 0,project_name,Scope of Work
11,Greenway Gables Managed Lanes,"managed lanes prioritizing carpools, clean vehicles, and public transit, featuring real time traffic updates and incentives for sustainable transportation choices."
16,Sparkle City Smart Streets Initiative,"an intelligent transportation system integrating traffic management, real time transit information, and smart parking solutions to enhance mobility and reduce congestion."
19,Rolling Renaissance Rabbit Express,"new, eco friendly rolling stock for public transit, incorporating advanced propulsion systems, comfortable seating, and onboard amenities."
20,Transit Treasure Transit Oasis,"transit supportive features, including shelters, wi fi, and real time information displays, prioritizing passenger convenience and accessibility."
25,Trail of Treats and Transit Hub,"a multi use path connecting to public transit, featuring public art installations, wayfinding signage, and amenities like bike storage and repair stations."
27,Park and Ride Petal Paradise,"an attractive park and ride facility with amenities like ev charging, wi fi, and convenient access to nearby transit options."


#### Step 2: Filtering
* We've found all the projects that says "transit" somewhere in its description. 
* Now there are just 7 more elements to go. 
* However, the method we used above leaves us with 7 separate dataframes when we actually just want our one original dataframe tagged with categories. 
* A faster way: join all the keywords you want.
* | designates "or".
* You can read this as "I want projects that contain the word bus, transit, or rail..."

In [13]:
transit_keywords = f"({'|'.join(transit)})"

In [14]:
# Print it out
transit_keywords

'(transit|passenger rail|bus|ferry)'

* Filter again - notice the .loc after df and how there are brackets around `df`?

In [15]:
df.loc[df["Scope of Work"].str.contains(transit_keywords)][preview_subset]

  df.loc[df["Scope of Work"].str.contains(transit_keywords)][preview_subset]


Unnamed: 0,project_name,Scope of Work
11,Greenway Gables Managed Lanes,"managed lanes prioritizing carpools, clean vehicles, and public transit, featuring real time traffic updates and incentives for sustainable transportation choices."
16,Sparkle City Smart Streets Initiative,"an intelligent transportation system integrating traffic management, real time transit information, and smart parking solutions to enhance mobility and reduce congestion."
18,Coastal Commuter Carousel,"a 30 mile passenger rail line connecting coastal towns, featuring modern train sets, enhanced station amenities, and scenic viewing cars."
19,Rolling Renaissance Rabbit Express,"new, eco friendly rolling stock for public transit, incorporating advanced propulsion systems, comfortable seating, and onboard amenities."
20,Transit Treasure Transit Oasis,"transit supportive features, including shelters, wi fi, and real time information displays, prioritizing passenger convenience and accessibility."
21,Berry Best Bus Rapid Transit,"dedicated bus lanes with comfortable stops, featuring off board fare payment, priority traffic signals, and enhanced passenger amenities."
25,Trail of Treats and Transit Hub,"a multi use path connecting to public transit, featuring public art installations, wayfinding signage, and amenities like bike storage and repair stations."
27,Park and Ride Petal Paradise,"an attractive park and ride facility with amenities like ev charging, wi fi, and convenient access to nearby transit options."


In [16]:
# We can see there are actually a few more transit projects then if we just filtered for the word "transit"
print(len(transit_only_projects))
print(len(df.loc[df["Scope of Work"].str.contains(transit_keywords)]))

6
8


  print(len(df.loc[df["Scope of Work"].str.contains(transit_keywords)]))



* Let's put this all together. 
* I want any project that contains a transit component to be tagged as "Y" in a column called  "Transit". If a project doesn't have a transit component, it gets tagged as a "N".

In [17]:
df["Transit"] = np.where(
        (df["Scope of Work"].str.contains(transit_keywords)),
        "Y",
        "N",
    )

  (df["Scope of Work"].str.contains(transit_keywords)),


* Using `value_counts()` we can see the breakdown of transit related projects.

In [18]:
df.Transit.value_counts()

N    21
Y     8
Name: Transit, dtype: int64

### Task 2: Functions 
* It looks only the 8 transit projects were categorized.
* We are missing the 2 categories: ATP and Lane related projects.
* We could repeat the steps above or we can use a function.
    * You can think of a function as a piece of code you write only once but reuse more than once.
    * In the long run, functions save you work and look neater when you present your work.
    * [Please read this great tutorial.](https://www.practicalpythonfordatascience.com/00_python_crash_course_functions)
    * [And refer to this page on our docs.](https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html#functions)
* Let's build one together.
* Start your function with def(): and the name



In [19]:

#def categorize():

* Now let's think of what are the two elements that we will repeat.
* We merely want to substitute `transit_keywords` with ATP or Managed Lane related keywords.
* Instead of the `df["Transit]"`, we want to create two new columns called something like `df["ATP]"` and `df["Managed_Lanes]"` to hold our yes/no results.
* Add the two elements that need to be substituted into the argument of your function.
    * It's good practice to specify what exactly the parameter should be: a string/list/dataframe. 


In [20]:
#def categorize(df:pd.DataFrame, keywords:list, new_column:str):

* It's also a nice idea to document what your function will return.
* In our case, it's a dataframe. 

In [21]:
#def categorize(df:pd.DataFrame, keywords:list, new_column:str)->pd.DataFrame:

* Think about the steps we took to categorize transit only.
* Add the sections of the code we will be reusing and sub in the original variables for the arguments.
    *  First, we joined the keywords from a list into a tuple.
    *  Second, we searched through the Scope of Work column for the keywords and tagged it with the category

In [22]:
def categorize(df: pd.DataFrame, keywords: list, new_column: str) -> pd.DataFrame:
    joined_keywords = (
        f"({'|'.join(keywords)})"  # Remember this used to be the list called transit_keywords, but it must be changed into a tuple.
    )
    
    # We are now creating a new column: notice how parameters has no quotation marks. 
    df[new_column] = np.where(
        (df["Scope of Work"].str.contains(joined_keywords)), # Why do you think "Scope of Work" has quotation marks around it?
        "Y",
        "N",
    )
    
    # We are returning the updated dataframe from this function
    return df

* Now let's use your function

In [23]:
df = categorize(df, atp, "ATP")

  (df["Scope of Work"].str.contains(joined_keywords)), # Why do you think "Scope of Work" has quotation marks around it?


In [24]:
df.ATP.value_counts()

N    19
Y    10
Name: ATP, dtype: int64

In [25]:
df = categorize(df, transit, "Transit")

  (df["Scope of Work"].str.contains(joined_keywords)), # Why do you think "Scope of Work" has quotation marks around it?


In [26]:
df = categorize(df, general_lanes, "General_Lanes")

  (df["Scope of Work"].str.contains(joined_keywords)), # Why do you think "Scope of Work" has quotation marks around it?


* Use the `groupby` technique from Exercise 2 to get the total number of projects that fall in each of these 3 new columns

### Amanda: Add more random projects in the sample data related to general lanes.

In [27]:
df.groupby(['General_Lanes',"Transit", "ATP"]).aggregate({'project_name':'nunique'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,project_name
General_Lanes,Transit,ATP,Unnamed: 3_level_1
N,N,N,12
N,N,Y,9
N,Y,N,7
N,Y,Y,1


In [28]:
df.groupby(['General_Lanes',"Transit", "ATP"]).aggregate({'overall_score':'median'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,overall_score
General_Lanes,Transit,ATP,Unnamed: 3_level_1
N,N,N,73.5
N,N,Y,76.0
N,Y,N,72.0
N,Y,Y,64.0


## If-Else
* Part of CSIS is to reward projects that create new infrastructure that isn't highway related. 
    * If a project contains at least one transit related element, we will add 10 points to its `overall_score`.
    * If a project contains at least one ATP element, we will add 5 points.
    * If a project contains a managed lane element, we will subtract 3 points.
    * For everything else, we will leave the `overall_score` as is. 
    * We are going to use an `if-else` clause within another function. 
        * [Read about them here](https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html#if-else-statements)

#### The first part of the logic is: <i>if</i> a project's `Scope of Work` column contains a transit element, their score gets bumped up by 10. 

In [29]:
def alter_score(row):
    if row.Transit == "Y":
        row.overall_score += 10
    elif row.ATP == "Y":
        row.overall_score += 5
    elif row.General_Lanes == "Y":
        row.overall_score -= 3
    return row

In [30]:
df = df.apply(alter_score, axis=1)

In [31]:
df.groupby(['General_Lanes',"Transit", "ATP"]).aggregate({'overall_score':'median'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,overall_score
General_Lanes,Transit,ATP,Unnamed: 3_level_1
N,N,N,73.5
N,N,Y,81.0
N,Y,N,82.0
N,Y,Y,74.0


## For Loops + More Charts.
* Tell them to make a chart that displays overall_scores for Transit projects.
* Use a function to create the chart. 
* Use a for loop to filter the dataframe for Y for the two other categories and create the chart. 