# Exercise 3: Strings, Functions, If Else, For Loops

In [None]:
import altair as alt
import numpy as np
import pandas as pd

In [None]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

* Using a f-strings, load in your merged dataframe from Exercise 3.

In [None]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/starter_kit/"

In [None]:
FILE = "starter_kit_merge.parquet"

In [None]:
df = pd.read_parquet(f"{GCS_FILE_PATH}{FILE}")

## Categorizing
* There are 30 projects. They all vary in themes, some are transit oriented while others are focused on Active Transportation (ATP).
* Categorizing data is an important part of data cleaning and analyzing so we can present the data in a more succint and insightful way. 
* Let's organize projects into three categories.
    * ATP
    * Transit
    * Everything else will go into "Other"

### Task 1: Strings
* Below are some of the common keywords that fall into transit and Active Transportation in a `list`.
* Feel free to add other terms you think are relevant. 
* We are going to search the `Scope of Work` column for these keywords. 

In [None]:
transit = ["transit", "passenger rail", "bus", "ferry"]
atp = ["bike", "pedestrian", "bicycle", "sidewalk", "path"]

#### Step 1: Cleaning
* Remember in Exercise 2 some of the project names didn't merge between the two dataframes?
* In the real world, a lot of string data can be spelled in different ways, different cases, abbreviated, and the like.
* What if a coworker typed in "HOV" lane instead of "hov" lane? We know that's the same thing, but if we did `str.contains("HOV")` we would miss out on any entry that says "hov" instead.
* The easiest way to clean this up is by lowercasing, stripping the white spaces, and replacing characters.

In [None]:
df["Scope of Work"] = (
    df["Scope of Work"]
    .str.lower()
    .str.strip()
    .str.replace("-", " ")
    .str.replace("+", " ")
)

* `str.contains()` allows you to search through the column. 
* Let's search for projects that have "transit" in their descriptions. 
* Pro-tip
    * The data we work with tends to be pretty large. Scrolling vertically and horizontally isn't easy on the eyes.
    * Placing all the columns you want to temporarily work within a `list` like `preview_subset` below is a good idea. 

In [None]:
preview_subset = ["project_name", "Scope of Work"]

In [None]:
transit_only_projects = df.loc[df["Scope of Work"].str.contains("transit")]

In [None]:
# Let's see how many transit projects
len(transit_only_projects)

In [None]:
transit_only_projects[preview_subset]

#### Step 2: Filtering
* We've found all the projects that says "transit" somewhere in its description. 
* Now there are just 7 more elements to go. 
* However, the method we used above leaves us with 7 separate dataframes when we actually just want our one original dataframe tagged with categories. 
* A faster way: join all the keywords you want.
* | designates "or".
* You can read this as "I want projects that contain the word bus, transit, or rail..."

In [None]:
transit_keywords = f"({'|'.join(transit)})"

In [None]:
# Print it out
transit_keywords

* Filter again - notice the .loc after df and how there are brackets around `df`?

In [None]:
df.loc[df["Scope of Work"].str.contains(transit_keywords)][preview_subset]

In [None]:
# We can see there are actually a few more transit projects then if we just filtered for the word "transit"
print(len(transit_only_projects))
print(len(df.loc[df["Scope of Work"].str.contains(transit_keywords)]))

### Task 2: Functions 
* Let's put this all together and categorize using `.map`.

In [None]:
df["Category"] = (
    df["Scope of Work"].str.contains(transit_keywords).map({True: "Transit"})
)

In [None]:
df.Category.value_counts()

* It looks only the 9 transit projects were categorized.
* We are missing 2 categories: ATP and Other.
* We could repeat the steps above or we can use a function.
* You can think of a function as a piece of code you write only once but reuse more than once.
* In the long run, functions save you work and look neater when you present your work.
* Let's build one together.
* Start your function with def() and the name

In [None]:

def categorize():

* Now let's think of what are the two elements that we will repeat.
* We merely want to substitute `transit_keywords` with ATP related keywords.
* Instead of the `df["Category]"==Transit`, we want our ATP projects to be categorized as "ATP".
* Add the two elements that need to be substituted into the argument of your function.
    * It's good practice to specify the argument should be: a string/list/dataframe. 


In [None]:
def categorize(df:pd.DataFrame, keywords:list, category:str):

* It's also a nice idea to document what your function will return.
* In our case, it's a dataframe. 

In [None]:
def categorize(df:pd.DataFrame, keywords:list, category:str)->pd.DataFrame:

* Think about the steps we took to categorize transit only.
* Add the sections of the code we will be reusing and sub in the original variables for the arguments.
    *  First, we joined the keywords from a list into a tuple.
    *  Second, we searched through the Scope of Work column for the keywords and tagged it with the category

In [None]:
df["Category"] = np.nan

In [None]:
def categorize(df: pd.DataFrame, keywords: list, category: str) -> pd.DataFrame:
    joined_keywords = (
        f"({'|'.join(keywords)})"  # Remember this used to be transit_keywords
    )

    df["Category"] = (
        df["Scope of Work"].str.contains(joined_keywords).map({True: category})
    )  # Remember this used to say "Transit". Now we want it to take whatever category is appropriate.

    return df

* Now let's use your function

In [None]:
df = categorize(df, atp, "ATP")

In [None]:
df.Category.value_counts()

In [None]:
df = categorize(df, transit, "Transit")

In [None]:
df.Category.value_counts()

* Let's look at the categories again

In [None]:
df.Category.value_counts()

## If-Else
* Now we have found all of the projects that need their scores adjusted, let's go ahead and adjust the scores. 
* We're going to do this with an `if-else` statement.
* The first part of the logic is: <i>if</i> a project's `Scope of Work` column contains an ATP or transit element, their score gets bumped up by 3. 
