# Exercise 3: Strings, Functions, If Else, For Loops, Git

In [None]:
import altair as alt
import numpy as np
import pandas as pd
from calitp_data_analysis import calitp_color_palette

In [None]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

* Using a `f-string`, load in your merged dataframe from Exercise 3.

In [None]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/starter_kit/"

## Categorizing
* There are 40+ projects. They all vary in themes, some contain transit elements while others contain Active Transportation (ATP) components. Some contain both! 
* Categorizing data is an important part of data cleaning and analyzing so we can present the data on a more succinct level. 
* Let's organize projects into three categories.
    * ATP
    * Transit
    * General Lanes

### Task 1: Strings
* Below are some of the common keywords that fall into the categories detailed above. They are held in a `list`.
* Add other terms you think are relevant. 
* We are going to search the `Scope of Work` column for these keywords. 

In [None]:
transit = ["transit", "passenger rail", "bus", "ferry"]
atp = ["bike", "pedestrian", "bicycle", "sidewalk", "path"]
general_lanes = ["general", "auxiliary", "highway"]

#### Step 1: Cleaning
* Remember in Exercise 2 some of the project names didn't merge between the two dataframes?
* In the real world, you won't have the bandwidth and time to replace each individual string value with a dictionary.
* An easy way to clean most of the values up is by lowercasing, stripping the white spaces, and replacing characters.
* We can search through a string column  easier when we simplify up the  values.

In [None]:
df.scope_of_work = (
    df.scope_of_work.str.lower() # Lowers the strings
    .str.strip() # Strips trailing white spaces
    .str.replace("-", " ") # Replaces hyphens with a space
    .str.replace("+", " ")
    .str.replace("_", " ")
)

* `str.contains()` allows you to search through the column. 
* Let's search for projects that have "transit" in their descriptions. 
* There are many modifications you can make to `str.contains()`. Try them out and see what happens.
    * `df.loc[df.scope_of_work.str.contains("transit", case=False)]` 
        * Will search through your column without matching the case. It'll return rows with both "Transit" and "transit".
    * `df.scope_of_work.str.contains("transit", case=False, regex=False) `
        * Will return any matches that include `transit` rather than an exact match. It'll return rows with values like "transit" and "Transitory".

In [None]:
transit_only_projects = df.loc[df.scope_of_work.str.contains("transit")]

* Let's see how many transit projects are in this dataset.
* <b>Tip</b>
    * The data we typically work with tends to be wide (read about wide vs. long data [here](https://www.statology.org/long-vs-wide-data/)). Scrolling horizontally gets tiresome.
    * Placing all the columns you want to temporarily work within a `list` like `preview_subset` below is a good idea to temporarily narrow down your dataframe while working. 

In [None]:
preview_subset = ["project_name", "scope_of_work"]

In [None]:
transit_only_projects[preview_subset]

#### Step 2: Filtering
* We've found all the projects that says "transit" somewhere in its description. 
* Now there are just many more transit related elements to go. We forgot about bikes, bus, rail, so on and so forth.
* The method above leaves us with multiple dataframes. We actually just want our one original dataframe tagged with categories. 
* A faster way: join all the keywords you want into one large string.
    * | designates "or".
    * You can read `transit_keywords` as "I want projects that contain the word transit or passenger rail or bus or ferry"

In [None]:
transit_keywords = f"({'|'.join(transit)})"

In [None]:
# Print it out
transit_keywords

* Filter again - notice the .loc after df and how there are brackets around `df`?


In [None]:
df.loc[df.scope_of_work.str.contains(transit_keywords)][preview_subset]

* Count how many more projects appear when we filter for 3 additional transit related keywords, compared to only transit below.


* Let's put this all together. 
* I want any project that contains a transit component to be tagged as "Y" in a column called  "Transit". 
* If a project doesn't have a transit component, it gets tagged as a "N".

In [None]:
df["Transit"] = np.where(
    (df.scope_of_work.str.contains(transit_keywords)),
    "Y",
    "N",
)

* Using `value_counts()` we can see the total of transit related vs non-transit related projects.

### Task 2: Functions 
* It looks like there are only  9 transit projects.
* We are missing the 2 other categories: ATP and General Lane related projects.
* We could repeat the steps above or we can use a **function.**
    * You can think of a function as a piece of code you write only once but reuse more than once.
    * In the long run, functions save you work and look neater when you present your work.
* You may not have realized this but you've been using functions this whole time.
    * When you are taking the `len()` you are using a built-in function to find the number of rows in a dataframe.

In [None]:
len(df)

* `type` too is a built-in function that tells you what type of variable you are looking at. 

In [None]:
type(df)

In [None]:
type(GCS_FILE_PATH)

In [None]:
type(transit)

### Practice with outside resources
* Functions are incredibly important as such, **please spend more time than usual on this section and practice the tutorials linked.**
* [Tutorial #1 Practical Python for Data Science.](https://www.practicalpythonfordatascience.com/00_python_crash_course_functions)
* [DDS Functions.](https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html#functions)

In [None]:
# Practice here

####  Let's build a function together.
* This will be repetitive after the tutorials, but you will use functions all the time at DDS.
##### Step 1
* Start your function with `def` and the name you'd like. I'm calling it `categorize():`

In [None]:
# def categorize():

##### Step 2 
* Now let's think of what are the two elements that we will repeat.
* We merely want to substitute `transit_keywords` with ATP or General Lane related keywords.
* Instead of the `df["Transit]"`, we want to create two new columns called something like `df["ATP]"` and `df["General_Lanes]"` to hold our yes/no results.
* Add the two elements that need to be substituted into the argument of your function.
    * It's good practice to specify what exactly the parameter should be: a string/list/dataframe/etc. 
    * Including this detail make it easier for your coworkers to read and use your code.

In [None]:
# def categorize(df:pd.DataFrame, keywords:list, new_column:str):

##### Step 3
* It's also good to document what your function will return.
* In our case, it's a Pandas dataframe. 

In [None]:
# def categorize(df:pd.DataFrame, keywords:list, new_column:str)->pd.DataFrame:

##### Step 4
* Think about the steps we took to categorize transit only.
* Add the sections of the code we will be reusing and sub in the original variables for the arguments.
    *  First, we joined the keywords from a list into a big string.
    *  Second, we searched through the Scope of Work column for the keywords.
    *  Third, if we find the keyword, we will tag the project as "Y" in the column "new_column". If the keyword isn't found, the project is tagged as "N".


In [None]:
def categorize(df: pd.DataFrame, keywords: list, new_column: str) -> pd.DataFrame:
    
    # Remember this used to be the list called transit_keywords, but it must be changed into a long string
    joined_keywords = f"({'|'.join(keywords)})" 

    # We are now creating a new column: notice how parameters has no quotation marks.
    df[new_column] = np.where((df.scope_of_work.str.contains(joined_keywords)), 
        "Y",
        "N",
    )

    # We are returning the updated dataframe from this function
    return df

#### Step 5 
* Now let's use your function: input the arguments in for each of the lists that hold the categorical keywords.

In [None]:
df = categorize(df = df, 
                keywords = atp, 
                new_column = "ATP")

#### Check out your results
* Use the `groupby` technique from Exercise 2 to get some descriptive statistics for these 3 new columns
* Use `.reset_index()` after `aggregate()` to see what happens.
* Try `.reset_index(drop = True)` as well. 

## Function + If-Else
* There are many cases in which we want to categorize our columns to create broader groups for summarizing and aggregating.
* Using a function with an If-Else clause will help us accomplish this goal.
* **<b>Resources</b>:**
    * [DDS Apply Docs](https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html#functions)
    * [DDS If-Else Tutorial](https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html#if-else-statements)
    
    

In [None]:
# Practice here.

### Practice #1: 
* We are going to write an If-Else function that categorizes projects by whether it scored low, medium, or high based on its `overall_score` and percentiles.
    * If a project scores below the 25% percentile, it is a "low scoring project". If a project scores above the 25% percentile but below the 75% percentile, it is a "medium scoring project". Anything above the 75% percentile is "high scoring".
* In Data Science, we like to save our work into variables.
    * If new projects are added, then different percentiles will likely switch.
    * As such, you can save whatever percentile you like using `p75 = df.overall_score.quantile(0.75).astype(float)` which will change automatically when you load in the new data.
* Write an if-else and set the various percentiles using variables. 

### Practice #2
* Goal:
    * Above, we can see all types of combinations of categories a project can fall into. 
    * Let's do away with these "Y" and "N" columns and create actual categories in an actual column called `categories`.
    * If a project has "N" for all 3 of the General Lane, Transit, and ATP columns, it should be `Other`. 
    * If a project has "Y" for all 3, it should be categorized as "General Lane, Transit, and ATP".
    * If a project has "Y" for only ATP and Transit, it should be categorized as "Transit and ATP".
    * Yes this will be very tedious given all the combinations!
* Resource:
    * [Geeks for Geeks: if-else with multiple conditions](https://www.geeksforgeeks.org/check-multiple-conditions-in-if-statement-python/)

### Please export your output as a `.parquet` to GCS before moving onto the next step

## For Loops 
* For Loops are one of the greatest gifts of Python. 
* It runs code from the beginning to the end of a list. 
* Below is a simple for loop that prints out all the numbers in range of 10.


In [None]:
for i in range(10):
    print(i)

* Here, I'm looping over a couple of columns in my dataframe and printing some descriptive statistics about it.
* Notice how I have to use `print` and `display` to show the results.
    * Try this same block of code without `print` and `display` to see the difference.

In [None]:
for column in ["zev_score", "vmt_score", "accessibility_score"]:
    print(f"Statistics for {column}")
    display(df[column].describe())

### Practice using a for loop
* I have aggregated the dataframe for you.

In [None]:
agg1 = (
    df.groupby(["category"])
    .aggregate(
        {"overall_score": "median", "project_cost": "median", "project_name": "nunique"}
    )
    .reset_index()
    .rename(
        columns={
            "overall_score": "median_score",
            "project_cost": "median_project_cost",
            "project_name": "total_projects",
        }
    )
)

In [None]:
agg1

* I have also prepared an Altair chart function. 

In [None]:
def create_chart(df: pd.DataFrame, column: str) -> alt.Chart:
    title = column.replace("_", " ").title()
    chart = (
        alt.Chart(df, title=f"{title} by Categories")
        .mark_bar(size=20)
        .encode(
            x=alt.X(column),
            y=alt.Y("category"),
            color=alt.Color(
                "category",
                scale=alt.Scale(
                    range=calitp_color_palette.CALITP_CATEGORY_BRIGHT_COLORS
                ),
            ),
            tooltip=list(df.columns),
        )
        .properties(width=400, height=250)
    )
    return chart

* Use the function above to create a chart out of the aggregated dataset.


* We have a couple of other columns left that still need to be visualized. 
* This is the perfect case for using a for loop, since all we want to do is replace the column above with the two remaining columns. 
* Try this below! 
    * You'll have to create a `list` that contains the rest of the columns.
    * You'll have to wrap the function with `display()` to get your results.

## GitHub - Pull Requests
* In Exercise 1, you created a new branch that you are working on now.
* Now that you are done with Exercise 3, you are at a nice stopping point to commit your work to our `main` branch.

**Steps**
1. Do the normal workflow of `committing` your work. 
2. Navigate to the our `data-analyses` [repo over here](https://github.com/cal-itp/data-analyses).
3. Follow the steps detailed in [this video](https://youtu.be/nCKdihvneS0?si=nPlBOAMcgO1nv3v1&t=95). 
4. Once you're done writing, scroll down the bottom and click `merge pull request` 
<img src= "./starter_kit_img.png">
5. Your work is now merged into the `main` branch of our `data-analyses` repo. 
6. To check, navigate to the our [repo](https://github.com/cal-itp/data-analyses) and to this `starter_kit` folder to make sure your notebooks are on the `main` branch.
7. Delete the branch `your_branch`. 
    * It's considered outdated now because your changes are on the `main branch`. In the terminal, paste `git branch -d your_branch`. 
    * If that doesn't work, paste `git branch -D your_branch`.
8. Continuing in the terminal, paste `git switch main`. 
9. Paste `git pull origin main`. 
    * This pulls down the work you just uploaded, along with the other work your coworkers have committed onto the main branch. 
9. Create a new branch `git switch -c your_branch` to continue working on exercises 4 and 5.
    * Your new branch can have the same name as the branch you just merged in.