## **Table of Contents**

  * [Function To Read in the Data!](#function-to-read-in-the-data!)
  * [Example usage](#example-usage)
      * [To Access a DataFrame in the list](#to-access-a-dataframe-in-the-list)
      * [To Remove Spaces in DataFrame name](#to-remove-spaces-in-dataframe-name)
  * [Update cleaning code](#update-cleaning-code)
* [example usage of each method](#example-usage-of-each-method)
    * [Sample use of the clean_salary function.](#sample-use-of-the-clean_salary-function)
* [Example of making a test](#example-of-making-a-test)
  * [Generate report](#generate-report)
* [pathway information](#pathway-information)
  * [Plots](#plots)

In [48]:
import pandas as pd
from typing import Dict, Union
from pathlib import Path
import os
import sys
import re
import pandas.testing as pdt

## Function To Read in the Data!


In [49]:
def load_data_folder(
    folder_path: Union[str, os.PathLike] = "../../data"
) -> Dict[str, pd.DataFrame]:
    """
    Load all CSV/XLS/XLSX files in a folder into pandas DataFrames.

    Parameters
    ----------
    folder_path : str | os.PathLike, optional
        Path to the folder containing the files. Defaults to "../../data".

    Returns
    -------
    Dict[str, pandas.DataFrame]
        A mapping from the file's stem (filename without extension) to its
        loaded DataFrame. For example, "employees.csv" -> key "employees".

    Raises
    ------
    FileNotFoundError
        If `folder_path` does not exist.
    PermissionError
        If the folder or files cannot be accessed due to permissions.
    pd.errors.EmptyDataError
        If a CSV file is empty and cannot be parsed.

    Notes
    -----
    - Supported extensions: .csv, .xls, .xlsx (case-insensitive).
    - If both `name.csv` and `name.xlsx` exist, the later one encountered will
      overwrite the earlier entry for key `name`.
    """
    path = Path(folder_path)
    if not path.exists():
        raise FileNotFoundError(f"Folder not found: {path.resolve()}")

    dataframes: Dict[str, pd.DataFrame] = {}
    for p in path.iterdir():
        if not p.is_file():
            continue

        ext = p.suffix.lower()
        if ext == ".csv":
            df = pd.read_csv(p)
        elif ext in {".xlsx", ".xls"}:
            df = pd.read_excel(p)
        else:
            continue

        dataframes[p.stem] = df

    return dataframes

## Example usage

```python
dfs = load_data_folder()
dfs.keys()
```

output:

```bash
dict_keys(['ARC_Enrollments', 'ARC_Application', 'All_demographics_and_programs'])
```

#### To Access a DataFrame in the list

```python
all_demo = dfs['All_demographics_and_programs']
all_demo.head(1)
```

output:
|col 1|col 2|col 3|
|:--:|:--:|:--:|
|3.14|name|apple|

#### To Remove Spaces in DataFrame name

```python
for name, df in dfs.items():
    safe_name = name.replace(" ", "_")
    globals()[safe_name] = df
```


In [50]:
dfs = load_data_folder()
dfs.keys()

dict_keys(['ARC_Enrollments', 'ARC_Application', 'All_demographics_and_programs'])

How to call the dataframe from the list above


In [51]:
all_demo = dfs['All_demographics_and_programs']
all_demo.head(2)

Unnamed: 0,Auto Id,First Name,Last Name,Gender,Race,Ethnicity Hispanic/Latino,Outcome,Veteran,Ex-Offender,Justice Involved,Single Parent,Program: Program Name
0,202107-1206,name,name,Male,Black or African American,,,No,,,,Reimage 21-22
1,202107-1206,name,name,Male,Black or African American,,,No,,,,Reimage 21-22


Little for loop at access the dataframes individually


In [52]:
for name, df in dfs.items():
    safe_name = name.replace(" ", "_")
    globals()[safe_name] = df

In [53]:
All_demographics_and_programs.head(2)

Unnamed: 0,Auto Id,First Name,Last Name,Gender,Race,Ethnicity Hispanic/Latino,Outcome,Veteran,Ex-Offender,Justice Involved,Single Parent,Program: Program Name
0,202107-1206,name,name,Male,Black or African American,,,No,,,,Reimage 21-22
1,202107-1206,name,name,Male,Black or African American,,,No,,,,Reimage 21-22


Should we switch to this rather than the 2 step process above?


In [54]:
def load_data_folder(
    folder_path: Union[str, os.PathLike] = "../../data",
    safe_names: bool = False
) -> Dict[str, pd.DataFrame]:
    """
    Load all CSV/XLS/XLSX files in a folder into pandas DataFrames.
    ...
    safe_names : bool, optional
        If True, replace spaces in filenames with underscores for dict keys.
    """
    path = Path(folder_path)
    if not path.exists():
        raise FileNotFoundError(f"Folder not found: {path.resolve()}")

    dataframes: Dict[str, pd.DataFrame] = {}
    for p in path.iterdir():
        if not p.is_file():
            continue

        ext = p.suffix.lower()
        if ext == ".csv":
            df = pd.read_csv(p)
        elif ext in {".xlsx", ".xls"}:
            df = pd.read_excel(p)
        else:
            continue

        key = p.stem.replace(" ", "_") if safe_names else p.stem
        dataframes[key] = df

    return dataframes

In [55]:
dfs = load_data_folder(safe_names=True)
dfs.keys()

dict_keys(['ARC_Enrollments', 'ARC_Application', 'All_demographics_and_programs'])

## Update cleaning code

- Look at our cleaning code that we have.
- we should start to make changes to it to account for this.
- We need to make it so it so the program doesn't crash when something fails
  - [Try Except logic updates](https://www.w3schools.com/python/python_try_except.asp)
  - make the messages mean something meaningful
- Ideally we will not drop anything from our data


Will update this a bit with usage etc...


In [56]:
class DataCleaner:
    """
    A utility class for cleaning and standardizing tabular datasets.

    This class wraps a pandas DataFrame and provides a set of 
    convenience methods for common data cleaning tasks such as:

    - Dropping unnecessary columns.
    - Filling missing values with specified defaults.
    - Replacing or normalizing categorical values.
    - Converting data types safely (including datetime).
    - Standardizing demographic fields (e.g., gender, race).
    - Parsing and normalizing salary values.

    All methods are designed to fail gracefully:
    - If a target column does not exist, it is skipped.
    - If an operation fails due to incompatible data, a warning 
      is printed and the DataFrame remains unchanged.

    Most methods return `self`, enabling method chaining:

    Example
    -------
    >>> cleaner = DataCleaner(df)
    >>> clean_df = (
    ...     cleaner
    ...     .drop_columns(["UnusedCol"])
    ...     .fillna({"Age": 0, "City": "Unknown"})
    ...     .normalize_gender()
    ...     .normalize_race()
    ...     .clean_salary()
    ...     .finalize()
    ...     )
    """

    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()

    def drop_columns(self, cols_to_drop) -> "Self":
        """
        Drop one or more columns from the DataFrame safely.

        This method attempts to drop the specified columns. If a column 
        does not exist, it is ignored (no error is raised). If dropping 
        fails due to another issue (e.g., invalid argument type), a 
        warning is printed and the DataFrame is left unchanged.

        Parameters
        ----------
        cols_to_drop : str or list of str
            Column name or list of column names to drop.

        Returns
        -------
        Self
            The current instance, allowing method chaining.
        """
        try:
            self.df = self.df.drop(columns=cols_to_drop, errors='ignore')
        except Exception as e:
            print(f"[Warning] Failed dropping columns: {e}")
        return self

    def fillna(self, fill_map: dict) -> "Self":
        """
        Fill missing (NaN) values in specified columns safely.

        For each column provided in the mapping, this method replaces 
        NaN values with the specified fill value. Columns not present 
        in the DataFrame are skipped. If filling fails for a column 
        (e.g., due to incompatible data types), a warning is printed 
        and that column is left unchanged.

        Parameters
        ----------
        fill_map : dict
            A dictionary mapping {column_name: fill_value} pairs.
            Example: {"age": 0, "city": "Unknown"}

        Returns
        -------
        Self
            The current instance, allowing method chaining.
        """
        for col, val in fill_map.items():
            try:
                if col in self.df.columns:
                    self.df[col] = self.df[col].fillna(val)
            except Exception as e:
                print(f"[Warning] Failed filling NaN for {col}: {e}")
        return self

    def replace_column_values(self, col: str, replacements: dict) -> "Self":
        """
        Replace values in a specified DataFrame column using a mapping dictionary.

        This method attempts to apply the given replacements safely. 
        If the column exists, it replaces matching values based on the 
        provided mapping. If an error occurs during replacement 
        (e.g., invalid mapping or data type mismatch), a warning 
        is printed and the DataFrame is left unchanged.

        Parameters
        ----------
        col : str
            The name of the column in the DataFrame to modify.
        replacements : dict
            A mapping of {old_value: new_value} pairs to replace.

        Returns
        -------
        Self
            The current instance, allowing method chaining.
        Sample usage:
        >>> cleaner = DataCleaner(df)
        >>> cleaner.replace_column_values("status", {"yes": 1, "no": 0})
        """
        try:
            if col in self.df.columns:
                self.df[col] = self.df[col].replace(replacements)
        except Exception as e:
            print(f"[Warning] Failed replacing values in {col}: {e}")
        return self

    def convert_datetime(self, col, dtype, errors="ignore"):
        """
        Convert a column to a specified dtype, with special handling for datetimes.

        Parameters
        ----------
        col : str
            Name of the column to convert.
        dtype : str or type
            Target dtype. If the string contains "datetime", the method will use
            `pandas.to_datetime` for conversion. Otherwise, it uses `.astype()`.
        errors : {"ignore", "raise", "coerce"}, default "ignore"
            Error handling behavior:
            - "ignore": invalid parsing will return the original input.
            - "raise": raises an exception on invalid parsing.
            - "coerce": invalid parsing will be set as NaT (for datetime) or NaN.

        Returns
        -------
        self : DataFrameCleaner
            The instance with the modified DataFrame, allowing for method chaining.

        Notes
        -----
        - For datetime conversion, the method forces `errors="coerce"` to ensure
        invalid values are converted to NaT instead of raising.
        - For non-datetime conversions, the provided `errors` argument is passed
        directly to `.astype()`.
        - If the column does not exist, no action is taken.

        Examples
        --------
        >>> cleaner.convert_datetime("StartDate", "datetime64[ns]")
        >>> cleaner.convert_datetime("Age", "int", errors="coerce")
        """
        try:
            if col in self.df.columns:
                if "datetime" in str(dtype):
                    self.df[col] = pd.to_datetime(
                        self.df[col], errors="coerce")
                else:
                    self.df[col] = self.df[col].astype(dtype, errors=errors)
        except Exception as e:
            print(f"[Warning] Failed dtype conversion on {col}: {e}")
        return self

    def normalize_gender(self) -> "Self":
        """
        Standardize gender labels in the DataFrame.

        This method looks for a column named "Gender" and replaces 
        specific transgender categories with the unified label 
        "Transgender". If the column does not exist or the replacement 
        fails (e.g., due to unexpected data types), the method prints a 
        warning and leaves the DataFrame unchanged.

        Replacements performed:
            - "Transgender male to female" → "Transgender"
            - "Transgender female to male" → "Transgender"

        Returns
        -------
        Self
            The current instance, allowing method chaining.
        """
        try:
            if "Gender" in self.df.columns:
                self.df["Gender"] = self.df["Gender"].replace({
                    "Transgender male to female": "Transgender",
                    "Transgender female to male": "Transgender"
                })
        except Exception as e:
            print(f"[Warning] Failed gender normalization: {e}")
        return self

    def normalize_race(self) -> "Self":
        """
        Normalize the 'Race' column so that multi-value entries are 
        collapsed into a single category "Two or More Races".

        Behavior
        --------
        - Single race values are kept as-is.
        - Multi-value entries separated by ";" or "," are replaced with
        "Two or More Races".

        Example
        -------
        Original: "White;Asian" → "Two or More Races"
        Original: "White,Asian" → "Two or More Races"

        Returns
        -------
        Self
            The current instance, allowing method chaining.
        """
        try:
            if "Race" in self.df.columns:
                self.df["Race"] = self.df["Race"].astype(str).apply(
                    lambda x: "Two or More Races" if (
                        ";" in x or "," in x) else x
                )
        except Exception as e:
            print(f"[Warning] Failed race normalization: {e}")
        return self

    def clean_salary(self, hours_per_year: int = 2080):
        """
        Clean and standardize salary values in the DataFrame.

        Steps performed:
        1. Remove currency symbols, commas, and shorthand (e.g., "$50k" → 50000).
        2. Handle ranges by converting them to the average value 
            (e.g., "50,000–70,000" → 60000).
        3. Handle shorthand "M" (e.g., "$1.5M" → 1,500,000).
        4. Convert values to numeric, coercing invalid entries to NaN.
        5. Treat values <= 200 as hourly wages and convert to annual salaries 
            (multiplied by `hours_per_year`).
        6. Drop unrealistic values greater than 1,000,000 (set to NaN).

        Parameters
        ----------
        hours_per_year : int, optional (default=2080)
            Number of work hours in a year for converting hourly to annual salary.

        Returns
        -------
        self : object
            The current instance with the cleaned Salary column.
        """
        try:
            if "Salary" in self.df.columns:
                self.df["Salary"] = self.df["Salary"].astype(str)

                def parse_salary(val: str):
                    val = val.strip()
                    if not val or val.lower() in {"nan", "none"}:
                        return None

                    # Normalize dash types (hyphen, en dash, em dash "-")
                    val = re.sub(r"[–—]", "-", val)

                    # Handle range like "50k-70k" or "50,000-70,000"
                    if "-" in val:
                        parts = val.split("-")
                        nums = [parse_salary(p) for p in parts if p.strip()]
                        nums = [n for n in nums if n is not None]
                        return sum(nums) / len(nums) if nums else None

                    # Remove $, commas, spaces
                    val = re.sub(r"[\$,]", "", val)

                    # Handle shorthand k/K (e.g., "50k" → 50000)
                    match_k = re.match(r"^(\d+(\.\d+)?)[kK]$", val)
                    if match_k:
                        return float(match_k.group(1)) * 1000

                    # Handle shorthand M (e.g., "1.5M" → 1500000)
                    match_m = re.match(r"^(\d+(\.\d+)?)[mM]$", val)
                    if match_m:
                        return float(match_m.group(1)) * 1_000_000

                    # Plain number (integer or float)
                    try:
                        return float(val)
                    except ValueError:
                        return None

                # Apply parsing
                self.df["Salary"] = self.df["Salary"].apply(parse_salary)

                # Convert small numbers (hourly) to annual
                self.df.loc[self.df["Salary"] <=
                            200, "Salary"] *= hours_per_year

                # Drop unrealistic salaries
                self.df.loc[self.df["Salary"] > 1_000_000, "Salary"] = None

        except Exception as e:
            print(f"[Warning] Failed salary cleaning: {e}")

        return self

    def finalize(self) -> pd.DataFrame:
        """
        Finalize and return the cleaned DataFrame.

        This method should be called at the end of a cleaning pipeline 
        to retrieve the fully processed DataFrame after all applied 
        transformations.

        Returns
        -------
        pd.DataFrame
            The cleaned and transformed DataFrame.
        """
        return self.df

# example usage of each method


In [57]:
cleaner = DataCleaner(all_demo)

clean_df = (
    cleaner
    # 1. Drop unneeded columns
    .drop_columns(["First Name", "Last Name"])

    # 2. Fill missing values
    .fillna({
        "Outcome": "Unknown",
        "Veteran": "Unknown",
        "Ex-Offender": "Unknown",
        "Justice Involved": "Unknown",
        "Single Parent": "Unknown",
        "Ethnicity Hispanic/Latino": "Unknown"
    })

    # 3. Replace specific column values
    .replace_column_values("Veteran", {"No": 0, "Yes": 1, "Unknown": -1})

    # 4. Convert a column to datetime (pretend Auto Id is a date code)
    .convert_datetime("Auto Id", "datetime64[ns]")  # will fail gracefully

    # 5. Normalize gender labels
    .normalize_gender()

    # 6. Normalize race column (collapse multi-value)
    .normalize_race()

    # 7. Clean salary column
    .clean_salary()

    # 8. Finalize and return cleaned DataFrame
    .finalize()
)

clean_df.head(2)

  self.df[col] = pd.to_datetime(


Unnamed: 0,Auto Id,Gender,Race,Ethnicity Hispanic/Latino,Outcome,Veteran,Ex-Offender,Justice Involved,Single Parent,Program: Program Name
0,NaT,Male,Black or African American,Unknown,Unknown,0,Unknown,Unknown,Unknown,Reimage 21-22
1,NaT,Male,Black or African American,Unknown,Unknown,0,Unknown,Unknown,Unknown,Reimage 21-22


### Sample use of the clean_salary function.


In [58]:
test_df = pd.DataFrame({
    "Salary": ["$50k", "10", "50", "60,000", "70,000-80,000", "100k", "150000", "200", "3000", "5000000", "$1.5M", "invalid", 70]
})

# Create instance with test DataFrame
cleaner = DataCleaner(test_df)

# Run salary cleaning
cleaner = cleaner.clean_salary(2080)

# Get the cleaned DataFrame
result_df = cleaner.finalize()
print(result_df)

      Salary
0    50000.0
1    20800.0
2   104000.0
3    60000.0
4    75000.0
5   100000.0
6   150000.0
7   416000.0
8     3000.0
9        NaN
10       NaN
11       NaN
12  145600.0


# Example of making a test 

In [59]:
fail_df = pd.DataFrame({
    "Salary": [
        None,                # NaN input
        "",                  # empty string
        " ",                 # whitespace only
        "abc123",            # text + numbers
        "50k-abc",           # malformed range
        "$-5000",            # negative salary
        "∞",                 # infinity symbol
        "NaN",               # literal string NaN
        "$1.5M",             # millions, not handled in parser
        "70,000—80,000"      # em dash (—) instead of hyphen/dash
    ]
})
# Create instance with failing DataFrame
fail_cleaner = DataCleaner(fail_df)
# Run salary cleaning on failing DataFrame
fail_cleaner = fail_cleaner.clean_salary(2080)
# Get the cleaned DataFrame
fail_result_df = fail_cleaner.finalize()
print(fail_result_df)

    Salary
0      NaN
1      NaN
2      NaN
3      NaN
4  50000.0
5   5000.0
6      NaN
7      NaN
8      NaN
9  75000.0


In [60]:
class DataCleaner:
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()

    def clean_salary(self, hours_per_year: int = 2080):
        """
        Clean and standardize salary values in the DataFrame.

        Steps performed:
        1. Remove currency symbols, commas, and shorthand (e.g., "$50k" → 50000).
        2. Handle ranges by converting them to the average value 
           (e.g., "50,000–70,000" → 60000).
        3. Handle shorthand "M" (e.g., "$1.5M" → 1,500,000).
        4. Convert values to numeric, coercing invalid entries to NaN.
        5. Treat values <= 200 as hourly wages and convert to annual salaries 
           (multiplied by `hours_per_year`).
        6. Drop unrealistic values greater than 1,000,000 (set to NaN).

        Parameters
        ----------
        hours_per_year : int, optional (default=2080)
            Number of work hours in a year for converting hourly to annual salary.

        Returns
        -------
        self : object
            The current instance with the cleaned Salary column.
        """
        try:
            if "Salary" in self.df.columns:
                self.df["Salary"] = self.df["Salary"].astype(str)

                def parse_salary(val: str):
                    val = val.strip()
                    if not val or val.lower() in {"nan", "none"}:
                        return None

                    # Normalize dash types (hyphen, en dash, em dash "-")
                    val = re.sub(r"[–—]", "-", val)

                    # Handle range like "50k-70k" or "50,000-70,000"
                    if "-" in val:
                        parts = val.split("-")
                        nums = [parse_salary(p) for p in parts if p.strip()]
                        nums = [n for n in nums if n is not None]
                        return sum(nums) / len(nums) if nums else None

                    # Remove $, commas, spaces
                    val = re.sub(r"[\$,]", "", val)

                    # Handle shorthand k/K (e.g., "50k" → 50000)
                    match_k = re.match(r"^(\d+(\.\d+)?)[kK]$", val)
                    if match_k:
                        return float(match_k.group(1)) * 1000

                    # Handle shorthand M (e.g., "1.5M" → 1500000)
                    match_m = re.match(r"^(\d+(\.\d+)?)[mM]$", val)
                    if match_m:
                        return float(match_m.group(1)) * 1_000_000

                    # Plain number (integer or float)
                    try:
                        return float(val)
                    except ValueError:
                        return None

                # Apply parsing
                self.df["Salary"] = self.df["Salary"].apply(parse_salary)

                # Convert small numbers (hourly) to annual
                self.df.loc[self.df["Salary"] <=
                            200, "Salary"] *= hours_per_year

                # Drop unrealistic salaries
                self.df.loc[self.df["Salary"] > 1_000_000, "Salary"] = None

        except Exception as e:
            print(f"[Warning] Failed salary cleaning: {e}")

        return self

    def finalize(self):
        """Return cleaned dataframe."""
        return self.df

In [61]:
# Test DataFrame with edge/fail cases
fail_df = pd.DataFrame({
    "Salary": [
        None,  # NaN
        "",  # NaN
        " ",  # NaN
        "abc123",  # NaN
        "50k-abc",  # 50000.0
        "$-5000",  # -5000.0  (still allowed for now)
        "∞",  # NaN
        "NaN",  # NaN
        "$1.5M",  # NaN ( >1,000,000 rule)
        "70,000—80,000"  # 75000.0 (dash normalized)
    ]
})

# Run through cleaner
cleaner = DataCleaner(fail_df)
result = cleaner.clean_salary().finalize().reset_index(drop=True)

# Expected results as DataFrame
expected = pd.DataFrame({
    "Salary": [
        None,       # None
        None,       # empty string
        None,       # whitespace
        None,       # abc123
        50000.0,    # 50k-abc
        5000.0,     # negative salary
        None,       # infinity
        None,       # "NaN"
        None,       # 1.5M filtered out
        75000.0     # range with em dash
    ]
}, dtype="float64").reset_index(drop=True)

# Assertion test
pdt.assert_frame_equal(result, expected)
print("✅ Salary cleaning DataFrame test passed!")

✅ Salary cleaning DataFrame test passed!


## Generate report

- Overall completion of program only accounting for the new style of classes m1-m4
- completion by year
- completion over all by pathway
- completion by year by pathway
- Feel free to get creative here adding gender etc to get us a better understanding
- education level and the above...
- export this as a txt file


In [65]:
dfs.keys()

dict_keys(['ARC_Enrollments', 'ARC_Application', 'All_demographics_and_programs'])

In [68]:
arc_apps = dfs['ARC_Application']
arc_apps.head(2)

Unnamed: 0,KY Region,Contact: Auto Id,Contact: Unique ID SSN,Contact: SSN Opt Out,Contact: Mailing State/Province,Contact: County,Contact: Mailing Zip/Postal Code,Contact: Birthdate,Contact: Gender,Disability,...,Displaced Homemaker,Spouse of Armed Forces Reduced Income,Loss of Family Support,Seasonal farm worker?,Contact: First Name,Contact: Last Name,Status,Date Completed,Assessment: Created Date,Contact: Approval Status
0,SOAR,202109-5224,,0,KY,Knox,40906,1981-10-25,Female,No,...,,,,,name,name,Accepted - Prework Complete,2021-08-24,2021-09-10,
1,SOAR,202109-5230,,0,KY,Perry,41773,2000-10-28,Prefer not to say,No,...,,,,,name,name,Accepted - Prework Complete,2021-08-25,2021-09-10,


In [70]:
arc_enroll = dfs['ARC_Enrollments']
arc_enroll.head(2)

Unnamed: 0,Auto Id,KY Region,Full Name,Assessment ID,EnrollmentId,Enrollment Service Name,Service,Projected Start Date,Actual Start Date,Projected End Date,Actual End Date,Outcome,ATP Cohort
0,202109-5224,SOAR,name name,OA-003348,Enrollment-1386,ES-0011193,Career Readiness Workshop,2021-11-11,NaT,NaT,NaT,,NaT
1,202109-5224,SOAR,name name,OA-003348,Enrollment-1386,ES-0013492,Software Development 1,2022-01-05,2022-01-05,2022-04-06,2022-04-06,Successfully Completed,2022-01-01


# pathway information

In [71]:
STARTER_PATHWAYS = [
    'Web Development M1',
    'Data Analysis M1',
    'Software Development M1',
    'Quality Assurance M1',
    'User Experience M1',
]

def get_starting_pathways(df: pd.DataFrame) -> pd.DataFrame:
    """
    Returns a DataFrame containing only the starting pathways.
    """
    mask_starter_pathways = df['Service'].isin(STARTER_PATHWAYS)
    return df[mask_starter_pathways]


def get_cohorts_list(df: pd.DataFrame) -> list:
    """
    Returns a sorted list of cohorts from starting pathways, including 'All cohorts'.
    """
    df_starters = get_starting_pathways(df)
    cohorts = list(
        pd.to_datetime(df_starters['ATP Cohort'][df_starters['ATP Cohort'] != 'NA'])
        .sort_values()
        .astype(str)
        .unique()
    )
    cohorts.insert(0, 'All cohorts')
    return cohorts


def get_data_by_cohort(df: pd.DataFrame, cohort: str = 'All cohorts') -> pd.DataFrame:
    """
    Returns a DataFrame counting services for a specific cohort or all cohorts.
    """
    df_starters = get_starting_pathways(df)
    if cohort == 'All cohorts':
        result = df_starters.value_counts('Service').reset_index()
    else:
        cohort_dt = str(pd.to_datetime(cohort))
        result = df_starters[df_starters['ATP Cohort'] == cohort_dt].value_counts('Service').reset_index()
    return result

In [91]:
cohorts = get_cohorts_list(arc_enroll)
cohorts


['All cohorts',
 '2023-05-01',
 '2023-08-01',
 '2024-01-01',
 '2024-05-01',
 '2024-08-01',
 '2025-01-01']

In [93]:
enroll_by_cohort = get_data_by_cohort(arc_enroll, "2024-01-01")
enroll_by_cohort

Unnamed: 0,Service,count
0,Data Analysis M1,17
1,Web Development M1,14
2,Software Development M1,11


Completion information 

In [94]:
class Completion_rate_data:
    def __init__(self, data):
        self.data = data
        self.__pathways = [
            'Web Development M1',
            'Web Development M2',
            'Web Development M3',
            'Web Development M4',
            'Data Analysis M1',
            'Data Analysis M2',
            'Data Analysis M3',
            'Data Analysis M4',
            'Software Development M1',
            'Software Development M2',
            'Software Development M3',
            'Software Development M4',
            'Quality Assurance M1',
            'Quality Assurance M2',
            'Quality Assurance M3',
            'Quality Assurance M4',
            'User Experience M1',
            'User Experience M2',
            'User Experience M3',
            'User Experience M4',
        ]

        # Not the best Pandas way to do it:
    def Get_completion_percentages(self,
                                   cohort: str = 'All cohorts') -> pd.DataFrame:  # noqa
        """
            Creates a pandas.Datafreme that contains the %
            of completion of each pathway.

            Args:
                cohort: str

            Return:
                pandas.DataFrame
        """
        if cohort == 'All cohorts':
            data = self.data
        else:
            data = self.data[self.data['ATP Cohort'] == pd.Timestamp(cohort)]

        completion_dictionary = {}

        for path in self.__pathways:
            outcome = data[data['Service'] == path]['Outcome'].value_counts(
                normalize=True).reset_index()
            completion_dictionary[path] = {
                row.Outcome: row.proportion for row in outcome.itertuples(index=True)}  # noqa

        result_df = pd.DataFrame(completion_dictionary).transpose().fillna(
            0).rename_axis('Module').reset_index()

        result_df['Pathway'] = result_df['Module'].apply(
            # intended to be able to sort by pathway
            lambda x: x[:x.rfind(' ')])
        return result_df

    def Get_pathways_name(self, df: pd.DataFrame) -> list:
        """
            List of all the pathways in a pandas.DataFrame generated by
            self.Get_completion_percentages().

            Args:
                df: pandas.DataFrame

            Return:
                list
        """
        return list(df['Pathway'].unique())


In [100]:
completion_data = Completion_rate_data(arc_enroll)
completion = completion_data.Get_completion_percentages('2023-05-01')
completion

Unnamed: 0,Module,Successfully Completed,Did Not Complete,Pathway
0,Web Development M1,0.75,0.25,Web Development
1,Web Development M2,0.714286,0.285714,Web Development
2,Web Development M3,0.8,0.2,Web Development
3,Web Development M4,0.75,0.25,Web Development
4,Data Analysis M1,0.714286,0.285714,Data Analysis
5,Data Analysis M2,0.5,0.5,Data Analysis
6,Data Analysis M3,1.0,0.0,Data Analysis
7,Data Analysis M4,0.333333,0.666667,Data Analysis
8,Software Development M1,1.0,0.0,Software Development
9,Software Development M2,0.818182,0.181818,Software Development


In [99]:
all_completion_data = Completion_rate_data(arc_enroll)
all_completion = completion_data.Get_completion_percentages("All cohorts")
completion

Unnamed: 0,Module,Successfully Completed,Did Not Complete,Partially Completed,Pathway
0,Web Development M1,1.0,0.0,0.0,Web Development
1,Web Development M2,0.857143,0.142857,0.0,Web Development
2,Web Development M3,1.0,0.0,0.0,Web Development
3,Web Development M4,0.666667,0.333333,0.0,Web Development
4,Data Analysis M1,0.764706,0.117647,0.117647,Data Analysis
5,Data Analysis M2,0.846154,0.153846,0.0,Data Analysis
6,Data Analysis M3,0.909091,0.090909,0.0,Data Analysis
7,Data Analysis M4,0.5,0.5,0.0,Data Analysis
8,Software Development M1,0.818182,0.181818,0.0,Software Development
9,Software Development M2,0.555556,0.222222,0.222222,Software Development


## Plots

- Look at the various plots
- make a consistent color scheme
- pick the plots that go with the report above
- make missing plots
- make plots have the option to show & save in the functions

see `src/notebooks/visualization_examples.ipynb`
See below from `src/Carmen_WORCEmployment_Plots.py`


In [None]:
def plot_salary_by_gender(data):
    plt.figure(figsize=(8, 5))
    sns.boxplot(data=data, x='Gender', y='Salary')
    plt.title("Salary Distribution by Gender")
    plt.show()


def plot_avg_salary_by_city(data):
    region_salary = data.groupby('Mailing City')['Salary'].mean().sort_values()
    region_salary.plot(kind='barh', figsize=(
        8, 5), title="Average Salary by KY Region")
    plt.xlabel("Average Salary")
    plt.show()


def plot_placements_over_time(data):
    data.set_index('Start Date').resample('M').size().plot(
        kind='line', marker='o', figsize=(10, 4))
    plt.title("Number of Placements Over Time")
    plt.ylabel("Placements")
    plt.show()


def plot_placement_type_by_program(data):
    plt.figure(figsize=(10, 6))
    sns.countplot(data=data, x='ATP Placement Type',
                  hue='Program: Program Name')
    plt.xticks(rotation=45)
    plt.title("Placement Type by Program")
    plt.show()


def plot_top_cities(data):
    city_counts = data['Mailing City'].value_counts().head(10)
    city_counts.plot(
        kind='bar', title='Top Cities by Participant Count', figsize=(8, 4))
    plt.ylabel("Count")
    plt.show()

# TOC generator


In [101]:
import json
import os


def generate_toc_from_notebook(notebook_path):
    """
    Parses a local .ipynb file and generates Markdown for a Table of Contents.
    """
    if not os.path.isfile(notebook_path):
        print(f"❌ Error: File not found at '{notebook_path}'")
        return

    with open(notebook_path, 'r', encoding='utf-8') as f:
        notebook = json.load(f)

    toc_markdown = "### **Table of Contents**\n"
    for cell in notebook.get('cells', []):
        if cell.get('cell_type') == 'markdown':
            for line in cell.get('source', []):
                if line.strip().startswith('#'):
                    level = line.count('#')
                    title = line.strip('#').strip()
                    link = title.lower().replace(' ', '-').strip('-.()')
                    indent = '  ' * (level - 1)
                    toc_markdown += f"{indent}* [{title}](#{link})\n"

    print("\n--- ✅ Copy the Markdown below and paste it "
          "into a new markdown cell ---\n")
    print(toc_markdown)


notebook_path = 'mainNb.ipynb'
generate_toc_from_notebook(notebook_path)


--- ✅ Copy the Markdown below and paste it into a new markdown cell ---

### **Table of Contents**
  * [**Table of Contents**](#**table-of-contents**)
  * [Function To Read in the Data!](#function-to-read-in-the-data!)
  * [Example usage](#example-usage)
      * [To Access a DataFrame in the list](#to-access-a-dataframe-in-the-list)
      * [To Remove Spaces in DataFrame name](#to-remove-spaces-in-dataframe-name)
  * [Update cleaning code](#update-cleaning-code)
* [example usage of each method](#example-usage-of-each-method)
    * [Sample use of the clean_salary function.](#sample-use-of-the-clean_salary-function)
* [Example of making a test](#example-of-making-a-test)
  * [Generate report](#generate-report)
* [pathway information](#pathway-information)
  * [Plots](#plots)
* [TOC generator](#toc-generator)

