### **Table of Contents**
    * [**Table of Contents**](#**table-of-contents**)
  * [Function To Read in the Data!](#function-to-read-in-the-data!)
  * [Example usage](#example-usage)
      * [To Access a DataFrame in the list](#to-access-a-dataframe-in-the-list)
      * [To Remove Spaces in DataFrame name](#to-remove-spaces-in-dataframe-name)
  * [Update cleaning code](#update-cleaning-code)
  * [Generate report](#generate-report)
  * [Plots](#plots)

In [28]:
import pandas as pd
from typing import Dict, Union
from pathlib import Path
import os
import sys
import re
import pandas.testing as pdt

## Function To Read in the Data! 

In [14]:
def load_data_folder(
    folder_path: Union[str, os.PathLike] = "../../data"
) -> Dict[str, pd.DataFrame]:
    """
    Load all CSV/XLS/XLSX files in a folder into pandas DataFrames.

    Parameters
    ----------
    folder_path : str | os.PathLike, optional
        Path to the folder containing the files. Defaults to "../../data".

    Returns
    -------
    Dict[str, pandas.DataFrame]
        A mapping from the file's stem (filename without extension) to its
        loaded DataFrame. For example, "employees.csv" -> key "employees".

    Raises
    ------
    FileNotFoundError
        If `folder_path` does not exist.
    PermissionError
        If the folder or files cannot be accessed due to permissions.
    pd.errors.EmptyDataError
        If a CSV file is empty and cannot be parsed.

    Notes
    -----
    - Supported extensions: .csv, .xls, .xlsx (case-insensitive).
    - If both `name.csv` and `name.xlsx` exist, the later one encountered will
      overwrite the earlier entry for key `name`.
    """
    path = Path(folder_path)
    if not path.exists():
        raise FileNotFoundError(f"Folder not found: {path.resolve()}")

    dataframes: Dict[str, pd.DataFrame] = {}
    for p in path.iterdir():
        if not p.is_file():
            continue

        ext = p.suffix.lower()
        if ext == ".csv":
            df = pd.read_csv(p)
        elif ext in {".xlsx", ".xls"}:
            df = pd.read_excel(p)
        else:
            continue

        dataframes[p.stem] = df

    return dataframes


## Example usage 

```python 
dfs = load_data_folder()
dfs.keys()
```
output:
```bash
dict_keys(['ARC_Enrollments', 'ARC_Application', 'All_demographics_and_programs'])
```
#### To Access a DataFrame in the list 

```python
all_demo = dfs['All_demographics_and_programs']
all_demo.head(1)
```

output:
|col 1|col 2|col 3|
|:--:|:--:|:--:|
|3.14|name|apple|



#### To Remove Spaces in DataFrame name

```python 
for name, df in dfs.items():
    safe_name = name.replace(" ", "_")
    globals()[safe_name] = df
```

In [16]:
dfs = load_data_folder()
dfs.keys()

dict_keys(['ARC_Enrollments', 'ARC_Application', 'All_demographics_and_programs'])

How to call the dataframe from the list above

In [18]:
all_demo = dfs['All_demographics_and_programs']
all_demo.head()

Unnamed: 0,Auto Id,First Name,Last Name,Gender,Race,Ethnicity Hispanic/Latino,Outcome,Veteran,Ex-Offender,Justice Involved,Single Parent,Program: Program Name
0,202107-1206,name,name,Male,Black or African American,,,No,,,,Reimage 21-22
1,202107-1206,name,name,Male,Black or African American,,,No,,,,Reimage 21-22
2,202107-1206,name,name,Male,Black or African American,,,No,,,,Reimage 21-22
3,202108-5167,name,name,Male,Asian,,Successfully Completed,No,,No,,Tech Louisville 21-22
4,202108-5171,name,name,Male,Black or African American,,,,,,,Tech Louisville 21-22


Little for loop at access the dataframes individually

In [None]:
for name, df in dfs.items():
    safe_name = name.replace(" ", "_")
    globals()[safe_name] = df

In [21]:
All_demographics_and_programs

Unnamed: 0,Auto Id,First Name,Last Name,Gender,Race,Ethnicity Hispanic/Latino,Outcome,Veteran,Ex-Offender,Justice Involved,Single Parent,Program: Program Name
0,202107-1206,name,name,Male,Black or African American,,,No,,,,Reimage 21-22
1,202107-1206,name,name,Male,Black or African American,,,No,,,,Reimage 21-22
2,202107-1206,name,name,Male,Black or African American,,,No,,,,Reimage 21-22
3,202108-5167,name,name,Male,Asian,,Successfully Completed,No,,No,,Tech Louisville 21-22
4,202108-5171,name,name,Male,Black or African American,,,,,,,Tech Louisville 21-22
...,...,...,...,...,...,...,...,...,...,...,...,...
32225,202502-20671,name,name,Female,White,,,,,,,Connecting Young Adults 24-25
32226,202410-17602,name,name,Female,White,,,,,,,Connecting Young Adults 24-25
32227,202506-23809,name,name,Female,White,,,,,,,Connecting Young Adults 24-25
32228,202410-17749,name,name,Female,White,,,,,,,Connecting Young Adults 24-25


## Update cleaning code 
- Look at our cleaning code that we have. 
- we should start to make changes to it to account for this. 
- We need to make it so it so the program doesn't crash when something fails 
  - [Try Except logic updates](https://www.w3schools.com/python/python_try_except.asp)
  - make the messages mean something meaningful
- Ideally we will not drop anything from our data 


Will update this a bit with usage etc... 

In [16]:
class DataCleaner:
    """
    General-purpose cleaner for multiple WORC datasets
    (Employment, Enrollments, Demographics).

    Uses try/except for safety (does not break if col missing).
    Keeps all rows (no drops), but fills/fixes when possible.
    """

    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()

    def safe_drop_columns(self, cols_to_drop):
        """Drop columns if they exist, otherwise ignore."""
        try:
            self.df = self.df.drop(columns=cols_to_drop, errors='ignore')
        except Exception as e:
            print(f"[Warning] Failed dropping columns: {e}")
        return self

    def safe_fillna(self, fill_map: dict):
        """Fill NaN values for specific columns safely."""
        for col, val in fill_map.items():
            try:
                if col in self.df.columns:
                    self.df[col] = self.df[col].fillna(val)
            except Exception as e:
                print(f"[Warning] Failed filling NaN for {col}: {e}")
        return self

    def safe_replace(self, col, replacements: dict):
        """Replace values in a column safely."""
        try:
            if col in self.df.columns:
                self.df[col] = self.df[col].replace(replacements)
        except Exception as e:
            print(f"[Warning] Failed replacing values in {col}: {e}")
        return self

    def safe_convert_dtype(self, col, dtype, errors="ignore"):
        """Convert column dtype safely."""
        try:
            if col in self.df.columns:
                if "datetime" in str(dtype):
                    self.df[col] = pd.to_datetime(
                        self.df[col], errors="coerce")
                else:
                    self.df[col] = self.df[col].astype(dtype, errors=errors)
        except Exception as e:
            print(f"[Warning] Failed dtype conversion on {col}: {e}")
        return self

    def normalize_gender(self):
        """Unify transgender categories safely."""
        try:
            if "Gender" in self.df.columns:
                self.df["Gender"] = self.df["Gender"].replace({
                    "Transgender male to female": "Transgender",
                    "Transgender female to male": "Transgender"
                })
        except Exception as e:
            print(f"[Warning] Failed gender normalization: {e}")
        return self

    def split_race(self):
        """Split Race column into Race_1, Race_2, etc., if it exists."""
        try:
            if "Race" in self.df.columns:
                splitting = self.df["Race"].astype(
                    str).str.split(";", expand=True)
                splitting.columns = [
                    f"Race_{i+1}" for i in range(splitting.shape[1])]
                self.df = pd.concat(
                    [self.df.drop(columns=["Race"]), splitting], axis=1)
        except Exception as e:
            print(f"[Warning] Failed race splitting: {e}")
        return self

    def clean_salary(self, hours_per_year: int = 2080):
        """
        Clean and standardize salary values in the DataFrame.

        Steps performed:
        1. Remove currency symbols, commas, and shorthand (e.g., "$50k" → "50000").
        2. Handle ranges by converting them to the average value 
           (e.g., "50,000-70,000" → 60000).
        3. Convert values to numeric, coercing invalid entries to NaN.
        4. Treat values < 200 as hourly wages and convert to annual salaries 
           (multiplied by `hours_per_year`).
        5. Drop unrealistic values greater than 1,000,000 (set to NaN).

        Parameters
        ----------
        hours_per_year : int, optional (default=2080)
            Number of work hours in a year for converting hourly to annual salary.

        Returns
        -------
        self : object
            The current instance with the cleaned Salary column.
        """
        try:
            if "Salary" in self.df.columns:
                self.df["Salary"] = self.df["Salary"].astype(str)
                def parse_salary(val: str):
                    val = val.strip()

                    # Handle range like "50k-70k" or "50,000–70,000"
                    if "-" in val or "–" in val:
                        parts = re.split(r"[-–]", val)
                        nums = [parse_salary(p) for p in parts if p.strip()]
                        nums = [n for n in nums if n is not None]
                        return sum(nums) / len(nums) if nums else None

                    # Remove $, commas, spaces
                    val = re.sub(r"[\$,]", "", val)

                    # Handle shorthand k/K (e.g., 50k -> 50000)
                    match = re.match(r"(\d+(\.\d+)?)([kK])", val)
                    if match:
                        return float(match.group(1)) * 1000

                    # Convert plain number if possible
                    try:
                        return float(val)
                    except ValueError:
                        return None

                # Apply parsing
                self.df["Salary"] = self.df["Salary"].apply(parse_salary)

                # Convert small numbers (hourly) to annual
                self.df.loc[self.df["Salary"] < 200, "Salary"] *= hours_per_year

                # Drop unrealistic salaries
                self.df.loc[self.df["Salary"] > 1_000_000, "Salary"] = None

        except Exception as e:
            print(f"[Warning] Failed salary cleaning: {e}")

        return self

    def finalize(self):
        """Return cleaned dataframe."""
        return self.df

### Sample use of the clean_salary function. 

In [19]:
test_df = pd.DataFrame({
    "Salary": ["$50k", "10", "50", "60,000", "70,000-80,000", "100k", "150000", "200", "3000", "5000000", "$1.5M", "invalid", 70]
})

# Create instance with test DataFrame
cleaner = DataCleaner(test_df)

# Run salary cleaning
cleaner = cleaner.clean_salary(2080)

# Get the cleaned DataFrame
result_df = cleaner.finalize()
print(result_df)

      Salary
0    50000.0
1    20800.0
2   104000.0
3    60000.0
4    75000.0
5   100000.0
6   150000.0
7      200.0
8     3000.0
9        NaN
10       NaN
11       NaN
12  145600.0


In [21]:
fail_df = pd.DataFrame({
    "Salary": [
        None,                # NaN input
        "",                  # empty string
        " ",                 # whitespace only
        "abc123",            # text + numbers
        "50k-abc",           # malformed range
        "$-5000",            # negative salary
        "∞",                 # infinity symbol
        "NaN",               # literal string NaN
        "$1.5M",             # millions, not handled in parser
        "70,000—80,000"      # em dash (—) instead of hyphen/dash
    ]
})
# Create instance with failing DataFrame
fail_cleaner = DataCleaner(fail_df)
# Run salary cleaning on failing DataFrame
fail_cleaner = fail_cleaner.clean_salary(2080)
# Get the cleaned DataFrame
fail_result_df = fail_cleaner.finalize()
print(fail_result_df)

    Salary
0      NaN
1      NaN
2      NaN
3      NaN
4  50000.0
5   5000.0
6      NaN
7      NaN
8      NaN
9      NaN


In [None]:
class DataCleaner:
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()

    def clean_salary(self, hours_per_year: int = 2080):
        """
        Clean and standardize salary values in the DataFrame.

        Steps performed:
        1. Remove currency symbols, commas, and shorthand (e.g., "$50k" → 50000).
        2. Handle ranges by converting them to the average value 
           (e.g., "50,000–70,000" → 60000).
        3. Handle shorthand "M" (e.g., "$1.5M" → 1,500,000).
        4. Convert values to numeric, coercing invalid entries to NaN.
        5. Treat values <= 200 as hourly wages and convert to annual salaries 
           (multiplied by `hours_per_year`).
        6. Drop unrealistic values greater than 1,000,000 (set to NaN).

        Parameters
        ----------
        hours_per_year : int, optional (default=2080)
            Number of work hours in a year for converting hourly to annual salary.

        Returns
        -------
        self : object
            The current instance with the cleaned Salary column.
        """
        try:
            if "Salary" in self.df.columns:
                self.df["Salary"] = self.df["Salary"].astype(str)

                def parse_salary(val: str):
                    val = val.strip()
                    if not val or val.lower() in {"nan", "none"}:
                        return None

                    # Normalize dash types (hyphen, en dash, em dash "-")
                    val = re.sub(r"[–—]", "-", val)

                    # Handle range like "50k-70k" or "50,000-70,000"
                    if "-" in val:
                        parts = val.split("-")
                        nums = [parse_salary(p) for p in parts if p.strip()]
                        nums = [n for n in nums if n is not None]
                        return sum(nums) / len(nums) if nums else None

                    # Remove $, commas, spaces
                    val = re.sub(r"[\$,]", "", val)

                    # Handle shorthand k/K (e.g., "50k" → 50000)
                    match_k = re.match(r"^(\d+(\.\d+)?)[kK]$", val)
                    if match_k:
                        return float(match_k.group(1)) * 1000

                    # Handle shorthand M (e.g., "1.5M" → 1500000)
                    match_m = re.match(r"^(\d+(\.\d+)?)[mM]$", val)
                    if match_m:
                        return float(match_m.group(1)) * 1_000_000

                    # Plain number (integer or float)
                    try:
                        return float(val)
                    except ValueError:
                        return None

                # Apply parsing
                self.df["Salary"] = self.df["Salary"].apply(parse_salary)

                # Convert small numbers (hourly) to annual
                self.df.loc[self.df["Salary"] <= 200, "Salary"] *= hours_per_year

                # Drop unrealistic salaries
                self.df.loc[self.df["Salary"] > 1_000_000, "Salary"] = None

        except Exception as e:
            print(f"[Warning] Failed salary cleaning: {e}")

        return self

    def finalize(self):
        """Return cleaned dataframe."""
        return self.df


In [37]:
# Test DataFrame with edge/fail cases
fail_df = pd.DataFrame({
    "Salary": [
        None,                #  NaN
        "",                  #  NaN
        " ",                 #  NaN
        "abc123",            #  NaN
        "50k-abc",           #  50000.0
        "$-5000",            #  -5000.0  (still allowed for now)
        "∞",                 #  NaN
        "NaN",               #  NaN
        "$1.5M",             #  NaN ( >1,000,000 rule)
        "70,000—80,000"      #  75000.0 (dash normalized)
    ]
})

# Run through cleaner
cleaner = DataCleaner(fail_df)
result = cleaner.clean_salary().finalize().reset_index(drop=True)

# Expected results as DataFrame
expected = pd.DataFrame({
    "Salary": [
        None,       # None
        None,       # empty string
        None,       # whitespace
        None,       # abc123
        50000.0,    # 50k-abc
        5000.0,     # negative salary
        None,       # infinity
        None,       # "NaN"
        None,       # 1.5M filtered out
        75000.0     # range with em dash
    ]
}, dtype="float64").reset_index(drop=True)

# Assertion test
pdt.assert_frame_equal(result, expected)
print("✅ Salary cleaning DataFrame test passed!")

✅ Salary cleaning DataFrame test passed!


## Generate report 

- Overall completion of program only accounting for the new style of classes m1-m4
- completion by year 
- completion over all by pathway 
- completion by year by pathway 
- Feel free to get creative here adding gender etc to get us a better understanding 
- education level and the above... 
- export this as a txt file 

## Plots 
- Look at the various plots 
- make a consistent color scheme
- pick the plots that go with the report above 
- make missing plots 
- make plots have the option to show & save in the functions

see `src/notebooks/visualization_examples.ipynb`
See below from `src/Carmen_WORCEmployment_Plots.py`

In [None]:
def plot_salary_by_gender(data):
    plt.figure(figsize=(8, 5))
    sns.boxplot(data=data, x='Gender', y='Salary')
    plt.title("Salary Distribution by Gender")
    plt.show()


def plot_avg_salary_by_city(data):
    region_salary = data.groupby('Mailing City')['Salary'].mean().sort_values()
    region_salary.plot(kind='barh', figsize=(8, 5), title="Average Salary by KY Region")
    plt.xlabel("Average Salary")
    plt.show()


def plot_placements_over_time(data):
    data.set_index('Start Date').resample('M').size().plot(kind='line', marker='o', figsize=(10, 4))
    plt.title("Number of Placements Over Time")
    plt.ylabel("Placements")
    plt.show()


def plot_placement_type_by_program(data):
    plt.figure(figsize=(10, 6))
    sns.countplot(data=data, x='ATP Placement Type', hue='Program: Program Name')
    plt.xticks(rotation=45)
    plt.title("Placement Type by Program")
    plt.show()


def plot_top_cities(data):
    city_counts = data['Mailing City'].value_counts().head(10)
    city_counts.plot(kind='bar', title='Top Cities by Participant Count', figsize=(8, 4))
    plt.ylabel("Count")
    plt.show()

TOC generator 

In [1]:
import json
import os


def generate_toc_from_notebook(notebook_path):
    """
    Parses a local .ipynb file and generates Markdown for a Table of Contents.
    """
    if not os.path.isfile(notebook_path):
        print(f"❌ Error: File not found at '{notebook_path}'")
        return

    with open(notebook_path, 'r', encoding='utf-8') as f:
        notebook = json.load(f)

    toc_markdown = "### **Table of Contents**\n"
    for cell in notebook.get('cells', []):
        if cell.get('cell_type') == 'markdown':
            for line in cell.get('source', []):
                if line.strip().startswith('#'):
                    level = line.count('#')
                    title = line.strip('#').strip()
                    link = title.lower().replace(' ', '-').strip('-.()')
                    indent = '  ' * (level - 1)
                    toc_markdown += f"{indent}* [{title}](#{link})\n"

    print("\n--- ✅ Copy the Markdown below and paste it "
          "into a new markdown cell ---\n")
    print(toc_markdown)


notebook_path = 'mainNb.ipynb'
generate_toc_from_notebook(notebook_path)



--- ✅ Copy the Markdown below and paste it into a new markdown cell ---

### **Table of Contents**
    * [**Table of Contents**](#**table-of-contents**)
  * [Function To Read in the Data!](#function-to-read-in-the-data!)
  * [Example usage](#example-usage)
      * [To Access a DataFrame in the list](#to-access-a-dataframe-in-the-list)
      * [To Remove Spaces in DataFrame name](#to-remove-spaces-in-dataframe-name)
  * [Update cleaning code](#update-cleaning-code)
  * [Generate report](#generate-report)
  * [Plots](#plots)

