# End-to-End Workflow for `DESS`

## Imports & Config

In [1]:
import pandas as pd

# Importing custom utilities
from stats import *
from data_pipeline_manager import *

In [2]:
INPUT_FILE = 'storage/input.dta'
UPLOAD_FILE_PATH = 'storage/completed_DepartmenttoSearch_September2024.dta'
COMPLETE_FILE_PATH = 'storage/complete.parquet'
REPROCESS_FILE_PATH = 'storage/reprocess.parquet'
UNCOMPLETE_FILE_PATH = 'storage/uncomplete.parquet'

## Data Prereprocessing

We recieve a file from Dropbox that we pass through our data pipeline to merge into our internal existing parquet files. Also, in this stage, we add relevant columns to `uncompleted.parquet`.

In [3]:
df_master = pd.read_stata(INPUT_FILE)
df_c = pd.read_parquet(COMPLETE_FILE_PATH)
df_r = pd.read_parquet(REPROCESS_FILE_PATH)

In [4]:
get_expected_file_split_stats(df_master, df_c, df_r)

+--------------------+
| Total    : 238880  |
| Complete : 212454  |
| Reprocess: 932     |
|--------------------|
| ToDo     : 25494   |
+--------------------+


In [None]:
df_u = get_new_rows()
df_u
# append_to_file(UNCOMPLETE_FILE_PATH, df_u)

In [None]:
df_u = add_dess_columns(df_u)
df_u

## Processing Data
Current thinking is that the three high-level functions should be modularized since they may be used in any order.

### Scraping
- Run the processing script on `uncompleted.parquet`.
    - This can be executed either outside the notebook (eg `./main.py`) or inside the notebook (eg: `!caffeinate -dui ./main.py`)
    - You would probably want some way to `get_stats` about the processing (look at `statusUpdate` from `status.ipynb`)
    - Relevant function(s) for this: `process_and_cache(df)`
- Filter out relevant rows and add to `reprocessed.parquet` (based on null rawText).

### Department Extraction
- Call `populate_faculty` on file 
    - By default, the file will be `uncompleted.parquet`.
    - It could also be on `completed.parquet`— when we make changes to our department extraction logic (and want to re-run it on the existing data).

### Data merging
- Want to merge the scraped + faculty-filled information to other internal files
    - Add rows to `completed.parquet` and `reprocessed.parquet`
    - Remove rows from `uncompleted.parquet`

## Post-Processing
- Get a status update about the dataset
    - Want to the completion rates & conversion rates (look at `final_merge.ipynb`?)
- Prepare the completed file for upload
    - convert `completed.parquet` to `.dta` (for uploading to Dropbox)
    - Need to think about how to handle file synchronization and data redundancy for these internal files— easiest solution is push these parquet files to dropbox.

## Other ToDos
- Update the department extraction logic:
    - Re-order the primary patterns (move "the" pattern to top)
    - Add logic to handle isProfessor2 variable
    - Figure out what edge cases to consider