# `DESS`: End-to-End Workflow

## Imports & Config

In [1]:
import pandas as pd

# Importing custom utilities
import stats as stats
import data_pipeline_manager as dpm
import dess.search as search
import dess.nlp as nlp

In [2]:
INPUT_FILE = 'storage/input.dta'
UPLOAD_FILE_PATH = 'storage/completed_DepartmenttoSearch_November2024.dta'
COMPLETE_FILE_PATH = 'storage/complete.parquet'
REPROCESS_FILE_PATH = 'storage/reprocess.parquet'
UNCOMPLETE_FILE_PATH = 'storage/uncomplete.parquet'

## Data Prereprocessing

We recieve a file from Dropbox that we pass through our data pipeline to merge into our internal existing parquet files. Also, in this stage, we add relevant columns to `uncompleted.parquet`.

In [3]:
df_master = pd.read_stata(INPUT_FILE)
df_c = pd.read_parquet(COMPLETE_FILE_PATH)
df_r = pd.read_parquet(REPROCESS_FILE_PATH)

In [None]:
stats.get_expected_file_split_stats(df_master, df_c, df_r)

In [None]:
df_u = dpm.get_new_rows()
df_u

In [None]:
df_u = dpm.prepare_dess_data_structure(df_u)
df_u

In [None]:
dpm.write_to_file(UNCOMPLETE_FILE_PATH, df_u, overwrite=True)

## Processing Data
In our existing workflow, we have three high-level functions that should be modularized since they may be used in any order. These are:
- **Scrape** — search for faculty information (from Google)
- **Extract** — Based on snapshots from Google search results, extract relevant faculty information.
- **Merge** — Update our files with new information.

### Scraping
Run the processing script on `uncompleted.parquet`. Code block included below is meant for demonstrations purposes. It's preferable to run this process in a separate terminal window to easily track processing.

In [None]:
caffeinate -dui python3 search.py [start_index]

To check the progress of the scraping script, run the following block. It will provide an update on number of chunks processed, and percentage of entire file processed so far.

In [None]:
df_u = pd.read_parquet(UNCOMPLETE_FILE_PATH)
stats.get_chunk_processing_stats(df_u, CHUNK_SIZE=200)

In [None]:
df_u

### Department Extraction
- Call `populate_faculty` on file

In [None]:
nlp.extract_department_information(df_c) # modifes relevant columns in place
df_c

In [None]:
dpm.write_to_file(COMPLETE_FILE_PATH, df_c, overwrite=True)

### Data merging
Often the scraping process is run in parallel so we have utilites to stitch together the complete file.

In [None]:
df_u_full = dpm.get_merged_data_from_parallel_scrape(pd.read_parquet('storage/uncomplete-akhil.parquet'),
                                                 pd.read_parquet('storage/uncomplete.parquet'))
dpm.write_to_file(UNCOMPLETE_FILE_PATH, df_u_full, overwrite=True)

We also want to merge the scraped and faculty-filled information to other internal files.

In [None]:
# add rows to `completed.parquet` and `reprocessed.parquet` and remove rows from `uncompleted.parquet`
# Will provide error messages for any conflicts. Returns the updated dataframes
df_c, df_r = dpm.update_internal_files(df_c, df_r, df_u_full)

# TODO: write back
dpm.write_to_file(COMPLETE_FILE_PATH, df_c, overwrite=True)
dpm.write_to_file(REPROCESS_FILE_PATH, df_r, overwrite=True)

## Post-Processing
To geta status update about the dataset—i.e. to get an overview of the completion rates and conversion rates—run the following block.

In [None]:
df_result = pd.read_parquet(COMPLETE_FILE_PATH)

In [4]:
stats.get_dataset_stats(COMPLETE_FILE_PATH)


 ________________________________________________ 
|      STATS FOR: storage/complete.parquet       |
|________________________________________________|
|Total Number of Records:                 437968 |
|Number of Professors:                    295796 |
|Number of Professors with Department:    233513 |
|Number of Professors without Department: 62283  |
|________________________________________________|
|Professor Identification Rate (%):       67.54  |
|Department Extraction Rate (coverage %): 78.94  |
|Department Coverage Gap (slippage %):    21.06  |
|________________________________________________|


To backup the results (i.e. upload to Dropbox), run the following blocks:

In [5]:
dpm.create_stata_output_file("completed_DepartmenttoSearch_November2024_2.dta")

Successfully generated storage/completed_DepartmenttoSearch_November2024_2.dta


In [4]:
dpm.orchestrate_upload_workflow()

Uploading: completed_DepartmenttoSearch_November2024_2.dta
	Successfully uploaded completed_DepartmenttoSearch_November2024_2.dta to /backup-sync/completed_DepartmenttoSearch_November2024_2.dta
Uploading: completed_conflicts.csv
	Successfully uploaded completed_conflicts.csv to /backup-sync/completed_conflicts.csv
Skipping: .DS_Store
Uploading: reprocess.parquet
	Successfully uploaded reprocess.parquet to /backup-sync/reprocess.parquet
Uploading: complete.parquet
	Successfully uploaded complete.parquet to /backup-sync/complete.parquet
Skipping: input.dta
Uploading: uncomplete.parquet
	Successfully uploaded uncomplete.parquet to /backup-sync/uncomplete.parquet


In [6]:
df_c = pd.read_parquet(COMPLETE_FILE_PATH)
df_c

Unnamed: 0,fullid,id_text,id_name,id_university,isProfessor,rawText,department,isProfessor2
0,,Zwygartstauffacher university of iowa,Zwygartstauffacher,university of iowa,True,[Mary Zwygart-Stauffacher's 32 research works ...,MISSING,False
1,3.0,A Abanov texas a and m university,A Abanov,texas a and m university,True,[Welcome to the Condensed Matter research grou...,physics,True
2,4.0,A Abbassi texas a and m university-commerce,A Abbassi,texas a and m university-commerce,True,"[Amir Abbassi, Associate Professor, Counseling...",MISSING,False
3,5.0,A Abdullat west texas a and m university,A Abdullat,west texas a and m university,True,[Dr. Abdullat serves as the Dean and a Profess...,Computer,True
4,6.0,A Abramovitc texas state,A Abramovitc,texas state,True,"[Assistant Professor, UAC 253C. Office Hours: ...",Psychology,True
...,...,...,...,...,...,...,...,...
437963,,Zoran Brkanac university of washington-seattle...,Zoran Brkanac,university of washington-seattle campus,False,[Map location is approximate.Can you help us i...,MISSING,False
437964,,Zoran Popovic university of washington-seattle...,Zoran Popovic,university of washington-seattle campus,True,"[Zoran Popović. Professor. Director, Center fo...",Computer,True
437965,,Zsolt Argenyi university of washington-seattle...,Zsolt Argenyi,university of washington-seattle campus,False,"[Add business hours, Be the first to ask a que...",MISSING,False
437966,,Zsolt Becsi southern illinois university-carbo...,Zsolt Becsi,southern illinois university-carbondale,True,"[Zsolt Becsi, Associate Professor of Economics...",Economics,True
