Creating a pipeline to extract tables from PDFs using AWS Step Functions and AWS Textract

A culinary consulting company has been training school nutrition staff and administrators in the public schools of Oklahoma for several years now. The trainers recorded pre- and post-assessment data and stored it in PDFs and stored multiple years with roughly 12–20 different schools each year. They would like some insights into this data. There are 4 specific tables that they would like to be able to glean insights from that appear in each of these several dozen documents.

I used AWS Textract, which is a machine learning tool that can extract data, while maintaining table structures. I used AWS Step Functions to establish the pipeline flow of the ETL process, Python Lambda functions to do the transformations, S3 as my data lake, and Tableau for visualizations.

AWS architecture drawing and table ERD:

https://tinyurl.com/47pau7nu

Blog posts about this project:

Things I've learned about Step Functions in doing this project: https://medium.com/@brian.farish/aws-step-functions-for-data-pipelines-7fc35e080ad9

Things I've learned about Lambdas in doing this project: https://medium.com/@brian.farish/navigating-aws-lambdas-and-iam-permissions-bd3540451f2d

Visualizations of derived data in Tableau:

One of the dashboards I made with some of the data from these tables: https://public.tableau.com/app/profile/brian.farish/viz/Cooking-for-Kids-Training-McAlester-2018-2023/CookingforKids-McAlester-2018-23

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
bin		bin
lib/python3.9/site-packages		lib/python3.9/site-packages
share		share
venv		venv
.gitignore		.gitignore
1.json		1.json
NoWaitTextract.ipynb		NoWaitTextract.ipynb
OUTPUT_DFs.md		OUTPUT_DFs.md
README.md		README.md
makefile		makefile
overview.ipynb		overview.ipynb
parse-to-df.ipynb		parse-to-df.ipynb
pyvenv.cfg		pyvenv.cfg
requirements.txt		requirements.txt
temp.py		temp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Creating a pipeline to extract tables from PDFs using AWS Step Functions and AWS Textract

AWS architecture drawing and table ERD:

Blog posts about this project:

Visualizations of derived data in Tableau:

About

Releases

Packages

Languages

codeslp/school_assessments_proj

Folders and files

Latest commit

History

Repository files navigation

Creating a pipeline to extract tables from PDFs using AWS Step Functions and AWS Textract

AWS architecture drawing and table ERD:

Blog posts about this project:

Visualizations of derived data in Tableau:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages