Data Pipeline Project

The data pipeline project was created for data analytics team at data services in VT university libraries(VTUL). The project was designed for Ellie Kohler, the head of Library Data Analytics and Assessment team to analyze library data. The project collects data on a weekly basis from libinsight to an aws s3 bucket analytics-datapipeline. The gathered data is collected in csv format. The data is then queried by aws athena and uploaded to Tableau for analysis purposes. The script that collects libinsight data: libInsightData_ec2inst is a lambda function. It is triggered on a weekly basis. In the libinsight athena database, a table is created mapping the original libinsight data file. The athena query then performs a query on the the original libinsight data file and stores the results in a different s3 bucket lib-insight-serialized-data... This athena query is coded into a lambda function . The trigger goes off for this lambda function everytime the original libinsight s3 data file gets updated which is on a weekly basis.

Tableau account is associated with the user data-analytics-team. The IAM policy on this user provides tableau account holder(Ellie Kohler) access to athena queries(read and write), access to the original s3 bucket analytics-datapipeline (read -only access) and access to the athena query results s3 bucket lib-insight-serialized-data.. (read-write-access). The athena query results are uploaded to Tableau. The query results are also automated on a weekly basis based on the athena query updates

The script(lambda function) is broken down into the following parts:

Get query results from libinsight using libinsight api with parameters: libinsight ID, data range and libinsight token
Append all the pages of the libinsight query results together as one dictionary. Libinsight api returns query results that are limited to one page at a time.
Transform the query response parameters to fit the needs of the data analytics team.
Serialize the data to s3 bucket and upload the record as a csv file
Create athena query and store query results in the s3 bucket
Upload and automate athena query results to tableau for data analysis
Add triggers to automate the upload on a weekly basis

The lambda function to start the ec2 instance is StartLibInsightEC2Instance. The lambda function to stop the ec2 instance is StopLibInsightEC2Instance. Triggers are also added to these lambda functions.

Documentation

See the wiki for documentation. For more detailed documentation see readme-notes.md

Environment

This project is hosted in the Data Services AWS account. AWS components are identified by the following tags:

Unit : AnalyticsAssessment
Owner : DataServices
Stack : Test
User : elliek
Application : DataPipeline

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.vscode		.vscode
.git-completion.bash		.git-completion.bash
.gitignore		.gitignore
README.md		README.md
athena_query_trigger_onS3fileUpdates.py		athena_query_trigger_onS3fileUpdates.py
lambda_function.py		lambda_function.py
readme-notes.md		readme-notes.md
requirements.txt		requirements.txt
setEnvironVariables.sh		setEnvironVariables.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Pipeline Project

Documentation

Environment

About

Releases

Packages

Contributors 3

Languages

VTUL/Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline Project

Documentation

Environment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages