Skip to content

Zihang97/Cloud-Native-Data-Analysis-Platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cloud-Native Data Analysis Platform

1. Vision and Goals Of The Project:

The Self Service Cloud-Native Data Analysis Platform is a project to build an end-to-end data analysis platform using managed services considering the tradeoffs they offer.

Our project aims to satisfy real world use cases of managed data science technology to accomplish data analysis tasks. The cloud toolkit is constantly growing and data scientists must consider the tradeoffs of managed services in order to select a toolkit that meets their specific requirements. Through researching data science application and cloud technologies, along with following guidelines set by our mentors, our team will target specific data analysis use cases and build an end-to-end data analysis platform for use by data scientists.

Goals:

  • Explore the variety of managed services that exist for each component (Compute, Storage, Data Analysis, Security, and UI) and make selections based on our use-case requirements
  • Develop each component using selected managed services
  • Provide capability to accomplish a realworld, end-to-end, data analysis use case
  • Implement infrastructure that can:
    • Pull and store data in a storage solution
    • Perform transform operations and analysis on data
    • Leverage standardized data science software (Tensorflow, Jupyter, etc.)
    • Secure data and implement authorization requirements

2. Users/Personas Of The Project:

The platform will be used by data scientists with end-to-end data analysis use cases utilizing cloud technologies.

It doesn’t target:

  • Data scientists with strict security and compliance requirements (Ex: HIPAA)
  • Data scientists with pre-defined, non-matching technology requirements

3. Scope and Features Of The Project:

  • Compute
    • Provide a compute environment using IaaS solutions
    • Does not need to provide several compute options
  • Storage
    • Provide cloud storage solution
    • Users must be able to push and pull data from one or more sources
  • Data Analysis
    • Support ability to run scripts (Python) to transform and analyze data
    • Users are free to install tools and services on their compute environment, beyond what we provide
  • Security
    • Authorization settings to control which users can access and modify specific data
    • Platform is not HIPAA compliant
    • Fine-grained access control between users is not provided
  • User Interface
    • Web interface through which users can register to our platform
    • Does not provide capability to perform data analysis tasks, users must SSH into their individual compute environments
    • Provides limited capability to control individaul cloud resources
    • Programmatic infrastructure management and orchestration using Terraform

4. Solution Concept

The core concept behind this project is to develop an end-to-end data analysis platform that uses various cloud resources, data analysis tools, and other technologies to provide a managed service for use by data scientists. Users of our platform will find that it is a far more efficient method of setting up data analysis environment and will enable them to complete certain data science tasks. The platform will not be an all-encompassing solution, but will have to make certain tradeoffs based on requirements from our mentors and what our team learns through research.

High-level outline of the solution:

  • Compute: Use existing IaaS solutions like AWS EC2 or GCP (can also use Container solution)
  • Storage: Cloud-native technologies like AWS S3/GCP GCS/ DynamoDB/Spanner
  • Data Analysis: Use data analysis platforms like Jupyter and Pandas to support machine learning programs of Tensorflow/Pytorch.
  • Permission and Access Control: Provide security solutions between services and external access (AWS IAM).
  • Front-end UI: HTML/CSS/JS for webpage and Python Flask for web application

Architecture Diagram:

Screen Shot 2021-12-07 at 10 33 20 AM


5. Acceptance criteria

Minimum acceptance criteria is a self service platform that can:

  • Provide a compute environment for data analysis tasks.
  • Provide storage from which users can extract and store data.
  • Support standardized data science software (Tensorflow, Jupyter, etc).
  • Allocate storage and computation resources to ETL pipelines based on user requirement.
  • Implement security controls to enable authorization requirements

Stretch goals include:

  • Use this project to analyze Boston Open Data
  • Provide several compute and storage configurations
  • Build a user friendly UI to interact with the platform
  • Provide cloud infrastructure options from other cloud providers (Azure, Google Cloud Platform, etc.)

6. Release Planning:

Release #1 (due by Week 5) - Demo #1:

  • Explore compute, storage, data anaylsis, and security managed services and technologies
  • When end-users decide to upload the data, we can extract the data from their buckets and store them in our built databases.

Release #2 (due by Week 9) - Demo #2:

  • Implement storage using S3 and push and pull data
  • Utilize Terraform to automate AWS infrastructure deployment
  • Use Parquet and SQLite to manage data

Release #3 (due by Week 11) - Demo #3:

  • Launch a Jupyter Notebook on an EC2 container via Terraform
  • Use Terraform to create storage buckets
  • Expand on data analysis functionality

Release #4 (due by Week 13) - Demo #4:

  • Implement security and access functionality in platform
  • Use Terraform to extract and upload data in storage
  • Expand on platform functionality

Release #5 (due by Week 15) - Demo #5:

  • Implement web interface
  • Prepare final product

Final Presentation


User Instructions

User Instructions Wiki page

Deployment Instructions

Deployment Instructions Wiki page

Mentors

Team