Stock Off The Press

This monorepo contains an application for crawling sources of market news, a machine learning pipeline for predicting the effect of news stories on stock prices, a workflow manager for application orchestration, a web application for serving model inferences, and an application for provisioning the required cloud infrastructure.

Crawler
ML Pipeline
Workflow Manager
Web App Backend
Web App Front
AWS CDK App

1. Crawler

What It Does
Local Setup
- Prerequisites
- Set Up Environment
Directory Structure
Lambda Functions
Run Crawl in Local Development Environment
Deployment
Run Crawl in Production Environment

1.1. What It Does

This application consists of Python classes – or “spiders” – that define how to crawl and extract structured data from the pages of websites that publish market news.

1.2. Local Setup

1.2.1. Prerequisites

1.2.2. Set Up Environment

Start by installing the conda package and environment manager. The Miniconda installer can be used to install a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip.

To create a fresh conda environment, run conda create -n <env-name> python=3.10, substituting <env-name> with your desired environment name. Once the environment has been created, activate the environment by running conda activate <env-name>.

Install the Python dependencies by running pip install -r requirements.txt from the crawler-project directory. Install the Node dependencies by running npm install from the crawler-project/lambdas directory.

Set the necessary environment variables by modifying the command below as required depending on the location of your Miniconda installation and environment name.

conda env config vars set \
PYTHONPATH=<path/to/project/dir>/crawler-project:$HOME/opt/miniconda3/envs/<env-name>/lib/python3.10/site-packages

Reactivate the environment by running conda activate <env-name>.

To create a file for storing environment variables, run cp .env.example .env from the crawler-project directory.

1.3. Directory Structure

📦crawler-project
 ┣ 📂crawler
 ┃ ┣ 📂scripts
 ┃ ┃ ┣ 📜crawl.py
 ┃ ┃ ┗ 📜encrypt_data.py
 ┃ ┣ 📂spiders
 ┃ ┃ ┣ 📜exchange.py
 ┃ ┃ ┣ 📜ft.py
 ┃ ┃ ┗ 📜price.py
 ┃ ┣ 📜form_payloads.py
 ┃ ┣ 📜items.py
 ┃ ┣ 📜log_formatter.py
 ┃ ┣ 📜middlewares.py
 ┃ ┣ 📜pipelines.py
 ┃ ┗ 📜settings.py
 ┣ 📂lambdas
 ┃ ┣ 📂ft
 ┃ ┃ ┣ 📜Dockerfile
 ┃ ┃ ┣ 📜entrypoint.sh
 ┃ ┃ ┣ 📜index.ts
 ┃ ┃ ┗ 📂node_modules
 ┃ ┣ 📜.eslintrc.json
 ┃ ┣ 📜.gitignore
 ┃ ┣ 📜.prettierrc
 ┃ ┣ 📜package-lock.json
 ┃ ┣ 📜package.json
 ┃ ┗ 📜tsconfig.json
 ┣ 📜.env
 ┣ 📜.env.example
 ┣ 📜.gitignore
 ┣ 📜Dockerfile
 ┣ 📜entrypoint.sh
 ┣ 📜production.env
 ┣ 📜requirements.txt
 ┗ 📜scrapy.cfg

1.4. Lambda Functions

The session cookies used by spiders for authenticating user accounts are obtained by means of Lambda functions that use the Playwright browser automation library to orchestrate instances of the Chromium browser running in headless mode. The JavaScript code that runs in AWS Lambda using the Node runtime is contained in the lambdas directory. The subdirectory for each Lambda function contains a TypeScript file defining the function handler method, a Dockerfile for building the deployment container image, and a shell script that is executed when the Docker container is started. To transcompile the TypeScript code into Lambda-compatible JavaScript, run npm run build from the lambdas directory. The AWS CDK app takes care of building and uploading the deployment container image. The constructs representing the Lambda functions are defined as part of the CrawlerStack stack.

1.5. Run Crawl in Local Development Environment

To initiate a crawl from your local machine, change the current working directory to crawler-project and run the command scrapy crawl <spider-name>, substituting <spider-name> with the name of the spider you want to run. Alternatively, start the crawler by executing crawler/scripts/crawl.py.

The optional parameter year can be used to specify the year within which an article must have been published for it to be processed by the spider. If no year is specified, the spider defaults to processing only the most recently published articles. If starting the crawler using the scrapy crawl command, a value for the year parameter can be supplied by passing the key–value pair year=<yyyy> to the —a option. If starting the crawler using the crawl.py script, a value for the parameter can be passed as a command line argument.

The Amazon DocumentDB cluster that acts as the data store for this project is deployed within an Amazon Virtual Private Cloud (VPC). The cluster can only be accessed directly by Amazon EC2 instances or other AWS services that are deployed within the same Amazon VPC. SSH tunneling (also known as port forwarding) can be used to access the DocumentDB cluster from outside the VPC. To create an SSH tunnel, you can connect to an EC2 instance running in the same VPC as the DocumentDB cluster that was provisioned specifically for this purpose.

As Transport Layer Security (TLS) is enabled on the cluster, you will need to download the public key for Amazon DocumentDB from https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem. The following operation downloads this file to the location specified by the -P option.

wget https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem -P $HOME/.ssh

Run the following command to set up an SSH tunnel to the DocumentDB cluster. The -L flag is used for forwarding a local port, in this case port 27017.

ssh -i $HOME/.ssh/ec2-key-pair.pem \
-L 27017:production.••••••.eu-west-1.docdb.amazonaws.com:27017 \
ec2-••••••.eu-west-1.compute.amazonaws.com -N

The connection URI for connecting the application to the DocumentDB cluster should be formatted as below.

mongodb://<username>:<password>@localhost:27017/stock-press?tlsAllowInvalidHostnames=true&ssl=true&tlsCaFile=$HOME/.ssh/rds-combined-ca-bundle.pem&directConnection=true&retryWrites=false

1.6. Deployment

To deploy the crawler using the AWS CDK Toolkit, change the current working directory to cdk and run cdk deploy CrawlerStack. See the AWS CDK app section for details of how to set up the AWS CDK Toolkit.

1.7. Run Crawl in Production Environment

For production crawls, the crawler is run as an Amazon Elastic Container Service (ECS) task using the AWS Fargate serverless container orchestrator. To run an ECS task using the AWS Command Line Interface, run the following command, substituting <cluster-name>, <task-definition-name>, <vpc-public-subnet-id> and <service-security-group-id> with the values outputted by the AWS CDK app after deployment. The example below shows how to override the default command for a container specified in the Docker image with a command that specifies a year within which an article must have been published for it to be processed by the spider.

aws ecs run-task \
    --launch-type FARGATE \
    --cluster arn:aws:ecs:eu-west-1:••••••:cluster/<cluster-name> \
    --task-definition <task-definition-name> \
    --network-configuration 'awsvpcConfiguration={subnets=[<vpc-public-subnet-id>],securityGroups=[<service-security-group-id>],assignPublicIp=ENABLED}' \
    --overrides '{
        "containerOverrides": [
            {
                "name": "<container-name>",
                "command": ["sh", "-c", "python3 ./crawler/scripts/crawl.py <yyyy>"]
            }
        ]
    }'

2. ML Pipeline

What It Does
Local Setup
- Prerequisites
- Set Up Environment
Testing
Directory Structure
Start a SparkSession
Run Job in Local Development Environment
Deployment
- Packaging Dependencies
- Deploy CloudFormation Stack
Run Job in Production Environment

2.1. What It Does

This is an application for performing distributed batch processing of ML workloads on the Apache Spark framework.

2.2. Local Setup

2.2.1. Prerequisites

2.2.2 Set up environment

Start by installing the conda package and environment manager. The Miniconda installer can be used to install a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip.

To create a fresh conda environment, run conda create -n <env-name> python=3.10, substituting <env-name> with your desired environment name. Once the environment has been created, activate the environment by running conda activate <env-name>.

Next, install the project dependencies, including distributions of Apache Hadoop and Apache Spark, by running pip install -r requirements_dev.txt from the ml-pipeline directory.

Set the necessary environment variables by modifying the command below as required depending on the location of your Miniconda installation and environment name.

conda env config vars set \
PYTHONPATH=<path/to/project/dir>/ml-pipeline:$HOME/opt/miniconda3/envs/<env-name>/lib/python3.10/site-packages \
SPARK_HOME=$HOME/opt/miniconda3/envs/<env-name>/lib/python3.10/site-packages/pyspark \
PYSPARK_PYTHON=$HOME/opt/miniconda3/envs/<env-name>/bin/python \
PYSPARK_DRIVER_PYTHON=$HOME/opt/miniconda3/envs/<env-name>/bin/python \
OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

Reactivate the environment by running conda activate <env-name>.

To create a file for storing environment variables, run cp .env.example .env.

To download model files needed for local inference, run docker build -f assets.Dockerfile -o . .. Model files will be outputted to the assets/models directory.

2.3. Testing

This project uses the pytest software testing framework. Run DEBUG=1 pytest to execute all tests. Use the -s flag to prevent pytest from capturing data written to STDOUT, and the -v flag to increase the verbosity of test output.

2.4. Directory Structure

📦ml-pipeline
 ┣ 📂artifacts
 ┃ ┣ 📜packages.tar.gz
 ┃ ┗ 📜uber-JAR.jar
 ┣ 📂assets
 ┃ ┗ 📂models
 ┃ ┃ ┣ 📂bert_large_token_classifier_conll03_en
 ┃ ┃ ┣ 📂facebook_bart_large_cnn
 ┃ ┃ ┣ 📂gbt
 ┃ ┃ ┣ 📂hnswlib
 ┃ ┃ ┣ 📂sent_bert_large_cased_en
 ┃ ┃ ┗ 📂sentence_detector_dl_xx
 ┣ 📂config
 ┣ 📂data
 ┣ 📂inference
 ┃ ┣ 📂services
 ┃ ┃ ┣ 📜logger.py
 ┃ ┃ ┗ 📜spark.py
 ┃ ┣ 📂transformers
 ┃ ┃ ┣ 📜named_entity_recognizer.py
 ┃ ┃ ┣ 📜summarizer.py
 ┃ ┃ ┗ 📜vectorizer.py
 ┃ ┗ 📜summarizer.py
 ┣ 📂jobs
 ┃ ┣ 📜classification.py
 ┃ ┣ 📜knn.py
 ┃ ┣ 📜ner.py
 ┃ ┣ 📜prediction.py
 ┃ ┗ 📜summarization.py
 ┣ 📂scripts
 ┃ ┣ 📜download_models.py
 ┃ ┗ 📜package_models.py
 ┣ 📂tests
 ┃ ┣ 📜conftest.py
 ┃ ┣ 📜fixtures.py
 ┃ ┣ 📜ner_test.py
 ┃ ┗ 📜summarization_test.py
 ┣ 📜.env
 ┣ 📜.env.example
 ┣ 📜.gitignore
 ┣ 📜artifacts.Dockerfile
 ┣ 📜assets.Dockerfile
 ┣ 📜pom.xml
 ┣ 📜pyproject.toml
 ┣ 📜pytest.ini
 ┣ 📜requirements.txt
 ┗ 📜requirements_dev.txt

The jobs directory contains Python scripts that can be sent to a Spark cluster and executed as jobs. The inference directory contains the custom Transformers and MLflow Python model classes that provide the core functionality.

2.5. Starting a `SparkSession`

The inference.services.spark module provides a start_spark function for creating a SparkSession on the worker node and registering an application with the cluster. The following example shows how to create a SparkSession and specify the Maven coordinates of JAR files to be downloaded and transferred to the cluster.

from inference.services.spark import start_spark

spark, log, config = start_spark(
    jars_packages=[
        "org.apache.hadoop:hadoop-aws:3.3.2",
        "org.mongodb.spark:mongo-spark-connector_2.12:3.0.2",
        f"com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.1",
    ],
    spark_config={
        "spark.mongodb.input.uri": os.environ["MONGODB_CONNECTION_URI"],
        "spark.mongodb.output.uri": os.environ["MONGODB_CONNECTION_URI"],
        "fs.s3a.aws.credentials.provider": "com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
        "spark.kryoserializer.buffer.max": "2000M",
        "spark.driver.memory": "10g",
    },
)

Note that only the app_name argument will take effect when calling start_spark from a job submitted to a cluster via the spark-submit script in Spark's bin directory. The purpose of the other arguments is to facilitate local development and testing from within an interactive terminal session or Python console. The start_spark function detects the execution environment in order to determine which arguments the session builder should use – the function arguments or the spark-submit arguments. The config dictionary is populated with configuration values contained in JSON files located at paths specified by the files argument or --files option. The top level keys of the config dictionary correspond to the names of the JSON files submitted to the cluster.

2.6. Run Job in Local Development Environment

The following example shows how to submit a job to a local standalone Spark cluster, specify the Maven coordinates of JAR files to be downloaded and transferred to the cluster, and supply configuration values to the SparkConf object that will be passed to the SparkContext.

$SPARK_HOME/bin/spark-submit \
--master "local[*]" \
--packages "org.apache.hadoop:hadoop-aws:3.3.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2,com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.1" \
--conf "spark.mongodb.input.uri=mongodb://<username>:<password>@localhost:27017/stock-press?tlsAllowInvalidHostnames=true&ssl=true&directConnection=true&retryWrites=false" \
--conf "spark.mongodb.output.uri=mongodb://<username>:<password>@localhost:27017/stock-press?tlsAllowInvalidHostnames=true&ssl=true&directConnection=true&retryWrites=false" \
--conf "fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain" \
--conf "spark.driver.memory=10g" \
--conf "spark.kryoserializer.buffer.max=2000M" \
jobs/summarization.py

The Amazon DocumentDB cluster that acts as the data store for this project is deployed within an Amazon Virtual Private Cloud (VPC). The cluster can only be accessed directly by Amazon EC2 instances or other AWS services that are deployed within the same Amazon VPC. SSH tunneling (also known as port forwarding) can be used to access the DocumentDB cluster from outside the VPC. To create an SSH tunnel, you can connect to an EC2 instance running in the same VPC as the DocumentDB cluster that was provisioned specifically for this purpose.

As Transport Layer Security (TLS) is enabled on the cluster, you will need to download the public key for Amazon DocumentDB from https://s3.amazonaws.com/rds-downloads/rds-ca-2019-root.pem. The following operation downloads this file to the location specified by the -P option.

wget https://s3.amazonaws.com/rds-downloads/rds-ca-2019-root.pem -P $HOME/.ssh

Run the following command to add the public key to the Java TrustStore.

sudo keytool -import -alias RDS -file $HOME/.ssh/rds-ca-2019-root.pem -cacerts

Run the following command to set up an SSH tunnel to the DocumentDB cluster. The -L flag is used for forwarding a local port, in this case port 27017.

ssh -i $HOME/.ssh/ec2-key-pair.pem \
-L 27017:production.••••••.eu-west-1.docdb.amazonaws.com:27017 \
ec2-••••••.eu-west-1.compute.amazonaws.com -N

The connection URI for connecting the application to the DocumentDB cluster should be formatted as below.

mongodb://<username>:<password>@localhost:27017/stock-press?tlsAllowInvalidHostnames=true&ssl=true&directConnection=true&retryWrites=false

2.7. Deployment

2.7.1. Packaging Dependencies

The project includes a Dockerfile with instructions for packaging dependencies into archives that can be uploaded to Amazon S3 and downloaded to Spark executors. Dependencies can be packaged for deployment by running the command docker build -f artifacts.Dockerfile -o . .. A TAR archive containing the Python dependencies and an uber-JAR containing the Java dependencies will be outputted to a directory named artifacts.

The project also includes a Dockerfile with instructions for fetching model files that must be uploaded to Amazon S3 and downloaded to Spark executors. Model files can be readied for deployment by running the command docker build -f assets.Dockerfile -o . .. Model files will be outputted to the directory assets/models.

2.7.2 Deploy CloudFormation Stack

To deploy the application using the AWS CDK Toolkit, change the current working directory to cdk and run cdk deploy EMRServerlessStack. See the AWS CDK app section for details of how to set up the AWS CDK Toolkit. The AWS CDK app takes care of uploading the deployment artifacts and assets to the project's dedicated S3 bucket. The app also creates and uploads a JSON configuration file named models.json that specifies the S3 URI for the models folder. For production job runs, this file needs to be submitted to the Spark cluster by passing the URI as an argument to the --files option. The AWS CDK app outputs the ID of the EMR Serverless application created by the CloudFormation stack, along with the ARN for the IAM execution role, S3 URIs for the jobs, config, artifacts, models and logs folders, and the S3 URI for the ZIP archive containing a custom Java KeyStore.

2.8. Run Job in Production Environment

The following is an example of how to submit a job to the EMR Serverless application deployed by the AWS CDK app using the AWS CLI. The placeholder values should be replaced with the values outputted by the CDK app after deployment.

aws emr-serverless start-job-run \
    --execution-timeout-minutes 10 \
    --region eu-west-1 \
    --application-id <application-ID> \
    --execution-role-arn <role-ARN> \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<bucket-name>/jobs/summarization.py",
            "entryPointArguments": [],
            "sparkSubmitParameters": "--conf spark.archives=s3://<bucket-name>/artifacts/packages.tar.gz#environment,s3://<bucket-name>/cacerts/<asset-hash>.zip#cacerts --conf spark.jars=s3://<bucket-name>/artifacts/uber-JAR.jar --files=s3://<bucket-name>/config/models.json --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.driver.disk=30g --conf spark.emr-serverless.executor.disk=30g --conf spark.executor.instances=10 --conf spark.mongodb.input.uri=mongodb://<username>:<password>@production.••••••.eu-west-1.docdb.amazonaws.com:27017/stock-press?tls=true&replicaSet=rs0&readPreference=secondaryPreferred&directConnection=true&retryWrites=false --conf spark.mongodb.output.uri=mongodb://<username>:<password>@production.••••••.eu-west-1.docdb.amazonaws.com:27017/stock-press?tls=true&replicaSet=rs0&readPreference=secondaryPreferred&directConnection=true&retryWrites=false --conf spark.driver.extraJavaOptions=-Djavax.net.ssl.trustStore=./cacerts/cacerts.jks --conf spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=./cacerts/cacerts.jks --conf spark.kryoserializer.buffer.max=2000M"
        }
    }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://<bucket-name>/logs/"
            }
        }
    }'

3.1. What It Does

This is an Apache Airflow application for automating and orchestrating data pipelines that comprise interdependent stages. The application is designed to run on an Amazon Elastic Container Service cluster using the AWS Fargate serverless compute engine.

3.2. Airflow with Fargate Architecture

The infrastructure components of an Airflow application fall into two categories: those needed for Airflow itself to operate, and those used to run tasks. The following components belong to the first category.

The Webserver for hosting the Airflow UI, which allows users to trigger and monitor workflows.
The Scheduler for triggering the task instances whose dependencies have been met.
The Metadata Database for storing configuration data and information about past and present workflow runs.
The Executor that provides the mechanism by which task instances get run.

The Webserver and Scheduler are run in Docker containers deployed to Amazon Elastic Container Service that are started by AWS Fargate. The Metadata Database is an Amazon RDS PostgreSQL instance. The Celery Executor, with Amazon SQS as the queue broker, is used to run task instances.

A DAG – or a Directed Acyclic Graph – is a collection of tasks organized in a way that reflects their relationships and dependencies. Each DAG is defined in a Python script that represents the DAG's structure (tasks and their dependencies) as code. Workers are the resources that run the DAG code. An Airflow Task is created by instantiating an Operator class. An operator is used to execute the operation that the task needs to perform. A dedicated Fargate task acts as the worker that monitors the queue for messages and either executes tasks directly or uses the Amazon ECS operator to execute tasks using additional capacity provisioned by either AWS Fargate or Amazon EC2.

3.3. Directory Structure

📦workflow-manager
 ┣ 📂airflow
 ┃ ┣ 📂config
 ┃ ┃ ┣ 📜scheduler_entry.sh
 ┃ ┃ ┣ 📜webserver_entry.sh
 ┃ ┃ ┗ 📜worker_entry.sh
 ┃ ┣ 📂dags
 ┃ ┃ ┣ 📜ecs_dag.py
 ┃ ┃ ┗ 📜emr_dag.py
 ┃ ┗ 📜Dockerfile
 ┗ 📂tasks
 ┃ ┗ 📂ecs_task
 ┃ ┃ ┣ 📜Dockerfile
 ┃ ┃ ┗ 📜app.py

3.4. Deployment

To deploy the Airflow application to Amazon ECS using the AWS CDK Toolkit, change the current working directory to cdk and run cdk deploy FarFlowStack. See the AWS CDK app section for details of how to set up the AWS CDK Toolkit. The AWS CDK app outputs the address of the Network Load Balancer that exposes the Airflow Webserver.

3.5. Run Workflow in Production Environment

To trigger a DAG manually, navigate to the web address outputted by the AWS CDK app and log in to the Airflow UI as "admin" using the password stored in AWS Secrets Manager under the name "airflow/admin". Select the Trigger DAG option from the dropdown activated by clicking the play button in the Actions column for the DAG you want to run.

4. Web App Backend

What It Does
Local Setup
- Prerequisites
- Set Up Environment
Directory Structure
Deployment

4.1. What It Does

This is a web application backend for serving model inferences architected using the Django framework.

4.2. Local Setup

4.2.1. Prerequisites

Conda package and environment manager

4.2.2. Set Up Environment

Start by installing the conda package and environment manager. The Miniconda installer can be used to install a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip.

To create a fresh conda environment, run conda create -n <env-name> python=3.10, substituting <env-name> with your desired environment name. Once the environment has been created, activate the environment by running conda activate <env-name>.

Install the Python dependencies for the backend by running pip install -r requirements.txt from the web-backend directory.

Set the necessary environment variables by modifying the command below as required depending on the location of your Miniconda installation and environment name.

conda env config vars set \
PYTHONPATH=<path/to/project/dir>/web-backend:$HOME/opt/miniconda3/envs/<env-name>/lib/python3.10/site-packages

Reactivate the environment by running conda activate <env-name>.

To create a file for storing environment variables, run cp .env.example .env from the web-backend directory.

Run python manage.py runserver from the web-backend directory to start the local Django development server. By default the server is started on port 8000.

4.3. Directory Structure

📦web-backend
 ┣ 📂core
 ┃ ┣ 📂templates
 ┃ ┃ ┗ 📜robots.txt
 ┃ ┣ 📜asgi.py
 ┃ ┣ 📜settings.py
 ┃ ┣ 📜urls.py
 ┃ ┣ 📜utils.py
 ┃ ┗ 📜wsgi.py
 ┣ 📂stockpress
 ┃ ┃ ┣ 📂views
 ┃ ┃ ┣ 📜articles.py
 ┃ ┃ ┗ 📜home.py
 ┣ 📜admin.py
 ┣ 📜apps.py
 ┣ 📜models.py
 ┣ 📜tests.py
 ┗ 📜urls.py
 ┣ 📜.env.example
 ┣ 📜.gitignore
 ┣ 📜Dockerfile
 ┣ 📜manage.py
 ┣ 📜requirements.txt
 ┗ 📜zappa_settings.json

4.4. Deployment

To deploy the application using the AWS CDK Toolkit, change the current working directory to cdk and run cdk deploy WebAppStack. See the AWS CDK app section for details of how to set up the AWS CDK Toolkit. The CDK app takes care of bundling the project files using the Zappa build tool for deployment to AWS Lambda.

5. Web App Frontend

What It Does
Local Setup
- Prerequisites
- Set Up Environment
Directory Structure
Deployment

5.1. What It Does

This is a web application frontend for allowing users to view model inferences architected using the Next.js framework.

5.2. Local Setup

5.2.1. Prerequisites

Node.js JavaScript runtime environment

5.2.2. Set Up Environment

Install the Node dependencies by running npm install from the web-frontend directory.

Run npm run dev to start the local Next.js development server. By default the server is started on port 3000. Navigate to http://localhost:3000 to view the site in a web browser.

5.3. Directory Structure

📦web-frontend
 ┣ 📂components
 ┃ ┣📜Article.tsx
 ┃ ┗📜...
 ┣ 📂context
 ┃ ┗📜articlesContext.tsx
 ┣ 📂hooks
 ┃ ┣📜useArticlesContext.tsx
 ┃ ┗📜useIntersectionObserver.tsx
 ┣ 📂pages
 ┃ ┣ 📂api
 ┃ ┃ ┗ 📜hello.ts
 ┃ ┣ 📜_app.tsx
 ┃ ┣ 📜_document.tsx
 ┃ ┗ 📜index.tsx
 ┣ 📂public
 ┃ ┣ 📜favicon.ico
 ┃ ┗ 📜robots.txt
 ┣ 📂styles
 ┃ ┗ 📜globals.css
 ┣ 📂types
 ┃ ┗ 📜index.ts
 ┣ 📜.eslintrc.json
 ┣ 📜.gitignore
 ┣ 📜.prettierrc.json
 ┣ 📜next-env.d.ts
 ┣ 📜next.config.js
 ┣ 📜package-lock.json
 ┣ 📜package.json
 ┣ 📜postcss.config.js
 ┣ 📜tailwind.config.js
 ┗ 📜tsconfig.json

5.4. Deployment

To deploy the application using the AWS CDK Toolkit, change the current working directory to cdk and run cdk deploy WebAppStack. See the AWS CDK app section for details of how to set up the AWS CDK Toolkit. The CDK app takes care of bundling the project files using the standalone output build mode for deployment to AWS Lambda.

6. AWS CDK App

What It Does
Local Setup
- Prerequisites
- Set Up Environment
Directory Structure
Testing
Deployment

6.1. What It Does

The AWS Cloud Development Kit (CDK) is a framework for defining cloud infrastructure in code and provisioning it through AWS CloudFormation. This is an AWS CDK application that defines the cloud infrastructure required by the services contained in this repository.

6.2. Local Setup

6.2.1. Prerequisites

Node.js JavaScript runtime environment

6.2.2. Set Up Environment

To install the CDK Toolkit (a CLI tool for interacting with a CDK app) using the Node Package Manager, run the command npm install -g aws-cdk. The CDK Toolkit needs access to AWS credentials. Access to your credentials can be configured using the AWS CLI by running aws configure and following the prompts.

Install the Node dependencies by running npm install from the cdk directory.

6.3. Directory Structure

📦cdk
 ┣ 📂bin
 ┃ ┗ 📜cdk.ts
 ┣ 📂cdk.out
 ┣ 📂lib
 ┃ ┣ 📂custom-resources
 ┃ ┃ ┣ 📂s3-copy-object
 ┃ ┃ ┃ ┣ 📜handler.py
 ┃ ┃ ┃ ┗ 📜s3-copy-object.ts
 ┃ ┃ ┣ 📂postgres-create-database
 ┃ ┃ ┃ ┣ 📜handler.py
 ┃ ┃ ┃ ┗ 📜postgres-create-database.ts
 ┃ ┣ 📂farflow-stack
 ┃ ┃ ┣ 📂constructs
 ┃ ┃ ┃ ┣ 📜airflow-construct.ts
 ┃ ┃ ┃ ┣ 📜dag-tasks.ts
 ┃ ┃ ┃ ┣ 📜rds.ts
 ┃ ┃ ┃ ┣ 📜service-construct.ts
 ┃ ┃ ┃ ┗ 📜task-construct.ts
 ┃ ┃ ┣ 📜config.ts
 ┃ ┃ ┣ 📜farflow-stack.ts
 ┃ ┃ ┗ 📜policies.ts
 ┃ ┣ 📜config.ts
 ┃ ┣ 📜farflow-stack.ts
 ┃ ┣ 📜policies.ts
 ┃ ┣ 📜crawler-stack.ts
 ┃ ┣ 📜docdb-stack.ts
 ┃ ┣ 📜emr-serverless-stack.ts
 ┃ ┗ 📜vpc-stack.ts
 ┣ 📂test
 ┃ ┣ 📜crawler.test.ts
 ┃ ┣ 📜docdb.test.ts
 ┃ ┣ 📜emr-serverless.test.ts
 ┃ ┗ 📜vpc.test.ts
 ┣ 📜.env
 ┣ 📜.env.example
 ┣ 📜.eslintrc.json
 ┣ 📜.gitignore
 ┣ 📜.npmignore
 ┣ 📜.prettierrc
 ┣ 📜cdk.context.json
 ┣ 📜cdk.json
 ┣ 📜jest.config.js
 ┣ 📜package-lock.json
 ┣ 📜package.json
 ┗ 📜tsconfig.json

6.4. Testing

This project uses the Jest software testing framework. Run npm run test to execute all tests.

6.5. Deployment

To deploy all the stacks defined by the application, change the current working directory to cdk and run cdk deploy --all.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github		.github
cdk		cdk
crawler-project		crawler-project
ml-pipeline		ml-pipeline
web-backend		web-backend
web-frontend		web-frontend
workflow-manager		workflow-manager
README.md		README.md

alistairdivorty/stock-off-the-press

Folders and files

Latest commit

History

Repository files navigation

Stock Off The Press

1. Crawler

1.1. What It Does

1.2. Local Setup

1.2.1. Prerequisites

1.2.2. Set Up Environment

1.3. Directory Structure

1.4. Lambda Functions

1.5. Run Crawl in Local Development Environment

1.6. Deployment

1.7. Run Crawl in Production Environment

2. ML Pipeline

2.1. What It Does

2.2. Local Setup

2.2.1. Prerequisites

2.2.2 Set up environment

2.3. Testing

2.4. Directory Structure

2.5. Starting a SparkSession

2.6. Run Job in Local Development Environment

2.7. Deployment

2.7.1. Packaging Dependencies

2.7.2 Deploy CloudFormation Stack

2.8. Run Job in Production Environment

3. Workflow Manager

3.1. What It Does

3.2. Airflow with Fargate Architecture

3.3. Directory Structure

3.4. Deployment

3.5. Run Workflow in Production Environment

4. Web App Backend

4.1. What It Does

4.2. Local Setup

4.2.1. Prerequisites

4.2.2. Set Up Environment

4.3. Directory Structure

4.4. Deployment

5. Web App Frontend

5.1. What It Does

5.2. Local Setup

5.2.1. Prerequisites

5.2.2. Set Up Environment

5.3. Directory Structure

5.4. Deployment

6. AWS CDK App

6.1. What It Does

6.2. Local Setup

6.2.1. Prerequisites

6.2.2. Set Up Environment

6.3. Directory Structure

6.4. Testing

6.5. Deployment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

2.5. Starting a `SparkSession`

Packages