Data Engineering

gamechanger-data focuses on the data engineering work of gamechanger. To see all repositories gamechanger

Important Note!

Configuration of repo is reliant on being able to hit advana-data-zone's s3 bucket. If you do not have access to advana-data-zone's s3 bucket, you will need to fill in your own values in config script; like topic_models (for ML features) and configure_app (ElasticSearch, Postgres, and Neo4j)
Once venv is set up, set DEPLOYMENT_ENV variable and run ./paasJobs/configure_repo.sh or paasJobs/configure_repo.bat
Example DEPLOYMENT_ENV=local ./paasJobs/configure_repo.sh or set DEPLOYMENT_ENV=local \paasJobs\configure_repo.bat

(Linux) Dev/Prod Deployment Instructions

Clone fresh gamechanger-data repo
Setup python3.8 venv with packages in requirements.txt.
- Create python3.8 venv, e.g. python3 -m venv /opt/gc-venv-20210613
- Before installing packages, update pip/wheel/setuptools, e.g. <venv>/bin/pip install --upgrade pip setuptools wheel
- Install packages from requirements.txt, with no additional dependencies, e.g. <venv>/bin/pip install --no-deps -r requirements.txt
Set up symlink /opt/gc-venv-current to the freshly created venv, e.g. ln -s /opt/gc-venv-20210613 /opt/gc-venv-current
Pull in other dependencies and configure repo with env SCRIPT_ENV=<prod|dev> <repo>/paasJobs/configure_repo.sh
- Config script will let you know if everything was configured correctly and if all backends can be reached.

How to Setup Local Env for Development

MacOS / Linux

(Linux Only) Follow instruction appropriate to repo to install ocrmypdf and its dependencies: https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-on-linux
(MacOS Only) Install "brew" then use it to install tesseract brew install tesseract-lang
Install Miniconda or Anaconda (Miniconda is much smaller)
- https://docs.conda.io/en/latest/miniconda.html
Create gamechanger python3.8 environment, like so:
- conda create -n gc python=3.8
Clone the repo and change into that dir git clone ...; cd gamechanger
Activate conda environment and install requirements:
- ‼️ reeeealy important - make sure you change into repo directory
- conda activate gc
- pip install --upgrade pip setuptools wheel
- pip install -e '.[dev]' (quoting around .[dev] is important)
That's it.

Windows (WSL Version)

Setup Windows Subsystem for Linux (WSL) environment
- https://docs.microsoft.com/en-us/windows/wsl/install-win10
(In WSL)
- Install ocrmypdf dependencies following ubuntu instructions here: https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-on-linux
- Install Miniconda or Anaconda (Miniconda is much smaller)
  - https://docs.conda.io/en/latest/miniconda.html
- Create gamechanger python3.8 environment, like so:
  - conda create -n gc python=3.8
- Clone the repo and change into that dir git clone ...; cd gamechanger-data
- Activate conda environment and install requirements:
  - ‼️ reeeealy important - make sure you change into repo directory
  - conda activate gc
  - pip install --upgrade pip setuptools wheel
  - pip install -e '.[dev]' (quoting around .[dev] is important)
- That's it, just activate that conda env if you want to use it inside the terminal.

Windows

Create venv python -m venv [venv-name] Activate \[venv-name]\Scripts\activate Update venv python -m pip install --upgrade pip setuptools wheel Install requirements.txt pip install --no-deps -r dev_tools\requirements\gc-venv-current.txt

Run Configure Repo, Steps at the top of this README

To-Do:

convert .sh scripts to .bat to support window users

Docker

docker build -t gc-data --no-cache .
docker rm -f gc-data-test || true
docker run -it --name gc-data gc-data

Configure Repo

IDE SETUP

How to Setup PyCharm IDE

Note: If you're using containerized env, you'll need Pro version of PyCharm and separate set of instructions - here

Create new project by opening directory where you cloned the repository. PyCharm will tell you that it sees existing repo there, just accept that and proceed.
With your gc conda environment all good to go, change your "Preferences -> Project -> Python Interpreter" to the EXISTING gc conda env you created. https://www.jetbrains.com/help/pycharm/conda-support-creating-conda-virtual-environment.html
Now, change your "Preferences -> Build, Execution, Deployment -> Console -> Python Console interpreter" to your gc conda interpreter env that you added earlier.
That's it, you will now have correct env in Terminal, Python Console, and elsewhere in the IDE.

How to Setup Visual Studio Code IDE

Note: if you're using containerized env, you'll need setup like this

Open the cloned dir in new workspace and make sure to set your conda gc venv as the python venv https://code.visualstudio.com/docs/python/environments
That's it, when you start new integrated terminals, they'll activate the right environment and the syntax highlighting/autocompletion is going to work as it's supposed to.

Common Issues

My venv is broken somehow!

Delete the old conda environment and create a new one, follow steps above to reinstall it.

License & Contributions

See LICENSE.md (including licensing intent - INTENT.md) and CONTRIBUTING.md

Name		Name	Last commit message	Last commit date
Latest commit History 1,173 Commits
common		common
configuration		configuration
dataPipelines		dataPipelines
dev_tools		dev_tools
gc_clone_maker		gc_clone_maker
gc_crawler_status_monitor		gc_crawler_status_monitor
img		img
notification		notification
paasJobs		paasJobs
rpa		rpa
specialRequests		specialRequests
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
Dockerfile		Dockerfile
INTENT.md		INTENT.md
LICENSE.md		LICENSE.md
README.md		README.md
pytest.ini		pytest.ini
setup.py		setup.py
upload_icon_data_s3_postgres.py		upload_icon_data_s3_postgres.py
vscode_debug_api.py		vscode_debug_api.py

License

dod-advana/gamechanger-data

Folders and files

Latest commit

History

Repository files navigation

Data Engineering

Important Note!

(Linux) Dev/Prod Deployment Instructions

How to Setup Local Env for Development

MacOS / Linux

Windows (WSL Version)

Windows

Docker

IDE SETUP

How to Setup PyCharm IDE

How to Setup Visual Studio Code IDE

Common Issues

License & Contributions

About

Topics

Resources

License

Stars

Watchers

Forks

Languages