Skip to content

Wittline/data-engineering-challenge-th

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dockerizing a Python Script for Web Scraping and consume the scraped data using FastApi

This repository aims to show you how to dockerize a Python script for web scraping, store the scraped data in a sqlite database and consume the scraped data using a FastApi app, everything in the same container, you can modify this repository and repeatthe same example but using different docker containers for each app.

Docker environment

image

Dockerfile

carbon (27)

Details

The project consists of two processes:

  1. A Python script that is executed when the docker image is created, this script will be responsible for scraped the data from the website: www.metrocubicos.com and will store the data in a sqlite database, the script is in the file publications.py and receives the following arguments as parameters: python publicaciones.py -t 50 -s 48, -t indicates the number of total elements to be scraped from the web site, and the parameter -s indicates the number of elements per page that the web site contains, this last parameter is not needed, by default it is using the number 48, if site´s structure changes in the future so this parameter can be changed at your convenience. the dockerfile file will be in charge of running the command: RUN python -u ./app/publicaciones.py -t 50 -s 48 when Docker Image is built.

image

  1. Once the Docker image was built, the FastApi app will be initialized, the dockerfile file is in charge of running the command: CMD ["python", "./app/main.py"], this exposes a service, you can try to call it by localhost as follows: http://127.0.0.1:8000/items/2, this example returns two records stored in the database metroscubicos.sqlite in the ESTATE table.

image

How to run

In order to make this example work correctly please follow the next steps:

  • Install git-bash for windows, once installed , open git bash and download this repository, this will download the app folder and the Dockerfile file, and other files needed.
ramse@DESKTOP-K6K6E5A MINGW64 /c/documents/github
$ git clone https://github.com/Wittline/data-engineering-challenge-th.git
  • Install Docker Desktop on Windows, it will install docker compose as well, docker compose will alow you to run multiple containers applications.

  • Once all the files needed were downloaded from the repository , Let's run everything we will use the git bash tool again, go to the folder data-engineering-challenge-th we will run the Dockerfile using the command:

ramse@DESKTOP-K6K6E5A MINGW64 ~/documents/github/data-engineering-challenge-th
docker build -t python-th . --progress=plain

image

  • Once the above command was executed and finished, proceed with the container creation using the image: python-th and then start the container, using the below command:
ramse@DESKTOP-K6K6E5A MINGW64 ~/documents/github/data-engineering-challenge-th
docker run -p 8000:8000 python-th

image

  • Ready!. Web scraping process was executed and the fastApi App exposed, lets try looking into this url:: http://127.0.0.1:8000/items/2

  • If you want stop everything, go and open another gitbash windows and use the below command:

ramse@DESKTOP-K6K6E5A MINGW64 ~/documents/github/data-engineering-challenge-th
docker container ls 
  • Now use the below command using the CONTAINER ID shown in the last step, this will stop everything:
ramse@DESKTOP-K6K6E5A MINGW64 ~/documents/github/data-engineering-challenge-th
docker stop 7f2181fe515d
  • If you want to release your resources, then you can use the below command, this will delete everything:
ramse@DESKTOP-K6K6E5A MINGW64 ~/documents/github/data-engineering-challenge-th
docker system prune -a

About

Dockerizing a Python Script for Web Scraping and consume the scraped data using FastApi (www.metroscubicos.com)

Topics

Resources

License

Stars

Watchers

Forks