Skip to content

caeltarifa/big_data_web_scraping

Repository files navigation

Big data by web scraping with scrapy

In this project, you see python modules that have been organized for collecting data as much as a large-scale system and are being presented as charts with cutting-edge technologies at the moment.

Given this Data Science project, it covers many topics about how to handle data and besides its, is a multi-disciplinary subject. Highlighted updates were posted, and every step to be executed is following the DataPipeline proposed.

Not all techniques are being develated here, but the most useful are for ilustrating concepts and tools which Biosoft exploits with.

1 Requirements

Once the libraries needed to extract data are satisfied, below could be built up the project:

Scrapy
selenium
boto3
botocore

requests
urllib3
bs4

pandas
numpy
scipy
matplotlib
scikit-

jupyter-server
jupyter_client
jupyter_core
jupyterlab
jupyterlab-pygments
jupyterlab_server

2 How to install

Dockerfile for running up the container to set the work environment up

docker build -t YOUR_IMAGE_NAME .
docker run -v /dev/shm:/dev/shm --shm-size=2gb -d -p 80:8080 

What about containers' high performance and its setting to collect huge datasets?

3 Toolkit

A diagram that shows the development enviroment with a toolkit as proposal.

Scrapy vscode googlecolab jupyter copilot drawio

4 DataPipeline for scraping

At this picture is ilustrated the process troughout how the files will be collected and storing for each provided URL.

2 DataPipeline

4.1 Technology Stack

Communicating components through processes into data flow design from scanning the website, mining data to collect and ingest, processing up to the storing and plot are declared within this technology stack in order to be reliable and workable.

TechnologyStack

5 Arquitecture of component design

Here are presented three componentes throghourt data collect software cycle with Scrapy. Given the URL target, this is followed to find common and media files to store in AWS services such as DynamoDB for structured data (key-value) and S3 Bucket for pure documents. Also, it is shown two helping components for specific purpose. Once the data is retrieved they are ploted on tableu; wheremore, Selenium componen contain tools for clicking on dynamic JS events to download valid links of files.

3 Arquitecture app drawio

5 Storing

A. Bitbucket Amazon S3

In the picture below is shown how the files are filled at distributed cloud storage by Amazon's bitbuckets. By web scraping over differents Chile's web sites this data storing is pure document database which each one has been retrieved of a variety of formats either PDF, CSV, XLS, Stata, and more.

bitbucket s3 aws

References

BeautifulSoup: Interfaces for reliable connections to url as target.

Scrapy

Selenium

Xpath

Regular expressions (RegEx)

Storing data in the cloud (AWS)

Tableau - Data Visualization

DynamoDB and its purposes (AWS)

Big Data and Data flow design: Apache NiFi

NoSQL Databases and Graph queries