Big data by web scraping with scrapy

In this project, you see python modules that have been organized for collecting data as much as a large-scale system and are being presented as charts with cutting-edge technologies at the moment.

Given this Data Science project, it covers many topics about how to handle data and besides its, is a multi-disciplinary subject. Highlighted updates were posted, and every step to be executed is following the DataPipeline proposed.

Not all techniques are being develated here, but the most useful are for ilustrating concepts and tools which Biosoft exploits with.

1 Requirements

Once the libraries needed to extract data are satisfied, below could be built up the project:

Scrapy
selenium
boto3
botocore

requests
urllib3
bs4

pandas
numpy
scipy
matplotlib
scikit-

jupyter-server
jupyter_client
jupyter_core
jupyterlab
jupyterlab-pygments
jupyterlab_server

2 How to install

Dockerfile for running up the container to set the work environment up

docker build -t YOUR_IMAGE_NAME .
docker run -v /dev/shm:/dev/shm --shm-size=2gb -d -p 80:8080

What about containers' high performance and its setting to collect huge datasets?

3 Toolkit

A diagram that shows the development enviroment with a toolkit as proposal.

4 DataPipeline for scraping

At this picture is ilustrated the process troughout how the files will be collected and storing for each provided URL.

4.1 Technology Stack

Communicating components through processes into data flow design from scanning the website, mining data to collect and ingest, processing up to the storing and plot are declared within this technology stack in order to be reliable and workable.

5 Arquitecture of component design

Here are presented three componentes throghourt data collect software cycle with Scrapy. Given the URL target, this is followed to find common and media files to store in AWS services such as DynamoDB for structured data (key-value) and S3 Bucket for pure documents. Also, it is shown two helping components for specific purpose. Once the data is retrieved they are ploted on tableu; wheremore, Selenium componen contain tools for clicking on dynamic JS events to download valid links of files.

5 Storing

A. Bitbucket Amazon S3

In the picture below is shown how the files are filled at distributed cloud storage by Amazon's bitbuckets. By web scraping over differents Chile's web sites this data storing is pure document database which each one has been retrieved of a variety of formats either PDF, CSV, XLS, Stata, and more.

References

BeautifulSoup: Interfaces for reliable connections to url as target.

Scrapy

Selenium

Xpath

Regular expressions (RegEx)

Storing data in the cloud (AWS)

Tableau - Data Visualization

DynamoDB and its purposes (AWS)

Big Data and Data flow design: Apache NiFi

NoSQL Databases and Graph queries

Extracting Data from NoSQL Databases - Master of Science Thesis

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
Apache NiFi container		Apache NiFi container
Images		Images
Project's Roadmap		Project's Roadmap
big_data_webscraping		big_data_webscraping
outcomes		outcomes
Collect_personal_data_Georgia.ipynb		Collect_personal_data_Georgia.ipynb
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache NiFi container

Apache NiFi container

Images

Images

Project's Roadmap

Project's Roadmap

big_data_webscraping

big_data_webscraping

outcomes

outcomes

Collect_personal_data_Georgia.ipynb

Collect_personal_data_Georgia.ipynb

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Big data by web scraping with scrapy

1 Requirements

2 How to install

3 Toolkit

4 DataPipeline for scraping

4.1 Technology Stack

5 Arquitecture of component design

5 Storing

A. Bitbucket Amazon S3

References

About

Releases

Packages

Languages

License

caeltarifa/big_data_web_scraping

Folders and files

Latest commit

History

Repository files navigation

Big data by web scraping with scrapy

1 Requirements

2 How to install

3 Toolkit

4 DataPipeline for scraping

4.1 Technology Stack

5 Arquitecture of component design

5 Storing

A. Bitbucket Amazon S3

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages