In this project, you see python modules that have been organized for collecting data as much as a large-scale system and are being presented as charts with cutting-edge technologies at the moment.
Given this Data Science project, it covers many topics about how to handle data and besides its, is a multi-disciplinary subject. Highlighted updates were posted, and every step to be executed is following the DataPipeline proposed.
Not all techniques are being develated here, but the most useful are for ilustrating concepts and tools which Biosoft exploits with.
Once the libraries needed to extract data are satisfied, below could be built up the project:
Scrapy
selenium
boto3
botocore
requests
urllib3
bs4
pandas
numpy
scipy
matplotlib
scikit-
jupyter-server
jupyter_client
jupyter_core
jupyterlab
jupyterlab-pygments
jupyterlab_server
Dockerfile for running up the container to set the work environment up
docker build -t YOUR_IMAGE_NAME .
docker run -v /dev/shm:/dev/shm --shm-size=2gb -d -p 80:8080
What about containers' high performance and its setting to collect huge datasets?
A diagram that shows the development enviroment with a toolkit as proposal.
At this picture is ilustrated the process troughout how the files will be collected and storing for each provided URL.
Communicating components through processes into data flow design from scanning the website, mining data to collect and ingest, processing up to the storing and plot are declared within this technology stack in order to be reliable and workable.
Here are presented three componentes throghourt data collect software cycle with Scrapy. Given the URL target, this is followed to find common and media files to store in AWS services such as DynamoDB for structured data (key-value) and S3 Bucket for pure documents. Also, it is shown two helping components for specific purpose. Once the data is retrieved they are ploted on tableu; wheremore, Selenium componen contain tools for clicking on dynamic JS events to download valid links of files.
In the picture below is shown how the files are filled at distributed cloud storage by Amazon's bitbuckets. By web scraping over differents Chile's web sites this data storing is pure document database which each one has been retrieved of a variety of formats either PDF, CSV, XLS, Stata, and more.
BeautifulSoup: Interfaces for reliable connections to url as target.
Scrapy
- Google Colab tips: using both %%writefile magic and %%javascript magic in the same cell
- Scrapy - User Agents and Proxies
- Scrapy - LinkExtractors
- Scrapy - LinkExtractors - GitHub's Documentation
Selenium
Xpath
- Parsing HTML with Xpath
- Scrapy - User Agents and Proxies
- XPath tester
- XPath tester codebeautify
- Xpath for python: is xpath underappreciated?
- Xpath examples
Regular expressions (RegEx)
- Regular-expressions
- Python RegEx
- Regex Tester by language programming
- Never use RegEx for web scraping
Storing data in the cloud (AWS)
Tableau - Data Visualization
DynamoDB and its purposes (AWS)
Big Data and Data flow design: Apache NiFi
- Processor: GetDynamoDB
- Processor: PutDynamoDB
- Processor: PutDynamoDBRecord
- Processor: DeleteDynamoDB
- How to Use Pyspark For Your Machine Learning Project
- Difference Between Big Data and Data Mining
- Web Scraing with NiFi and Scrapy via ExecuteProcess processor
- Building Data Lake using Apache NiFi | The Complete Guide
- Docker image | Apache NiFi
NoSQL Databases and Graph queries