web-scraping

A collection of scripts for web scraping.

email_scraper.py

Can be used to collect email addresses from any webpage. You can do this by defining a list of URLs and then running the script. The output is saved in CSV format by default. Dependencies:

asyncio - asyncio is a library to write concurrent code using the async/await syntax.
pyppeteer - unofficial Python port of puppeteer JavaScript (headless) chrome/chromium browser automation library.
pandas - data analysis and manipulation tool
validators - data validation library

pyppeteer requires Python >= 3.6. Install with pip from PyPI:

pip install pyppeteer

Or install the latest version from this github repo:

pip install -U git+https://github.com/pyppeteer/pyppeteer@dev

pandas can be installed via pip from PyPI. Requires Python 3.7.1 and above.

pip install pandas

You can install validators using pip. Supports python versions 2.7, 3.3, 3.4, 3.5 and PyPy:

pip install validators

Notes:

df.to_csv('email_addresses.csv', index=False) - saves your data in a CSV file called "email_addresses.csv". You can use other methods such as to_html to_json to_excel etc. in order to save it in other formats. You can find a reference to this and other methods here: Pandas API

traversive_email_scraper.py

Can be used to automatically traverse/crawl several pages of a website in order to collect email addresses. Much like email_scraper.py, just define a list of URLs then execute the program in order to start the crawling process.

Requires BeautifulSoup4 (in addition to the ones mentioned above)

pip install beautifulsoup4

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
email_scraper.py		email_scraper.py
traversive_email_scraper.py		traversive_email_scraper.py
urls.csv		urls.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

email_scraper.py

email_scraper.py

traversive_email_scraper.py

traversive_email_scraper.py

urls.csv

urls.csv

Repository files navigation

web-scraping

email_scraper.py

traversive_email_scraper.py

About

Releases

Packages

Languages

License

aennisjr/web-scraping

Folders and files

Latest commit

History

Repository files navigation

web-scraping

email_scraper.py

traversive_email_scraper.py

About

Resources

License

Stars

Watchers

Forks

Languages