Skip to content

dzhang32/biotech_web_scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

Web scraping biotech company information

As a final year bioinformatics PhD student, I decided to try and make the job hunt a little more enjoyable by automating and standardising the process of finding companies that I would be interested in working for. Here, I use Beautiful Soup and Selenium to find details of all UK companies within the field of Biotechnnology.

Disclaimer

I do not recommend web scraping in any way. If you do web scrape, please respect the Terms of Service and robots.txt of the site you scrape. For more information on the legality of web scraping, you may find this blog useful.

Contents

It's worth mentioning that due to the ever-updating nature of many websites, it's unlikely that these scripts remain re-runnable out-of-the-box.

Script Description
01a-get_biotech_companies.py Web scrape names of all UK biotech companies.
01b-tidy_biotech_companies.py Tidy data from previous step.
02a-scrape_company_info.py Use selenium to navigate, search and scrape description, size, location, url and domains/tags of companies.
02b-merge_tidy_company_info.py Tidying. Find the exceptions that were not scraped successfully.
02c-scrape_company_info_2nd_pass.py 2nd pass, re-run scraping on the exceptions.
02d-merge_exceptions_2nd_pass.py Merge together all company info.
utils.py Utility function to keep project self-contained.

Acknowledgements

This project was inspired by this blog post and accompanying youtube video by Chris Lovejoy.

About

Web scraping biotech company information

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages