GisAids Influenza Downloader
This project is used to download data from the EpiFlu(influenza virus database) in GISAID, including metadata, genetic sequences, etc.
- search_page.py can only be used to download all entries from EpiFlu, without the ability to filter specific datasets.
- download_item.py selects up to 8,000 data entries each time. You need to manually click the 'Download' button in the browser, select the required file type, and then manually download the file.
- Each time download_item.py initiates a download , the program opens a new browser page for login.
Before you begin using this script, ensure that your system meets the following requirements:
-
GisAids Account
-
Operating System: Linux desktop environment is required for proper execution of the script.
-
Firefox Browser: This script uses Selenium, which requires Firefox browser to be installed on your machine. Install Firefox if it is not already installed.
-
Firefox WebDriver: You will also need to download the Firefox WebDriver, which allows Selenium to interact with Firefox. The WebDriver can be downloaded from the following URL:
After downloading, extract the WebDriver and place it in the root directory of the project.
- Clone the Repository: First, clone this repository to your local machine using Git.
- Set up Conda Environment: Use the
environment.yml
file to create a Conda environment with all the necessary dependencies:conda env create -f environment.yml
Follow these steps to use this project:
- In the
download_item.py
andsearch_page.py
files, fill inusername = ''
andpassword = ''
, as well as thepage_number
variable with the username, password, and maximum number of pages. - Execute the
search_page.py
program to download all search page entry information:python search_page.py
- After the download is complete, execute combine_json.py to merge all the page information:
python combine_json.py
- Execute download_item.py to download the required entry data:
python download_item.py