# BioInformatics Assignment 1: H3N2 Protein Sequences
**- Matthew Johnson, Sept. 19, 2018**
- Sequence length 500-600 aa

The influenza virus A is responsible for some of the worst pandemics in history. The 1918 flu pandemic was an unusually deadly influenza pandemic involving H1N1 influenza virus. It infected 500 million people across the world and resulted in the deaths of 50 to 100 million (three to five percent of the world's population). 
**In this assignment we want to investigate one subtype of influenza virus A: H3N2.**<br>
*Please collect H3N2 protein sequences, put them into one text file (.txt or equivalent) and submit through moodle.*
1. Using **FASTA** format
2. Each sequence is named as: **[location]/[year]**, if multiple sequence exist for the same location and year, you may use [location]/[year]/[number].
3. Do not include many sequences with the same location and same year. Normally less than 3.
4. The number of sequences should be in the range of 20 to 50.
5. The sequences should reflect the overview of H3N2's distribution.

**Import Useful Libraries:**

In [2]:
import re
import time
import selenium
from selenium import webdriver
import warnings
warnings.filterwarnings('ignore')

**Method to label the [location]/[year]:**

In [3]:
def label_fasta_format(country_, text_):
    try:
        year_ = re.findall('\(.*?\)',text_)[0].split('/')[-1].split('(')[0]
        temp_text = re.sub(r" ?\([^)]+\)", "", text_)
        prefix = country_ + '/' + year_ + '  '        
    
        return ( prefix + temp_text )
    except:
        return 'xxxx/xxxx  ' + text_

**Scrape the sequences, we chose locations and took ~3 from each one:**

In [4]:
countries = ['Ukraine', 'France', 'Spain', 'Germany', 'Poland', 'Mexico', 'Russia', 'China', 'Korea', 'Japan',
            'Alberta', 'Ontario', 'New York', 'Boston', 'Brazil', 'India', 'Sweden', 'Australia', 'New Zealand',
            'Egypt', 'England']

seq_list = []

driver = webdriver.Firefox()

for country in countries:
    
    # Bring up NCBI Webpage for Protein H3N2
    driver.get('https://www.ncbi.nlm.nih.gov/protein/?term=H3N2')

    # Text Box for Detailed Search
    text_for_box = f'(("H3N2 subtype"[Organism] OR H3N2[All Fields]) AND {country}[All Fields]) AND ("500"[SLEN] : "600"[SLEN])'

    # Send text to box
    driver.find_element_by_xpath('/html/body/div/div[1]/form/div[1]/div[5]/div/div[4]/div[2]/div/textarea') \
                .send_keys(text_for_box)

    # Press the Search Button
    driver.find_element_by_xpath('//*[@id="ui-ncbibutton-5"]').click()
    
    driver.implicitly_wait(5) # seconds
    links = [link.get_attribute('href') for link in driver.find_elements_by_xpath('//*[@id="ReportShortCut6"]')]
    
    for link in links[:3]:
        
        driver.implicitly_wait(3)
        
        attempts = 0
        while( attempts < 5) :
            try:
                driver.get(link)
                driver.implicitly_wait(5)
                p = driver.find_element_by_xpath('/html/body/div/div[1]/form/div[1]/div[4]/div/div[5]/div[2]/div[1]/pre')
                seq_list.append(label_fasta_format(country, p.text))
                driver.back()
                break
            
            except:
                print('Error!')
                attempts += 1
                driver.implicitly_wait(5)

                
driver.close()

**Overview of Sequences:**

In [8]:
print('# Sequences:', len(seq_list))
print('# Countries:', len(countries))

# Sequences: 57
# Countries: 21


**Saving the Data as a .txt:**

In [6]:
with open('h3n2_sequences_sep19.txt', 'w') as filehandle:  
    for listitem in seq_list:
        filehandle.write('%s\n\n' % listitem)

**Sample Entry:**

In [7]:
seq_list[0]

'Ukraine/2009  >AJK03248.1 hemagglutinin, partial [Influenza A virus)]\nMKTIIALSHILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDS\nPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNES\nFNWTGVTQNGTSSACIRRSNNSFFSRLNWLTHLKFKYPALNVTMPNNEQFDKLYIWGVHHPGTDNDQIFL\nYAQASGRITVSTKRSQQTVIPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGK\nSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGA\nIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEG\nRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCD\nNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNI\nRCNICI'