### Recolher dados da Web

Neste notebook vamos sistematizar a recolha de dados a partir do site do Genbank.

Dado uma determinada sequência, identificada por um id, por exemplo, quer-se ir ao site descarregar o respetivo registo, tratá-lo (somo já fizemos) e depois inserir as partes relevantes em base de dados.

Exemplos de links para sequências:
- https://www.ncbi.nlm.nih.gov/nuccore/L42022
- https://www.ncbi.nlm.nih.gov/nuccore/L42023
- https://www.ncbi.nlm.nih.gov/nuccore/LC740868.1

Depois de se pedir este link, dentro da página, em Javascript, é feito um outro pedido ao servidor a pedir o record da sequência.

Exemplo:
- https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&log$=seqview&db=nuccore&report=genbank&id=804715&conwithfeat=on&hide-cdd=on&ncbi_phid=null

No exemplo seguinte,. pede-se a página, mas a mesma não contém o registo que nos interessa.

O registo é carregado assincronamente, através de programação Javascript.

In [4]:
import requests
r = requests.get('https://www.ncbi.nlm.nih.gov/nuccore/PA500505.1')
print(r.content)



No exemplo seguinte,. em vez de se pedir a página que contém o registo, pede-se apenas o registo, depois de percebermos como o mesmo é pedido.

### Problema

Ao pedir a página, não vem a registo da sequência.

Para pedir o registo, o mesmo pede-se por um id (um número interno) que é diferente do id da sequência (L42022, por exemplo).

### Solução

A solução passa por fazer dois pedidos. No primeiro, pede-se a página e extrai-se apenas o id numérico interno, associado à sequência. Esse id interno é então usado para se fazer o segundo pedido.



In [5]:
import requests
from bs4 import BeautifulSoup

# Making a GET request
r = requests.get('https://www.ncbi.nlm.nih.gov/nuccore/L42022')
# Parsing the HTML
soup = BeautifulSoup(r.content, 'html.parser')

# Procurar um tag meta com um determinado atributo
lines = soup.find_all('meta', {'name':"ncbi_uidlist"} )

id = ""
url = ""
for line in lines:
	# print(line)
	# if 'name' in line.attrs:
	# 	print(line.attrs['name'])
	if 'content' in line.attrs:
		# print(line.attrs['content'])		
		id = line.attrs['content']

if id:
	url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={}&db=nuccore&report=genbank&conwithfeat=on&hide-cdd=on&retmode=text&maxdownloadsize=5000000".format(id)

r2 = requests.get( url )

print( r2.content ) 

b'LOCUS       HIVI5C                   231 bp    DNA     linear   VRL 24-MAR-1997\nDEFINITION  Human immunodeficiency virus type 1 (isolate genotype C, I5) gag\n            gene, partial cds.\nACCESSION   L42022\nVERSION     L42022.1\nKEYWORDS    gag gene; p24 protein.\nSOURCE      Human immunodeficiency virus 1 (HIV-1)\n  ORGANISM  Human immunodeficiency virus 1\n            Viruses; Riboviria; Pararnavirae; Artverviricota; Revtraviricetes;\n            Ortervirales; Retroviridae; Orthoretrovirinae; Lentivirus.\nREFERENCE   1  (bases 1 to 231)\n  AUTHORS   Voevodin,A., Crandall,K.A., Seth,P. and al Mufti,S.\n  TITLE     HIV type 1 subtypes B and C from new regions of India and Indian\n            and Ethiopian expatriates in Kuwait\n  JOURNAL   AIDS Res. Hum. Retroviruses 12 (7), 641-643 (1996)\n   PUBMED   8743090\nFEATURES             Location/Qualifiers\n     source          1..231\n                     /organism="Human immunodeficiency virus 1"\n                     /proviral\n   

In [6]:
# pip3 install html5lib

import requests
from bs4 import BeautifulSoup

# Making a GET request
r = requests.get('https://www.ncbi.nlm.nih.gov/nuccore/L42022')
# Parsing the HTML
soup = BeautifulSoup(r.content, 'html.parser')

# Procurar um tag meta com um determinado atributo
lines = soup.find_all('meta', {'name':"ncbi_uidlist"} )

id = ""
url = ""
for line in lines:
	# print(line)
	# if 'name' in line.attrs:
	# 	print(line.attrs['name'])
	if 'content' in line.attrs:
		# print(line.attrs['content'])		
		id = line.attrs['content']

if id:
	url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={}&db=nuccore&report=genbank&conwithfeat=on&hide-cdd=on&retmode=text&maxdownloadsize=5000000".format(id)

r2 = requests.get( url )

print( r2.content ) 


b'LOCUS       HIVI5C                   231 bp    DNA     linear   VRL 24-MAR-1997\nDEFINITION  Human immunodeficiency virus type 1 (isolate genotype C, I5) gag\n            gene, partial cds.\nACCESSION   L42022\nVERSION     L42022.1\nKEYWORDS    gag gene; p24 protein.\nSOURCE      Human immunodeficiency virus 1 (HIV-1)\n  ORGANISM  Human immunodeficiency virus 1\n            Viruses; Riboviria; Pararnavirae; Artverviricota; Revtraviricetes;\n            Ortervirales; Retroviridae; Orthoretrovirinae; Lentivirus.\nREFERENCE   1  (bases 1 to 231)\n  AUTHORS   Voevodin,A., Crandall,K.A., Seth,P. and al Mufti,S.\n  TITLE     HIV type 1 subtypes B and C from new regions of India and Indian\n            and Ethiopian expatriates in Kuwait\n  JOURNAL   AIDS Res. Hum. Retroviruses 12 (7), 641-643 (1996)\n   PUBMED   8743090\nFEATURES             Location/Qualifiers\n     source          1..231\n                     /organism="Human immunodeficiency virus 1"\n                     /proviral\n   

In [7]:
import time
for id in range(10, 15):
    url = "'https://www.ncbi.nlm.nih.gov/nuccore/{}'".format(id)
    time.sleep(1)
    print(url)

'https://www.ncbi.nlm.nih.gov/nuccore/10'
'https://www.ncbi.nlm.nih.gov/nuccore/11'
'https://www.ncbi.nlm.nih.gov/nuccore/12'
'https://www.ncbi.nlm.nih.gov/nuccore/13'
'https://www.ncbi.nlm.nih.gov/nuccore/14'


In [18]:
import requests
r = requests.get('https://www.ncbi.nlm.nih.gov/nuccore/HIVI5C')
# print(r.content)

In [43]:
import requests
from bs4 import BeautifulSoup
#https://www.ncbi.nlm.nih.gov/nuccore/NC_000002.12?report=genbank&from=226731312&to=226799820&strand=true
# Making a GET request
r = requests.get('https://www.ncbi.nlm.nih.gov/nuccore/CP039618.1')
# Parsing the HTML
soup = BeautifulSoup(r.content, 'html.parser')

# Procurar um tag meta com um determinado atributo
lines = soup.find_all('meta', {'name':"ncbi_uidlist"} )

id = ""
url = ""
for line in lines:
	# print(line)
	# if 'name' in line.attrs:
	# 	print(line.attrs['name'])
	if 'content' in line.attrs:
		# print(line.attrs['content'])		
		id = line.attrs['content']

if id:
	url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={}&db=nuccore&report=genbank&conwithfeat=on&hide-cdd=on&retmode=text&maxdownloadsize=5000000".format(id)

r2 = requests.get( url )

print( r2.content ) 

b'OUTPUT_TOO_BIG\n'


In [42]:


# Define a regular expression to match CDS features
cds_regex = re.compile(r'(\\n\s+CDS)')

# Find all CDS features in the record
cds_matches = cds_regex.finditer(str(r2.content))

# Print the number of CDS features found
print(f"Found {len(list(cds_matches))} CDS features.")

Found 0 CDS features.


__Exercicio__

In [12]:
import time
def url_get(i):
    url_list= [ ]
    for id in range(243,(243+i)):
        url = "https://www.ncbi.nlm.nih.gov/nuccore/NG_009{}".format( id )
        url_list.append(url)
    return url_list
url_get(4)

['https://www.ncbi.nlm.nih.gov/nuccore/NG_009243',
 'https://www.ncbi.nlm.nih.gov/nuccore/NG_009244',
 'https://www.ncbi.nlm.nih.gov/nuccore/NG_009245',
 'https://www.ncbi.nlm.nih.gov/nuccore/NG_009246']

In [13]:
import requests

content = []
for url in url_get(4):
    r = requests.get(url)
    content.append(r.content)
print(content)



In [14]:
import requests
from bs4 import BeautifulSoup

# Parsing the HTML
for c in content:
    soup = BeautifulSoup(c, 'html.parser')

    # Procurar um tag meta com um determinado atributo
    lines = soup.find_all('meta', {'name':"ncbi_uidlist"} )

    id = ""
    url = ""
    for line in lines:
        #print(line)
        #if 'name' in line.attrs:
        #    print(line.attrs['name'])
        if 'content' in line.attrs:
            # print(line.attrs['content'])		
            id = line.attrs['content']

    if id:
        url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={}&db=nuccore&report=genbank&conwithfeat=on&hide-cdd=on&retmode=text&maxdownloadsize=5000000".format(id)

    r2 = requests.get( url )
    
    r3= str(r2.content)
    print (r3, '\n')

b'LOCUS       NG_009243             197905 bp    DNA     linear   PRI 02-JAN-2023\nDEFINITION  Homo sapiens dispatched RND transporter family member 1 (DISP1),\n            RefSeqGene on chromosome 1.\nACCESSION   NG_009243\nVERSION     NG_009243.2\nKEYWORDS    RefSeq; RefSeqGene.\nSOURCE      Homo sapiens (human)\n  ORGANISM  Homo sapiens\n            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;\n            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;\n            Catarrhini; Hominidae; Homo.\nREFERENCE   1  (bases 1 to 197905)\n  AUTHORS   Tekendo-Ngongang,C., Muenke,M. and Kruszka,P.\n  TITLE     Holoprosencephaly Overview\n  JOURNAL   (in) Adam MP, Everman DB, Mirzaa GM, Pagon RA, Wallace SE, Bean\n            LJH, Gripp KW and Amemiya A (Eds.);\n            GENEREVIEWS(R);\n            (1993)\n   PUBMED   20301702\nCOMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The\n            reference sequence was derived from AL39

b'LOCUS       NG_009244             144673 bp    DNA     linear   PRI 18-FEB-2021\nDEFINITION  Homo sapiens BCR activator of RhoGEF and GTPase (BCR), RefSeqGene\n            (LRG_1112) on chromosome 22.\nACCESSION   NG_009244\nVERSION     NG_009244.2\nKEYWORDS    RefSeq; RefSeqGene.\nSOURCE      Homo sapiens (human)\n  ORGANISM  Homo sapiens\n            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;\n            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;\n            Catarrhini; Hominidae; Homo.\nCOMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff in\n            collaboration with Linda Lee. The reference sequence was derived\n            from U07000.1, KF457372.1, KF457370.1, KF457379.1, KF457382.1 and\n            AP000343.1.\n            This sequence is a reference standard in the RefSeqGene project.\n            \n            On Dec 27, 2018 this sequence version replaced NG_009244.1.\n            \n            Summary: A 

b'LOCUS       NG_009245             407474 bp    DNA     linear   PRI 02-JAN-2023\nDEFINITION  Homo sapiens bone morphogenetic protein receptor type 1B (BMPR1B),\n            RefSeqGene on chromosome 4.\nACCESSION   NG_009245\nVERSION     NG_009245.1\nKEYWORDS    RefSeq; RefSeqGene.\nSOURCE      Homo sapiens (human)\n  ORGANISM  Homo sapiens\n            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;\n            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;\n            Catarrhini; Hominidae; Homo.\nREFERENCE   1  (bases 1 to 407474)\n  AUTHORS   Austin,E.D., Phillips,J.A. III and Loyd,J.E.\n  TITLE     Heritable Pulmonary Arterial Hypertension Overview\n  JOURNAL   (in) Adam MP, Everman DB, Mirzaa GM, Pagon RA, Wallace SE, Bean\n            LJH, Gripp KW and Amemiya A (Eds.);\n            GENEREVIEWS(R);\n            (1993)\n   PUBMED   20301658\nCOMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The\n            reference seque

b'LOCUS       NG_009246              12950 bp    DNA     linear   PRI 24-AUG-2020\nDEFINITION  Homo sapiens glutathione S-transferase mu 1 (GSTM1), RefSeqGene on\n            chromosome 1.\nACCESSION   NG_009246\nVERSION     NG_009246.1\nKEYWORDS    RefSeq; RefSeqGene.\nSOURCE      Homo sapiens (human)\n  ORGANISM  Homo sapiens\n            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;\n            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;\n            Catarrhini; Hominidae; Homo.\nCOMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The\n            reference sequence was derived from AC000031.6 and AC000032.7.\n            This sequence is a reference standard in the RefSeqGene project.\n            \n            Summary: Cytosolic and membrane-bound forms of glutathione\n            S-transferase are encoded by two distinct supergene families. At\n            present, eight distinct classes of the soluble cytoplasmic\n      

Number of CDS sequences: 0


In [17]:
import re
existe = re.findall(r"ORGANISM\s+.*?(?=bp)", r3, re.DOTALL)
if existe:
    for definition in existe:
        m = re.match( r"ORGANISM\s+(.+)", definition, re.DOTALL )
        print( re.sub(r'\s+', ' ', m.group(1) ) )