# Introduction

This notebook aims to describe the extract, transform, load (ETL) process for gene-specific data from NCBI and UniProt.

In [1]:
import gzip
import os
import shutil

def _uncompress(_path):
    with gzip.open(_path, 'rb') as _input, open('./' + os.path.splitext(os.path.basename(_path))[0], 'w') as _output:
        shutil.copyfileobj(_input, _output) 
        return os.path.basename(_output.name)
    
import pandas
    
def _read(_path):
    return pandas.read_csv(_path, delimiter = '\t')

# Methods

## Datasets

This section describe the datasets used to in this project

In [2]:
_datasets = {
    'Homo_sapiens.gene_info': '../datasets/NCBI/Homo_sapiens.gene_info.gz',
    'gene2go': '../datasets/NCBI/gene2go.gz' 
}

for index in _datasets:
    _datasets[index] = _uncompress(_datasets[index])

### NCBI

#### Homo_sapiens.gene_info

In [3]:
_data = _read(_datasets['Homo_sapiens.gene_info'])

_data.head()

Unnamed: 0,#tax_id,GeneID,Symbol,LocusTag,Synonyms,dbXrefs,chromosome,map_location,description,type_of_gene,Symbol_from_nomenclature_authority,Full_name_from_nomenclature_authority,Nomenclature_status,Other_designations,Modification_date,Feature_type
0,9606,1,A1BG,-,A1B|ABG|GAB|HYST2477,MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410...,19,19q13.43,alpha-1-B glycoprotein,protein-coding,A1BG,alpha-1-B glycoprotein,O,alpha-1B-glycoprotein|HEL-S-163pA|epididymis s...,20171221,-
1,9606,2,A2M,-,A2MD|CPAMD5|FWP007|S863-7,MIM:103950|HGNC:HGNC:7|Ensembl:ENSG00000175899...,12,12p13.31,alpha-2-macroglobulin,protein-coding,A2M,alpha-2-macroglobulin,O,alpha-2-macroglobulin|C3 and PZP-like alpha-2-...,20171223,-
2,9606,3,A2MP1,-,A2MP,HGNC:HGNC:8|Ensembl:ENSG00000256069,12,12p13.31,alpha-2-macroglobulin pseudogene 1,pseudo,A2MP1,alpha-2-macroglobulin pseudogene 1,O,pregnancy-zone protein pseudogene,20170903,-
3,9606,9,NAT1,-,AAC1|MNAT|NAT-1|NATI,MIM:108345|HGNC:HGNC:7645|Ensembl:ENSG00000171...,8,8p22,N-acetyltransferase 1,protein-coding,NAT1,N-acetyltransferase 1,O,arylamine N-acetyltransferase 1|N-acetyltransf...,20171105,-
4,9606,10,NAT2,-,AAC2|NAT-2|PNAT,MIM:612182|HGNC:HGNC:7646|Ensembl:ENSG00000156...,8,8p22,N-acetyltransferase 2,protein-coding,NAT2,N-acetyltransferase 2,O,arylamine N-acetyltransferase 2|N-acetyltransf...,20171217,-


#### gene2go

In [4]:
_data = _read(_datasets['gene2go'])

_data.head()

Unnamed: 0,#tax_id,GeneID,GO_ID,Evidence,Qualifier,GO_term,PubMed,Category
0,3702,814629,GO:0005634,ISM,-,nucleus,-,Component
1,3702,814629,GO:0008150,ND,-,biological_process,-,Process
2,3702,814630,GO:0003677,IEA,-,DNA binding,-,Function
3,3702,814630,GO:0003700,ISS,-,DNA binding transcription factor activity,11118137,Function
4,3702,814630,GO:0005634,IEA,-,nucleus,-,Component


The target taxonomy of this project is the Homo sapiens (Human) that holds the taxon indentifier 9606.

In [5]:
taxon = 9606

### Database

In [6]:
# connection