# Imports

## System Imports

The first import is very common in Python and for direct directory and system access

## Pandas and Numpy

Pandas and numpy are import commonly used in data analysis.
Pandas provides DataFrames and Numpy numerical methods

## Bioservices
Bioservices offers ccess to a wide variety of tools to access bio informatic resourses. Details can be found here: 

https://bioservices.readthedocs.io/en/master/index.html


In [None]:
import os, os.path, sys
import pandas as pd
import numpy as np

from bioservices import *


First, create a BioMart object using the host 'www.ensembl.org'

In [None]:
bm = BioMart(host='www.ensembl.org')

To see the avilable databases on the site use the 'database' attribute of the BioMart object.

```python
bm.databases
```
Output:

```python
['ensembl_mart_98',
 'genomic_features_mart_98',
 'mouse_mart_98',
 'ontology_mart_98',
 'regulation_mart_98',
 'sequence_mart_98',
 'snp_mart_98']
```
 

In [None]:
bm.databases

You can also the search function ```lookfor```, which takes one string parameter. If given the empry string ```''```, it lists all databases and information associated with it. 

In [None]:
bm.lookfor()

Now look for the mart with the mart name ```'ENSEMBL_MART_ENSEMBL'```

In [None]:

bm.lookfor(mart_name)

You can also list the available datasets using the ```datasets```method of BioMart. You will have to pass the mart name as paratemeter.

```python
bm.datasets(mart_name)
```


To store the avilable dataset in a variable use the following command:

```python
datasets = bm.datasets(mart_name)
```
To filter all databases by let's say some substring, you can use a simillar approach as in the following example:

```python
some_genes = ['tec','gata1','mt-nd2']
[x for x in some_genes if x.find('ata')>= 0]
```
Output:
```python
['gata1']
```

Please find the dataset conatining the substring ```'sapiens'```


To list attributes use the ```attributes```method of BioMart. You will have to pass the database as paratemeter.

```python
bm.attributes(dataset)
```


The return value of the method is a dictinary. To access a single entry, for example ```'ensembl_gene_id'``` use the following code:

```python
bm.attributes(dataset)['ensembl_gene_id']
```

Also filters can be accessed in teh same way. 

```python
bm.filters(dataset)
```


Thes are quite a number of filters. To only see the keys of the dictionary, attach ```.keys()``` at the end. 

```
bm.filters(dataset).keys()
```

In [None]:
bm.filters(dataset)['chromosome_name']

To get full access to BioMart, you should build a query. To do that you will need the ```dataset```,  some ```àttributes``` nd potentially a ```filter```

Start with building the query:

In [None]:
# build a query

bm.add_dataset_to_xml(dataset)

Add attributes:

In [None]:
bm.add_attribute_to_xml('ensembl_gene_id')
bm.add_attribute_to_xml('ensembl_transcript_id')
bm.add_attribute_to_xml('hgnc_symbol')
bm.add_attribute_to_xml('chromosome_name')
bm.add_attribute_to_xml('start_position')
bm.add_attribute_to_xml('end_position')



Instead of copy & paste each line, you can use either a for loop or a single list (comprehension):

```python
# version using a for loop
attributes = ['ensembl_gene_id','ensembl_transcript_id','hgnc_symbol',
              'chromosome_name','start_position','end_position']
for attribute in attributes:
    bm.add_attribute_to_xml(attribute)
    
```

or

```python
# version using list comprehension
attributes = ['ensembl_gene_id','ensembl_transcript_id','hgnc_symbol',
              'chromosome_name','start_position','end_position']
[bm.add_attribute_to_xml(attribute) for attribute in attributes]

```

Do not execute this now, as you would add this to the already existing query!

Add a filter:

In [None]:
# add filter
bm.add_filter_to_xml('chromosome_name','1')

Now, the query has to be translated into XML

In [None]:
query_xml = bm.get_xml()

... and executed. The return value is stored in the variable ```res```

In [None]:
res = bm.query(query_xml)

In [None]:
res

The result is in plain text. Using the internal Jupyter print function does not account for newline symbols (```\n```). To propperly print it use the ```print```function. 

```python
print(res)
```

To better access each line, split the resulting text into individual lines:
```python
res.split('\n')
```


To construct a propper array of an array, you should now split each line into indvidual elements. Each line conatins a number of tab symbols (```\t```). 

Use the following command to generate a propper list of lists and save it to a variable:

```python
res_data = [x.split('\t') for x in res.split('\n')]
```


To better access the data, we now create a pandas DataFrame. We have to pass the actual data, as well as the names of the attribute names:

```python
# Create a pandas dataframe
res_data = [x.split('\t') for x in res.split('\n')]
df = pd.DataFrame(res_data,columns=attributes)
```
You might need the attributes as defines some blocks above. 

The DataFrame object has a number of methods to display the content. ```head()``` displays the top of the frame (ususaly the first 5 elements). ```head(10)```displays the first 10. 

```python
df.head()
```

In [None]:
df.head()

In [None]:
df.head(10)

In [None]:
df