In [1]:
from utils import uniprotRetrieve
from utils import mapping

# 1. Download data

To download the data, the Uniprot [REST API](https://www.uniprot.org/help/api%5Fqueries) was used.

In a first step, 
two search queries were used to distinguish between secreted proteins and proteins in the cytoplasm.

In a second step, protein sequence similarity (and therefore bias towards some parts in the sequence space) is reduced by mapping the proteins to the UniRef50 groups.

## 1.1. Construct search Queries

### 1.1.1. Cytoplasmic proteins

* We will limit the dataset to Bacteria.
* want to make sure there is some evidence of the protein being in the cytoplasm
    - this evidence can be generated by automated pipelines or manually assigned
* As an extra saveguard, we search for proteins that have no signal peptide annotation

In [2]:
QUERY="taxonomy:Gammaproteobacteria (locations:(location:cytoplasm) OR locations:(location:cytosol)) NOT annotation:(type:signal)"

Since we still have to map those proteins to their Uniref50 group (at most 50 percent sequence identity),
we only need the identifiers and will not waste bandwith in downloading fasta sequences yet.

In [3]:
FORMAT="fasta"

In [4]:
FILENAME="cytoplasmSearch.fasta"

In [5]:
# Download the identifiers 
uniprotRetrieve(FILENAME, format=FORMAT, query=QUERY)

'cytoplasmSearch.fasta'

### 1.1.2 Periplasmic proteins

* We will limit the dataset to Bacteria. This is however a smaller group as for the cytoplasmic proteins since only gram negative bacteria have a periplasmic space.
* want to make sure there is some evidence of the protein being in the periplasm
    - this evidence can be generated by automated pipelines or manually assigned
* As an extra saveguard, we search for proteins that have a signal peptide annotation

In [6]:
QUERY="taxonomy:Gammaproteobacteria locations:(location:periplasm) annotation:(type:signal)"
FORMAT="fasta"
FILENAME="periplasmSearch.fasta"
uniprotRetrieve(FILENAME, format=FORMAT, query=QUERY)

'periplasmSearch.fasta'

## 1.2. Use [Uniprot Mapper API](https://www.uniprot.org/help/api_idmapping) to map proteins to their Uniref50