In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import matplotlib.pyplot as plt

---------------------------

## Config

In [3]:
import sys

In [4]:
project_dir = '/home/pmonteagudo/workspace/silencing_project'
if project_dir not in sys.path: 
    sys.path.append(project_dir)
from config_analysis import *

In [5]:
pd.options.mode.chained_assignment = None

In [6]:
import Util

- Result **directories**

In [7]:
#annot_dir = '/data/parastou/RNAdeg/annotation/'
#annot_dir = os.path.join(project_data_dir, 'annotation/gff')
annot_dir = os.path.join(project_data_dir, 'annotation/gff_v2')

---------------------------

## Annotation Config

- reference `GTF/GFF` **file**

In [8]:
## - Parastous:
#in_gff = os.path.join(annot_dir, 'schizosaccharomyces_pombe.chr.extended.gff3')

## - sPomBase:
in_gff = os.path.join(annot_dir, 'Schizosaccharomyces_pombe_all_chromosomes.extended.gff3')
#in_gff = os.path.join(annot_dir, 'Schizosaccharomyces_pombe_all_chromosomes.extended_with_mat_locus.gff3') ## include MAT1, MAT2, MAT3

## - Ensembl:
#in_gff = os.path.join(annot_dir, 'Schizosaccharomyces_pombe.ASM294v2.45.gff3')

- Heterochromatic Features (**repeats**) `GTF/GFF` **file**

In [9]:
in_heterochromatic_features = os.path.join(annot_dir, 'repeats_subtel.gff3')

---------------------------

# **Import `GFF`** - with all annotated `Genomic Features` (`41755/41764`)

- Load `GFF` **File** as **DataFrame**: 

In [10]:
in_gff

'/gcm-lfs1/pablo/data/rna_silencing/annotation/gff_v2/Schizosaccharomyces_pombe_all_chromosomes.extended.gff3'

In [11]:
gff_header = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']

In [12]:
gff = pd.read_csv(in_gff, sep='\t', comment='#', names=gff_header)

In [13]:
## Parse `attributes` colummn
attributes_df = gff['attributes'].apply(lambda x: Util.parse_attribute_col(x))
#attributes_df = attributes_df.apply(pd.Series) # slower

attributes_df = pd.DataFrame(attributes_df.values.tolist(), index=gff.index)
gff = pd.concat([gff.drop('attributes', axis=1), attributes_df], axis=1)

In [14]:
gff.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,Parent
0,I,PomBase,gene,1798347,1799015,.,+,.,SPAC1002.01,mrx11,
1,I,PomBase,mRNA,1798347,1799015,.,+,.,SPAC1002.01.1,,SPAC1002.01
2,I,PomBase,CDS,1798347,1798830,.,+,.,SPAC1002.01.1:exon:1,,SPAC1002.01.1
3,I,PomBase,intron,1798831,1798959,.,+,.,SPAC1002.01.1:intron:1,,SPAC1002.01.1
4,I,PomBase,CDS,1798960,1799015,.,+,.,SPAC1002.01.1:exon:2,,SPAC1002.01.1


In [15]:
gff.shape

(41755, 11)

## **Attributes** of features: 8th Column (in gff)

The **attribute columns** are all those entries present in the 8th column of the gff that have been parsed an expanded into columns.

In [16]:
attribute_columns = gff.columns.difference(gff_header)
attribute_columns

Index(['ID', 'Name', 'Parent'], dtype='object')

In this case we only have:
- **`ID`**: unique identifier needs to be present in every entry/feature.
- **`Name`**: not present in every entry/feature, alternative indetifier for most common features.
- **`Parent`**: not present in every entry/feature, represents relationships between features/entries by referencing to other features/entries `ID`.

---------------------------

# Explore GFF: **Type** of features

The `type` column contains the classification of each **feature/entry** into a set of **discrete classes**.

As one can see there are many classes. In this script we will organize all these categories within a hierarchical framework (due to our particular interest in them) into **3 categories**:

1. **Gene Features**:
    * genes
    * transcripts
    * sub-transcript features
* **Repeats**:
    * centromeric repeats
    * mating type region
* **Others**

In [17]:
gff_by_type = gff[['type', 'ID']].groupby(['type']).count()
gff_by_type.sort_values('ID', ascending=False)

Unnamed: 0_level_0,ID
type,Unnamed: 1_level_1
CDS,12162
gene,7005
intron,5372
mRNA,5140
five_prime_UTR,4863
three_prime_UTR,4791
ncRNA,1528
long_terminal_repeat,239
tRNA,196
promoter,62


### How are PomBase systematic IDs determined?
* **Systematic IDs** follow patterns based on the `feature type`, and in some cases the `chromosome`, as shown in the table below.

* **Open reading frame (ORF) IDs** also indicate which `cosmid` or `plasmid` they were found on in genome sequencing. In most cases, ORF IDs that end with a digit indicate that the ORF is on the **forward (Watson) strand**, and an ORF with an ID that ends with ‘c’ is on the **reverse (Crick) strand**. There are a few exceptions, however, because some cosmids were moved and their orientation reversed late in the sequence assembly procedure

IDs with `.1` appended are **transcript IDs**; the dot-and-digit IDs follow Ensembl’s standard.

<font color='red'> At present, **PomBase** has only one transcript annotated for any given feature, but in the future when alternative transcripts are annotated the digit will be incremented (.2, .3, etc.). </font>

**Systematic ID patterns**

| ID pattern | Description |
| --- | --- |
| SPAC* | features, usually ORFs, on chromosome 1, sequenced on cosmids |
| SPBC* | features, usually ORFs, on chromosome 2, sequenced on cosmids |
| SPCC* | features, usually ORFs, on chromosome 3, sequenced on cosmids |
| SPAP* | features, usually ORFs, on chromosome 1, sequenced on plasmids |
| SPBP* | features, usually ORFs, on chromosome 2, sequenced on plasmids |
| SBCP* | features, usually ORFs, on chromosome 3, sequenced on plasmids |
| SPATRNA* | tRNA genes on chromosome 1 |
| SPBTRNA* | tRNA genes on chromosome 2 |
| SPCTRNA* | tRNA genes on chromosome 3 |
| SPLTRA* | LTRs on chromosome 1 |
| SPLTRB* | LTRs on chromosome 2 |
| SPLTRC* | LTRs on chromosome 3 |
| SPNCRNA* | non-coding RNA genes (no chromosome info in ID) |
| SPRPTA.* | repeats (other than LTRs or centromeric repeats) on chromosome 1 |
| SPRPTB.* | repeats (other than LTRs or centromeric repeats) on chromosome 2 |
| SPRPTC.* | repeats (other than LTRs or centromeric repeats) on chromosome 3 |
| SPRPTCENA* | centromeric repeats on chromosome 1 |
| SPRPTCENB* | centromeric repeats on chromosome 2 |
| SPRPTCENC* | centromeric repeats on chromosome 3 |
| SPRRNA* | rRNA genes (no chromosome info in ID) |
| SPSNORNA* | snoRNA genes (no chromosome info in ID) |
| SPSNRNA* | snRNA genes (no chromosome info in ID) |
| SPTF* | transposons (no chromosome info in ID) |
| SPMTR* | features on the separately sequenced mating type region contig |
| SPMIT* | features on the mitochondrial chromosome |
| SPMITTRNA* | subset of SPMIT*; tRNA genes on mitochondrial chromosome |
| SPNUMT* | NUMTs (nuclear mitochondrial pseudogenes) (no chromosome info in ID) |
| ||


---------------------------

# **Gene features** `(7005/7008)`

---------------------------

A **gene** refers to the *genomic coordinates*, *genomic locus*, corresponding to the sequence of nucleotides in DNA that encodes the synthesis of a **gene product**, either RNA or protein.

<font color='red'> A **gene feature** is the **top feature** in a hierarchy of features associated with the **transcription of DNA**. </font>

<font color='red'> The **other features** included in this category are: </font>


1. **Transcripts**: (in spombe there is a *one to one map* between gene features and transcript features)
    * mRNA
    * ncRNA
    * pseudogenes
* **Sub-transcript features**:
    * CDS
    * UTR
    * Introns

At the end of this section we add `length` and `category`info for each **gene feature**.

- Explore **gene feature**'s

In [18]:
gene_features = ["gene"]
explained_features = gene_features

In [19]:
gff_by_type.loc[gene_features]

Unnamed: 0_level_0,ID
type,Unnamed: 1_level_1
gene,7005


- Get `gene_df` by filtering the **gff** for `gene` **features**

In [20]:
gene_df = gff[gff['type'].isin(gene_features)]
gene_df.shape

(7005, 11)

- Filter `columns` that contain **all NaN's**: in this case the `Parent` **column** (since *gene feature* is at the top of the hierarchy)

In [21]:
cols = gene_df.columns[~gene_df.isna().all()]
gene_df = gene_df[cols]
gene_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name
0,I,PomBase,gene,1798347,1799015,.,+,.,SPAC1002.01,mrx11
5,I,PomBase,gene,1799061,1800053,.,+,.,SPAC1002.02,pom34
10,I,PomBase,gene,1799915,1803141,.,-,.,SPAC1002.03c,gls2
15,I,PomBase,gene,1803624,1804491,.,-,.,SPAC1002.04c,taf11
20,I,PomBase,gene,1804548,1806797,.,-,.,SPAC1002.05c,jmj2


- List containing all `gene_ids`:

In [22]:
gene_ids = gene_df['ID'].values
len(gene_ids)

7005

---------------------------

## **I. Transcript Features:** *protein coding genes*, *non-coding genes* and *pseudogenes*

Both **DNA** and **RNA** are nucleic acids, which use base pairs of nucleotides as a complementary language. 

During **transcription**, a DNA sequence is read by an *RNA polymerase*, which produces a complementary, antiparallel RNA strand called a **primary transcript**.

The *primary transcript* needs to be further processed to yield various mature RNA products such as **mRNAs**, **tRNAs**, **rRNAs**, etc. 

The **stretch of DNA** transcribed into an RNA molecule is called a **transcription unit** and encodes at least one gene.

* **Protein coding genes** are genes that **encode for proteins** and during the transcription process produce:
    * `messenger RNA`: **mRNA**
* **Non-coding genes** are *transcribed* genes that **encode for non-coding RNA** such as
    * `ribosomal RNA`: **rRNA**
    * `transfer RNA`: **tRNA**
    * `microRNA`: **miRNA**
    * (...)
* **Pseudogenes** are **nonfunctional segments of DNA** that resemble functional genes.

<font color='red'> Each **gene feature**, in *spombe*, is associated uniquely (there is a *one to one map*) to a **transcript feature**. </font>

<font color='red'> In the `spombe GFF` the **transcript features** belong to one of the following types: </font>
* **Protein coding Genes**: mRNA Transcripts (5140/5143)
   * `mRNA`
* **Non-coding RNA genes**: ncRNA Transcripts (1836/1836)
    * `ncRNA`
    * `tRNA`
    * `snoRNA`
    * `rRNA`
    * `snRNA`
* **Pseudogenes**: (29/29)
    * `pseudogenic_transcript`

- Get `transcript_df` by filtering the **gff** using `gene_id` **features** as `Parent`

In [23]:
transcript_df = gff.merge(gene_df[["ID", "Name"]].rename(columns={'ID':'gene_id', 'Name':'gene_name'}),
                          left_on='Parent',
                          right_on = 'gene_id',
                          how='inner')
transcript_df["transcript_id"] = transcript_df["ID"]
transcript_df.shape

(7005, 14)

- Filter `columns` that contain **all NaN's**: in this case the `Name` **column**

In [24]:
cols = transcript_df.columns[~transcript_df.isna().all()]
transcript_df = transcript_df[cols]
transcript_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Parent,gene_id,gene_name,transcript_id
0,I,PomBase,mRNA,1798347,1799015,.,+,.,SPAC1002.01.1,SPAC1002.01,SPAC1002.01,mrx11,SPAC1002.01.1
1,I,PomBase,mRNA,1799061,1800053,.,+,.,SPAC1002.02.1,SPAC1002.02,SPAC1002.02,pom34,SPAC1002.02.1
2,I,PomBase,mRNA,1799915,1803141,.,-,.,SPAC1002.03c.1,SPAC1002.03c,SPAC1002.03c,gls2,SPAC1002.03c.1
3,I,PomBase,mRNA,1803624,1804491,.,-,.,SPAC1002.04c.1,SPAC1002.04c,SPAC1002.04c,taf11,SPAC1002.04c.1
4,I,PomBase,mRNA,1804548,1806797,.,-,.,SPAC1002.05c.1,SPAC1002.05c,SPAC1002.05c,jmj2,SPAC1002.05c.1


- Define **transcript features** from `transcript_df`

In [25]:
transcripts_df_by_type = transcript_df[['type', 'transcript_id']].groupby(['type']).count()
transcripts_df_by_type = transcripts_df_by_type.sort_values('transcript_id', ascending=False)
transcripts_df_by_type

Unnamed: 0_level_0,transcript_id
type,Unnamed: 1_level_1
mRNA,5140
ncRNA,1528
tRNA,196
snoRNA,57
rRNA,49
pseudogenic_transcript,29
snRNA,6


In [26]:
transcript_features = transcripts_df_by_type.index.tolist()
transcript_features

['mRNA', 'ncRNA', 'tRNA', 'snoRNA', 'rRNA', 'pseudogenic_transcript', 'snRNA']

In [27]:
transcript_features = ['mRNA', 'ncRNA', 'tRNA', 'snoRNA', 'rRNA', 'pseudogenic_transcript', 'snRNA']

### **I.A. <font color=blue> Protein coding Genes</font>**: `mRNA Transcripts (5140/5143)`

**Messenger RNA (mRNA)**: is a single-stranded RNA molecule that has already undegone RNA splicing, by which **intron** regions have been removed, leaving only **exons**. This exon sequence constitutes mature mRNA that will be read by the ribosome and translated into protein.

- **ID code**: 
 - `SP` (spombe) 
 - {**`A`**: `chr I`, **`B`**: `chr II`, **`C`**: `chr III`, **`MIT`**:  `mitochondrial`, **`BC`**: `chr_II_telomeric_gap`, **`MTR`**: `mating_type_region`}
 - {**`C`**: `sequenced on cosmids`, **`P`**: `sequenced on plasmids`} 
 - `n_id` (number of group)

In [28]:
mrna_features = ["mRNA"]

In [29]:
gff_by_type.loc[mrna_features]

Unnamed: 0_level_0,ID
type,Unnamed: 1_level_1
mRNA,5140


- **mRNAs**:

In [30]:
mrna_df = transcript_df[transcript_df['type'].isin(mrna_features)].sort_values(['seqid', 'start', 'end'])
mrna_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Parent,gene_id,gene_name,transcript_id
848,I,PomBase,mRNA,1,5662,.,-,.,SPAC212.11.1,SPAC212.11,SPAC212.11,tlh1,SPAC212.11.1
6999,I,PomBase,mRNA,1,5662,.,+,.,SPAC212.11b.1,SPAC212.11b,SPAC212.11b,tlh1_plus,SPAC212.11b.1
845,I,PomBase,mRNA,11784,12994,.,+,.,SPAC212.08c.1,SPAC212.08c,SPAC212.08c,,SPAC212.08c.1
849,I,PomBase,mRNA,15855,16226,.,+,.,SPAC212.12.1,SPAC212.12,SPAC212.12,,SPAC212.12.1
843,I,PomBase,mRNA,18042,18974,.,+,.,SPAC212.06c.1,SPAC212.06c,SPAC212.06c,,SPAC212.06c.1


In [31]:
mrna_df.shape

(5140, 13)

- **mRNA** (Example): Transcript (has a `Parent: gene`)

In [32]:
mrna_id = "SPAC212.04c.1"
mrna_df[mrna_df['ID'] == mrna_id]

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Parent,gene_id,gene_name,transcript_id
841,I,PomBase,mRNA,21381,23050,.,+,.,SPAC212.04c.1,SPAC212.04c,SPAC212.04c,,SPAC212.04c.1


- corresponding **gene** (Example): top feature (has **NO** `Parent`)

In [33]:
gene_id = "SPAC212.04c"
gff[gff['ID'] == gene_id]

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,Parent
5860,I,PomBase,gene,21381,23050,.,+,.,SPAC212.04c,,


In [34]:
assert(len(mrna_df) == len(gff[gff['type'].isin(mrna_features)]))

### **I.B. <font color=blue>Non-coding RNA genes</font>**: `ncRNA Transcripts (1836/1836)`

A **non-coding RNA** (`ncRNA`) is an RNA molecule that **is not `translated` into a protein**. 

The **DNA sequence** from which a functional non-coding RNA is `transcribed` is often called **an RNA gene**. 

Abundant and functionally important types of non-coding RNAs include:

* **transfer RNAs** (`tRNAs`)
* **ribosomal RNAs** (`rRNAs`)

as well as small RNAs such as: 

* **Small nucleolar RNAs** (`snoRNAs`) 
* **Small nuclear RNA** (`snRNA`)
* (...)

- **ID code**: 
 - `SP` (spombe)
 - {**`A`**: `chr I`, **`B`**: `chr II`, **`C`**: `chr III`, **`MIT`**: `mitochondrial`} - (**only for tRNA**) 
 - {**`NCRNA`**: Non-coding RNA, **`TRNA`**: Transfer RNA, **`SNORNA`**: Small nucleolar RNA, **`RRNA`**: Ribosomal RNA, **`SNRNA`**: Small nuclear RNA}
 - {`aminoacid-3-letter`} - (**only for tRNA**: https://en.wikipedia.org/wiki/Proteinogenic_amino_acid)
 - `n_id` (number of group)

<font color='red'> Becareful! `ncRNA Transcripts` is both the **category** to define all ncRNA transcripts and a **subcategory**. </font>

In [35]:
ncrna_features = ['ncRNA', 'tRNA', 'snoRNA', 'rRNA', 'snRNA']

In [36]:
gff_by_type.loc[ncrna_features]

Unnamed: 0_level_0,ID
type,Unnamed: 1_level_1
ncRNA,1528
tRNA,196
snoRNA,57
rRNA,49
snRNA,6


- **ncRNAs**

In [37]:
ncrna_df = transcript_df[transcript_df['type'].isin(ncrna_features)].sort_values(['seqid', 'start', 'end'])
ncrna_df.head()
#ncrna_df[ncrna_df['type'].isin(['snRNA'])].tail(50)

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Parent,gene_id,gene_name,transcript_id
6578,I,PomBase,ncRNA,11027,11556,.,-,.,SPNCRNA.70.1,SPNCRNA.70,SPNCRNA.70,,SPNCRNA.70.1
6104,I,PomBase,ncRNA,15002,16334,.,-,.,SPNCRNA.1701.1,SPNCRNA.1701,SPNCRNA.1701,prl101,SPNCRNA.1701.1
5751,I,PomBase,ncRNA,40795,41489,.,+,.,SPNCRNA.136.1,SPNCRNA.136,SPNCRNA.136,,SPNCRNA.136.1
6473,I,PomBase,ncRNA,40916,41770,.,-,.,SPNCRNA.601.1,SPNCRNA.601,SPNCRNA.601,,SPNCRNA.601.1
6474,I,PomBase,ncRNA,50851,51545,.,-,.,SPNCRNA.602.1,SPNCRNA.602,SPNCRNA.602,,SPNCRNA.602.1


In [38]:
ncrna_df.shape

(1836, 13)

In [39]:
ncrna_df.shape

(1836, 13)

In [40]:
#gff[gff['type'].isin(['intron']) & gff['Parent'].isin(nc_rnas)].sort_values(['seqid','start', 'end'])['ID']

- **ncRNA** (Example): Transcript (has a `Parent: gene`)

In [41]:
ncrna_id = "SPNCRNA.05.1"
ncrna_df[ncrna_df['ID'] == ncrna_id]

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Parent,gene_id,gene_name,transcript_id
5364,I,PomBase,ncRNA,434104,434843,.,+,.,SPNCRNA.05.1,SPNCRNA.05,SPNCRNA.05,prl5,SPNCRNA.05.1


- corresponding **gene** (Example): top feature (has **NO** `Parent`)

In [42]:
gene_id = "SPNCRNA.05"
gff[gff['ID'] == gene_id]

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,Parent
36261,I,PomBase,gene,434104,434843,.,+,.,SPNCRNA.05,prl5,


In [43]:
assert(len(ncrna_df) == len(gff[gff['type'].isin(ncrna_features)]))

### **I.C. <font color=blue>Pseudogenes</font>**: `Pseudogenic Transcripts (29/29)`

Additionally, for **pseudogenes** (`56`) both `gene` and `transcript` features can be found under the `pseudogene` category:
- `28` of the `56` **pseudogenes** behave as `transcripts` (e.g. contain a `Parent`, no `gene_id`, contain a `transcript_id`)
- the other `28` **pseudogenes** behave as `genes` (e.g. contain no `Parent`, a `gene_id`, no `transcript_id`)

- **ID code**: 
 - `SP` (spombe) 
 - {**`A`**: `chr I`, **`B`**: `chr II`, **`C`**: `chr III`}
 - {**`C`**: `sequenced on cosmids`, **`P`**: `sequenced on plasmids`} 
 - `n_id` (number of group)

In [44]:
pseudogene_rna_features = ["pseudogenic_transcript"]

In [45]:
gff_by_type.loc[pseudogene_rna_features]

Unnamed: 0_level_0,ID
type,Unnamed: 1_level_1
pseudogenic_transcript,29


- **pseudogenic_transcripts**:

In [46]:
pseudogene_rna_df = transcript_df[transcript_df['type'].isin(pseudogene_rna_features)].sort_values(['seqid', 'start', 'end'])
pseudogene_rna_df

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Parent,gene_id,gene_name,transcript_id
847,I,PomBase,pseudogenic_transcript,5726,6331,.,-,.,SPAC212.10.1,SPAC212.10,SPAC212.10,,SPAC212.10.1
7000,I,PomBase,pseudogenic_transcript,5726,6331,.,+,.,SPAC212.10b.1,SPAC212.10b,SPAC212.10b,SPAC212.10_plus,SPAC212.10b.1
846,I,PomBase,pseudogenic_transcript,7619,9274,.,+,.,SPAC212.09c.1,SPAC212.09c,SPAC212.09c,,SPAC212.09c.1
844,I,PomBase,pseudogenic_transcript,13665,14555,.,+,.,SPAC212.07c.1,SPAC212.07c,SPAC212.07c,,SPAC212.07c.1
842,I,PomBase,pseudogenic_transcript,20824,21015,.,+,.,SPAC212.05c.1,SPAC212.05c,SPAC212.05c,,SPAC212.05c.1
2150,I,PomBase,pseudogenic_transcript,58277,59105,.,-,.,SPAC977.13c.1,SPAC977.13c,SPAC977.13c,,SPAC977.13c.1
2304,I,PomBase,pseudogenic_transcript,2954983,2955377,.,-,.,SPAPB24D3.05c.1,SPAPB24D3.05c,SPAPB24D3.05c,,SPAPB24D3.05c.1
1035,I,PomBase,pseudogenic_transcript,4344572,4346170,.,-,.,SPAC23D3.05c.1,SPAC23D3.05c,SPAC23D3.05c,,SPAC23D3.05c.1
1312,I,PomBase,pseudogenic_transcript,5063258,5066231,.,+,.,SPAC2E12.05.1,SPAC2E12.05,SPAC2E12.05,wtf1,SPAC2E12.05.1
2584,II,PomBase,pseudogenic_transcript,33231,34780,.,+,.,SPBC1348.11.1,SPBC1348.11,SPBC1348.11,,SPBC1348.11.1


In [47]:
pseudogene_rna_df.shape

(29, 13)

- **pseudogenic_transcripts** (Example): Transcript (has a `Parent: gene`)

In [48]:
pseudogene_rna_id = "SPAC212.10.1"
pseudogene_rna_df[pseudogene_rna_df['ID'] == pseudogene_rna_id]

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Parent,gene_id,gene_name,transcript_id
847,I,PomBase,pseudogenic_transcript,5726,6331,.,-,.,SPAC212.10.1,SPAC212.10,SPAC212.10,,SPAC212.10.1


- corresponding **gene** (Example): top feature (has **NO** `Parent`)

In [49]:
gene_id =  "SPAC212.10"
gff[gff['ID'] == gene_id]

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,Parent
5887,I,PomBase,gene,5726,6331,.,-,.,SPAC212.10,,


In [50]:
assert(len(pseudogene_rna_df) == len(gff[gff['type'].isin(pseudogene_rna_features)]))

In [51]:
assert(set(mrna_features + ncrna_features + pseudogene_rna_features) == set(transcript_features))

In [52]:
explained_features.extend(transcript_features)

---------------------------

## **II. Sub-transcript Features** (`27188/27191`)

<font color='red'> **Transcript features** are in turn **associated hierarchically** with other **sub-transcript features** present in the `GFF`. </font>

* **CDS**: Coding sequence (`12156`)
* **Introns**: (`5372`)
* **UTRs**: Untranslated Regions (`9654`)

A **precursor mRNA (pre-mRNA)** is a type of primary transcript that becomes a **messenger RNA (mRNA)** after processing  in preparation for **translation**.

<font color='red'> **NB: sub-transcript features** are associated NOT ONLY with **mRNA** (protein coding) but also with **ncRNA** and **pseudogenes**. </font>

!['pre_mrna'](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/Pre-mRNA.svg/500px-Pre-mRNA.svg.png)

- Get `sub_transcript_df` by filtering the **gff** using `transcript_id` **features** as `Parent`

In [53]:
#sub_transcript_df = gff.merge(transcript_df[["ID"]].rename(columns={'ID':'transcript_id'}),
sub_transcript_df = gff.merge(transcript_df[["gene_id", "gene_name", "transcript_id", "type"]].rename(columns={'type':'bio_type'}),
                          left_on='Parent',
                          right_on = 'transcript_id',
                          how='inner')
sub_transcript_df.shape

(27188, 15)

- Filter `columns` that contain **all NaN's**: in this case the `Name` **column**

In [54]:
cols = sub_transcript_df.columns[~sub_transcript_df.isna().all()]
sub_transcript_df = sub_transcript_df[cols]
sub_transcript_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Parent,gene_id,gene_name,transcript_id,bio_type
0,I,PomBase,CDS,1798347,1798830,.,+,.,SPAC1002.01.1:exon:1,SPAC1002.01.1,SPAC1002.01,mrx11,SPAC1002.01.1,mRNA
1,I,PomBase,intron,1798831,1798959,.,+,.,SPAC1002.01.1:intron:1,SPAC1002.01.1,SPAC1002.01,mrx11,SPAC1002.01.1,mRNA
2,I,PomBase,CDS,1798960,1799015,.,+,.,SPAC1002.01.1:exon:2,SPAC1002.01.1,SPAC1002.01,mrx11,SPAC1002.01.1,mRNA
3,I,PomBase,five_prime_UTR,1799061,1799127,.,+,.,SPAC1002.02.1:five_prime_UTR:1,SPAC1002.02.1,SPAC1002.02,pom34,SPAC1002.02.1,mRNA
4,I,PomBase,CDS,1799128,1799817,.,+,.,SPAC1002.02.1:exon:1,SPAC1002.02.1,SPAC1002.02,pom34,SPAC1002.02.1,mRNA


- Define **sub-transcript features** from `sub_transcript_df`

In [55]:
sub_transcripts_df_by_type = sub_transcript_df[['type', 'transcript_id']].groupby(['type']).count()
sub_transcripts_df_by_type = sub_transcripts_df_by_type.sort_values(['transcript_id'], ascending=False)
sub_transcripts_df_by_type

Unnamed: 0_level_0,transcript_id
type,Unnamed: 1_level_1
CDS,12162
intron,5372
five_prime_UTR,4863
three_prime_UTR,4791


In [56]:
sub_transcript_features = sub_transcripts_df_by_type.index.tolist()
sub_transcript_features

['CDS', 'intron', 'five_prime_UTR', 'three_prime_UTR']

In [57]:
sub_transcript_features = ["CDS", "intron", "five_prime_UTR", "three_prime_UTR"] 

- Define **sub-transcript features** from `sub_transcript_df` see breakdown with respect to `bio_type`

In [58]:
sub_transcripts_df_by_bio_type = sub_transcript_df[['bio_type', 'type', 'transcript_id']].groupby(['type', 'bio_type']).count()
sub_transcripts_df_by_bio_type = sub_transcripts_df_by_bio_type.sort_values(['type', 'transcript_id'], ascending = (True, False))
sub_transcripts_df_by_bio_type

Unnamed: 0_level_0,Unnamed: 1_level_0,transcript_id
type,bio_type,Unnamed: 2_level_1
CDS,mRNA,10223
CDS,ncRNA,1533
CDS,tRNA,240
CDS,snoRNA,58
CDS,pseudogenic_transcript,52
CDS,rRNA,49
CDS,snRNA,7
five_prime_UTR,mRNA,4860
five_prime_UTR,pseudogenic_transcript,3
intron,mRNA,5298


### **II.A. <font color=blue> CDS</font>**: `Coding sequence (12162/?)`

The **coding region of a gene**, also known as the `CDS` (from **coding sequence**), is the portion of a gene's DNA or RNA that **codes for protein**.

Although this term is also sometimes used interchangeably with **exon**, it is not the exact same thing: the **exon** is composed of the **coding region** (`CDS`) as well as the **3' and 5' untranslated regions** of the RNA, and so therefore, an exon would be partially made up of coding regions. **(See II.C. UTRs)**

In [59]:
cds_features = ["CDS"]

In [60]:
#gff_by_type.loc[cds_features]
sub_transcripts_df_by_bio_type.loc[cds_features]

Unnamed: 0_level_0,Unnamed: 1_level_0,transcript_id
type,bio_type,Unnamed: 2_level_1
CDS,mRNA,10223
CDS,ncRNA,1533
CDS,tRNA,240
CDS,snoRNA,58
CDS,pseudogenic_transcript,52
CDS,rRNA,49
CDS,snRNA,7


- **CDSs**: (examples - contain a `Parent: transcript`)

In [61]:
cds_df = sub_transcript_df[sub_transcript_df['type'].isin(cds_features)].sort_values(['seqid', 'start', 'end'])
cds_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Parent,gene_id,gene_name,transcript_id,bio_type
4194,I,PomBase,CDS,1,5662,.,-,.,SPAC212.11.1:exon:1,SPAC212.11.1,SPAC212.11,tlh1,SPAC212.11.1,mRNA
27182,I,PomBase,CDS,1,5662,.,+,.,SPAC212.11b.1:exon:1,SPAC212.11b.1,SPAC212.11b,tlh1_plus,SPAC212.11b.1,mRNA
4193,I,PomBase,CDS,5726,6331,.,-,.,SPAC212.10.1:pseudogenic_exon:1,SPAC212.10.1,SPAC212.10,,SPAC212.10.1,pseudogenic_transcript
27183,I,PomBase,CDS,5726,6331,.,+,.,SPAC212.10b.1:pseudogenic_exon:1,SPAC212.10b.1,SPAC212.10b,SPAC212.10_plus,SPAC212.10b.1,pseudogenic_transcript
4192,I,PomBase,CDS,7619,9274,.,+,.,SPAC212.09c.1:pseudogenic_exon:1,SPAC212.09c.1,SPAC212.09c,,SPAC212.09c.1,pseudogenic_transcript


In [62]:
cds_df.shape

(12162, 14)

In [63]:
assert(len(cds_df) == len(gff[gff['type'].isin(cds_features)]))

### **II.B. <font color=blue> Introns</font>**: `(5372/5372)`

**Introns**: are any nucleotide sequence within a gene that is removed by **RNA splicing** during maturation of the final RNA product. 

This includes:
* **protein coding RNA (mRNA)**
* **pseudogenic transcripts**
* to a less degree **ncRNA**

In [64]:
intron_features = ["intron"]

In [65]:
#gff_by_type.loc[intron_features]
sub_transcripts_df_by_bio_type.loc[intron_features]

Unnamed: 0_level_0,Unnamed: 1_level_0,transcript_id
type,bio_type,Unnamed: 2_level_1
intron,mRNA,5298
intron,tRNA,44
intron,pseudogenic_transcript,23
intron,ncRNA,5
intron,snRNA,1
intron,snoRNA,1


- **Introns**: (examples - contain a Parent: transcript)

In [66]:
intron_df = sub_transcript_df[sub_transcript_df['type'].isin(intron_features)].sort_values(['seqid', 'start', 'end'])
intron_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Parent,gene_id,gene_name,transcript_id,bio_type
4186,I,PomBase,intron,18307,18348,.,+,.,SPAC212.06c.1:intron:1,SPAC212.06c.1,SPAC212.06c,,SPAC212.06c.1,mRNA
4180,I,PomBase,intron,22077,22131,.,+,.,SPAC212.04c.1:intron:1,SPAC212.04c.1,SPAC212.04c,,SPAC212.04c.1,mRNA
4172,I,PomBase,intron,29228,29285,.,+,.,SPAC212.01c.1:intron:1,SPAC212.01c.1,SPAC212.01c,,SPAC212.01c.1,mRNA
10437,I,PomBase,intron,31558,31767,.,-,.,SPAC977.18.1:intron:1,SPAC977.18.1,SPAC977.18,,SPAC977.18.1,mRNA
10435,I,PomBase,intron,31914,32162,.,-,.,SPAC977.18.1:intron:2,SPAC977.18.1,SPAC977.18,,SPAC977.18.1,mRNA


In [67]:
intron_df.shape

(5372, 14)

In [68]:
assert(len(intron_df) == len(gff[gff['type'].isin(intron_features)]))

### **II.C. <font color=blue> UTRs</font>**: Untranslated Regions`(9654)`

**Untranslated region** (or **UTR**) refers to either of two sections, one on each side of a coding sequence on a strand of **mRNA**.

This includes, only:
* **protein coding RNA (mRNA)**
* **pseudogenic transcripts**


In [69]:
utr_features = ["five_prime_UTR", "three_prime_UTR"]

In [70]:
#gff_by_type.loc[utr_features]
sub_transcripts_df_by_bio_type.loc[utr_features]

Unnamed: 0_level_0,Unnamed: 1_level_0,transcript_id
type,bio_type,Unnamed: 2_level_1
five_prime_UTR,mRNA,4860
five_prime_UTR,pseudogenic_transcript,3
three_prime_UTR,mRNA,4789
three_prime_UTR,pseudogenic_transcript,2


- **UTRs**: (examples - contain a Parent: transcript)

In [71]:
utr_df = sub_transcript_df[sub_transcript_df['type'].isin(utr_features)].sort_values(['seqid', 'start', 'end'])
utr_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Parent,gene_id,gene_name,transcript_id,bio_type
4190,I,PomBase,five_prime_UTR,11784,12157,.,+,.,SPAC212.08c.1:five_prime_UTR:1,SPAC212.08c.1,SPAC212.08c,,SPAC212.08c.1,mRNA
4184,I,PomBase,five_prime_UTR,18042,18071,.,+,.,SPAC212.06c.1:five_prime_UTR:1,SPAC212.06c.1,SPAC212.06c,,SPAC212.06c.1,mRNA
4188,I,PomBase,three_prime_UTR,18558,18974,.,+,.,SPAC212.06c.1:three_prime_UTR:1,SPAC212.06c.1,SPAC212.06c,,SPAC212.06c.1,mRNA
4178,I,PomBase,five_prime_UTR,21381,21586,.,+,.,SPAC212.04c.1:five_prime_UTR:1,SPAC212.04c.1,SPAC212.04c,,SPAC212.04c.1,mRNA
4182,I,PomBase,three_prime_UTR,22509,23050,.,+,.,SPAC212.04c.1:three_prime_UTR:1,SPAC212.04c.1,SPAC212.04c,,SPAC212.04c.1,mRNA


In [72]:
utr_df.shape

(9654, 14)

In [73]:
assert(len(utr_df) == len(gff[gff['type'].isin(utr_features)]))

In [74]:
assert(set(cds_features + intron_features + utr_features) == set(sub_transcript_features))

In [75]:
explained_features.extend(sub_transcript_features)

In [76]:
#gff_by_type[~gff_by_type.index.isin(explained_features)]

---------------------------

## Add `length` Information

Because the new `gtf` is using `CDS` **features**, as oposed to the old `gtf` that used `exon` **features**, the lengths of `gene`'s differ between the two.

To recover the same lengths, it should be enough to add the corresponding **three_prime_UTR** and **five_prime_UTR** lengths, to the lengths obtained from `CDS` **features**.

<font color='red'> **Remember:** `exon` = `CDS` + `UTR` </font>

Additionally, we will also store a copy of the `Intron` lengths that will prove useful late on.

### **CDS**

This `gff` contains no **exon** features instead we start by looking at the **CDS** features

- Add `length` **Column** to the `exon_df`: 

In [77]:
#cds_df.head()

In [78]:
## compute exon lengths: end - start + 1
cds_df['cds_length'] = cds_df['end'] - cds_df['start'] + 1

Each `transcript` is associated to one or more `exons`, in this case `CDS`.

This defines properly the **length** of an `transcript` as: *the sum of individual exon (`CDS`) length's associated to that transcript.*

- Get `transcript_cds_lengths_df`: (**summarize** `cds_df` by `transcript_id` and **sum** over individual `cds_length`s)

In [79]:
transcript_cds_lengths_df = cds_df[['transcript_id', 'cds_length']].groupby('transcript_id').sum().reset_index()

In [80]:
transcript_cds_lengths_df.head()

Unnamed: 0,transcript_id,cds_length
0,SPAC1002.01.1,540
1,SPAC1002.02.1,690
2,SPAC1002.03c.1,2772
3,SPAC1002.04c.1,600
4,SPAC1002.05c.1,2148


In [81]:
transcript_cds_lengths_df.shape

(7005, 2)

- Add `cds_length` **Column** to the `transcript_df`: **merge** `transcript_cds_lengths_df` to `transcript_df`

In [82]:
transcript_df = transcript_df.merge(transcript_cds_lengths_df, on='transcript_id')
#transcript_df

### **`UTR`'s**:

- Add `length` **Column** to the `utr_df`: 

In [83]:
#utr_df.head()

In [84]:
## compute exon lengths: end - start + 1
utr_df['utr_length'] = utr_df['end'] - utr_df['start'] + 1

Each `transcript` is associated to two or less `utr`'s.

This properly defines the **length** of a `transcript_utr`'s as: *the sum of individual `utr` length's associated to that transcript.*

- Get `transcript_utr_lengths_df`: (**summarize** `utr_df` by `transcript_id` and **sum** over individual `utr_length`s)

In [85]:
transcript_utr_lengths_df = utr_df[['transcript_id', 'utr_length']].groupby('transcript_id').sum().reset_index()

In [86]:
transcript_utr_lengths_df.head()

Unnamed: 0,transcript_id,utr_length
0,SPAC1002.02.1,303
1,SPAC1002.03c.1,455
2,SPAC1002.04c.1,268
3,SPAC1002.05c.1,102
4,SPAC1002.07c.1,438


In [87]:
transcript_utr_lengths_df.shape

(4857, 2)

- Add `utr_length` **Column** to the `transcript_df`: **merge** `transcript_utr_lengths_df` to `transcript_df`

In [88]:
transcript_df = transcript_df.merge(transcript_utr_lengths_df, on='transcript_id', how='outer')
#transcript_df

### **`Intron`'s**:

- Add `length` **Column** to the `intron_df`: 

In [89]:
#intron_df.head()

In [90]:
## compute exon lengths: end - start + 1
intron_df['intron_length'] = intron_df['end'] - intron_df['start'] + 1

Each `transcript` is associated to none, one or more `introns`.

This defines properly the **length** of a `transcript_intron`s as: *the sum of individual `intron` length's associated to that transcript.*

- Get `transcript_intron_lengths_df`: (**summarize** `intron_df` by `transcript_id` and **sum** over individual `intron_length`s)

In [91]:
transcript_intron_lengths_df = intron_df[['transcript_id', 'intron_length']].groupby('transcript_id').sum().reset_index()

In [92]:
transcript_intron_lengths_df.head()

Unnamed: 0,transcript_id,intron_length
0,SPAC1002.01.1,129
1,SPAC1002.06c.1,155
2,SPAC1002.07c.1,493
3,SPAC1002.15c.1,251
4,SPAC1006.03c.1,522


In [93]:
transcript_intron_lengths_df.shape

(2562, 2)

- Add `intron_length` **Column** to the `transcript_df`: **merge** `transcript_intron_lengths_df` to `transcript_df`

In [94]:
transcript_df = transcript_df.merge(transcript_intron_lengths_df, on='transcript_id', how='outer')
#transcript_df

### **`Transcripts`/`Genes`**:

Each `gene` is associated to one and only one `transcript`.

We will distinguish between the length of:
- **`gene`**: end - start + 1 of genomic locci describing a `gene` (**exon + intron length's**).
- **`transcript`**: the sum of **exon length's** describing the `transcript`.

In [95]:
#transcript_df.head()

- Add `cds_length` and `utr_length` **Column**s to the `gene_df`: **merge** `transcripts_df` to `gene_df`

In [96]:
gene_df = gene_df.merge(transcript_df[['gene_id', 'gene_name', 'transcript_id',
                                       'cds_length', 'utr_length', 'intron_length', 'type']].rename(columns={'type':'bio_type'}),
                        left_on='ID',
                        right_on='gene_id'
                       )

In [97]:
gene_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,gene_id,gene_name,transcript_id,cds_length,utr_length,intron_length,bio_type
0,I,PomBase,gene,1798347,1799015,.,+,.,SPAC1002.01,mrx11,SPAC1002.01,mrx11,SPAC1002.01.1,540,,129.0,mRNA
1,I,PomBase,gene,1799061,1800053,.,+,.,SPAC1002.02,pom34,SPAC1002.02,pom34,SPAC1002.02.1,690,303.0,,mRNA
2,I,PomBase,gene,1799915,1803141,.,-,.,SPAC1002.03c,gls2,SPAC1002.03c,gls2,SPAC1002.03c.1,2772,455.0,,mRNA
3,I,PomBase,gene,1803624,1804491,.,-,.,SPAC1002.04c,taf11,SPAC1002.04c,taf11,SPAC1002.04c.1,600,268.0,,mRNA
4,I,PomBase,gene,1804548,1806797,.,-,.,SPAC1002.05c,jmj2,SPAC1002.05c,jmj2,SPAC1002.05c.1,2148,102.0,,mRNA


In [98]:
gene_df.shape

(7005, 17)

- Add `gene_length` **Column** to the `gene_df`: `gene_length` = `end` - `start` + 1

In [99]:
gene_df['gene_length'] = gene_df['end'] -  gene_df['start'] + 1

In [100]:
gene_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,gene_id,gene_name,transcript_id,cds_length,utr_length,intron_length,bio_type,gene_length
0,I,PomBase,gene,1798347,1799015,.,+,.,SPAC1002.01,mrx11,SPAC1002.01,mrx11,SPAC1002.01.1,540,,129.0,mRNA,669
1,I,PomBase,gene,1799061,1800053,.,+,.,SPAC1002.02,pom34,SPAC1002.02,pom34,SPAC1002.02.1,690,303.0,,mRNA,993
2,I,PomBase,gene,1799915,1803141,.,-,.,SPAC1002.03c,gls2,SPAC1002.03c,gls2,SPAC1002.03c.1,2772,455.0,,mRNA,3227
3,I,PomBase,gene,1803624,1804491,.,-,.,SPAC1002.04c,taf11,SPAC1002.04c,taf11,SPAC1002.04c.1,600,268.0,,mRNA,868
4,I,PomBase,gene,1804548,1806797,.,-,.,SPAC1002.05c,jmj2,SPAC1002.05c,jmj2,SPAC1002.05c.1,2148,102.0,,mRNA,2250


In [101]:
gene_df.shape

(7005, 18)

- Add `transcript_length` **Column** to the `gene_df`: `transcript_length` = `cds_length` + `utr_length`

In [102]:
gene_df['transcript_length'] = gene_df[["cds_length", "utr_length"]].sum(axis=1)

In [103]:
gene_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,gene_id,gene_name,transcript_id,cds_length,utr_length,intron_length,bio_type,gene_length,transcript_length
0,I,PomBase,gene,1798347,1799015,.,+,.,SPAC1002.01,mrx11,SPAC1002.01,mrx11,SPAC1002.01.1,540,,129.0,mRNA,669,540.0
1,I,PomBase,gene,1799061,1800053,.,+,.,SPAC1002.02,pom34,SPAC1002.02,pom34,SPAC1002.02.1,690,303.0,,mRNA,993,993.0
2,I,PomBase,gene,1799915,1803141,.,-,.,SPAC1002.03c,gls2,SPAC1002.03c,gls2,SPAC1002.03c.1,2772,455.0,,mRNA,3227,3227.0
3,I,PomBase,gene,1803624,1804491,.,-,.,SPAC1002.04c,taf11,SPAC1002.04c,taf11,SPAC1002.04c.1,600,268.0,,mRNA,868,868.0
4,I,PomBase,gene,1804548,1806797,.,-,.,SPAC1002.05c,jmj2,SPAC1002.05c,jmj2,SPAC1002.05c.1,2148,102.0,,mRNA,2250,2250.0


In [104]:
gene_df.shape

(7005, 19)

---------------------------

## Add `Category` Information

- Load Heterochromatic features (**repeats**) `GTF/GFF` **file** as `gdf`

In [105]:
htc_features_df = pd.read_csv(in_heterochromatic_features, sep='\t', comment='#', names=gff_header)

In [106]:
## Parse `attributes` colummn
attributes_df = htc_features_df['attributes'].apply(lambda x: Util.parse_attribute_col(x))

attributes_df = pd.DataFrame(attributes_df.values.tolist(), index=htc_features_df.index)
htc_features_df = pd.concat([htc_features_df.drop('attributes', axis=1), attributes_df], axis=1)

In [107]:
#htc_features_df.head()

In [108]:
#htc_features_df.shape

- Get break-down of Heterochromatic **features**

In [109]:
htc_features_df[['ID', 'type']].groupby('type').count()

Unnamed: 0_level_0,ID
type,Unnamed: 1_level_1
CDS,32
five_prime_UTR,7
gene,27
intron,5
mRNA,21
ncRNA,1
pseudogenic_transcript,5
three_prime_UTR,7


- Get `gene` **features** by selecting those entries without a `Parent`

In [110]:
htc_features_df = htc_features_df[htc_features_df['Parent'].isnull()]

In [111]:
#htc_features_df

In [112]:
#htc_features_df.shape

In [113]:
htc_features_df[['ID', 'type']].groupby('type').count()

Unnamed: 0_level_0,ID
type,Unnamed: 1_level_1
gene,27


In [114]:
htc_features_df.shape

(27, 11)

---------------------------

## Include (non-gene) features found in `GTF/GFF`:

### **a.** Features present in the **Mating Type Region** chromosome

- Get all features present in the **chromosome** `mating_type_region`

In [115]:
mating_type_region_df = gff[gff['seqid'] == 'mating_type_region'].sort_values(['seqid', 'start', 'end', 'ID'])

In [116]:
#mating_type_region_df.head()

In [117]:
#mating_type_region_df.shape

- Get break-down of `mating_type_region` **features**

In [118]:
mating_type_region_df_by_type = mating_type_region_df[['type', 'ID']].groupby(['type']).count()
mating_type_region_df_by_type.sort_values('ID')

Unnamed: 0_level_0,ID
type,Unnamed: 1_level_1
TR_box,1
CDS,2
five_prime_UTR,2
gene,2
mRNA,2
region,11


- Exclude `TR_box` **features**

In [119]:
exclude_features = ['TR_box']

In [120]:
mating_type_region_df = mating_type_region_df[~mating_type_region_df['type'].isin(exclude_features)]

In [121]:
mating_type_region_df

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,Parent
41301,mating_type_region,PomBase,region,1,2120,.,+,.,FP565355_region_1..2120,,
41618,mating_type_region,PomBase,region,3203,3259,.,+,.,FP565355_region_3203..3259,,
41215,mating_type_region,PomBase,region,3260,3394,.,+,.,FP565355_region_3260..3394,,
36244,mating_type_region,PomBase,CDS,3354,3710,.,-,.,SPMTR.01.1:exon:1,,SPMTR.01.1
36241,mating_type_region,PomBase,gene,3354,3753,.,-,.,SPMTR.01,mat2-Pc,
36242,mating_type_region,PomBase,mRNA,3354,3753,.,-,.,SPMTR.01.1,,SPMTR.01
41256,mating_type_region,PomBase,region,3395,4497,.,+,.,FP565355_region_3395..4497,,
36243,mating_type_region,PomBase,five_prime_UTR,3711,3753,.,-,.,SPMTR.01.1:five_prime_UTR:1,,SPMTR.01.1
36247,mating_type_region,PomBase,five_prime_UTR,3861,3888,.,+,.,SPMTR.02.1:five_prime_UTR:1,,SPMTR.02.1
36245,mating_type_region,PomBase,gene,3861,4368,.,+,.,SPMTR.02,mat2-Pi,


**Genes:**  (`2`)
    * SPMTR.01: mat2-Pc
    * SPMTR.02: mat2-Pi
    
<font color='red'> Already have been processed and are contained in `gene_df`, still we need to include them to the `htc_repeat_ids`. </font>

In [122]:
gene_features = ['gene', 'mRNA', 'five_prime_UTR', 'CDS']

In [123]:
genes_mating_type_region_df = mating_type_region_df[mating_type_region_df['type'].isin(gene_features)]
genes_mating_type_region_df

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,Parent
36244,mating_type_region,PomBase,CDS,3354,3710,.,-,.,SPMTR.01.1:exon:1,,SPMTR.01.1
36241,mating_type_region,PomBase,gene,3354,3753,.,-,.,SPMTR.01,mat2-Pc,
36242,mating_type_region,PomBase,mRNA,3354,3753,.,-,.,SPMTR.01.1,,SPMTR.01
36243,mating_type_region,PomBase,five_prime_UTR,3711,3753,.,-,.,SPMTR.01.1:five_prime_UTR:1,,SPMTR.01.1
36247,mating_type_region,PomBase,five_prime_UTR,3861,3888,.,+,.,SPMTR.02.1:five_prime_UTR:1,,SPMTR.02.1
36245,mating_type_region,PomBase,gene,3861,4368,.,+,.,SPMTR.02,mat2-Pi,
36246,mating_type_region,PomBase,mRNA,3861,4368,.,+,.,SPMTR.02.1,,SPMTR.02
36248,mating_type_region,PomBase,CDS,3889,4368,.,+,.,SPMTR.02.1:exon:1,,SPMTR.02.1


In [124]:
genes_mating_type_region_df.shape

(8, 11)

In [125]:
mating_type_region_gene_ids = genes_mating_type_region_df[genes_mating_type_region_df['Parent'].isnull()]['ID'].tolist()
mating_type_region_gene_ids

['SPMTR.01', 'SPMTR.02']

**Regions**: (`11`)

<font color='red'> Adapt **regions** to add them to `gene_df` (or more generally `features_df`) </font>

In [126]:
regions_mating_type_region_df = mating_type_region_df[~mating_type_region_df['type'].isin(gene_features)]
regions_mating_type_region_df['gene_id'] = mating_type_region_df['ID']
regions_mating_type_region_df['bio_type'] = mating_type_region_df['type']
regions_mating_type_region_df['transcript_length'] = mating_type_region_df['end'] - mating_type_region_df['start'] + 1
regions_mating_type_region_df['gene_length'] = mating_type_region_df['end'] - mating_type_region_df['start'] + 1

In [127]:
regions_mating_type_region_df

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,Parent,gene_id,bio_type,transcript_length,gene_length
41301,mating_type_region,PomBase,region,1,2120,.,+,.,FP565355_region_1..2120,,,FP565355_region_1..2120,region,2120,2120
41618,mating_type_region,PomBase,region,3203,3259,.,+,.,FP565355_region_3203..3259,,,FP565355_region_3203..3259,region,57,57
41215,mating_type_region,PomBase,region,3260,3394,.,+,.,FP565355_region_3260..3394,,,FP565355_region_3260..3394,region,135,135
41256,mating_type_region,PomBase,region,3395,4497,.,+,.,FP565355_region_3395..4497,,,FP565355_region_3395..4497,region,1103,1103
41606,mating_type_region,PomBase,region,4498,4556,.,+,.,FP565355_region_4498..4556,,,FP565355_region_4498..4556,region,59,59
41427,mating_type_region,PomBase,region,9170,13408,.,+,.,FP565355_region_9170..13408,,,FP565355_region_9170..13408,region,4239,4239
41487,mating_type_region,PomBase,region,15417,15473,.,+,.,FP565355_region_15417..15473,,,FP565355_region_15417..15473,region,57,57
41515,mating_type_region,PomBase,region,15474,15608,.,+,.,FP565355_region_15474..15608,,,FP565355_region_15474..15608,region,135,135
41544,mating_type_region,PomBase,region,15609,16735,.,+,.,FP565355_region_15609..16735,,,FP565355_region_15609..16735,region,1127,1127
41351,mating_type_region,PomBase,region,16736,16794,.,+,.,FP565355_region_16736..16794,,,FP565355_region_16736..16794,region,59,59


In [128]:
regions_mating_type_region_df.shape

(11, 15)

In [129]:
mating_type_region_region_ids = regions_mating_type_region_df['ID'].tolist()
#mating_type_region_region_ids

### **b.** Centromeric Repeat Features **dh/dg** (in Chr I) as annotated in `Pombase`

- Get all `dh_repeat` and `dg_repeat` features present in the **chromosome** `I`

In [130]:
dh_dg_repeats_df = gff[(gff['seqid'] == 'I') & (gff['type'].isin(['dh_repeat','dg_repeat']))].sort_values(['seqid', 'start', 'end', 'ID'])
# only get first dh/dg
dh_dg_repeats_df = dh_dg_repeats_df.iloc[:2,]
dh_dg_repeats_df

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,Parent
41275,I,PomBase,dh_repeat,3754127,3759170,.,-,.,SPRPTCENA.3,,
41362,I,PomBase,dg_repeat,3759165,3763441,.,+,.,SPRPTCENA.4,,


- Get break-down of `dh_repeat` and `dg_repeat` **features**

In [131]:
dh_dg_repeats_df_by_type = dh_dg_repeats_df[['type', 'ID']].groupby(['type']).count()
dh_dg_repeats_df_by_type.sort_values('ID')

Unnamed: 0_level_0,ID
type,Unnamed: 1_level_1
dg_repeat,1
dh_repeat,1


**Repeats**: (`2`)

<font color='red'> Adapt **repeats** to add them to `gene_df` (or more generally `features_df`) </font>

In [132]:
dh_dg_repeats_df['gene_id'] = dh_dg_repeats_df['ID']
dh_dg_repeats_df['bio_type'] = dh_dg_repeats_df['type']
dh_dg_repeats_df['transcript_length'] = dh_dg_repeats_df['end'] - dh_dg_repeats_df['start'] + 1
dh_dg_repeats_df['gene_length'] = dh_dg_repeats_df['end'] - dh_dg_repeats_df['start'] + 1

In [133]:
dh_dg_repeats_df

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,Parent,gene_id,bio_type,transcript_length,gene_length
41275,I,PomBase,dh_repeat,3754127,3759170,.,-,.,SPRPTCENA.3,,,SPRPTCENA.3,dh_repeat,5044,5044
41362,I,PomBase,dg_repeat,3759165,3763441,.,+,.,SPRPTCENA.4,,,SPRPTCENA.4,dg_repeat,4277,4277


In [134]:
dh_dg_repeats_df.shape

(2, 15)

In [135]:
dh_dg_repeats_ids = dh_dg_repeats_df['ID'].tolist()
dh_dg_repeats_ids

['SPRPTCENA.3', 'SPRPTCENA.4']

### <font color='red'> Include non-gene features in `htc_repeat_ids`. </font>

- Get list of Heterochromatin features (**repeats**) `gene_id`'s and add `mating_type_region` **features**

In [136]:
htc_features_ids = list(htc_features_df['ID'])
htc_features_ids.extend(mating_type_region_gene_ids)
htc_features_ids.extend(mating_type_region_region_ids)
htc_features_ids.extend(dh_dg_repeats_ids)
#htc_features_ids = list(htc_features_df['Name'])
#htc_features_ids

In [137]:
len(set(htc_features_ids))

42

- Concatenate `gene_df` and `regions_mating_type_region_df`, add `category` **column**, either: 

`['repeat', 'ribosomal', 'gene']`

In [138]:
#features_df = pd.concat([gene_df, regions_mating_type_region_df], ignore_index=True, sort=True)[gene_df.columns]
features_df = pd.concat([gene_df, regions_mating_type_region_df, dh_dg_repeats_df], ignore_index=True, sort=True)[gene_df.columns]

In [139]:
#gene_df['category'] = gene_df[['gene_id', 'Name']].apply(lambda row: Util.get_category(row['gene_id'], row['Name'], htc_features_ids), axis=1)
features_df['category'] = features_df[['gene_id', 'Name']].apply(lambda row: Util.get_category(row['gene_id'], row['Name'], htc_features_ids), axis=1)

In [140]:
#gene_df.head()
features_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Name,gene_id,gene_name,transcript_id,cds_length,utr_length,intron_length,bio_type,gene_length,transcript_length,category
0,I,PomBase,gene,1798347,1799015,.,+,.,SPAC1002.01,mrx11,SPAC1002.01,mrx11,SPAC1002.01.1,540.0,,129.0,mRNA,669,540.0,gene
1,I,PomBase,gene,1799061,1800053,.,+,.,SPAC1002.02,pom34,SPAC1002.02,pom34,SPAC1002.02.1,690.0,303.0,,mRNA,993,993.0,gene
2,I,PomBase,gene,1799915,1803141,.,-,.,SPAC1002.03c,gls2,SPAC1002.03c,gls2,SPAC1002.03c.1,2772.0,455.0,,mRNA,3227,3227.0,gene
3,I,PomBase,gene,1803624,1804491,.,-,.,SPAC1002.04c,taf11,SPAC1002.04c,taf11,SPAC1002.04c.1,600.0,268.0,,mRNA,868,868.0,gene
4,I,PomBase,gene,1804548,1806797,.,-,.,SPAC1002.05c,jmj2,SPAC1002.05c,jmj2,SPAC1002.05c.1,2148.0,102.0,,mRNA,2250,2250.0,gene


In [141]:
#gene_df.shape
features_df.shape

(7018, 20)

In [142]:
#set(gene_df[gene_df['category'] == 'repeat']['ID']).symmetric_difference(htc_features_ids)

- Have a look at distribution of **feature** `category`:

In [143]:
#gene_df_by_category = gene_df[['category', 'ID']].groupby(['category']).count()
features_df_by_category = features_df[['category', 'ID']].groupby(['category']).count()

#gene_df_by_category.sort_values('ID')
features_df_by_category.sort_values('ID')

Unnamed: 0_level_0,ID
category,Unnamed: 1_level_1
repeat,39
ribosomal,177
gene,6802


In [144]:
#sum(gene_df_by_category.ID)
sum(features_df_by_category.ID)

7018

- Have a look at distribution of **feature** `types`:

In [145]:
#gene_df_by_type = gene_df[['type', 'ID']].groupby(['type']).count()
#gene_df_by_type.sort_values('ID')

In [146]:
features_df_by_type = features_df[['type', 'ID']].groupby(['type']).count()
features_df_by_type.sort_values('ID')

Unnamed: 0_level_0,ID
type,Unnamed: 1_level_1
dg_repeat,1
dh_repeat,1
region,11
gene,7005


In [147]:
#sum(gene_df_by_type.ID)
sum(features_df_by_type.ID)

7018

In [148]:
#transcripts_df_by_type

In [149]:
#gff_by_type.index.difference(gene_df_by_type.index.union(transcripts_df_by_type.index)).tolist()

---------------------------

## Store **Gene Table Annotation's** (gdf)

These are individual `.csv` **files** that contain information about associated **features**:
- **features**
- **genes**
- **exons** <font color='red'> (missing, for now use both `cds` and `utr`) <red>
- **CDS** 
- **UTRs**
- **Introns** (missing, but might be necesary for further analysis)

Besides the `gene_df`, I'm not sure if we need the rest ... At least for the `gene_counts.py` **script** we can directly parse the `gff` **file**

**Features**: these are a mix of **genes** and other **features of interest** (e.g. `repeats`, `mating_type_region`, etc...)

- Store `features_df` as `.csv` **file**

In [150]:
#features_df_file = os.path.join(annot_dir, 'Schizosaccharomyces_pombe_all_chromosomes.features.csv')
features_df_file = os.path.join(annot_dir, 'Schizosaccharomyces_pombe_all_chromosomes.extended.features.csv')

In [151]:
features_df.to_csv(features_df_file, sep='\t', index=None)

**Genes**

- Store `gene_df` as `.csv` **file**

In [152]:
#gene_df_file = os.path.join(annot_dir, 'Schizosaccharomyces_pombe_all_chromosomes.gene.csv')
gene_df_file = os.path.join(annot_dir, 'Schizosaccharomyces_pombe_all_chromosomes.extended.gene.csv')

In [153]:
gene_df.to_csv(gene_df_file, sep='\t', index=None)

**Exons**

- Add **gene information** and `biotype` column, and store `exons_df` as `.csv` **file** 

In [154]:
#exons_df_file = os.path.join(annot_dir, 'Schizosaccharomyces_pombe_all_chromosomes.extended.exon.csv')

In [155]:
#exons_df.to_csv(exons_df_file, sep='\t', index=None)

**CDS**

- Add **gene information** and `biotype` column, and store `cds_df` as `.csv` **file** 

In [156]:
#cds_df_file = os.path.join(annot_dir, 'Schizosaccharomyces_pombe_all_chromosomes.cds.csv')
#cds_df_file = os.path.join(annot_dir, 'Schizosaccharomyces_pombe_all_chromosomes.extended.cds.csv')

In [157]:
#cds_df.to_csv(cds_df_file, sep='\t', index=None)

**UTRs**

- Add **gene information** and `biotype` column, and store `cds_df` as `.csv` **file** 

In [158]:
#utr_df_file = os.path.join(annot_dir, 'Schizosaccharomyces_pombe_all_chromosomes.utr.csv')

In [159]:
#utr_df.to_csv(utr_df_file, sep='\t', index=None)

**Introns**

- Add **gene information** and `biotype` column, and store `intron_df` as `.csv` **file** 

In [160]:
intron_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,Parent,gene_id,gene_name,transcript_id,bio_type,intron_length
4186,I,PomBase,intron,18307,18348,.,+,.,SPAC212.06c.1:intron:1,SPAC212.06c.1,SPAC212.06c,,SPAC212.06c.1,mRNA,42
4180,I,PomBase,intron,22077,22131,.,+,.,SPAC212.04c.1:intron:1,SPAC212.04c.1,SPAC212.04c,,SPAC212.04c.1,mRNA,55
4172,I,PomBase,intron,29228,29285,.,+,.,SPAC212.01c.1:intron:1,SPAC212.01c.1,SPAC212.01c,,SPAC212.01c.1,mRNA,58
10437,I,PomBase,intron,31558,31767,.,-,.,SPAC977.18.1:intron:1,SPAC977.18.1,SPAC977.18,,SPAC977.18.1,mRNA,210
10435,I,PomBase,intron,31914,32162,.,-,.,SPAC977.18.1:intron:2,SPAC977.18.1,SPAC977.18,,SPAC977.18.1,mRNA,249


In [161]:
intron_df = pd.merge(intron_df, features_df[['gene_id', 'gene_length', 'category']], how='left', on='gene_id')

In [162]:
intron_df_file = os.path.join(annot_dir, 'Schizosaccharomyces_pombe_all_chromosomes.intron.csv')

In [163]:
intron_df.to_csv(intron_df_file, sep='\t', index=None)

---------------------------

# **[PomBase - Genomic regions](https://www.pombase.org/downloads/genome-datasets)**

---------------------------

## **[Sequencing Status](https://www.pombase.org/status/sequencing-status)**

### **Chromosome 1**

| Contig Name | Region | Size |
| --- | --- | --- |
| - | unsequenced to chr1 left telomere| 10 ± 2 kb* | 
| c212| sub-telomeric left arm | 29,663 bp| 
| - | Gap | | <5 kb* |
| c977 | left arm and right arm | 5,549,370 bp| 
| - | unsequenced to chr1 left telomere| 18 ± 3 kb* | 

### **Chromosome 2**
| Contig Name | Region | Size |
| --- | --- | --- |
| AB325691 | chr2 left arm gap-filling contig | 20,000 bp** | 
| - | Gap | 5 ± 5 kb| 
| c1348 | sub-telomeric left arm | 80,201 bp| 
| - | Gap | 22 ± 5 kb*| 
| pB10D8 | left arm to centromeric gap | 1,536,269 bp| 
| - | Gap | ~6 kb | 
| pJ5566 | right arm from centromeric gap to telomeric repeats | 2,923,134 bp| 

### **Chromosome 3**
| Contig Name | Region | Size |
| --- | --- | --- |
| p20C8 | left arm from centromeric gap | 1,083,348 bp| 
| - | Gap | 25.3 ± 6 kb***| 
| c1676 | right arm from centromeric gap | 1,369,435 bp| 

!['spombe_chr'](https://cshperspectives.cshlp.org/content/7/7/a018770/F2.large.jpg)

## **[Telomeres](https://www.pombase.org/status/telomeres)**

The fission yeast complete genome sequence currently stops short of the **telomeric repeats**. See the [Sequencing Status](https://www.pombase.org/status/sequencing-status) page for the current assembly status.

The most proximal anchored cosmids to each telomere are (links to `JBrowser`):

* **Chromosome I** left c212 (coordinates [1-29663](https://www.pombase.org/jbrowse/?loc=I%3A1..29664&tracks=DNA%2CPomBase%20forward%20strand%20features%2CPomBase%20reverse%20strand%20features&highlight=))
* **Cromosome I** right c750 (coordinates [5554844-5579133](https://www.pombase.org/jbrowse/?loc=I%3A5554843..5579133&tracks=PomBase%20forward%20strand%20features%2CPomBase%20reverse%20strand%20features&highlight=))
* **Chromosome II** left c1348 (coordinates [1-39186](https://www.pombase.org/jbrowse/?loc=II%3A1..39181&tracks=DNA%2CPomBase%20forward%20strand%20features%2CPomBase%20reverse%20strand%20features&highlight=))
* **Chromosome II** right pT2R1 (coordinates [4500619-4539800](https://www.pombase.org/jbrowse/?loc=II%3A4500628..4539804&tracks=DNA%2CPomBase%20forward%20strand%20features%2CPomBase%20reverse%20strand%20features&highlight=))

<font color='red'> There are no telomere proximal clones for **chromosome III** as the unsequenced rDNA blocks occur in between the sequenced portion and the telomeres on both chromosome arms. <font> 

### <ins> How can I locate telomeres and subtelomeric regions? </ins>

The current `S. pombe` genome assembly <font color='red'> **does not include the complete `telomeric regions` or the `telomeric short repeats`**. </font>

These omissions are beyond the control of **PomBase curators**.

<font color='red'> **`Subtelomeric repeats` are also not explicitly defined at present**</font>, although we hope to provide this information in the future. 

Additional information about `S. pombe`  **`telomeres`** is available at on the [Telomeres page](https://www.pombase.org/status/telomeres).



## **[Centromeres](https://www.pombase.org/status/centromeres)**

At present, each centromere is annotated as a single sequence feature in *PomBase*, which can be viewed in and downloaded from the Ensembl genome browser.

* **CEN 1**: [3753687-3789421](https://www.pombase.org/jbrowse/?loc=I%3A3753680..3789414&tracks=DNA%2CPomBase%20forward%20strand%20features%2CPomBase%20reverse%20strand%20features&highlight=)
* **CEN 2**: [1602264-1644747](https://www.pombase.org/jbrowse/?loc=II%3A1602261..1644744&tracks=DNA%2CPomBase%20forward%20strand%20features%2CPomBase%20reverse%20strand%20features&highlight=)
* **CEN 3**: [1070904-1137003](https://www.pombase.org/jbrowse/?loc=III%3A1070899..1136998&tracks=DNA%2CPomBase%20forward%20strand%20features%2CPomBase%20reverse%20strand%20features&highlight=)

Note that specific `repeats` within **centromeres** cannot yet be viewed or searched as features on *PomBase* pages, but they are included in the `forward` and `reverse` **strand feature tracks** in the genome browser, which are enabled by default.

Repeats are also shown in the diagram below. To see the `repeat sequences`, download and unzip the [contiguated sequence files](ftp://ftp.pombase.org/pombe/genome_sequence_and_features/artemis_files/) and view them in **Artemis**. (See this [FAQ](https://www.pombase.org/faq/there-equivalent-artemis-java-applet-pombase) for more information.)

**Note**: Recent work (`?2010`) by Chad Ellermeier and Gerry Smith suggests that there are only 4 +/- 1 copies of the 6760 bp repeat missing from chromosome 3.

**Please note**: This map is a schematic diagram. Distances and overlaps are approximate. Please refer to the sequence data to design experimental constructs.
!['spombe_cemtromer'](https://www.pombase.org/assets/centromeremapping.gif)

**Centromere map from**: *The genome sequence of Schizosaccharomyces pombe. Nature 2002 Feb 21;415(6874):871-80. created by Rhian Gwilliam*

### <ins> How can I locate centromeres? </ins>

Centromeres can be retrieved in the *PomBase* genome browser; the coordinates are:

* **Chromosome I**: `3753687-3789421`
* **Chromosome II**: `1602264-1644747`
* **Chromosome III**: `1070904-1137003`

Sequence features within the centromeres, such as `repeats`, are annotated with **Sequence Ontology terms**. 

For more details, see the [Centromeres page](https://www.pombase.org/status/centromeres).


## **[Mating Type Region](https://www.pombase.org/status/mating-type-region)**

The *S. pombe* **mating type loci** are located on **Chromosome 2**.

The `reference` **strain 972 h-** encodes the `M`-specific **mating genes** `II:2114008-2115135` at the expressed `mat1` **locus**.

The **silent region** `mat3M` is located at coordinates `II:2129208-2137121`.

<font color='red'> Note that the **silent** `mat2P` **region** and **cenH element** are **deleted** in the reference strain. </font>

A contig of the **h90 configuration** of the `mat2P`-`mat3M` **region** was created by Xavier Marsellach and Lorena Aguilar (Azorín lab) using available data and `S. pombe var. kambucha` as a scaffold. The contig can be viewed in the genome browser. 

Replacing the **Chromosome 2** region spanning coordinates `2129208-2137121` with the separate contig sequence yields the Chromosome 2 contig of an **h90 strain**.

For a description of how the mating type specific genes are organized and annotated in PomBase, see this [FAQ](https://www.pombase.org/faq/how-are-mating-type-specific-gene-pages-organized) item.

For a detailed description of the *S. pombe* **mating type region**, please see the [online tutorial](http://www1.bio.ku.dk/english/research/fg/cellecyklus_genomintegritet/mating/) provided by the Nielsen lab (dead - external link).

### <ins> How can I locate the mating type region? </ins>

Browse for:
* **Chromosome II**: `2129208-2137121`
* and see the [Mating type region page](https://www.pombase.org/status/mating-type-region).

The **mating type region** will soon be annotated as a `feature`, and refer to a **Sequence Ontology term**.



###  <ins> How are the mating type specific gene pages organized? </ins>

As described in detail in this [online tutorial](https://www1.bio.ku.dk/english/research/fg/cellecyklus_genomintegritet/mating/)  provided by the Nielsen lab, `S. pombe` switch between the **two mating types** `M` and `P`.

Genetic information encoded by the `mat1 locus` determines the **mating type**: 
* If this locus contains the `Pc` and `Pi` genes, the cell is **mating type** `P`
* If it contains the `Mc` and `Mi` genes the cell is **mating type** `M`.

Additionally, `S. pombe` contains **two silent loci**: `mat2` and `mat3`. These **loci are not expressed** but host the information needed for each mating type configuration:
* `Mat2` contains the two genes `Pc` and `Pi`
* `Mat3` contains the two genes `Mc` and `Mi`.

Recombinational DNA repair during mitotic cell division ensures production of one daughter cell of parental mating type and one daughter of the opposite mating type.

A **wild type** `S. pombe` cell thus contains **6 mating type specific genes**:
1. `mat1-P/Mc` - expressed 
2. `mat1-P/Mi` - expressed
3. `mat2-Pc` - silent
4. `mat2-Pi` - silent 
5. `mat3-Mc` - silent
6. `mat3-Mi` - silent

The sequenced `S. pombe` **reference strain** (`972 h-`) is in the **M mating type configuration** (encodes the `Mc` and `Mi` genes at the `mat1` locus).
The `mat2` genes **are deleted in this strain** for technical reasons, whereas the `mat3` genes **are intact**.

The DNA sequence of the **WT** [silent mating type](https://www.pombase.org/status/mating-type-region) region containing the `mat2` and `mat3` genes was reconstructed yielding an extra [mating type region](https://www.pombase.org/downloads/genome-datasets) **contig** (see ‘current genome’ ftp site link).

Only the `P` **genes** from this **contig** have gene pages in **PomBase**.

The `systematic IDs` and contig source of the mating type specific genes are:
1. `mat1-Mc` - SPBC23G7.09 (from the chromosome 2 contig) 
2. `mat1-Mi` - SPBC23G7.17c (from the chromosome 2 contig)
3. `mat2-Pc` - SPMTR.01 (from the mating type contig) 
4. `mat2-Pi` - SPMTR.02 (from the mating type contig) 
5. `mat3-Mc` - SPBC1711.02 (from the chromosome 2 contig) 
6. `mat3-Mi` - SPBC1711.01c (from the chromosome 2 contig)

The **duplicate** `M` genes without gene pages are: 
7. `mat3-Mc` - SPMTR.04 (extra copy from the mating type contig) 
8. `mat3-Mi` - SPMTR.03 (extra copy from the mating type contig)

For the `M` specific genes, functional annotation (GO, phenotypes…) is attached to the `mat1-Mc` and `mat1-Mi` genes. 

For the `P` specific genes, functional annotation is attached to `mat2-Pc` and `mat2-Pi` out of necessity (the <font color='red'> **reference strain** is in the `M` **configuration**</font>).

---------------------------

# **Investigate Genomic Regions**

In [169]:
repeat_features = ['long_terminal_repeat', # other repeat regions
                   'repeat_region',
                   # centromere repeats
                   'dh_repeat', 'dg_repeat',
                   'regional_centromere_inner_repeat_region',  'regional_centromere', 'regional_centromere_central_core',
                   # mating type region
                   'mating_type_region']
explained_features.extend(repeat_features)

In [165]:
gff_by_type.loc[repeat_features]

Unnamed: 0_level_0,ID
type,Unnamed: 1_level_1
long_terminal_repeat,239
repeat_region,49
dh_repeat,16
dg_repeat,11
regional_centromere_inner_repeat_region,6
regional_centromere,3
regional_centromere_central_core,4
mating_type_region,1


- Get `repeats_df` by filtering the **gff** for `repeat` **features**

In [166]:
repeats_df = gff[gff['type'].isin(repeat_features)].sort_values(['seqid', 'start', 'end'])
#repeats_df.shape

- Filter `columns` that contain **all NaN's**: in this case the `Name` and `Parent` **column**

In [167]:
cols = repeats_df.columns[~repeats_df.isna().all()]
repeats_df = repeats_df[cols]
repeats_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID
41697,I,PomBase,long_terminal_repeat,21035,21329,.,+,.,SPLTRA.1
41190,I,PomBase,long_terminal_repeat,24281,24445,.,-,.,SPLTRA.2
41369,I,PomBase,long_terminal_repeat,24581,24876,.,-,.,SPLTRA.3
41424,I,PomBase,long_terminal_repeat,25997,26339,.,-,.,SPLTRA.4
41355,I,PomBase,long_terminal_repeat,28124,28480,.,+,.,SPLTRA.5


In [168]:
repeats_df.shape

(329, 9)

## **II. Centromere repeats** (include `regional_centromere`)

In [82]:
centromere_repeat_features = ['regional_centromere', ## this is actually the whole region, not a repeat per-se
                              'dh_repeat', 'dg_repeat',
                              'regional_centromere_inner_repeat_region', 'regional_centromere_central_core']

In [83]:
centromere_repeat_df = repeats_df[repeats_df['type'].isin(centromere_repeat_features)].sort_values(['seqid', 'start', 'end'])
centromere_repeat_df['length'] = centromere_repeat_df["end"] - centromere_repeat_df["start"] + 1
centromere_repeat_df.shape

(40, 10)

The **centromeric region/centrometric repeats** follow the structure that can be seen in the figure below with slight differences for each chromosome.

**In Chromosome I**
* region is not symmetrical `dh_repeat` and `dh_repeat` on both sides the **Central Domain** have different lengths
* **SPRPTCENA.7** (`regional_centromere_inner_repeat_region`) and **SPRPTCENA.8** (`dg_repeat`) overlap
* there is a big gap between  SPRPTCENA.8 (`dg_repeat`) and SPRPTCENA.9 (`dh_repeat`).


!['ch_I_centromer'](igv_plots/chr_I_centromeric_region.png)

**Chromosome I** - regional_centromere (spans `~35 kb`) can be divided in: (no-overlap)
    * (-) `dh_repeat` (~5 kb) - SPRPTCENA.3
    * (+) `dg_repeat` (~4.3 kb) - SPRPTCENA.4
    * `regional_centromere_inner_repeat_region` (~5.6 kb) - SPRPTCENA.5
    * `regional_centromere_central_core` (~4.1 kb) - SPRPTCENA.6
    * `regional_centromere_inner_repeat_region` (~5.6 kb) - SPRPTCENA.7
    * `dg_repeat` (~1.7 kb) - SPRPTCENA.8 
    * `dh_repeat` (~3.9 kb) - SPRPTCENA.9

In [84]:
centromere_repeat_df[centromere_repeat_df['seqid'].isin(['I'])]

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,length
41285,I,PomBase,regional_centromere,3753687,3789421,.,+,.,CU329670_regional_centromere_3753687..3789421,35735
41275,I,PomBase,dh_repeat,3754127,3759170,.,-,.,SPRPTCENA.3,5044
41362,I,PomBase,dg_repeat,3759165,3763441,.,+,.,SPRPTCENA.4,4277
41356,I,PomBase,regional_centromere_inner_repeat_region,3763442,3769077,.,+,.,SPRPTCENA.5,5636
41584,I,PomBase,regional_centromere_central_core,3769078,3773184,.,+,.,SPRPTCENA.6,4107
41621,I,PomBase,regional_centromere_inner_repeat_region,3773196,3778831,.,+,.,SPRPTCENA.7,5636
41668,I,PomBase,dg_repeat,3777115,3778839,.,+,.,SPRPTCENA.8,1725
41251,I,PomBase,dh_repeat,3785089,3788981,.,+,.,SPRPTCENA.9,3893


**In Chromosome II**
* <font color='red'> **TODO:** </font>

!['ch_II_centromer'](/igv_plots/chr_II_centromeric_region.png)

In [102]:
#<img src="igv_plots/chr_II_centromeric_region.png" style="width:200; height:400px">

* **Chromosome II** - regional_centromere (spans ~42 kb) can be divided in:
    * <font color='red'> **TODO:** </font>
    * dg_repeat (~4.8 kb)
    * (gap)
    * (-) dh_repeat (~1.5 kb)
    * (+) dh_repeat (~1.5 kb)
    * regional_centromere_inner_repeat_region
    * regional_centromere_central_core
    * regional_centromere_inner_repeat_region
    * dg_repeat (~1.7 kb) (overlaped with regional_centromere_inner_repeat_region)
    * dh_repeat (~3.9 kb) (after a gap)

In [85]:
centromere_repeat_df[centromere_repeat_df['seqid'].isin(['II'])]

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,length
41406,II,PomBase,regional_centromere,1602418,1644747,.,+,.,CU329671_regional_centromere_1602418..1644747,42330
41430,II,PomBase,dg_repeat,1604786,1609649,.,-,.,SPRPTCENB.1,4864
41286,II,PomBase,dh_repeat,1610041,1611580,.,-,.,SPRPTCENB.3,1540
41404,II,PomBase,dh_repeat,1611467,1613530,.,+,.,SPRPTCENB.4,2064
41473,II,PomBase,regional_centromere_inner_repeat_region,1616671,1620806,.,+,.,SPRPTCENB.10,4136
41412,II,PomBase,regional_centromere_central_core,1620807,1624737,.,+,.,SPRPTCENB.11,3931
41384,II,PomBase,regional_centromere_central_core,1624515,1627609,.,+,.,SPRPTCENB.12,3095
41503,II,PomBase,regional_centromere_inner_repeat_region,1627610,1631932,.,+,.,SPRPTCENB.13,4323
41666,II,PomBase,dh_repeat,1631933,1634352,.,+,.,SPRPTCENB.14,2420
41302,II,PomBase,dh_repeat,1634252,1637841,.,+,.,SPRPTCENB.15,3590


**In Chromosome III**
* <font color='red'> **TODO:** </font>

!['ch_III_centromer'](igv_plots/chr_III_centromeric_region.png)

* **Chromosome III** - regional_centromere (spans ~66 kb) can be divided in:
    * <font color='red'> **TODO:** </font>

In [86]:
centromere_repeat_df[centromere_repeat_df['seqid'].isin(['III'])]

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,length
41255,III,PomBase,regional_centromere,1070904,1137003,.,+,.,CU329672_regional_centromere_1070904..1137003,66100
41613,III,PomBase,dh_repeat,1073708,1075339,.,+,.,SPRPTCENC.1,1632
41685,III,PomBase,dg_repeat,1076115,1078499,.,+,.,SPRPTCENC.2,2385
41181,III,PomBase,dh_repeat,1078934,1082875,.,+,.,SPRPTCENC.4,3942
41202,III,PomBase,dg_repeat,1082876,1083348,.,+,.,SPRPTCENC.5,473
41309,III,PomBase,dh_repeat,1083449,1084747,.,+,.,SPRPTCENC.6,1299
41517,III,PomBase,dg_repeat,1084748,1087132,.,+,.,SPRPTCENC.7,2385
41589,III,PomBase,dh_repeat,1087567,1091508,.,+,.,SPRPTCENC.9,3942
41718,III,PomBase,dh_repeat,1090919,1091558,.,+,.,SPRPTCENC.10,640
41217,III,PomBase,regional_centromere_inner_repeat_region,1091700,1097063,.,+,.,SPRPTCENC.12,5364


### **I.A. <font color='blue'>dg repeat</font>**

- `dg_repeat`

In [87]:
dg_repeat_df = centromere_repeat_df[centromere_repeat_df['type'].isin(['dg_repeat'])].sort_values(['seqid','start', 'end'])
dg_repeat_df

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,length
41362,I,PomBase,dg_repeat,3759165,3763441,.,+,.,SPRPTCENA.4,4277
41668,I,PomBase,dg_repeat,3777115,3778839,.,+,.,SPRPTCENA.8,1725
41430,II,PomBase,dg_repeat,1604786,1609649,.,-,.,SPRPTCENB.1,4864
41383,II,PomBase,dg_repeat,1638230,1644747,.,+,.,SPRPTCENB.17,6518
41685,III,PomBase,dg_repeat,1076115,1078499,.,+,.,SPRPTCENC.2,2385
41202,III,PomBase,dg_repeat,1082876,1083348,.,+,.,SPRPTCENC.5,473
41517,III,PomBase,dg_repeat,1084748,1087132,.,+,.,SPRPTCENC.7,2385
41629,III,PomBase,dg_repeat,1111813,1114198,.,+,.,SPRPTCENC.18,2386
41444,III,PomBase,dg_repeat,1118573,1120957,.,+,.,SPRPTCENC.21,2385
41583,III,PomBase,dg_repeat,1125333,1125582,.,+,.,SPRPTCENC.24,250


In [88]:
dg_repeat_df.groupby('seqid').agg({'length': ['sum', 'count']})

Unnamed: 0_level_0,length,length
Unnamed: 0_level_1,sum,count
seqid,Unnamed: 1_level_2,Unnamed: 2_level_2
I,6002,2
II,11382,2
III,12650,7


### **I.A. <font color='blue'>dh repeat</font>**

- `dh_repeat`

In [89]:
dh_repeat = centromere_repeat_df[centromere_repeat_df['type'].isin(['dh_repeat'])].sort_values(['seqid', 'start', 'end'])
dh_repeat

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,length
41275,I,PomBase,dh_repeat,3754127,3759170,.,-,.,SPRPTCENA.3,5044
41251,I,PomBase,dh_repeat,3785089,3788981,.,+,.,SPRPTCENA.9,3893
41286,II,PomBase,dh_repeat,1610041,1611580,.,-,.,SPRPTCENB.3,1540
41404,II,PomBase,dh_repeat,1611467,1613530,.,+,.,SPRPTCENB.4,2064
41666,II,PomBase,dh_repeat,1631933,1634352,.,+,.,SPRPTCENB.14,2420
41302,II,PomBase,dh_repeat,1634252,1637841,.,+,.,SPRPTCENB.15,3590
41613,III,PomBase,dh_repeat,1073708,1075339,.,+,.,SPRPTCENC.1,1632
41181,III,PomBase,dh_repeat,1078934,1082875,.,+,.,SPRPTCENC.4,3942
41309,III,PomBase,dh_repeat,1083449,1084747,.,+,.,SPRPTCENC.6,1299
41589,III,PomBase,dh_repeat,1087567,1091508,.,+,.,SPRPTCENC.9,3942


In [90]:
dh_repeat.groupby('seqid').agg({'length': ['sum', 'count']})

Unnamed: 0_level_0,length,length
Unnamed: 0_level_1,sum,count
seqid,Unnamed: 1_level_2,Unnamed: 2_level_2
I,8937,2
II,9614,4
III,28612,10


## **III. Mating type region**

In [91]:
mating_type_features = ["mating_type_region"]

In [92]:
mating_type_df = repeats_df[repeats_df['type'].isin(mating_type_features)].sort_values(['seqid', 'start', 'end'])
#mating_type_df['size'] = mating_type_df["end"] - mating_type_df["start"] + 1

---------------------------

# **Other features**

In [93]:
other_features = ['promoter',
                  'low_complexity_region', 
                  'region', 
                  'nuclear_mt_pseudogene', 
                  'polyA_site', 
                  'origin_of_replication',
                  'LTR_retrotransposon',
                  'gene_group', 
                  'gap', 
                  'SNP',
                  'TR_box',
                  'polyA_signal_sequence']

In [94]:
gff_by_type.loc[other_features]

Unnamed: 0_level_0,ID
type,Unnamed: 1_level_1
promoter,62
low_complexity_region,59
region,29
nuclear_mt_pseudogene,17
polyA_site,17
origin_of_replication,16
LTR_retrotransposon,14
gene_group,5
gap,4
SNP,2


- Get `other_features_df` by filtering the **gff** for `everything left` **features**

In [95]:
other_features_df = gff[gff['type'].isin(other_features)].sort_values(['seqid', 'start', 'end'])
#other_features_df.shape

- Filter `columns` that contain **all NaN's**: in this case the `Name` and `Parent` **column**

In [96]:
cols = other_features_df.columns[~other_features_df.isna().all()]
other_features_df = other_features_df[cols]
other_features_df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID
41540,I,PomBase,gap,29664,29763,.,+,.,CU329670_gap_29664..29763
41379,I,PomBase,nuclear_mt_pseudogene,72348,72436,.,+,.,SPNUMT.8
41216,I,PomBase,nuclear_mt_pseudogene,79544,79628,.,+,.,SPNUMT.9
41474,I,PomBase,promoter,175826,175833,.,-,.,CU329670_promoter_175826..175833
41186,I,PomBase,promoter,187628,187635,.,-,.,CU329670_promoter_187628..187635


In [97]:
other_features_df.shape

(228, 9)

- Get `gap` features:

In [98]:
gap_df = other_features_df[other_features_df['type'].isin(["gap"])].sort_values(['seqid', 'start', 'end'])
gap_df['length'] = gap_df["end"] - gap_df["start"] + 1
gap_df

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,ID,length
41540,I,PomBase,gap,29664,29763,.,+,.,CU329670_gap_29664..29763,100
41374,II,PomBase,gap,80202,80301,.,+,.,CU329671_gap_80202..80301,100
41270,II,PomBase,gap,1616571,1616670,.,+,.,CU329671_gap_1616571..1616670,100
41656,III,PomBase,gap,1083349,1083448,.,+,.,CU329672_gap_1083349..1083448,100


In [99]:
explained_features.extend(other_features)

In [100]:
gff_by_type[~gff_by_type.index.isin(explained_features)].sort_values('ID', ascending=False)

Unnamed: 0_level_0,ID
type,Unnamed: 1_level_1
