# EDA of Sequence files

This notebook explores the contents of the sequecne files, starting from the the training one.

The format of this file is a bit complex to understand, because the last column contains multi-lines entries, in the FASTA format.

For example, in this entry, we have one sequence for Chain A of 1SCL_A:

```
1SCL_A,GGGUGCUCAGUACGAGAGGAACCGCACCC,1995-01-26,"THE SARCIN-RICIN LOOP, A MODULAR RNA",">1SCL_1|Chain A|RNA SARCIN-RICIN LOOP|Rattus norvegicus (10116)
GGGUGCUCAGUACGAGAGGAACCGCACCC
"
```

This is another entry, with two sequences. 

The first sequence is 1HMH_1, encoding Chains A, C, E of 1_HMH_E, with sequence GGCGACCCUGAUGAGGCCGAAAGGCCGAAACCGU. 

The second sequence is 1HMH_2, and it encodes Chains B, D, F, and it reads ACGGTCGGTCGCC.

```

1HMH_E,GGCGACCCUGAUGAGGCCGAAAGGCCGAAACCGU,1995-12-07,THREE-DIMENSIONAL STRUCTURE OF A HAMMERHEAD RIBOZYME,">1HMH_1|Chains A, C, E|HAMMERHEAD RIBOZYME-RNA STRAND|
GGCGACCCUGAUGAGGCCGAAAGGCCGAAACCGU
>1HMH_2|Chains B, D, F|HAMMERHEAD RIBOZYME-DNA STRAND|
ACGGTCGGTCGCC
"

```

For this second example, according to the competition home page, we only need to make a prediction for the first chain, with sequence GGCGACCCUGAUGAGGCCGAAAGGCCGAAACCGU. We can ignore the second sequence, during the prediction, although it may be useful to use it for training.


## Original Description of the Sequence files

The following is the description of the sequences files, copy&pasted from the [competition home page](https://www.kaggle.com/competitions/stanford-rna-3d-folding/data)

\[train/validation/test]_sequences.csv - the target sequences of the RNA molecules.

- target_id - (string) An arbitrary identifier. In train_sequences.csv, this is formatted as pdb_id_chain_id, where pdb_id is the id of the entry in the Protein Data Bank and chain_id is the chain id of the monomer in the pdb file.
- sequence - (string) The RNA sequence. For test_sequences.csv, this is guaranteed to be a string of A, C, G, and U. For some train_sequences.csv, other characters may appear.
- temporal_cutoff - (string) The date in yyyy-mm-dd format that the sequence was published. See Additional Notes.
- description - (string) Details of the origins of the sequence. For a few targets, additional information on small molecule ligands bound to the RNA is included. You don't need to make predictions for these ligand coordinates.
- all_sequences - (string) FASTA-formatted sequences of all molecular chains present in the experimentally solved structure. In a few cases this may include multiple copies of the target RNA (look for the word "Chains" in the header) and/or partners like other RNAs or proteins or DNA. You don't need to make predictions for all these molecules; if you do, just submit predictions for sequence. Some entries are blank.


### Contents of train_sequences.csv

These are the contents of the train_sequences.file. 

There are some inconsistencies:
- The first line of the file seem to contain the column headers, separated by ",". However, the rest of the file contains "|" separated entries. 
- some lines begin with `>`, others do not
- in the second column, some entries contain the sequence, while others contain words like "Chain".
-  

In [1]:
!head -n 25 /kaggle/input/stanford-rna-3d-folding/train_sequences.csv

target_id,sequence,temporal_cutoff,description,all_sequences
1SCL_A,GGGUGCUCAGUACGAGAGGAACCGCACCC,1995-01-26,"THE SARCIN-RICIN LOOP, A MODULAR RNA",">1SCL_1|Chain A|RNA SARCIN-RICIN LOOP|Rattus norvegicus (10116)
GGGUGCUCAGUACGAGAGGAACCGCACCC
"
1RNK_A,GGCGCAGUGGGCUAGCGCCACUCAAAAGGCCCAU,1995-02-27,THE STRUCTURE OF AN RNA PSEUDOKNOT THAT CAUSES EFFICIENT FRAMESHIFTING IN MOUSE MAMMARY TUMOR VIRUS,">1RNK_1|Chain A|RNA PSEUDOKNOT|null
GGCGCAGUGGGCUAGCGCCACUCAAAAGGCCCAU
"
1RHT_A,GGGACUGACGAUCACGCAGUCUAU,1995-06-03,"24-MER RNA HAIRPIN COAT PROTEIN BINDING SITE FOR BACTERIOPHAGE R17 (NMR, MINIMIZED AVERAGE STRUCTURE)",">1RHT_1|Chain A|RNA (5'-R(P*GP*GP*GP*AP*CP*UP*GP*AP*CP*GP*AP*UP*CP*AP*CP*GP*CP*AP*GP*UP*CP*UP*AP*U)-3')|null
GGGACUGACGAUCACGCAGUCUAU
"
1HLX_A,GGGAUAACUUCGGUUGUCCC,1995-09-15,P1 HELIX NUCLEIC ACIDS (DNA/RNA) RIBONUCLEIC ACID,">1HLX_1|Chain A|RNA (5'-R(*GP*GP*GP*AP*UP*AP*AP*CP*UP*UP*CP*GP*GP*UP*UP*GP*UP*CP*CP*C)-3')|null
GGGAUAACUUCGGUUGUCCC
"
1HMH_E,GGCGACCCUGAUGAG

### Inconsistency 1: Some rows begin with ">" and use | as separator

Some rows begin with ">" and use | as separator. Other rows in the file do not start with ">", and use "," as separator.

In [2]:
# Entries that begin with ">"
!grep "^>" -A 2 /kaggle/input/stanford-rna-3d-folding/train_sequences.csv | head

>1HMH_2|Chains B, D, F|HAMMERHEAD RIBOZYME-DNA STRAND|
ACGGTCGGTCGCC
"
--
>1MME_2|Chains B, D|RNA HAMMERHEAD RIBOZYME|
GGCCGAAACUCGUAAGAGUCACCAC
"
--
>1BIV_2|Chain B|TAT PEPTIDE|Bovine immunodeficiency virus (11657)
SGPRPRGTRGKGRRIRR
grep: write error: Broken pipe


In [3]:
# Entries that do NOT begin with ">"
!grep "^1SCL" -A 2 /kaggle/input/stanford-rna-3d-folding/train_sequences.csv | head

1SCL_A,GGGUGCUCAGUACGAGAGGAACCGCACCC,1995-01-26,"THE SARCIN-RICIN LOOP, A MODULAR RNA",">1SCL_1|Chain A|RNA SARCIN-RICIN LOOP|Rattus norvegicus (10116)
GGGUGCUCAGUACGAGAGGAACCGCACCC
"


### Inconsistency 2: some entries have a keyword in the second column, in place of the sequence.

For most of the entries, the sequence is repeated twice: once in the second column of the first row, and another time in the second row.

However, some rows have a text description in the second column, such as "Chain".

In [4]:
!head /kaggle/input/stanford-rna-3d-folding/train_sequences.csv

target_id,sequence,temporal_cutoff,description,all_sequences
1SCL_A,GGGUGCUCAGUACGAGAGGAACCGCACCC,1995-01-26,"THE SARCIN-RICIN LOOP, A MODULAR RNA",">1SCL_1|Chain A|RNA SARCIN-RICIN LOOP|Rattus norvegicus (10116)
GGGUGCUCAGUACGAGAGGAACCGCACCC
"
1RNK_A,GGCGCAGUGGGCUAGCGCCACUCAAAAGGCCCAU,1995-02-27,THE STRUCTURE OF AN RNA PSEUDOKNOT THAT CAUSES EFFICIENT FRAMESHIFTING IN MOUSE MAMMARY TUMOR VIRUS,">1RNK_1|Chain A|RNA PSEUDOKNOT|null
GGCGCAGUGGGCUAGCGCCACUCAAAAGGCCCAU
"
1RHT_A,GGGACUGACGAUCACGCAGUCUAU,1995-06-03,"24-MER RNA HAIRPIN COAT PROTEIN BINDING SITE FOR BACTERIOPHAGE R17 (NMR, MINIMIZED AVERAGE STRUCTURE)",">1RHT_1|Chain A|RNA (5'-R(P*GP*GP*GP*AP*CP*UP*GP*AP*CP*GP*AP*UP*CP*AP*CP*GP*CP*AP*GP*UP*CP*UP*AP*U)-3')|null
GGGACUGACGAUCACGCAGUCUAU
"


In [5]:
import pandas as pd
import csv

# Read only the first four columns, ignoring "all_sequences"
df = pd.read_csv(
    "/kaggle/input/stanford-rna-3d-folding/train_sequences.csv",
    engine="python",
    quoting=csv.QUOTE_MINIMAL,
    usecols=["target_id", "sequence", "temporal_cutoff", "description", "all_sequences"]
)

print(df.head())

  target_id                            sequence temporal_cutoff  \
0    1SCL_A       GGGUGCUCAGUACGAGAGGAACCGCACCC      1995-01-26   
1    1RNK_A  GGCGCAGUGGGCUAGCGCCACUCAAAAGGCCCAU      1995-02-27   
2    1RHT_A            GGGACUGACGAUCACGCAGUCUAU      1995-06-03   
3    1HLX_A                GGGAUAACUUCGGUUGUCCC      1995-09-15   
4    1HMH_E  GGCGACCCUGAUGAGGCCGAAAGGCCGAAACCGU      1995-12-07   

                                         description  \
0               THE SARCIN-RICIN LOOP, A MODULAR RNA   
1  THE STRUCTURE OF AN RNA PSEUDOKNOT THAT CAUSES...   
2  24-MER RNA HAIRPIN COAT PROTEIN BINDING SITE F...   
3  P1 HELIX NUCLEIC ACIDS (DNA/RNA) RIBONUCLEIC ACID   
4  THREE-DIMENSIONAL STRUCTURE OF A HAMMERHEAD RI...   

                                       all_sequences  
0  >1SCL_1|Chain A|RNA SARCIN-RICIN LOOP|Rattus n...  
1  >1RNK_1|Chain A|RNA PSEUDOKNOT|null\nGGCGCAGUG...  
2  >1RHT_1|Chain A|RNA (5'-R(P*GP*GP*GP*AP*CP*UP*...  
3  >1HLX_1|Chain A|RNA (5'-R(*GP*GP*GP*A

In [6]:
df.loc[df.target_id==">1HMH_2"]

Unnamed: 0,target_id,sequence,temporal_cutoff,description,all_sequences
