# First Workshop



## Infomation of the Workshop

__Workshop I - Systems Analysis__

_Workshop Definition:_

Welcome to the first workshop of Systems Analysis course. Let’s funny me with a biological exercise.

Imagine you have been hired as data analyst in an important biotechnology company.  Your boss, science chief officer, want to get some patterns in genomic data, sometimes called motifs.

Here you will have some tasks in order to complete this workshop:
Create a dummy database of genetic sequences composed of nucleotide bases (A, C, G, T), where each sequence must have between 10 and 20 bases. Your database must be composed for 50.000 genetic sequences.

Get the motifs (must repeated sequence) of size 6 and 8.
Use the Shannon Entropy measurement to filter sequences with not a good variance level.
Get again the motifs of size 6 and 8.

Write some conclusions based on your analysis.
Write any technical concern/decision/difficulty  you think is relevant regarding your work.
You must deliver a full report detailing each one of the previous steps. For steeps 1 to 4 you must describe the algorithms you propose and let an screenshot about the code and the output of the code. I strongly recommend you to use a Jupyter Notebook or a COLAB to write/execute your code.

## 1° Create Dummy data-base

First we define a function that creates a sequence using the bases : A, C, T G.

In [None]:
import random

def create_sequence() -> str:
    """
    This function is used to generate a random genetic sequence.

    Returns:
    - str: random genetic sequence
    """
    nucleotid_bases = ["A", "G", "C", "T"]
    size_sequence = random.randint(10,20)
    new_sequence = [nucleotid_bases[random.randint(0,3)] for i in range (size_sequence)]
    return "".join(new_sequence)


Now we create the dummy database using our sequence creator function.

In [None]:
def create_database(size: int) -> list:
    """
    This function is used to create a dataset composed by a set of genetic sequences.

    Parameters:
    - dataset_size (int): size of the dummy dataset to be generated.

    Returns:
    - list: a list of genetic sequences
    """
    db_size = size
    data_base = [create_sequence() for i in range(db_size)]
    return data_base


In [None]:
def get_combination(n: int, sequences: list, bases: list) -> list:
    """
    This method is used to generate a set of combinations based on a list of nucleotid bases.
    To make easy the process, this function is defined as a recurssion.

    Parameters:
    - n (int): amount of elements of each combination
    - sequences (list): list of recursive sequences obtained
    - bases (list): list of nucleotid bases to be used

    Returns:
    - list: list of combinations
    """
    if n == 1:
        return [sequence + base for sequence in sequences for base in bases]

    else:
        sequence = [sequence + base for sequence in sequences for base in bases]
        return get_combination(n-1, sequence, bases)

def count_motif(motif: str, sequences_db: list) -> int:
    """
    This function is used to count the number of times a motif appears in a set of genetic sequences.

    Parameters:
    - motif (str): genetic motif to be searched.
    - sequences_db (list): list of genetic sequences.

    Returns:
    - int: number of times the motif appears in the dataset.
    """
    count = 0
    for sequence in sequences_db:
        count += sequence.count(motif)
    return count

def get_motif(motif_size: int, sequences_db: list):
    """
    This function is used to get the motif with the highest count in a set of genetic sequences.

    Parameters:
    - motif_size (int): size of the motif to be searched.
    - sequences_db (list): list of genetic sequences.

    Returns:
    - (str, int): motif with the highest count and the number of times it appears in the dataset.
    """
    nucletid_bases = ["A", "C", "G", "T"]
    combinations = get_combination(motif_size, [""], nucletid_bases)
    max_counter = 0
    motif_winner = ""

    for motif_candidate in combinations:
        temp_conter = count_motif(motif_candidate, sequences_db)
        if temp_conter > max_counter:
            max_counter = temp_conter
            motif_winner = motif_candidate

    return motif_winner, max_counter

## Entropia de Shannon

In [None]:
# Tarea
# Info: Calcular entropia de shanon de todas las sequencias y filtrar en base a un numero (de la entropia)
# para luego volver a calcular el motif

In [None]:
for size in [6, 8]:
    print(f"\nMotifs of size: {size}")
    for i in range(10):
        print(get_motif(size, create_database(50000)))


Motifs of size: 6
('GCATTT', 164)
('CTGGCC', 161)
('CCTTCT', 164)
('TACAGC', 166)
('TTCACG', 169)
('ATGCTC', 164)
('CGAGGT', 166)
('TCGTGA', 179)
('GAGACT', 162)
('CGATAT', 168)

Motifs of size: 8
('CATGCAAT', 21)
('ACAATTAT', 18)
('TCAGATAC', 19)
('TATCATCA', 19)
('TCCTCTTT', 19)
('ATTCTCCT', 21)
('GCAGGGTA', 21)
('AAGGATTG', 20)
('ACAGCGAA', 18)
('TTTCTCTG', 20)


In [None]:
from math import log2
from collections import Counter

def calculate_shannon_entrophy(sequence: str) -> float:
    """
    This function is used to calculate the Shannon Entropy of a genetic sequence.

    Parameters:
    - sequence (str): genetic sequence.

    Returns:
    - float: Shannon Entropy of the sequence.
    """
    entrophy = 0

    length_sequence = len(sequence)
    prob_sequence = Counter(sequence)

    entrophy = log2(length_sequence) - sum(count * log2(count) for count in prob_sequence.values()) / length_sequence

    return entrophy


In [None]:
def filter_shannon(sequence: str) -> bool:
    """
    This function is used to filter genetic sequences based on their Shannon Entropy.

    Parameters:
    - sequence (str): genetic sequence.

    Returns:
    - bool: True if the sequence passes the filter, False otherwise.
    """

    return calculate_shannon_entrophy(sequence) > 1.5

In [None]:
for size in [6, 8]:
    print(f"\nAfter filter, motifs of size: {size}")
    for i in range(10):
        dataset = create_database(50000)
        dataset = list(filter(filter_shannon, dataset))
        print(f"Dataset size: {len(dataset)}, Motif: {get_motif(size, dataset)}")


After filter, motifs of size: 6
Dataset size: 47739, Motif: ('GCTAGG', 164)
Dataset size: 47675, Motif: ('CCGGCA', 176)
Dataset size: 47696, Motif: ('CGAACA', 159)
Dataset size: 47614, Motif: ('ATAGAT', 160)
Dataset size: 47545, Motif: ('CCAGTT', 159)
Dataset size: 47727, Motif: ('GAGTCT', 157)
Dataset size: 47707, Motif: ('CATGAC', 164)
Dataset size: 47616, Motif: ('ATACGC', 160)
Dataset size: 47680, Motif: ('ATACGA', 163)
Dataset size: 47668, Motif: ('TCGATA', 159)

After filter, motifs of size: 8
Dataset size: 47611, Motif: ('TATGACTT', 19)
Dataset size: 47664, Motif: ('CCCGGTTT', 18)
Dataset size: 47628, Motif: ('ACCATTAG', 18)
Dataset size: 47657, Motif: ('AAGGGTCT', 19)
Dataset size: 47707, Motif: ('AATCAACC', 19)
Dataset size: 47634, Motif: ('TACCAGCA', 18)
Dataset size: 47677, Motif: ('GCGCGAAG', 18)
Dataset size: 47616, Motif: ('CCAAGCTC', 20)
Dataset size: 47696, Motif: ('CGTTAAGC', 18)
Dataset size: 47722, Motif: ('AGAAATCC', 18)
