## Homework 8:

__Exercise 1.__ Go to prosite and find the Gamma-glutamyl phosphate reductase signature. Write a script to detect all yeast proteins that have the Gamma-glutamyl phosphate reductase signature. Your script should output a dataframe with the folowing info about the proteins: accession number, match to the first, span of the motif, and the protein's description.

**NOTE:** You can just copy the file with all the yeast protein form last class in the current directory, you don't need to download it again.

```
[VA]-x(5)-A-[LIVAMTCK]-x-[HWFY]-[IM]-x(2)-[HYWNRFT]-[GSNT]-[STAG]-x(0,1)-H-[ST]-[DE]-x(1,2)-I
```

In [1]:
import re
import pandas as pd
from biopython import SeqIO  # Import SeqIO from Biopython

# Load yeast protein sequences from the FASTA file
yeast_proteins = SeqIO.parse("testDNA.fasta", "fasta")

# Define the regular expression pattern for the Gamma-glutamyl phosphate reductase signature
pattern = r'[VA].{5}A[LIVAMTCK].[HWFY][IM].{2}[HYWNRFT][GSNT][STAG].{0,1}H[ST][DE].{1,2}I'

# Create lists to store information about matching proteins
accession_numbers = []
matches = []
motif_spans = []
descriptions = []

# Iterate over each protein sequence
for protein in yeast_proteins:
    sequence = str(protein.seq)
    
    # Search for the pattern in the sequence
    match = re.search(pattern, sequence)
    
    # If a match is found, extract relevant information
    if match:
        accession_numbers.append(protein.id)
        matches.append(match.group())
        motif_spans.append((match.start(), match.end()))
        descriptions.append(protein.description)

# Create a DataFrame with the extracted information
df = pd.DataFrame({
    'Accession Number': accession_numbers,
    'Match to the First': matches,
    'Span of the Motif': motif_spans,
    'Description': descriptions
})

print(df)


ModuleNotFoundError: No module named 'biopython'

__Exercise 2.__ Now do the same for the Hexapeptide-repeat containing-transferases signature.

```
[LIV]-[GAED]-x(2)-[STAV]-x-[LIV]-x(3)-[LIVAC]-x-[LIV]-[GAED]-x(2)-[STAVR]-x-[LIV]-[GAED]-x(2)-[STAV]-x-[LIV]-x(3)-[LIV]
```

In [None]:
import re
import pandas as pd
from biopython import SeqIO  # Import SeqIO from Biopython

# Load yeast protein sequences from the FASTA file
yeast_proteins = SeqIO.parse("testDNA.fasta", "fasta")

# Define the regular expression pattern for the Hexapeptide-repeat containing-transferases signature
pattern = r'[LIV][GAED].{2}[STAV].{1}[LIV].{3}[LIVAC].{1}[LIV][GAED].{2}[STAVR].{1}[LIV][GAED].{2}[STAV].{1}[LIV].{3}[LIV]'

# Create lists to store information about matching proteins
accession_numbers = []
matches = []
motif_spans = []
descriptions = []

# Iterate over each protein sequence
for accession_number, sequence in yeast_proteins.items():
    # Search for the pattern in the sequence
    match = re.search(pattern, sequence)
    
    # If a match is found, extract relevant information
    if match:
        accession_numbers.append(accession_number)
        matches.append(match.group())
        motif_spans.append((match.start(), match.end()))
        descriptions.append("")

# Create a DataFrame with the extracted information
df = pd.DataFrame({
    'Accession Number': accession_numbers,
    'Match to the First': matches,
    'Span of the Motif': motif_spans,
    'Description': descriptions
})

print(df)

__Exercise 3.__ Now find the 14-3-3 proteins signatures. The 14-3-3 proteins seem to have multiple biological activities and play a key role in signal transduction pathways and the cell cycle. The prosite database uses two motifs to determine members of this family.

Write a script to search for proteins in yeast that have both domains in either order. You should find two proteins.

Your script should show a dataframe with the proteins: accession number, match to the first motif, span of the first motif, match to the second motif, span of the second motif, and the proteins description.

Although your regex doesn't need to match the domains in the reverse order for it to identify both yeast proteins, I would like for you to write a regex that would be able to identify such a case for this exercise purpose.

```
[RA]-N-L-[LIV]-S-[VG]-[GA]-Y-[KN]-N-[IVA]
```

and

```
Y-K-[DE]-[SG]-T-L-I-[IML]-Q-L-[LF]-[RHC]-D-N-[LF]-T-[LS]-W-[TANS]-[SAD]
```

In [None]:
import re
import pandas as pd

# Function to parse FASTA file
def parse_fasta(file_path):
    sequences = {}
    with open(file_path, "r") as file:
        current_id = None
        current_sequence = ""
        for line in file:
            if line.startswith(">"):
                if current_id:
                    sequences[current_id] = current_sequence
                current_id = line.strip()[1:]
                current_sequence = ""
            else:
                current_sequence += line.strip()
        if current_id:
            sequences[current_id] = current_sequence
    return sequences

# Load yeast protein sequences from the FASTA file
yeast_proteins = parse_fasta("testDNA.fasta")

# Define the regular expression patterns for the 14-3-3 protein motifs
pattern1 = r'[RA]N[LIV]S[VG]A[GA]Y[KN]N[IVA]'
pattern2 = r'YK[DE][SG]TLI[IML]QL[LF][RHC]D[NLF]T[LS]W[TANS][SAD]'

# Create lists to store information about matching proteins
accession_numbers = []
matches1 = []
motif_spans1 = []
matches2 = []
motif_spans2 = []
descriptions = []

# Iterate over each protein sequence
for accession_number, sequence in yeast_proteins.items():
    # Search for the first motif in the sequence
    match1 = re.search(pattern1, sequence)
    
    # Search for the second motif in the sequence
    match2 = re.search(pattern2, sequence)
    
    # If both motifs are found, extract relevant information
    if match1 and match2:
        accession_numbers.append(accession_number)
        matches1.append(match1.group())
        motif_spans1.append((match1.start(), match1.end()))
        matches2.append(match2.group())
        motif_spans2.append((match2.start(), match2.end()))
        descriptions.append("")

# Create a DataFrame with the extracted information
df = pd.DataFrame({
    'Accession Number': accession_numbers,
    'Match to Motif 1': matches1,
    'Span of Motif 1': motif_spans1,
    'Match to Motif 2': matches2,
    'Span of Motif 2': motif_spans2,
    'Description': descriptions
})

print(df)


__Exercise 4.__ Parsing and extracting data from a URL:

This is form the tutorial that you should have completed.

When working with files and resources over a network, you will often come across URIs and URLs which can be parsed and worked with directly. Most standard libraries will have classes to parse and construct these kind of identifiers, but if you need to match them in logs or a larger corpus of text, you can use regular expressions to pull out information from their structured format quite easily.

URIs, or Uniform Resource Identifiers, are a representation of a resource that is generally composed of a scheme, host, port (optional), and resource path, respectively highlighted below.

http://regexone.com:80/page

The scheme describes the protocol to communicate with, the host and port describe the source of the resource, and the full path describes the location at the source for the resource.

In the exercise below, try to extract the protocol, host and port of the all the resources listed in this string.

```
ftp://file_server.com:21/top_secret/life_changing_plans.pdf
https://regexone.com/lesson/introduction#section
file://localhost:4040/zip_file
https://s3cur3-server.com:9999/
market://search/angry%20birds
```

You can work interactively here: https://regexone.com/problem/extracting_url_data to find the right regular expression, then use re.finditer to create a dataframe with columns protocol, host and port for each of the matches in the string.

In [1]:
import re
import pandas as pd

# Define the URLs
urls = """
ftp://file_server.com:21/top_secret/life_changing_plans.pdf
https://regexone.com/lesson/introduction#section
file://localhost:4040/zip_file
https://s3cur3-server.com:9999/
market://search/angry%20birds
"""

# Define the regular expression pattern
pattern = r'(?P<protocol>\w+):\/\/(?P<host>[\w.-]+)(?::(?P<port>\d+))?\/?'

# Find all matches in the URLs
matches = re.finditer(pattern, urls)

# Create lists to store extracted data
protocols = []
hosts = []
ports = []

# Extract data and append to lists
for match in matches:
    protocols.append(match.group('protocol'))
    hosts.append(match.group('host'))
    ports.append(match.group('port'))

# Create a DataFrame
df = pd.DataFrame({
    'protocol': protocols,
    'host': hosts,
    'port': ports
})

print(df)


  protocol               host  port
0      ftp    file_server.com    21
1    https       regexone.com  None
2     file          localhost  4040
3    https  s3cur3-server.com  9999
4   market             search  None
