(The spirit when programming: "*Why spend 5 days doing some work when you can spend 5 weeks automating it!*")

# Creating your own kernel / environment

Create a Python environment (where packages are installed):
- Open a terminal
- `python3 -m venv bioinfo`
- `source bioinfo/bin/activate`

Link it to a Jupyter kernel (so that your code execute with it):
- `pip install ipykernel`
- `python3 -m ipykernel install --user --name bioinfo --display-name "Python (bioinfo)"`

Then you can install new packages:
- `pip install pandas`

In your notebook, you'll need to `Kernel` -> `Change kernel` to this new kernel. It is probably safe to then `Kernel` -> `Restart & clear output`. Make sure this new kernel is the one active (upper right corner of your screen, just below the button `Control Panel`).

# Working with real data

In [75]:
import pandas as pd
#from pandas import *
#import pandas as pd

## Read from / write to TSV and CSV files (in and out of Excel / R)

(Doc: https://pandas.pydata.org/docs/reference/api/pandas.read_table.html#pandas.read_table)

In [92]:
df = pandas.read_table("kmers.tsv")
#how many of the reads contain the sequence (in the count)
#add df = to store data

(Doc: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)

In [93]:
df.to_csv("test.csv")

## Dataframe manipulation

(.head(), .tail(), .shape, .Col, sum(), len(), .describe(), ["Col"], .drop())

In [94]:
df.head()

Unnamed: 0,Seq,Id,Count
0,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,1,113422
1,CAGGACTCCAATATAGAGATAAGTTAATGTC,2,93
2,TATGTAATTGGTTCCAGTGTGAGTCATTAAA,3,5
3,GATATTTTCGAAAAGTGGGATTTTTTAAACC,4,88
4,CTCCATCTCAGGTATTAGAATGAATGCTTAC,5,7


In [95]:
df.tail()

Unnamed: 0,Seq,Id,Count
3995,AGCTGCAGGAACTCCCTCGTCACAGCTTAAA,3996,5
3996,CTGAGCTCTCTGGGAAAGTCGTGTTCCGGAA,3997,5
3997,GTCTGCCTTTATGGCCTTTGTACTCAAAGAA,3998,10
3998,AGACTATAGTGAGCTCAGGTGATTGATACTC,3999,7
3999,AAACCCAGTCACTGGACACCTAAGTGTCCAC,4000,11


In [96]:
df.shape[0]
#value inside dataframe without executing anything, this gives the size of the table (4000 sequences, 3 columns)

4000

In [97]:
df.columns
#to find a specific name of a column amongst many

Index(['Seq', 'Id', 'Count'], dtype='object')

In [98]:
df.Count

0       113422
1           93
2            5
3           88
4            7
         ...  
3995         5
3996         5
3997        10
3998         7
3999        11
Name: Count, Length: 4000, dtype: int64

In [99]:
sum(df.Count)
#extract column, returns list of all columns

372648

In [100]:
sum(df.Count)/len(df.Count)
#calculates average

93.162

In [101]:
df.describe()
#average of each row, standard dev, minimum, maximum

Unnamed: 0,Id,Count
count,4000.0,4000.0
mean,2000.5,93.162
std,1154.844867,1804.997654
min,1.0,5.0
25%,1000.75,7.0
50%,2000.5,14.0
75%,3000.25,42.0
max,4000.0,113422.0


In [102]:
df["Seq"]
#name that starts with number or space, another way to get column out is to put it as a string

0       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
1       CAGGACTCCAATATAGAGATAAGTTAATGTC
2       TATGTAATTGGTTCCAGTGTGAGTCATTAAA
3       GATATTTTCGAAAAGTGGGATTTTTTAAACC
4       CTCCATCTCAGGTATTAGAATGAATGCTTAC
                     ...               
3995    AGCTGCAGGAACTCCCTCGTCACAGCTTAAA
3996    CTGAGCTCTCTGGGAAAGTCGTGTTCCGGAA
3997    GTCTGCCTTTATGGCCTTTGTACTCAAAGAA
3998    AGACTATAGTGAGCTCAGGTGATTGATACTC
3999    AAACCCAGTCACTGGACACCTAAGTGTCCAC
Name: Seq, Length: 4000, dtype: object

In [103]:
df["Test"]=23
#want to create a new column

In [104]:
df

Unnamed: 0,Seq,Id,Count,Test
0,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,1,113422,23
1,CAGGACTCCAATATAGAGATAAGTTAATGTC,2,93,23
2,TATGTAATTGGTTCCAGTGTGAGTCATTAAA,3,5,23
3,GATATTTTCGAAAAGTGGGATTTTTTAAACC,4,88,23
4,CTCCATCTCAGGTATTAGAATGAATGCTTAC,5,7,23
...,...,...,...,...
3995,AGCTGCAGGAACTCCCTCGTCACAGCTTAAA,3996,5,23
3996,CTGAGCTCTCTGGGAAAGTCGTGTTCCGGAA,3997,5,23
3997,GTCTGCCTTTATGGCCTTTGTACTCAAAGAA,3998,10,23
3998,AGACTATAGTGAGCTCAGGTGATTGATACTC,3999,7,23


In [105]:
df.Count/sum(df.Count)
#if you want to add a row which would be the percent of sequences that contain Seq "0"
#the "A", how high of a fraction it is of a whole sequence we counted
#take whole column, divide each of its entry by the sum of the columns, it returns a whole row

0       0.304368
1       0.000250
2       0.000013
3       0.000236
4       0.000019
          ...   
3995    0.000013
3996    0.000013
3997    0.000027
3998    0.000019
3999    0.000030
Name: Count, Length: 4000, dtype: float64

In [106]:
df["frac"]=df.Count/sum(df.Count) * 100
#capture this in data frame

In [107]:
df

Unnamed: 0,Seq,Id,Count,Test,frac
0,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,1,113422,23,30.436766
1,CAGGACTCCAATATAGAGATAAGTTAATGTC,2,93,23,0.024957
2,TATGTAATTGGTTCCAGTGTGAGTCATTAAA,3,5,23,0.001342
3,GATATTTTCGAAAAGTGGGATTTTTTAAACC,4,88,23,0.023615
4,CTCCATCTCAGGTATTAGAATGAATGCTTAC,5,7,23,0.001878
...,...,...,...,...,...
3995,AGCTGCAGGAACTCCCTCGTCACAGCTTAAA,3996,5,23,0.001342
3996,CTGAGCTCTCTGGGAAAGTCGTGTTCCGGAA,3997,5,23,0.001342
3997,GTCTGCCTTTATGGCCTTTGTACTCAAAGAA,3998,10,23,0.002683
3998,AGACTATAGTGAGCTCAGGTGATTGATACTC,3999,7,23,0.001878


In [108]:
o = df.Count >= 100
#how many are there super high represented sequences 
#apply greater than or equal to - to each element of the list, remove o = to see table

In [109]:
sum(o)
#how many sequences have a count higher than 100

543

In [110]:
o
#telling which entries to keep
#masks, a selection of your row, start in variable and then extract values from data frame 

0        True
1       False
2       False
3       False
4       False
        ...  
3995    False
3996    False
3997    False
3998    False
3999    False
Name: Count, Length: 4000, dtype: bool

In [111]:
df.Seq[o]
#what are those sequences? 
#Seq gives all sequences in table, open string to see all in "o"

0       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
6       CTTCCATGGCTGTCCGGATCGCCGCACTGCA
7       GCACCAGGCCTTTCTCTAGAAGTCCTGAGAC
11      ATCAATCGACTCAGATGATCAGTTTTGGTAG
15      GGCCTGGGCTGGAAACAGCTCTGTGTGTGAA
                     ...               
3969    AGTTTTCTAAAAAGGGGGAGAGTTGTGAAAG
3973    ATTATCTGGGCGTGGTGGCATGTGCCTGTAG
3975    CCTATGCTTTCCTTGGCATCGGCTACACATC
3987    AAGGGTGTCCTGCTCCTTGACCACGATGGGG
3994    AACCCAAGGAAAGAGAAATGCTGGGGTGTAT
Name: Seq, Length: 543, dtype: object

(Arithmetics on columns)

## Guided exercise(s) here...

### 1) Add a column with nucleotide count (A)
([loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)[*row*,*col*], .[apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)())

In [200]:
def count_nuc(my_seq, nuc):
    total = 0
    for c in my_seq: 
         if c == nuc:
            total = total + 1 
    return total

In [201]:
count_nuc(df.Seq[1],"A")

12

In [202]:
def count_A(row):
    return count_nuc(row.Seq,"A")

In [203]:
df["Count A"]=df.apply(count_A,axis=1)
df

Unnamed: 0,Seq,Id,Count,Test,frac,Count A,pGC
0,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,1,113422,23,30.436766,31,0.000000
1,CAGGACTCCAATATAGAGATAAGTTAATGTC,2,93,23,0.024957,12,35.483871
2,TATGTAATTGGTTCCAGTGTGAGTCATTAAA,3,5,23,0.001342,9,32.258065
3,GATATTTTCGAAAAGTGGGATTTTTTAAACC,4,88,23,0.023615,10,29.032258
4,CTCCATCTCAGGTATTAGAATGAATGCTTAC,5,7,23,0.001878,9,38.709677
...,...,...,...,...,...,...,...
3995,AGCTGCAGGAACTCCCTCGTCACAGCTTAAA,3996,5,23,0.001342,9,51.612903
3996,CTGAGCTCTCTGGGAAAGTCGTGTTCCGGAA,3997,5,23,0.001342,6,54.838710
3997,GTCTGCCTTTATGGCCTTTGTACTCAAAGAA,3998,10,23,0.002683,7,41.935484
3998,AGACTATAGTGAGCTCAGGTGATTGATACTC,3999,7,23,0.001878,9,41.935484


(First step: We'll try to first apply it to the second row, counting As only)

(Doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)

In [204]:
df2 = df.loc[o]
df2
#when you do loc you get the whole row 

Unnamed: 0,Seq,Id,Count,Test,frac,Count A,pGC
0,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,1,113422,23,30.436766,31,0.000000
6,CTTCCATGGCTGTCCGGATCGCCGCACTGCA,7,299,23,0.080237,4,64.516129
7,GCACCAGGCCTTTCTCTAGAAGTCCTGAGAC,8,128,23,0.034349,7,54.838710
11,ATCAATCGACTCAGATGATCAGTTTTGGTAG,12,252,23,0.067624,9,38.709677
15,GGCCTGGGCTGGAAACAGCTCTGTGTGTGAA,16,330,23,0.088555,6,58.064516
...,...,...,...,...,...,...,...
3969,AGTTTTCTAAAAAGGGGGAGAGTTGTGAAAG,3970,112,23,0.030055,11,38.709677
3973,ATTATCTGGGCGTGGTGGCATGTGCCTGTAG,3974,130,23,0.034885,4,54.838710
3975,CCTATGCTTTCCTTGGCATCGGCTACACATC,3976,352,23,0.094459,5,51.612903
3987,AAGGGTGTCCTGCTCCTTGACCACGATGGGG,3988,172,23,0.046156,5,61.290323


(Doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)

(Whiz-kid corner: lambda expressions, https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions)

### 2) Show the 10 sequences with the most number of A. How many reads do they represent? What % of the (truncated) transcriptome?
(.sort_values())

(Doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)

In [216]:
sorteddf = df.sort_values(by='Count A', ascending =False)
sorteddf

Unnamed: 0,Seq,Id,Count,Test,frac,Count A,pGC,Count C,Count T,Count G
0,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,1,113422,23,30.436766,31,0.000000,0,0,0
650,CAAAAAAAAAAAAACAAAAAACAAAAAAACA,651,13,23,0.003489,27,12.903226,4,0,0
2507,AAATAACAAAAAATTAAAAAAAAAAAAAAAA,2508,5,23,0.001342,27,3.225806,1,3,0
3678,AAAACAAAAACAAAACAAACAAACAAAAAAG,3679,26,23,0.006977,25,19.354839,5,0,1
168,AAAAAAGATTAAAAAATTAAAAAAAAAAGAA,169,11,23,0.002952,25,6.451613,0,4,2
...,...,...,...,...,...,...,...,...,...,...
1411,ACCTGGTCTCTCTCTCTGGTCTTGCCTCTCC,1412,6,23,0.001610,1,58.064516,13,12,5
371,CCTCCGTCGGCTCTGCGGGCTCCCGGGCCTA,372,81,23,0.021736,1,77.419355,14,6,10
1419,CTGTCTCCTGCTCTTCCCTCCTTCCTGGTCC,1420,9,23,0.002415,0,61.290323,15,12,4
2140,CTGTTTCCTCCTTTCTCCTTTTCCTCCTCTC,2141,5,23,0.001342,0,48.387097,14,16,1


In [217]:
x=sorteddf[0:10]
x
#reads is number of count

Unnamed: 0,Seq,Id,Count,Test,frac,Count A,pGC,Count C,Count T,Count G
0,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,1,113422,23,30.436766,31,0.0,0,0,0
650,CAAAAAAAAAAAAACAAAAAACAAAAAAACA,651,13,23,0.003489,27,12.903226,4,0,0
2507,AAATAACAAAAAATTAAAAAAAAAAAAAAAA,2508,5,23,0.001342,27,3.225806,1,3,0
3678,AAAACAAAAACAAAACAAACAAACAAAAAAG,3679,26,23,0.006977,25,19.354839,5,0,1
168,AAAAAAGATTAAAAAATTAAAAAAAAAAGAA,169,11,23,0.002952,25,6.451613,0,4,2
2321,AATAACAGAAAGAAAACAAAAAGAAAAATAA,2322,57,23,0.015296,24,16.129032,2,2,3
3880,AAAGAAAGAAAAAGAAAAAAAAAATAGCACA,3881,7,23,0.001878,24,19.354839,2,1,4
2636,AAAATTAAAAAAAAAAAAAAAAAATTAGCCG,2637,6,23,0.00161,23,12.903226,2,4,2
3491,AAAAGAAGACAAAAGAAAAGAGAAAGAAGAA,3492,7,23,0.001878,23,25.806452,1,0,7
3186,ATAAATAAAAAGGAAAAGAAAAGAAAAGAAG,3187,24,23,0.00644,23,19.354839,0,2,6


In [218]:
y=sum(x.Count)
y

113578

In [219]:
z=sum(df.Count)
z

372648

In [220]:
xreads= y/z*100
xreads

30.478628625405207

### 3) How many sequences with 25 or more As? Then, check that the result is correct.
(Cond. row selection)

In [221]:
filtered_df= df[df["Count A"] >= 25]
filtered_df

Unnamed: 0,Seq,Id,Count,Test,frac,Count A,pGC,Count C,Count T,Count G
0,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,1,113422,23,30.436766,31,0.0,0,0,0
168,AAAAAAGATTAAAAAATTAAAAAAAAAAGAA,169,11,23,0.002952,25,6.451613,0,4,2
650,CAAAAAAAAAAAAACAAAAAACAAAAAAACA,651,13,23,0.003489,27,12.903226,4,0,0
2507,AAATAACAAAAAATTAAAAAAAAAAAAAAAA,2508,5,23,0.001342,27,3.225806,1,3,0
3678,AAAACAAAAACAAAACAAACAAACAAAAAAG,3679,26,23,0.006977,25,19.354839,5,0,1


In [222]:
print("Number of sequences with 25 or more A's:")
print(len(filtered_df))

Number of sequences with 25 or more A's:
5


### 4) Clean up the dataframe (or re-load), add counts for all 4 nucl

In [223]:
def count_C(row):
    return count_nuc(row.Seq,"C")
df["Count C"]=df.apply(count_C,axis=1)

def count_T(row):
    return count_nuc(row.Seq,"T")
df["Count T"]=df.apply(count_T,axis=1)

def count_G(row):
    return count_nuc(row.Seq,"G")
df["Count G"]=df.apply(count_G,axis=1)
df.head()

Unnamed: 0,Seq,Id,Count,Test,frac,Count A,pGC,Count C,Count T,Count G
0,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,1,113422,23,30.436766,31,0.0,0,0,0
1,CAGGACTCCAATATAGAGATAAGTTAATGTC,2,93,23,0.024957,12,35.483871,5,8,6
2,TATGTAATTGGTTCCAGTGTGAGTCATTAAA,3,5,23,0.001342,9,32.258065,3,12,7
3,GATATTTTCGAAAAGTGGGATTTTTTAAACC,4,88,23,0.023615,10,29.032258,3,12,6
4,CTCCATCTCAGGTATTAGAATGAATGCTTAC,5,7,23,0.001878,9,38.709677,7,10,5


(Whiz kid corner: a function returning a function)

### 5) Add a %GC column

In [224]:
def count_nuc(s, nuc1, nuc2):
    total = 0
    for c in s: 
            if c == nuc1 or c == nuc2:
                total += 1
    content = (total / len(s)) * 100
    return content

In [225]:
seq= "CTCG"
print(seq)
count_nuc(seq, "C", "G")

CTCG


75.0

In [226]:
def aux_countnuc(row): 
    return count_nuc(row.Seq, "G", "C")

In [227]:
df["pGC"]=df.apply(aux_countnuc, axis=1)

In [228]:
df

Unnamed: 0,Seq,Id,Count,Test,frac,Count A,pGC,Count C,Count T,Count G
0,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,1,113422,23,30.436766,31,0.000000,0,0,0
1,CAGGACTCCAATATAGAGATAAGTTAATGTC,2,93,23,0.024957,12,35.483871,5,8,6
2,TATGTAATTGGTTCCAGTGTGAGTCATTAAA,3,5,23,0.001342,9,32.258065,3,12,7
3,GATATTTTCGAAAAGTGGGATTTTTTAAACC,4,88,23,0.023615,10,29.032258,3,12,6
4,CTCCATCTCAGGTATTAGAATGAATGCTTAC,5,7,23,0.001878,9,38.709677,7,10,5
...,...,...,...,...,...,...,...,...,...,...
3995,AGCTGCAGGAACTCCCTCGTCACAGCTTAAA,3996,5,23,0.001342,9,51.612903,10,6,6
3996,CTGAGCTCTCTGGGAAAGTCGTGTTCCGGAA,3997,5,23,0.001342,6,54.838710,7,8,10
3997,GTCTGCCTTTATGGCCTTTGTACTCAAAGAA,3998,10,23,0.002683,7,41.935484,7,11,6
3998,AGACTATAGTGAGCTCAGGTGATTGATACTC,3999,7,23,0.001878,9,41.935484,5,9,8


### 6) And find the 10 sequences with highest GC content. How many reads do they represent?
(as a bonus, store this result in a new dataframe with only columns: Seq, Id, Count and %GC. You might need a few extra "tricks" with .loc[:,["Col1", "Col2"])

In [229]:
xpGC = df.sort_values(by='pGC', ascending =False)[0:10]
xpGC

Unnamed: 0,Seq,Id,Count,Test,frac,Count A,pGC,Count C,Count T,Count G
1735,CTGCCCGCGCCCGCCGCCCAGGACCCCGCAC,1736,6,23,0.00161,3,87.096774,19,1,8
1508,ACGCACCCCTCCCCGGCCTGGGCGGCGGCGA,1509,72,23,0.019321,3,83.870968,15,2,11
963,CCGCGCCGCCCGGGCACCATGGCGGGGAAGG,964,7,23,0.001878,4,83.870968,12,1,14
233,ACCCGGCGCCCGGCCAGTCCTGCGCGTCCCC,234,38,23,0.010197,2,83.870968,17,3,9
3390,GCACGGGCGAAGGGGCCGCGGCCGCATGCCC,3391,64,23,0.017174,4,83.870968,12,1,14
1222,CTGCGGGGGGCCTGCGGAGACGGCGCCCGCA,1223,5,23,0.001342,3,83.870968,11,2,15
1751,CGGCGGTTGGCGGGGCACCACGGGAGGGGCC,1752,19,23,0.005099,3,83.870968,9,2,17
3021,GGGTCCGGCGCCGCCGGCTGCGGCTTCGCGA,3022,21,23,0.005635,1,83.870968,12,4,14
1899,AGGACTGGGGGGAGGCGGGCACCCCAGCGGG,1900,42,23,0.011271,5,80.645161,8,1,17
2320,GCAGGTCGCCCTGGGGTGCCCGCGCGTGGGA,2321,9,23,0.002415,2,80.645161,10,4,15


In [230]:
print("Numbers of reads represented:")
reads_pGC=sum(xpGC.Count)
reads_pGC

Numbers of reads represented:


283

In [231]:
highest_pGC = xpGC.loc[:,["Seq", "Id", "Count", "pGC"]]
highest_pGC

Unnamed: 0,Seq,Id,Count,pGC
1735,CTGCCCGCGCCCGCCGCCCAGGACCCCGCAC,1736,6,87.096774
1508,ACGCACCCCTCCCCGGCCTGGGCGGCGGCGA,1509,72,83.870968
963,CCGCGCCGCCCGGGCACCATGGCGGGGAAGG,964,7,83.870968
233,ACCCGGCGCCCGGCCAGTCCTGCGCGTCCCC,234,38,83.870968
3390,GCACGGGCGAAGGGGCCGCGGCCGCATGCCC,3391,64,83.870968
1222,CTGCGGGGGGCCTGCGGAGACGGCGCCCGCA,1223,5,83.870968
1751,CGGCGGTTGGCGGGGCACCACGGGAGGGGCC,1752,19,83.870968
3021,GGGTCCGGCGCCGCCGGCTGCGGCTTCGCGA,3022,21,83.870968
1899,AGGACTGGGGGGAGGCGGGCACCCCAGCGGG,1900,42,80.645161
2320,GCAGGTCGCCCTGGGGTGCCCGCGCGTGGGA,2321,9,80.645161


### 7) How many sequences with ≥ 50%GC (1453)? What is the %GC of all the sequences joined together (44.8%)? How many sequence have %GC above this average value (2104)?

In [232]:
filtered_dfpGC= df[df["pGC"] >= 50]
filtered_dfpGC.head()

Unnamed: 0,Seq,Id,Count,Test,frac,Count A,pGC,Count C,Count T,Count G
5,GGGCTGTTCAAAATAGCTGGAGCCCCAGACA,6,10,23,0.002683,9,54.83871,8,5,9
6,CTTCCATGGCTGTCCGGATCGCCGCACTGCA,7,299,23,0.080237,4,64.516129,12,7,8
7,GCACCAGGCCTTTCTCTAGAAGTCCTGAGAC,8,128,23,0.034349,7,54.83871,10,7,7
15,GGCCTGGGCTGGAAACAGCTCTGTGTGTGAA,16,330,23,0.088555,6,58.064516,6,7,12
17,GCCGGCCTCGTAGTCCAGGAAGACGCCGAGC,18,24,23,0.00644,6,70.967742,11,3,11


In [233]:
print("Number of sequences with 50 or more pGC's:")
print(len(filtered_dfpGC))

Number of sequences with 50 or more pGC's:
1453


In [234]:
print("pGC of all the sequences joined together:")
df["pGC"].mean()

pGC of all the sequences joined together:


np.float64(44.836290322580645)

In [237]:
filtered_df= df[df["pGC"] >= 44.8]
filtered_df

Unnamed: 0,Seq,Id,Count,Test,frac,Count A,pGC,Count C,Count T,Count G
5,GGGCTGTTCAAAATAGCTGGAGCCCCAGACA,6,10,23,0.002683,9,54.838710,8,5,9
6,CTTCCATGGCTGTCCGGATCGCCGCACTGCA,7,299,23,0.080237,4,64.516129,12,7,8
7,GCACCAGGCCTTTCTCTAGAAGTCCTGAGAC,8,128,23,0.034349,7,54.838710,10,7,7
10,GCATCTGCACACACACAGGGAACTTTAAGCA,11,22,23,0.005904,11,48.387097,9,5,6
12,CAAAGTTTGATCTGGAATGTTGGGACTGCAG,13,9,23,0.002415,8,45.161290,4,9,10
...,...,...,...,...,...,...,...,...,...,...
3993,AGTGACAGGAACAACACCTCAGGGATCACTC,3994,17,23,0.004562,11,51.612903,9,4,7
3994,AACCCAAGGAAAGAGAAATGCTGGGGTGTAT,3995,117,23,0.031397,12,45.161290,4,5,10
3995,AGCTGCAGGAACTCCCTCGTCACAGCTTAAA,3996,5,23,0.001342,9,51.612903,10,6,6
3996,CTGAGCTCTCTGGGAAAGTCGTGTTCCGGAA,3997,5,23,0.001342,6,54.838710,7,8,10


In [238]:
print("Number of sequences that have %GC above this average value:")
print(len(filtered_df))

Number of sequences that have %GC above this average value:
2104


### (*Challenge!*): Which sequence would form the longest helix linking the 5' and 3' extremities (no overhang)?
(Answer: ATGAATTGAGTTGTGTCCCCCCAAAATTCAT, 7 base pairs, line number 2827)
(Working with GPT-4: https://chatgpt.com/share/4687f755-e276-4d69-94c3-68be6dc5b584  Search for "Can you write a function that...")