# Gen559 pandas practice notebook
### 2020.12.09

### Practice problem 1 

The file *3_probes.bed* contains information about hybridization probes in the format:

```python
'chromosome' + '\t' + 'start' + '\t' + 'stop' + '\t' + 'sequence' + '\t' + 'Tm'
```

* In the cell below, Use the *pandas.read_csv()* method to import the info from this file into Python as a DataFrame object.    
          
          
* Assign column labels to the DataFrame based on the string above.
       
      
* Print your DataFrame

In [4]:
# Import pandas module
import pandas as pd

# Use read_csv() method to import input file.
probe_info = pd.read_csv('3_probes.bed', sep='\t', names=['chromosome', 'start', 'stop', 'sequence', 'Tm'])

# Print DataFrame.
print(probe_info)

   chromosome      start       stop  \
0        chr7  148845262  148845298   
1        chr7  148845406  148845445   
2        chr7  148845449  148845489   
3        chr7  148845513  148845553   
4        chr7  148845554  148845589   
5        chr7  148845719  148845759   
6        chr7  148845845  148845884   
7        chr7  148846105  148846145   
8        chr7  148846359  148846394   
9        chr7  148846443  148846483   
10       chr7  148846484  148846519   
11       chr7  148846522  148846562   

                                     sequence     Tm  
0       CCCCATCCTTATTCCCTTTAAGAACGTTCTGATGGGC  42.85  
1    ATAAGTCAGAGCTTTACAGAGGTGTCACCTAACAAAACGC  42.27  
2   AATGACTAAGAATCATTCCAAGTGTCACCATCAAGACCACC  42.24  
3   AGGATTATGTCTGTTGCTACTGGGATATCACTGACAAGTCT  42.12  
4        TTGAAGCTCATATGTCACACAATCTCCAGAAGGCCT  42.22  
5   ACTACATAGTGCAATTTTAATTCCAACTCGCTCTTTCCCCT  42.14  
6    AGACTTTTGCCATTTTCTTCTCATCTTGCTGCAATCATGT  42.08  
7   TCTTACCACATGGCTAATTCAAATTGGGGTTACAACAGTGA  42.26

### Practice problem 2

* In the cell below, create a function *get_GC()* that returns the GC% of an input DNA sequence.
          
          
* Use the pandas *df.filter()* method to create a list of the sequences of the hybridization probes. *Hint: you can use the df.values.tolist() method on the DataFrame you create after filtering*
           
           
* Call the *get_GC()* function on the list of probe sequences.
           
           
* Use the *df.insert()* method to add a new 'GC%' column to your DataFrame.

In [5]:
# Create GC_content function.

def get_GC(seq):
    '''Takes in DNA sequence, returns GC% as a float'''
    
    # Return (count of C +  count of G) / length of seq. Use .upper() to make case-insensitive.
    return float((seq.upper().count('C') + seq.upper().count('G')) / len(seq))


# Use df.filter() method on 'sequence'.
seqs_df = probe_info.filter(items=['sequence'])

# Use df.values.tolist() method to extract seqs into list. Note this creates a list of lists.
seqs_list = seqs_df.values.tolist()

# Create and populate list of GC%s using get_GC() function.
GCs = [float(get_GC(x[0])) for x in seqs_list]

# Add GC content to DataFrame with the df.insert() method.
probe_info.insert(5, "GC", GCs)

# Print result.
print(probe_info)

   chromosome      start       stop  \
0        chr7  148845262  148845298   
1        chr7  148845406  148845445   
2        chr7  148845449  148845489   
3        chr7  148845513  148845553   
4        chr7  148845554  148845589   
5        chr7  148845719  148845759   
6        chr7  148845845  148845884   
7        chr7  148846105  148846145   
8        chr7  148846359  148846394   
9        chr7  148846443  148846483   
10       chr7  148846484  148846519   
11       chr7  148846522  148846562   

                                     sequence     Tm        GC  
0       CCCCATCCTTATTCCCTTTAAGAACGTTCTGATGGGC  42.85  0.486486  
1    ATAAGTCAGAGCTTTACAGAGGTGTCACCTAACAAAACGC  42.27  0.425000  
2   AATGACTAAGAATCATTCCAAGTGTCACCATCAAGACCACC  42.24  0.414634  
3   AGGATTATGTCTGTTGCTACTGGGATATCACTGACAAGTCT  42.12  0.414634  
4        TTGAAGCTCATATGTCACACAATCTCCAGAAGGCCT  42.22  0.444444  
5   ACTACATAGTGCAATTTTAATTCCAACTCGCTCTTTCCCCT  42.14  0.390244  
6    AGACTTTTGCCATTTTCTTCTCATCTTGCTGC

### Practice problem 3
        
* In the cell below, use the *df.query()* method to create a new dataframe in which only probes with Tm > 42.5C and GC% > 45% are returned,
       
       
* Print your results.

In [6]:
# Create new df using df.query() method and specified conditions.
filtered_probes = probe_info.query("Tm>42.5" and "GC>0.45")

# Print new DataFrame.
print(filtered_probes)

   chromosome      start       stop                               sequence  \
0        chr7  148845262  148845298  CCCCATCCTTATTCCCTTTAAGAACGTTCTGATGGGC   
8        chr7  148846359  148846394   AAAAGATGGACACCCTGAGGTCAATGATTTCCTCCC   
10       chr7  148846484  148846519   GCAATGAGCTCACAGAAGTCAGGATGTGCACAGGCT   

       Tm        GC  
0   42.85  0.486486  
8   42.63  0.472222  
10  45.79  0.527778  
