# CSCI E-7 Introduction to Programming with Python
## Lecture 6. Text Processing
### Reading: Downey Chapters 9 and 13
![](06_fig/extension_school_logo.png)

# Recap 1
- Use <b>Function open()</b> Python's built in function: to open a file.  Returns a file object.
- Use <b>with open(filename,'r') as fname:</b> to open and read a file
- <b>Modes</b> indicate how the file is going to be opened:  
r: open for reading (default)  
r+: Opens a file for both reading and writing  
w: open for writing, truncating the file first  
a: append to the end of a file (instead of overwriting it)  
Note: 
There are other modes (a,b, t, +):   
https://docs.python.org/3/library/functions.html#open  
- Use <b> Method close()</b> to close a file and free up any system resources taken up by the open file  




![](06_images/keypoint.png) 
-What is a File Object?  Downey: **A value that represents an open file.**  
-File object has several methods for reading data.  

-open() is most commonly used with two arguments: open(filename, mode)
- When opening a file for **read only** you do not have to specify mode= 'r' since it is the default.  Example: file_obj = open("short.fasta")  
-  If you’re NOT using the **with** keyword then you should call fileobject.close() to close the file and immediately free up any system resources used by it.
- When a file operation fails for an I/O-related reason, an exception IOError is raised.  

# Recap 2
- Use Method: <b>fileObject.read(bytesize)</b> to read the entire file or a portion by declaring byte size.
- Use Method: <b>fileObject.readline()</b> to read one entire line from the file.  Returns: a String. Note: A trailing newline character is kept in the string.   
- Use <b>for loop or while loop</b> to iterate over data in the file.
- Use Method. <b>str.strip()</b> to split a string by leading or trailing 'padding' characters.  Whitespace characters (' ','\t','n\').  Returns: a List of strings.  
- Use <b>try: exception</b> to provide descriptive error messages.   

![](06_images/keypoint.png)
-A file object keeps track of where it is in the file.  
-The string returned by readline will contain the **newline character '\n'** at the end of the file.  
**Important:** An empty string is returned only when EOF is encountered immediately.    


### Code Review from Lecture 5:###

In [None]:
# Write program to count the DNA bases.
with open('short.fasta', 'r') as input_file:
    for line in input_file:
        print(line)

**Same code as above but leave off r mode in open()**

In [None]:
# From Lecture 5:
# Write program to count the DNA bases.
with open('short.fasta') as input_file:  #Note: default mode=r is left off
    for line in input_file: 
        print(line)

**Count all the chars in the DNA sequence in all lines.  Skip reading the header.**

In [None]:
# From Lecture 5:
# Counts all the chars in the DNA sequence.

with open('short.fasta', 'r') as input_file:
    
    # skip first line of file        
    input_file.readline()  # Useful for skipping headers in files
    
    # Read the content of file
    data = input_file.read()
    print(data)

    # Get the length of all chars
    no_of_chars = len(data)

print('Count of all chars=',no_of_chars)

**Remove the end of line character in all lines** 

In [None]:
count = 0
with open('short.fasta') as input_file:
   # skip first line of file 
   input_file.readline()
     
   for line in input_file:
      print(line)
      for ch in line.rstrip('\n'):
        count = count + 1
        
print('Count of all chars= ',count)

**Data is stored in a List in the next 2 examples.  Usage: Method str.split()**

In [None]:
count=0
with open('short.fasta') as input_file:
    input_file.readline() # skip the first line
    for line in input_file:
        print(line)  # this is a string
        words = line.split()  # puts the string into a list 
        print(words)
        for ch in line:
               count=count+1
            
print('Count of all chars= ',count)    

In [None]:
count=0
with open('short.fasta') as input_file:
    input_file.readline() # skip the first line
    for line in input_file:
        words = line.split()
        print(words)
        for ch in line:
            if ch not in '\n':
               count=count+1
            
print('Count of all chars= ',count)    

![](06_images/keypoint.png)
  
- Always view the contents of your data file before working with data.  
- If you use read() and the file is very large it will be stored in memory and use up resources.  Remember to close files if you use open().  The exception is if you use <b>with</b>. 
- Pay attention to headers in a file as they generally will be descriptive of each data field in the file.
- Skip reading the first line of your header if it is non-essential in your text processing.
- The last line of your data file may not contain a newline. Our fasta files do not.  BUT in practice... it is good style to always put the newline as a last character if it is allowed by the file format.  

In [None]:
fasta_dir= 'C:/Users/Owner/Documents/diane/Harvard/Spring2020Pythoncourse/Lecture 05/short.fasta'
with open(fasta_dir) as input_file:  #Note: default mode=r is left off
    for line in input_file: 
        print(line)

![](06_images/keypoint.png)
**Reading files from different directories.**    
The code examples we have used assume the fasta data files are stored in the same directory as your notebook.    
Sometimes you may want to open a data file from a different directory.  

In <b>Windows</b> your paths will be of the following format:       data_dir='C:\Users\Owner\Documents\diane\Harvard\Spring2020Pythoncourse\Lecture 05\short.fasta' You will need to use the <b>forward slash</b> for Python.    
Example: fasta_dir= 'C:/Users/Owner/Documents/diane/Harvard/Spring2020Pythoncourse/Lecture 05/short.fasta'

### Part 1 Textual Processing: Strings  
Python Documentation: https://docs.python.org/3/library/stdtypes.html#textseq  
Textual data in Python is handled with str objects, or strings. Strings are immutable sequences of Unicode code points. String literals:  

Single quotes: 'allows embedded "double" quotes'  

Double quotes: "allows embedded 'single' quotes".  

Triple quoted: '''Three single quotes''', """Three double quotes"""  


**str()**    
**Usage:**  
class str(object='')  
Returns a string containing a printable representation of an **object**.  if obj is not specified returns an empty string. The encoding of the given object default is UTF-8b.        

**String Examples**

In [None]:
mystring=str('A small text sentence.')
str1 =str(311040)
str2 = str(1.0e4)
str3 = str(True)  
color = 'blue'  # Use either double or single quotes
sentence = "Today is Monday.\n The weather is sunny.\n It\'s in the 60s."
print(mystring)
print(str1)
print(str2)
print(str3)
print(color)
print(sentence)

**Check data types**

In [None]:
value = str(253./2.)
value
print(value)
print(type(value))

In [None]:
#Check if a variable contains a value that is a string, use the isinstance built-in function.
#The isinstance function takes two arguments. The first is your variable. The second is the type you want to check for.
print(isinstance(value, int))
print(isinstance(value, float))
print(isinstance(value, list))
print(isinstance(value, str))

In [None]:
print(float(value))
print(type(float(value)))

![](06_images/keypoint.png)
Pay attention to the data types you are using.  This will avoid mismatch data type errors.   
**Use:**    
type(nameofvariable)   
isinstance - to check type of a variable.  Returns: True / False  
PyCharm: code will be displayed as: a = {int} 7.   

**String Methods:**  
https://docs.python.org/3/library/stdtypes.html#string-methods  
https://www.w3schools.com/python/python_ref_string.asp  


| Method | Description |
| :----- | :---------- |
| **capitalize()** |Converts the first character to upper case |     
| **casefold()**	|Converts string into lower case |    
| **center()** | Returns a centered string |  
| **count()** | Returns the number of times a specified value occurs in a string |  
|**encode()** | Returns an encoded version of the string |  
| **endswith()** |	Returns true if the string ends with the specified value |
| **expandtabs()** | Sets the tab size of the string |
| **find()** | Searches the string for a specified value and returns the position of where it was found |
| **format()** | Formats specified values in a string |  
| **format_map()** | Formats specified values in a string |
| **index()** |	Searches the string for a specified value and returns the position of where it was found |
| **isalnum()** | Returns True if all characters in the string are alphanumeric |
| **isalpha()**	| Returns True if all characters in the string are in the alphabet |
| **isdecimal()** |	Returns True if all characters in the string are decimals
| **isdigit()** |	Returns True if all characters in the string are digits |
| **isidentifier()** |	Returns True if the string is an identifier |
| **islower()** |	Returns True if all characters in the string are lower case |
| **isnumeric()** |	Returns True if all characters in the string are numeric |
| **isprintable()** |	Returns True if all characters in the string are printable |
| **isspace()** |	Returns True if all characters in the string are whitespaces | 
| **istitle()** |	Returns True if the string follows the rules of a title |  
| **isupper()** |	Returns True if all characters in the string are upper case |
| **join()** |	Joins the elements of an iterable to the end of the string |
| **ljust()** |	Returns a left justified version of the string |
| **lower()** |	Converts a string into lower case |
| **lstrip()** |	Returns a left trim version of the string |
| **maketrans()** |	Returns a translation table to be used in translations |
| **partition()** |	Returns a tuple where the string is parted into three parts |
| **replace()** |	Returns a string where a specified value is replaced with a specified value |
| **rfind()** |	Searches the string for a specified value and returns the last position of where it was found |
| **rindex()** |	Searches the string for a specified value and returns the last position of where it was found |
| **rjust()** |	Returns a right justified version of the string |
| **rpartition()** |	Returns a tuple where the string is parted into three parts |
| **rsplit()** |	Splits the string at the specified separator, and returns a list |
| **rstrip()** |	Returns a right trim version of the string |
| **split()** |	Splits the string at the specified separator, and returns a list |
| **splitlines()** |	Splits the string at line breaks and returns a list |
| **startswith()** |	Returns true if the string starts with the specified value |
| **strip()** |	Returns a trimmed version of the string |
| **swapcase()** |	Swaps cases, lower case becomes upper case and vice versa |
| **title()** |	Converts the first character of each word to upper case |
| **translate()** |	Returns a translated string |
| **zfill()** |	Fills the string with a specified number of 0 values at the beginning |

**Common String Methods:**


In [None]:
#Check if string has all uppercase or lowercase
color='Blue'
print("'color' is lowercase: ", color.islower()) 
print("'color' is uppercase: ", color.isupper()) 

In [None]:
name="benjamin"
print(name)
name = name.upper()   #Convert string to all uppercase
print("'name' is: ", name)
print("Uppercase? ", name.isupper())
name=name.lower().capitalize()   #Convert string to all lower and capitalize first letter
print(name)
name_cnt=name.count('n')
print(name_cnt)

In [None]:
alpha='abcdefgjijklmnopqrstuvwxyz'  # Use isalpha rather than looping through this string 
string = 'AGCT1'
print('Initial string: ',string)

string1=string.lower()
print('Lower case string: ',string1)

string2=string.isalpha()
print('Check if all characters in the string are in the alphabet:',string2)

string3=string1.upper().isalnum()
print('Convert to uppercase:',string1,' Check if alphanumeric:',string3)

In [None]:
#Returns a list
seq_code1='AGCT,TTTC,ATTC,TGAC,TGCA,'
seq_code2='AGCT TTTC ATTC TGAC TGCA'
new_seq1=seq_code1.split(',')
new_seq2=seq_code2.split(' ')
print(new_seq1)
print(new_seq2)

In [None]:
#View contents of file
with open("brokensequences.csv") as input_file:
    file_contents = input_file.read()
print(file_contents)

- Biologists use text files in the Fasta format to store genetic data.
- The first line of a Fasta file is a comment about the contents.
- The rest of the file will be a sequence of lines with data.
- DNA is described by strings of the letters A, C, G and T.
- These strings represent the four bases that encode a genome.
- Letters may be upper case or lower case.
- **Note:** The following sequence file brokensequences.csv is also a popular format that data is stored in.

In [None]:
with open('brokensequences.csv') as input_file:
    input_file.readline() # skip the first line
    for line in input_file:
        word2 = line.upper().split(',')
        print(word2)
    #print(line)
        

In [None]:
sequence_lst = []
with open('brokensequences.csv') as input_file:
    input_file.readline() # skip the first line
    for line in input_file:
        words = line.upper().split(',')
        for word in words:
            if word.isalpha():
               sequence_lst.append(word)
               allcodes_str=''.join(sequence_lst)
    print('String of all data:',allcodes_str)
    print('List of all data:',sequence_lst)
        

### Part 2 Textual Processing: Regular Expressions  
Python documentation: 
https://docs.python.org/3.8/howto/regex.html  
https://docs.python.org/3/library/re.html    
-Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes).  
-Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning.  

**Usage: re module**  
Contains many methods you can use.

In [None]:
import re

**Most Commonly used Methods:**  

|Method | Description | Usage
| :---- | :---------- | :----
|findall | Returns a list containing all matches. | re.findall(pattern, string, flags=0)
|match | Returns a list containing all matches. | re.match(pattern, string, flags=0)
|search | Search for a string. | re.search(pattern, string, flags=0)
|split | Break string into a sub string/s. | re.split(pattern, string, maxsplit=0, flags=0)
|sub | Replace part of a string. | re.sub(pattern, repl, string, count=0, flags=0)

![](06_images/keypoint.png)    
  
**re.split** is helpful when working with large textual data.  

**Matching Versus Searching Key Differences**  
-Both return first match of a substring found in the string.  
**re.match** searches only in the first line of the string and returns match object if found.  Else returns None.        
**re.search** checks for a match **anywhere** in the string.    
-Both are very efficient and fast for searching in strings.   

**Regular Expression examples:**

In [None]:
text = 'To be, or not to be, that is the question. By William Shakespeare'

val=re.findall('be',text)
print(val)
print(type(val)) 

val=re.search('to be',text)
# print('val=',val)
if val:
    print(val.groups)
print(val)

val=re.split('be',text)
print(val)

val=re.sub('be','have been',text)
print(val)

In [None]:
# Search for all matches of a word
str = "Benjamin Franklin said: If you would be loved, love, and be loveable"
fnd = re.match('Benjamin',str)
print(fnd)

if fnd:
    print(fnd.groups)
    
fnd1 = re.match('would',str)
if fnd1:
    print(fnd1.groups)
    
x = re.findall("love", str)
print(x)

x1 = re.findall("loved", str)
print(x1)

x3 = re.findall("and be loveable", str)
print(x3)

In [None]:
# Replace part of a string with blank character.
input_string = 'Box A contains 3 red and 5 white balls but Box B contains 4 red and 2 blue balls.'
result = re.sub(r'\d+', '', input_string)
print(result)

**For Pattern Matching:**

**Metacharacters** are used for specifying a character class which is a set of characters that you wish to match.

<table style="width:100%" align="left">
  <tr align="left">
    <th>Character</th>
    <th>Description</th>
    <th>Example</th>
  </tr>
  <tr align="left">
    <td><b>[]</b></td>
    <td>A set of characters</td>
    <td>[a-m]</td>
  </tr>
  <tr align="left">
    <td><b>\</b></td>
    <td>A special sequence (also used to escape special characters)</td>
    <td>"\d"</td>
  </tr>
  <tr align="left">
    <td><b>.</b></td>
    <td>Any character (except newline character)</td>
    <td>"he..o" </td>
  </tr>
  <tr align="left">
    <td><b>^</b></td>
    <td>Starts with</td>
    <td>"^hello"</td>
  </tr>
  <tr align="left">
    <td><b>dollar sign</b></td>
    <td>Ends with</td>
    <td>"goodbye$"</td>
  </tr>
  <tr align="left">
    <td><b>*</b></td>
    <td>Zero or more occurences.</td>
    <td>aix* or *aix</td>
  </tr>
  <tr align="left">
    <td><b>+</b></td>
    <td>One or more occurences</td>
    <td>aix+</td>
  </tr>
  <tr align="left">
    <td><b>{}</b></td>
    <td>Exactly the specified number of occurrences</td>
    <td>al{2}</td>
  </tr>
  <tr align="left">
    <td><b>|</b></td>
    <td>Either or</td>
    <td>falls|stays</td>
  </tr>
   <tr align="left">
    <td><b>()</b></td>
    <td>Capture and group</td>
    <td></td>
  </tr>
</table>



**Useful Regular Expression patterns:**  
[aeiouAEIOU] matches vowels  
[^aeiouAEIOU] matches non-vowels  
[0-9a-zA-Z] matches numbers, lower and uppercase vowels  

**Sequences**  
A special sequence is a '\' followed by one of the characters below:

| Character	| Description	| Usage  
| :-------- | :----------   | :------  
| \A | Returns a match if the specified characters are at the beginning of the string |	"\A The"	
| \b |	Returns a match where the specified characters are at the beginning or at the end of a word |	r"\bain" r"ain\b"	
| \B |	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word	| r"\Bain" r"ain\B"	
| \d |	Returns a match where the string contains digits (numbers from 0-9) |	"\d"	
| \D |	Returns a match where the string DOES NOT contain digits |	"\D"	
| \s |	Returns a match where the string contains a white space character |	"\s"	
| \S |	Returns a match where the string DOES NOT contain a white space character |	"\S"	
| \w |	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) |	"\w"	
| \W |	Returns a match where the string DOES NOT contain any word characters |	"\W"	
| \Z |	Returns a match if the specified characters are at the end of the string |	"\Z"

In [None]:
str1 = "I really prefer a 1/2 tbsp of sugar rather than 1 tbsp."

#Check if the string contains any digits (numbers from 0-9)
x = re.findall("\d", str1)
print(x)

In [None]:
#Below will find 1/ ... can you figure it out?

str1 = "I really prefer a 1/2 tbsp of sugar rather than 1 tbsp."
find = '\d/'
x1 = re.findall(find, str1)
print(x1)

In [None]:
# Social security number: 123-45-6789
# \d Digit [0-9]
# Include a count {3}
soc_sec_no = '123-45-6789'
if re.search("\d{3}-\d{2}-\d{4}",soc_sec_no):
   print('valid') # Found legal SS#

In [None]:
# Using negate with ^  [^aeou]
#  ^ to anchor the start
str1 = 'put the pot upon the spit'
m = re.findall('p[^aeou]t', str1)
print(m)

![](06_images/keypoint.png)  
When creating complicated patterns put them in a table.    
See next cell.

In [None]:
#pattern = '^[A-Z]\w+,?\s*[A-Z]$'    
#^     Start of string      
#[A-Z] One upper case letter      
#\w+   One or more alphanumeric      
#,?    Optional comma      
#\s*   Zero or more white space    
#[A-Z] One upper case letter    
#$     End of string    

In [None]:
#Notice what happens with the $ character.  This is due to $ forcing an end
#after Spain.  Since our string doesn't end at Spain it returns None.

txt = "The rain in Spain always falls on the plain."
x1 = re.search("^The.*Spain$", txt)
print(x1)

In [None]:
# Find all email addresses in string.  

line = "Email addresses: asbfal2@als.com, Users1@gmail.de and another: Dariush@dasd-asasdsa.com.lo,Dariush.lastName@someDomain.com"

match2 = re.findall(r'[\w\.-]+@[\w\.-]+', line)
print(match2)

In [None]:
#An example using split.
#Separate by split a string into individual words by spaces, commas and periods
match_reg=" |(?<![0-9])[.,](?![0-9])"
oddstring = "one two 3.4 5,6 seven.eight nine,ten,1.2,a,5"
print(oddstring)
newstring=re.split(match_reg, oddstring)
print('New string=',newstring)

In [None]:
# Sample strings.
list = ["dog dot", "do don't", "eric donna", "do-nut", "no match","Door don't"]

for element in list:
    # Match if two words start with letter d.
    m = re.match("(d\w+)\W(d\w+)", element,re.IGNORECASE)

#Pattern: (d\w+)\W(d\w+)
#d        Lowercase letter d.
#\w+      One or more word characters.
#\W       A non-word character.

    # Check if successful
    if m:
        print(m.groups())
#Notice the don't didn't catch the t.  How would you fix this?

In [None]:
#An example using split.
#Separate by split a string into individual words by spaces, commas and periods
match_reg=" |(?<![0-9])[.,](?![0-9])"
oddstring = "one two 3.4 5,6 seven.eight nine,ten,1.2,a,5"
print(oddstring)
list1=re.split(match_reg, oddstring)
#print(type(newstring))
print('List=',list1)

In [None]:
#Another example using re.split
value = "one 1 two 2 three 3"

# Separate on one or more non-digit characters!
#Usage Pattern: \D+
#\D+One or more non-digit characters.
result = re.split("\D+", value)

# Print results.
for element in result:
    print(element)


![](06_images/keypoint.png)
When working with string data in text files pre-process ('clean') data using String Methods and Regular Expressions.  
1.  Create a table of your patterns. 
2.  Check  your regular expressions:  
https://pythex.org/  

Great summary Cheat sheet:    
https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf  