# Manipulating Files and Processing Text


## Topics
- Basic text processing with **split()**, **join()**, and **strip()**
- Text testing with **endswith()**, **startswith()**, **find()**, and **in**
- Text conversion with **swapcase()**, **replace()**, **upper()**, and **lower()**
- Opening and closing filehandles
- Reading from the filehandle with **read()**, **readline()**, and **readlines()**
- Reading from the filehandle iterable
- Writing or appending to a file with **write()** and **writelines()**

---

## Introduction
We've learned so far how we can write programs to make many, many decisions with an ordered logic to process information. What we've lacked thus far is how to input and output large tomes of data. In addition to manipulating large amounts of data with functions that open, read, write, and close files, we'll also benefit from learning about Python's marvelously powerful abilities to process text. Not to malign the now-dead king of text-processing languages, Perl (The King is Dead! Long Live the King!), Python really cleans house with it's unparalleled text-processing abilities with respect to both speed and ease of use.

## Basic Text Processing
Systematically manipulating large text files is one of the most common tasks you will encounter. The most basic tools for this task are the built-in Python string methods. These allow us to convert between strings and lists, test the properties of strings, and modify strings.

---
## Brief Review: methods vs. functions

We've learned about writing our own *functions* to process information. Many types of objects have special built-in functions. We call these functions *methods*, and in a broader discussion of objected-oriented programming practice and theory, we would have much, much more to say about them. However, we're not getting into the object-oriented universe or philosophy here, so you'll have to take as explanation simply that some objects are so routinely manipulated with the same sorts of operations that it pays to have functions dedicated to their processing. In the case of strings and files today, we'll see the *methods* that routinely operate on these types.

Whereas a *function* is written to accept *arguments* and process those *arguments*, a *method* processes the object to which it belongs and is *called* differently. Whereas a *function* such as **sorted** is called by typing **sort(list_variable)**, etc, a *method* is called by typing a period and the name of the *method* at the end of the object. For example, if *print* were a *method*, it would be called like this:
```python
'Hello, World!'.print() 
```
Notice that there are still **()** at the end of the name of the *method*, and *methods* can accept *arguments* just like *functions*. 

---

### *split()*

Let's consider the task of converting a character string of a sentence into a list of words separated by spaces and punctuation marks:

In [3]:
delimiter = ","
sentence_string = "I am a well-written sentence, and so I dependably have punctuation."

list_from_string = sentence_string.split(delimiter)

print list_from_string

['I am a well-written sentence', ' and so I dependably have punctuation.']


Note that as we've split with a comma, the comma doesn't appear in our list. We can try out what happens with different arguments to split().

In [4]:
# we don't need to specify the delimiter in a different variable
 
list_A = sentence_string.split(' ')
print list_A
for word in list_A:
     print word

['I', 'am', 'a', 'well-written', 'sentence,', 'and', 'so', 'I', 'dependably', 'have', 'punctuation.']
I
am
a
well-written
sentence,
and
so
I
dependably
have
punctuation.


In [5]:
list_B = sentence_string.split('a')
print list_B
for vowel_handicapped_lump in list_B:
     print vowel_handicapped_lump

['I ', 'm ', ' well-written sentence, ', 'nd so I depend', 'bly h', 've punctu', 'tion.']
I 
m 
 well-written sentence, 
nd so I depend
bly h
ve punctu
tion.


You might also want to take a string and turn it letter-by-letter into a list. Although this isn't done by **split()**, it fits nicely here:

In [6]:
list_C = list(sentence_string)
print list_C

['I', ' ', 'a', 'm', ' ', 'a', ' ', 'w', 'e', 'l', 'l', '-', 'w', 'r', 'i', 't', 't', 'e', 'n', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ',', ' ', 'a', 'n', 'd', ' ', 's', 'o', ' ', 'I', ' ', 'd', 'e', 'p', 'e', 'n', 'd', 'a', 'b', 'l', 'y', ' ', 'h', 'a', 'v', 'e', ' ', 'p', 'u', 'n', 'c', 't', 'u', 'a', 't', 'i', 'o', 'n', '.']


**split()** also can take a second argument (see, as always, the [string methods documentation](https://docs.python.org/2/library/stdtypes.html#string-methods)): you can specify how many times you want to split.

In [7]:
list_from_string = sentence_string.split(' ', 3)
print list_from_string
for item in list_from_string:
     print item

['I', 'am', 'a', 'well-written sentence, and so I dependably have punctuation.']
I
am
a
well-written sentence, and so I dependably have punctuation.


Now let's see what happens when two delimiters are next to each other:

In [12]:
print sentence_string

list_from_string = sentence_string.split('t')
print list_from_string
for consonant_crippled_lump in list_from_string:
     print consonant_crippled_lump

I am a well-written sentence, and so I dependably have punctuation.
['I am a well-wri', '', 'en sen', 'ence, and so I dependably have punc', 'ua', 'ion.']
I am a well-wri

en sen
ence, and so I dependably have punc
ua
ion.


We can see that we have a blank space in our list: "written," in particular, was split into three parts: ["...wri","","en..."]. If delimiters are adjacent to each other, it will find that empty string between them and give it to you at the appropriate spot. What's more, if a delimiter comes at the end of a string, **split()** will find the empty string at the end of the string and give it to you at the end, as you can see in our first example. The **split()** method is a very one-hand-clapping-in-a-forest sort of thing.

However, there is an exception to this. If you glanced at the **split()** documentation, you might have noticed that all of its arguments are, in fact, in brackets. That means that it doesn't need arguments to run: it has a default behavior.

In [13]:
# this looks *almost* the same as splitting by spaces
list_from_string = sentence_string.split()
for item in list_from_string:
     print item

I
am
a
well-written
sentence,
and
so
I
dependably
have
punctuation.


In [15]:
# But it is not the same as splitting by spaces -- no empty items!
sentence_string = "   this      is    a   different                         string"
list_from_string = sentence_string.split()
for item in list_from_string:
     print item

this
is
a
different
string


In [16]:
sentence_string = '''   complete
\t\t whitespace                      chaos
             !!!!!!!!!!!         '''
list_from_string = sentence_string.split()
for item in list_from_string:
     print item

complete
whitespace
chaos
!!!!!!!!!!!


We see that the default behavior of **split()** is to:
1. Remove all kinds of whitespace from the beginning and end of the string.
2. Condense all adjacent whitespaces to single space characters.
3. Split on those spaces.

This turns out to be really handy. For instance, if you're using someone else's table, and, as happens more often than you might want to think, they've done a poor job delimiting their fields systematically with whitespace, this cleans things up quickly and easily in just one line.

The **split()** method being popular, it has a few hangers-on:

In [17]:
toes = '''went to the market
stayed home
had roast beef
had none
cried wee wee wee all the way home'''
 
# splitlines splits on linebreaks
list_from_string = toes.splitlines()
print list_from_string
for toe in list_from_string:
     print "this little piggy {}".format(toe)

['went to the market', 'stayed home', 'had roast beef', 'had none', 'cried wee wee wee all the way home']
this little piggy went to the market
this little piggy stayed home
this little piggy had roast beef
this little piggy had none
this little piggy cried wee wee wee all the way home


In [19]:
# from the end of the string
last_toe = "and this little piggy went wee wee wee all the way home"
# when given a second argument, reverse split counts from the end
list_from_string = last_toe.rsplit(' ',7)
print list_from_string
for item in list_from_string:
     print item

['and this little piggy went', 'wee', 'wee', 'wee', 'all', 'the', 'way', 'home']
and this little piggy went
wee
wee
wee
all
the
way
home


### *join()*
So now we're pretty good at splitting things up, but how do we put things together again? **join()** takes care of that: it turns lists into strings. Surprisingly enough, it's not a method of lists. It's a string method, and it relies on the delimiter to know how to put lists together. This little surprise renders the syntax of **join()** to be among the most unintuitive of all syntactic trifles, but we will persevere if we concentrate on the fact that just like **split()**, **join()** is a method of strings.

In [21]:
join_str = "--"
words = ['Up', 'Down', 'Left', 'Right']

print join_str.join(words)

Up--Down--Left--Right


In [24]:
# Or we can just do it like this:

print '--'.join(words)

Up--Down--Left--Right


In [22]:
broken = ['hu','m','pty',' du','mpty']  # Poor humpty dumpty....

# Can we rebuild him?
all_the_kings_horses = '...'
all_the_kings_men = '+++'

first_try = all_the_kings_horses.join(broken)
print first_try

second_try = all_the_kings_men.join(broken)
print second_try

if (first_try == 'humpty dumpty') or (second_try =='humpty dumpty'):
     print 'hooray!'
else:
     print '''All the king's horses and all the king's men couldn't put Humpty together again'''

hu...m...pty... du...mpty
hu+++m+++pty+++ du+++mpty
All the king's horses and all the king's men couldn't put Humpty together again


Like **split()**, **join()** can usefully use the empty string-- it glues the components of the list directly together.

In [25]:
third_try = ''.join(broken)
print third_try

humpty dumpty


Paradoxically, 'nothing' can put poor Humpty together again. To summarize, the syntax of **join()** is

```python
'delimiter'.join(list_object)```

This is in fact the usual way to use **join()** -- you don't need to declare a separate variable to act as the glue.

In [26]:
fairy_tale_characters = ['witch','rapunzel','prince']
plot = '--hair--'.join(fairy_tale_characters)
print plot

witch--hair--rapunzel--hair--prince


## Testing Text: *startswith()*, *endswith()*, *find()*, and *in*

Some files can have multiple types of information in them designated by a change in formatting. For example, in FASTA files sequence names are designated by starting the line with the **>** character, and all other lines are sequence. Of course we could use an **if** statement to test for the presence of a formatting character, as we demonstrated by finding ORFs within a sequence in the section on **if** statements, but why go through the effort when someone else has already written a method for us? We will cover tests asking if a string begins with, ends with, or contains a substring of interest.

In [27]:
id_number = '1131431a'
 
# let's see if the id_number string starts with the number one
if id_number[0] == '1':
    print "This id starts with a 1."

# now let's use the string method startswith()
if id_number.startswith('1'):
    print "This id starts with a 1!"
else:
    print "This id doesn't start with a 1!"

# and here's the endswith() method
if id_number.endswith('1'):
    print "This id number ends with a 1!"
else:
    print "This id number doesn't end with a 1 at all!"

# and these methods can get a little fancier by having multiple things to
# test for if you provide a tuple of characters
if id_number.endswith(('1', 'a')):
    print "This id number ends with either an 'a' or a '1'."
else:
    print "This id number does not end with an 'a' or a '1'!"

This id starts with a 1.
This id starts with a 1!
This id number doesn't end with a 1 at all!
This id number ends with either an 'a' or a '1'.


Or maybe we don't care what the string starts or ends with as long as it contains a substring of interest. For this, we can use the **find()** method, which will return the index of the substring.

In [28]:
sequence = 'AAGGCGCGT'
first_c = sequence.find('C')

print first_c

4


In [30]:
first_z = sequence.find('Z')

print first_z

-1


But be careful when you write if tests using the **find()** method, as it returns the index of the substring only if the substring is found.
*If the substring isn't found **find()** returns the integer -1, which is not a zero, and thus will pass the **if** test as **True**.*

In [32]:
beatles = "john, paul, george, and ringo"
 
# the wrong way
if beatles.find('stuart'):
    print "Found Stuart! We've got a bassist."
else:
    print "Anyone here play bass?"

Found Stuart! We've got a bassist.


In [35]:
# let's do a comparison for -1 instead
if beatles.find('pete') != -1:
    print "Fount Pete! We've got a drummer!"
else:
    print "This band isn't working out. . ."

This band isn't working out. . .


As you can see, using **find()** for **if** testing is counter-intuitive, and that's because **find()** is not intended to be used for this. Instead, use the **in** keyword, which works exactly like you would think, and is more than twice as fast.

In [37]:
seq = 'GAAGTCGGAACCGAGGGTATGTCTCGGTGGCCAG'
#                        ^^^

if 'ATG' in seq:
    print 'start codon at position {}'.format(seq.find('ATG'))
else:
    print 'no ORF present'

start codon at position 18


## Text Conversions
Systematically replacing the instances of a substring with a replacement substring may be a familiar task of tedium. Python has several methods for systematically converting characters in strings. The most general is the method **replace()**.

In [2]:
dead = 'Jerry Bob Phil Bill Ron'

dead = dead.replace('Ron', 'Keith')
print dead

# YES! Keith's in!

Jerry Bob Phil Bill Keith


In [3]:
dead = dead + ' Mickey More Mickey!'
print dead

print dead.replace('Mickey', 'Donna')

Jerry Bob Phil Bill Keith Mickey More Mickey!
Jerry Bob Phil Bill Keith Donna More Donna!


In [4]:
# And we can tell replace how many replacements to make, starting at the beginning
print dead
print dead.replace('Mickey', 'Donna', 1)

Jerry Bob Phil Bill Keith Mickey More Mickey!
Jerry Bob Phil Bill Keith Donna More Mickey!


Notice that **replace()** does not change the string in place. Rembember that strings are immutable, so you have to reassign the variable to refer to the new string object that **replace()** returns.

Since Python is case sensitive, as are most UNIX-based bioinformatics programs you'll be interested in using, you may also find yourself wishing that all the text in your data was the same case. There are methods for both testing and converting cases.

In [43]:
blast_hit = 'ACTGTCAGTACGTAGCATCGAaaatCGATCGACTGAatacgatCG'
 
if blast_hit.isupper():
    pass
else:
    print blast_hit.upper()

ACTGTCAGTACGTAGCATCGAAAATCGATCGACTGAATACGATCG


In [44]:
# or if you prefer lower case
print blast_hit.lower()

actgtcagtacgtagcatcgaaaatcgatcgactgaatacgatcg


In [45]:
# or if you are (or the program you're writing is) indecisive
print blast_hit.swapcase()

actgtcagtacgtagcatcgaAAATcgatcgactgaATACGATcg


In [46]:
# and we might also be interested in this method
if blast_hit.isalpha():
    print "we got all letters here"
else:
    print "whoa, something doesn't look like nucleotides!"

we got all letters here
