## String Operations


Python has lots of built-in ways for dealing with text data. This can be tremendously useful if you need to search through text data, or if you need to manipulate text.

There are a whole host of methods, but I will show you just some of the basics. First, let's make a couple of string variables:


In [None]:

str1 = 'Jason Hubbard'
str2 = 'That is what my name is, you know. What is your name?'


### Methods

Many of the string operations are done with `methods`. You will see this with lists too. Methods are functions too, except you tag them to the end of a variable name with a dot `.` 

This is because methods are meant to operate on certain types of objects (data types). It's hard to wrap your head around at first, but you'll get the hang of it!

### Startswith/Endswith


Two handy methods are `startswith` and `endswidth`, which tell you whether or not (you guessed it) the string starts or ends with some character. The function returns `True` or `False`. 

> When I've taught programming before, I've found that people can understand objects pretty easily if they're introduced first. It's easy to think about anthropomorphisms like "Businessman objects know the method of stealing from the poor, and have the attribute of a full wallet." More importantly though, It's deep in the nature of Python and makes a lot of the confusing parts of the language self-evident. 

In [None]:
print str1.startswith('J')

print str1.startswith('Jas')

print str2.startswith('That')
print str1.startswith('That')

This is handy for `if` statments. For instance, we could check if a string is a question by checking if the last character is a question mark. 

In [None]:

def is_question(somestring):
    if somestring.endswith('?'):
        answer = 'yes!'
    else:
        answer = 'no'
    
    return(answer)


print is_question(str1)
print is_question(str2)

### Find

What if we care about stuff in the middle of the string? No problemo. That's what `find` is for:

In [None]:

str3 = 'No Problemo, that is what find is for'

str3.find('Problemo')



What is the number for? Well, it tells you the *position* that the string starts at. Notice that "Problemo" starts as the 3rd character in the string. If we search for something that isn't there, then we get an answer of -1

In [None]:
str3.find('potato')

The position is useful, because we can index (grab) certain characters from the string based on their position. We just use square brackets `[]` with the position number in between. In Python, position numbers start at zero.

We can get a range of positions by using the colon `:` between the starting and ending range. 
We can also use the `len` function to get the length of the string. 

In [None]:
str3 = 'No Problemo, that is what find is for'


print str3[0] #the very first character
print str3[0:2] #the first TWO characters
print str3[5:9] #the sixth to the 9th character

search_string = 'Problemo'
numchars = len(search_string)
print numchars

# Perfect example of the objective nature of Python. 
# len() is actually a wrapper for the string object's __len__ method.
# We could have equivalently called

print(search_string.__len__())


From `find` we can find the start of "Problemo", and based on the length (`numchars`), we know the end point. Let's see if you get get the whole word "Problemo" out of `str3` using `find` and indexing:

### Replace

The `replace` method is a nice way of replacing a certain string with another string. Let's say I have the string below, and I want to replace "worst" with "blurst". 

In [None]:

excerpt = 'It was the best of times, it was the worst of times'

print excerpt

print excerpt.replace('worst','blurst') 



`replace` can also be used to remove strings from your list. Just replace a string with an empty string. You specify nothing using two quotes `''` with no spaces or anything. Let's remove the comma from `excerpt`

In [None]:

excerpt.replace(',','')



### Upper and Lower

Other handy methods are `upper` and `lower` for changing the text:

In [None]:
regular_email = 'Hey how are you?'

mom_email = regular_email.upper() + '!!!!'

print mom_email

print mom_email.lower()


### Count

We can count the occurrences of certain characters using `count`

In [None]:
mom_email.count('!') #way too many



### Join

The method `join` is useful for putting a character between all elements of a string. Notice we can do it with a variable as the separator, or just the string itself.

In [1]:
separator = '-'

print separator.join('UniversityofOregon')

print '--'.join('UniversityofOregon')

print '-potato-'.join('123')

U-n-i-v-e-r-s-i-t-y-o-f-O-r-e-g-o-n
U--n--i--v--e--r--s--i--t--y--o--f--O--r--e--g--o--n
1-potato-2-potato-3


`join` is most useful if you have a list of items, rather than just a string. It's great for inserting spaces between each element of your list. This is because .join() exploits the iterable nature of strings, by wrapping the string in a list, the call to __iter__ will yield whole words rather than individual characters.

In [None]:
print '-'.join(['University', 'of', 'Oregon'])
print ' '.join(['University', 'of', 'Oregon'])

### Strip

The function `strip` becomes useful when you're reading in files. Often times, there will be spaces hidden in your text file that are difficult to find. Later in your script, it will mess up because you're expecting just a word/sentence. `strip` removes all whitespace surrounding a string.The nice thing is that it doesn't remove all spaces, only the ones at the beginning and end. It will also remove other hidden characters like tab (\t) and newline (\n). You will learn more about these later. 

In [None]:

pesky_text = '   hello there     '

print pesky_text.strip() #notice the space we want is still there!


pesky_text2 = '   hello there\t   \n\n' #tabs and newline characters

print pesky_text2.strip()

### Split

In the next lesson we will learn about lists, which are very powerful. Often times, you will read in a line of text from a file, and you want to split that text into a list, so you can look at each item individually. The `split` method will take a string, and split it up into a list, based on the character you specify (the separator). It basically is the exact opposite of `join`. 

In [None]:

#Here is the first name, last name, age, eye color, and hair color for someone
#each bit of data = separated by a comma. 
mydata = 'Bob,Smith,45,blue,blonde'


datalist= mydata.split(',')

print datalist

#First and Last name
datalist[0:2]

#eye color 
datalist[3]



`split` works nicely if you want to get each word of a sentence as a separate element in a list. Often times, you will use it in conjunction with `strip`, otherwise, you end up with extra stuff at the end. 

In [None]:

mystring = '   It was the best of times, it was the blurst of times   \n'

mystring.split(' ') #there is a space in there!



Notice that we have 3 extra spaces at the beginning, and 2 spaces and a newline character at the end. Let's clean it up first:

In [None]:
mystring = '   It was the best of times, it was the blurst of times   \n'

mystring_stripped = mystring.strip()
print mystring_stripped

print mystring_stripped.split(' ') #nice!

### Chaining Methods Together

Notice that I had an intermediate step, and I saved a different variable, `mystring_stripped`. The nice thing is, we don't need to do this in python. We can chain our different steps together. If we want to strip the text, then split it, then we just write it like so: 

In [None]:
mystring = '   It was the best of times, it was the blurst of times   \n'

mystring.strip().split(' ')


We can chain together as many as we want on 1 line. Just remember, at a certain point it makes your head hurt. 

In [None]:
mystring.strip().replace('blurst','worst').replace(',','').upper().split()

## String Formatting

We've seen this before, but let's revisit it. Sometimes you want to manipulate strings by inserting information in them. Often this is for printing some output to the user. So far you have done it using the `+` operator and converting numbers to strings with `str` like this:

```python
print "the value of x is: "+ str(x)
```

This works fine if the number is at the end, but gets very tedious when you're inserting multiple bits of information that are embedded in the middle of a sentence. Consider the example below. What if I wanted to print a sentence like, **"This person is named [name], they are [age] years old. They also have [haircolor] hair and [eyecolor] eyes. That's all I know about [name]"**

In [2]:
name = 'Bob'
age = 45
haircolor = 'blonde'
eyecolor = 'green'



First, we insert placeholders into our string using a funny syntax. We use different placeholders for different types of variables. For string variables, we always use `%s` (s for string). For integers, we use `%d` (d for digit). Then, at the end of the string, we use the `%` operator (I know, it's confusing because Python also uses it for the modulus function). After the `%` we list the variables in the order they appear. If there are multiple variables, you have to surround them with parentheses, which makes them a tuple.   

In [3]:
message = 'This person is named %s, they are %d years old. They also have %s hair and %s eyes.'

print message %(name,age,haircolor,eyecolor)

#notice that order is important. If I make a mistake, my message comes out wrong
print message %(haircolor,age,name,eyecolor)

#if we want to repeat the same information multiple times, we need to list it multiple times
print message %(name,age,name,name)


This person is named Bob, they are 45 years old. They also have blonde hair and green eyes.
This person is named blonde, they are 45 years old. They also have Bob hair and green eyes.
This person is named Bob, they are 45 years old. They also have Bob hair and Bob eyes.


Much simpler is the .format() method of strings:

In [6]:
message = 'This person is named {}, they are {} years old. They also have {} hair and {} eyes.'
print(message.format(name, age, haircolor, eyecolor))

# Or if we don't want to depend on the order of our arguments
message = 'This person is named {name}, they are {age} years old. They also have {hcolor} hair and {ecolor} eyes.'
print(message.format(age=age,
                     hcolor=haircolor,
                     ecolor=eyecolor,
                     name=name))

# Or repeat things multiple times
message = 'This person is named {name}, they are {name} years old. They also have {name} hair and {name} eyes.'
print(message.format(name=name))

This person is named Bob, they are 45 years old. They also have blonde hair and green eyes.
This person is named Bob, they are 45 years old. They also have blonde hair and green eyes.
This person is named Bob, they are Bob years old. They also have Bob hair and Bob eyes.


The `float` datatype is a little different. First, we use the `%f` placeholder. Often times, we have to specify how many digits we want to appear after the decimal point. First a simple example:

In [None]:
from math import pi

#6 digits after the decimal point
print 'the value of pi is %f' % pi

#12 digits after the decimal point
print 'the value of pi is %.12f' %pi



Notice we can specify the number of digits after the placeholder using `%.Xf` where X is the number of places. 

This is particularly relevant if we want to display amounts of money. When Python does a computation, it may include a lot of digits after the decimal point, but we only care about the first two. 

In [None]:

total = 45.65
payment = 50.
change = payment-total

print 'your change is %f' % change #weird

print 'your change is $%.2f' %change #much better!



The other thing you might want to do is add leading zeros to a number. You will realize later that this is useful for finding files. We can add leading zeros to an integer by saying `%Xd` where X is the total width of the number:

In [None]:

myint = 4

print 'Here it is with 6 total places: %06d' % myint


There is a whole new method of string formatting using the `format` function. It is outlined here <https://pyformat.info/>. I won't teach it to you in this class, only because it's more complicated. It is also generally better and more flexible, but less intuitive. You of course are free to use that method in your assignments if you wish. 

There are also various other things you can do using our "old" method, but I have shown you the most common cases that will come up. 