# Class 17: Text manipulation

In this notebook we will learn some functions to manipulate text data. These functions should be useful for cleaning data for your class project. 


## Notes on the class Jupyter setup

If you have the *ydata123_2024a* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [3]:
import YData

# YData.download.download_class_code(17)       
# YData.download.download_class_code(17, TRUE) # get the code with the answers 


There are also similar functions to download the homework:

In [4]:
# YData.download.download_homework(7)  # download the homework 

If you are using google colabs, you should also uncomment and run the code below install the YData package and to mount the your google drive.

In [5]:
# !pip install https://github.com/emeyers/YData_package/tarball/master
# from google.colab import drive
# drive.mount('/content/drive')

## Text manipulation

A large part of Data Scientists' time is spent cleaning data, and a large part of data cleaning consists of manipulating text.

Let's explore some of the functions that are built into Python for manipulating strings of text. 


### 1. Changing capitalization

One of the most basic things we can do is to change the capitalization of a piece of text. 

One case where this comes up is when one is merging two DataFrames that have the same key values but the values have different capitalization. For example, one might have two DataFrames that have a column that has the names of different countries, but in one DataFrame the country names are capitalized and in the other they are not. 

Python strings have a number of methods to change the capitalization of words including: 

- `capitalize()`: Converts the first character to upper case
- `lower()`: Converts a string into lower case
- `upper()`: Converts a string into upper case
- `title()`: Converts the first character of each word to upper case
- `swapcase()`: Swaps cases, lower case becomes upper case and vice versa

Let's explore these methods by manipulating this [quote](https://www.brainyquote.com/topics/yale-quotes) from [Herman Melville](https://en.wikipedia.org/wiki/Herman_Melville): "a whale ship was my Yale College and my Harvard". 


In [6]:
melville_quote = "a whale ship was my Yale College and my Harvard"

melville_quote


'a whale ship was my Yale College and my Harvard'

In [7]:
# Capitalize the first letter 

melville_quote.capitalize()

'A whale ship was my yale college and my harvard'

In [8]:
# Convert all letters to lower case

melville_quote.lower()

'a whale ship was my yale college and my harvard'

In [9]:
# Convert all letters to upper case

melville_quote.upper()

'A WHALE SHIP WAS MY YALE COLLEGE AND MY HARVARD'

In [10]:
# Make the first letter of each word capitalized

melville_quote.title()

'A Whale Ship Was My Yale College And My Harvard'

In [11]:
# Make uppercase lowercase, and lowercase uppercase

melville_quote.swapcase()


'A WHALE SHIP WAS MY yALE cOLLEGE AND MY hARVARD'

### 2. String padding

Often we want to remove extra spaces (called "white space") from the front or end of a string. Or conversely, sometimes we want to add extra spaces to make a set of strings the same length (this is known as "string padding"). 

Python strings have a number of methods that can pad/trim strings including: 

- `strip()`: Returns a trimmed version of the string (i.e., with no leading or trailing white space). 
- `rstrip()`: Returns a right trim version of the string
- `lstrip()`: Returns a left trim version of the string

- `center(num)`: Returns a centered string (with equal padding on both sides)
- `ljust(num)`: Returns a left justified version of the string
- `rjust(num)`: Returns a right justified version of the string

- `zfill(num)`: Fills the string with a specified number of 0 values at the beginning

Let's use a modified version of Melville quote to explore this


In [12]:
melville_quote2 = "    a whale ship was my Yale College and my Harvard   "
melville_quote2

'    a whale ship was my Yale College and my Harvard   '

In [13]:
# Strip the white space
melville_quote2.strip()

'a whale ship was my Yale College and my Harvard'

In [14]:
# Strip just the left the white space
melville_quote2.lstrip()

'a whale ship was my Yale College and my Harvard   '

In [15]:
# Center the quote by padding with white space 
#. to have a total of 70 characters
melville_quote.center(70)


'           a whale ship was my Yale College and my Harvard            '

In [16]:
# Make a number have leading 0's 
# Q: Why/when is this useful?

"7".zfill(3)


'007'

### 3. Checking string properties

There are also many functions to check properties of strings including:

- `isalnum()`: Returns True if all characters in the string are alphanumeric
- `isalpha()`: Returns True if all characters in the string are in the alphabet
- `isnumeric()`: Returns True if all characters in the string are numeric

- `isspace()`: Returns True if all characters in the string are whitespaces

- `islower()`: Returns True if all characters in the string are lower case
- `isupper()`:Returns True if all characters in the string are upper case
- `istitle()`: Returns True if the string follows the rules of a title

Let's test some of these methods out...


In [17]:
# Checking if a string is all letters

"abc".isalpha()

"abc123".isalpha()


False

In [18]:
# Checking if a string is all numbers

"123".isnumeric()

True

In [19]:
# Checking if a string only contains spaces

"   ".isspace()

"\n".isspace()   # also works for new line characters \n, and tables \t

True

In [20]:
# Checking if a string is upper case

"I AM NOT YELLILNG!!!".isupper()

True

### 4. Splitting and joining strings

There are several methods that can help us join strings that are contained into a list into a single string, or conversely, parse a single string into a list of strings. These include: 

- `split(separator_string)`: Splits the string at the specified separator, and returns a list
- `splitlines()`: Splits the string at line breaks and returns a list

- `join(a_list)`: Converts the elements of an iterable into a string

In [21]:
# Split the Melville quote at each space into a list

melville_quote.split(" ")

['a', 'whale', 'ship', 'was', 'my', 'Yale', 'College', 'and', 'my', 'Harvard']

In [22]:
# Split a string at each line into a list

poem = """Some say the world will end in fire,
Some say in ice.
From what I’ve tasted of desire
I hold with those who favor fire.
But if it had to perish twice,
I think I know enough of hate
To say that for destruction ice
Is also great
And would suffice."""

poem

'Some say the world will end in fire,\nSome say in ice.\nFrom what I’ve tasted of desire\nI hold with those who favor fire.\nBut if it had to perish twice,\nI think I know enough of hate\nTo say that for destruction ice\nIs also great\nAnd would suffice.'

In [23]:
# Split the poem into a list 

poem.splitlines()


['Some say the world will end in fire,',
 'Some say in ice.',
 'From what I’ve tasted of desire',
 'I hold with those who favor fire.',
 'But if it had to perish twice,',
 'I think I know enough of hate',
 'To say that for destruction ice',
 'Is also great',
 'And would suffice.']

In [24]:
# Join a string together

a_list = ["A", "Whale", "of", "a", "Tale"]

" ".join(a_list)



'A Whale of a Tale'

### 5. Finding and replacing substrings

Some methods for locating a substring within a larger string include: 

- `count(substring)`: Returns the number of times a specified value occurs in a string
- `rfind(substring)`: Searches the string for a specified value and returns the last position of where it was found. (also see `rindex()`)

- `startswith(substring)`: Returns true if the string starts with the specified value
- `endswith(substring)` : Returns true if the string ends with the specified value

- `replace(original_str, replacement_str)`: Replace a substring with a different string. 

In [25]:
# How many times does the word "my" occur in the Melville quote? 
melville_quote.count("my")

2

In [26]:
# At what index does the first instance of "my" occur?
melville_quote.index("my")

17

In [27]:
# Does the quote start with "a"?
melville_quote.startswith("a")

True

In [28]:
# Does the quote end with Harvard? 

melville_quote.endswith("Harvard")

True

In [29]:
# Replace a substring
melville_quote.replace("Harvard", "that other school that is almost as good")

'a whale ship was my Yale College and my that other school that is almost as good'

### 6. Filling in strings with particular values

There are a number of ways to fill in strings parts of a string with particular values. Perhaps the most useful is to use "f strings", which have the following syntax such as: 

`f"my string {value_to_fill} will be filled in"`.

Where the value of the variable `value_to_fill` will be filled into the string. 

Let's try it out... 


In [30]:
# Let's use an f-string

person = "Herman Melville"

f"Mr. {person} liked writing about whales."



'Mr. Herman Melville liked writing about whales.'

In [31]:
# We can also do formatting with f-strings

amount = 123000
f"${amount:,.2f} is a lot of money!"

'$123,000.00 is a lot of money!'

### Example: string processing on webpages

As an example, let's do some string processing on webpages!


In [32]:
# Download a webpage and save it as a file called politics.html

import requests

url = 'https://www.foxnews.com/politics/white-house-doctor-says-biden-fit-serve-president'
r = requests.get(url, allow_redirects=True)
open('politics.html', 'wb').write(r.content)



223760

In [33]:
# Read in the file as a string called webpage_string
file = open('politics.html', 'r', encoding="utf8")
webpage_string = file.read()

# Look at the first 300 characters 
webpage_string[0:300]

'<!doctype html>\n<html data-n-head-ssr lang="en" data-n-head="%7B%22lang%22:%7B%22ssr%22:%22en%22%7D%7D">\n  <head>\n    <title>White House doctor says Biden &#x27;fit to serve&#x27; as president: &#x27;Healthy, vigorous, 80-year-old&#x27; | Fox News</title><meta data-n-head="ssr" http-equiv="X-UA-Comp'

In [34]:
# Replace a word on the webpage

webpage_updated = webpage_string.replace("Biden", "Sleepy Joe")


In [35]:
# Write updated string to a file
text_file = open("updated_politics.html", "w", encoding="utf8")
n = text_file.write(webpage_updated)
text_file.close()

<img src = "https://i1.sndcdn.com/avatars-000316245474-0yp1vu-t500x500.jpg">

## Regular expressions

Regular expressions are string with special characters that allow you find more complex patterns in pieces of text.

To use regular expressions in Python we can use the `re` module. 

If we convert the output of the `re.match()` function to a Boolean (i.e., `bool(re.match())`, we can tell if a piece of text contains a particular substring. 

Let's run to test to check if:

1. Our Melville quote contains the letter "a"
2. Our Melville quote contains the letter "z"


In [36]:
import re

# Check if our Melville quote contains/starts with the letter a
print(bool(re.match("a", melville_quote)))


True


In [37]:
# Check if our Melville quote contains/starts with the letter z
print(bool(re.match("z", melville_quote)))


False


A few special characters that can be used in regular expressions are:
- `^` means the start of a word 
- `$` means the end of a word 
- `[Pp]` means P or p

In [41]:
# Check if our Melville quote starts with an upper of lower case A
print(bool(re.match("^[aA]", melville_quote)))


True


In [39]:
# Check if our Melville quote starts with a vowel
print(bool(re.match("^[aeiouAEIOU]", melville_quote)))

True


In [38]:
# Check if our Melville quote does not starts with a vowel
print(bool(re.match("^[^aeiouAEIOU]", melville_quote)))

False


#### Wildcard characters

In [42]:
# We can use the period . to match any one character

bool(re.match("m.ss", "miss"))   # miss, mass, mess


True

In [40]:
# * means repeat the previous character 0 or more times
bool(re.match("xy*z", "xz"))   # xz, xyz, xyyz, xyyyz, ...

True

In [41]:
# + means repeat the previous character 1 or more times
bool(re.match("xy+z", "xz"))   # xyz, xyyz, xyyyz, ...

False

In [42]:
# Will the following match?

bool(re.match(".*a.*e",  "pineapple"))  


True

#### Example: matching phone numbers

In [43]:
phone_strings = [ "apple", 
                 "219 733 8965", 
                 "329-293-8753", 
                 "Work: 579-499-7527",
                 "Home: 543.355.3679"]

phone_strings

['apple',
 '219 733 8965',
 '329-293-8753',
 'Work: 579-499-7527',
 'Home: 543.355.3679']

In [44]:
# A regular expression to match phone numners

phone_expression = ".*([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"


In [46]:
# Test which phone_strings are valid phone numbers

for curr_number in phone_strings:
    print(bool(re.match(phone_expression,  curr_number)))

False
True
True
True
True


#### Escape characters

In [46]:
# Does not match because nothing after the end of a string
bool(re.match(".*$100", "Joanna has $100 and Chris has $0"))

False

In [47]:
# using escape characters can help
bool(re.match(".*\\$100", "Joanna has $100 and Chris has $0"))

True

#### Special characters

Other special characters are also designated by using a double slash first

`\s`   space

`\n`   new line     or also   `\r`

`\t`   tab


In [48]:
# Does the melville_quote contain new lines?
bool(re.match(".*\n", melville_quote))

False

In [49]:
# Does the poem contain new lines?
bool(re.match(".*\n", poem))

True