# Processing Strings with Python | BAIS 6100

**Instructor: Qihang Lin**

## Create and Read Strings

A single-line string is created using single or double quotes.

In [1]:
mystr = 'a character string within single quotes'
mystr = "a character string within double quotes"
print(mystr)

a character string within double quotes


A string with multiple lines can be created using triple single or triple double quotes.

In [2]:
mystr = """a character 
string within 
double quotes"""
print(mystr)

a character 
string within 
double quotes


Text data can be stored in csv, txt, xml, and json files.

Read text from a csv file into a <b>data frame</b>:

In [3]:
import pandas as pd
df = pd.read_csv('classdata/Catchphrase.csv')
df.head()

Unnamed: 0,Catchphrase,Movie Name,Context
0,"Beetlejuice, Beetlejuice, Beetlejuice!",BEETLEJUICE,"Lydia, summoning Beetlejuice"
1,It's showtime!,BEETLEJUICE,"Beetlejuice, being summoned."
2,They're heeeere!,POLTERGEIST,"Carol Anne Freeling, notifying her parents of..."
3,Hey you guys!,THE GOONIES,"Sloth, calling the attention of the children ..."
4,"Good morning, Vietnam!","GOOD MORNING, VIETNAM",Adrian Cronauer's greeting on his radio show


Read text from a txt file into a single string:

In [4]:
file = open("classdata/OneEmail.txt", mode="r")     # open a file in read mode.
text = file.read()                                  # read the entire txt file without splitting lines.
file.close()                                        # close the file.
text

'Hi Everyone,\n\nI have uploaded the data file to our IDAS server. You can find it in the directory\n\t"classdata\\homework\\hw1"\n   \nLet me know if you have any questions.\n\nBest,\nQihang\n'

Read text from a txt file into a list of strings (split on each new line):

In [5]:
file = open("classdata/OneEmail.txt", mode="r")     # open a file in read mode.
text = file.readlines()                             # read the txt file with lines splitted.
file.close()                                        # close the file.
text

['Hi Everyone,\n',
 '\n',
 'I have uploaded the data file to our IDAS server. You can find it in the directory\n',
 '\t"classdata\\homework\\hw1"\n',
 '   \n',
 'Let me know if you have any questions.\n',
 '\n',
 'Best,\n',
 'Qihang\n']

## Unicode and UTF-8

ASCII uses 7 bits to encode 128 specified characters, including the digits, letters, punctuation symbols, and some non-printing control codes such as the carriage return, line feed and tab.

Non-ASCII characters are very common in text data. 
- Non-english character: Elektrizität, Électricité, बिजली, 전기, 电
- Emoji: 😀, 😂, 😷
- Math: ⊂, ∈, ⊗, ⊥, ∮

A big table is created to map characters from all languages into numbers. This big table that holds almost everything is called <b>Unicode</b>. 

An <b>encoding</b> scheme is needed to translate between those numbers  and  bits. In many cases, the default encoding is 'utf-8'. Other encodings include 'latin-1', 'utf-16', and 'utf-32'.

In [6]:
# Convert a character to its utf-8 code.
f'{int("😀".encode("utf-8").hex(), 16):b}'

'11110000100111111001100010000000'

In [7]:
# Convert a character to its utf-8 code.
f'{int("ä".encode("utf-8").hex(), 16):b}'

'1100001110100100'

In the following example, encoding scheme "latin-1" is needed to load this file successfully.

In [8]:
import pandas as pd
df = pd.read_csv('classdata/clinton-street-social-club.csv', encoding="latin-1")  
df["reviews"][77]    # Word "entrée"  causes the problem.

'This is where grown ups come to drink.\r\n\r\nTucked away above Short\'s, Clinton St. is a wonderful find here in Iowa City. The ambiance is great, with a retro warehouse feel and beautiful views out onto the Pentacrest. Their real attraction is the drinks list: many interesting cocktails, my personal favorite being the Grandpa\'s Coffin (bourbon, scotch, locally made apple brandy, and bitters). They also have a "Murderer\'s Row" of high-end whiskey should that strike your fancy. I\'d say this place competes with Devotay for best cocktails in Iowa City. They also have a well-chosen, if small, beer list, and I have had good times with their wines.\r\n\r\nThe food has been very good. Last night, I had their smoked salmon nicoise salad, which had excellent flavor, and my wife had a tasty vegetarian quiche. They used to carry a really flavorful veggie burger, but appear to have dropped that from the menu within the past couple of months. On another visit, the salmon entrée was also good, 

## Escape Sequence

Open “OneEmail.txt” in a text editor and compare it with variable <b>text</b> above. What are the differences?

Backslash <b>(\\)</b> does not have its literal meaning in Python. It is an <b>escape character</b>, which is a character that alters the meaning of the following character.

<b>Escape sequence</b> consists of '\\' and the following character. Some common examples:
- \\n: newline
- \\r: carriage return
- \\t: tab
- \\\\: backslash
- \\': single quote '
- \\": double quotes "

In [9]:
mystr = "\'\tBefore\'\n\"After\\\""
file = open("test.txt", mode="w") 
file.write(mystr)                               # See what is in test.txt.
file.close()  
print(mystr)                                    # Or see what is printed.

'	Before'
"After\"


## Raw String

Raw string is created by prefixing a string literal with 'r' or 'R'. 

Raw string treats '\\' as a literal character. This is useful when we want to have a string that contains backslash and don't want it to be treated as an escape character.

In [10]:
mystr = r"\'\tBefore\'\n\"After\\\""
file = open("test.txt", mode="w") 
file.write(mystr)                               # See what is in test.txt.
file.close()  
print(mystr)                                    # Or see what is printed.

\'\tBefore\'\n\"After\\\"


## Basic String Methods

Python has built-in methods (functions) for processing a string.

In [11]:
mystr="\"IDAS\""

Count the number of characters in a string.

In [12]:
len(mystr)

6

Get the substring by the positions of start and end.

In [13]:
mystr[2:4]

'DA'

Split the string at the specified separator and returns a <b>list</b>.

In [14]:
mystr="I have uploaded the data file to our IDAS server."
mystr.split()

['I',
 'have',
 'uploaded',
 'the',
 'data',
 'file',
 'to',
 'our',
 'IDAS',
 'server.']

Join multiple strings into one.

In [15]:
mystr = "Text" + " " + "mining"
mystr 

'Text mining'

In [16]:
mystr = mystr + " " + "is useful!"
mystr

'Text mining is useful!'

In [17]:
mystr=["Text", "mining", "is", "useful!"]
"-".join(mystr)

'Text-mining-is-useful!'

Search a substring and returns the position of where it was found. 

<b>Warning:</b> If there are multiple matches, it only returns the position of the first one. If no match, it returns "-1".

In [18]:
mystr="I have uploaded the data file to our IDAS server."
mystr.find("er")

43

Replace a substring by another.

In [19]:
mystr.replace("data file", "course project")

'I have uploaded the course project to our IDAS server.'

Case conversion

In [20]:
mystr.upper()

'I HAVE UPLOADED THE DATA FILE TO OUR IDAS SERVER.'

In [21]:
mystr.lower()

'i have uploaded the data file to our idas server.'

Note: All string methods return new values but do not change the original string unless we overwrite the original string variable. 

In [22]:
#not overwrite 
mystr="I have uploaded the data file to our IDAS server."
mystr.lower()
mystr

'I have uploaded the data file to our IDAS server.'

In [23]:
#overwrite 
mystr="I have uploaded the data file to our IDAS server."
mystr = mystr.lower()
mystr

'i have uploaded the data file to our idas server.'

We only show a few examples above. See https://www.w3schools.com/python/python_ref_string.asp for a complete list of basic string methods.

## Process Strings in a Column 

The documents we want to analyze are often stored in a column of a data frame. Next, we will learn to how to apply string functions to each document in a column (also called a series).

Take a quick look at the data by showing its first five rows.

In [24]:
df = pd.read_csv('classdata/clinton-street-social-club.csv', encoding="latin-1")  
df.head()

Unnamed: 0,reviews,ratings
0,"With its jazzy vibes and chill atmosphere, Cli...",5
1,This was an exceptional surprise in Iowa city!...,5
2,There is no other place in town like CSSC. If ...,5
3,Tucked away through a narrow staircase like a ...,5
4,Love. Love. Love. If you're older than the col...,5


Each column of a data frame is also called <b>series</b>. The example below is column "reviews".

In [25]:
df["reviews"]

0      With its jazzy vibes and chill atmosphere, Cli...
1      This was an exceptional surprise in Iowa city!...
2      There is no other place in town like CSSC. If ...
3      Tucked away through a narrow staircase like a ...
4      Love. Love. Love. If you're older than the col...
                             ...                        
127    The food was not good and service was not frie...
128    Wow have times changed in downtown Iowa City. ...
129    Great building. Loved the big windows overlook...
130    Love the food, love the service. If you are in...
131    Dinner and cocktails on a Friday night. Place ...
Name: reviews, Length: 132, dtype: object

### List Comprehension Method

Return the number of characters in each review in a list using <b>list comprehension</b>.

In [26]:
[len(s) for s in df["reviews"]]

[967,
 387,
 733,
 606,
 586,
 307,
 824,
 917,
 377,
 583,
 642,
 288,
 388,
 466,
 1774,
 881,
 981,
 496,
 336,
 1304,
 1073,
 314,
 512,
 305,
 761,
 191,
 1218,
 442,
 969,
 158,
 230,
 198,
 1599,
 99,
 256,
 362,
 562,
 1416,
 3485,
 169,
 1328,
 1083,
 1056,
 106,
 313,
 1014,
 923,
 472,
 544,
 530,
 520,
 190,
 213,
 410,
 394,
 469,
 793,
 671,
 180,
 480,
 101,
 228,
 243,
 841,
 1304,
 384,
 1910,
 214,
 154,
 519,
 493,
 563,
 214,
 1473,
 399,
 1036,
 215,
 1444,
 762,
 518,
 620,
 297,
 90,
 246,
 3202,
 1301,
 591,
 232,
 174,
 251,
 1036,
 294,
 1279,
 260,
 265,
 400,
 1403,
 582,
 89,
 570,
 453,
 752,
 508,
 2352,
 2702,
 1137,
 69,
 211,
 143,
 199,
 417,
 104,
 755,
 1056,
 3545,
 633,
 457,
 1216,
 529,
 146,
 298,
 1597,
 37,
 340,
 92,
 123,
 675,
 87,
 595,
 531,
 146,
 1152]

See another example below where we created a list boolean values saying if each review contains the keyword "food".

In [27]:
[s.find("food")>=0 for s in df["reviews"]]

[True,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 True,
 False,
 True,
 True,
 False,
 True,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 False]

A boolean list is often used to select rows from a dataframe. For example, we can select all rows from a dataframe where the reivew mentions "food".

In [28]:
rowselected = [s.find("food")>=0 for s in df["reviews"]]
dffood = df[rowselected].copy()                    #Copy() is used to avoid any link between dffood and df. 
dffood.reset_index(inplace=True, drop=True)        #Reset row index so we can select rows in dffood by more intuitive index.
dffood.head()

Unnamed: 0,reviews,ratings
0,"With its jazzy vibes and chill atmosphere, Cli...",5
1,Tucked away through a narrow staircase like a ...,5
2,Love. Love. Love. If you're older than the col...,5
3,This place has the coolest vibe! It has that s...,5
4,Fantastic. Feels like you are stepping back in...,5


### ".str" Method

Just like a single string, a string series also has its own built-in methods for various processes. We need to add <b>.str</b> after the column in order to access those string methods.

Check which reviews mention "food".

In [29]:
df["reviews"].str.find("food")>=0

0       True
1      False
2      False
3       True
4       True
       ...  
127     True
128    False
129     True
130     True
131    False
Name: reviews, Length: 132, dtype: bool

".str" has some functions the basic method does not have, for example, "contains".

In [30]:
df["reviews"].str.contains("food", case=False)
#This search is case insensitive. To do case-sensitive search, use: 
#df["reviews"].str.contains("food", case=True)

0       True
1       True
2       True
3       True
4       True
       ...  
127     True
128    False
129     True
130     True
131    False
Name: reviews, Length: 132, dtype: bool

This allows another way to select all rows where the reivew mentions "food".

In [31]:
rowselected =  df["reviews"].str.contains("food", case=False)
dffood = df[rowselected].copy() 
dffood.reset_index(inplace=True, drop=True)        
dffood.head()

Unnamed: 0,reviews,ratings
0,"With its jazzy vibes and chill atmosphere, Cli...",5
1,This was an exceptional surprise in Iowa city!...,5
2,There is no other place in town like CSSC. If ...,5
3,Tucked away through a narrow staircase like a ...,5
4,Love. Love. Love. If you're older than the col...,5


Replace "Iowa City" by "IowaCity" in all reviews.

In [32]:
#Before replacement
df["reviews"]
#Try df["reviews"][77] to see an example containing "Iowa City"

0      With its jazzy vibes and chill atmosphere, Cli...
1      This was an exceptional surprise in Iowa city!...
2      There is no other place in town like CSSC. If ...
3      Tucked away through a narrow staircase like a ...
4      Love. Love. Love. If you're older than the col...
                             ...                        
127    The food was not good and service was not frie...
128    Wow have times changed in downtown Iowa City. ...
129    Great building. Loved the big windows overlook...
130    Love the food, love the service. If you are in...
131    Dinner and cocktails on a Friday night. Place ...
Name: reviews, Length: 132, dtype: object

In [33]:
#After replacement
df["reviews"].str.replace("Iowa City","IowaCity")
#Try df["reviews"].str.replace("Iowa City","IowaCity")[77] to see the change.

0      With its jazzy vibes and chill atmosphere, Cli...
1      This was an exceptional surprise in Iowa city!...
2      There is no other place in town like CSSC. If ...
3      Tucked away through a narrow staircase like a ...
4      Love. Love. Love. If you're older than the col...
                             ...                        
127    The food was not good and service was not frie...
128    Wow have times changed in downtown IowaCity. W...
129    Great building. Loved the big windows overlook...
130    Love the food, love the service. If you are in...
131    Dinner and cocktails on a Friday night. Place ...
Name: reviews, Length: 132, dtype: object

Similar to the basic string methods, the string methods of a series do not change the original string neither. Overwrite the original column if you want to change the original one. For exmaple:


In [34]:
df["reviews"] = df["reviews"].str.replace("Iowa City","IowaCity")
df["reviews"][77]

'This is where grown ups come to drink.\r\n\r\nTucked away above Short\'s, Clinton St. is a wonderful find here in IowaCity. The ambiance is great, with a retro warehouse feel and beautiful views out onto the Pentacrest. Their real attraction is the drinks list: many interesting cocktails, my personal favorite being the Grandpa\'s Coffin (bourbon, scotch, locally made apple brandy, and bitters). They also have a "Murderer\'s Row" of high-end whiskey should that strike your fancy. I\'d say this place competes with Devotay for best cocktails in IowaCity. They also have a well-chosen, if small, beer list, and I have had good times with their wines.\r\n\r\nThe food has been very good. Last night, I had their smoked salmon nicoise salad, which had excellent flavor, and my wife had a tasty vegetarian quiche. They used to carry a really flavorful veggie burger, but appear to have dropped that from the menu within the past couple of months. On another visit, the salmon entrée was also good, th

Note that although there is a big overlap between the string methods of a series and the basic string methods, they are still different sets of methods. For example, "contains" is only available to the former. 

For a complete list of string methods of a series, see https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

### ".str" Method and Regular Expression

All ".str" functions can be used with a "regular expression", which is an expression to represent a class of strings rathern than a specific string. For example, suppose we want to replace all upper-case letters in df["reviews"] by a "*". We can do:

In [35]:
df["reviews"].str.replace("[A-Z]","*")

  df["reviews"].str.replace("[A-Z]","*")


0      *ith its jazzy vibes and chill atmosphere, *li...
1      *his was an exceptional surprise in *owa city!...
2      *here is no other place in town like ****. *f ...
3      *ucked away through a narrow staircase like a ...
4      *ove. *ove. *ove. *f you're older than the col...
                             ...                        
127    *he food was not good and service was not frie...
128    *ow have times changed in downtown *owa*ity. *...
129    *reat building. *oved the big windows overlook...
130    *ove the food, love the service. *f you are in...
131    *inner and cocktails on a *riday night. *lace ...
Name: reviews, Length: 132, dtype: object

Here, "[A-Z]" is a regular expression representing "all letters in upper case". We will study more regular expression in the next module.

## Process Strings in a List

In Python, a list is a different object from a column or a series and does not have the ".str" built-in methods for string processing.

To process string in a list, using list comprehension with the basic string methods is the choice.

In [36]:
mystr=["This is a good game!", "I don\'t like this game!", "Do you like this game?"]
[s.replace("game", "course project") for s in mystr]

['This is a good course project!',
 "I don't like this course project!",
 'Do you like this course project?']

In [37]:
#The following command will return an error message
#mystr.str.replace("game", "course project")