# Working with text data in Python

Before we begin

- String literals
- Escaping special characters
- String operations, indexing, slicing and formatting
- Regular expressions

## String literals

For detailed information see the Python 3 [string documentation](https://docs.python.org/3/library/string.html).

In [1]:
## String literals
simple_string = "The dog chased the bat."

In [13]:
## Quotation marks need to be escaped using \ to remove their special meaning

string_with_escape_char_1 = "The building housing the John Snow pub was built in the 1870s and was originally called the \"Newcastle-upon-Tyne.\""

In [14]:
## Long string format

string_long_format = """She said that this model is "the best there is." """
string_long_format

'She said that this model is "the best there is." '

In [2]:
## Some characters have a special meaning when preceeded by \,

string_with_escape_char_2 = "Press \t to align the piece of code..."
print(string_with_escape_char_2)

string_with_escape_char_3 = "Press \\t to align the piece of code..."
print(string_with_escape_char_3)

## String prefixes: r"string": raw strings do not perform backspace interpolation and disable escape sequences
string_with_escape_char_3 = r"Press \t to align the piece of code..."
print(string_with_escape_char_3)

Press 	 to align the piece of code...
Press \t to align the piece of code...
Press \t to align the piece of code...


## Selecting substrings

Strings in Python work as immutable arrays.

In [3]:
sentence = "The Sam Altman drama points to a deeper split in the tech world."

print(len(sentence))

print(sentence[0])
print(sentence[0:10])
print(sentence[::2])
print(sentence[::-2])

64
T
The Sam Al
TeSmAta rm onst  eprslti h ehwrd
.lo ctetn ip eedao tipaadnml a h


In [4]:
sentence[0] = "B"

TypeError: 'str' object does not support item assignment

# Formatted String Literals

Python 3.6 introduced *f-strings which are often more convenient to use
than the `.format()` string method. With f-strings you can use outside variables which are interpolated into the string.

For detailed information see the Python 3 [f-strings documentation](https://docs.python.org/3/reference/lexical_analysis.html#f-strings).

In [13]:
phone_number = "+35988328349"

print(f"Call {phone_number} for assistance.")
print("Dial {phn} for assistance.".format(phn = phone_number))

Call +35988328349 for assistance.
Dial +35988328349 for assistance.


Printing the string representation of an object.


When using values from a dictionary, pay attention to the quotation marks.

In [10]:
dct = {"phone_number": "+35988328349"}

In [14]:
print(f"Call {dct["phone_number"]}")

SyntaxError: invalid syntax (180812370.py, line 1)

In [15]:
print(f"Call {dct['phone_number']}.")

Call +35988328349.


## Local Text Files

In the following we will use a jupyter notebooks specific way to first write a local file and then read it back. 

In [5]:
%%writefile tweets_hashtags_callouts.txt
Just returned to the @WhiteHouse after a great evening @ Monroe, Louisiana with a massive turnout of Great American Patriots. With early voting underway until Sat, find your polling location below & go vote for your next #LAgov, @EddieRispone! #GeauxVote➡️https://vote.donaldjtrump.com
The degenerate Washington Post MADE UP the story about me asking Bill Barr to hold a news conference. Never happened, and there were no sources!
LOUISIANA! Early voting is underway until Saturday, it’s time to get out and VOTE to REPLACE Radical Liberal Democrat John Bel Edwards with a great new REPUBLICAN Governor, @EddieRispone! #GeauxVote
“Based on the things I’ve seen, the Democrats have no case, or a weak case, at best. I don’t think there are, or will be, well founded articles of Impeachment here.” Robert Wray, respected former prosecutor. It is a phony scam by the Do Nothing Dems! @foxandfriends

Writing tweets_hashtags_callouts.txt


In [6]:
### Read the whole file string into memory

## Opens a file handle in read mode
f = open('tweets_hashtags_callouts.txt', 'r')

## Reads the contents of the file as one string
callouts_text = f.read()
f.close()

print(callouts_text)

Just returned to the @WhiteHouse after a great evening @ Monroe, Louisiana with a massive turnout of Great American Patriots. With early voting underway until Sat, find your polling location below & go vote for your next #LAgov, @EddieRispone! #GeauxVote➡️https://vote.donaldjtrump.com
The degenerate Washington Post MADE UP the story about me asking Bill Barr to hold a news conference. Never happened, and there were no sources!
LOUISIANA! Early voting is underway until Saturday, it’s time to get out and VOTE to REPLACE Radical Liberal Democrat John Bel Edwards with a great new REPUBLICAN Governor, @EddieRispone! #GeauxVote
“Based on the things I’ve seen, the Democrats have no case, or a weak case, at best. I don’t think there are, or will be, well founded articles of Impeachment here.” Robert Wray, respected former prosecutor. It is a phony scam by the Do Nothing Dems! @foxandfriends


Wrong paths are a common source of errors when reading files from your local system. You can check the current working directory of the notebook with the `getcwd()` function. Other errors can arise because of read/write permissions on your system. Make sure that you have write permissions for the repository
where you cloned the GitHub repository.

In [7]:
import os
print(os.getcwd())

/home/amarov/proj/statistics/ta2023/01-Text-Data-Basics


Usually it is safer to use aliases and context managers when reading files, because they take care to close the file handles.

In [8]:
with open('tweets_hashtags_callouts.txt', 'r') as f:
    text = f.read()

print(text)

Just returned to the @WhiteHouse after a great evening @ Monroe, Louisiana with a massive turnout of Great American Patriots. With early voting underway until Sat, find your polling location below & go vote for your next #LAgov, @EddieRispone! #GeauxVote➡️https://vote.donaldjtrump.com
The degenerate Washington Post MADE UP the story about me asking Bill Barr to hold a news conference. Never happened, and there were no sources!
LOUISIANA! Early voting is underway until Saturday, it’s time to get out and VOTE to REPLACE Radical Liberal Democrat John Bel Edwards with a great new REPUBLICAN Governor, @EddieRispone! #GeauxVote
“Based on the things I’ve seen, the Democrats have no case, or a weak case, at best. I don’t think there are, or will be, well founded articles of Impeachment here.” Robert Wray, respected former prosecutor. It is a phony scam by the Do Nothing Dems! @foxandfriends


In [9]:
## It is common that each line in the file
## corresponds to a single record (e.g. tweet)

## Splits the strings on newline

tweets = callouts_text.split("\n")
tweets[:2]

print(f"Read {len(tweets)} tweets.")
tweets

Read 5 tweets.


['Just returned to the @WhiteHouse after a great evening @ Monroe, Louisiana with a massive turnout of Great American Patriots. With early voting underway until Sat, find your polling location below & go vote for your next #LAgov, @EddieRispone! #GeauxVote➡️https://vote.donaldjtrump.com',
 'The degenerate Washington Post MADE UP the story about me asking Bill Barr to hold a news conference. Never happened, and there were no sources!',
 'LOUISIANA! Early voting is underway until Saturday, it’s time to get out and VOTE to REPLACE Radical Liberal Democrat John Bel Edwards with a great new REPUBLICAN Governor, @EddieRispone! #GeauxVote',
 '“Based on the things I’ve seen, the Democrats have no case, or a weak case, at best. I don’t think there are, or will be, well founded articles of Impeachment here.” Robert Wray, respected former prosecutor. It is a phony scam by the Do Nothing Dems! @foxandfriends',
 '']

String length

In [10]:
tweet1 = tweets[0]
print(f"The first tweet has {len(tweet1)} characters.")

The first tweet has 285 characters.


String slices

In [11]:
print(f"The first four characters of the first tweet: '{tweet1[:4]}'")

The first four characters of the first tweet: 'Just'


String operations: `istitle()`, `isalnum()`, `lower()`, etc.

In [12]:
"title string".title()

'Title String'

In [15]:
"Title String".istitle()

False

In [17]:
"UpcaSe".lower()

'upcase'

In [18]:
"upcase".islower()

True

In [24]:
"lowcase".upper()

'LOWCASE'

In [25]:
"LOWCASE".isupper()

True

Commonly it is useful to replace leading and trainling whitespace. The methods
`strip()`, `rstrip()` and `lstrip()` are helpful.

In [19]:
"    with leading white space".lstrip()

'with leading white space'

In [20]:
"with trailing whitespace        ".rstrip()

'with trailing whitespace'

In [21]:
"   with both leading and trailing whitespace       ".strip()

'with both leading and trailing whitespace'

Another common operation is splitting a string on some character, e.g. whitespace

In [22]:
str_split = "The airplane took off.".split(" ")
print(str_split)

['The', 'airplane', 'took', 'off.']


Different from other languages, joining
a list of strings is a method of string (the separator).

In [23]:
",".join(str_split)

'The,airplane,took,off.'

Replacing substrings

Note that as strings are immutable, the `.replace()` method returns a new string.


In [24]:
"some string to replace a string".replace("some", "-")

'- string to replace a string'