In [1]:

import pandas as pd

## Working with text data in Python

Before we begin

- String literals
- Escaping special characters
- String operations, indexing, slicing and formatting
- Regular expressions

### String literals

For detailed information see the Python 3 [string](https://docs.python.org/3/library/string.html) documentation.

In [2]:
## String literals
simple_string = "The dog chased the bat."

In [3]:
## Quotation marks need to be escaped using \ to remove their special meaning

string_with_escape_char_1 = "The building housing the John Snow pub was built in the 1870s and was originally called the \"Newcastle-upon-Tyne.\""

In [4]:
## Long string format

string_long_format = """She said that this model is "the best there is." """
string_long_format

'She said that this model is "the best there is." '

In [5]:
## Some characters have a special meaning when preceeded by \,

string_with_escape_char_2 = "Press \t to align the piece of code..."
print(string_with_escape_char_2)

string_with_escape_char_3 = "Press \\t to align the piece of code..."
print(string_with_escape_char_3)

Press 	 to align the piece of code...
Press \t to align the piece of code...


In [6]:
## String prefixs: r"string": raw strings do not perform backspace interpolation and disable escape sequences

string_with_escape_char_3 = r"Press \t to align the piece of code..."
print(string_with_escape_char_3)

Press \t to align the piece of code...


### Reading text from a local file


In [7]:
### Read the whole file string into memory

## Opens a file handle in read mode
f = open('./data/tweets_hashtags_callouts.txt', 'r')

## Reads the contents of the file as one string
callouts_text = f.read()
f.close()

print(callouts_text)

Just returned to the @WhiteHouse after a great evening @ Monroe, Louisiana with a massive turnout of Great American Patriots. With early voting underway until Sat, find your polling location below & go vote for your next #LAgov, @EddieRispone! #GeauxVote➡️https://vote.donaldjtrump.com
The degenerate Washington Post MADE UP the story about me asking Bill Barr to hold a news conference. Never happened, and there were no sources!
LOUISIANA! Early voting is underway until Saturday, it’s time to get out and VOTE to REPLACE Radical Liberal Democrat John Bel Edwards with a great new REPUBLICAN Governor, @EddieRispone! #GeauxVote
“Based on the things I’ve seen, the Democrats have no case, or a weak case, at best. I don’t think there are, or will be, well founded articles of Impeachment here.” Robert Wray, respected former prosecutor. It is a phony scam by the Do Nothing Dems! @foxandfriends



In [8]:
## It is common that each line in the file
## corresponds to a single record (e.g. tweet)

## Splits the strings on newline

tweets = callouts_text.split("\n")
tweets[:2]

print(f"Read {len(tweets)} tweets.")

Read 5 tweets.


String length

In [9]:
tweet1 = tweets[0]
print(f"The first tweet has {len(tweet1)} characters.")

The first tweet has 285 characters.


String slices

In [10]:
print(f"The first four characters of the first tweet: '{tweet1[:4]}'")

The first four characters of the first tweet: 'Just'


String operations: `istitle()`, `isalnum()`, `lower()`, etc.

In [11]:
"title string".title()

'Title String'

In [12]:
"Title String".istitle()

True

In [13]:
"Upcase".lower()

'upcase'

In [14]:
"upcase".islower()

True

In [15]:
"lowcase".upper()

'LOWCASE'

In [16]:
"LOWCASE".isupper()

True

Commonly it is useful to replace leading and trainling whitespace. The methods
`strip()`, `rstrip()` and `lstrip()` are helpful.

In [17]:
"    with leading white space".lstrip()

'with leading white space'

In [18]:
"with trainling whitespace        ".rstrip()

'with trainling whitespace'

In [19]:
"   with both leading and trailing whitespace".strip()

'with both leading and trailing whitespace'

Another common operation is splitting a string on some character, e.g. whitespace

In [20]:
str_split = "The brown rabbit".split(" ")
print(str_split)

['The', 'brown', 'rabbit']


Different from other languages, joining
a list of strings is a method of string (the separator).

In [21]:
",".join(str_split)

'The,brown,rabbit'

Replacing substrings

In [22]:
"some string to replace a string".replace("string", "-")

'some - to replace a -'

In [23]:
"playful"[::-2]

'lfap'

### Regular expressions

Regular expressions are a mini (or not so mini) language for specifying search patterns and are an indispensable
tool for handling text data in any programming language. Here we will briefly touch on some use cases in Python.
For more details, please visit the Python [documentation](https://docs.python.org/3.7/library/re.html). This [short introduction](https://realpython.com/regex-python/) can also be helpful.


In [24]:
import re

In [25]:
## Test if the string contains the substring "be"

has_match = re.search(r"be", "Let it be")

if has_match:
    print("Success!")

Success!


`re.match` search at the start of the string.

In [26]:
has_match = re.match("Be", "Be it as it may")

print("Success") if has_match else print("No match")

Success


To ignore the case in both the regular expression and the string,
use the modifier: `re.IGNORECASE`. Try the following without the modifier.

In [27]:
has_match = re.match("be", "Be it as it may", re.IGNORECASE)

print("Success") if has_match else print("No match")

Success


Find all words starting with g of length 4. The character class `\w` matches
word characters.

In [28]:
matches = re.findall(r"g\w{3}", "The goal was to catch the goat.")
print(matches)

['goal', 'goat']


The character class `\d` matches decimal digits.

In [29]:
matches = re.findall(r"\d+", "Find all integers like 2 and 301 here.")
print(matches)

['2', '301']


Find all substrings starting with g and ending with either t or l.

In [30]:
matches = re.findall(r"g.*?[tl]", "The goal was to catch the goat.")
print(matches)

['goal', 'goat']


Use search to find substrings starting with @ and return it as a dictionary.

In [31]:
matches = re.search(r"(?P<mention>@\w+)", "Hi, @Ann23, @Pieas")
print(matches.groupdict())

{'mention': '@Ann23'}


Use a regular expression to compress whitespace within the string

In [32]:
cleaned_string = re.sub(r"\s+", " ", "A string    with lots  of white space.")
print(cleaned_string)

A string with lots of white space.


Use a regular expression to change the position of the first and last names in the following string:

In [33]:
cleaned_string = re.sub(r"(\w+) (\w+)", r"\2 \1", "Mike Santori")
cleaned_string

'Santori Mike'

### Strings in pandas

Panda series and data frames have build-in methods for manipulation text columns.
Let us build an example data frame from the `callouts_text` object.


In [34]:
co_df = pd.DataFrame(tweets, columns=["content"])
co_df

Unnamed: 0,content
0,Just returned to the @WhiteHouse after a great...
1,The degenerate Washington Post MADE UP the sto...
2,LOUISIANA! Early voting is underway until Satu...
3,"“Based on the things I’ve seen, the Democrats ..."
4,


Conver the `content` column to lowercase.


In [151]:
co_df["content"] = co_df["content"].str.lower()
co_df

Unnamed: 0,content
0,just returned to the @whitehouse after a great...
1,the degenerate washington post made up the sto...
2,louisiana! early voting is underway until satu...
3,"“based on the things i’ve seen, the democrats ..."
4,


Count the number of mentions in each tweet.

In [152]:
co_df["mentions_cnt"] = co_df["content"].str.count(r"@\w+")
co_df

Unnamed: 0,content,mentions_cnt
0,just returned to the @whitehouse after a great...,2
1,the degenerate washington post made up the sto...,0
2,louisiana! early voting is underway until satu...,1
3,"“based on the things i’ve seen, the democrats ...",1
4,,0


Count the number of uppercase words in each tweet. Use the [] to define a character class and the `\b` character to match
word boundaries.

In [153]:
co_df["cries_cnt"] = co_df["content"].str.count(r"\b[A-Z0-9]+\b")
co_df

Unnamed: 0,content,mentions_cnt,cries_cnt
0,just returned to the @whitehouse after a great...,2,0
1,the degenerate washington post made up the sto...,0,0
2,louisiana! early voting is underway until satu...,1,0
3,"“based on the things i’ve seen, the democrats ...",1,0
4,,0,0


Count the length of the tweet in terms of characters.

In [154]:
## Count the length of the tweet in characters
co_df["len"] = co_df["content"].str.len()
co_df

Unnamed: 0,content,mentions_cnt,cries_cnt,len
0,just returned to the @whitehouse after a great...,2,0,285
1,the degenerate washington post made up the sto...,0,0,144
2,louisiana! early voting is underway until satu...,1,0,198
3,"“based on the things i’ve seen, the democrats ...",1,0,265
4,,0,0,0


In [72]:
## Count the length of the tweet in terms of words (splitting on whitespace)
from collections import Counter
import numpy as np

words_count = set()

words_split = co_df["content"].str.split("\s+", expand=True)\
    .stack() \
    .reset_index()

counts = words_split.groupby("level_0").apply(len).reset_index()
co_df["words_cnt"] = counts[0]
co_df
# co_df["words_cnt"] = counts
# words_count
# words_split.apply()
# co_df["words_cnt"] = 1

Unnamed: 0,content,words_cnt
0,Just returned to the @WhiteHouse after a great...,40
1,The degenerate Washington Post MADE UP the sto...,25
2,LOUISIANA! Early voting is underway until Satu...,30
3,"“Based on the things I’ve seen, the Democrats ...",47
4,,1
