# Vectorised String Operations

In [1]:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2

array([ 4,  6, 10, 14, 22, 26])

For arrays of strings, we don't have the same functionality:

In [2]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

Won't work if there's missing data:

In [3]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

Pandas can do both of this things:

In [4]:
import pandas as pd
names = pd.Series(data)
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

In [5]:
names.str.capitalize()

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

## Tables of Pandas String Methods

In [6]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

### Methods similar to Python methods

len()<br>
lower() <br>
translate()<br>
islower()<br>
ljust()<br>
upper()<br>
startswith()<br>
isupper()<br>
rjust()<br>
find()<br>
endswith()<br>
isnumeric()<br>
center()<br>
rfind()<br>
isalnum()<br>
isdecimal()<br>
zfill()<br>
index()<br>
isalpha()<br>
split()<br>
strip()<br>
rindex()<br>
isdigit()<br>
rsplit()<br>
rstrip()<br>
capitalize()<br>
isspace()<br>
partition()<br>
lstrip()<br>
swapcase()<br>
istitle()<br>
rpartition()<br>

Different return types:

In [7]:
monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

In [8]:
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [9]:
monte.str.startswith('T')

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

In [10]:
monte.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

## Methods using regex

Method Description<br>
match() Call re.match() on each element, returning a boolean. <br>
extract() Call re.match() on each element, returning matched groups as strings.<br>
findall() Call re.findall() on each element<br>
replace() Replace occurrences of pattern with some other string<br>
contains() Call re.search() on each element, returning a boolean<br>
count() Count occurrences of pattern<br>
split() Equivalent to str.split(), but accepts regexps<br>
rsplit() Equivalent to str.rsplit(), but accepts regexps<br>

We can extract the first name forom each by asking for a contiguous group of characters at the beginning of each element:

In [13]:
monte.str.extract('([A-Za-z]+)', expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

Or we can find all names that start and end with a constonant, using the start of string ^ and end of string $

In [16]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

### Miscellaneous methods

Method Description <br>
get() Index each element<br>
slice() Slice each element<br>
slice_replace() Replace slice in each element with passed value<br>
cat() Concatenate strings<br>
repeat() Repeat values<br>
normalize() Return Unicode form of string<br>
pad() Add whitespace to left, right, or both sides of strings<br>
wrap() Split long strings into lines with length less than a given width<br>
join() Join strings in each element of the Series with passed separator<br>
get_dummies() extract dummy variables as a dataframe<br>

In [18]:
monte.str[0:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

In [19]:
monte.str.slice(0,3)

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

In [20]:
monte.str.get(2)

0    a
1    h
2    r
3    i
4    r
5    c
dtype: object

In [22]:
monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

### Indicator variables

get_dummies() is useful when you have a column containing some sort of coded indicator e.g. A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":

In [23]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

Unnamed: 0,name,info
0,Graham Chapman,B|C|D
1,John Cleese,B|D
2,Terry Gilliam,A|C
3,Eric Idle,B|D
4,Terry Jones,B|C
5,Michael Palin,B|C|D


Get dummies lets us encode this:

In [24]:
full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


## Example: Recipe Database

Can't do this now

## A simple recipe recommender