<h1>Vectorized String Operations</h1>

In [1]:
# Pandas provides a comprehensive set of vectorized string operations that become an essential piece of 
# the type of munging required when one is working with cleaning up real world data. 

<h3>Introducing Pandas String Operations</h3>

In [2]:
# General import of numpy
import numpy as np

In [3]:
# Perform normal arithmetic operation
x = np.array([2,3,5,7,11,13])
x*2

array([ 4,  6, 10, 14, 22, 26])

In [4]:
# This vectorization of operation simplifies the syntax of operating on arrays of data:
# We no longer have to worry about the size or shape of the array, but just about what the operation we want done.

# With arrays of string such operations are not present in numpy. Example:

data = ["peter","Paul","MARY","gUIDO"]
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

In [5]:
# This works well in case of all data being present, but in case of missing values or null or none:

data = ["peter","Paul", None, "MARY","gUIDO"]
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

In [6]:
# Pandas includes features to address both of these need for vectorized string operations and for correvtly hand
# ling missing data via the str attribute of Pandas series and Index objects containing strings. 

# General pandas import
import pandas as pd

In [7]:
names = pd.Series(data)
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

In [8]:
# We can now call single method that will capitalize all the entries while skipping over any missing values:

names.str.capitalize()

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

In [10]:
names.str.lower()

0    peter
1     paul
2     None
3     mary
4    guido
dtype: object

<h3>Tables of Pandas String Methods</h3>

In [11]:
# Series of names:

monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

In [12]:
# All python built in string methods are mirrored by a Pandas vectorized string method.

monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

In [13]:
# to get the length of strings
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [14]:
# Check for starts with a specific letter
monte.str.startswith('T')

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

In [15]:
monte.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

<h4>Methods using regular expressions</h4>

In [16]:
# Use of regular expressions

monte.str.extract("([A-Za-z]+)")

Unnamed: 0,0
0,Graham
1,John
2,Terry
3,Eric
4,Terry
5,Michael


In [17]:
# Finding all names that start and end with consonant character making use of start of string and end of string 
# character. 

monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

<h4>Miscellaneous Methods</h4>

<h5>Vectorized item access and slicing</h5>

In [18]:
# The get() and slice() operations in particular enable vectorized element access from each array. 

monte.str[0:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

In [19]:
# Combine split and get to get the last name of each of the strings

monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

<h5>Indicator Variables</h5>

In [20]:
# get_dummies() method is useful when the data has a column containing some sort of coded indicator

full_monte = pd.DataFrame({"name": monte,
                           "info": ["B|C|D","B|D","A|C","B|D","B|C","B|C|D"]})
full_monte

Unnamed: 0,name,info
0,Graham Chapman,B|C|D
1,John Cleese,B|D
2,Terry Gilliam,A|C
3,Eric Idle,B|D
4,Terry Jones,B|C
5,Michael Palin,B|C|D


In [22]:
# get_dummies routine quickly splits out these indicator variables into a data frame. 

full_monte["info"].str.get_dummies("|")

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


<h3>Recipe DataBase</h3>

In [23]:
# The vectorized string operations become most useful in the process of cleaning up messy, real world data. 

!curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    20  100    20    0     0     25      0 --:--:-- --:--:-- --:--:--    25


In [24]:
!gunzip recipeitems-latest.json.gz

In [27]:
# As the data is in json format we will leverage pd.read_json method

try:
    recipes = pd.read_json("recipeitems-latest.json")
except ValueError as exception:
    print("ValueError is ", exception)

ValueError is  Trailing data


In [30]:
# As each line of the file is valid json but overall file is not json 

with open('recipeitems-latest.json') as f:
    line = f.readline()
pd.read_json(line).shape

ValueError: If using all scalar values, you must pass an index

In [33]:
with open('recipeitems-latest.json') as f:
    line = f.readline()
    print("line : ",line)

line :  {"name": "Easter Leftover Sandwich", "ingredients": "12 whole Hard Boiled Eggs\n1/2 cup Mayonnaise\n3 Tablespoons Grainy Dijon Mustard\n Salt And Pepper, to taste\n Several Dashes Worcestershire Sauce\n Leftover Baked Ham, Sliced\n Kaiser Rolls Or Other Bread\n Extra Mayonnaise And Dijon, For Spreading\n Swiss Cheese Or Other Cheese Slices\n Thinly Sliced Red Onion\n Avocado Slices\n Sliced Tomatoes\n Lettuce, Spinach, Or Arugula", "url": "http://thepioneerwoman.com/cooking/2013/04/easter-leftover-sandwich/", "image": "http://static.thepioneerwoman.com/cooking/files/2013/03/leftoversandwich.jpg", "cookTime": "PT", "recipeYield": "8", "datePublished": "2013-04-01", "prepTime": "PT15M", "description": "Got leftover Easter eggs?    Got leftover Easter ham?    Got a hearty appetite?    Good! You've come to the right place!    I..."}



In [34]:
with open('recipeitems-latest.json') as f:
    line = f.readline()
    pd.read_json(line).shape

ValueError: If using all scalar values, you must pass an index

In [35]:
# read the entire file into a python array
with open('recipeitems-latest.json', 'r') as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Perform reformatting
    data_json = "[{0}]".format(",".join(data))
# Read the result as json
recipes = pd.read_json(data_json)

In [36]:
# Get the shape of the data:
recipes.shape

(1042, 9)

In [37]:
# Look at one of the rows of recipes
recipes.iloc[0]

name                                      Easter Leftover Sandwich
ingredients      12 whole Hard Boiled Eggs\n1/2 cup Mayonnaise\...
url              http://thepioneerwoman.com/cooking/2013/04/eas...
image            http://static.thepioneerwoman.com/cooking/file...
cookTime                                                        PT
recipeYield                                                      8
datePublished                                           2013-04-01
prepTime                                                     PT15M
description      Got leftover Easter eggs?    Got leftover East...
Name: 0, dtype: object

In [38]:
# Closer look at the data
recipes.ingredients.str.len().describe()

count    1042.000000
mean      358.645873
std       187.332133
min        22.000000
25%       246.250000
50%       338.000000
75%       440.000000
max      3160.000000
Name: ingredients, dtype: float64

In [39]:
# Recipe with longest ingredient list
recipes.name[np.argmax(recipes.ingredients.str.len())]

'A Nice Berry Pie'

In [40]:
# Recipes which are for breakfast food
recipes.description.str.contains('[Bb]reakfast').sum()

11

In [41]:
# Recipes including cinnamon as an ingredient
recipes.ingredients.str.contains('[Cc]innamon').sum()

79

In [42]:
# Recipes with misspelling of cinnamon
recipes.ingredients.str.contains('[Cc]inamon').sum()

0

<h4>A simple recipe recommendar</h4>

In [43]:
# Spices list used in recipes

spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']

In [44]:
import re
spice_dataframe = pd.DataFrame(dict((spice,recipes.ingredients.str.contains(spice, re.IGNORECASE))
                                        for spice in spice_list))

In [45]:
spice_dataframe.head()

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


In [46]:
# Recipes using parsley, paprika and tarragon
selection = spice_dataframe.query('parsley & paprika & tarragon')
len(selection)

0

In [47]:
recipes.name[selection.index]

Series([], Name: name, dtype: object)