## Text cleaning with Pandas

### Vectorized String Operations

One strength of Python is its support for handling and manipulating string data. 

Pandas builds on this and provides a comprehensive set of vectorized string operations that are an essential piece of the tool kit for working with real-world data.

### Pandas String Operations
NumPy and Pandas generalize arithmetic operations inorder to easily and quickly perform the same operation on many array elements.

Pandas supports application of string manipulation and regular expressions on whole arrays in a similar fashion.

As of version 1.0.0 (January 2020), Pandas introduced as an experimental feature first-class support for string types.  StringDtype.

So now there are two ways to store text data in pandas:

- object -dtype NumPy array.

- StringDtype extension type

The new recommendation is to use StringDtype to store text data.

You can accidentally store a mixture of strings and non-strings in an object dtype array. It’s better to have a dedicated dtype.

In [2]:
import pandas as pd
import numpy as np

Numpy example 

In [None]:
x = np.array([2, 3, 5, 7, 11, 13])
x * 2

This vectorization of operations simplifies the syntax of operating on arrays of data.

We no longer have to worry about the size or shape of the array, just about what operation we want done. 

For arrays of strings, NumPy does not provide such simple access, which means we have to revert to a loop syntax:

In [None]:
data = ['samantha', 'Paul', 'MARY', 'gEORGE', 'WATSON']
[s.capitalize() for s in data]

This is perhaps sufficient to work with some data, but it breaks if there are any missing values. For example:

In [None]:
data = ['peter', 'Paul', None, 'MARY', 'gEORGE', 'WATSON']
[s.capitalize() for s in data]

In [None]:
pd.Series(["a", "b", "c"]) # by default pandas still defaults to object type

In [None]:
pd.Series(["a", "b", "c"], dtype="string") # need to specify string dtype explicitly


In [None]:
names=pd.Series(['samantha', 'Paul', 'MARY', None,'gEORGE', 'WATSON'],dtype="string")
names

Pandas adds features for vectorized string operations and for correctly handling missing data.

This is managed through the str attribute of **Pandas Series** and **Index objects** containing strings. 


Now we can call a Pandas str method that will capitalize all the entries, and skip over any missing values:

In [None]:
names.str.capitalize()

The str methods can applied to dataframe columns.

In [None]:
df = pd.DataFrame({'A': ['samantha', 'john', 'bODAY', 'minA', 'Peter', 'nicky'], 
                  'B': ['Grad','masters', 'graduate', 'graduate', 
                                   'Masters', 'Graduate'], 
                  'C': [22,27, 23, 21, 23, 24]}) 
   
df 

In [None]:
df['A'] = df['A'].str.capitalize() 
df

A column can be explicitly converted to the string type with  astype after the Series or DataFrame is created

In [None]:
df['A'].astype("string")

In [None]:
df.dtypes

### Pandas String Methods

Most string manipulations in Python carry over to Pandas string syntax.


### A list of Pandas str methods that mirror Python string methods:

          
        len() 	 lower() 	translate()	 islower()
        ljust()  upper()	startswith() isupper()
        rjust()  find()	    endswith()	 isnumeric()
        center() rfind()	isalnum()	 isdecimal()
        zfill()	 index()	isalpha()	 split()
        strip()	 rindex()	isdigit()	 rsplit()
        rstrip() capitalize()isspace()	 partition()
        lstrip() swapcase()	istitle()	 rpartition()
    
These have various return values, some return a series of strings, others Booleans.

##### Some case related examples:

- title: Converts first character of each word to uppercase and remaining to lowercase.

- capitalize: Converts first character to uppercase and remaining to lowercase.

- swapcase: Converts uppercase to lowercase and lowercase to uppercase.

In [None]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

In [None]:
monte.str.lower()

In [None]:
monte.str.upper()

In [None]:
monte.str.swapcase()

In [None]:
monte.str.startswith('T')

In [None]:
monte.str.len()

In [None]:
monte.str.split()

#### Examples of str.contains

In [None]:
s1 = pd.Series(['Mouse', 'dog', 'horse and parrot', '23', 'frog',np.NaN])

In [None]:
s1.str.contains('og', regex=False)

Specifying case sensitivity using case.

In [None]:
s1.str.contains('oG', case=True, regex=True)

In [None]:
s1.str.contains('Mo', case=True, regex=True)

string contains horse or dog

In [None]:
s1.str.contains('horse|dog', regex=True)

In [None]:
import re # provides regular expression matching operations

Ignoring case sensitivity using flags with regex.

In [None]:
s1.str.contains('PARROT', flags=re.IGNORECASE, regex=True)


Returning any digit using regular expression.

In [None]:
s1.str.contains('\\d', regex=True)

### Examples of str.strip, lstrip, and rstrip
Note, str methods can be applied to an index as well as a Series

In [None]:
idx = pd.Index([" jack", "jill ", " jesse ", "frank"])

idx.str.strip()

In [None]:
idx.str.lstrip()

In [None]:
idx.str.rstrip()

Since columns in dataframes are indices, str methods apply to them.

In [None]:
df = pd.DataFrame(np.random.randn(3, 2), columns=[" Column A ", " COLUmn B "], index=range(3))

In [None]:
df

Combinations of str methods for cleaning up column labels.

In [None]:
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df

Examples for split

split returns a Series of lists.


In [3]:
s2 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"])
s2

0    a_b_c
1    c_d_e
2      NaN
3    f_g_h
dtype: object

In [None]:
s2 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"],dtype=pd.StringDtype())
s2


Elements in the split lists can be accessed using get or [] notation:

In [None]:
s2.str.split("_").str.get(1)

In [None]:
s2.str.split("_").str[1]

#### Vectorized item access and slicing

The get() and slice() operations, in particular, enable vectorized element access from each array. For example, we can get a slice of the first three characters of each array using str.slice(0, 3). Note that this behavior is also available through Python's normal indexing syntax–for example, df.str.slice(0, 3) is equivalent to df.str[0:3]:

In [None]:
monte.str[0:3]

Indexing via df.str.get(i) and df.str[i] is likewise similar.

These get() and slice() methods also let you access elements of arrays returned by split(). For example, to extract the last name of each entry, we can combine split() and get():

In [None]:
monte.str.split().str.get(-1)

#### Regular expressions

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module.

With re you can specify rules for the set of possible strings that you want to match.

You can ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C.

#### Matching characters
Most letters and characters will simply match themselves. a matches to a, g to g etc. 

Character sets or classes, e.g. [abc] will match any of the characters a, b, or c. 
 
This is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z], for upper case letter [A-Z]
 
With a  “character set”, you can tell the regex engine to match only one out of several characters.   A character set [ae] could for example be used in gr[ae]y to match either gray or grey. 

You can match the characters not listed within the class by complementing the set. This is indicated by including a '^' as the first character of the class. For example, [^5] will match any character except '5'. If the caret appears elsewhere in a character class, it does not have special meaning. For example: [5^] will match either a '5' or a '^'.

List of metacharacters: . ^ $ * + ? { } [ ] \ | ( )

The period matches any single character

\d
Matches any decimal digit; this is equivalent to the class [0-9].

\D
Matches any non-digit character; this is equivalent to the class [^0-9].

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

These sequences can be included inside a character class. For example, [\s,.] is a character class that will match any whitespace character, or ',' or '.'.

^\s+ matches leading whitespace and \s+$ matches trailing whitespace.

'*'specifies that the previous character can be matched zero or more times, instead of exactly once.

For example, ba*t will match 'bt' (0 'a' characters), 'bat' (1 'a'), 'baaat' (3 'a' characters).

'*' matches zero or more times, so whatever’s being repeated may not be present at all.

'+' requires at least one occurrence. 
ba+t will match 'bat' (1 'a'), 'baaat' (3 'a's), but won’t match 'bt'.

{}	Exactly the specified number of occurrences

Anchors do not match characters. They match a position before, after, or between characters. They can be used to “anchor” the regex match at a certain position. 

The caret ^ matches the position before the first character in the string. Applying ^a to abc matches a. 

The question mark character, ?, matches either once or zero times; you can think of it as marking something as being optional. For example, home-?brew matches either 'homebrew' or 'home-brew'.

The regular expression "[A-Z][a-z]*" matches any sequence of letters that starts with an uppercase letter and is followed by zero or more lowercase letters. 

The \ metacharacter is used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning:" \\[" or "\\."

#### Methods using regular expressions

There are several methods that accept regular expressions to examine the content of a string element, and follow some of the API conventions of Python's built-in re module:

    Method	Description
    match()	Call re.match() on each element, returning a boolean.
    extract()	Call re.match() on each element, returning matched groups as strings.
    findall()	Call re.findall() on each element
    replace()	Replace occurrences of pattern with some other string
    contains()	Call re.search() on each element, returning a boolean
    count()	Count occurrences of pattern
    split()	Equivalent to str.split(), but accepts regexps, returns a list where the string has been split at each matc
    rsplit()	Equivalent to str.rsplit(), but accepts regexps
    search	Returns a Match object if there is a match anywhere in the string
    sub	Replaces one or many matches with a string, e.g substitutes parts of a string.
    
With these, you can do a wide range of useful operations. 


Example of extract:  get the first name by asking for a contiguous group of characters at the beginning of each element:


In [None]:
monte.str.extract('([A-Za-z]+)', expand=False)

find all names that start and end with a consonant, making use of the start-of-string (^) and end-of-string ($) regular expression characters:

In [None]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

In [None]:
text = "foo    bar\t baz  \tqux"
re.split('\s+', text)

Creating a regex object with re.compile is higly recommended if you intend to apply the same expression to many strings.

In [None]:
regex = re.compile('\s+')
regex.split(text)

Get a list of all the patterns matching the regex

In [None]:
regex.findall(text)

regex pattern for identifying email addresses

In [None]:
text = """Dave dave@google.com
Sue sue234@gmail.com
Mary mary@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [None]:
regex.findall(text)

In [None]:
import re


In [None]:
pd.Series(['foo', 'fuz', np.nan]).str.replace('f.', 'ba', regex=True)

In [None]:
Addresses = ['100 Baker Street',
                        '109 - 111 S. Wharfside Street',
                        '40-42 Parkway',
                        '25b-26 Sun Street',
                        '43a South Garden Walk',
                        '6/7 Marine Road',
                        '10 - 12 Acacia Ave',
                        '4513 3RD STREET CIRCLE WEST',
                        '0 1/2 Fifth Avenue',
                        '194-03 1/2 50th Avenue']

In [None]:
pd.Series(Addresses).str.title()

In [None]:
pd.Series(Addresses).str.replace('S.','South', regex=False)

In [None]:
for a in Addresses:
    m=re.findall('\d+', a)
    print(m)


In [None]:
for a in Addresses:
    m=re.findall('[S]+[s,t,r,e]*', a)
    print(m)


In [None]:
for a in Addresses:
    m=re.split('\s+', a)
    print(m)

In [None]:
for a in Addresses:
    m=re.split('\d+', a)
    print(m)

Example: Recipe Database
These vectorized string operations become most useful in the process of cleaning up messy, real-world data. Here I'll walk through an example of that, using an open recipe database compiled from various sources on the Web. Our goal will be to parse the recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.

In [None]:
r1=pd.read_json(r'C:\Courses\DSE Practicum\Data\recipe.json')

In [None]:
r1.shape

In [None]:
r1.columns

In [None]:
r1.iloc[0]

In [None]:
r1.ingredients.str.len().describe()

In [None]:
r1.name[np.argmax(r1.ingredients.str.len())]

In [None]:
r1.description.str.contains('[Bb]reakfast').sum()

In [None]:
r1.ingredients.str.contains('[Cc]reme').sum()

Find recipes that have asparagus as an ingredient and indices

In [None]:
asparaguses=np.where(r1.ingredients.str.contains('[Aa]spar'))#.sum()
asparaguses

Get the names of recipes with aspragus as an ingredient

In [None]:
r1.name[asparaguses[0]]

In [None]:
nuts=np.where(r1.ingredients.str.contains('[Wa]ut'))


In [None]:
r1.name[nuts[0]]

In [None]:
s=r1.loc[457,'ingredients']
#m=re.split('\d+[cup]', s)
print(s)