In [1]:
import numpy as np, pandas as pd

# Vectorized String Operations 

One benifit of working with python accompanied with Pandas is that it eases the complexity of working with multiple and complicated string operations.

## Introduction to Pandas String Operations

We saw how use of numpy arrays ease the generalization of arithmatic operations onto elements of each array.

In [3]:
x=np.array([2,3,5,7,11,13])
x*2

array([ 4,  6, 10, 14, 22, 26])

This is called broadcasting of the operation, as studied in detail in the previous Numpy section. For handling similar broadcasting operations over strings Numpy doesn't come handy, which might requires more verbose default python commands 

In [5]:
data=['kali', 'Meeta', 'HARSH', 'mANUJ']
[s.capitalize() for s in data]

['Kali', 'Meeta', 'Harsh', 'Manuj']

This would have been sufficient in certain simiple cases, but in instances where there are NULL values in the list, this method would fail

In [6]:
data=['kali', 'Meeta', None ,'HARSH', 'mANUJ']
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

Pandas' Vectorized String Operations handles multiple such problems that might originate at specific string operation instances using the `str` attribute of Pandas `Series` or the `Index` Object

In [8]:
names=pd.Series(data)
names

0     kali
1    Meeta
2     None
3    HARSH
4    mANUJ
dtype: object

In [9]:
names.str.capitalize()

0     Kali
1    Meeta
2     None
3    Harsh
4    Manuj
dtype: object

## Pandas String Methods

Given the basic knowledge of default python string operations, it is fairly intuitive to understand the string operations using Pandas, upon merely looking at all the methods that exist.

We'll use the list below to demonstrate the usage of these string operations.

In [11]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam','Eric Idle', 'Terry Jones', 'Michael Palin'])


### Methods similar to default Python string methods

Here's a list of Pandas `str` methods that mirror Python string methods

![image.png](attachment:image.png)

Notice that these might have different return values. Observe the output of the two string operations below

In [13]:
monte.str.lower(), monte.str.len(), monte.str.startswith('T')

(0    graham chapman
 1       john cleese
 2     terry gilliam
 3         eric idle
 4       terry jones
 5     michael palin
 dtype: object,
 0    14
 1    11
 2    13
 3     9
 4    11
 5    13
 dtype: int64,
 0    False
 1    False
 2     True
 3    False
 4     True
 5    False
 dtype: bool)

In [15]:
monte.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

### Methods using regular expressions

The built-in module `re` helps in implementing multiple string operations. These operations can also be performed on Pandas `Series` and `Index` objects using the following methods

![image.png](attachment:image.png)

In [18]:
monte.str.extract('([A-Za-z]+)', expand=False) # The argument '([A-Za-z]+)' is supposed to be a regular expresion(regex) pattern.
monte.str.findall(r'^[^AEIOU].*[^aeiou]$') # Try researching more about the rules used to work on these patterns.

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

This ability to apply `regex` operations, ramp up the data cleaning and analysis process.

### Some more methods

![image.png](attachment:image.png)

#### Vectorized item accessing and slicing

The get() and the slice() operations, in particular,