___

<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>
___
<center><em>Copyright by Pierian Data Inc.</em></center>
<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>

# Text Methods

A normal Python string has a variety of method calls available:

In [1]:
email = 'harrison@email.com'

In [2]:
email.split('@')

['harrison', 'email.com']

In [7]:
email.isdigit()

False

In [8]:
'5'.isdigit()

True

# Pandas and Text

Pandas can do a lot more than what we show here. Full online documentation on things like advanced string indexing and regular expressions with pandas can be found here: https://pandas.pydata.org/docs/user_guide/text.html

## Text Methods on Pandas String Column

In [3]:
import pandas as pd

In [4]:
names = pd.Series(['andrew', 'bobo', 'claire', 'david', '5'])
names

0    andrew
1      bobo
2    claire
3     david
4         5
dtype: object

In [5]:
names.str.upper()

0    ANDREW
1      BOBO
2    CLAIRE
3     DAVID
4         5
dtype: object

In [6]:
names

0    andrew
1      bobo
2    claire
3     david
4         5
dtype: object

In [9]:
names.str.isdigit()

0    False
1    False
2    False
3    False
4     True
dtype: bool

## Splitting , Grabbing, and Expanding

In [14]:
tech_finance = ['GOOG,APPL,AMZN', 'JPM,BAC,GS']

In [15]:
len(tech_finance)

2

In [None]:
tech = 'GOOG,APPL,AMZN'
tech.split(',')

['GOOG', 'APPL', 'AMZN']

In [None]:
tech.split(',')[0]

'GOOG'

In [16]:
tickers = pd.Series(tech_finance)
tickers

0    GOOG,APPL,AMZN
1        JPM,BAC,GS
dtype: object

In [13]:
tickers.str.split(',')

0    [GOOG,  APPL,  AMZN]
1        [JPM,  BAC,  GS]
dtype: object

In [20]:
tickers.str.split(',').str[0]

0    GOOG
1     JPM
dtype: object

In [21]:
tickers.str.split(',', expand=True)

Unnamed: 0,0,1,2
0,GOOG,APPL,AMZN
1,JPM,BAC,GS


## Cleaning or Editing Strings

In [22]:
# Notice the "mis-alignment" on the right hand side due to spacing in "andrew  " and "  claire  "
messy_names = pd.Series(['andrew   ', 'bo;bo', '   claire   '])
messy_names

0       andrew   
1           bo;bo
2       claire   
dtype: object

In [23]:
messy_names[0]

'andrew   '

In [24]:
messy_names.str.replace(';', '')

0       andrew   
1            bobo
2       claire   
dtype: object

In [25]:
messy_names.str.replace(';', '').str.strip()

0    andrew
1      bobo
2    claire
dtype: object

In [26]:
messy_names.str.replace(';', '').str.strip()[0]

'andrew'

In [27]:
messy_names.str.replace(';', '').str.strip().str.capitalize()

0    Andrew
1      Bobo
2    Claire
dtype: object

## Alternative with Custom apply() call

In [29]:
def clean_up(name):
    name = name.replace(";","")
    name = name.strip()
    name = name.capitalize()
    return name

In [30]:
messy_names.apply(clean_up)

0    Andrew
1      Bobo
2    Claire
dtype: object

## Which one is more efficient?

In [43]:
import timeit 
  
# code snippet to be executed only once 
setup = '''
import pandas as pd
import numpy as np
messy_names = pd.Series(["andrew  ","bo;bo","  claire  "])
def cleanup(name):
    name = name.replace(";","")
    name = name.strip()
    name = name.capitalize()
    return name
'''
  
# code snippet whose execution time is to be measured 
stmt_pandas_str = ''' 
messy_names.str.replace(";","").str.strip().str.capitalize()
'''

stmt_pandas_apply = '''
messy_names.apply(cleanup)
'''

stmt_pandas_vectorize='''
np.vectorize(cleanup)(messy_names)
'''

In [44]:
timeit.timeit(setup = setup, 
                    stmt = stmt_pandas_str, 
                    number = 10000) 

3.931618999999955

In [45]:
timeit.timeit(setup = setup, 
                    stmt = stmt_pandas_apply, 
                    number = 10000) 

1.2268500999999787

In [46]:
timeit.timeit(setup = setup, 
                    stmt = stmt_pandas_vectorize, 
                    number = 10000) 

0.28283379999993485

Wow! While .str() methods can be extremely convienent, when it comes to performance, don't forget about np.vectorize()! Review the "Useful Methods" lecture for a deeper discussion on np.vectorize()