## Vectorized String Operations

In [9]:
import pandas as pd
import numpy as np

df = pd.Series(['Goutham', 'Sakshitha', 'Bablu', 'Abhishek', 'Anand', np.nan, 'Praveena'])
df

0      Goutham
1    Sakshitha
2        Bablu
3     Abhishek
4        Anand
5          NaN
6     Praveena
dtype: object

#### Create a String Dataframe using Pandas

In [59]:
import pandas as pd
import numpy as np

df = pd.Series(['Goutham', 'Sakshitha', 'Bablu', 'Abhishek', 'Anand', np.nan, 'Praveena'], dtype='string')

print(df)

0      Goutham
1    Sakshitha
2        Bablu
3     Abhishek
4        Anand
5         <NA>
6     Praveena
dtype: string


#### Creating the dataframe as dtype = pd.StringDtype()

In [63]:
import pandas as pd
import numpy as np

df = pd.Series(['Goutham', 'Sakshitha', 'Bablu', 'Abhishek', 'Anand', np.nan, 'Praveena'], dtype=pd.StringDtype())
df

0      Goutham
1    Sakshitha
2        Bablu
3     Abhishek
4        Anand
5         <NA>
6     Praveena
dtype: string

### String Manipulations in Pandas

In [125]:
import pandas as pd
import numpy as np

df = pd.Series(['Goutham Karthik',  'Bablu Bal', 'Abhishek   ', 'Anand', np.nan, 'Praveena     '])
df

0    Goutham Karthik
1          Bablu Bal
2        Abhishek   
3              Anand
4                NaN
5      Praveena     
dtype: object

* **lower()**: Converts all uppercase characters in strings in the DataFrame to lower case and returns the lowercase strings in the result

In [184]:
df_replaced = df.str.replace('Bablu', 'T.')
print(df_replaced)

0    Goutham Karthik
1             T. Bal
2        Abhishek   
3              Anand
4                NaN
5      Praveena     
dtype: object


In [181]:
print(df.str.lower())

0    goutham karthik
1          bablu bal
2        abhishek   
3              anand
4                NaN
5      praveena     
dtype: object


* **upper()**: Converts all lowercase characters in strings in the DataFrame to upper case and returns the uppercase strings in result.

In [174]:
print(df.str.upper())

0    GOUTHAM KARTHIK
1          BABLU BAL
2        ABHISHEK   
3              ANAND
4                NaN
5      PRAVEENA     
dtype: object


* **strip()**: If there are spaces at the beginning or end of a string, we should trim the strings to eliminate spaces using strip() or remove the extra spaces contained by a string in DataFrame

In [133]:
print(df)
print('\nAfter using the strip:')
print(df.str.strip())

0    Goutham Karthik
1          Bablu Bal
2        Abhishek   
3              Anand
4                NaN
5      Praveena     
dtype: object

After using the strip:
0    Goutham Karthik
1          Bablu Bal
2           Abhishek
3              Anand
4                NaN
5           Praveena
dtype: object


* **split(‘ ‘)**: Splits each string with the given pattern. Strings are split and the new elements after the performed split operation, are stored in a list.

In [136]:
print(df)
print('\n After using the strip:')
print(df.str.split(','))

print('\n using []:')
print(df.str.split(',').str[0].str.strip())

print('\n using get():')
print(df.str.split(',').str.get(1))


0    Goutham Karthik
1          Bablu Bal
2        Abhishek   
3              Anand
4                NaN
5      Praveena     
dtype: object

 After using the strip:
0    [Goutham Karthik]
1          [Bablu Bal]
2        [Abhishek   ]
3              [Anand]
4                  NaN
5      [Praveena     ]
dtype: object

 using []:
0    Goutham Karthik
1          Bablu Bal
2           Abhishek
3              Anand
4                NaN
5           Praveena
dtype: object

 using get():
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
dtype: float64


* **len()**: With the help of len() we can compute the length of each string in DataFrame & if there is empty data in DataFrame, it returns NaN

In [139]:
print("length of the dataframe: ", len(df))
print(df)
print("length of each value of dataframe:")
print(df.str.len())


length of the dataframe:  6
0    Goutham Karthik
1          Bablu Bal
2        Abhishek   
3              Anand
4                NaN
5      Praveena     
dtype: object
length of each value of dataframe:
0    15.0
1     9.0
2    11.0
3     5.0
4     NaN
5    13.0
dtype: float64


* **cat(sep=’ ‘)**: It concatenates the data-frame index elements or each string in DataFrame with given separator.

In [142]:
print(df)

print("\n after using cat:")
print(df.str.cat(sep='_'))

print("\n working with NaN using cat:")
print(df.str.cat(sep='_', na_rep='#'))


0    Goutham Karthik
1          Bablu Bal
2        Abhishek   
3              Anand
4                NaN
5      Praveena     
dtype: object

 after using cat:
Goutham Karthik_Bablu Bal_Abhishek   _Anand_Praveena     

 working with NaN using cat:
Goutham Karthik_Bablu Bal_Abhishek   _Anand_#_Praveena     


In [144]:
df.str.startswith('A')


0    False
1    False
2     True
3     True
4      NaN
5    False
dtype: object

In [146]:
df.str.endswith('a')



0    False
1    False
2    False
3    False
4      NaN
5    False
dtype: object

In [148]:
df.str.split()

0    [Goutham, Karthik]
1          [Bablu, Bal]
2            [Abhishek]
3               [Anand]
4                   NaN
5            [Praveena]
dtype: object

### Miscellaneous methods


| Method | Description |
|--------|-------------|
| ``get()`` | Index each element |
| ``slice()`` | Slice each element|
| ``slice_replace()`` | Replace slice in each element with passed value|
| ``cat()``      | Concatenate strings|
| ``repeat()`` | Repeat values |
| ``normalize()`` | Return Unicode form of string |
| ``pad()`` | Add whitespace to left, right, or both sides of strings|
| ``wrap()`` | Split long strings into lines with length less than a given width|
| ``join()`` | Join strings in each element of the Series with passed separator|
| ``get_dummies()`` | extract dummy variables as a dataframe |

In [152]:
df.str[0:3]

0    Gou
1    Bab
2    Abh
3    Ana
4    NaN
5    Pra
dtype: object

In [154]:
df.str.split().str.get(-1)

0     Karthik
1         Bal
2    Abhishek
3       Anand
4         NaN
5    Praveena
dtype: object

In [156]:
df1 = pd.DataFrame({'name': df,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
df1


Unnamed: 0,name,info
0,Goutham Karthik,B|C|D
1,Bablu Bal,B|D
2,Abhishek,A|C
3,Anand,B|D
4,,B|C
5,Praveena,B|C|D


In [159]:
df1['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1
