<a href="https://colab.research.google.com/github/armahin/Pandas/blob/main/7.%20Pandas%20Strings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This topic is called **vectorized string operations** which deals with whenever we find something textual in datasets:

* Could be movies dataset
* Customer Review Dataset
* Whatsapp chat analysis

In [None]:
import pandas as pd
import numpy as np

Using pandas we can do vectorized operations on strings but

**1. What is vectorized operations?**

Vectorized Operations are that operations where, what we are operating on is a Vector(Set of things)

Let's make a numpy array

In [None]:
a = np.array([1,2,3,4,5,6,7])
a

array([1, 2, 3, 4, 5, 6, 7])

Now if we do a*4 it will get multiplied with all the items of the array

In [None]:
a*4

array([ 4,  8, 12, 16, 20, 24, 28])

Now this is an example of vectorized operations,here the vector was the array

So the topic here is being discussed is Vectorized String Operations, where the vector is the string

**In vectorized string operations, we have a dataframe where there is a column and in that column all the elements are strings and we will make operations in where all the strings will get operated**

**2. Problem in vectorized operations in vanilla python**

Pandas has vectorized string operations but in vanilla python it is a problem

In [None]:
s = ['cat','mat',None,'rat']

Now if we want to find the items that starts with C

In [None]:
# [i.startswith('c') for i in s]

**3. How pandas solves this issue?**

In [None]:
s = pd.Series(['cat','mat',None,'rat'])

In [None]:
s.str.startswith('c')

0     True
1    False
2     None
3    False
dtype: object

*This is very fast and optimized*

**4. String Operations through Pandas**

In [None]:
#import titanic
df = pd.read_csv('titanic.csv')
df.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [None]:
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

**Some common functions**

In [None]:
#lower
df['Name'].str.lower()

0                                braund, mr. owen harris
1      cumings, mrs. john bradley (florence briggs th...
2                                 heikkinen, miss. laina
3           futrelle, mrs. jacques heath (lily may peel)
4                               allen, mr. william henry
                             ...                        
886                                montvila, rev. juozas
887                         graham, miss. margaret edith
888             johnston, miss. catherine helen "carrie"
889                                behr, mr. karl howell
890                                  dooley, mr. patrick
Name: Name, Length: 891, dtype: object

In [None]:
#Upper
df['Name'].str.upper()

0                                BRAUND, MR. OWEN HARRIS
1      CUMINGS, MRS. JOHN BRADLEY (FLORENCE BRIGGS TH...
2                                 HEIKKINEN, MISS. LAINA
3           FUTRELLE, MRS. JACQUES HEATH (LILY MAY PEEL)
4                               ALLEN, MR. WILLIAM HENRY
                             ...                        
886                                MONTVILA, REV. JUOZAS
887                         GRAHAM, MISS. MARGARET EDITH
888             JOHNSTON, MISS. CATHERINE HELEN "CARRIE"
889                                BEHR, MR. KARL HOWELL
890                                  DOOLEY, MR. PATRICK
Name: Name, Length: 891, dtype: object

In [None]:
#Capitalize
df['Name'].str.capitalize()

0                                Braund, mr. owen harris
1      Cumings, mrs. john bradley (florence briggs th...
2                                 Heikkinen, miss. laina
3           Futrelle, mrs. jacques heath (lily may peel)
4                               Allen, mr. william henry
                             ...                        
886                                Montvila, rev. juozas
887                         Graham, miss. margaret edith
888             Johnston, miss. catherine helen "carrie"
889                                Behr, mr. karl howell
890                                  Dooley, mr. patrick
Name: Name, Length: 891, dtype: object

In [None]:
#Title
df['Name'].str.title()

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [None]:
#len - Longest name finding
df['Name'][df['Name'].str.len() == 82].values[0]

'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'

In [None]:
#strip
df['Name'].str.strip()

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [None]:
#split -> get
#Let's make a new column of Surname Name,Title,First Name
df['Surname'] = df['Name'].str.strip().str.split(',').str.get(0)

In [None]:
df.head(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Braund


In [None]:
df[['Title','First Name']]=df['Name'].str.split(',').str.get(1).str.strip().str.split(' ',n=1,expand = True)

Here n = 1 is doing the strip just for one time on the basis of space and exand = True is showing it on a dataframe

Now we can see how many titles were there

In [None]:
df['Title'].value_counts()

Title
Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Mlle.          2
Major.         2
Col.           2
the            1
Capt.          1
Ms.            1
Sir.           1
Lady.          1
Mme.           1
Don.           1
Jonkheer.      1
Name: count, dtype: int64

In [None]:
#Replace
df['Title']= df['Title'].str.replace('Ms.','Miss.')

In [None]:
df['Title'].value_counts()

Title
Mr.          517
Miss.        183
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Major.         2
Mlle.          2
Col.           2
Don.           1
Mme.           1
Lady.          1
Sir.           1
Capt.          1
the            1
Jonkheer.      1
Name: count, dtype: int64

In [None]:
df.head(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Title,First Name
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr.,Owen Harris


**Filtering**

In [None]:
#startswith/endswith
df['First Name'][df['First Name'].str.startswith('A')]

13            Anders Johan
22            Anna "Annie"
35         Alexander Oskar
38           Augusta Maria
61                  Amelie
              ...         
842                Augusta
845                Anthony
866               Asuncion
875    Adele Kiamie "Jane"
876          Alfred Ossian
Name: First Name, Length: 95, dtype: object

In [None]:
#isdigit/isalpha
df['First Name'][df['First Name'].str.isdigit()]

Series([], Name: First Name, dtype: object)

**Advanced Level Filtering**

In [None]:
#applying regex

In [None]:
#contains

**Let's say i want to search john and john can be capital as well as small**

In [None]:
#search john -> both case
df['First Name'][df['First Name'].str.contains('john',case=False)].count()

44

In [None]:
#Find lastnames with start and end char vowel
df['Surname'][df['Surname'].str.contains('^[aeiouAEIOU].+[aeiouAEIOU]$')]

30          Uruchurtu
49     Arnold-Franchi
207          Albimona
210               Ali
353    Arnold-Franchi
493      Artagaveytia
518             Angle
784               Ali
840          Alhomaki
Name: Surname, dtype: object

In [None]:
#slicing
df['Name'].str[:4]

0      Brau
1      Cumi
2      Heik
3      Futr
4      Alle
       ... 
886    Mont
887    Grah
888    John
889    Behr
890    Dool
Name: Name, Length: 891, dtype: object