# Pandas Working With Text Data
Series and Indexes are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally, have names matching the equivalent (scalar) built-in string methods.

In order to lowercase a data, we use str.lower() this function converts all uppercase characters to lowercase. If no uppercase characters exist, it returns the original string. In order to uppercase a data, we use str.upper() this function converts all lowercase characters to uppercase. If no lowercase characters exist, it returns the original string.

#### Code #1:

In [2]:
# Import pandas package 
import pandas as pd 
   
# Define a dictionary containing employee data 
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 
        'Age':[27, 24, 22, 32], 
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']} 
   
# Convert the dictionary into DataFrame  
df = pd.DataFrame(data) 
   
df

Unnamed: 0,Name,Age,Address,Qualification
0,Jai,27,Delhi,Msc
1,Princi,24,Kanpur,MA
2,Gaurav,22,Allahabad,MCA
3,Anuj,32,Kannauj,Phd


In [3]:
# converting and overwriting values in column 
df["Name"]= df["Name"].str.lower()
 
print(df)


     Name  Age    Address Qualification
0     jai   27      Delhi           Msc
1  princi   24     Kanpur            MA
2  gaurav   22  Allahabad           MCA
3    anuj   32    Kannauj           Phd


In this example, we are using nba.csv file.


In [4]:
# importing pandas package 
import pandas as pd 
   
# making data frame from csv file 
data = pd.read_csv("nba.csv") 
   
# converting and overwriting values in column 
data["Team"]= data["Team"].str.upper() 
   
# display 
data 

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,BOSTON CELTICS,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,BOSTON CELTICS,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,BOSTON CELTICS,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,BOSTON CELTICS,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,BOSTON CELTICS,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
453,Shelvin Mack,UTAH JAZZ,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,UTAH JAZZ,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,UTAH JAZZ,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,UTAH JAZZ,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In order to split a data, we use str.split() this function returns a list of strings after breaking the given string by the specified separator but it can only be applied to an individual string. Pandas str.split() method can be applied to a whole series. .str has to be prefixed every time before calling this method to differentiate it from the Python’s default function otherwise, it will throw an error. In order to replace a data, we use str.replace() this function works like Python .replace() method only, but it works on Series too. Before calling .replace() on a Pandas series, .str has to be prefixed in order to differentiate it from the Python’s default replace method.

In [7]:
# importing pandas module  
import pandas as pd 
     
# Define a dictionary containing employee data 
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 
        'Age':[27, 24, 22, 32], 
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Knnuaj'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']} 
 
# Convert the dictionary into DataFrame  
df = pd.DataFrame(data) 
df

Unnamed: 0,Name,Age,Address,Qualification
0,Jai,27,Nagpur,Msc
1,Princi,24,Kanpur,MA
2,Gaurav,22,Allahabad,MCA
3,Anuj,32,Knnuaj,Phd


In [9]:
# dropping null value columns to avoid errors 
df.dropna(inplace = True) 
    
# new data frame with split value columns 
df["Address"]= df["Address"].str.split("a") 
   
# df display 
print(df)

     Name  Age         Address Qualification
0     Jai   27       [N, gpur]           Msc
1  Princi   24       [K, npur]            MA
2  Gaurav   22  [All, h, b, d]           MCA
3    Anuj   32       [Knnu, j]           Phd


In [10]:
# importing pandas module 
import pandas as pd
 
# reading csv file from url
data = pd.read_csv("nba.csv")
 
# overwriting column with replaced value of age
data["Age"]= data["Age"].replace(25.0, "Twenty five")
 
# creating a filter for age column 
# where age = "Twenty five"
filter = data["Age"]=="Twenty five"
 
# printing only filtered columns
data.where(filter).dropna()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,Twenty five,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,Twenty five,6-6,235.0,Marquette,6796117.0
7,Kelly Olynyk,Boston Celtics,41.0,C,Twenty five,7-0,238.0,Gonzaga,2165160.0
26,Thomas Robinson,Brooklyn Nets,41.0,PF,Twenty five,6-10,237.0,Kansas,981348.0
35,Cleanthony Early,New York Knicks,11.0,SF,Twenty five,6-8,210.0,Wichita State,845059.0
44,Derrick Williams,New York Knicks,23.0,PF,Twenty five,6-8,240.0,Arizona,4000000.0
47,Isaiah Canaan,Philadelphia 76ers,0.0,PG,Twenty five,6-0,201.0,Murray State,947276.0
48,Robert Covington,Philadelphia 76ers,33.0,SF,Twenty five,6-9,215.0,Tennessee State,1000000.0
59,Hollis Thompson,Philadelphia 76ers,31.0,SG,Twenty five,6-8,206.0,Georgetown,947276.0
71,Terrence Ross,Toronto Raptors,31.0,SF,Twenty five,6-7,195.0,Washington,3553917.0


In order to concatenate a Series or Index, we use str.cat() this function is used to concatenate strings to the passed caller series of string. Distinct values from a different series can be passed but the length of both the series has to be same. .str has to be prefixed to differentiate it from the Python’s default method.

In [11]:
# importing pandas module 
import pandas as pd 
   
# Define a dictionary containing employee data 
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 
        'Age':[27, 24, 22, 32], 
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']} 
 
# Convert the dictionary into DataFrame  
df = pd.DataFrame(data) 
 
df


Unnamed: 0,Name,Age,Address,Qualification
0,Jai,27,Nagpur,Msc
1,Princi,24,Kanpur,MA
2,Gaurav,22,Allahabad,MCA
3,Anuj,32,Kannuaj,Phd


In [12]:
# making copy of address column 
new = df["Address"].copy() 
   
# concatenating address with name column 
# overwriting name column 
df["Name"]= df["Name"].str.cat(new, sep =", ") 
   
# display 
print(df)

                Name  Age    Address Qualification
0        Jai, Nagpur   27     Nagpur           Msc
1     Princi, Kanpur   24     Kanpur            MA
2  Gaurav, Allahabad   22  Allahabad           MCA
3      Anuj, Kannuaj   32    Kannuaj           Phd


In [None]:
# importing pandas module
import pandas as pd
 
# importing csv from link
data = pd.read_csv("nba.csv")
 
# making copy of team column
new = data["Team"].copy()
 
# concatenating team with name column
# overwriting name column
data["Name"]= data["Name"].str.cat(new, sep =", ")
 
# display
data

## Removing Whitespaces of Data
In order to remove a whitespaces, we use str.strip(), str.rstrip(), str.lstrip() these function used to handle white spaces(including New line) in any text data. As it can be seen in the name, str.lstrip() is used to remove spaces from the left side of string, str.rstrip() to remove spaces from right side of the string and str.strip() removes spaces from both sides. Since these are pandas function with same name as Python’s default functions, .str has to be prefixed to tell the compiler that a Pandas function is being called.

In [14]:
# importing pandas module 
import pandas as pd 
   
# Define a dictionary containing employee data 
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 
        'Age':[27, 24, 22, 32], 
        'Address':['Nagpur junction', 'Kanpur junction', 
                   'Nagpur junction', 'Kannuaj junction'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']} 
 
# Convert the dictionary into DataFrame  
df = pd.DataFrame(data)
   
df


Unnamed: 0,Name,Age,Address,Qualification
0,Jai,27,Nagpur junction,Msc
1,Princi,24,Kanpur junction,MA
2,Gaurav,22,Nagpur junction,MCA
3,Anuj,32,Kannuaj junction,Phd


In [16]:
# replacing address name and adding spaces in start and end 
new = df["Address"].replace("Nagpur junction", "  Nagpur junction  ").copy() 
   
new

0      Nagpur junction  
1        Kanpur junction
2      Nagpur junction  
3       Kannuaj junction
Name: Address, dtype: object

In [18]:
# checking with custom string 
new.str.strip()


0     Nagpur junction
1     Kanpur junction
2     Nagpur junction
3    Kannuaj junction
Name: Address, dtype: object

In [None]:
# importing pandas module 
import pandas as pd 
   
# making data frame 
data = pd.read_csv("nba.csv") 
   
# replacing team name and adding spaces in start and end 
new = data["Team"].replace("Boston Celtics", "  Boston Celtics  ").copy() 
   
# checking with custom removed space string 
new.str.lstrip()=="Boston Celtics  "

In order to extract a data, we use str.extract() this function accepts a regular expression with at least one capture group. Extracting a regular expression with more than one group returns a DataFrame with one column per group. Elements that do not match return a row filled with NaN.

# Extracting a Data
In order to extract a data, we use str.extract() this function accepts a regular expression with at least one capture group. Extracting a regular expression with more than one group returns a DataFrame with one column per group. Elements that do not match return a row filled with NaN.

In [19]:
# importing pandas module 
import pandas as pd 
 
# creating a series 
s = pd.Series(['a1', 'b2', 'c3'])
 
s

0    a1
1    b2
2    c3
dtype: object

In [21]:
# Extracting a data
n= s.str.extract(r'([abc])(\d)')
 
print(n)

   0  1
0  a  1
1  b  2
2  c  3


In [22]:
# importing pandas module 
import pandas as pd 
 
# creating a series 
s = pd.Series(['a1', 'b2', 'c3'])
 
# Extracting a data
n = s.str.extract(r'(?P<Geeks>[ab])(?P<For>\d)')
 
print(n)

  Geeks  For
0     a    1
1     b    2
2   NaN  NaN


## Pandas str methods:
![image.png](attachment:image.png)