## An introduction of String function on a pandas dataframe
Speaker: Abdoulaye Balde [@abdoulayegk](http://twitter.com/abdoulayegk)<br>
Notebook will be  [abdulayegk]()

The dataset is freely available online you can access it through this link.<br>
[Dataset](https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5/data)

In [3]:
import pandas as pd

In [4]:
inspections = pd.read_csv("chicago_food_inspections.csv")
inspections.head()

Unnamed: 0,Name,Risk
0,MARRIOT MARQUIS CHICAGO,Risk 1 (High)
1,JET'S PIZZA,Risk 2 (Medium)
2,ROOM 1520,Risk 3 (Low)
3,MARRIOT MARQUIS CHICAGO,Risk 1 (High)
4,CHARTWELLS,Risk 1 (High)


In [5]:
inspections["Name"].head()

0     MARRIOT MARQUIS CHICAGO   
1                   JET'S PIZZA 
2                     ROOM 1520 
3      MARRIOT MARQUIS CHICAGO  
4                  CHARTWELLS   
Name: Name, dtype: object

### what is wrong with dataframe? can anyone identify it?

In [6]:
inspections["Name"].head().values

array([' MARRIOT MARQUIS CHICAGO   ', " JET'S PIZZA ", '   ROOM 1520 ',
       '  MARRIOT MARQUIS CHICAGO  ', ' CHARTWELLS   '], dtype=object)

We can see that we have extra whitespaces from the begenning and the ending of the column name which is problem.

### Now how can we fix that issue 
python provide us with some functions that can help us to handle this perfectly
1. lstrip()
2. rstrip()
3. strip()

In [7]:
dessert = "  orange  "
dessert.lstrip()

'orange  '

In [8]:
dessert.rstrip()

'  orange'

In [9]:
dessert.strip()

'orange'

**Note**: in pandas we can either a dataframe or a series however we have to convert it into a string to be able to apply any string manipulation funciton on the dataframe

In [10]:
inspections["Name"].str

<pandas.core.strings.accessor.StringMethods at 0x7f7b4729f3a0>

In [11]:
inspections["Name"].str.lstrip().head()

0    MARRIOT MARQUIS CHICAGO   
1                  JET'S PIZZA 
2                    ROOM 1520 
3     MARRIOT MARQUIS CHICAGO  
4                 CHARTWELLS   
Name: Name, dtype: object

In [12]:
inspections["Name"].str.rstrip().head()

0      MARRIOT MARQUIS CHICAGO
1                  JET'S PIZZA
2                    ROOM 1520
3      MARRIOT MARQUIS CHICAGO
4                   CHARTWELLS
Name: Name, dtype: object

In [13]:
inspections["Name"].str.strip().head()

0    MARRIOT MARQUIS CHICAGO
1                JET'S PIZZA
2                  ROOM 1520
3    MARRIOT MARQUIS CHICAGO
4                 CHARTWELLS
Name: Name, dtype: object

In [14]:
inspections["Name"] = inspections["Name"].str.strip()

In [15]:
inspections.columns

Index(['Name', 'Risk'], dtype='object')

In [16]:
for column in inspections.columns:
    inspections[column] = inspections[column].str.strip()

## Lowercase and Uppercase

In [17]:
inspections["Name"].str.lower().head()

0    marriot marquis chicago
1                jet's pizza
2                  room 1520
3    marriot marquis chicago
4                 chartwells
Name: Name, dtype: object

In [18]:
steaks = pd.Series(["porterhouse", "filet mignon", "ribeye"])
steaks

0     porterhouse
1    filet mignon
2          ribeye
dtype: object

In [19]:
steaks.str.upper()

0     PORTERHOUSE
1    FILET MIGNON
2          RIBEYE
dtype: object

In [20]:
inspections["Name"].str.capitalize().head()

0    Marriot marquis chicago
1                Jet's pizza
2                  Room 1520
3    Marriot marquis chicago
4                 Chartwells
Name: Name, dtype: object

In [21]:
inspections["Name"].str.title().head()

0    Marriot Marquis Chicago
1                Jet'S Pizza
2                  Room 1520
3    Marriot Marquis Chicago
4                 Chartwells
Name: Name, dtype: object

## String Slicing

In [22]:
inspections["Risk"].head()

0      Risk 1 (High)
1    Risk 2 (Medium)
2       Risk 3 (Low)
3      Risk 1 (High)
4      Risk 1 (High)
Name: Risk, dtype: object

In [23]:
inspections["Risk"].unique()

array(['Risk 1 (High)', 'Risk 2 (Medium)', 'Risk 3 (Low)', 'All', nan],
      dtype=object)

In [24]:
inspections.dropna(subset = ["Risk"], inplace = True)

In [25]:
inspections["Risk"].unique()

array(['Risk 1 (High)', 'Risk 2 (Medium)', 'Risk 3 (Low)', 'All'],
      dtype=object)

We are considering that all as Risk 4 so we are going to replace it so that it will be int the same format as the remainding

In [26]:
inspections.replace(
    to_replace = "All", value = "Risk 4 (Extreme)", inplace = True
)

In [27]:
inspections["Risk"].unique()

array(['Risk 1 (High)', 'Risk 2 (Medium)', 'Risk 3 (Low)',
       'Risk 4 (Extreme)'], dtype=object)

### String Slicing and Character Replacement

String slicing is very important mostly in this case we can achieve something that we supposed to **Regular Expressions** in a very simple and easy way to understand.

In [28]:
inspections["RiskNumber"] = inspections["Risk"].str.slice(5, 6).head()
inspections['RiskNumber']

0           1
1           2
2           3
3           1
4           1
         ... 
153805    NaN
153806    NaN
153807    NaN
153808    NaN
153809    NaN
Name: RiskNumber, Length: 153744, dtype: object

In [29]:
inspections["Risk"].str[5:6].head()

0    1
1    2
2    3
3    1
4    1
Name: Risk, dtype: object

In [30]:
inspections["Risk"].str.slice(8).head()

0      High)
1    Medium)
2       Low)
3      High)
4      High)
Name: Risk, dtype: object

In [30]:
inspections["Risk"].str[8:].head()

0      High)
1    Medium)
2       Low)
3      High)
4      High)
Name: Risk, dtype: object

In [32]:
inspections['RiskLevel']= inspections["Risk"].str.slice(8, -1).head()
inspections["RiskLevel"].head()

0      High
1    Medium
2       Low
3      High
4      High
Name: RiskLevel, dtype: object

## Boolean Methods

In [33]:
"Pizza" in "Jet's Pizza"

True

In [34]:
"pizza" in "Jet's Pizza"

False

In [35]:
inspections["Name"].str.lower().str.contains("pizza").head()

0    False
1     True
2    False
3    False
4    False
Name: Name, dtype: bool

In [36]:
has_pizza = inspections["Name"].str.lower().str.contains("pizza")
inspections[has_pizza]

Unnamed: 0,Name,Risk
1,JET'S PIZZA,Risk 2 (Medium)
19,NANCY'S HOME OF STUFFED PIZZA,Risk 1 (High)
27,"NARY'S GRILL & PIZZA ,INC.",Risk 1 (High)
29,NARYS GRILL & PIZZA,Risk 1 (High)
68,COLUTAS PIZZA,Risk 1 (High)
...,...,...
153756,ANGELO'S STUFFED PIZZA CORP,Risk 1 (High)
153764,COCHIAROS PIZZA #2,Risk 1 (High)
153772,FERNANDO'S MEXICAN GRILL & PIZZA,Risk 1 (High)
153788,REGGIO'S PIZZA EXPRESS,Risk 1 (High)


In [37]:
inspections["Name"].str.lower().str.startswith("tacos").head()

0    False
1    False
2    False
3    False
4    False
Name: Name, dtype: bool

In [38]:
starts_with_tacos = inspections["Name"].str.lower().str.startswith("tacos")
inspections[starts_with_tacos]

Unnamed: 0,Name,Risk
69,TACOS NIETOS,Risk 1 (High)
556,TACOS EL TIO 2 INC.,Risk 1 (High)
675,TACOS DON GABINO,Risk 1 (High)
958,TACOS EL TIO 2 INC.,Risk 1 (High)
1036,TACOS EL TIO 2 INC.,Risk 1 (High)
...,...,...
143587,TACOS DE LUNA,Risk 1 (High)
144026,TACOS GARCIA,Risk 1 (High)
146174,Tacos Place's 1,Risk 1 (High)
147810,TACOS MARIO'S LIMITED,Risk 1 (High)


In [39]:
ends_with_tacos = inspections["Name"].str.lower().str.endswith("tacos")
inspections[ends_with_tacos]

Unnamed: 0,Name,Risk
382,LAZO'S TACOS,Risk 1 (High)
569,LAZO'S TACOS,Risk 1 (High)
2652,FLYING TACOS,Risk 3 (Low)
3250,JONY'S TACOS,Risk 1 (High)
3812,PACO'S TACOS,Risk 1 (High)
...,...,...
151121,REYES TACOS,Risk 1 (High)
151318,EL MACHO TACOS,Risk 1 (High)
151801,EL MACHO TACOS,Risk 1 (High)
153087,RAYMOND'S TACOS,Risk 1 (High)


**I would appreciate your feedback** <br>
Email: abdoulayegnbalde@gmail.com <br>
github: abdoulayegk <br>
twitter: @abdoulayegk