# Demo 2 | Regular Expressions Demonstration
<hr>

Using the Lazada CIKM 2017 Dataset, a few regular expression use cases are presented. Regular expressions, in particular the `re` module in Python present three categores for operation: **pattern matching**, **substitution** and **splitting**.

In [1]:
import re

import pandas as pd
pd.options.display.max_colwidth = -1
import matplotlib.pyplot as plt
import seaborn as sns

import utils

In [2]:
df = utils.read_lazada_csv()
df = df[(df.country=='sg') & (df.category_lvl_1 == 'Mobiles & Tablets') & (df.category_lvl_3 == 'Phone Cases')]
df = df[['title', 'category_lvl_1', 'category_lvl_2', 'category_lvl_3', 'desc',]]
df['id'] = df.index
df.reset_index(inplace=True, drop=True)
df = df[['id', 'title', 'category_lvl_1', 'category_lvl_2', 'category_lvl_3', 'desc',]]
display(df.head())

Unnamed: 0,id,title,category_lvl_1,category_lvl_2,category_lvl_3,desc
0,157,BUILDPHONE Plastic Hard Back Phone Case for Samsung Galaxy S3mini with Phone Holder Ring (Multicolor) - intl,Mobiles & Tablets,Accessories,Phone Cases,"<ul> <li>Half around protect</li> <li>Made from durable plastic.</li> <li>Keep your phone safe and protected in style.</li> <li>Shock absorbent, shatterproof&nbsp;material.&nbsp;</li> <li>Applicaton and removal are&nbsp;easy.</li> <li>Painted pattern, add the art feeling to your smartphone</li> </ul>"
1,164,MOONCASE Hard Protective Printing Back Plate Case Cover for Sony Xperia Z2 No.3002004 (EXPORT),Mobiles & Tablets,Accessories,Phone Cases,<ul><li>Excellent quality and fashion design</li><li>Cute Pattern</li><li>Secure fit for your case</li></ul>
2,165,Moonmini Cover for HTC Desire 616 (White) 3D Luxury Bling Rhinestones Diamonds Bow Bone PU Leather Flip Case Cover Wallet with Card Holders(Export)(Intl),Mobiles & Tablets,Accessories,Phone Cases,<ul> <li>Premium look and feel</li> <li>Perfect and attractive decorate your device</li> <li>Protect your device from scratches/shock and dust</li> <li>Show off natural beauty of your device's design</li> <li>Durable and fashionable</li> <li>All round protection</li> <li>Easy to install and remove</li> </ul>
3,166,Case for Apple iPhone 6 4.7 inch - Angry Face(Export),Mobiles & Tablets,Accessories,Phone Cases,<ul> <li>Perfect and attractive decorate your phone</li> <li>Protect your phone from scratches/shock and dust</li> <li>Show off natural beauty of your phone's design</li> <li>Durable and fashionable</li> <li>All round protection</li> <li>Easy to use: just snap it on and snap it off</li> </ul>
4,171,TPU Case with Removable Metal Rim for Samsung Galaxy S6 G920 (Blue),Mobiles & Tablets,Accessories,Phone Cases,"<ul> <li>Back tpu case + detachable metal rim</li> <li>Consist of two pieces, easy to install</li> <li>Lightweight and slim, wonderful touch feeling</li> <li>Provide good protection</li> <li>Without retail package</li> </ul>"


`re.split(pattern, string)` is used to split a string into a list of substrings. The `pattern` is considered to be the delimiter of the split function.

In [16]:
c2 = df.loc[0,'category_lvl_3']
print(c2)
# Here, re.split() finds a space and then pulls out a term. The space is the delimiter.
print(re.split(' ', c2))

Phone Cases
['Phone', 'Cases']


In [17]:
c1 = df.loc[0,'category_lvl_1']
print(c1)
# re.split() can take different delimiters and split them. Here, it uses a "&" as the delimiter.
print(re.split('&', c1))

Mobiles & Tablets
['Mobiles ', ' Tablets']


In [28]:
c3 = df.loc[0, 'desc']
print(c3)
print()

# alternatively, first compile the regex, then call its split() method to split the string
r = re.compile('<li>')
for rs in r.split(c3):
    print(rs)
# NOTE: compiling the regex is very useful if it were be to reused in different parts of the code

<ul> <li>Half around protect</li> <li>Made from durable plastic.</li> <li>Keep your phone safe and protected in style.</li> <li>Shock absorbent, shatterproof&nbsp;material.&nbsp;</li> <li>Applicaton and removal are&nbsp;easy.</li> <li>Painted pattern, add the art feeling to your smartphone</li> </ul> 

<ul> 
Half around protect</li> 
Made from durable plastic.</li> 
Keep your phone safe and protected in style.</li> 
Shock absorbent, shatterproof&nbsp;material.&nbsp;</li> 
Applicaton and removal are&nbsp;easy.</li> 
Painted pattern, add the art feeling to your smartphone</li> </ul> 


In [35]:
print(c3)
# re.match() is used to check if a pattern exists in the beginning of a string
print(re.match('<ul>', c3)) # returns True because there is a match in the beginning of str
print(re.match('Half', c3)) # returns False because there is no match in the beginning of str

# Alternatively, re.search() checks if a pattern exists anywhere in the string
print(re.search('Half', c3))

<ul> <li>Half around protect</li> <li>Made from durable plastic.</li> <li>Keep your phone safe and protected in style.</li> <li>Shock absorbent, shatterproof&nbsp;material.&nbsp;</li> <li>Applicaton and removal are&nbsp;easy.</li> <li>Painted pattern, add the art feeling to your smartphone</li> </ul> 
<_sre.SRE_Match object; span=(0, 4), match='<ul>'>
None
<_sre.SRE_Match object; span=(9, 13), match='Half'>


In [37]:
print(c3)
print(re.sub('</?ul>|</?li>', '', c3))

<ul> <li>Half around protect</li> <li>Made from durable plastic.</li> <li>Keep your phone safe and protected in style.</li> <li>Shock absorbent, shatterproof&nbsp;material.&nbsp;</li> <li>Applicaton and removal are&nbsp;easy.</li> <li>Painted pattern, add the art feeling to your smartphone</li> </ul> 
 Half around protect Made from durable plastic. Keep your phone safe and protected in style. Shock absorbent, shatterproof&nbsp;material.&nbsp; Applicaton and removal are&nbsp;easy. Painted pattern, add the art feeling to your smartphone  


In [27]:
# Conversely, using re.findall() can find all instances in a string that matches the regex.
print(re.findall('&', c1))
print(r.findall(c3))

['&']
['<li>', '<li>', '<li>', '<li>', '<li>', '<li>']


In [3]:
t5 = df.copy().loc[:4, 'title'].tolist()
for title in t5:
    print(title)
    print()

BUILDPHONE Plastic Hard Back Phone Case for Samsung Galaxy S3mini with Phone Holder Ring (Multicolor) - intl

MOONCASE Hard Protective Printing Back Plate Case Cover for Sony Xperia Z2 No.3002004 (EXPORT)

Moonmini Cover for HTC Desire 616 (White) 3D Luxury Bling Rhinestones Diamonds Bow Bone PU Leather Flip Case Cover Wallet with Card Holders(Export)(Intl)

Case for Apple iPhone 6 4.7 inch - Angry Face(Export)

TPU Case with Removable Metal Rim for Samsung Galaxy S6 G920 (Blue)



In [4]:
desc5 = df.copy().loc[:4, 'desc'].tolist()
for desc in desc5:
    print(desc)
    print()

<ul> <li>Half around protect</li> <li>Made from durable plastic.</li> <li>Keep your phone safe and protected in style.</li> <li>Shock absorbent, shatterproof&nbsp;material.&nbsp;</li> <li>Applicaton and removal are&nbsp;easy.</li> <li>Painted pattern, add the art feeling to your smartphone</li> </ul> 

<ul><li>Excellent quality and fashion design</li><li>Cute Pattern</li><li>Secure fit for your case</li></ul>

<ul> <li>Premium look and feel</li> <li>Perfect and attractive decorate your device</li> <li>Protect your device from scratches/shock and dust</li> <li>Show off natural beauty of your device's design</li> <li>Durable and fashionable</li> <li>All round protection</li> <li>Easy to install and remove</li> </ul> 

<ul> <li>Perfect and attractive decorate your phone</li> <li>Protect your phone from scratches/shock and dust</li> <li>Show off natural beauty of your phone's design</li> <li>Durable and fashionable</li> <li>All round protection</li> <li>Easy to use: just snap it on and sna

In [5]:
d1 = desc5[0]
print(d1)
d1_remove_ul = re.sub("</?ul>", "", d1)
print("#{}#".format(d1_remove_ul))
d1_remove_ul = d1_remove_ul.strip()
print("#{}#".format(d1_remove_ul))
d1_lines = re.split("<li>", d1_remove_ul)

d1_cleaned = []
for l in d1_lines:
    print(l)
    l_cleaned = re.sub("</li>", "", l)
    d1_cleaned.append(l_cleaned)

print()
for l2 in d1_cleaned:
    print(l2)

<ul> <li>Half around protect</li> <li>Made from durable plastic.</li> <li>Keep your phone safe and protected in style.</li> <li>Shock absorbent, shatterproof&nbsp;material.&nbsp;</li> <li>Applicaton and removal are&nbsp;easy.</li> <li>Painted pattern, add the art feeling to your smartphone</li> </ul> 
# <li>Half around protect</li> <li>Made from durable plastic.</li> <li>Keep your phone safe and protected in style.</li> <li>Shock absorbent, shatterproof&nbsp;material.&nbsp;</li> <li>Applicaton and removal are&nbsp;easy.</li> <li>Painted pattern, add the art feeling to your smartphone</li>  #
#<li>Half around protect</li> <li>Made from durable plastic.</li> <li>Keep your phone safe and protected in style.</li> <li>Shock absorbent, shatterproof&nbsp;material.&nbsp;</li> <li>Applicaton and removal are&nbsp;easy.</li> <li>Painted pattern, add the art feeling to your smartphone</li>#

Half around protect</li> 
Made from durable plastic.</li> 
Keep your phone safe and protected in style.</li

In [6]:
def clean_desc(x):
    x_cleaned = re.sub("</?ul>", "", x)
    x_cleaned = x_cleaned.strip()
    x_entries = re.split("<li>", x_cleaned)

    x_lines = []
    for ln in x_entries:
        x_cleaned_to_return = re.sub("</li>", "", ln)
        x_lines.append(x_cleaned_to_return)
    return x_lines

In [7]:
for d in desc5:
    print(d)
    for d_c in clean_desc(d):
        print(d_c)
    print()

<ul> <li>Half around protect</li> <li>Made from durable plastic.</li> <li>Keep your phone safe and protected in style.</li> <li>Shock absorbent, shatterproof&nbsp;material.&nbsp;</li> <li>Applicaton and removal are&nbsp;easy.</li> <li>Painted pattern, add the art feeling to your smartphone</li> </ul> 

Half around protect 
Made from durable plastic. 
Keep your phone safe and protected in style. 
Shock absorbent, shatterproof&nbsp;material.&nbsp; 
Applicaton and removal are&nbsp;easy. 
Painted pattern, add the art feeling to your smartphone

<ul><li>Excellent quality and fashion design</li><li>Cute Pattern</li><li>Secure fit for your case</li></ul>

Excellent quality and fashion design
Cute Pattern
Secure fit for your case

<ul> <li>Premium look and feel</li> <li>Perfect and attractive decorate your device</li> <li>Protect your device from scratches/shock and dust</li> <li>Show off natural beauty of your device's design</li> <li>Durable and fashionable</li> <li>All round protection</li>

**References**

- [Github / minhcp](https://github.com/minhcp/CIKMCup17) for the dataset
- Python for Data Analysis, 2nd Edition, McKinney (2017)