<h2>Introduction</h2>
<div style="font-size:18px;font-family:Calibri">
    Today, we'll cover strategies for transforming text into information-rich features, and use some out-of-the-box features that have become increasingly ubiquitous in tasks that involve Natural Language Processing.
</div>

In [170]:
# Importing the required libraries
import re
from bs4 import BeautifulSoup
import unicodedata
import sys

In [172]:
txt = ["   Interrobang. By Aishwarya Henriette     ",
 "Parking And Going. By Karl Gautier",
 "    Today Is The night. By Jarek Prakash   "]

txt = [sttr.strip() for sttr in txt]
txt

['Interrobang. By Aishwarya Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']

In [14]:
txt = [sttr.replace('.','') for sttr in txt]
txt

['Interrobang By Aishwarya Henriette',
 'Parking And Going By Karl Gautier',
 'Today Is The night By Jarek Prakash']

In [20]:
def capitalizer(strr: str) -> str:
    return strr.upper()

txt = [capitalizer(strr) for strr in txt]
txt

['INTERROBANG BY AISHWARYA HENRIETTE',
 'PARKING AND GOING BY KARL GAUTIER',
 'TODAY IS THE NIGHT BY JAREK PRAKASH']

In [24]:
def replace_letter(strr : str) -> str:
    return re.sub(r"[a-zA-Z]", "X", strr)

[replace_letter(strr) for strr in txt]

['XXXXXXXXXXX XX XXXXXXXXX XXXXXXXXX',
 'XXXXXXX XXX XXXXX XX XXXX XXXXXXX',
 'XXXXX XX XXX XXXXX XX XXXXX XXXXXXX']

In [44]:
s = "machine learning in python cookbook"

In [28]:
s.find("n")

5

In [32]:
s.startswith('m')

True

In [34]:
s.endswith('k')

True

In [36]:
s.endswith("python")

False

In [42]:
s.isalnum()

False

In [46]:
s.isalpha()

False

In [61]:
x = s.encode("utf-8")
x

b'machine learning in python cookbook'

In [63]:
x.decode("utf-8")

'machine learning in python cookbook'

<h2>Parsing & Cleaning HTML</h2>
<div style="font-size:18px;font-family:Calibri">
    Use Beautiful's Soup extensive set of options to parse and extract.
</div>

In [71]:
 # Create some HTML code
 html = """
 <div class='full_name'><span style='font
weight:bold'>Masego</span> Azra</div>"
 """

soup = BeautifulSoup(html, "lxml")
soup

<html><body><div class="full_name"><span style="font
weight:bold">Masego</span> Azra</div>"
</body></html>

In [77]:
soup.find("div", {"class": "full_name"}).text

'Masego Azra'

In [89]:
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
soup

<html><head><title>Page title</title></head><body><p align="center" id="firstpara">This is paragraph <b>one</b>.</p><p align="blah" id="secondpara">This is paragraph <b>two</b>.</p></body></html>

In [91]:
soup.prettify

<bound method Tag.prettify of <html><head><title>Page title</title></head><body><p align="center" id="firstpara">This is paragraph <b>one</b>.</p><p align="blah" id="secondpara">This is paragraph <b>two</b>.</p></body></html>>

In [101]:
soup.contents[0].title

<title>Page title</title>

In [113]:
soup.contents[0].title.parent.name

'head'

In [127]:
soup.contents[0].title.parent.nextSibling.contents[0]

<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>

In [147]:
soup.findAll('p', {"align": "center"})[0]['id']

'firstpara'

In [157]:
soup.find('p').b.string

'one'

<h2>Removing Punctuation</h2>
<div style="font-size:18px;font-family:Calibri">
We'll use unicodedata and sys library.
</div>

In [163]:
txt = ['Hi!!!! I. Love. This. Song....',
 '10000% Agree!!!! #LoveIT',
 'Right?!?!']

In [184]:
punc = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
[strr.translate(punc) for strr in txt]

['Interrobang By Aishwarya Henriette',
 'Parking And Going By Karl Gautier',
 'Today Is The night By Jarek Prakash']