# **Removing HTML Tags from Text in Python: 2 Best Practices for Data Cleaning**

## Introduction

The internet holds a wealth of textual data, and the ability to work with it efficiently is indispensable for any NLP practitioner. When it comes to working with text from the vast expanse of the internet, you've probably noticed that a good chunk of it is wrapped in complex HTML formatting. It can feel like searching for a needle in a digital haystack.

In this post, I'll not only walk you through the process of extracting text from a web URL but also introduce you to two effective methods for cleaning away that intricate HTML formatting.

Let's dive in! 🌟

## Read Text from a URL

In [16]:
import nltk
from urllib import request

In [17]:
url = "https://www.gutenberg.org/files/14469/14469-h/14469-h.htm"

# Read the HTML from the URL
html = request.urlopen(url).read()

Now we print the first 100 characters from the url

In [18]:
# print the first 100 characters
html[:100]

b'<?xml version="1.0" encoding="ISO-8859-1"?>\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transition'

As you see, they are in html format. There are a number of methods which we can apply to remove html formatting from our text. Let's try some of them!

## 1. HTML Cleaning Using BeautifulSoup method

In [19]:
from bs4 import BeautifulSoup

In [20]:
# Get text using BeautifulSoup get_text method
clean_1 = BeautifulSoup(html).get_text()

clean_1[:100]

'\n\n\r\n      The Project Gutenberg eBook of The English Novel, by George Saintsbury.\r\n    \n\n\n\n\r\n\r\nThe P'

## 2. HTML Cleaning Using Regex

In [21]:
import re

In [22]:
try:
    # Attempt to decode the HTML data to a string using UTF-8 encoding
    html_str = html.decode('utf-8')
except UnicodeDecodeError:
    # If UTF-8 decoding fails, try a different encoding or handle the error as needed
    html_str = html.decode('latin-1')  # You can try other encodings too

# Define the regex pattern
clean_re = re.compile(r'<[^>]+.*?')

# Apply the pattern to the string
clean_2 = clean_re.sub("", html_str)

clean_2[:100]

'>\r\n>\r\n\r\n>\r\n  >\r\n    >\r\n      The Project Gutenberg eBook of The English Novel, by George Saintsbury.'

Now that we've successfully removed those pesky HTML tags from our text data, the next exciting phase begins. We can now tokenize the text, breaking it down into its constituent parts, and dive into the realm of analysis. This step allows us to uncover patterns, gain insights, and extract valuable information, bringing us closer to data-driven decision-making and innovation.