# Handling HTML Text

We can use Beautiful Soup package to clean Web data

In [None]:
from bs4 import BeautifulSoup

In [None]:
# !pip install BeautifulSoup

In [None]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

This Python function, `strip_html_tags`, removes HTML tags from a given text, leaving only the plain text content. Here's a breakdown of the code:

1. **`def strip_html_tags(text):`**

   * This line defines a function named `strip_html_tags` that takes a single argument `text`, which is expected to be a string containing HTML content.

2. **`soup = BeautifulSoup(text, "html.parser")`**

   * The `BeautifulSoup` class is imported from the `bs4` (BeautifulSoup) library, which is used to parse HTML or XML documents.
   * The `text` is passed to `BeautifulSoup`, and the "html.parser" specifies that the HTML content in `text` should be parsed using Python's built-in HTML parser.
   * `soup` is now a BeautifulSoup object representing the HTML structure of the input string.

3. **`stripped_text = soup.get_text()`**

   * The `get_text()` method of the BeautifulSoup object extracts the text content from the parsed HTML, removing all the HTML tags. It returns a plain text string with just the human-readable content, leaving out the tags, script contents, and other HTML elements.

4. **`return stripped_text`**

   * Finally, the function returns the `stripped_text`, which is the original content from the `text` argument but without any HTML tags.


In [None]:
strip_html_tags('<html><h2>Some important text</h2></html>')

In [None]:
strip_html_tags('''<span class="a-size-base">IOS 17.5 (current version updated from 16.6)<br>Thw worst thing as a ios user for me that the fear of update , i heared some updates get display issues battery drain issues , god sake i doesn't had any issues from ios updates , but till i care<br>- device  was great, super handy for me, little feeling of weight .<br>Display<br> - The 60Hz refresh rate didn't felt badly form me cause i am used such kind of device before.<br> - The brightness Range is so comfortable.<br> - The 2160p60 HDR videos in youtube give blissful experiance<br>Battery And charging<br> - The charging time:- i took 1Hour to charge  from 20% to 91% using apple 20w original adapter and cable without any interruption(no electricity break or disconnection ,calls etc)<br> - Battery backup - I am mostly maintain battery in btw 20% &lt;-&gt; 80% so i think this may get one day backup for me ,Average 5h screen on time<br>The one day backup is just a advertisemnt advertisement label usaully we use, actually a day like  events,party,outing  we use -camera ,editing app, social media app more than usual use , that day i didn't get a satisfied backup .<br> - Little temperature warming while charging , no heating issue<br>- battery health after 6 month 98%<br>-battery cycle count 194  after 6 month<br>Perfomance<br> - Good perfomance for video editing ,photography , social media . (I am not a gamer )<br> Camera<br>-  Even though i  felt some lag or stuck screens in some third party applications<br> - The cinematic Mode is mind Blowing . i missed 13 pro or higher varients for zooming in this mode .<br> - Actually i really miss pro variant for the tele lens and camera features while i using this . Cost is wall for me.<br> - The photos are nice and feature like exposure adjust and filter was good and quality photos . And the 5x zoomed photos is not good for me.it reduced the pixel quality .<br>- The video and optical image stabilisation was mind blowing . i mostly explore on videos . it was soo good , no words other than that.<br>AND THE BIGGEST FEAR FOR ME NOW IS IOS UPDATE<br>i turned off the auto update at first place .<br>i heared about DISPLAY ISSUES,HEATING ISSUES and BUGS in the ios updates for 13 and 14 series.<br>So for expencive and established brand like apple does not fair . This is the only -ve i have to say as a reminder .Also i heared without ios update display issues are happened . Buy the way it is a hardware issue one of the display provider company's batch may show the issues . but we customers cannot identify which company provides the hardaware.<br>I will attach my battery analytics data screnshot photo here so you can better understand how the battery is good<br> The ordering experience in the greate indian sale was worst . I planned to buy this for 45999/- in oct7 greate indian sale offer but i got canceled then i bought for 48999 that was disppointed .<br>As a whole i didn't recommend base models ,i felt i must have buy pro variant but for person who need ios in affordable price i recommend 13 is better than 14</span>''')

# Removing Special Characters

* The function `remove_special_characters` removes special characters from a string using regex.
* If `remove_digits` is `False` (default), it keeps letters, digits, and spaces, and removes other characters.
* If `remove_digits` is `True`, it removes digits as well, leaving only letters and spaces. It uses `re.sub` to replace matching characters with an empty string.


The `re` part in the code refers to the use of Python's `re` module, which provides support for working with regular expressions.

* **`re.sub(pattern, '', text)`**: This function searches for patterns (defined by `pattern`) in the `text` and replaces them with an empty string (`''`), effectively removing them.
* **`pattern = r'[^a-zA-Z0-9\s]'`**: This regular expression matches any character that is **not** a letter (A-Z, a-z), digit (0-9), or whitespace (`\s`).
* **`pattern = r'[^a-zA-Z\s]'`**: If `remove_digits=True`, this pattern removes characters that are **not** a letter or whitespace, excluding digits.


In [None]:
import re

In [None]:
def remove_special_characters(text, remove_digits=False):
    #Using regex
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    text = re.sub(pattern, '', text)
    return text

In [None]:
remove_special_characters("Well this was fun! What do you think? 123#@!", remove_digits=False)

In [None]:
remove_special_characters('Sómě Áccěntěd těxt')

# Remove accented characters

The function `remove_accented_chars` removes accented characters from a string, leaving only the non-accented characters. Here's how it works:

### Breakdown:

```python
text = unicodedata.normalize('NFKD', text)
```

* **`unicodedata.normalize('NFKD', text)`**:

  * This normalizes the text to the **NFKD** (Normalization Form KD) form, where accented characters are decomposed into their base character and accent.
  * For example, "é" becomes "e" and "́" (the accent) is separated from it.

```python
.encode('ascii', 'ignore')
```

* **`.encode('ascii', 'ignore')`**:

  * This encodes the normalized text into ASCII, ignoring any non-ASCII characters (like accents).
  * After this step, characters like "é" are removed since they cannot be represented in ASCII.

```python
.decode('utf-8', 'ignore')
```

* **`.decode('utf-8', 'ignore')`**:

  * This decodes the ASCII bytes back into a UTF-8 string, effectively converting the remaining characters (without accents) back into a string.

### Final Output:

The function returns a string without any accented characters, leaving only the non-accented versions.



In [None]:
import unicodedata

In [None]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [None]:
remove_accented_chars('Sómě Áccěntěd těxt')



```
# This is formatted as code
```

# Text Lemmatization

In [None]:
import nltk

nltk.download('punkt')
nltk.download('wordnet')

The code snippet is using the **Natural Language Toolkit (NLTK)**, a Python library for working with human language data (text). Here's an explanation of each part:

### `import nltk`

* **`import nltk`**: This imports the NLTK library into your Python script, allowing you to use its functionalities for tasks like tokenization, stemming, lemmatization, and more.

### `nltk.download('punkt')`

* **`nltk.download('punkt')`**: This downloads the **Punkt tokenizer models**, which are used for splitting text into sentences or words (tokenization). Tokenization is a key step in many NLP tasks.

  * The 'punkt' package contains pre-trained models for various languages to identify sentence and word boundaries.

### `nltk.download('wordnet')`

* **`nltk.download('wordnet')`**: This downloads the **WordNet lexical database**, which is used for tasks like **lemmatization**. WordNet groups English words into sets of synonyms (synsets) and provides semantic relationships between words, such as antonyms or hyponyms.

  * By downloading WordNet, you can use NLTK functions to look up meanings, synonyms, and relationships between words.

### Summary:

* `nltk.download('punkt')` is for tokenization, helping split text into sentences or words.
* `nltk.download('wordnet')` is for working with the WordNet database, used for lemmatization and understanding word meanings and relationships.


In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
def lemmatize_text(text):

    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

In [None]:
"My system keeps crashing, his crashed yesterday, ours crashes daily".split()

In [None]:
".".join(["a","b"])

In [None]:
lemmatize_text("My system keeps crashing, his crashed yesterday, ours crashes daily")

# Text Stemming

In [None]:
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

In [None]:
simple_stemmer("My system keeps crashing his crashed yesterday, ours crashes daily")

# Working with Emojis

In [None]:
!pip install emoji --quiet

In [None]:
import emoji

In [None]:
#input data
input_text = 'He is 😳'

In [None]:
#Replace emoji icon with text
output_text = emoji.demojize(input_text)
output_text

In [None]:
#Remove ':' from emoji text
output_text = output_text.replace(':','')
output_text