# Text Cleanup EDA

As with any user inputted text on the internet, we will see some dirtiness in the data and will need to do some preprocessing. This can include:

* Removing characteristics of scraped HTML like tags or line breaks
* Replacing standard unescaped or badly encoded characters
* Reducing unneeded text by removing stop words and lemmanizing 

## Imports and Notebook Setup

In [10]:
import sys
import re
import numpy as np
import pandas as pd
import html.parser
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from bs4 import BeautifulSoup

In [2]:
pd.set_option('display.max_colwidth', None)

## Loading Data

In [3]:
df = pd.read_csv("../data/Reviews.csv")

In [6]:
review_text = df[['Summary', 'Text']]

In [4]:
sys.path.insert(0, '..')

%load_ext autoreload
%autoreload 2

# Text Issues

Taking a look at the data, we see that there are multiple HTML issues.

Scraped HTML will retain any formatting tags that the rich-text uses to display on the web. Examples include things like line breaks `<br/>`, links `<a href=>`, bullets `<ol>` and `<ul>`, etc. We can check examples with a regex match.

In [9]:
# HTML Tags
review_text[review_text['Text'].str.contains(r'<[^>]*>')].head()

Unnamed: 0,Summary,Text
10,The Best Hot Sauce in the World,"I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind! We picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away! When we realized that we simply couldn't find it anywhere in our city we were bummed.<br /><br />Now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it.<br /><br />If you love hot sauce..I mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of Tequila Picante Gourmet de Inclan. Just realize that once you taste it, you will never want to use any other sauce.<br /><br />Thank you for the personal, incredible service!"
18,GREAT SWEET CANDY!,"Twizzlers, Strawberry my childhood favorite candy, made in Lancaster Pennsylvania by Y & S Candies, Inc. one of the oldest confectionery Firms in the United States, now a Subsidiary of the Hershey Company, the Company was established in 1845 as Young and Smylie, they also make Apple Licorice Twists, Green Color and Blue Raspberry Licorice Twists, I like them all<br /><br />I keep it in a dry cool place because is not recommended it to put it in the fridge. According to the Guinness Book of Records, the longest Licorice Twist ever made measured 1.200 Feet (370 M) and weighted 100 Pounds (45 Kg) and was made by Y & S Candies, Inc. This Record-Breaking Twist became a Guinness World Record on July 19, 1998. This Product is Kosher! Thank You"
21,TWIZZLERS,"I bought these for my husband who is currently overseas. He loves these, and apparently his staff likes them also.<br />There are generous amounts of Twizzlers in each 16-ounce bag, and this was well worth the price. <a href=""http://www.amazon.com/gp/product/B001GVISJM"">Twizzlers, Strawberry, 16-Ounce Bags (Pack of 6)</a>"
24,Please sell these in Mexico!!,"I have lived out of the US for over 7 yrs now, and I so miss my Twizzlers!! When I go back to visit or someone visits me, I always stock up. All I can say is YUM!<br />Sell these in Mexico and you will have a faithful buyer, more often than I'm able to buy them right now."
25,Twizzlers - Strawberry,"Product received is as advertised.<br /><br /><a href=""http://www.amazon.com/gp/product/B001GVISJM"">Twizzlers, Strawberry, 16-Ounce Bags (Pack of 6)</a>"


We can also have unescaped HTML characters. For instance, punctuation like quotes and ampersands can appear like `&quot;` and `&amp;`, and UTF-8 characers with accent markings like ñ may appear like `&ntilde;`. Below show some examples.

In [8]:
# HTML unescaped characters
review_text[review_text['Text'].str.contains(r'&[a-z]+;')].head()

Unnamed: 0,Summary,Text
999,"Not hot, not habanero","I have to admit, I was a sucker for the large quantity, 12 oz, when shopping for hot sauces ...but now seeing the size of the bottle, it reminds of wing-sauce bottle sizes. Plastic bottle. It does have a convenient squirt top. But overall, not very hot or tasty, and made mostly from jalape&ntilde;os. If I had seen the ingredients list I would not have bought it:<br />Jalapenos<br />Water<br />Vinegar<br />Brown Sugar<br />Lime Juice<br />Fish Sauce<br />Cilantro<br />Habanero<br />Garlic<br />Spice Blend<br />Salt<br />Potassium Sorbate<br />Xanthan Gum"
1243,WOW Make your own 'slickers' !,"I just received my shipment and could hardly wait to try this product. We love &quot;slickers&quot; which is what we call them, instead of stickers because they can be removed so easily. My daughter designed signs to be printed in reverse to use on her car windows. They printed beautifully (we have 'The Print Shop' program). I am going to have a lot of fun with this product because there are windows everywhere and other surfaces like tv screens and computer monitors."
1461,One of my favorites,I love the McDougall Asian Entr&eacute;es and although I haven't tried all of them this one is amazing. It has a really peanutty flavor that you wouldn't expect in a product containing only 3 grams of fat per serving. I would imagine these would make a great dinner added to a pound of mixed stir-fry vegetables...I am buying a case now to test just that! The peanut flavor is strong enough that I bet it'd distribute among a fair amount of vegetables or tofu when added so you wouldn't get very saucy noodles and dry veggies.<br /><br />I recommend this product. The best-tasting peanut noodle you'll be able to find for only 3 grams of fat.
1471,Noodles not good,The noodles for this product are what make me gag. I am not a fan of this meal at all and I &lt;3 Asian food. For those looking for a delicious snack who are not already 100% vegan - I would either A) go try 1 first or B) skip buying this package as it is not delicious.
1615,Great chips - and tasty too,"Good flavorful chips - too bad the selection does not include jalape&ntilde;o or paramsan garlic. Packaging is difficult to open, but they fresh."


We will need to tackle these before proceeding with further text processing. Also, sometimes the removing line breaks will result in concatenated words.

Thus, the order of cleanup should follow:

* Replace line breaks with spaces
* Remove html tags
* Unescape HTML
* Final preprocessing for stop words and lemmanization 

# Text Cleanup

In [12]:
review_text['cleaned_text'] = review_text['Text']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [14]:
# Replacing line breaks with spaces
review_text['cleaned_text'] = review_text['cleaned_text'] \
    .apply(lambda x: re.sub(r'< *br *\/?>', ' ', x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


In [16]:
review_text[review_text['Text'].str.contains(r'<[^>]*>')].head()

Unnamed: 0,Summary,Text,cleaned_text
10,The Best Hot Sauce in the World,"I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind! We picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away! When we realized that we simply couldn't find it anywhere in our city we were bummed.<br /><br />Now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it.<br /><br />If you love hot sauce..I mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of Tequila Picante Gourmet de Inclan. Just realize that once you taste it, you will never want to use any other sauce.<br /><br />Thank you for the personal, incredible service!","I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind! We picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away! When we realized that we simply couldn't find it anywhere in our city we were bummed. Now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it. If you love hot sauce..I mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of Tequila Picante Gourmet de Inclan. Just realize that once you taste it, you will never want to use any other sauce. Thank you for the personal, incredible service!"
18,GREAT SWEET CANDY!,"Twizzlers, Strawberry my childhood favorite candy, made in Lancaster Pennsylvania by Y & S Candies, Inc. one of the oldest confectionery Firms in the United States, now a Subsidiary of the Hershey Company, the Company was established in 1845 as Young and Smylie, they also make Apple Licorice Twists, Green Color and Blue Raspberry Licorice Twists, I like them all<br /><br />I keep it in a dry cool place because is not recommended it to put it in the fridge. According to the Guinness Book of Records, the longest Licorice Twist ever made measured 1.200 Feet (370 M) and weighted 100 Pounds (45 Kg) and was made by Y & S Candies, Inc. This Record-Breaking Twist became a Guinness World Record on July 19, 1998. This Product is Kosher! Thank You","Twizzlers, Strawberry my childhood favorite candy, made in Lancaster Pennsylvania by Y & S Candies, Inc. one of the oldest confectionery Firms in the United States, now a Subsidiary of the Hershey Company, the Company was established in 1845 as Young and Smylie, they also make Apple Licorice Twists, Green Color and Blue Raspberry Licorice Twists, I like them all I keep it in a dry cool place because is not recommended it to put it in the fridge. According to the Guinness Book of Records, the longest Licorice Twist ever made measured 1.200 Feet (370 M) and weighted 100 Pounds (45 Kg) and was made by Y & S Candies, Inc. This Record-Breaking Twist became a Guinness World Record on July 19, 1998. This Product is Kosher! Thank You"
21,TWIZZLERS,"I bought these for my husband who is currently overseas. He loves these, and apparently his staff likes them also.<br />There are generous amounts of Twizzlers in each 16-ounce bag, and this was well worth the price. <a href=""http://www.amazon.com/gp/product/B001GVISJM"">Twizzlers, Strawberry, 16-Ounce Bags (Pack of 6)</a>","I bought these for my husband who is currently overseas. He loves these, and apparently his staff likes them also. There are generous amounts of Twizzlers in each 16-ounce bag, and this was well worth the price. <a href=""http://www.amazon.com/gp/product/B001GVISJM"">Twizzlers, Strawberry, 16-Ounce Bags (Pack of 6)</a>"
24,Please sell these in Mexico!!,"I have lived out of the US for over 7 yrs now, and I so miss my Twizzlers!! When I go back to visit or someone visits me, I always stock up. All I can say is YUM!<br />Sell these in Mexico and you will have a faithful buyer, more often than I'm able to buy them right now.","I have lived out of the US for over 7 yrs now, and I so miss my Twizzlers!! When I go back to visit or someone visits me, I always stock up. All I can say is YUM! Sell these in Mexico and you will have a faithful buyer, more often than I'm able to buy them right now."
25,Twizzlers - Strawberry,"Product received is as advertised.<br /><br /><a href=""http://www.amazon.com/gp/product/B001GVISJM"">Twizzlers, Strawberry, 16-Ounce Bags (Pack of 6)</a>","Product received is as advertised. <a href=""http://www.amazon.com/gp/product/B001GVISJM"">Twizzlers, Strawberry, 16-Ounce Bags (Pack of 6)</a>"


In [17]:
# Using the BeautifulSoup package to tackle HTML tag removing

review_text['cleaned_text'] = review_text['cleaned_text'] \
    .apply(lambda x: BeautifulSoup(x, 'html.parser').text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


In [18]:
review_text[review_text['Text'].str.contains(r'<[^>]*>')].head()

Unnamed: 0,Summary,Text,cleaned_text
10,The Best Hot Sauce in the World,"I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind! We picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away! When we realized that we simply couldn't find it anywhere in our city we were bummed.<br /><br />Now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it.<br /><br />If you love hot sauce..I mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of Tequila Picante Gourmet de Inclan. Just realize that once you taste it, you will never want to use any other sauce.<br /><br />Thank you for the personal, incredible service!","I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind! We picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away! When we realized that we simply couldn't find it anywhere in our city we were bummed. Now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it. If you love hot sauce..I mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of Tequila Picante Gourmet de Inclan. Just realize that once you taste it, you will never want to use any other sauce. Thank you for the personal, incredible service!"
18,GREAT SWEET CANDY!,"Twizzlers, Strawberry my childhood favorite candy, made in Lancaster Pennsylvania by Y & S Candies, Inc. one of the oldest confectionery Firms in the United States, now a Subsidiary of the Hershey Company, the Company was established in 1845 as Young and Smylie, they also make Apple Licorice Twists, Green Color and Blue Raspberry Licorice Twists, I like them all<br /><br />I keep it in a dry cool place because is not recommended it to put it in the fridge. According to the Guinness Book of Records, the longest Licorice Twist ever made measured 1.200 Feet (370 M) and weighted 100 Pounds (45 Kg) and was made by Y & S Candies, Inc. This Record-Breaking Twist became a Guinness World Record on July 19, 1998. This Product is Kosher! Thank You","Twizzlers, Strawberry my childhood favorite candy, made in Lancaster Pennsylvania by Y & S Candies, Inc. one of the oldest confectionery Firms in the United States, now a Subsidiary of the Hershey Company, the Company was established in 1845 as Young and Smylie, they also make Apple Licorice Twists, Green Color and Blue Raspberry Licorice Twists, I like them all I keep it in a dry cool place because is not recommended it to put it in the fridge. According to the Guinness Book of Records, the longest Licorice Twist ever made measured 1.200 Feet (370 M) and weighted 100 Pounds (45 Kg) and was made by Y & S Candies, Inc. This Record-Breaking Twist became a Guinness World Record on July 19, 1998. This Product is Kosher! Thank You"
21,TWIZZLERS,"I bought these for my husband who is currently overseas. He loves these, and apparently his staff likes them also.<br />There are generous amounts of Twizzlers in each 16-ounce bag, and this was well worth the price. <a href=""http://www.amazon.com/gp/product/B001GVISJM"">Twizzlers, Strawberry, 16-Ounce Bags (Pack of 6)</a>","I bought these for my husband who is currently overseas. He loves these, and apparently his staff likes them also. There are generous amounts of Twizzlers in each 16-ounce bag, and this was well worth the price. Twizzlers, Strawberry, 16-Ounce Bags (Pack of 6)"
24,Please sell these in Mexico!!,"I have lived out of the US for over 7 yrs now, and I so miss my Twizzlers!! When I go back to visit or someone visits me, I always stock up. All I can say is YUM!<br />Sell these in Mexico and you will have a faithful buyer, more often than I'm able to buy them right now.","I have lived out of the US for over 7 yrs now, and I so miss my Twizzlers!! When I go back to visit or someone visits me, I always stock up. All I can say is YUM! Sell these in Mexico and you will have a faithful buyer, more often than I'm able to buy them right now."
25,Twizzlers - Strawberry,"Product received is as advertised.<br /><br /><a href=""http://www.amazon.com/gp/product/B001GVISJM"">Twizzlers, Strawberry, 16-Ounce Bags (Pack of 6)</a>","Product received is as advertised. Twizzlers, Strawberry, 16-Ounce Bags (Pack of 6)"


In [19]:
# Removing unescaped characters like &amp;
review_text['cleaned_text'] = review_text['cleaned_text'] \
    .apply(lambda x: html.parser.unescape(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


In [20]:
review_text[review_text['Text'].str.contains(r'&[a-z]+;')].head()

Unnamed: 0,Summary,Text,cleaned_text
999,"Not hot, not habanero","I have to admit, I was a sucker for the large quantity, 12 oz, when shopping for hot sauces ...but now seeing the size of the bottle, it reminds of wing-sauce bottle sizes. Plastic bottle. It does have a convenient squirt top. But overall, not very hot or tasty, and made mostly from jalape&ntilde;os. If I had seen the ingredients list I would not have bought it:<br />Jalapenos<br />Water<br />Vinegar<br />Brown Sugar<br />Lime Juice<br />Fish Sauce<br />Cilantro<br />Habanero<br />Garlic<br />Spice Blend<br />Salt<br />Potassium Sorbate<br />Xanthan Gum","I have to admit, I was a sucker for the large quantity, 12 oz, when shopping for hot sauces ...but now seeing the size of the bottle, it reminds of wing-sauce bottle sizes. Plastic bottle. It does have a convenient squirt top. But overall, not very hot or tasty, and made mostly from jalapeños. If I had seen the ingredients list I would not have bought it: Jalapenos Water Vinegar Brown Sugar Lime Juice Fish Sauce Cilantro Habanero Garlic Spice Blend Salt Potassium Sorbate Xanthan Gum"
1243,WOW Make your own 'slickers' !,"I just received my shipment and could hardly wait to try this product. We love &quot;slickers&quot; which is what we call them, instead of stickers because they can be removed so easily. My daughter designed signs to be printed in reverse to use on her car windows. They printed beautifully (we have 'The Print Shop' program). I am going to have a lot of fun with this product because there are windows everywhere and other surfaces like tv screens and computer monitors.","I just received my shipment and could hardly wait to try this product. We love ""slickers"" which is what we call them, instead of stickers because they can be removed so easily. My daughter designed signs to be printed in reverse to use on her car windows. They printed beautifully (we have 'The Print Shop' program). I am going to have a lot of fun with this product because there are windows everywhere and other surfaces like tv screens and computer monitors."
1461,One of my favorites,I love the McDougall Asian Entr&eacute;es and although I haven't tried all of them this one is amazing. It has a really peanutty flavor that you wouldn't expect in a product containing only 3 grams of fat per serving. I would imagine these would make a great dinner added to a pound of mixed stir-fry vegetables...I am buying a case now to test just that! The peanut flavor is strong enough that I bet it'd distribute among a fair amount of vegetables or tofu when added so you wouldn't get very saucy noodles and dry veggies.<br /><br />I recommend this product. The best-tasting peanut noodle you'll be able to find for only 3 grams of fat.,I love the McDougall Asian Entrées and although I haven't tried all of them this one is amazing. It has a really peanutty flavor that you wouldn't expect in a product containing only 3 grams of fat per serving. I would imagine these would make a great dinner added to a pound of mixed stir-fry vegetables...I am buying a case now to test just that! The peanut flavor is strong enough that I bet it'd distribute among a fair amount of vegetables or tofu when added so you wouldn't get very saucy noodles and dry veggies. I recommend this product. The best-tasting peanut noodle you'll be able to find for only 3 grams of fat.
1471,Noodles not good,The noodles for this product are what make me gag. I am not a fan of this meal at all and I &lt;3 Asian food. For those looking for a delicious snack who are not already 100% vegan - I would either A) go try 1 first or B) skip buying this package as it is not delicious.,The noodles for this product are what make me gag. I am not a fan of this meal at all and I <3 Asian food. For those looking for a delicious snack who are not already 100% vegan - I would either A) go try 1 first or B) skip buying this package as it is not delicious.
1615,Great chips - and tasty too,"Good flavorful chips - too bad the selection does not include jalape&ntilde;o or paramsan garlic. Packaging is difficult to open, but they fresh.","Good flavorful chips - too bad the selection does not include jalapeño or paramsan garlic. Packaging is difficult to open, but they fresh."


We can see that our final text data is much cleaner and removes all the extraneous HTML text. 

# Text Preprocessing

Finally, we can preprocess the text data by changing all words to lowercase, removing stop words, removing unnecessary numbers, and lemmanizing.

In [None]:
def preprocess(text, lemmatize=True, reg_pattern='[^A-Za-z]+', stopwords=stopwords.words('english')):
    lemmatizer = WordNetLemmatizer()
    regularizer = re.compile(reg_pattern)

    doc_list = text.lower()
    tokens = regularizer.sub(' ', doc_list).split() # keep only letters OR numbers and tokenize strings

    stopped_tokens = [i for i in tokens if not i in stopwords] # remove stop words

    long_tokens = [i for i in stopped_tokens if len(i) >= 2] # remove single letters
    if lemmatize==True:
        lemmatized = [lemmatizer.lemmatize(i) for i in long_tokens] # lemmatize words
    else:
        lemmatized = long_tokens

    # remove 'xxxx'-like tokens
    cleaned = [word for word in lemmatized if word != len(word) * word[0]]
    return ' '.join(cleaned)

We can apply this end-to-end cleanup process to text before further featurization or fitting models. For ease of use, the final data cleanup methods are stored in `utils/text_proprocessing.py`, which any notebook can import as a package. 