# NLP - Remove Special and Accented Characters

In [1]:
import pandas as pd
import unicodedata
import re

In [2]:
data = pd.read_csv('amazon_alexa.tsv', delimiter = '\t')

data.shape

(3150, 5)

In [3]:
data.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


### Removing Accented characters

Accented characters are important elements which are used to signify emphasis on a particular word during pronunciation or understanding. In some instances, the accent mark also clarifies the meaning of a word, which might be different without the accent. While their use in English is largely limited but there are very good chances that you will come across accented character/letters in a free text corpus. Words such as resume, cafe, protest, divorce, coordinate, expose, latte, etc.

In [4]:
# remove accented characters
def remove_accented_chars(text):
    new_text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return new_text

# apply
data['verified_reviews'] = data.apply(lambda x: remove_accented_chars(x['verified_reviews']), axis = 1)



In [6]:
# remove special characters

def remove_special_characters(text):
    pat = r'[^a-zA-Z0-9]'
    return re.sub(pat, ' ', text)

# apply
data['verified_reviews'] = data.apply(lambda x: remove_special_characters(x['verified_reviews']), axis = 1)

In [7]:
data['verified_reviews'][:5]

0                                        Love my Echo 
1                                            Loved it 
2    Sometimes while playing a game  you can answer...
3    I have had a lot of fun with this thing  My 4 ...
4                                                Music
Name: verified_reviews, dtype: object