#Text cleaning with regexp

Apply regular expressions on 2 data sets as preparation and warming up for application on CV data set.<br>

* Data sets to practice
  * IMDB Dataset.csv
  * Emoji_Sentiment_Data_v1.0.csv
* Target data set
  * Resume data set



In [46]:
import pandas as pd

###IMDB movie review - sentiment data set

Read data set into pandas frame

In [47]:
df = pd.read_csv("IMDB Dataset.csv", nrows = 2000)

Generate shape

In [48]:
df.shape

(2000, 2)

generate five 1st samples

In [49]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


###Remove HTML tags

Use apply method on pandas frame column with review text.

In [35]:
#HTML tags
import re
def remove_html_tags(reviews):
    return re.sub(r'<[^<]+?>', '', reviews)

df['review'] = df['review'].apply(remove_html_tags)

###Remove HTTP

In [37]:
#HTTP

def remove_url(text):
    return re.sub(r'http[s]?://\S+|www\.\S+', '', text)
df['review'] = df['review'].apply(remove_url)

In [38]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


###Convert to lowercase (not necessary for classification with transformer model)

In [39]:
#to lower case
def to_lower(text):
  return  text.lower()
df['review'] = df['review'].apply(to_lower)



In [40]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


###Remove punctuation (not necessary for classification with transformer model)

Use a lambda function on pandas frame column with review text this time.

In [41]:
#Use a lambda function this time:
def punctuation_removal(text):
    punctuations = '@#!?+&*[]-%.:/();$=><|{}^,' + "'`" + '_'
    for p in punctuations:
        text = text.replace(p,'') #Removing punctuations
    return text

df['review'] = df['review'].apply(lambda x: punctuation_removal(x))

In [None]:
df.head()

#apply emoticon removal on emoji data set (optional)

More info<br>

https://www.geeksforgeeks.org/python-program-to-print-emojis/

### emoticon removal intermediate short test

In [None]:
#TEST emoticon removal
import re

# Define the emoji pattern
emoji_pattern = re.compile("["
                           u"\U0001F602"  # emoticons
                           #u"\U0001F600-\U0001F64F"  # emoticons
                           #u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           #u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           #u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           #u"\U00002702-\U000027B0"  # other symbols
                           #u"\U000024C2-\U0001F251"  # additional symbols
                           "]+", flags=re.UNICODE)

# Sample text containing emojis
text = "Hello 😂, this is a test message with emoji."

# Remove emojis from the text
cleaned_text = emoji_pattern.sub(r'', text)

print(cleaned_text)  # Output: Hello , this is a test  message with emojis !


Hello , this is a test message with emoji.


###check emo in data set

Open and read in pandas data frame : Emoji_Sentiment_Data_v1.0.csv

In [None]:
df_emoticon = pd.read_csv("Emoji_Sentiment_Data_v1.0.csv")

In [None]:
df_emoticon.head()

Unnamed: 0,Emoji,Unicode codepoint,Occurrences,Position,Negative,Neutral,Positive,Unicode name,Unicode block
0,😂,0x1f602,14622,0.805101,3614,4163,6845,FACE WITH TEARS OF JOY,Emoticons
1,❤,0x2764,8050,0.746943,355,1334,6361,HEAVY BLACK HEART,Dingbats
2,♥,0x2665,7144,0.753806,252,1942,4950,BLACK HEART SUIT,Miscellaneous Symbols
3,😍,0x1f60d,6359,0.765292,329,1390,4640,SMILING FACE WITH HEART-SHAPED EYES,Emoticons
4,😭,0x1f62d,5526,0.803352,2412,1218,1896,LOUDLY CRYING FACE,Emoticons


Write code to check if column Emoji contains Emoticon.

Write a function to check if the Column emoji contains emoticons.

In [None]:
# Function to check if text contains emojis
def contains_emoji(text):
    return bool(emoji_pattern.search(text))

In [None]:
# Apply the function to the DataFrame
df_emoticon['contains_emoji'] = df_emoticon['Emoji'].apply(contains_emoji)

# Print the DataFrame with the new column
#print(df)

# Extract rows containing emojis
emoji_rows = df_emoticon[df_emoticon['contains_emoji']]

# Print rows that contain emojis
print(emoji_rows)

  Emoji Unicode codepoint  Occurrences  Position  Negative  Neutral  Positive  \
0     😂           0x1f602        14622  0.805101      3614     4163      6845   

             Unicode name Unicode block  contains_emoji  
0  FACE WITH TEARS OF JOY     Emoticons            True  


In [None]:
len(emoji_rows)

1

Write a function to remove the emoticons.

In [None]:
#EMOTICON

def remove_emo(text):
  emoji_pattern = re.compile("["
                           u"\U0001F602"  # emoticons
                           #u"\U0001F600-\U0001F64F"  # emoticons
                           #u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           #u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           #u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           #u"\U00002702-\U000027B0"  # other symbols
                           #u"\U000024C2-\U0001F251"  # additional symbols
                           "]+", flags=re.UNICODE)
  return emoji_pattern.sub(r'', text)


df_emoticon['no_emoticon'] = df_emoticon['Emoji'].apply(remove_emo)



In [None]:
df_emoticon.head()

Unnamed: 0,Emoji,Unicode codepoint,Occurrences,Position,Negative,Neutral,Positive,Unicode name,Unicode block,contains_emoji,no_emoticon
0,😂,0x1f602,14622,0.805101,3614,4163,6845,FACE WITH TEARS OF JOY,Emoticons,True,
1,❤,0x2764,8050,0.746943,355,1334,6361,HEAVY BLACK HEART,Dingbats,False,❤
2,♥,0x2665,7144,0.753806,252,1942,4950,BLACK HEART SUIT,Miscellaneous Symbols,False,♥
3,😍,0x1f60d,6359,0.765292,329,1390,4640,SMILING FACE WITH HEART-SHAPED EYES,Emoticons,False,😍
4,😭,0x1f62d,5526,0.803352,2412,1218,1896,LOUDLY CRYING FACE,Emoticons,False,😭


###Clean text for CV classification

Code applied in aforementioned code blocks will help to clean target CV data set. <br>

Consider:
* URL's
* HASHTAGS
* Mentions
* Special Letters
* upper case - lower case
* punctions
* Phone numbers
* email address
* ...

<br>
Remove professions from text that also occur in the classification label. Keep a version without and with removal of professions from text and compare classification.
<br>
For instance:

* Class label : data scientist
* Original text : I have 2 years experience as python developer and 3 years in data science.
* Modified text after removal of profession: I have 2 years experience as python developer and 3 years in .




