The objective of this notebook is to make a first approach to and analyze the punctuation and special characters of the messages from the dataset and how to handle them in each case.

In [12]:
import pandas as pd
import re
from string import punctuation


In [2]:
df = pd.read_csv("../data/bronze/spam.csv")


### Special chars

In [3]:
def find_special_characters(text):
    return re.findall(r'[^a-zA-Z0-9\s]', text) 

all_special_characters = df['Message'].apply(find_special_characters).explode().dropna()


unique_special_characters = all_special_characters.unique()
unique_special_characters

array([',', '.', '(', ')', '&', "'", '!', '?', '£', '*', '>', '/', '+',
       ':', '=', '\x92', '-', 'ú', '‘', 'ü', ';', '#', '"', '@', '$', 'Ü',
       '~', '|', '_', '–', '<', '…', '\\', 'è', '^', '\x94', '“', '%',
       '\x91', '[', ']', '’', '\x93', '\x96', '»', '—', 'é', 'É', 'ì',
       '鈥', '┾', '〨', '¡'], dtype=object)

### Rows with char

In [4]:
def row_with_char(char):
    matching_rows = df[df['Message'].str.contains(re.escape(char))]
    print(f"Rows containing '{char}':")
    for index, row in matching_rows.iterrows():
        print(f"Row {index}: {row['Message']}")


In [5]:
pd.set_option('display.max_colwidth', None)
row_with_char("*")

Rows containing '*':
Row 7: As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
Row 67: Urgent UR awarded a complimentary trip to EuroDisinc Trav, Aco&Entry41 Or £1000. To claim txt DIS to 87121 18+6*£1.50(moreFrmMob. ShrAcomOrSglSuplt)10, LS1 3AJ
Row 103: As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
Row 117: You are a winner U have been specially selected 2 receive £1000 or a 4* holiday (flights inc) speak to a live operator 2 claim 0871277810910p/min (18+)
Row 154: As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
Row 160: You are a winner U have been specially selected 2 receive £1000 cash or a 4* holiday (flights inc) speak to a live operator 2 claim 0

In [6]:
row_with_char(">")

Rows containing '>':
Row 11: SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info
Row 15: XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL
Row 309: TheMob> Check out our newest selection of content, Games, Tones, Gossip, babes and sport, Keep your mobile fit and funky text WAP to 82468
Row 690: <Forwarded from 448712404000>Please CALL 08712404000 immediately as there is an urgent message waiting for you.
Row 1613: RT-KIng Pro Video Club>> Need help? info@ringtoneking.co.uk or call 08701237397 You must be 16+ Club credits redeemable at www.ringtoneking.co.uk! Enjoy!
Row 2003: TheMob>Yo yo yo-Here comes a new selection of hot downloads for our members to get for FREE! Just click & open the next link sent to ur fone...
Row 2079: 85233 FREE>Ringtone!Reply REAL
Row 2267: <Forwarded from 88877>FREE entry into 

In [7]:
row_with_char("/")

Rows containing '/':
Row 11: SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info
Row 15: XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL
Row 19: England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+
Row 34: Thanks for your subscription to Ringtone UK your mobile will be charged £5/month Please confirm by replying YES or NO. If you reply NO you will not be charged
Row 117: You are a winner U have been specially selected 2 receive £1000 or a 4* holiday (flights inc) speak to a live operator 2 claim 0871277810910p/min (18+)
Row 121: URGENT! Your Mobile No. was awarded £2000 Bonus Caller Prize on 5/9/03 This is our final try to contact U! Call from Landline 09064019788 BOX42WR29C, 150PPM
Row 164: -PLS STOP

In [8]:
row_with_char("@")

Rows containing '@':
Row 55: Do you know what Mallika Sherawat did yesterday? Find out now @  &lt;URL&gt;
Row 135: Want 2 get laid tonight? Want real Dogging locations sent direct 2 ur mob? Join the UK's largest Dogging Network bt Txting GRAVEL to 69888! Nt. ec2a. 31p.msg@150p
Row 136: I only haf msn. It's yijue@hotmail.com
Row 235: Text & meet someone sexy today. U can find a date or even flirt its up to U. Join 4 just 10p. REPLY with NAME & AGE eg Sam 25. 18 -msg recd@thirtyeight pence
Row 474: Want 2 get laid tonight? Want real Dogging locations sent direct 2 ur Mob? Join the UK's largest Dogging Network by txting MOAN to 69888Nyt. ec2a. 31p.msg@150p
Row 541: from www.Applausestore.com MonthlySubscription@50p/msg max6/month T&CsC web age16 2stop txt stop
Row 607: XCLUSIVE@CLUBSAISAI 2MOROW 28/5 SOIREE SPECIALE ZOUK WITH NICHOLS FROM PARIS.FREE ROSES 2 ALL LADIES !!! info: 07946746291/07880867867
Row 960: Where @
Row 1170: Msgs r not time pass.They silently say that I am thinking of 

### First approach

First approach:

- ',', '.', '(', ')', '&': replace with space 
- "'": replace with empty string
- '!','?': replace with space
- '£': replace with "pound" and add other common currency names
- '*': replace with space
- '>': replace with space
- '/', '+': replace with space
- ':', '=': replace with space
- '-': replace with space
- 'ú': still don't know
- '‘', 'ü':, ';': replace with space
- '#', 
- '"': replace with space
- '@': they can belong to emails, ats...
- '$': replace with dollar, 
- 'Ü':
- '\x91', '\x92', '\x93', '\x94', '\x96': replace with empty string
- '~', '|', '_', '–', '<', '…', '\\', 'è', '^', , '“': replace with spaces
- '%': replace with "percetage"
- '[', ']', '’', , '»', '—', 'é', 'É', 'ì','鈥', '┾', '〨', '¡': replace with space

keywords: cash, xxx
websites


###  Clean text function

In [None]:

def clean_text(text):
    special_replacements = {
        r"£": "pound",
        r"\$": "dollar",
        r"\€": "euro",
        r"%": "percentage"}
    
    for pattern, replacement in special_replacements.items():
        text = re.sub(pattern, replacement, text)
    text = text.lower()
    text = re.sub('<[^<>]+>', ' ', text)
    text = re.sub(r'http\S+|www.\S+', '', text)
    text = re.sub('[0-9]+', 'number', text)
    text = re.sub('[^\s]+@[^\s]+', 'emailaddr', text)
    text = text.translate(str.maketrans('', '', punctuation))
    
    # text = re.sub(r'(http\S+|@\S+|\d+)', '', text)
    # text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # text = re.sub(r'\s+', ' ', text).strip()
    # text = re.sub(r'[\x00-\x1F\x7F-\x9F]', '', text)
    # que elimine dos espacios o tres por uno
    return text

In [14]:
df_cleaned = df.copy() 

df_cleaned['Message']=df_cleaned['Message'].apply(clean_text)

In [15]:
all_special_characters = df_cleaned['Message'].apply(find_special_characters).explode().dropna()

unique_special_characters = all_special_characters.unique()
unique_special_characters

array(['\x92', 'ú', '‘', 'ü', '–', '…', 'è', '\x94', '“', '\x91', '’',
       '\x93', '\x96', '»', '—', 'é', 'ì', '鈥', '┾', '〨', '¡'],
      dtype=object)

In [27]:
row_with_char("-")

Rows containing '-':
Row 19: England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+
Row 42: 07732584351 - Rodger Burns - MSG = We tried to call you re your reply to our sms for a free nokia mobile + free camcorder. Please call now 08000930705 for delivery tomorrow
Row 56: Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out!
Row 90: Yeah do! Don‘t stand to close tho- you‘ll catch something!
Row 93: Please call our customer service representative on 0800 169 6031 between 10am-9pm as you have WON a guaranteed £1000 cash or £5000 prize!
Row 98: Hi. Wk been ok - on hols now! Yes on for a bit of a run. Forgot that i have hairdressers appointment at four so need to get home n shower beforehand. Does that cause prob for u?"
Row 118: Goodo! Yes we must speak friday - egg-potato ratio for t

In [16]:
row_with_char("ü")

Rows containing 'ü':
Row 22: So ü pay first lar... Then when is da stock comin...
Row 35: Yup... Ok i go home look at the timings then i msg ü again... Xuhui going to learn on 2nd may too but her lesson is at 8am
Row 125: Ü predict wat time ü'll finish buying?
Row 140: Got c... I lazy to type... I forgot ü in lect... I saw a pouch but like not v nice...
Row 214: Yup... How ü noe leh...
Row 429: 7 at esplanade.. Do ü mind giving me a lift cos i got no car today..
Row 502: When can ü come out?
Row 581: Huh so early.. Then ü having dinner outside izzit?
Row 619: I come n pick ü up... Come out immediately aft ur lesson...
Row 638: When ü login dat time... Dad fetching ü home now?
Row 663: Sorry me going home first... Daddy come fetch ü later...
Row 685: I wanted to ask ü to wait 4 me to finish lect. Cos my lect finishes in an hour anyway.
Row 699: Mum ask ü to buy food home...
Row 701: How much r ü willing to pay?
Row 989: Yun ah.the ubi one say if ü wan call by tomorrow.call 67441233 look

In [17]:
row_with_char("è")

Rows containing 'è':
Row 989: Yun ah.the ubi one say if ü wan call by tomorrow.call 67441233 look for irene.ere only got bus8,22,65,61,66,382. Ubi cres,ubi tech park.6ph for 1st 5wkg days.èn


In [18]:
row_with_char("ú")

Rows containing 'ú':
Row 19: England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+


In [19]:
row_with_char("é")

Rows containing 'é':
Row 4762: It's é only $140 ard...É rest all ard $180 at least...Which is é price 4 é 2 bedrm ($900)


In [20]:
row_with_char("“")

Rows containing '“':
Row 1318: Win the newest “Harry Potter and the Order of the Phoenix (Book 5) reply HARRY, answer 5 questions - chance to be the first among readers!
Row 2729: Urgent Please call 09066612661 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection. T&Cs SAE award. 20M12AQ. 150ppm. 16+ “


In [21]:
row_with_char("»")

Rows containing '»':
Row 4097: Hey , is * rite u put »10 evey mnth is that all?


In [22]:
row_with_char('ì')

Rows containing 'ì':
Row 5017: Hey gals...U all wanna meet 4 dinner at nìte?


- "ü": replace with you
- "è": replace with empty string
- "ú": replace with empty string
- "é": replace with empty string
- "“": replace with empty string
- "»": replace with empty string
- 'ì': replace with i
- "\x91", "\x92", "\x93", "\x96": replace with empty string

The rest replace with empty string

In [None]:
def clean_text_2(text):
    special_replacements = {
        r"£": "pound",
        r"\$": "dollar",
        r"\€": "euro",
        r"%": "percentage", 
        r"ì": "i",
        r"ü": "you",
        }
    
    for pattern, replacement in special_replacements.items():
        text = re.sub(pattern, replacement, text)
    text = text.lower()
    text = re.sub('<[^<>]+>', ' ', text)
    text = re.sub(r'http\S+|www.\S+', '', text)
    text = re.sub('[0-9]+', 'number', text)
    text = re.sub('[^\s]+@[^\s]+', 'emailaddr', text)
    text = text.translate(str.maketrans('', '', punctuation))
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [25]:
df_cleaned_v2 = df.copy()

df_cleaned_v2['Message']=df_cleaned_v2['Message'].apply(clean_text_2)
df_cleaned_v2['Message']

0                                                                                  go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
1                                                                                                                                                                 ok lar joking wif u oni
2                        free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
3                                                                                                                                             u dun say so early hor u c already then say
4                                                                                                                             nah i dont think he goes to usf he lives around here though
                                                                      

In [26]:
all_special_characters = df_cleaned_v2['Message'].apply(find_special_characters).explode().dropna()


unique_special_characters = all_special_characters.unique()
unique_special_characters

array([], dtype=object)