# SPAM Link Detection System
We want to implement a system that is able to automatically detect whether a web page contains spam or not based on its URL.

<span id="intro">Intro</span>

In [107]:
import pandas as pd
import regex as re

In [108]:
total_data = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/NLP-project-tutorial/main/url_spam.csv')
total_data.head()

Unnamed: 0,url,is_spam
0,https://briefingday.us8.list-manage.com/unsubs...,True
1,https://www.hvper.com/,True
2,https://briefingday.com/m/v4n3i4f3,True
3,https://briefingday.com/n/20200618/m#commentform,False
4,https://briefingday.com/fan,True


In [109]:
total_data["is_spam"] = total_data["is_spam"].apply(lambda x: 1 if x else 0).astype(int)
total_data.head()

Unnamed: 0,url,is_spam
0,https://briefingday.us8.list-manage.com/unsubs...,1
1,https://www.hvper.com/,1
2,https://briefingday.com/m/v4n3i4f3,1
3,https://briefingday.com/n/20200618/m#commentform,0
4,https://briefingday.com/fan,1


In [110]:
total_data.shape

(2999, 2)

In [111]:
# We are going to build a system able to predict if a URL is spam or not. The first step is to clean the data and prepare it for the machine learning model.
# The first step is to check if there are any missing values in the data
missing_values = total_data.isnull().sum()
missing_values

url        0
is_spam    0
dtype: int64

In [112]:
# Eliminate duplicates if any and check the shape of the data
total_data = total_data.drop_duplicates().reset_index(inplace=False, drop=True)
total_data.shape

(2369, 2)

In [113]:
# There are no missing values in the data, so we can move on to the next step
# The next step is to check the distribution of the data, our columns are 'url' and 'is_spam'
# Let's check the distribution of the 'is_spam' column
total_data['is_spam'].value_counts()

is_spam
0    2125
1     244
Name: count, dtype: int64

In [114]:
print(f"Spam: {total_data['is_spam'].value_counts()[1]}")
print(f"Not Spam: {total_data['is_spam'].value_counts()[0]}")

Spam: 244
Not Spam: 2125


In [115]:
# The data is balanced, so we can move on to the next step. The next step is to check the distribution of the 'url' column by checking its length
# total_data['url_length'] = total_data['url'].apply(len)
# total_data['url_length'].describe()
# total_data.head()


## Text Processing 
In order to train the model it is necessary to first apply a transformation process to the text. We start by transforming the text to lowercase and removing punctuation marks and special characters:

In [117]:
def preprocess_url(url):
    # Remove any character that is not a letter (a-z) or white space ( )
    url = re.sub(r'[^a-z ]', " ", url)
    
    # Remove white spaces
    url = re.sub(r'\s+[a-zA-Z]\s+', " ", url)
    url = re.sub(r'\^[a-zA-Z]\s+', " ", url)

    # Multiple white spaces into one
    url = re.sub(r'\s+', " ", url.lower())

    # Remove tags
    url = re.sub("&lt;/?.*?&gt;"," &lt;&gt; ", url)
    
    return url.split()

total_data['url'] = total_data['url'].apply(preprocess_url)
total_data.head()

Unnamed: 0,url,is_spam
0,"[https, briefingday, us, list, manage, com, un...",1
1,"[https, www, hvper, com]",1
2,"[https, briefingday, com, i]",1
3,"[https, briefingday, com, commentform]",0
4,"[https, briefingday, com, fan]",1


<a href="#intro" class="btn">Go up</a>