## Contents<a id='3.1_Contents'></a>
* [1 Data Wrangling](#1-Data-Wrangling)
  * [1.1 Introduction](#1.1-Introduction)
  * [1.2 Imports](#1.2-Imports)
  * [1.3 Load the data](#1.3-Load-the-data)
  * [1.4 Dataset overview](#1.4-Dataset-overview)
  * [1.5 Exploring the dataset](#1.5-Exploring-the-dataset)
    * [1.5.1 Exploring for non-ASCII characters](#1.5.1-Exploring-for-non-ASCII-characters)
    * [1.5.2 Finding duplicates](#1.5.2-Finding-duplicates)
    * [1.5.3 Finding URLs that start with characters other than numbers or letters](#1.5.3-Finding-URLs-that-start-with-characters-other-than-numbers-or-letters)
    * [1.5.4 Removing spaces](#1.5.4-Removing-spaces)
    * [1.5.5 Dropping URL's that has 2 or less characters](#1.5.5-Dropping-URL's-that-has-2-or-less-characters)

# 1.1 Introductions

In this first notebook we will perform the preliminary data explorations to find the most basic errors in a dataset, e.g. missing values, duplicates, non-ASCII characters etc. Also, we will use our domain konwledge to perform some additional cleaning, e.g. removing '' or "" or any spaces in between the URL as we know URL's shouldn't have any spaces.

# 1.2 Imports

In [5]:
import pandas as pd
import numpy as np
import re
import requests
from collections import Counter

# 1.3 Load the data

In [7]:
df = pd.read_csv('../data/raw/new_data_urls.csv')
df.head()

Unnamed: 0,url,status
0,0000111servicehelpdesk.godaddysites.com,0
1,000011accesswebform.godaddysites.com,0
2,00003.online,0
3,0009servicedeskowa.godaddysites.com,0
4,000n38p.wcomhost.com,0


# 1.4 Dataset overview

In [9]:
df.describe()

Unnamed: 0,status
count,822010.0
mean,0.519492
std,0.49962
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,1.0


In [10]:
df.dtypes

url       object
status     int64
dtype: object

In [11]:
df.isnull().sum()

url       0
status    0
dtype: int64

In [12]:
df.columns

Index(['url', 'status'], dtype='object')

In [13]:
df.shape

(822010, 2)

In [14]:
df.loc[210264, :]

url       WY
status     1
Name: 210264, dtype: object

# 1.5 Exploring the dataset

### 1.5.1 Exploring for non-ASCII characters

In [17]:
def contains_foreign_characters(url):
    '''A function to find out urls with non-ASCII characters'''
    return bool(re.search(r'[^\x00-\x7F]', url))

In [18]:
df['url_contains_foreign'] = df['url'].apply(contains_foreign_characters)

In [19]:
df[df['url_contains_foreign'] == True]

Unnamed: 0,url,status,url_contains_foreign
28801,http://email302.com/l/5fc15ea15e66c082e33c48ba...,0,True
29659,http://email302.com/l/5fc15ea15e66c082e33c48ba...,0,True
29742,http://u1146016e85.ha004.t.justns.ru/sociﾃｩtﾃｩ...,0,True
30045,http://u10955164a4.ha004.t.justns.ru/sociﾃｩtﾃｩ...,0,True
30311,http://email302.com/l/5fc15ea15e66c082e33c48ba...,0,True
...,...,...,...
678980,copy.com/s8w9tqqzVDaXIkcR/הריגתו של קצין ביטחו...,0,True
679836,venezuela365.com/wp-content/uploads/2014/10/ti...,0,True
679852,www.hjclub.info/bbs/uploadfiles/45/ca-bundle.exe’,0,True
749894,http://email302.com/l/5fc15ea15e66c082e33c48ba...,0,True


There are 425 urls that contains non-ASCII characters. We can drop these rows as their percentage is negligible, 0.05%.

##### Dropping the urls containing foreign characters

In [22]:
df = df[~df['url_contains_foreign'] == True]

In [23]:
df[df['url_contains_foreign'] == True]

Unnamed: 0,url,status,url_contains_foreign


In [24]:
df.isnull().sum()

url                     0
status                  0
url_contains_foreign    0
dtype: int64

In [25]:
df.drop(columns='url_contains_foreign', inplace=True)

In [26]:
df.head()

Unnamed: 0,url,status
0,0000111servicehelpdesk.godaddysites.com,0
1,000011accesswebform.godaddysites.com,0
2,00003.online,0
3,0009servicedeskowa.godaddysites.com,0
4,000n38p.wcomhost.com,0


### 1.5.2 Finding duplicates 

In [28]:
df.duplicated().sum()

13967

Duplicate rows are very few in number, we need to drop them.

In [30]:
df.drop_duplicates(inplace=True)

In [31]:
df.duplicated().sum()

0

### 1.5.3 Finding URLs that start with characters other than numbers or letters

In [33]:
df[df['url'].str.match(r'^[^a-zA-Z0-9]')]

Unnamed: 0,url,status
191868,-https://www.cinemaximum.com.tr/cakallarla-dan...,1
192050,'motors.shop.ebay.com-cars-trucks-724527.jnq3....,0
192064,'www.gestion-des-impayes.com/visuel.php?param=...,0
192141,'nicolecustodio.com.br/paypal\%20us/webscr.htm...,0
192227,'beforenanny911.com/auto/my-themes/file/proper...,0
...,...,...
680466,intent.nofrillspace.com/users/web11_focus/380...,0
680468,mister.nofrillspace.com/users/web8_dice/3791/...,0
684998,69.162.100.198/,0
685257,babicz123.ddns.net/,0


##### Finding URLs that are enclosed in either single quotes (') or double quotes (")

In [35]:
df[df['url'].str.contains(r"^['\"].*['\"]$", na=False)]

Unnamed: 0,url,status
192050,'motors.shop.ebay.com-cars-trucks-724527.jnq3....,0
192064,'www.gestion-des-impayes.com/visuel.php?param=...,0
192141,'nicolecustodio.com.br/paypal\%20us/webscr.htm...,0
192227,'beforenanny911.com/auto/my-themes/file/proper...,0
192234,'www.edyshsdf32.hut4.ru/Redirecionamento.html?...,0
...,...,...
287613,'www.fileplanet.com/100541/0/section/Marc-Ecko...,1
287615,'en.wikipedia.org/wiki/Marc_Ecko\'s_Getting_Up...,1
287928,'www.armchairempire.com/Reviews/PC\%20Games/al...,1
287932,'www.mobygames.com/game_group/sheet/gameGroupI...,1


##### Removing the quotes and special characters in front of the urls

In [37]:
df['url'] = df['url'].str.replace(r"^['\"^\W]+", '', regex=True)

In [38]:
df[df['url'].str.match(r'^[^a-zA-Z0-9]')]

Unnamed: 0,url,status


In [39]:
df[df['url'].str.contains(r"^['\"].*['\"]$", na=False)]

Unnamed: 0,url,status


In [40]:
df.isnull().sum()

url       0
status    0
dtype: int64

### 1.5.4 Removing spaces

In [42]:
def remove_spaces(url):
    return url.replace(' ', '')  # Replaces all spaces with an empty string

In [43]:
df['url'] = df['url'].apply(remove_spaces)

In [44]:
def has_space(url):
    return ' ' in url

In [45]:
df[(df['url'].apply(has_space)) & (df.status == 0)]

Unnamed: 0,url,status


In [46]:
df[(df['url'].apply(has_space)) & (df.status == 1)]

Unnamed: 0,url,status


### 1.5.5 Dropping URL's that has 2 or less characters

URL with only two characters doesn't make any sense. Most probably they are some inconsistencies during web scraping.

In [50]:
df[df['url'].apply(lambda x: len(x) < 3)]

Unnamed: 0,url,status
210264,WY,1
670137,,0
751998,IE,0
763032,cc,0
777359,gt,0
780127,ie,0
787310,lt,0


In [52]:
df[(df['url'].apply(lambda x: len(x) < 3)) & (df.status == 1)]

Unnamed: 0,url,status
210264,WY,1


In [54]:
df = df[~df['url'].apply(lambda x: len(x) < 3)]

In [56]:
df.reset_index(drop=True, inplace=True)

In [58]:
df.to_csv('../data/processed/url_new_cleaned.csv')