# Data Cleaning and Exploration

This notebook serves two purposes:
- Understand if the raw data needs any kind of cleaning
- Explore the data and check the distributions to see if anything pops and to better understand the underlying meaning of the extracted features

## 1. Setup

In [48]:
# importing relevant libraries
import pandas as pd
import time

from sklearn.manifold import TSNE
import altair as alt

# setting plot style
import altair_theme
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [49]:
# import data
processed_data = pd.read_csv("../data/processed/processed_data.csv")

## 2. Data Cleaning & Exploration

In [50]:
# check what the data looks like
processed_data.head()

Unnamed: 0.1,Unnamed: 0,domain,div,a,img,span,ul,li
0,0,sms-japan.com,55.0,8.0,0.0,0.0,0.0,0.0
1,1,just-jobs.com,170.0,145.0,0.0,0.0,0.0,0.0
2,2,openproxy.space,61.0,26.0,0.0,20.0,0.0,0.0
3,4,sms-japan.com,62.0,8.0,0.0,0.0,0.0,0.0
4,6,tooljp.com,9.0,101.0,0.0,2.0,0.0,0.0


In [51]:
# check data size
len(processed_data)

5263

In [52]:
# check number of duplicates
len(processed_data) - len(processed_data.drop_duplicates())

0

**Action -> remove duplicates from data**

### Outliers

In [53]:
alt.Chart(processed_data).mark_boxplot(opacity=0.1).encode(
    alt.X(alt.repeat("row"), type='quantitative')
).properties(
    width=800,
    height=80
).repeat(
    row=['div', 'a', 'img', 'span', 'ul', 'li']
)

#### How do the outliers look like?

#### div
div outliers are sites with a lot of containers.

They might be associated with sites of low value for ads online, although not necessarily fraudulent

In [54]:
processed_data[processed_data["div"]>=1000].sort_values("div", ascending=False).head(5)

Unnamed: 0.1,Unnamed: 0,domain,div,a,img,span,ul,li
402,462,gpop.io,3042.0,829.0,133.0,8.0,0.0,0.0
401,461,gpop.io,3020.0,844.0,133.0,8.0,0.0,0.0
4382,5004,young-machine.com,2555.0,882.0,306.0,30.0,5.0,64.0
4388,5010,young-machine.com,2549.0,879.0,303.0,27.0,5.0,64.0
4387,5009,young-machine.com,2548.0,879.0,303.0,27.0,5.0,64.0


#### a
a outliers are sites with a lot of hyperlinks

They might be associated with information aggregators that point to multiple sources.

In [55]:
processed_data.sort_values("a", ascending=False).head(5)

Unnamed: 0.1,Unnamed: 0,domain,div,a,img,span,ul,li
4847,5517,jin-taro.com,146.0,2853.0,43.0,37.0,5.0,41.0
4572,5215,presidenthouse.net,254.0,2252.0,43.0,264.0,5.0,2214.0
5258,5968,singapore-startup.com,98.0,2092.0,60.0,214.0,3.0,2035.0
404,464,footao.tv,626.0,1867.0,631.0,656.0,0.0,0.0
4326,4944,hannoumatome.com,1052.0,1820.0,27.0,0.0,4.0,58.0


#### img
img outliers are sites with a lot of images

Like the div outliers, they might be associated with sites of low value for ads online.

In [56]:
processed_data.sort_values("img", ascending=False).head(5)

Unnamed: 0.1,Unnamed: 0,domain,div,a,img,span,ul,li
1026,1184,brawlify.com,1130.0,566.0,1768.0,381.0,3.0,11.0
4573,5217,acestickers.com,2072.0,1561.0,1522.0,3.0,2.0,501.0
5190,5891,rhinoos.xyz,100.0,1443.0,1389.0,1990.0,5.0,694.0
1564,1807,benricho.org,438.0,798.0,746.0,293.0,3.0,15.0
5250,5959,y9freegames.com,860.0,408.0,697.0,148.0,4.0,158.0


#### span
span outliers are sites with a lot of text containers

They might be associated with, for example, news sites (that use a lot of text tags to mark their news)

In [57]:
processed_data.sort_values("span", ascending=False).head(5)

Unnamed: 0.1,Unnamed: 0,domain,div,a,img,span,ul,li
4582,5226,herowarsjpwebfb.com,762.0,754.0,81.0,2405.0,5.0,211.0
5190,5891,rhinoos.xyz,100.0,1443.0,1389.0,1990.0,5.0,694.0
5030,5721,kurashi-karu.com,100.0,134.0,87.0,1672.0,4.0,53.0
5194,5895,ch225.com,1170.0,366.0,20.0,1592.0,3.0,67.0
808,940,adr-stock.com,211.0,402.0,3.0,1365.0,2.0,10.0


#### ul
ul outliers are sites with a lot of unordered lists
Here, however, the ul dispersion is low and it does not seem like we can associate the domains with higher ul counts to anything of particular relevance

In [58]:
processed_data.sort_values("ul", ascending=False).head(5)

Unnamed: 0.1,Unnamed: 0,domain,div,a,img,span,ul,li
4472,5101,hokanko-alt.com,357.0,385.0,255.0,515.0,49.0,354.0
4471,5100,jisaka.com,357.0,385.0,255.0,514.0,49.0,354.0
4470,5099,nandemo-uketori.com,357.0,385.0,255.0,514.0,49.0,354.0
5117,5811,networthmagazine.com,259.0,123.0,41.0,77.0,10.0,70.0
5116,5810,smarttelly.com,267.0,128.0,44.0,81.0,10.0,70.0


#### li
li outliers are sites with a lot of list items

Again, they might be associated with information aggregators that point to multiple sources for example

In [59]:
processed_data.sort_values("li", ascending=False).head(5)

Unnamed: 0.1,Unnamed: 0,domain,div,a,img,span,ul,li
4572,5215,presidenthouse.net,254.0,2252.0,43.0,264.0,5.0,2214.0
5258,5968,singapore-startup.com,98.0,2092.0,60.0,214.0,3.0,2035.0
5190,5891,rhinoos.xyz,100.0,1443.0,1389.0,1990.0,5.0,694.0
5177,5877,pasonica.com,417.0,842.0,73.0,571.0,3.0,560.0
4573,5217,acestickers.com,2072.0,1561.0,1522.0,3.0,2.0,501.0


### Distributions

#### Main outputs
Based on the distributions and possible values, all tags look viable to use except for the ul tag. The low dispersion on the values of this tag indicate it won't be a good differentiar between what is similar and what is not and it will probably only increase noise and complexity.

In [60]:
alt.Chart(processed_data).mark_bar().encode(
    alt.X(alt.repeat("row"),type='quantitative', bin=alt.Bin(extent=[0, 800], step=5)),
    y='count()',
).properties(
    width=800,
    height=80
).repeat(
    row=['div', 'a', 'img', 'span', 'ul', 'li']
)

**Action -> remove ul from training features**

### Correlations

#### Main outputs
After general look at correlation matrix and more in depth view of potential problematic pairs of values, we don't seem to have issues with correlations. This could potentially bias the distance metric

In [61]:
alt.Chart(processed_data).mark_circle(opacity=0.1).encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative')
).properties(
    width=80,
    height=80
).repeat(
    row=['div', 'a', 'img', 'span', 'ul', 'li'],
    column=['li', 'ul', 'span', 'img', 'a', 'div']
)

In [62]:
alt.Chart(processed_data[(processed_data["div"]<2000) & (processed_data["li"]<2000)]).mark_circle(opacity=0.1).encode(
    alt.X("div"),
    alt.Y("li")
).properties(
    width=300,
    height=300
)

In [63]:
alt.Chart(processed_data[(processed_data["span"]<2000) & (processed_data["li"]<2000)]).mark_circle(opacity=0.1).encode(
    alt.X("span"),
    alt.Y("li")
).properties(
    width=300,
    height=300
)

In [64]:
alt.Chart(processed_data[(processed_data["img"]<2000) & (processed_data["div"]<2000)]).mark_circle(opacity=0.1).encode(
    alt.X("div"),
    alt.Y("img")
).properties(
    width=300,
    height=300
)

In [65]:
alt.Chart(processed_data[(processed_data["li"]<2000) & (processed_data["a"]<1000)]).mark_circle(opacity=0.1).encode(
    alt.X("li"),
    alt.Y("a")
).properties(
    width=300,
    height=300
)

## Data Structure

In [66]:
features = ["div", "a", "img", "span", "li"]

time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=15, n_iter=300)

tsne_results = tsne.fit_transform(processed_data[features])
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

[t-SNE] Computing 46 nearest neighbors...
[t-SNE] Indexed 5263 samples in 0.005s...
[t-SNE] Computed neighbors for 5263 samples in 0.109s...
[t-SNE] Computed conditional probabilities for sample 1000 / 5263
[t-SNE] Computed conditional probabilities for sample 2000 / 5263
[t-SNE] Computed conditional probabilities for sample 3000 / 5263
[t-SNE] Computed conditional probabilities for sample 4000 / 5263
[t-SNE] Computed conditional probabilities for sample 5000 / 5263
[t-SNE] Computed conditional probabilities for sample 5263 / 5263
[t-SNE] Mean sigma: 8.743155
[t-SNE] KL divergence after 250 iterations with early exaggeration: 80.815079
[t-SNE] KL divergence after 300 iterations: 2.448580
t-SNE done! Time elapsed: 5.208219766616821 seconds


In [67]:
processed_data['tsne-2d-one'] = tsne_results[:,0]
processed_data['tsne-2d-two'] = tsne_results[:,1]

alt.Chart(processed_data).mark_circle(opacity=0.05).encode(
    alt.X("tsne-2d-one"),
    alt.Y("tsne-2d-two")
).properties(
    width=600,
    height=600)