# 1. Preprocessing script
Firstly, I tried multiple ways of preprocessing the imported data. Later, I adapted the code to a method with variables with which you can easily choose which preprocessing method(s) you want to apply. This piece of code was later put in a class so every group member could use it for preprocessing.

In [13]:
import pandas as pd

## Import the data
This is a kaggle dataset that TNO said we could practice on. The data would be similar to their scraped dark web content.

In [14]:
data = pd.read_csv('darkweb/data/agora.csv')
data.head()

Unnamed: 0,Vendor,Category,Item,Item Description,Price,Origin,Destination,Rating,Remarks
0,CheapPayTV,Services/Hacking,12 Month HuluPlus gift Code,12-Month HuluPlus Codes for $25. They are wort...,0.05027025666666667 BTC,Torland,,4.96/5,
1,CheapPayTV,Services/Hacking,Pay TV Sky UK Sky Germany HD TV and much mor...,Hi we offer a World Wide CCcam Service for En...,0.152419585 BTC,Torland,,4.96/5,
2,KryptykOG,Services/Hacking,OFFICIAL Account Creator Extreme 4.2,Tagged Submission Fix Bebo Submission Fix Adju...,0.007000000000000005 BTC,Torland,,4.93/5,
3,cyberzen,Services/Hacking,VPN > TOR > SOCK TUTORIAL,How to setup a VPN > TOR > SOCK super safe enc...,0.019016783532494728 BTC,,,4.89/5,
4,businessdude,Services/Hacking,Facebook hacking guide,. This guide will teach you how to hack Faceb...,0.062018073963963936 BTC,Torland,,4.88/5,


## Extract the data we need
We seperate and create a string of the description (feature).

In [16]:
descriptions = data[' Item'] + " " + data[' Item Description']
descriptions.head()

0    12 Month HuluPlus gift Code 12-Month HuluPlus ...
1    Pay TV Sky UK  Sky Germany HD TV  and much mor...
2    OFFICIAL Account Creator Extreme 4.2 Tagged Su...
3    VPN > TOR > SOCK TUTORIAL How to setup a VPN >...
4    Facebook hacking guide .  This guide will teac...
dtype: object

## The preprocessing method
The preprocess method takes in one description and several parameters and returns the preprocessed description. In this case, we process the first 100 descriptions. The result of the script are the preprocessed features (descriptions).

In [17]:
# Imports
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
from itertools import dropwhile
import string

# Utilities
stop_words_dictionary = set(map(lambda x: x.lower(), stopwords.words("english")))
strip_punctuation_translator = str.maketrans("", "", string.punctuation)
strip_numbers_translator = str.maketrans("", "", string.digits)
stem = PorterStemmer().stem
lemmatize = WordNetLemmatizer().lemmatize

# Preprocessing method
# input: a description/sentence of type "string"
def preprocess(description, lower_case=True, punctuation=True, numbers=True, unicode=True, cut_off=True, stop_words=True, stemming=True, lemmatizing=False, min_word_length=-1, max_word_length=-1):    
    if lower_case:
        description = description.lower()
    if punctuation:
        description = description.translate(strip_punctuation_translator).strip()
    if numbers:
        description = description.translate(strip_numbers_translator).strip()
    if unicode:
        description = description.encode('ascii', 'ignore').decode("utf-8")
    if cut_off:
        word_tokens = word_tokenize(description)
        description = " ".join(word_tokens[:-1])
    if stop_words:
        word_tokens = word_tokenize(description)
        delete_stop_words = [w for w in word_tokens if not w in stop_words_dictionary]
        description = " ".join(delete_stop_words)
    if stemming:
        word_tokens = word_tokenize(description)
        stemmed = [stem(w) for w in word_tokens]
        description = " ".join(stemmed)
    if lemmatizing:
        word_tokens = word_tokenize(description)
        lemmatized = [lemmatize(w) for w in word_tokens]
        description = " ".join(lemmatized)
    if min_word_length is not -1:
        word_tokens = word_tokenize(description)
        min_sized_words = [w for w in word_tokens if len(w) >= min_word_length]
        description = " ".join(min_sized_words)
    if max_word_length is not -1:
        word_tokens = word_tokenize(description)
        max_sized_words = [w for w in word_tokens if len(w) <= max_word_length]
        description = " ".join(max_sized_words)
    return description

aantal = 100

descriptions_preprocessed = descriptions[:aantal].apply(lambda d: preprocess(str(d)))
descriptions_preprocessed

0     month huluplu gift code month huluplu code wor...
1     pay tv sky uk sky germani hd tv much cccam ser...
2     offici account creator extrem tag submiss fix ...
3     vpn tor sock tutori setup vpn tor sock super s...
4     facebook hack guid guid teach hack facebook ac...
5     ddo attack servic new servic avail take websit...
6     atm hack tutori step step guid manual hack atm...
7     callsm verif servic need regist account sm ver...
8     mac window address changer come complet databa...
9     wifi hack hack wepwpawpa glori wp hack wpa wif...
10    paytv via internet hd iptv box month subscript...
11                       setup botnet guid setup botnet
12    proxi softwar login proxi day list provid soft...
13    credit card info cvvcvv provid credit card inf...
14    pay tv sky uk sky germani hd tv much morecccam...
15    look pay tv iptv cccam resel hi lok resel ccca...
16    hack ebook collect say titl ebook collect hack...
17    masterkey xtremeau masterkey extremeto ope