# 8. Vectorization Comparisons
Looking at the last two notebooks, linear SVC seems to be the ML method that works best on our data. Let's see if we can tweak the input so that it performs better than the 92 percent score whith balanced categories from 7.2.

In [1]:
import pandas as pd
from io import StringIO
from sklearn.feature_extraction.text import TfidfVectorizer
from preprocessing import PreProcessor
from gensim.models import word2vec

pp = PreProcessor()

data = pd.read_csv('darkweb/data/agora.csv')

min_records_per_category = 500

unique_categories = data[' Category'].unique()
sorteddf = data.sort_values([' Category']).groupby(' Category').head(min_records_per_category)
filtereddf = sorteddf.where(sorteddf[" Category"].isin(unique_categories))
filtereddf = filtereddf[filtereddf["Vendor"].notnull()]
filtereddf = filtereddf[~filtereddf[' Category'].str.contains('Other')]
filtereddf = filtereddf[~filtereddf[' Category'].str.contains('Information')]

categories = filtereddf[' Category']
descriptions = filtereddf[' Item'] + " " + filtereddf[' Item Description']
descriptions_preprocessed = descriptions.apply(lambda d: pp.preprocess(str(d)))

## Creating the dataframe

In [10]:
df = pd.DataFrame({'Category': categories, 'Item Description': descriptions_preprocessed})
df = df[pd.notnull(df['Item Description'])] # no empty descriptions
df = df[df.groupby('Category')['Category'].transform(len) >= min_records_per_category] # only categories that appear more than once

df['category_id'] = df['Category'].factorize()[0]
category_id_df = df[['Category', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Category']].values)

df.head()

Unnamed: 0,Category,Item Description,category_id
40127,Counterfeits/Watches,emporio armani ar shell case ceram bracelet re...,0
40126,Counterfeits/Watches,cartiertank ladi brand cartier seri tank gende...,0
40125,Counterfeits/Watches,patek philipp watch box patek philipp watch bo...,0
40130,Counterfeits/Watches,breitl navitim cosmonaut replica watch inform ...,0
40129,Counterfeits/Watches,emporio armani men ar dial color gari watch re...,0


## TFIDF Tweaking
We create the vectorizer and adjust some settings to try to improve the training. The best result after tweaking all variables (more than given here, the rest seems to only worsen the outcome) is 93 percent with a minimum document frequency of 1. Apperently it works better when looking at words that appear very rarely.

In [11]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn import metrics

def vectorize_and_train_results(minDf):
    
    # vectorizer
    tfidf = TfidfVectorizer(min_df=minDf, ngram_range=(1, 2))
    features = tfidf.fit_transform(df['Item Description'])
    labels = df.category_id
    
    # training
    X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, df.index, test_size=0.33, random_state=0)
    
    linearSVCModel = LinearSVC()
    linearSVCModel.fit(X_train, y_train)

    y_predLinearSVC = linearSVCModel.predict(X_test)
    
    # showing scores
    print(metrics.classification_report(y_test, y_predLinearSVC, target_names=df['Category'].unique()))
    
vectorize_and_train_results(1)
vectorize_and_train_results(5)
vectorize_and_train_results(10)
vectorize_and_train_results(25)
vectorize_and_train_results(50)

                               precision    recall  f1-score   support

         Counterfeits/Watches       1.00      0.99      1.00       165
                Data/Accounts       0.82      0.90      0.86       150
                 Data/Pirated       0.96      0.90      0.93       167
                 Drugs/Benzos       0.87      0.93      0.90       163
  Drugs/Cannabis/Concentrates       0.96      0.95      0.96       167
       Drugs/Cannabis/Edibles       0.97      0.99      0.98       156
          Drugs/Cannabis/Hash       0.95      0.95      0.95       155
         Drugs/Cannabis/Seeds       0.97      0.96      0.96       153
    Drugs/Cannabis/Synthetics       0.98      0.96      0.97       168
          Drugs/Cannabis/Weed       0.91      0.93      0.92       172
 Drugs/Dissociatives/Ketamine       0.99      0.99      0.99       161
           Drugs/Ecstasy/MDMA       0.96      0.93      0.95       169
          Drugs/Ecstasy/Pills       0.97      0.96      0.97       159
     

                               precision    recall  f1-score   support

         Counterfeits/Watches       0.99      0.99      0.99       165
                Data/Accounts       0.80      0.79      0.80       150
                 Data/Pirated       0.78      0.84      0.81       167
                 Drugs/Benzos       0.82      0.88      0.85       163
  Drugs/Cannabis/Concentrates       0.89      0.91      0.90       167
       Drugs/Cannabis/Edibles       0.94      0.92      0.93       156
          Drugs/Cannabis/Hash       0.89      0.86      0.88       155
         Drugs/Cannabis/Seeds       0.96      0.93      0.94       153
    Drugs/Cannabis/Synthetics       0.94      0.91      0.92       168
          Drugs/Cannabis/Weed       0.86      0.91      0.89       172
 Drugs/Dissociatives/Ketamine       0.99      0.97      0.98       161
           Drugs/Ecstasy/MDMA       0.95      0.91      0.93       169
          Drugs/Ecstasy/Pills       0.91      0.92      0.92       159
     