# 21. Comparing Best Results
In this notebook, I will compare all best methods so far to get an overview of the best results we got so far. Up until this point, we compare our scores mainly to the balanced 'sample_500' set, but now we also get scores from different methods over all main categories or the whole dataset for example. Let's see how those compare.

## Preprocessing

In [2]:
import pandas as pd
from preprocessing import PreProcessor

pp = PreProcessor()

df_all = pd.read_csv('Structured_DataFrame.csv', index_col=0)
df_all['Item Description'] = df_all['Item Description'].apply(lambda d: pp.preprocess(str(d)))
df_all

Unnamed: 0,Category,Item Description,category_id
0,Services/Hacking,month huluplu gift code month huluplu code wor...,0
1,Services/Hacking,pay tv sky uk sky germani hd tv much cccam ser...,0
2,Services/Hacking,offici account creator extrem tag submiss fix ...,0
3,Services/Hacking,vpn tor sock tutori setup vpn tor sock super s...,0
4,Services/Hacking,facebook hack guid guid teach hack facebook ac...,0
...,...,...,...
109585,Drugs/Opioids/Opium,gr purifi opium list gramm redefin opium pefec...,95
109586,Weapons/Fireworks,ship ticket order ship one gun bought must bou...,99
109587,Drugs/Opioids/Opium,gram white afghani heroin full escrow gram whi...,95
109588,Drugs/Opioids/Opium,gram white afghani heroin full escrow gram whi...,95


In [4]:
df_main = pd.read_csv('Structured_DataFrame_Main_Categories.csv', index_col=0)
df_main['Item Description'] = df_main['Item Description'].apply(lambda d: pp.preprocess(str(d)))
df_main

Unnamed: 0,Category,Item Description,category_id
0,Services,month huluplu gift code month huluplu code wor...,0
1,Services,pay tv sky uk sky germani hd tv much cccam ser...,0
2,Services,offici account creator extrem tag submiss fix ...,0
3,Services,vpn tor sock tutori setup vpn tor sock super s...,0
4,Services,facebook hack guid guid teach hack facebook ac...,0
...,...,...,...
109585,Drugs,gr purifi opium list gramm redefin opium pefec...,1
109586,Weapons,ship ticket order ship one gun bought must bou...,11
109587,Drugs,gram white afghani heroin full escrow gram whi...,1
109588,Drugs,gram white afghani heroin full escrow gram whi...,1


In [26]:
df_balanced = pd.read_csv('Structured_DataFrame_Sample_500.csv', index_col=0)
df_balanced['Item Description'] = df_balanced['Item Description'].apply(lambda d: pp.preprocess(str(d)))
df_balanced

Unnamed: 0,Category,Item Description,category_id
40127,Counterfeits/Watches,emporio armani ar shell case ceram bracelet re...,0
40126,Counterfeits/Watches,cartiertank ladi brand cartier seri tank gende...,0
40125,Counterfeits/Watches,patek philipp watch box patek philipp watch bo...,0
40130,Counterfeits/Watches,breitl navitim cosmonaut replica watch inform ...,0
40129,Counterfeits/Watches,emporio armani men ar dial color gari watch re...,0
...,...,...,...
15401,Services/Money,canada cc get card number cvv expiri date name...,29
15402,Services/Money,uk debit card take chanc buy uk visa debit car...,29
15403,Services/Money,itali card detail high valid fresh itali card ...,29
15404,Services/Money,centurionblack cc get us centurion cc card num...,29


## Vectorizing

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2))

features_all = tfidf.fit_transform(df_all['Item Description'])
labels_all = df_all.Category

features_all

<109563x102238 sparse matrix of type '<class 'numpy.float64'>'
	with 3414110 stored elements in Compressed Sparse Row format>

In [6]:
features_main = tfidf.fit_transform(df_main['Item Description'])
labels_main = df_main.Category

features_main

<109563x102238 sparse matrix of type '<class 'numpy.float64'>'
	with 3414110 stored elements in Compressed Sparse Row format>

In [27]:
features_balanced = tfidf.fit_transform(df_balanced['Item Description'])
labels_balanced = df_balanced.Category

features_balanced

<15000x16833 sparse matrix of type '<class 'numpy.float64'>'
	with 390081 stored elements in Compressed Sparse Row format>

## Splitting

In [28]:
from sklearn.model_selection import train_test_split

X_train_all, X_test_all, y_train_all, y_test_all, indices_train_all, indices_test_all = train_test_split(features_all, labels_all, df_all.index, test_size=0.33, random_state=0)
X_train_main, X_test_main, y_train_main, y_test_main, indices_train_main, indices_test_main = train_test_split(features_main, labels_main, df_main.index, test_size=0.33, random_state=0)
X_train_balanced, X_test_balanced, y_train_balanced, y_test_balanced, indices_train_balanced, indices_test_balanced = train_test_split(features_balanced, labels_balanced, df_balanced.index, test_size=0.33, random_state=0)

print(X_train_all.shape, X_test_all.shape)
print(X_train_main.shape, X_test_main.shape)
print(X_train_balanced.shape, X_test_balanced.shape)

(73407, 102238) (36156, 102238)
(73407, 102238) (36156, 102238)
(10050, 16833) (4950, 16833)


## Training

In [31]:
from sklearn.svm import LinearSVC

model_all = LinearSVC()
model_main = LinearSVC()
model_balanced = LinearSVC()

model_all.fit(X_train_all, y_train_all)
model_main.fit(X_train_main, y_train_main)
model_balanced.fit(X_train_balanced, y_train_balanced)

y_pred_all = model_all.predict(X_test_all)
y_pred_main = model_main.predict(X_test_main)
y_pred_balanced = model_balanced.predict(X_test_balanced)

## Results

In [35]:
from sklearn import metrics

print("Accuracy: ", metrics.accuracy_score(y_test_all, y_pred_all))
print()
print(metrics.classification_report(y_test_all, y_pred_all))

Accuracy:  0.893046797212081



  'precision', 'predicted', average, warn_for)


                                                 precision    recall  f1-score   support

                                      Chemicals       0.88      0.71      0.79        31
                       Counterfeits/Accessories       0.92      0.73      0.81        75
                          Counterfeits/Clothing       0.90      0.91      0.90       132
                       Counterfeits/Electronics       0.67      0.53      0.59        15
                             Counterfeits/Money       0.92      0.89      0.90       123
                           Counterfeits/Watches       0.98      0.99      0.99       360
                                  Data/Accounts       0.78      0.79      0.78       378
                                   Data/Pirated       0.84      0.81      0.82       178
                                  Data/Software       0.74      0.75      0.74       119
                  Drug paraphernalia/Containers       0.68      0.70      0.69        54
                    

In [36]:
print("Accuracy: ", metrics.accuracy_score(y_test_main, y_pred_main))
print()
print(metrics.classification_report(y_test_main, y_pred_main))

Accuracy:  0.9539495519415865

                    precision    recall  f1-score   support

         Chemicals       0.95      0.68      0.79        31
      Counterfeits       0.97      0.93      0.95       705
              Data       0.86      0.80      0.83       675
Drug paraphernalia       0.87      0.85      0.86       275
             Drugs       0.98      1.00      0.99     30807
       Electronics       0.78      0.78      0.78       192
         Forgeries       0.94      0.91      0.92       323
              Info       0.61      0.65      0.63       725
       Information       0.47      0.40      0.43       642
           Jewelry       0.88      0.84      0.86       129
             Other       0.67      0.31      0.43       474
          Services       0.69      0.71      0.70       837
           Tobacco       0.99      0.90      0.94       134
           Weapons       0.94      0.86      0.90       207

         micro avg       0.95      0.95      0.95     36156
       

In [37]:
print("Accuracy: ", metrics.accuracy_score(y_test_balanced, y_pred_balanced))
print()
print(metrics.classification_report(y_test_balanced, y_pred_balanced))

Accuracy:  0.9222222222222223

                               precision    recall  f1-score   support

         Counterfeits/Watches       1.00      0.99      1.00       162
                Data/Accounts       0.90      0.90      0.90       147
                 Data/Pirated       0.95      0.92      0.94       157
                 Drugs/Benzos       0.84      0.87      0.86       166
  Drugs/Cannabis/Concentrates       0.95      0.94      0.95       172
       Drugs/Cannabis/Edibles       0.98      0.98      0.98       162
          Drugs/Cannabis/Hash       0.91      0.92      0.92       153
         Drugs/Cannabis/Seeds       0.97      0.95      0.96       149
    Drugs/Cannabis/Synthetics       0.95      0.96      0.96       170
          Drugs/Cannabis/Weed       0.90      0.93      0.91       174
 Drugs/Dissociatives/Ketamine       0.97      1.00      0.98       159
           Drugs/Ecstasy/MDMA       0.99      0.92      0.95       182
                Drugs/Opioids       0.80     

## Conclusion
The best trained model we have so far is LinearSVC with TF-IDF vectorization. The scores for this pipeline are displayed above. 

The Multi-Layer Perceptron (notebook 19.1) scored 90,8% on the balanced set, compared to 92,2% for LinearSVC.

The Multi-Layer Perceptron (notebook 19.2) scored 87,4% on the entire set, compared to 89,3% for LinearSVC.

The Multi-Layer Perceptron (notebook 19.3) scored 94,8% on the main categories, compared to 95,4% for LinearSVC.

We can conclude that the MLP performs pretty good, but it still can't match the LinearSVC model.