# 22. Mapped dataset
We have now created a dataset that mapped the Agora dataset categories to the Interpol categories. Let's see how that set performs. Probably better, since there are less categories.

## Preprocessing

In [20]:
import pandas as pd
from preprocessing import PreProcessor

data = pd.read_csv('Mapped.csv')

categories = data['Category']
descriptions = data['Item'] + " " + data['Item Description']

df = pd.DataFrame({'Category': categories, 'Item Description': descriptions})
df = df[pd.notnull(df['Item Description'])] # no empty descriptions
df = df[df.groupby('Category')['Category'].transform(len) > 1] # only categories that appear more than once

df['category_id'] = df['Category'].factorize()[0]
category_id_df = df[['Category', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Category']].values)

df.to_csv('Structured_DataFrame_Mapped.csv')

pp = PreProcessor()

df['Item Description'] = df['Item Description'].apply(lambda d: pp.preprocess(str(d)))
df

Unnamed: 0,Category,Item Description,category_id
0,Services,month huluplu gift code month huluplu code wor...,0
1,Services,pay tv sky uk sky germani hd tv much cccam ser...,0
2,Services,offici account creator extrem tag submiss fix ...,0
3,Services,vpn tor sock tutori setup vpn tor sock super s...,0
4,Services,facebook hack guid guid teach hack facebook ac...,0
...,...,...,...
110567,Drugs,gr purifi opium list gramm redefin opium pefec...,1
110568,Explosives,ship ticket order ship one gun bought must bou...,12
110569,Drugs,gram white afghani heroin full escrow gram whi...,1
110570,Drugs,gram white afghani heroin full escrow gram whi...,1


## Vectorizing

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2))

features = tfidf.fit_transform(df['Item Description'])
labels = df.Category

features

<110572x102119 sparse matrix of type '<class 'numpy.float64'>'
	with 3419789 stored elements in Compressed Sparse Row format>

## Splitting

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, df.index, test_size=0.33, random_state=0)

print(X_train.shape, X_test.shape)

(74083, 102119) (36489, 102119)


## Training

In [23]:
from sklearn.svm import LinearSVC

model = LinearSVC()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

## Results

In [24]:
from sklearn import metrics

print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))
print()
print(metrics.classification_report(y_test, y_pred))

Accuracy:  0.9645646633231932



  'precision', 'predicted', average, warn_for)


                  precision    recall  f1-score   support

        Chemical       1.00      0.75      0.86        28
     Counterfeit       0.93      0.89      0.91       939
           Drugs       0.99      0.99      0.99     31950
      Explosives       0.00      0.00      0.00         5
        Firearms       0.83      0.70      0.76       124
Forged Documents       0.93      0.88      0.91       353
        Services       0.69      0.68      0.69       868
        Software       0.72      0.60      0.66       100
     Stolen Data       0.83      0.81      0.82       560
    Stolen Goods       0.89      0.84      0.87       131
        Tutorial       0.36      0.24      0.29       304
         Weapons       0.92      0.75      0.83        96
            Wiki       0.75      0.78      0.77      1031

       micro avg       0.96      0.96      0.96     36489
       macro avg       0.76      0.69      0.72     36489
    weighted avg       0.96      0.96      0.96     36489



## Conclusion
The mapped dataset performs even better, as expected.