# 9. Reusable DataFrames
Since we are using more and more different approaches and are testing out more methods and are working on parameter tweaking, it seems helpfull to easily import dataframes without having to reorganize them each time for specific use. This notebook will create a few and save them for later use. 

All DataFrames will be without preprocessing applied so we can still tweak that for each specific use.

In [14]:
import pandas as pd

data = pd.read_csv('darkweb/data/agora.csv')

categories = data[' Category']
categories_main = data[' Category'].apply(lambda x: x.split('/')[0])
descriptions = data[' Item'] + " " + data[' Item Description']

## All categories
Here, we simplify the data we have into only the basic columns we need. We clean it a bit (some faulty categories are dropped), and save it for later use.

In [15]:
df = pd.DataFrame({'Category': categories, 'Item Description': descriptions})
df = df[pd.notnull(df['Item Description'])] # no empty descriptions
df = df[df.groupby('Category')['Category'].transform(len) > 1] # only categories that appear more than once

df['category_id'] = df['Category'].factorize()[0]
category_id_df = df[['Category', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Category']].values)

df.to_csv('Structured_DataFrame.csv')
df.head()

Unnamed: 0,Category,Item Description,category_id
0,Services/Hacking,12 Month HuluPlus gift Code 12-Month HuluPlus ...,0
1,Services/Hacking,Pay TV Sky UK Sky Germany HD TV and much mor...,0
2,Services/Hacking,OFFICIAL Account Creator Extreme 4.2 Tagged Su...,0
3,Services/Hacking,VPN > TOR > SOCK TUTORIAL How to setup a VPN >...,0
4,Services/Hacking,Facebook hacking guide . This guide will teac...,0


## Main categories
Here, we combine all categories into the main one. For example: Drugs/Weed and Drugs/MDMA etc. are all combined into one 'Drugs' category.

In [16]:
df_main = pd.DataFrame({'Category': categories_main, 'Item Description': descriptions})
df_main = df_main[pd.notnull(df_main['Item Description'])] # no empty descriptions
df_main = df_main[df_main.groupby('Category')['Category'].transform(len) > 1] # only categories that appear more than once

df_main['category_id'] = df_main['Category'].factorize()[0]
category_id_df_main = df_main[['Category', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id_main = dict(category_id_df_main.values)
id_to_category_main = dict(category_id_df_main[['category_id', 'Category']].values)

df_main.to_csv('Structured_DataFrame_Main_Categories.csv')
df_main.head()

Unnamed: 0,Category,Item Description,category_id
0,Services,12 Month HuluPlus gift Code 12-Month HuluPlus ...,0
1,Services,Pay TV Sky UK Sky Germany HD TV and much mor...,0
2,Services,OFFICIAL Account Creator Extreme 4.2 Tagged Su...,0
3,Services,VPN > TOR > SOCK TUTORIAL How to setup a VPN >...,0
4,Services,Facebook hacking guide . This guide will teac...,0


## Balanced sample
We also want to try out a balanced categories to avoid bias towords for example the 'Drugs' category, since this is by far the largest. Some categories are really small. We found that we can get a nice set of data when limiting to categories with 500 records or more and discarding all the smaller ones. This results in a dataset of 15000 records and 30 categories.

In [17]:
min_records_per_category = 500

unique_categories = data[' Category'].unique()
sorteddf = data.sort_values([' Category']).groupby(' Category').head(min_records_per_category)
filtereddf = sorteddf.where(sorteddf[" Category"].isin(unique_categories))
filtereddf = filtereddf[filtereddf["Vendor"].notnull()]
filtereddf = filtereddf[~filtereddf[' Category'].str.contains('Other')]
filtereddf = filtereddf[~filtereddf[' Category'].str.contains('Information')]

categories = filtereddf[' Category']
descriptions = filtereddf[' Item'] + " " + filtereddf[' Item Description']

df_balanced = pd.DataFrame({'Category': categories, 'Item Description': descriptions})
df_balanced = df_balanced[pd.notnull(df_balanced['Item Description'])] # no empty descriptions
df_balanced = df_balanced[df_balanced.groupby('Category')['Category'].transform(len) >= min_records_per_category] # only categories that appear more than the set minimum

df_balanced['category_id'] = df_balanced['Category'].factorize()[0]
category_id_df = df_balanced[['Category', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Category']].values)

df_balanced.to_csv('Structured_DataFrame_Sample_500.csv')
df_balanced.head()

Unnamed: 0,Category,Item Description,category_id
40127,Counterfeits/Watches,Emporio Armani - AR1610 Shell Case ceramic bra...,0
40126,Counterfeits/Watches,Cartier-Tank Ladies Brand: Cartier Series: Tan...,0
40125,Counterfeits/Watches,Patek Philippe watch box ★ Patek Philippe - Wa...,0
40130,Counterfeits/Watches,Breitling - NAVITIMER COSMONAUTE 【Replica】 Wat...,0
40129,Counterfeits/Watches,Emporio Armani Men's AR0397 Dial color Gary Wa...,0


## Balaced sample of main categories
We also want to be able to test with a set of balanced categories with only the main category labels. Because we have less categories now, the dataset will be smaller. In this case, we have 9 main categories left, each with 500 items. This means we get a dataset of 4500 items.

In [22]:
min_records_per_category = 500

unique_categories = df_main['Category'].unique()
sorteddf = df_main.sort_values(['Category']).groupby('Category').head(min_records_per_category)
filtereddf = sorteddf.where(sorteddf["Category"].isin(unique_categories))
filtereddf = filtereddf[~filtereddf['Category'].str.contains('Other')]
filtereddf = filtereddf[~filtereddf['Category'].str.contains('Information')]

categories = filtereddf['Category']
descriptions = filtereddf['Item Description']

df_balanced_main = pd.DataFrame({'Category': categories, 'Item Description': descriptions})
df_balanced_main = df_balanced_main[pd.notnull(df_balanced_main['Item Description'])] # no empty descriptions
df_balanced_main = df_balanced_main[df_balanced_main.groupby('Category')['Category'].transform(len) >= min_records_per_category] # only categories that appear more than the set minimum

df_balanced_main['category_id'] = df_balanced_main['Category'].factorize()[0]
category_id_df = df_balanced_main[['Category', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Category']].values)

df_balanced_main.to_csv('Structured_DataFrame_Sample_500_Main_Categories.csv')
df_balanced_main.head()

Unnamed: 0,Category,Item Description,category_id
73333,Counterfeits,Prada Double Bag BN2756 Bag Replica Prada Doub...,0
73334,Counterfeits,Bottega Veneta BV132S Sunglasses Replica Botte...,0
73335,Counterfeits,Prada SPR27N Sunglasses Replica ◆ Prada SPR27N...,0
73336,Counterfeits,Breitling-AVENGER SEAWOLF CHRONO 45MM BWY【NOOB...,0
73338,Counterfeits,Rolex-LADY-DATEJUST 26MM SWW【Replica】Woman wat...,0
