# 26. Balancing by copying - TFIDF
We want to try to balance the dataset in a different way to see if we get better results. Instead of having an unbalanced main categories set, we want to copy items from categories that are smaller.

## Preprocessing

In [48]:
import pandas as pd
from preprocessing import PreProcessor

pp = PreProcessor()

df = pd.read_csv('../Data/Structured_DataFrame_Main_Categories.csv', index_col=0)
df['Item Description'] = df['Item Description'].apply(lambda d: pp.preprocess(str(d)))
df

Unnamed: 0,Category,Item Description,category_id
0,Services,month huluplu gift code month huluplu code wor...,0
1,Services,pay tv sky uk sky germani hd tv much cccam ser...,0
2,Services,offici account creator extrem tag submiss fix ...,0
3,Services,vpn tor sock tutori setup vpn tor sock super s...,0
4,Services,facebook hack guid guid teach hack facebook ac...,0
...,...,...,...
109585,Drugs,gr purifi opium list gramm redefin opium pefec...,1
109586,Weapons,ship ticket order ship one gun bought must bou...,11
109587,Drugs,gram white afghani heroin full escrow gram whi...,1
109588,Drugs,gram white afghani heroin full escrow gram whi...,1


## Splitting
Because we are going to expand the dataset with records that are duplicates, we want to split before we do that, to prevent having the same descriptions in the test and train set.

In [49]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df["Item Description"], df.Category, test_size=0.33, random_state=0)

data_train = {'Category': y_train, 'Item_Description': X_train}
df_train = pd.DataFrame(data_train)
print(df_train.shape)

data_test = {'Category': y_test, 'Item_Description': X_test}
df_test = pd.DataFrame(data_test)
print(df_test.shape)

(73407, 2)
(36156, 2)


## Sampling
We take samples of categories with less categories and duplicate them to fill up the dataframe to create category groups of equal size. 

In [50]:
grouped = df_train.groupby('Category', group_keys=False)
df_train_balanced = pd.DataFrame(grouped.apply(lambda x: x.sample(grouped.size().max(), replace=True))).reset_index(drop=True)
df_train_balanced

Unnamed: 0,Category,Item_Description
0,Chemicals,chemistri advic provid high level chemistri su...
1,Chemicals,nitroethan gallon pure nitroethan need ac grade
2,Chemicals,chemistri advic provid high level chemistri su...
3,Chemicals,phenylnitropropen kg phenylnitropropen kg
4,Chemicals,g red phosphoru reagent grade free em ship red...
...,...,...
870879,Weapons,telescop expand baton black protect solid stee...
870880,Weapons,custom vacheron custom list vacheron includ ship
870881,Weapons,stealth knife card knife card ship eu countri ...
870882,Weapons,silent stun k sent oz silent stun k sent oz


## Vectorizing
Now that we have the training and test set, we vectorize the features with the same tfidf model.

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

X_train = tfidf.fit_transform(df_train_balanced.Item_Description)
y_train = df_train_balanced.Category.values
X_test = tfidf.transform(df_test.Item_Description)
y_test = df_test.Category.values

## Training

In [52]:
from sklearn.svm import LinearSVC

model = LinearSVC()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

## Results

In [1]:
from sklearn import metrics

print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))
print()
print(metrics.classification_report(y_test, y_pred, target_names=df['Category'].unique()))

NameError: name 'y_test' is not defined

## Conclusion
We get a good result, but, as shown in notebook 21, it is a littlebit worse than without the sample balancing. This might be due to the fact that LinearSVC takes into account the document frequency of words. With duplicate documents, this might explain the worse result, because the words logically appear more often. The tfidf model thinks this means the words are less meaningfull, whilst they don't have to be; they're just duplicated.

Word2Vec might work better for this reason, so that's what we'll try in the next notebook.