# Create dataset

AutoML required specified data formatting for use in a Vertex AI dataset. If we choose the CSV input format then our text dataset should contain 3 columns:
- **ml_use** - *training*/*validation*/*test*
- **text**
- **label**

Find more details at: [Text training data requirements](https://cloud.google.com/vertex-ai/docs/datasets/prepare-text)  


## Data

For presentation purposes we use general [text classification](https://www.kaggle.com/kashnitsky/hierarchical-text-classification) dataset from Kaggle. This dataset contains a hierarchical classification of comments on product. We use only the top level category to simplify the task. There are 6 types of top level categories, such as *beauty*, *toys games*, *pet supplies* and others.  

The Kaggle dataset contains training and validation sets. In order to be able to calculate model performance, we keep validation set as test set. Original train set then can be split into training and validation sets if required.

In [1]:
import pandas as pd

train_df = pd.read_csv("gs://haba-ws/data/train_40k.csv")

In [9]:
train_df.head()

Unnamed: 0,productId,Title,userId,Helpfulness,Score,Time,Text,Cat1,Cat2,Cat3
0,B000E46LYG,Golden Valley Natural Buffalo Jerky,A3MQDNGHDJU4MK,0/0,3.0,-1,The description and photo on this product need...,grocery gourmet food,meat poultry,jerky
1,B000GRA6N8,Westing Game,unknown,0/0,5.0,860630400,This was a great book!!!! It is well thought t...,toys games,games,unknown
2,B000GRA6N8,Westing Game,unknown,0/0,5.0,883008000,"I am a first year teacher, teaching 5th grade....",toys games,games,unknown
3,B000GRA6N8,Westing Game,unknown,0/0,5.0,897696000,I got the book at my bookfair at school lookin...,toys games,games,unknown
4,B00000DMDQ,I SPY A is For Jigsaw Puzzle 63pc,unknown,2/4,5.0,911865600,Hi! I'm Martine Redman and I created this puzz...,toys games,puzzles,jigsaw puzzles


Use provided validation set as a test set.

In [10]:
test_df = pd.read_csv("gs://haba-ws/data/val_10k.csv")[["Text", "Cat1"]]
test_df["ml_use"] = "test"

Split training set into training and validation sets.

In [11]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(train_df[["Text", "Cat1"]], test_size=5_000, random_state=42)
train_df["ml_use"] = "training"
val_df["ml_use"] = "validation"

In [14]:
data = pd.concat([train_df, val_df, test_df])[["ml_use", "Text", "Cat1"]]
data.head()

Unnamed: 0,ml_use,Text,Cat1
27053,training,This is oval and lop sided. I tried using it m...,toys games
4082,training,This is the best set hands downit stays togeth...,toys games
38171,training,"Well, after several months of use I decided to...",pet supplies
165,training,These didn't really taste the way I expected a...,health personal care
12079,training,"Great toy! Liked the the back light, easy to p...",toys games


Save dataset without header and index to match the required format.

In [16]:
data.to_csv("gs://haba-ws/data.csv", index=False, header=False)