Alexander S. Lundervold, 28.03.22

# Introduction

This notebook prepares the data sets we'll use as we go through the various components of TensorFlow Extended.

## Data sets

* A simplified version of the PetFinder.my data set from the Kaggle competition https://www.kaggle.com/c/petfinder-adoption-prediction. The simplified version was created by the TensorFlow team and used in their tutorials (e.g., https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers). 
* The full PetFinder.my data set from Kaggle.
* _More TBA_

<img src="assets/petfinder.png">

# Setup

In [1]:
%matplotlib inline
import pandas as pd, numpy as np, urllib.requests, shutil, os
from pathlib import Path

In [2]:
# Check whether we're running on Colab
try:
    import colab
    colab=True
except:
    colab=False

In [3]:
if not colab:
    # Store small files in the repo:
    NB_DIR = Path.cwd()
    LOCAL_DATA = NB_DIR/'..'/'data' 
    PETFINDER_MINI = LOCAL_DATA/'petfinder-mini'
    # Store larger files outside the repo:
    DATA = Path('/home/alex/data/dat255')
elif colab:
    from google.colab import drive
    drive.mount('./gdrive')
    DATA = Path('./gdrive/MyDrive/ColabData')
    LOCAL_DATA = DATA
    PETFINDER_MINI = DATA/'petfinder-mini'
    
DATA.mkdir(exist_ok=True)

# Download PetFinder-mini

In [4]:
petfinder_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'

In [7]:
if not os.path.isfile(PETFINDER_MINI/'petfinder-mini.csv'): 

    _ = urllib.request.urlretrieve(petfinder_url, filename=LOCAL_DATA/'petfinder-mini.zip')

    shutil.unpack_archive(LOCAL_DATA/'petfinder-mini.zip', extract_dir=LOCAL_DATA)

In [8]:
pd.read_csv(PETFINDER_MINI/'petfinder-mini.csv').head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,Description,PhotoAmt,AdoptionSpeed
0,Cat,3,Tabby,Male,Black,White,Small,Short,No,No,Healthy,100,Nibble is a 3+ month old ball of cuteness. He ...,1,2
1,Cat,1,Domestic Medium Hair,Male,Black,Brown,Medium,Medium,Not Sure,Not Sure,Healthy,0,I just found it alone yesterday near my apartm...,2,0
2,Dog,1,Mixed Breed,Male,Brown,White,Medium,Medium,Yes,No,Healthy,0,Their pregnant mother was dumped by her irresp...,7,3
3,Dog,4,Mixed Breed,Female,Black,Brown,Medium,Short,Yes,No,Healthy,150,"Good guard dog, very alert, active, obedience ...",8,2
4,Dog,1,Mixed Breed,Male,Black,No Color,Medium,Short,No,No,Healthy,0,This handsome yet cute boy is up for adoption....,3,2


Move the CSV file to a separate subdirectory:

In [8]:
(PETFINDER_MINI/'csv').mkdir(exist_ok=True)
(PETFINDER_MINI/'petfinder-mini.csv').rename(PETFINDER_MINI/'csv'/'petfinder-mini.csv')

'/data-ssd/Dropbox/Jobb/projects/ML/medGPU1-alex/DAT255-repo/TensorFlow_Extended/nbs/../data/petfinder-mini/csv/petfinder-mini.csv'

## Split the data

It will be useful to have our data stored on disk across multiple CSV files. Let's split it at random in two parts and save them to disk:

In [9]:
df = pd.read_csv(PETFINDER_MINI/'petfinder-mini.csv')

In [15]:
len(df)

11537

In [11]:
df1 = df.sample(frac=0.5)

df2 = df.drop(df1.index)

In [16]:
len(df1), len(df2)

(5768, 5769)

### Create some new data

Later, we'll have use for some data that's in some meaningful way different from what's in PetFinder. 

In [56]:
df3 = df1.sample(frac=0.001, random_state=42)
df3.reset_index(inplace=True)
df3.drop('index', axis=1, inplace=True)

In [57]:
len(df3)

6

In [58]:
df3.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,Description,PhotoAmt,AdoptionSpeed
0,Cat,2,Domestic Short Hair,Male,Black,White,Small,Short,Not Sure,No,Healthy,0,Location: IIUM / UIA Gombak Contact no:,2,2
1,Cat,4,Siamese,Female,White,No Color,Small,Short,Yes,No,Healthy,0,"She's a very happy, healthy and playful. I wou...",0,1
2,Dog,2,Mixed Breed,Female,Black,Brown,Medium,Short,Yes,No,Healthy,80,"This lovely Lady is Active, Playful & Friendly...",9,2
3,Dog,3,Terrier,Male,Black,No Color,Large,Long,Yes,No,Healthy,0,"Rescued from the car park area, Puchong",1,4
4,Cat,36,Persian,Male,Golden,Yellow,Medium,Medium,Not Sure,Not Sure,Healthy,0,"Hi guys , i found some abandon dogs and cats i...",4,1


In [59]:
df3.at[0, "Type"] = "Bird"
df3.at[1, "Health"] = 99
df3.at[3, "Age"] = -1

In [60]:
df3.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,Description,PhotoAmt,AdoptionSpeed
0,Bird,2,Domestic Short Hair,Male,Black,White,Small,Short,Not Sure,No,Healthy,0,Location: IIUM / UIA Gombak Contact no:,2,2
1,Cat,4,Siamese,Female,White,No Color,Small,Short,Yes,No,99,0,"She's a very happy, healthy and playful. I wou...",0,1
2,Dog,2,Mixed Breed,Female,Black,Brown,Medium,Short,Yes,No,Healthy,80,"This lovely Lady is Active, Playful & Friendly...",9,2
3,Dog,-1,Terrier,Male,Black,No Color,Large,Long,Yes,No,Healthy,0,"Rescued from the car park area, Puchong",1,4
4,Cat,36,Persian,Male,Golden,Yellow,Medium,Medium,Not Sure,Not Sure,Healthy,0,"Hi guys , i found some abandon dogs and cats i...",4,1


### Save to disk

In [61]:
(PETFINDER_MINI/'split_csv').mkdir(exist_ok=True)

In [62]:
df1.to_csv(PETFINDER_MINI/'split_csv'/'span1.csv', index=None)

In [63]:
df2.to_csv(PETFINDER_MINI/'split_csv'/'span2.csv', index=None)

In [64]:
df3.to_csv(PETFINDER_MINI/'split_csv'/'span3.csv', index=None)

# Download full PetFinder dataset

_TBA_

# Download the Flowers dataset

_TBA_