<a href="https://colab.research.google.com/github/aerjayc/CoE197Z/blob/master/cat_in_the_dat_formatted.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Downloading the dataset

In [1]:
!git clone https://github.com/aerjayc/CoE197Z.git
!cp "CoE197Z/train.csv" .
!cp "CoE197Z/test.csv" .
!cp "CoE197Z/sample_submission.csv" .
!ls

Cloning into 'CoE197Z'...
remote: Enumerating objects: 26, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 26 (delta 6), reused 12 (delta 1), pack-reused 0[K
Unpacking objects: 100% (26/26), done.
CoE197Z  sample_data  sample_submission.csv  test.csv  train.csv


### Structure of the data:

* Columns: 
    * `id`: unique integer associated with each entry
    * `bin_*` binary values
        * `bin_0`, `bin_1`, `bin_2`: $\in \{0,1\}$
        * `bin_3`: $\in \{T,F\}$
        * `bin_4`: $\in \{Y,N\}$
    * `nom_*`: _unordered_ values
        * `nom_0`: 3 unique values (colors)
        * `nom_1`: 6 unique values (shapes)
        * `nom_2`: 6 unique values (animals)
        * `nom_3`: 6 unique values (countries)
        * `nom_4`: 4 unique values (instruments)
        * `nom_5`: 222 unique values
        * `nom_6`: 522 unique values
        * `nom_7`: 1220 unique values
        * `nom_8`: 2215 unique values
        * `nom_6`: 11981 unique values
    * `ord_*`: _ordered_ values
        * `ord_0` $\in \{ 1, 2, 3 \}$
        * `ord_1` $\in \{$ `Novice`, `Contributor`, `Expert`, `Master`, `Grandmaster` $\}$
        * `ord_2` $\in \{$ `Freezing`, `Cold`, `Warm`, `Hot`, `Boiling Hot`, `Lava Hot` $\}$
        * `ord_3`: 15 unique values (lowercase letters)
        * `ord_4`: 26 unique values (uppercase letters)
        * `ord_5`: 192 unique values (two-letter combinations)
    * `day` $\in \{ 1, 2, 3, ..., 7 \}$
    * `month` $\in \{ 1, 2, 3, ..., 12  \}$
    * `target` $\in \{0,1\}$
        * the value we want to predict

In [0]:
import pandas as pd

# import train data
#raw_data = pd.read_csv("train.csv")
data = pd.read_csv("train.csv")

### Preprocessing the data

* Binary Data
    * `bin 0`-`bin_2` binary data are already in 1-hot encoding
    * `bin_3`, `bin_4` needs to be transformed so each element is either `0` or `1`
* Nominal Data
    * These can be easily converted into 1-hot encoding
* Ordinal Data
    * This should be manually labeled according to their rank (e.g. `Freezing < Cold < Warm < ...`), then normalized to have a range of $[0,1]$
    * From the `cat-in-the-dat` page [here](https://www.kaggle.com/c/cat-in-the-dat/data):
        * "The string ordinal features `ord_{3-5}` are lexically ordered according to `string.ascii_letters`."
* `Day` and `Month`
    * I don't know how to deal with this yet. They aren't independent, so we shouldn't process them separately. (Maybe we could, idk. Thoughts?)

In [38]:
data.head()

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,nom_5,nom_6,nom_7,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target
0,0,0,0,0,1,1,Green,Triangle,Snake,Finland,Bassoon,50f116bcf,3ac1b8814,68f6ad3e9,c389000ab,2f4cb3d51,2,Grandmaster,Cold,h,D,kr,2,2,0
1,1,0,1,0,1,1,Green,Trapezoid,Hamster,Russia,Piano,b3b4d25d0,fbcb50fc1,3b6dd5612,4cd920251,f83c56c21,1,Grandmaster,Hot,a,A,bF,7,8,0
2,2,0,0,0,0,1,Blue,Trapezoid,Lion,Russia,Theremin,3263bdce5,0922e3cb8,a6a36f527,de9c9f684,ae6800dd0,1,Expert,Lava Hot,h,R,Jc,7,2,0
3,3,0,1,0,0,1,Red,Trapezoid,Snake,Canada,Oboe,f12246592,50d7ad46a,ec69236eb,4ade6ab69,8270f0d71,1,Grandmaster,Boiling Hot,i,D,kW,2,1,1
4,4,0,0,0,0,0,Red,Trapezoid,Lion,Canada,Oboe,5b0f5acd5,1fe17a1fd,04ddac2be,cb43ab175,b164b72a7,1,Grandmaster,Freezing,a,R,qP,7,8,0


In [0]:
# Binary Data

## Non-numeric binary data to 1's and 0's
def bin_to_10(np_array, one='T', zero='F'):
    np_array[np_array == one]  = 1
    np_array[np_array == zero] = 0
    """ Q: Why is there no return value?
        A: np arrays are passed by reference, so modifying them in a function
           also modifies them outside """

bin_to_10(data['bin_3'].to_numpy(), one='T', zero='F')
bin_to_10(data['bin_4'].to_numpy(), one='Y', zero='N')

In [0]:
# Nominal Data
for i in range(7):
    data = pd.get_dummies(raw_data, columns=[f'nom_{i}'], prefix = [f'nom_{i}'])
    """ Each nom_* column is transformed into a 1-hot vector

        E.g.: nom_0 has 3 distinct categories: 'Red', 'Green', 'Blue'. 1-hot
              encoding turns it into 3 columns consisting of only 1's and 0's
              so that:
                    nom_0 = ['Green', 'Green', 'Blue', 'Red', ...]
              becomes 3 columns:
                    nom_0_Blue  = [1, 1, 0, 0, ...]
                    nom_0_Green = [0, 0, 1, 0, ...]
                    nom_0_Red   = [0, 0, 0, 1, ...] """

In [0]:
# Ordinal Data