<a href="https://colab.research.google.com/github/aerjayc/CoE197Z/blob/master/cat_in_the_dat_formatted.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Downloading the dataset

In [2]:
!git clone https://github.com/aerjayc/CoE197Z.git
!cp "CoE197Z/train.csv" .
!cp "CoE197Z/test.csv" .
!cp "CoE197Z/sample_submission.csv" .
!ls

fatal: destination path 'CoE197Z' already exists and is not an empty directory.
CoE197Z  sample_data  sample_submission.csv  test.csv  train.csv


---

### Importing the dataset

In [9]:
import pandas as pd

# import train data)
data = pd.read_csv("train.csv")
data.head()

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,nom_5,nom_6,nom_7,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target
0,0,0,0,0,T,Y,Green,Triangle,Snake,Finland,Bassoon,50f116bcf,3ac1b8814,68f6ad3e9,c389000ab,2f4cb3d51,2,Grandmaster,Cold,h,D,kr,2,2,0
1,1,0,1,0,T,Y,Green,Trapezoid,Hamster,Russia,Piano,b3b4d25d0,fbcb50fc1,3b6dd5612,4cd920251,f83c56c21,1,Grandmaster,Hot,a,A,bF,7,8,0
2,2,0,0,0,F,Y,Blue,Trapezoid,Lion,Russia,Theremin,3263bdce5,0922e3cb8,a6a36f527,de9c9f684,ae6800dd0,1,Expert,Lava Hot,h,R,Jc,7,2,0
3,3,0,1,0,F,Y,Red,Trapezoid,Snake,Canada,Oboe,f12246592,50d7ad46a,ec69236eb,4ade6ab69,8270f0d71,1,Grandmaster,Boiling Hot,i,D,kW,2,1,1
4,4,0,0,0,F,N,Red,Trapezoid,Lion,Canada,Oboe,5b0f5acd5,1fe17a1fd,04ddac2be,cb43ab175,b164b72a7,1,Grandmaster,Freezing,a,R,qP,7,8,0


### Structure of the data:

* __`id`__ : unique integer associated with each entry
* __`bin_*`__ : binary values
* __`nom_*`__ : _unordered_ values
* __`ord_*`__ : _ordered_ values
* __`day`__ $\in \{ 1, 2, 3, ..., 7 \}$
* __`month`__ $\in \{ 1, 2, 3, ..., 12  \}$
* __`target`__ $\in \{0,1\}$
    * the value we want to predict

## Preprocessing the data

###Binary Data

* `bin_*` : binary values
    * `bin_{0-2}` $\in \{0,1\}$
        * already in 1-hot encoding
    * `bin_3` $\in \{T,F\}$
        * remap $\ T \to 1,\ F \to 0$
    * `bin_4`: $\in \{Y,N\}$
        * remap $\ Y \to 1,\ N \to 0$

In [10]:
# Binary Data

## Arbitrary binary data to 1's and 0's
def bin_to_10(np_array, one='T', zero='F'):
    np_array[np_array == one]  = 1
    np_array[np_array == zero] = 0
    """ Q: Why is there no return value?
        A: np arrays are passed by reference, so modifying them in a function
           also modifies them outside """

bin_to_10(data['bin_3'].to_numpy(), one='T', zero='F')
bin_to_10(data['bin_4'].to_numpy(), one='Y', zero='N')
data.head()

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,nom_5,nom_6,nom_7,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target
0,0,0,0,0,1,1,Green,Triangle,Snake,Finland,Bassoon,50f116bcf,3ac1b8814,68f6ad3e9,c389000ab,2f4cb3d51,2,Grandmaster,Cold,h,D,kr,2,2,0
1,1,0,1,0,1,1,Green,Trapezoid,Hamster,Russia,Piano,b3b4d25d0,fbcb50fc1,3b6dd5612,4cd920251,f83c56c21,1,Grandmaster,Hot,a,A,bF,7,8,0
2,2,0,0,0,0,1,Blue,Trapezoid,Lion,Russia,Theremin,3263bdce5,0922e3cb8,a6a36f527,de9c9f684,ae6800dd0,1,Expert,Lava Hot,h,R,Jc,7,2,0
3,3,0,1,0,0,1,Red,Trapezoid,Snake,Canada,Oboe,f12246592,50d7ad46a,ec69236eb,4ade6ab69,8270f0d71,1,Grandmaster,Boiling Hot,i,D,kW,2,1,1
4,4,0,0,0,0,0,Red,Trapezoid,Lion,Canada,Oboe,5b0f5acd5,1fe17a1fd,04ddac2be,cb43ab175,b164b72a7,1,Grandmaster,Freezing,a,R,qP,7,8,0


### Nominal Data

* __`nom_*`__ : _unordered_ values
    * __`nom_0`__ : 3 unique values (colors)
    * __`nom_1`__ : 6 unique values (shapes)
    * __`nom_2`__ : 6 unique values (animals)
    * __`nom_3`__ : 6 unique values (countries)
    * __`nom_4`__ : 4 unique values (instruments)
    * __`nom_5`__ : 222 unique values   (hex)
    * __`nom_6`__ : 522 unique values   (hex)
    * __`nom_7`__ : 1220 unique values  (hex)
    * __`nom_8`__ : 2215 unique values  (hex)
    * __`nom_9`__ : 11981 unique values (hex)

These can be easily converted into 1-hot encoding

In [11]:
# Nominal Data
for i in range(10):
    data = pd.get_dummies(data, columns=[f'nom_{i}'], prefix = [f'nom_{i}'])
    """ Each nom_* column is transformed into a 1-hot vector

        E.g.: nom_0 has 3 distinct categories: 'Red', 'Green', 'Blue'. 1-hot
              encoding turns it into 3 columns consisting of only 1's and 0's
              so that:
                    nom_0 = ['Green', 'Green', 'Blue', 'Red', ...]
              becomes 3 columns:
                    nom_0_Blue  = [1, 1, 0, 0, ...]
                    nom_0_Green = [0, 0, 1, 0, ...]
                    nom_0_Red   = [0, 0, 0, 1, ...] """
data.head()

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target,nom_0_Blue,nom_0_Green,nom_0_Red,nom_1_Circle,nom_1_Polygon,nom_1_Square,nom_1_Star,nom_1_Trapezoid,nom_1_Triangle,nom_2_Axolotl,nom_2_Cat,nom_2_Dog,nom_2_Hamster,nom_2_Lion,nom_2_Snake,nom_3_Canada,nom_3_China,nom_3_Costa Rica,nom_3_Finland,nom_3_India,nom_3_Russia,nom_4_Bassoon,nom_4_Oboe,nom_4_Piano,nom_4_Theremin,...,nom_9_ff0be0f4a,nom_9_ff0deb8c5,nom_9_ff1275360,nom_9_ff1cea430,nom_9_ff1ffdb6c,nom_9_ff26ddc5f,nom_9_ff2c54c2b,nom_9_ff37346ce,nom_9_ff3ff6768,nom_9_ff4039502,nom_9_ff43e5d26,nom_9_ff4922d4e,nom_9_ff4c9a31f,nom_9_ff4ccc205,nom_9_ff4e4f5b5,nom_9_ff534b1a4,nom_9_ff55fea3b,nom_9_ff56952ea,nom_9_ff5bf79d4,nom_9_ff61809a0,nom_9_ff680c901,nom_9_ff683e4d6,nom_9_ff7809d76,nom_9_ff794dacf,nom_9_ff7b5c805,nom_9_ff7dd1073,nom_9_ff9efcf8c,nom_9_ffac03704,nom_9_ffae88db8,nom_9_ffb07126f,nom_9_ffc086cfa,nom_9_ffc668c22,nom_9_ffccfc611,nom_9_ffd347754,nom_9_ffd966e07,nom_9_fff13b60a,nom_9_fff1ce319,nom_9_fff4abc0b,nom_9_fffb01c38,nom_9_fffd6e64c
0,0,0,0,0,1,1,2,Grandmaster,Cold,h,D,kr,2,2,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,1,0,1,1,1,Grandmaster,Hot,a,A,bF,7,8,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2,0,0,0,0,1,1,Expert,Lava Hot,h,R,Jc,7,2,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,3,0,1,0,0,1,1,Grandmaster,Boiling Hot,i,D,kW,2,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4,0,0,0,0,0,1,Grandmaster,Freezing,a,R,qP,7,8,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Ordinal Data

* __`ord_*`__ : _ordered_ values
    * __`ord_0`__ $\in \{ 1, 2, 3 \}$
    * __`ord_1`__ $\in \{$ `Novice`, `Contributor`, `Expert`, `Master`, `Grandmaster` $\}$
    * __`ord_2`__ $\in \{$ `Freezing`, `Cold`, `Warm`, `Hot`, `Boiling Hot`, `Lava Hot` $\}$
    * __`ord_3`__ : 15 unique values (lowercase letters)
    * __`ord_4`__ : 26 unique values (uppercase letters)
    * __`ord_5`__ : 192 unique values (two-letter combinations)

* This should be manually labeled according to their rank (e.g. `Freezing < Cold < Warm < ...`), then normalized to have a range of $[0,1]$
* From the `cat-in-the-dat` page [here](https://www.kaggle.com/c/cat-in-the-dat/data):
    * "The string ordinal features `ord_{3-5}` are lexically ordered according to `string.ascii_letters`."
        * where the order of `string.ascii_letters` = `'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'`

In [12]:
# Ordinal Data

## Arbitrary categorical data to integers
def ord_to_int(np_array, categories, values=None):
    """ np_array: the column vector.
        categories: a list/tuple containing the distinct elements in np_array
        values: a list/tuple with the same length as categories. It contains the
                values to be assigned to the corresponding category.
                By default, values = [0, 1, 2, ..., len(categories)-1]
        E.g.: For the following inputs:
                    np_array = array(['c', 'a', 'd', 'b', 'c'])
                    categories =  ['a', 'b', 'c', 'd']
                    values =      [ 0 ,  1 ,  2 ,  3 ]
              np_array becomes:
                    np_array = array([2, 0, 3, 1, 2])
        Note: this is the generalized version of bin_to_10:
                bin_to_10(np_array) = nom_to_int(np_array, ['T','F']) """

    if (not values) or (len(values) != len(categories)) :
        values = list(range(len(categories)))

    for i in range(len(categories)):
        np_array[np_array == categories[i]] = values[i]

categories = ('Novice', 'Contributor', 'Expert', 'Master', 'Grandmaster')
ord_to_int(data['ord_1'].to_numpy(), categories)

categories = ('Freezing', 'Cold', 'Warm', 'Hot', 'Boiling Hot', 'Lava Hot')
ord_to_int(data['ord_2'].to_numpy(), categories)

categories = data['ord_3'].astype('category').cat.categories.tolist()
ord_to_int(data['ord_3'].to_numpy(), sorted(categories))

categories = data['ord_4'].astype('category').cat.categories.tolist()
ord_to_int(data['ord_4'].to_numpy(), sorted(categories))

# https://stackoverflow.com/a/28136444
categories = data['ord_5'].astype('category').cat.categories.tolist()
ord_to_int(data['ord_5'].to_numpy(), sorted(categories, key=str.swapcase))

data.head()

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target,nom_0_Blue,nom_0_Green,nom_0_Red,nom_1_Circle,nom_1_Polygon,nom_1_Square,nom_1_Star,nom_1_Trapezoid,nom_1_Triangle,nom_2_Axolotl,nom_2_Cat,nom_2_Dog,nom_2_Hamster,nom_2_Lion,nom_2_Snake,nom_3_Canada,nom_3_China,nom_3_Costa Rica,nom_3_Finland,nom_3_India,nom_3_Russia,nom_4_Bassoon,nom_4_Oboe,nom_4_Piano,nom_4_Theremin,...,nom_9_ff0be0f4a,nom_9_ff0deb8c5,nom_9_ff1275360,nom_9_ff1cea430,nom_9_ff1ffdb6c,nom_9_ff26ddc5f,nom_9_ff2c54c2b,nom_9_ff37346ce,nom_9_ff3ff6768,nom_9_ff4039502,nom_9_ff43e5d26,nom_9_ff4922d4e,nom_9_ff4c9a31f,nom_9_ff4ccc205,nom_9_ff4e4f5b5,nom_9_ff534b1a4,nom_9_ff55fea3b,nom_9_ff56952ea,nom_9_ff5bf79d4,nom_9_ff61809a0,nom_9_ff680c901,nom_9_ff683e4d6,nom_9_ff7809d76,nom_9_ff794dacf,nom_9_ff7b5c805,nom_9_ff7dd1073,nom_9_ff9efcf8c,nom_9_ffac03704,nom_9_ffae88db8,nom_9_ffb07126f,nom_9_ffc086cfa,nom_9_ffc668c22,nom_9_ffccfc611,nom_9_ffd347754,nom_9_ffd966e07,nom_9_fff13b60a,nom_9_fff1ce319,nom_9_fff4abc0b,nom_9_fffb01c38,nom_9_fffd6e64c
0,0,0,0,0,1,1,2,4,1,7,3,43,2,2,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,1,0,1,1,1,4,3,0,0,7,7,8,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2,0,0,0,0,1,1,2,5,7,17,135,7,2,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,3,0,1,0,0,1,1,4,4,8,3,50,2,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4,0,0,0,0,0,1,4,0,0,17,74,7,8,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0



* `Day` and `Month`
    * I don't know how to deal with this yet. They aren't independent, so we shouldn't process them separately. (Maybe we could, idk. Thoughts?)