<h1 align = "center">Multi-Class Classification</h1>

<p align = "justify"><b>Objective:</b> The notebook is created to create a multi-class classification for <a href = "https://www.kaggle.com/c/tabular-playground-series-dec-2021/overview">Tabular Playground Series - Dec 2021</a>. The data contains <b><code>55-features</code></b> out of which <b><code>Cover_Type</code></b> is to be classified among classes labled $(1, 2, ..., 7)$. Before diving into, being a good practioner, let's understand the data by doing some analysis - and understand what we're up against!</p>

## Data Preprocessing

<p align = "justify">Before we deep-dive into feature engineering, we need to process the data - namely <b>clean</b> the data to fix any errors (like encoding), and <b>impute</b> any missing values. However, the data in question is clean and has no encoding errors. This was verified using the following codes, however their output was omitted from this notebook.</p>

```python
# data is either `dataTrain` or `dataTest`
data.info() # returns all dtypes as `int`

# checking null values using `sns` and `pd.isnull`
# the image for training and testing dataset on null-values is shown below
sns.heatmap(data.isnull())
```

![na-in-training](images/training.png)
![na-in-training](images/testing.png)

In [6]:
from os.path import join
from copy import deepcopy
from tqdm import tqdm as TQ

In [7]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%precision 3
%matplotlib inline
sns.set_style('whitegrid');
plt.style.use('default-style');
np.set_printoptions(precision = 3, threshold = 15)

In [8]:
from imblearn.over_sampling import (
    SMOTE, # https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis#SMOTE
    ADASYN # https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis#ADASYN
)

from sklearn.preprocessing import (
    MinMaxScaler,
    LabelEncoder
)

In [9]:
import tensorflow as tf
print('Tensorflow Version: {}'.format(tf.__version__))

# check physical devices
tf.config.list_physical_devices()

Tensorflow Version: 2.3.1


[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
 PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU')]

In [10]:
# ignore specific warnings
import warnings
warnings.simplefilter("ignore", FutureWarning)

In [11]:
# https://www.analyticsvidhya.com/blog/2021/04/how-to-reduce-memory-usage-in-python-pandas/
# https://towardsdatascience.com/reducing-memory-usage-in-pandas-with-smaller-datatypes-b527635830af

calculateMemory = lambda frame : frame.memory_usage(deep = True).sum() / 1024 ** 2 # return usage in MB

def limitNumeric(frame : pd.DataFrame, verbose : bool = True) -> pd.DataFrame:
    """Given a DataFrame (frame) - the function considers each numeric columns (integer and/or float) and sets the data type to any of `np.dtypes` to Reduce Memory Usage"""
    
    if verbose:
        actual = calculateMemory(frame)
    
    # foreach column calculate the min and max value
    # and map the data to its relevant unit category - int8, int16, int32 or int64
    # by default - pandas treats each numeric column to its highest number base - int64/float64
    for col in TQ(frame.columns, desc = "converting dtypes"):
        c_min = frame[col].min()
        c_max = frame[col].max()
        
        if c_min > np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
            frame[col] = frame[col].astype(np.int8)
        elif c_min > np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
            frame[col] = frame[col].astype(np.int16)
        if c_min > np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
            frame[col] = frame[col].astype(np.int32)
        else:
            frame[col] = frame[col].astype(np.int64)
            
    if verbose:
        final = calculateMemory(frame)
        print(f"Actual Size : {actual:.2f} MB | Final Size : {final:.2f} MB || Reduction Ration = {((actual - final) / actual) * 100:.2f}%")
        
    return frame

In [12]:
TRAIN_DATA = join(".", "train.csv")
EVALUATION_DATA = join(".", "test.csv")

In [13]:
dataTrain = pd.read_csv(TRAIN_DATA, index_col = 0)
dataTrain = limitNumeric(dataTrain)

dataTrain.head() # FutureWarning Ignored https://stackoverflow.com/a/46721064/6623589

converting dtypes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55/55 [00:13<00:00,  4.16it/s]

Actual Size : 1708.98 MB | Final Size : 869.75 MB || Reduction Ration = 49.11%





Unnamed: 0_level_0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3189,40,8,30,13,3270,206,234,193,4873,...,0,0,0,0,0,0,0,0,0,1
1,3026,182,5,280,29,3270,233,240,106,5423,...,0,0,0,0,0,0,0,0,0,2
2,3106,13,7,351,37,2914,208,234,137,5269,...,0,0,0,0,0,0,0,0,0,1
3,3022,276,13,192,16,3034,207,238,156,2866,...,0,0,0,0,0,0,0,0,0,2
4,2906,186,13,266,22,2916,231,231,154,2642,...,0,0,0,0,0,0,0,0,0,2


In [14]:
dataTest = pd.read_csv(EVALUATION_DATA, index_col = 0)
dataTest = limitNumeric(dataTest)

dataTest.head() # FutureWarning Ignored https://stackoverflow.com/a/46721064/6623589

converting dtypes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54/54 [00:03<00:00, 16.42it/s]

Actual Size : 419.62 MB | Final Size : 213.62 MB || Reduction Ration = 49.09%





Unnamed: 0_level_0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4000000,2763,78,20,377,88,3104,218,213,195,1931,...,0,0,0,0,0,0,0,0,0,0
4000001,2826,153,11,264,39,295,219,238,148,2557,...,0,0,0,0,0,0,0,0,0,0
4000002,2948,57,19,56,44,852,202,217,163,1803,...,0,0,1,0,0,0,0,0,0,0
4000003,2926,119,6,158,134,2136,234,240,142,857,...,0,0,0,0,0,0,0,0,0,0
4000004,2690,10,4,38,108,3589,213,221,229,431,...,0,0,0,0,0,0,0,0,0,0


## Feature Engineering

<p align = "justify">"<i>Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. Feature engineering has been employed in Kaggle competitions and machine learning projects.</i>" In the following dataset, the following feature-engineering strategies are identified and applied. These are compiled into a function named <b><code>featureEngineering</code></b> and thus, the same types of engineering has been done in both training and testing dataset.</p>

### #1 Determining Distance of Water-Body

<p align = "justify">The dataset is based of the original dataset <a href = "https://www.kaggle.com/c/forest-cover-type-prediction/data">Forest Cover Type Prediction</a> which lists out the details of the column names. Interestingly, the data provides horizontal as well as vertical distance to its nearest water surface - which can be used to measure the actual distance of the water body from the <i>forest</i> considering it at the origin. Euclidean Distance metric is used, and can be defined as follows (for this setup):</p>

$$
d = \sqrt{h_d^2 + v_d^2}
$$

where, $h_d$ is the horizontal distance, $v_d$ is the vertical distance, and the forest is located at origin $O_{(0, 0)}$.

### #2 Additional Parameters

<p align = "justify">As suggested by Craig Thomas in his <a href = "https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/292823">discussion</a> additional features are included into the training and testing data - <code>soil_type_count</code> and <code>wilderness_area_count</code>.</p>

In [15]:
soilTypes = {
    1   : "Cathedral family - Rock outcrop complex, extremely stony.",
    2   : "Vanet - Ratake families complex, very stony.",
    3   : "Haploborolis - Rock outcrop complex, rubbly.",
    4   : "Ratake family - Rock outcrop complex, rubbly.",
    5   : "Vanet family - Rock outcrop complex complex, rubbly.",
    6   : "Vanet - Wetmore families - Rock outcrop complex, stony.",
    7   : "Gothic family.",
    8   : "Supervisor - Limber families complex.",
    9   : "Troutville family, very stony.",
    10  : "Bullwark - Catamount families - Rock outcrop complex, rubbly.",
    11  : "Bullwark - Catamount families - Rock land complex, rubbly.",
    12  : "Legault family - Rock land complex, stony.",
    13  : "Catamount family - Rock land - Bullwark family complex, rubbly.",
    14  : "Pachic Argiborolis - Aquolis complex.",
    15  : "unspecified in the USFS Soil and ELU Survey.",
    16  : "Cryaquolis - Cryoborolis complex.",
    17  : "Gateview family - Cryaquolis complex.",
    18  : "Rogert family, very stony.",
    19  : "Typic Cryaquolis - Borohemists complex.",
    20  : "Typic Cryaquepts - Typic Cryaquolls complex.",
    21  : "Typic Cryaquolls - Leighcan family, till substratum complex.",
    22  : "Leighcan family, till substratum, extremely bouldery.",
    23  : "Leighcan family, till substratum - Typic Cryaquolls complex.",
    24  : "Leighcan family, extremely stony.",
    25  : "Leighcan family, warm, extremely stony.",
    26  : "Granile - Catamount families complex, very stony.",
    27  : "Leighcan family, warm - Rock outcrop complex, extremely stony.",
    28  : "Leighcan family - Rock outcrop complex, extremely stony.",
    29  : "Como - Legault families complex, extremely stony.",
    30  : "Como family - Rock land - Legault family complex, extremely stony.",
    31  : "Leighcan - Catamount families complex, extremely stony.",
    32  : "Catamount family - Rock outcrop - Leighcan family complex, extremely stony.",
    33  : "Leighcan - Catamount families - Rock outcrop complex, extremely stony.",
    34  : "Cryorthents - Rock land complex, extremely stony.",
    35  : "Cryumbrepts - Rock outcrop - Cryaquepts complex.",
    36  : "Bross family - Rock land - Cryumbrepts complex, extremely stony.",
    37  : "Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.",
    38  : "Leighcan - Moran families - Cryaquolls complex, extremely stony.",
    39  : "Moran family - Cryorthents - Leighcan family complex, extremely stony.",
    40  : "Moran family - Cryorthents - Rock land complex, extremely stony."
}

In [16]:
rubblyStonyCats = [f"Soil_Type{k}" for k, v in soilTypes.items() if "rubbly" in v.lower()]

veryStonyCats = [f"Soil_Type{k}" for k, v in soilTypes.items() if "very stony" in v.lower()]
extremelyStonyCats = [f"Soil_Type{k}" for k, v in soilTypes.items() if "extremely stony" in v.lower()]

In [17]:
# Calculate the Euclidean Distance - adheres to #1 of Feature-Engineering
distance = lambda xVals, yVals : np.array((xVals ** 2 + yVals ** 2) ** 0.5, dtype = np.float64)

def featureEngineering(frame : pd.DataFrame) -> pd.DataFrame:
    """Apply Feature Engineering on the Given DataFrame"""
    
    frame = deepcopy(frame) # keeps the original data intact - useful in quick analysis
    
    # calculate resource distances
    frame["distanceHydro"] = distance(frame["Horizontal_Distance_To_Hydrology"].values, frame["Vertical_Distance_To_Hydrology"].values)
    frame["distanceRoads"] = distance(frame["Elevation"].values, frame["Horizontal_Distance_To_Roadways"].values)
    frame["distanceFires"] = distance(frame["Elevation"].values, frame["Horizontal_Distance_To_Fire_Points"].values)
    
    # calculate different soil type and wilderness type count
    frame["soil_type_count_r"] = frame[rubblyStonyCats].sum(axis = 1)
    frame["soil_type_count_vs"] = frame[veryStonyCats].sum(axis = 1)
    frame["soil_type_count_es"] = frame[extremelyStonyCats].sum(axis = 1)
    frame["soil_type_count_s"] = frame[["soil_type_count_vs", "soil_type_count_es"]].sum(axis = 1)
    frame["soil_type_count_total"] = frame[[x for x in frame.columns if x.startswith("Soil_Type")]].sum(axis = 1)
    frame["wilderness_area_count"] = frame[[x for x in frame.columns if x.startswith("Wilderness_Area")]].sum(axis = 1)
    
    return frame

In [18]:
dataTrain = featureEngineering(dataTrain)
dataTest = featureEngineering(dataTest)

### #3 Highly Imbalanced Data

**`Cover_Type`** for each of the class is highly imbalanced, infact `value_counts` reveals:

```python
2    2262087
1    1468136
3     195712
7      62261
6      11426
4        377
5          1
Name: Cover_Type, dtype: int64
```

<p align = "justify">Referencing the work on dealing with <a href = "https://www.kaggle.com/dpramanik/customer-churn-prediction-using-ann-and-smote">an Imbalanced Dataset</a> that I'd earlier used in Customer Churn Prediction, he will <i>oversample</i> minority classed. In this section, a combination of two oversampling methods - namely <b>SMOTE</b> and <b>ADASYN</b> is used for classes $4, 6, 7$; and <b>SMOTE</b> is used for $1, 3$. For class $5$ it is impossible/meaningless to generate large number of synthetic samples - and thus the data is simply duplicated 100,000 times - just that the class is not neglected.</p>

In [19]:
smote = SMOTE(k_neighbors = 7)
adasyn = ADASYN(n_neighbors = 7)

In [20]:
# bring the count upto 195712 (class-3) using ADASYN for 4,6,7
Xy = dataTrain[
    (dataTrain.Cover_Type == 3) |
    (dataTrain.Cover_Type == 4) |
    (dataTrain.Cover_Type == 6) |
    (dataTrain.Cover_Type == 7)
]

X = Xy.drop(columns = "Cover_Type")
y = Xy.Cover_Type

del Xy # house-keeping

In [21]:
XSynth, ySynth = adasyn.fit_resample(X, y)
ySynth.value_counts()

6    196553
3    195712
7    195703
4    195565
Name: Cover_Type, dtype: int64

<p align = "justify">Okay, so we now somewhat have a equal class-weitage data. Now, we can use the SMOTE algorithm to process further. <b>But,</b> the complexity of non-linear SVM is on the order $O(n^2)$ to $O(n^3)$ where $n$ is the number of samples (<a href = "https://datascience.stackexchange.com/questions/48709/">reference</a>). Also, the ratio of $\frac{c_1}{c_4} = \frac{1468136}{195743} \approx 7.5$ and $\frac{c_2}{c_1} = \frac{2262087}{1468136} \approx 1.5$. Hence, processing a data of this scale on a single time may take forever! Also, increasing the data may also reduce the accuracy of the SMOTE algorithm, and impact in model training.</p>

In [22]:
Xy = XSynth
Xy["Cover_Type"] = ySynth.values

Xy = pd.concat([
    Xy, dataTrain[
        (dataTrain.Cover_Type == 1) |
        (dataTrain.Cover_Type == 2)
    ]
], ignore_index = True)

Xy.sample(5)

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,distanceHydro,distanceRoads,distanceFires,soil_type_count_r,soil_type_count_vs,soil_type_count_es,soil_type_count_s,soil_type_count_total,wilderness_area_count,Cover_Type
1271511,3035,104,22,729,-15,2114,209,216,127,726,...,729.154305,3698.678277,3120.625098,0,0,0,0,1,1,1
1708899,3023,33,16,238,0,1607,241,246,118,620,...,238.0,3423.591389,3085.924335,0,0,0,0,0,1,2
1892743,2967,117,15,171,30,1126,238,234,187,1447,...,173.611636,3173.478376,3301.044986,1,0,0,0,2,1,2
3594273,3354,148,8,1004,117,1249,206,246,41,932,...,1010.794242,3579.010617,3481.083165,0,0,0,0,0,1,1
3982222,2560,219,12,52,418,85,212,202,146,3955,...,421.222032,2561.410744,4711.223302,1,0,0,0,2,1,2


In [23]:
parts = pd.concat([
    # increase the data size by approximately twice
    Xy[Xy.Cover_Type == 1].sample(391500),
    Xy[Xy.Cover_Type == 2].sample(391500),
    Xy[(Xy.Cover_Type > 2) & (Xy.Cover_Type != 5)]
])

X = parts.drop(columns = "Cover_Type")
y = parts.Cover_Type

In [24]:
XSynth, ySynth = smote.fit_resample(X, y)
ySynth.value_counts()

7    391500
6    391500
4    391500
3    391500
2    391500
1    391500
Name: Cover_Type, dtype: int64

In [25]:
Xy = XSynth
Xy["Cover_Type"] = ySynth.values

Xy = pd.concat([
    Xy[Xy.Cover_Type > 2],
    dataTrain[
        (dataTrain.Cover_Type == 1) |
        (dataTrain.Cover_Type == 2)
    ],
    dataTrain[dataTrain.Cover_Type == 5].sample(391500, replace = True)
], ignore_index = True)

Xy.Cover_Type.value_counts()

2    2262087
1    1468136
7     391500
6     391500
5     391500
4     391500
3     391500
Name: Cover_Type, dtype: int64

In [26]:
del XSynth, ySynth # house-keeping

In [27]:
Xy.sample(5)

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,distanceHydro,distanceRoads,distanceFires,soil_type_count_r,soil_type_count_vs,soil_type_count_es,soil_type_count_s,soil_type_count_total,wilderness_area_count,Cover_Type
3386553,2583,201,25,403,98,1130,232,176,128,1020,...,414.7445,2819.359679,2777.100826,0,0,1,1,1,1,2
2126230,3351,83,8,164,1,2628,233,231,152,1193,...,164.003049,4258.589555,3557.028254,0,0,0,0,1,1,1
4062349,3174,119,12,695,-6,825,240,247,77,492,...,695.025899,3279.466572,3211.905976,0,0,0,0,0,1,1
3467675,3277,33,9,204,16,1015,200,220,82,3874,...,204.626489,3430.590911,5074.111252,0,0,0,0,1,1,1
753149,3102,117,10,37,8,1608,241,234,136,75,...,39.658004,3495.008849,3103.496331,0,0,1,1,1,1,7


In [56]:
for i in range(1, 8):
    print(i, Xy[Xy.Cover_Type == i].drop_duplicates().shape)

1 (1468136, 64)
2 (2262087, 64)
3 (391500, 64)
4 (391499, 64)
5 (1, 64)
6 (391500, 64)
7 (391489, 64)


In [48]:
Xy.to_csv(join(".", "output", "train-100.csv"), index = False)
dataTest.to_csv(join(".", "output", "test-100.csv"), index = False)

### Data Scaling

<p align = "justify">Considering numerical data, a good machine learning model performs and predicts well when the data is scaled. For this, will be using <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html"><code>MinMaxScaler</code></a> for all feature columns, and <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html"><code>LabelEncoder</code></a> for <code>Coverage_Type</code>.</p>

In [29]:
X = Xy.drop(columns = "Cover_Type")
y = Xy.Cover_Type

In [30]:
scaler = MinMaxScaler()
encoder = LabelEncoder()

In [31]:
XScaled = scaler.fit_transform(X)
yScaled = encoder.fit_transform(y)

## Creating Model

In [32]:
INPUT_SHAPE = XScaled.shape[1]
OUTPUT_SHAPE = encoder.classes_.shape[0]

INPUT_SHAPE, OUTPUT_SHAPE

(63, 7)

In [41]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, input_shape = (INPUT_SHAPE, ), activation = "relu", name = "iLayer"),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(128, activation = "relu"),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(256, activation = "relu"),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64, activation = "relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(OUTPUT_SHAPE, activation = "softmax", name = "oLayer"),
], name = "DFC-1.0.0")

model.summary(line_length = 127)

Model: "DFC-1.1.0"
_______________________________________________________________________________________________________________________________
Layer (type)                                             Output Shape                                      Param #             
iLayer (Dense)                                           (None, 64)                                        4096                
_______________________________________________________________________________________________________________________________
dropout_3 (Dropout)                                      (None, 64)                                        0                   
_______________________________________________________________________________________________________________________________
dense_3 (Dense)                                          (None, 128)                                       8320                
_____________________________________________________________________________________

In [42]:
model.compile(
    optimizer = "adam",
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True),
    metrics = [tf.keras.metrics.SparseCategoricalAccuracy()]
)

In [43]:
model.fit(XScaled, yScaled, epochs = 10, batch_size = 128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1b4bdea8a90>

In [36]:
testScaled = scaler.transform(dataTest)

In [37]:
yPredicted = model.predict(testScaled)

In [38]:
# considering argmax
yPredictedMax = [np.argmax(i) + 1 for i in yPredicted]

In [39]:
# DFC-1.0.0
output = pd.DataFrame({
    "Id" : dataTest.index.values,
    "Cover_Type" : yPredictedMax
})

output.sample(5)

Unnamed: 0,Id,Cover_Type
326846,4326846,3
242605,4242605,2
988681,4988681,1
466158,4466158,3
120872,4120872,1


In [40]:
output.to_csv("dfc100-2.csv", index = False)