# Introduction #
Welcome
<blockquote style="margin-right:auto; margin-left:auto; padding: 1em; margin:24px;">
    <strong>Fork This Notebook!</strong><br>
Create your own editable copy of this notebook by clicking on the <strong>Copy and Edit</strong> button in the top right corner.
</blockquote>

## Imports and Configuration ##

We'll start by importing the packages we used in the exercises and setting some notebook defaults. Unhide this cell if you'd like to see the libraries we'll use:

In [3]:
from IPython.display import clear_output
# !pip install autokeras
# clear_output()

import os
import warnings
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display
from pandas.api.types import CategoricalDtype

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_regression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
import autokeras as ak
import tensorflow as tf


# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)

# Mute warnings
warnings.filterwarnings('ignore')



In [4]:
# -----------------------------------------------------------------
# Some parameters to config 
ID = "Id"
MAX_TRIAL = 1 # speed trial any%
EPOCHS = 3
BATCH_SIZE = 2048
ACTIVATION = 'swish'
LEARNING_RATE = 0.0007
LABEL_SMOOTHING=1e-3
FOLDS = 5
RANDOM_STATE = 42
VALIDATION_SPLIT = 0.97

## Data Preprocessing ##

Before we can do any feature engineering, we need to *preprocess* the data to get it in a form suitable for analysis. The data we used in the course was a bit simpler than the competition data. For the *Ames* competition dataset, we'll need to:
- **Load** the data from CSV files
- **Clean** the data to fix any errors or inconsistencies
- **Encode** the statistical data type (numeric, categorical)
- **Impute** any missing values

We'll wrap all these steps up in a function, which will make easy for you to get a fresh dataframe whenever you need. After reading the CSV file, we'll apply three preprocessing steps, `clean`, `encode`, and `impute`, and then create the data splits: one (`df_train`) for training the model, and one (`df_test`) for making the predictions that you'll submit to the competition for scoring on the leaderboard.

In [5]:
def load_data():
    # Read data
    data_dir = Path("../input/tabular-playground-series-dec-2021")
    df_train = pd.read_csv(data_dir / "train.csv", index_col="Id")
    df_test = pd.read_csv(data_dir / "test.csv", index_col="Id")
    # Merge the splits so we can process them together
#     df = pd.concat([df_train, df_test])
    # Preprocessing
#     df = clean(df)
#     df = encode(df)
    df_train = impute(df_train)
    df_test = impute(df_test)
    # Reform splits
#     df_train = df.loc[df_train.index, :]
#     df_test = df.loc[df_test.index, :]
    return df_train, df_test


### Handle Missing Values ###

Handling missing values now will make the feature engineering go more smoothly. We'll impute `0` for missing numeric values and `"None"` for missing categorical values. You might like to experiment with other imputation strategies. In particular, you could try creating "missing value" indicators: `1` whenever a value was imputed and `0` otherwise.

In [6]:
def impute(df):
    for name in df.select_dtypes("number"):
        df[name] = df[name].fillna(0)
    for name in df.select_dtypes("category"):
        df[name] = df[name].fillna("None")
    return df

## Load Data ##

And now we can call the data loader and get the processed data splits:

In [7]:
df_train, df_test = load_data()

In [8]:
# Peek at the values
# display(df_train)
# display(df_test)

# Display information about dtypes and missing values
# display(df_train.info())
# display(df_test.info())

In [9]:
target_col = df_train.columns.difference(df_test.columns)[0]
X_raw = df_train.drop(columns=target_col)
y_raw = df_train[target_col]

X_test_raw = df_test.iloc[:,:]
target_col

'Cover_Type'

## Scaler transformer

In [10]:
transformer_all_cols = make_pipeline(
    StandardScaler(),
    MinMaxScaler(feature_range=(0, 1))
)

preprocessor = make_column_transformer(
    (transformer_all_cols, df_test.columns[:]),
)

In [11]:
# X_test_raw

In [12]:
X_train = preprocessor.fit_transform(X_raw)
X_test = preprocessor.transform(X_test_raw)

TPS always have huge dataset.

In [13]:
import gc
gc.collect()

132

# Hyperparameter Tuning #

At this stage, you might like to do auto hyperparameter tuning with AutoKeras before creating your final submission.
AutoKeras: An AutoML system based on Keras. It is developed by DATA Lab at Texas A&M University. The goal of AutoKeras is to make machine learning accessible to everyone.

By default, AutoKeras use the last 20% of training data as validation data. As shown in the example below, you can use validation_split to specify the percentage.

In [14]:
# Search for the best model with EarlyStopping.
cbs = [
    tf.keras.callbacks.EarlyStopping(patience=3),
]

In [15]:
# Initialize the structured data classifier.
clf = ak.StructuredDataClassifier(
    overwrite=True, max_trials=MAX_TRIAL,seed=RANDOM_STATE
)  # It tries 3 different models.
# Feed the structured data classifier with training data.
clf.fit(
    X_train,
    y_raw,
    # Split the training data and use the last 15% as validation data.
    validation_split=VALIDATION_SPLIT,
    epochs=EPOCHS,
    callbacks=cbs,
)



Trial 1 Complete [00h 07m 51s]
val_accuracy: 0.5004639029502869

Best val_accuracy So Far: 0.5004639029502869
Total elapsed time: 00h 07m 51s
INFO:tensorflow:Oracle triggered exit
Epoch 1/3
Epoch 2/3
Epoch 3/3
INFO:tensorflow:Assets written to: .\structured_data_classifier\best_model\assets


<tensorflow.python.keras.callbacks.History at 0x1fac5457088>

In [16]:
# clf.evaluate(X_val, y_val)

You can also export the best model found by AutoKeras as a Keras Model.

In [17]:
model = clf.export_model()
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 54)]              0         
_________________________________________________________________
multi_category_encoding (Mul (None, 54)                0         
_________________________________________________________________
normalization (Normalization (None, 54)                109       
_________________________________________________________________
dense (Dense)                (None, 32)                1760      
_________________________________________________________________
re_lu (ReLU)                 (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                1056      
_________________________________________________________________
re_lu_1 (ReLU)               (None, 32)                0     

# Train Model and Create Submissions #

Once you're satisfied with everything, it's time to create your final predictions! This cell will:
- use the best trained model to make predictions from the test set
- save the predictions to a CSV file

In [18]:
# Predict with the best model.
predicted_y = clf.predict(X_test)




In [40]:
# predicted_y.shape
predicted_y

array([['2'],
       ['2'],
       ['1'],
       ...,
       ['1'],
       ['1'],
       ['3']], dtype='<U1')

In [44]:
# output = pd.DataFrame({'Id': X_test.index, target_col: predicted_y})
output = pd.read_csv("../input/tabular-playground-series-dec-2021/sample_submission.csv")
output[target_col] = predicted_y
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")
output.head()

Your submission was successfully saved!


Unnamed: 0,Id,Cover_Type
0,4000000,2
1,4000001,2
2,4000002,1
3,4000003,2
4,4000004,1


To submit these predictions to the competition, follow these steps:

1. Begin by clicking on the blue **Save Version** button in the top right corner of the window.  This will generate a pop-up window.
2. Ensure that the **Save and Run All** option is selected, and then click on the blue **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Output** tab on the right of the screen.  Then, click on the file you would like to submit, and click on the blue **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

# Next Steps #

If you want to keep working to improve your performance, select the blue **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.

Be sure to check out [other users' notebooks](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/notebooks) in this competition. You'll find lots of great ideas for new features and as well as other ways to discover more things about the dataset or make better predictions. There's also the [discussion forum](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion), where you can share ideas with other Kagglers.

Have fun Kaggling!