# Introduction #
Welcome. Let automate machine learning as much as possible.
<blockquote style="margin-right:auto; margin-left:auto; padding: 1em; margin:24px;">
    <strong>Fork This Notebook!</strong><br>
Create your own editable copy of this notebook by clicking on the <strong>Copy and Edit</strong> button in the top right corner.
</blockquote>

## Imports and Configuration ##

We'll start by importing the packages we used in the exercises and setting some notebook defaults. Unhide this cell if you'd like to see the libraries we'll use:

In [51]:
# from IPython.display import clear_output
!pip install -q -U autokeras==1.0.16.post1
# clear_output()

import os
import warnings
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display
from pandas.api.types import CategoricalDtype
from matplotlib.ticker import MaxNLocator, FormatStrFormatter, PercentFormatter

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, TomekLinks

# from category_encoders import MEstimateEncoder
# from sklearn.cluster import KMeans
# from sklearn.decomposition import PCA
# from sklearn.feature_selection import mutual_info_regression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
import autokeras as ak
import tensorflow as tf


# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)

# Mute warnings
warnings.filterwarnings('ignore')



# Parameters


In [24]:
# -----------------------------------------------------------------
# Some parameters to config 
MAX_TRIAL = 5 # speed trial any%
EPOCHS = 80

# not used
BATCH_SIZE = 2048 # large enough to fit RAM
ACTIVATION = 'swish'
LEARNING_RATE = 0.0007
LABEL_SMOOTHING=1e-3
FOLDS = 5

RANDOM_STATE = 42
VERBOSE = 0

# The dataset is too huge for free contrainer. Sampling it for more fun!
SAMPLE = 195712 # [1468136, 2262087, 195712, 377, 1, 11426, 62261] # 4000000 total rows
VALIDATION_SPLIT = 0.15

# Admin
ID = "Id" # Id id x X index
INPUT = "../input/tabular-playground-series-dec-2021"

## Data Preprocessing ##

Before we can do any feature engineering, we need to *preprocess* the data to get it in a form suitable for analysis. The data we used in the course was a bit simpler than the competition data. For the *Ames* competition dataset, we'll need to:
- **Load** the data from CSV files
- **Clean** the data to fix any errors or inconsistencies
- **Encode** the statistical data type (numeric, categorical)
- **Impute** any missing values

We'll wrap all these steps up in a function, which will make easy for you to get a fresh dataframe whenever you need. After reading the CSV file, we'll apply three preprocessing steps, `clean`, `encode`, and `impute`, and then create the data splits: one (`df_train`) for training the model, and one (`df_test`) for making the predictions that you'll submit to the competition for scoring on the leaderboard.

In [25]:
def load_data():
    # Read data
    data_dir = Path(INPUT)
    df_train = pd.read_csv(data_dir / "train.csv", index_col=ID)
    df_test = pd.read_csv(data_dir / "test.csv", index_col=ID)
    # Merge the splits so we can process them together
#     df = pd.concat([df_train, df_test])
    # Preprocessing
#     df = clean(df)
#     df = encode(df)
    df_train = impute(df_train)
    df_test = impute(df_test)
    # Reform splits
#     df_train = df.loc[df_train.index, :]
#     df_test = df.loc[df_test.index, :]
    return df_train, df_test


### Handle Missing Values ###

Handling missing values now will make the feature engineering go more smoothly. We'll impute `0` for missing numeric values and `"None"` for missing categorical values. You might like to experiment with other imputation strategies. In particular, you could try creating "missing value" indicators: `1` whenever a value was imputed and `0` otherwise.

In [26]:
def impute(df):
    for name in df.select_dtypes("number"):
        df[name] = df[name].fillna(0)
    for name in df.select_dtypes("category"):
        df[name] = df[name].fillna("None")
    return df

## Load Data ##

And now we can call the data loader and get the processed data splits:

In [27]:
df_train, df_test = load_data()

In [28]:
# Peek at the values
display(df_train)
# display(df_test)

# Display information about dtypes and missing values
# display(df_train.info())
# display(df_test.info())

In [29]:
target_col = df_train.columns.difference(df_test.columns)[0]
X_raw = df_train.drop(columns=target_col)
y_raw = df_train[target_col]

X_test_raw = df_test.iloc[:,:]
target_col

# Resampling

Auto Keras y categories calculation wrong when cat 5 is missing etc

In [30]:
from sklearn.model_selection import train_test_split
# Check NA
missing_val = X_raw.isnull().sum()
print(missing_val[missing_val > 0])

# For small testing batch
# X_raw, x_val, y, y_val = train_test_split(X_raw, y, train_size = VALIDATION_SPLIT, random_state = RANDOM_STATE)
# X_raw = X_raw.sample(n=SAMPLE, random_state=RANDOM_STATE)
# y_raw = y_raw.sample(n=SAMPLE, random_state=RANDOM_STATE)
# x_test = x_test.sample(n=SAMPLE, random_state=RANDOM_STATE)

In [31]:
sampling_key, sampling_count = np.unique(y_raw, return_counts=True)
sampling_count[sampling_count > SAMPLE] = SAMPLE
zip_iterator = zip(sampling_key, sampling_count)
sampling_params = dict(zip_iterator)

undersample = RandomUnderSampler(
    sampling_strategy=sampling_params)

X_raw, y_raw = undersample.fit_resample(X_raw, y_raw)

In [32]:
np.unique(y_raw, return_counts=True)

## Scaler transformer

In [33]:
transformer_all_cols = make_pipeline(
    RobustScaler(),
)

preprocessor = make_column_transformer(
    (transformer_all_cols, df_test.columns[:]),
)

In [35]:
X_train = preprocessor.fit_transform(X_raw)
X_test = preprocessor.transform(X_test_raw)

TPS always have huge dataset.

In [36]:
import gc
gc.collect()

# Hyperparameter Tuning #

At this stage, you might like to do auto hyperparameter tuning with AutoKeras before creating your final submission.
AutoKeras: An AutoML system based on Keras. It is developed by DATA Lab at Texas A&M University. The goal of AutoKeras is to make machine learning accessible to everyone.

By default, AutoKeras use the last 20% of training data as validation data. As shown in the example below, you can use validation_split to specify the percentage.

In [37]:
# Search for the best model with EarlyStopping.
cbs = [
    tf.keras.callbacks.EarlyStopping(patience=3),
]

In [38]:
# Initialize the structured data classifier.
clf = ak.StructuredDataClassifier(
    overwrite=False, max_trials=MAX_TRIAL,seed=RANDOM_STATE
)  # It tries MAX_TRIAL different models.
# Feed the structured data classifier with training data.
history = clf.fit(
                X_train,
                y_raw,
                # Split the training data and use the last 15% as validation data.
                validation_split=VALIDATION_SPLIT,
                epochs=EPOCHS,
                callbacks=cbs,
)



In [39]:
# clf.evaluate(X_val, y_val)

You can also export the best model found by AutoKeras as a Keras Model.

In [40]:
model = clf.export_model()
model.summary()

# Train Model and Create Submissions #

Once you're satisfied with everything, it's time to create your final predictions! This cell will:
- use the best trained model to make predictions from the test set
- save the predictions to a CSV file

In [41]:
# Predict with the best model.
predicted_y = clf.predict(X_test)


In [42]:
predicted_y

In [43]:
# output = pd.DataFrame({'Id': X_test.index, target_col: predicted_y})
output = pd.read_csv(INPUT + "/sample_submission.csv")
output[target_col] = predicted_y
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")
output

In [48]:
# summarize history for accuracy
plt.plot(history.history['accuracy'])
# plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('acc')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
# plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [45]:
history.history

In [49]:
np.unique(output[target_col], return_counts=True)

In [52]:
# Plot the distribution of the test predictions
plt.figure(figsize=(10,3))
plt.hist(output['Cover_Type'], bins=np.linspace(0.5, 7.5, 8), density=True)
plt.xlabel('Test predictions')
plt.ylabel('Density')
plt.gca().yaxis.set_major_formatter(PercentFormatter())
plt.show()

To submit these predictions to the competition, follow these steps:

1. Begin by clicking on the blue **Save Version** button in the top right corner of the window.  This will generate a pop-up window.
2. Ensure that the **Save and Run All** option is selected, and then click on the blue **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Output** tab on the right of the screen.  Then, click on the file you would like to submit, and click on the blue **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

# Next Steps #

If you want to keep working to improve your performance, select the blue **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.

Be sure to check out [other users' notebooks](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/notebooks) in this competition. You'll find lots of great ideas for new features and as well as other ways to discover more things about the dataset or make better predictions. There's also the [discussion forum](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion), where you can share ideas with other Kagglers.

Have fun Kaggling!