# <div style="background-color:#20B2AA;text-align:center;color:white;font-size:150%;border-radius:10px"> **Project interest**</div>

This project is based on a Kaggle competition, [Playground Series - Season 3, Episode 1](https://www.kaggle.com/competitions/playground-series-s3e1). The interest of this competition is that one of the keys for obtaining the best model was to feature engineer the coordinate features (latitude and longitude). Such location feature engineering is explored in this notebook.

# <div style="background-color:#20B2AA;text-align:center;color:white;font-size:150%;border-radius:10px"> **0. Imports**</div>

In [1]:
# System
import subprocess
import os

# Data handling
import pandas as pd
import numpy as np

# Dataset
from sklearn.datasets import fetch_california_housing

# <div style="background-color:#20B2AA;text-align:center;color:white;font-size:150%;border-radius:10px"> **1. Data import and EDA**</div>

## <span style='color:#20B2AA;font-size:100%'>1.1</span> | Import data

The data provided in the competition was artificially generated by training a deep learning model to pandas' california_housing dataset. The data is retrieved from the Kaggle website, then consolidated by adding the original data from sklearn to the providede training data.

A Kaggle API key is required to download the Kaggle data. If no key is available, the data should be downloaded manually from the competition webpage.

In [2]:
# Download competition data
if 'playground-series-s3e1.zip' not in os.listdir('./'):
    subprocess.call(["kaggle",
                     "competitions",
                     "download",
                     "-c",
                     "playground-series-s3e1"])

# Unzip files
if not all(f in os.listdir('./') for f in ["test.csv", "train.csv", "sample_submission.csv"]):
    subprocess.call(["unzip",
                     "playground-series-s3e1.zip"])

# Read csv files to DataFrames
kaggle_train = pd.read_csv('./train.csv')
kaggle_test = pd.read_csv('./test.csv')

# Fetch base california_housing dataset from sklearn.
ch = fetch_california_housing()
ch_train = pd.DataFrame(data=ch.data, columns=ch.feature_names)
ch_train['MedHouseVal'] = ch.target

In [3]:
kaggle_train.head()

Unnamed: 0,id,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,0,2.3859,15.0,3.82716,1.1121,1280.0,2.486989,34.6,-120.12,0.98
1,1,3.7188,17.0,6.013373,1.054217,1504.0,3.813084,38.69,-121.22,0.946
2,2,4.775,27.0,6.535604,1.103175,1061.0,2.464602,34.71,-120.45,1.576
3,3,2.4138,16.0,3.350203,0.965432,1255.0,2.089286,32.66,-117.09,1.336
4,4,3.75,52.0,4.284404,1.069246,1793.0,1.60479,37.8,-122.41,4.5


In [4]:
kaggle_test.head()

Unnamed: 0,id,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,37137,1.7062,35.0,4.966368,1.096539,1318.0,2.844411,39.75,-121.85
1,37138,1.3882,22.0,4.187035,1.098229,2296.0,3.180218,33.95,-118.29
2,37139,7.7197,21.0,7.129436,0.959276,1535.0,2.888889,33.61,-117.81
3,37140,4.6806,49.0,4.769697,1.048485,707.0,1.74359,34.17,-118.34
4,37141,3.1284,25.0,3.765306,1.081633,4716.0,2.003827,34.17,-118.29


In [5]:
ch_train.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [6]:
# Remove unnecessary columns
if 'id' in kaggle_train.columns:
    kaggle_train.drop('id', axis=1, inplace=True)
if 'id' in kaggle_test.columns:
    kaggle_test.drop('id', axis=1, inplace=True)

# Create feature to distinguish origin dataset
kaggle_train['is_artificial'] = 1
kaggle_test['is_artificial'] = 1
ch_train['is_artificial'] = 0
    
# Consolidating all data in train and test DataFrames
train = pd.concat([kaggle_train, ch_train], ignore_index=True)
test = kaggle_test.copy(deep=True)

In [7]:
train.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,is_artificial
0,2.3859,15.0,3.82716,1.1121,1280.0,2.486989,34.6,-120.12,0.98,1
1,3.7188,17.0,6.013373,1.054217,1504.0,3.813084,38.69,-121.22,0.946,1
2,4.775,27.0,6.535604,1.103175,1061.0,2.464602,34.71,-120.45,1.576,1
3,2.4138,16.0,3.350203,0.965432,1255.0,2.089286,32.66,-117.09,1.336,1
4,3.75,52.0,4.284404,1.069246,1793.0,1.60479,37.8,-122.41,4.5,1


In [8]:
test.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,is_artificial
0,1.7062,35.0,4.966368,1.096539,1318.0,2.844411,39.75,-121.85,1
1,1.3882,22.0,4.187035,1.098229,2296.0,3.180218,33.95,-118.29,1
2,7.7197,21.0,7.129436,0.959276,1535.0,2.888889,33.61,-117.81,1
3,4.6806,49.0,4.769697,1.048485,707.0,1.74359,34.17,-118.34,1
4,3.1284,25.0,3.765306,1.081633,4716.0,2.003827,34.17,-118.29,1


## <span style='color:#20B2AA;font-size:100%'>1.2</span> | Exploratory Data Analysis (EDA)