# Midterm

## Problem

You are given a dataset related to damage after a Nepalese earthquake. A detailed description of the dataset is provided below. The data is processed by the following code cells. In other words, these cells will take care of some data preprocessing for you, so please run them.

Your task is to train a neural network to predict the numeric class of damage (classification problem) from 1-3. 

The task is open ended, but your goal, in the spirit of online machine learning contests, is to maximize a given metric. Here that metric is the F1 score with micro averaging. See this URL for details and a convienent implementation (which you should use) https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html. 

In practice you would split this dataset into three sections: training, validation, and test. Instead, please split the data with 80% train and 20% validation. During training, monitor the validation loss and compute the F1 metric on that validation data. This score will be the final score you report. 

The baseline F1 score for this dataset is 0.58, upon which you should be able to improve with the neural net. 


### Data Description

Following the 7.8 Mw Gorkha Earthquake on April 25, 2015, Nepal carried out a massive household survey using mobile technology to assess building damage in the earthquake-affected districts. Although the primary goal of this survey was to identify beneficiaries eligible for government assistance for housing reconstruction, it also collected other useful socio-economic information. In addition to housing reconstruction, this data serves a wide range of uses and users e.g. researchers, newly formed local governments, and citizens at large. 

### Labels

Three numeric labels represent damage grade:

1) Low Damage

2) Moderate Damage

3) Complete Destruction


### Features
geo_level_1_id: High level geographic location

geo_level_2_id: Mid level geographic location

geo_level_3_id: Low level geographic location

count_floors_pre_eq (type: int): number of floors in the building before the earthquake.

age (type: int): age of the building in years.

area_percentage (type: int): normalized area of the building footprint.

height_percentage (type: int): normalized height of the building footprint.

land_surface_condition (type: categorical): surface condition of the land where the building was built. Possible values: n, o, t.

foundation_type (type: categorical): type of foundation used while building. Possible values: h, i, r, u, w.

roof_type (type: categorical): type of roof used while building. Possible values: n, q, x.

ground_floor_type (type: categorical): type of the ground floor. Possible values: f, m, v, x, z.

other_floor_type (type: categorical): type of constructions used in higher than the ground floors (except of roof). Possible values: j, q, s, x.

position (type: categorical): position of the building. Possible values: j, o, s, t.

plan_configuration (type: categorical): building plan configuration. Possible values: a, c, d, f, m, n, o, q, s, u.

has_superstructure_adobe_mud (type: binary): flag variable that indicates if the superstructure was made of Adobe/Mud.

has_superstructure_mud_mortar_stone (type: binary): flag variable that indicates if the superstructure was made of Mud Mortar - Stone.

has_superstructure_stone_flag (type: binary): flag variable that indicates if the superstructure was made of Stone.
has_superstructure_cement_mortar_stone (type: binary): flag variable that indicates if the superstructure was made of Cement Mortar - Stone.
has_superstructure_mud_mortar_brick (type: binary): flag variable that indicates if the superstructure was made of Mud Mortar - Brick.
has_superstructure_cement_mortar_brick (type: binary): flag variable that indicates if the superstructure was made of Cement Mortar - Brick.

has_superstructure_timber (type: binary): flag variable that indicates if the superstructure was made of Timber.

has_superstructure_bamboo (type: binary): flag variable that indicates if the superstructure was made of Bamboo.

has_superstructure_rc_non_engineered (type: binary): flag variable that indicates if the superstructure was made of non-engineered reinforced concrete.

has_superstructure_rc_engineered (type: binary): flag variable that indicates if the superstructure was made of engineered reinforced concrete.

has_superstructure_other (type: binary): flag variable that indicates if the superstructure was made of any other material.


count_families (type: int): number of families that live in the building.

## Deliverables

You should submit the complete code (with cell outputs) showing your train/validation split, any other preprocessing on the training data or labels, your neural network, training loss and validation loss plots, f1 score on the validation data, and the confusion matrix.

In addition to the complete code, in a text cell or in a seperate document, write up your methodology, results, and some discussion of the results. Also discuss the specific choices (e.g. loss function, optimizer) you made in that process.

### Note

This is a real world dataset, so you will likely not be able to acheive > 0.75 F1 score.

In [3]:
import numpy as np
import matplotlib.pyplot as plt

import pandas as pd

import keras
import keras.backend as K
import tensorflow as tf

#Some metrics and utilities
from sklearn.metrics import confusion_matrix
from sklearn import datasets
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

import seaborn as sn
import re

from IPython.display import clear_output
clear_output()

### Load the Data

In [4]:
training_data = pd.read_csv('train_values.csv')
training_labels = pd.read_csv('train_labels.csv')

### Data Preprocessing

In [5]:
#list to aggregate the columns that have a secondary use
remove_cols = []

#Identify all the columns with a secondary use
for item in training_data.columns:
    if re.findall("has_secondary_use",item):
        remove_cols.append(item)
        
#Remove secondary use fields and other fields that are not useful
training_data = training_data.drop(remove_cols, axis = 1)
training_data = training_data.drop(['legal_ownership_status', 'building_id'], axis = 1)
training_labels = training_labels.drop('building_id', axis = 1)

#Change string classes to numeric values
training_data['land_surface_condition'] = training_data['land_surface_condition'].map({'t':1, 'o':2, 'n':3})
training_data['foundation_type'] = training_data['foundation_type'].map({'r': 1, 'w': 2, 'i':3, 'u':4, 'h':5})
training_data['roof_type'] = training_data['roof_type'].map({'n':1, 'q':2, 'x':3})
training_data['ground_floor_type'] = training_data['ground_floor_type'].map({'f':1, 'x':2, 'v':3, 'z':4, 'm':5})
training_data['other_floor_type'] = training_data['other_floor_type'].map({'q':1, 'x':2, 'j':3, 's':4})
training_data['position'] = training_data['position'].map({'t':1, 's':2, 'j':3, 'o':4})
training_data['plan_configuration'] = training_data['plan_configuration'].map({'d':1, 'u':2, 's':3, 'q':4, 'm':5, 'c':6, 'a':7, 'n':8, 'f':9, 'o':10})

def normalize_zero_one(array):
    minimum = np.min(array)
    maximum = np.max(array)
    return (array-minimum)/(maximum-minimum)

#Clip age and family counts because they have strong outliers
training_data['age'] = np.clip(training_data['age'], 0, 100)
training_data['count_families'] = np.clip(training_data['count_families'], 0, 6)

#Normalize 0-1
norm_cats = ['age','count_families', 'geo_level_1_id', 'geo_level_2_id', 'geo_level_3_id', 'area_percentage', 
              'count_floors_pre_eq', 'height_percentage', 'land_surface_condition', 'roof_type', 'foundation_type']

#normalize categories to interval 0-1
for column in norm_cats:
    training_data[column] = normalize_zero_one(training_data[column])
    
#check that everything has been done correctly
training_data.describe()

Unnamed: 0,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,...,has_superstructure_stone_flag,has_superstructure_cement_mortar_stone,has_superstructure_mud_mortar_brick,has_superstructure_cement_mortar_brick,has_superstructure_timber,has_superstructure_bamboo,has_superstructure_rc_non_engineered,has_superstructure_rc_engineered,has_superstructure_other,count_families
count,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,...,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0
mean,0.463345,0.491293,0.497961,0.141215,0.216562,0.070889,0.114479,0.152286,0.081396,0.180241,...,0.034332,0.018235,0.068154,0.075268,0.254988,0.085011,0.04259,0.015859,0.014985,0.163977
std,0.267787,0.289216,0.290154,0.090958,0.198003,0.044366,0.063947,0.34802,0.208532,0.297798,...,0.182081,0.1338,0.25201,0.263824,0.435855,0.278899,0.201931,0.124932,0.121491,0.069516
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.233333,0.24527,0.244529,0.125,0.1,0.040404,0.066667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667
50%,0.4,0.491941,0.498926,0.125,0.15,0.060606,0.1,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667
75%,0.7,0.735809,0.748946,0.125,0.3,0.080808,0.133333,0.0,0.0,0.5,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.166667
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
#Callbacks (pass these to model.fit as callbacks = callbacks to help with the training) See keras callbacks documentation for more details

model_checkpoint = keras.callbacks.ModelCheckpoint(
    filepath='Best_model.h5', monitor='val_loss', verbose=1, save_best_only=True,
    save_weights_only=True, mode='auto')

early_stopping = keras.callbacks.EarlyStopping(patience=100, verbose = 1)

callbacks = [early_stopping,
            model_checkpoint]