# Richter's Predictor: Modeling Earthquake Damage

This was a competition hosted by [drivendata.org](https://www.drivendata.org)

## Overview
Based on aspects of building location and construction, the goal is to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal.

The data was collected through surveys by the Central Bureau of Statistics that work under the National Planning Commission Secretariat of Nepal. This survey is one of the largest post-disaster datasets ever collected, containing valuable information on earthquake impacts, household conditions, and socio-economic-demographic statistics.

## Problem description
We're trying to predict the ordinal variable damage_grade, which represents a level of damage to the building that was hit by the earthquake. There are 3 grades of the damage:

 - **1** represents low damage
 - **2** represents a medium amount of damage
 - **3** represents almost complete destruction
 
## Features
The dataset mainly consists of information on the buildings' structure and their legal ownership. Each row in the dataset represents a specific building in the region that was hit by Gorkha earthquake.

There are 39 columns in this dataset, where the building_id column is a unique and random identifier. The remaining 38 features are described in the section below. Categorical variables have been obfuscated random lowercase ascii characters. The appearance of the same character in distinct columns does not imply the same original value.

## Description

- geo_level_1_id, geo_level_2_id, geo_level_3_id (type: int): geographic region in which building exists, from largest (level 1) to most specific sub-region (level 3). Possible values: level 1: 0-30, level 2: 0-1427, level 3: 0-12567.

- count_floors_pre_eq (type: int): number of floors in the building before the earthquake.

- age (type: int): age of the building in years.

- area_percentage (type: int): normalized area of the building footprint.

- height_percentage (type: int): normalized height of the building footprint.

- land_surface_condition (type: categorical): surface condition of the land where the building was built. Possible values: n, o, t.

- foundation_type (type: categorical): type of foundation used while building. Possible values: h, i, r, u, w.

- roof_type (type: categorical): type of roof used while building. Possible values: n, q, x.

- ground_floor_type (type: categorical): type of the ground floor. Possible values: f, m, v, x, z.

- other_floor_type (type: categorical): type of constructions used in higher than the ground floors (except of roof). Possible values: j, q, s, x.

- position (type: categorical): position of the building. Possible values: j, o, s, t.

- plan_configuration (type: categorical): building plan configuration. Possible values: a, c, d, f, m, n, o, q, s, u.

- has_superstructure_adobe_mud (type: binary): flag variable that indicates if the superstructure was made of Adobe/Mud.

- has_superstructure_mud_mortar_stone (type: binary): flag variable that indicates if the superstructure was made of Mud Mortar Stone.

- has_superstructure_stone_flag (type: binary): flag variable that indicates if the superstructure was made of Stone.

- has_superstructure_cement_mortar_stone (type: binary): flag variable that indicates if the superstructure was made of Cement Mortar - Stone.

- has_superstructure_mud_mortar_brick (type: binary): flag variable that indicates if the superstructure was made of Mud Mortar - Brick.

- has_superstructure_cement_mortar_brick (type: binary): flag variable that indicates if the superstructure was made of Cement Mortar - Brick.

- has_superstructure_timber (type: binary): flag variable that indicates if the superstructure was made of Timber.

- has_superstructure_bamboo (type: binary): flag variable that indicates if the superstructure was made of Bamboo.

- has_superstructure_rc_non_engineered (type: binary): flag variable that indicates if the superstructure was made of non-engineered reinforced concrete.

- has_superstructure_rc_engineered (type: binary): flag variable that indicates if the superstructure was made of engineered reinforced concrete.

- has_superstructure_other (type: binary): flag variable that indicates if the superstructure was made of any other material.
- legal_ownership_status (type: categorical): legal ownership status of the land where building was built. Possible values: a, r, v, w.

- count_families (type: int): number of families that live in the building.

- has_secondary_use (type: binary): flag variable that indicates if the building was used for any secondary purpose.

- has_secondary_use_agriculture (type: binary): flag variable that indicates if the building was used for agricultural purposes.

- has_secondary_use_hotel (type: binary): flag variable that indicates if the building was used as a hotel.

- has_secondary_use_rental (type: binary): flag variable that indicates if the building was used for rental purposes.

- has_secondary_use_institution (type: binary): flag variable that indicates if the building was used as a location of any institution.

- has_secondary_use_school (type: binary): flag variable that indicates if the building was used as a school.

- has_secondary_use_industry (type: binary): flag variable that indicates if the building was used for industrial purposes.

- has_secondary_use_health_post (type: binary): flag variable that indicates if the building was used as a health post.

- has_secondary_use_gov_office (type: binary): flag variable that indicates if the building was used fas a government office.

- has_secondary_use_use_police (type: binary): flag variable that indicates if the building was used as a police station.

- has_secondary_use_other (type: binary): flag variable that indicates if the building was secondarily used for other purposes.

## Performance metric

We are predicting the level of damage from 1 to 3. The level of damage is an ordinal variable meaning that ordering is important. This can be viewed as a classification or an ordinal regression problem. (Ordinal regression is sometimes described as an problem somewhere in between classification and regression.)

To measure the performance of our algorithms, we'll use the F1 score which balances the precision and recall of a classifier. Traditionally, the F1 score is used to evaluate performance on a binary classifier, but since we have three possible labels we will use a variant called the micro averaged F1 score.


In [1]:
# Importing the necessary modules

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier

# Reading the training data into a dataframe

train_values = pd.read_csv('C:/Users/user/Documents/1. E-Learning/2019 Program/Data Science Machine Learning and Deep Learning/Driven Data/Modeling EarthQuake Damage/train_values.csv')


In [2]:
# Inspecting the first rows of the dataframe
train_values.head()

Unnamed: 0,building_id,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,...,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other
0,802906,6,487,12198,2,30,6,5,t,r,...,0,0,0,0,0,0,0,0,0,0
1,28830,8,900,2812,2,10,8,7,o,r,...,0,0,0,0,0,0,0,0,0,0
2,94947,21,363,8973,2,10,5,5,t,r,...,0,0,0,0,0,0,0,0,0,0
3,590882,22,418,10694,2,10,6,5,t,r,...,0,0,0,0,0,0,0,0,0,0
4,201944,11,131,1488,3,30,8,9,t,r,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# Reading the training labels into a dataframe

train_labels = pd.read_csv('C:/Users/user/Documents/1. E-Learning/2019 Program/Data Science Machine Learning and Deep Learning/Driven Data/Modeling EarthQuake Damage/train_labels.csv')
train_labels = train_labels['damage_grade']

In [4]:
# Getting basic infos about the training features
train_values.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260601 entries, 0 to 260600
Data columns (total 39 columns):
building_id                               260601 non-null int64
geo_level_1_id                            260601 non-null int64
geo_level_2_id                            260601 non-null int64
geo_level_3_id                            260601 non-null int64
count_floors_pre_eq                       260601 non-null int64
age                                       260601 non-null int64
area_percentage                           260601 non-null int64
height_percentage                         260601 non-null int64
land_surface_condition                    260601 non-null object
foundation_type                           260601 non-null object
roof_type                                 260601 non-null object
ground_floor_type                         260601 non-null object
other_floor_type                          260601 non-null object
position                                  260601 non

## Feature Selection

Feature selection is a critical part of machine learning. There are several ways to select the most relevant features for a prediction. For instance, one could use the **Recursive Feature Elimination (RFE)** algorithm. We could also use a **LightGBM**, or an **XGBoost** object as long it has a feature_importances_ attribute. The **Pearson Correlation**, which is a filter-based method, could help us to keep the features which have the highest correlation with the target variable. 

However, we have decided to keep certain specific columns as features. This selection was based on a general understanding of the subject, done after some research. Nevertheless, it's important to keep in mind that this selection could be improved with experimentation. 

In [5]:
# Selecting the features
features = ['geo_level_1_id', 'geo_level_2_id', 'geo_level_3_id', 'foundation_type', 'position', 'plan_configuration', 'has_superstructure_adobe_mud', 'has_superstructure_mud_mortar_stone', 'has_superstructure_stone_flag', 'has_superstructure_cement_mortar_stone', 'has_superstructure_mud_mortar_brick', 'has_superstructure_cement_mortar_brick', 'has_superstructure_timber', 'has_superstructure_bamboo', 'has_superstructure_rc_engineered', 'has_superstructure_rc_non_engineered', 'has_superstructure_other']
train_values = train_values[features]

train_values.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260601 entries, 0 to 260600
Data columns (total 17 columns):
geo_level_1_id                            260601 non-null int64
geo_level_2_id                            260601 non-null int64
geo_level_3_id                            260601 non-null int64
foundation_type                           260601 non-null object
position                                  260601 non-null object
plan_configuration                        260601 non-null object
has_superstructure_adobe_mud              260601 non-null int64
has_superstructure_mud_mortar_stone       260601 non-null int64
has_superstructure_stone_flag             260601 non-null int64
has_superstructure_cement_mortar_stone    260601 non-null int64
has_superstructure_mud_mortar_brick       260601 non-null int64
has_superstructure_cement_mortar_brick    260601 non-null int64
has_superstructure_timber                 260601 non-null int64
has_superstructure_bamboo                 260601 non-n

In [6]:
# Print summary statistics 
train_values.describe()

Unnamed: 0,geo_level_1_id,geo_level_2_id,geo_level_3_id,has_superstructure_adobe_mud,has_superstructure_mud_mortar_stone,has_superstructure_stone_flag,has_superstructure_cement_mortar_stone,has_superstructure_mud_mortar_brick,has_superstructure_cement_mortar_brick,has_superstructure_timber,has_superstructure_bamboo,has_superstructure_rc_engineered,has_superstructure_rc_non_engineered,has_superstructure_other
count,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0,260601.0
mean,13.900353,701.074685,6257.876148,0.088645,0.761935,0.034332,0.018235,0.068154,0.075268,0.254988,0.085011,0.015859,0.04259,0.014985
std,8.033617,412.710734,3646.369645,0.284231,0.4259,0.182081,0.1338,0.25201,0.263824,0.435855,0.278899,0.124932,0.201931,0.121491
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7.0,350.0,3073.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,12.0,702.0,6270.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,21.0,1050.0,9412.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
max,30.0,1427.0,12567.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Data Preprocessing

In [7]:
# Getting the categorical variables
categorical_values = train_values.select_dtypes(include=['object'])
print("Categorical Variables")
print("-------------------------------------------------")
categorical_values.columns

Categorical Variables
-------------------------------------------------


Index(['foundation_type', 'position', 'plan_configuration'], dtype='object')

In [8]:
# Getting the numeric variables
numeric_values = train_values.select_dtypes(include=['int64'])
print("Numeric Variables")
print("-------------------------------------------------")
numeric_values.columns

Numeric Variables
-------------------------------------------------


Index(['geo_level_1_id', 'geo_level_2_id', 'geo_level_3_id',
       'has_superstructure_adobe_mud', 'has_superstructure_mud_mortar_stone',
       'has_superstructure_stone_flag',
       'has_superstructure_cement_mortar_stone',
       'has_superstructure_mud_mortar_brick',
       'has_superstructure_cement_mortar_brick', 'has_superstructure_timber',
       'has_superstructure_bamboo', 'has_superstructure_rc_engineered',
       'has_superstructure_rc_non_engineered', 'has_superstructure_other'],
      dtype='object')

In [9]:
# Encoding the categorical variables

categorical_encoded = pd.get_dummies(categorical_values, drop_first=True)

In [10]:
# Checking the types of the variables after encoding
print("Feature                Data_Type")
print("---------------------------------")
categorical_encoded.dtypes

Feature                Data_Type
---------------------------------


foundation_type_i       uint8
foundation_type_r       uint8
foundation_type_u       uint8
foundation_type_w       uint8
position_o              uint8
position_s              uint8
position_t              uint8
plan_configuration_c    uint8
plan_configuration_d    uint8
plan_configuration_f    uint8
plan_configuration_m    uint8
plan_configuration_n    uint8
plan_configuration_o    uint8
plan_configuration_q    uint8
plan_configuration_s    uint8
plan_configuration_u    uint8
dtype: object

## Model Training and Prediction

In [11]:
# Getting the final dataset, by concatenating the numeric and encoded categorical variables
new_data = pd.concat([numeric_values, categorical_encoded], axis=1)

In [12]:
# Splitting the data into training and test set
X_train, X_test, y_train, y_test = train_test_split(new_data, train_labels, test_size=0.2, stratify=train_labels)
print("Splitting completed!")

Splitting completed!


In [13]:
# Instantiating an xgb classifier
xg_cl = xgb.XGBClassifier(objective='multi:softmax', num_class=3, n_estimators=100, max_depth=10)
# Fitting the model
xg_cl.fit(X_train, y_train)
# Getting the micro average f1_score 
print("The micro average f1_score for the xgb classifier is:")
f1_score(y_test, xg_cl.predict(X_test), average='micro')

The micro average f1_score for the xgb classifier is:


0.7287465704802287

In [14]:
# Instantiating a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=30)
# Fitting the model
rf.fit(X_train, y_train)
# Getting the micro average f1_score
print("The micro average f1_score for the Random Forest Classifier is:")
f1_score(y_test, rf.predict(X_test), average='micro')

The micro average f1_score for the Random Forest Classifier is:


0.7245831814431803

In [15]:
# Instantiating an AdaBoost Classifier
ada = AdaBoostClassifier(n_estimators=500)
# Fitting the model
ada.fit(X_train, y_train)
# Getting the micro average f1_score
print("The micro average f1_score for the adaboost classifier is:")
f1_score(y_test, ada.predict(X_test), average='micro')

The micro average f1_score for the adaboost classifier is:


0.6669480631607222

## Combining the previous models

We will use a Voting Classifier to combine the three previous models we've created. We will set the parameter *voting = soft* in order to get the mean of the probabilities that an outcome belongs to a specific class. These probabilities are of course, determined by the previous models. 

In [16]:
# Instantiating the Voting Classifier
clf_voting = VotingClassifier(
                    estimators=[
                                ('xgboost', xg_cl),
                                ('random forest', rf),
                                ('adaboost', ada)],
                     voting='soft')
print("Step completed!")

Step completed!


In [17]:
# Fitting the Voting Classifier
clf_voting.fit(X_train, y_train)
# Getting the micro average f1_score
print("The micro average f1_score for the Voting classifier is:")
f1_score(y_test, clf_voting.predict(X_test), average='micro')

The micro average f1_score for the Voting classifier is:


0.7364593925672953

When we combine the previous models, we see that we obtain a better micro average f1_score. Therefore we'll use the voting classifier as the final model.

## Preparing datas for submission

In [24]:
def preprocessing(df):
    """This function takes a dataframe. Then, it selects the appropriate features, encode categorical variables, and return
    a dataframe usable for the model prediction"""
    X = df[features]
    categorical_values = X.select_dtypes(include=['object'])
    numeric_values = X.select_dtypes(include=['int64'])
    categorical_encoded = pd.get_dummies(categorical_values, drop_first=True)
    new_data = pd.concat([numeric_values, categorical_encoded], axis=1)
    return new_data

In [25]:
# Getting the test values
test_values = pd.read_csv('C:/Users/user/Documents/1. E-Learning/2019 Program/Data Science Machine Learning and Deep Learning/Driven Data/Modeling EarthQuake Damage/test_values.csv')

In [26]:
# Preprocessing the test values
X = preprocessing(test_values)

In [27]:
# Predicting the target variable
predictions = clf_voting.predict(X)
print("Done")

Done


In [28]:
print("{} predictions made".format(len(predictions)))

86868 predictions made


In [23]:
# Getting the submission into the right format
submissions = test_values.copy()
submissions['damage_grade'] = pd.Series(predictions)
submissions[['building_id', 'damage_grade']].to_csv('damage_predictions_2.csv', index=False)
print("File created!")