# VAST Challenge 2019: Disaster at St. Himark!

St. Himark is a beautiful community located at the Ocenaus sea. It is a small community with almost everything it needs to sustain a spirited civilization. St. Himark is primarily powered by the Always Safe Nuclear Power Plant. This was true until the disaster struck. Now, Mayor Jordan, city officials, and emergency services are overwhelmed and are desperate for assistance in understanding the true situation on the ground and how best to deploy the limited resources available to this relatively small community.

## Mini-Challenge 1

In a prescient move of community engagement, the city had released a new damage reporting mobile application, 'RUMBLE', that allows citizens to report damages that they see in their neighborhood. 
The challenge is to use app responses in conjunction with shake maps of the earthquake strength to identify areas of concern and advise emergency planners respond to damages more efficiently. 

### Data Description
The data for MC1 is included in the 'mc1-reports-data.csv' CSV file that spans over the entire length of the event. It is consisted of categorical reports of shaking/damage to the neighborhood over time. 

#### Data Fields:
* Time: Timestamp of incoming reports. Format: YYYY-MM-DD hh:mm:ss
* Location: Neighborhood id, 1 through 19, representing different cities in St. Himark, where damage is reported.
* Shake_intensity, Sewer&Water, Power, Roads&Bridges, Medical, Buildings: Reported damage extent. 0 -> lowest, 10 -> highest

_Missing data is allowed_

### Data Cleaning
Since the given csv file has certain missing files and other irregularities, the data must be cleaned for the ease of visualization and analysis. Python Pandas to the rescue! Using Python Pandas, the data is re-structured to achieve optimum visualization and analysis. 


In [1]:
# Import all necessary packages

import numpy as np
import pandas as pd
import math

print("Import Success!")

Import Success!


Once the required packages are imported successfully, the next step is to read the csv file. This task is simplified thanks to pandas. **pd.read_csv** reads the csv file into a panda data frame. 

The **df.head()** function is used to see the first 5 entries in the data frame, making sure that the data was read correctly. 

In [2]:
# Reading the RAW data file
df = pd.read_csv("mc1-reports-data.csv")
display(df.head())
display(df.describe())

Unnamed: 0,time,sewer_and_water,power,roads_and_bridges,medical,buildings,shake_intensity,location
0,2020-04-08 17:50:00,10.0,6.0,10.0,3.0,8.0,,1
1,2020-04-09 13:50:00,2.0,10.0,0.0,8.0,4.0,0.0,1
2,2020-04-09 00:20:00,7.0,10.0,10.0,9.0,10.0,0.0,1
3,2020-04-08 17:25:00,1.0,1.0,2.0,10.0,7.0,,1
4,2020-04-08 02:50:00,9.0,7.0,1.0,6.0,9.0,,1


Unnamed: 0,sewer_and_water,power,roads_and_bridges,medical,buildings,shake_intensity,location
count,82899.0,83070.0,83070.0,35629.0,82900.0,70926.0,83070.0
mean,5.649139,6.045371,5.743289,5.322687,4.744005,2.682641,8.978488
std,2.787791,2.851951,2.506399,2.527679,2.256358,1.935366,5.123608
min,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,3.0,4.0,4.0,4.0,3.0,1.0,4.0
50%,6.0,7.0,6.0,6.0,5.0,2.0,8.0
75%,8.0,8.0,7.0,7.0,6.0,4.0,14.0
max,10.0,10.0,10.0,10.0,10.0,9.0,19.0


The **df.describe()** function does a priliminary analysis of all the data present in the csv file. This can be used to compare and analyze the information in the later stages. We can note that the the count of the data fields are not the same. Ideally, every entry will have some value, thus making the count for all the entries to be the same. Unqual count indicates some missing values. This can be sean in the **df.head()** output. Some values are 'NaN'. 

In [3]:
print("Total Rows and Columns in the data Frame: ", df.shape)

print("\nTotal number of missing Values")
df.isnull().sum()


Total Rows and Columns in the data Frame:  (83070, 8)

Total number of missing Values


time                     0
sewer_and_water        171
power                    0
roads_and_bridges        0
medical              47441
buildings              170
shake_intensity      12144
location                 0
dtype: int64

Out of the 83070 entries, we can see that shake_intensity field has 12144 missing data. Similarly medical, buildings and the rest of the fields have certain amopunt of missing data. 

Let us start off by replacing all the 'NaN' values to zero

In [4]:
# Convert all column title into python list
cols = df.columns.tolist()

# replace 'NaN' in every cols entry
for c in cols:
    df[c].fillna(0.00, inplace=True)
    
display(df.head())
print("\nTotal number of missing Values")
display(df.isnull().sum())

Unnamed: 0,time,sewer_and_water,power,roads_and_bridges,medical,buildings,shake_intensity,location
0,2020-04-08 17:50:00,10.0,6.0,10.0,3.0,8.0,0.0,1
1,2020-04-09 13:50:00,2.0,10.0,0.0,8.0,4.0,0.0,1
2,2020-04-09 00:20:00,7.0,10.0,10.0,9.0,10.0,0.0,1
3,2020-04-08 17:25:00,1.0,1.0,2.0,10.0,7.0,0.0,1
4,2020-04-08 02:50:00,9.0,7.0,1.0,6.0,9.0,0.0,1



Total number of missing Values


time                 0
sewer_and_water      0
power                0
roads_and_bridges    0
medical              0
buildings            0
shake_intensity      0
location             0
dtype: int64

In the above cell, it can be observed that all the 'NaN' or the missing values have been replaced by zero.

Next, we will split the 'time' field. The 'time' field has both data and time entered together. Splitting them up into different fileds

In [5]:
# Split 'time' into 'Date' and 'Time' 
df['Date'] = pd.to_datetime(df['time']).dt.date
df['Time'] = pd.to_datetime(df['time']).dt.time

#display(df.head(5))

# Drop the old 'time' column
df.drop(columns = ['time'], inplace=True)

#display(df.head(5))



In [6]:
# Rearrage the columns

# List of all remaining columns
cols = df.columns.tolist()
# Move the Last 2 columns to first
cols = cols[-2:] + cols[:-2]
df = df[cols]

display(df.head())
display(df.tail())

Unnamed: 0,Date,Time,sewer_and_water,power,roads_and_bridges,medical,buildings,shake_intensity,location
0,2020-04-08,17:50:00,10.0,6.0,10.0,3.0,8.0,0.0,1
1,2020-04-09,13:50:00,2.0,10.0,0.0,8.0,4.0,0.0,1
2,2020-04-09,00:20:00,7.0,10.0,10.0,9.0,10.0,0.0,1
3,2020-04-08,17:25:00,1.0,1.0,2.0,10.0,7.0,0.0,1
4,2020-04-08,02:50:00,9.0,7.0,1.0,6.0,9.0,0.0,1


Unnamed: 0,Date,Time,sewer_and_water,power,roads_and_bridges,medical,buildings,shake_intensity,location
83065,2020-04-10,02:30:00,9.0,10.0,10.0,0.0,7.0,2.0,8
83066,2020-04-10,02:30:00,8.0,10.0,10.0,0.0,7.0,1.0,8
83067,2020-04-09,16:45:00,10.0,9.0,10.0,0.0,8.0,1.0,8
83068,2020-04-09,16:55:00,8.0,8.0,9.0,0.0,7.0,0.0,8
83069,2020-04-10,02:30:00,9.0,10.0,10.0,0.0,6.0,-0.0,8


Now that the data cleaned to a satisfactory degree, we will save it for further use. 

In [7]:
df['Time'].head()

0    17:50:00
1    13:50:00
2    00:20:00
3    17:25:00
4    02:50:00
Name: Time, dtype: object

In [8]:
#Saving cleaned data set
df.to_csv("MC1_Clean.csv")


# Extra codes used for convinience

In [9]:
#Finding out all the different values in shake intensity column
sh_int_uniq = df[cols[7]].unique()

print( sorted(sh_int_uniq))

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]


In [10]:
print(cols[8])
unique_loc = df[cols[8]].unique()
print(unique_loc)

location
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]


In [11]:
loc7 = (df['location']==7)
loc7_abs = df[loc7][df['medical']!=0]

  


In [12]:
loc7_abs.head()

Unnamed: 0,Date,Time,sewer_and_water,power,roads_and_bridges,medical,buildings,shake_intensity,location
57351,2020-04-08,13:15:00,0.0,8.0,7.0,6.0,0.0,-0.0,7


In [13]:
df[loc7].head()

Unnamed: 0,Date,Time,sewer_and_water,power,roads_and_bridges,medical,buildings,shake_intensity,location
4295,2020-04-08,12:25:00,0.0,1.0,3.0,0.0,0.0,0.0,7
4296,2020-04-10,20:55:00,0.0,10.0,0.0,0.0,0.0,0.0,7
4297,2020-04-06,07:00:00,0.0,4.0,9.0,0.0,0.0,0.0,7
4298,2020-04-07,10:35:00,0.0,0.0,8.0,0.0,0.0,0.0,7
4299,2020-04-06,13:20:00,0.0,4.0,8.0,0.0,0.0,0.0,7


In [14]:
df[loc7].mean()

sewer_and_water      0.098266
power                7.838150
roads_and_bridges    5.919075
medical              0.034682
buildings            0.098266
shake_intensity      4.421965
location             7.000000
dtype: float64

In [15]:
loc1 = (df['location']==1)
df[loc1].describe()

Unnamed: 0,sewer_and_water,power,roads_and_bridges,medical,buildings,shake_intensity,location
count,1662.0,1662.0,1662.0,1662.0,1662.0,1662.0,1662.0
mean,4.740072,4.759326,4.867629,4.974729,4.904934,0.329723,1.0
std,3.159965,3.833197,2.02912,2.520827,3.251358,0.600678,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,2.0,1.0,4.0,3.0,2.0,-0.0,1.0
50%,3.0,3.0,5.0,4.0,4.0,0.0,1.0
75%,8.0,9.0,6.0,7.0,8.0,1.0,1.0
max,10.0,10.0,10.0,10.0,10.0,3.0,1.0


In [16]:
loc4 = (df['location']==4)
loc4_abs = df[loc4][df['medical']!=0]
loc4_abs.head()

  


Unnamed: 0,Date,Time,sewer_and_water,power,roads_and_bridges,medical,buildings,shake_intensity,location
2419,2020-04-10,00:30:00,3.0,8.0,6.0,6.0,9.0,0.0,4
2443,2020-04-09,11:05:00,6.0,8.0,0.0,8.0,9.0,0.0,4
2464,2020-04-09,13:50:00,1.0,2.0,5.0,3.0,10.0,0.0,4
2496,2020-04-08,13:50:00,4.0,4.0,6.0,7.0,10.0,0.0,4
2501,2020-04-09,14:00:00,1.0,10.0,0.0,2.0,4.0,0.0,4


In [17]:
len(loc4)

83070

In [18]:
len(loc4) - len(loc4_abs)

83021

In [19]:
df[loc4].mean()

sewer_and_water      5.468654
power                4.224049
roads_and_bridges    4.213772
medical              0.096608
buildings            3.692018
shake_intensity      4.263789
location             4.000000
dtype: float64

In [20]:
loc4_abs['medical'].mean()

5.755102040816326

In [21]:
type(loc4)

pandas.core.series.Series

In [22]:
type(loc4_abs)

pandas.core.frame.DataFrame

In [23]:
# To print df in terminal without losing the format

pd.set_option('expand_frame_repr', False)
pd.set_option('display.max_columns', 999)

In [24]:
df.head()

Unnamed: 0,Date,Time,sewer_and_water,power,roads_and_bridges,medical,buildings,shake_intensity,location
0,2020-04-08,17:50:00,10.0,6.0,10.0,3.0,8.0,0.0,1
1,2020-04-09,13:50:00,2.0,10.0,0.0,8.0,4.0,0.0,1
2,2020-04-09,00:20:00,7.0,10.0,10.0,9.0,10.0,0.0,1
3,2020-04-08,17:25:00,1.0,1.0,2.0,10.0,7.0,0.0,1
4,2020-04-08,02:50:00,9.0,7.0,1.0,6.0,9.0,0.0,1


In [28]:
df.dtypes

Date                  object
Time                  object
sewer_and_water      float64
power                float64
roads_and_bridges    float64
medical              float64
buildings            float64
shake_intensity      float64
location               int64
dtype: object

In [29]:
df['Date']

0        2020-04-08
1        2020-04-09
2        2020-04-09
3        2020-04-08
4        2020-04-08
5        2020-04-09
6        2020-04-08
7        2020-04-10
8        2020-04-10
9        2020-04-07
10       2020-04-09
11       2020-04-08
12       2020-04-08
13       2020-04-10
14       2020-04-06
15       2020-04-06
16       2020-04-06
17       2020-04-10
18       2020-04-09
19       2020-04-10
20       2020-04-10
21       2020-04-09
22       2020-04-08
23       2020-04-09
24       2020-04-06
25       2020-04-09
26       2020-04-06
27       2020-04-10
28       2020-04-08
29       2020-04-08
            ...    
83040    2020-04-10
83041    2020-04-09
83042    2020-04-09
83043    2020-04-10
83044    2020-04-09
83045    2020-04-09
83046    2020-04-09
83047    2020-04-10
83048    2020-04-09
83049    2020-04-10
83050    2020-04-09
83051    2020-04-10
83052    2020-04-09
83053    2020-04-10
83054    2020-04-09
83055    2020-04-09
83056    2020-04-09
83057    2020-04-09
83058    2020-04-09
