<img src="images/GAlogo.png" style="float: left; margin: 15px; height: 100px">

# CAPSTONE PROJECT
## US TORNADOES: PREDICTING THEIR MAGNITUDE WITH MACHINE LEARNING
### Pre-Processing Class

# Goal
Taking the pre-processing functions and work done in [part3](tornadoes_part3_eda_modelling_nlp_tsa.ipynb) and gathering them in a class to be used in production with pipeline

# Packages import

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
import itertools

from sqlalchemy import create_engine

import calendar
from datetime import datetime, date, timedelta

from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin

%matplotlib inline

# Pre-processing class for data cleaning

- When wanting to apply a process in production, it is important to make everyhting repeatable and with an easy implementation. We do not want to copy and past every single line of code and functions every single time new data needs to be predicted.
- This is where a class is important. We can gather in it all the functions and pre-processing code we worked on in part 2 and 3 of the capstone. And then run all that in one go through fit and then transform. This is a more efficient way of working.
- Moreover, it is then easily accessible in a pipeline in conjunction with standardization and modellong.

In [24]:
# As we want to be able to integrate the class into an sklearn pipeline, we need to use 
# the modules BaseEstimator and TransformerMixin in the class construction:

class TornadoPreprocessor(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.feature_names = []
    
    

    def _compute_duration(self, df, beginDT='BEGIN_DATE_TIME', endDT='END_DATE_TIME'):
        '''Takes a tornado dataframe in input, with dates and times of beginning and end of
        a tornado, in text format.
        Converts them into datatime format.
        Computes the duration in minutes and returns it as new column
        Computes the average date and average time, 
        Computes the day of the year, and time in floating format, and returns them as 
        new columns'''

        # Changing to datetime format:
        df[beginDT] = pd.to_datetime(df[beginDT])
        df[endDT] = pd.to_datetime(df[endDT])

        # Computes the duration:
        df['Duration'] =  df[endDT] - df[beginDT]
        # Converting to minutes from seconds:
        df['Duration'] = df.Duration.map(lambda x: int(round(x.total_seconds()/60.)))

        # Computing the average date and time of a tornado event:
        df['AverageDate'] = pd.to_datetime(df[beginDT] + (df[endDT] - df[beginDT])/2)
        df['AverageTime'] = df['AverageDate']

        def _dayoftheyear(dateandtime):
            '''converts a timestamp object into day of the year, taking leap years into account'''
            dayYear = dateandtime.timetuple().tm_yday
            # Feb 28 corresponds to year day 31+28=59
            if calendar.isleap(dateandtime.year) and dayYear<=59:
                return dayYear
            # For dates superior or equal to Feb 29, for leap years, returns the value-1 
            # for a number to always correspond
            # to the same date for any year
            elif calendar.isleap(dateandtime.year) and dayYear>59:
                return dayYear - 1
            # For non leap years:
            else:
                return dayYear
        # Converting date to day of the year:
        df['AverageDate'] = df['AverageDate'].map(_dayoftheyear)

        # Converting time to float from Timestamp object:
        df['AverageTime'] = df['AverageTime'].map(lambda x: x.time())
        df['AverageTime'] = df['AverageTime'].map(lambda x: x.hour) \
        + df['AverageTime'].map(lambda x: round((x.minute*100/60.)/100., 2)) \
        + df['AverageTime'].map(lambda x: round((x.second*100/60.)/10000., 4))
    
        return df
    
    
    def _death_and_injuries(self, df):
        '''Gathers indirect and direct injuries and deaths into one column for each'''
        df['Deaths'] = df.DEATHS_DIRECT + df.DEATHS_INDIRECT
        df['Injuries'] = df.INJURIES_DIRECT + df.INJURIES_INDIRECT
        return df
    
    
    def _compute_Az_AvLat_AvLon_fromLatLong(self, df, xA='BEGIN_LON', yA='BEGIN_LAT', 
                                            xB='END_LON', yB='END_LAT',
                                            outMeanLat='Mean_Lat', outMeanLon='Mean_Lon',
                                            outAz='Azimuth'):
        '''Takes a tornado dataframe in input, with geographical coordinates of beginning and 
        end points.
        Returns the dataframe with 3 new features: 
        direction the tornado headed from North (azimuth of its rectiligne path)
        average latitude
        average longitude'''

        # New features with average latitude and longitude of the tornado path:
        LAT = (df[yA]+df[yB])/2
        LON = (df[xA]+df[xB])/2
        df[outMeanLat] = np.round(LAT,4)
        df[outMeanLon] = np.round(LON,4)

        # Coordinates differences converted to radians:
        diffLON = np.radians(df[xB] - df[xA])
        diffLAT = np.radians(df[yB] - df[yA])

        # Azimuth computed from trigonometry (arctan2 allows getting the angle in range (-pi,pi) 
        # instead of (-pi/2,pi/2) like with conventional arctan):
        AZ = pd.Series(index=diffLON.index)
        for i in range(len(diffLON)):
            if diffLAT[i]==0 and diffLON[i]==0:
                AZ[i] = np.nan
            else:
                AZ[i] = np.degrees(np.arctan2(diffLON[i],diffLAT[i]))

        # Saving azimuth into new column:
        df[outAz] = np.round(AZ,2)

        # Let's make the North as the middle of the range => angles from ]-pi;pi] instead 
        # of [0;2pi[
        df[outAz] = df[outAz].map(lambda x: x-360 if x>180 else x)
        
        # Filling the azimuth NaN values with median:
        df[outAz].fillna(df['Azimuth'].median(), inplace=True)

        return df
    

    def _fips_to_state_abbreviation(self, df):
        '''Transforming the state FIPS numbers into shorter official abbreviations'''
        fips_to_state = {1: 'AL', 2: 'AK', 4: 'AZ', 5: 'AR', 6: 'CA', 8: 'CO', 9: 'CT', 10: 'DE',
                         11: 'DC', 12: 'FL', 13: 'GA', 15: 'HI', 16: 'ID', 17: 'IL', 18: 'IN', 
                         19: 'IA', 20: 'KS',
                         21: 'KY', 22: 'LA', 23: 'ME', 24: 'MD', 25: 'MA', 26: 'MI', 27: 'MN', 
                         28: 'MS', 29: 'MO', 30: 'MT',
                         31: 'NE', 32: 'NV', 33: 'NH', 34: 'NJ', 35: 'NM', 36: 'NY', 37: 'NC',
                         38: 'ND', 39: 'OH', 40: 'OK',
                         41: 'OR', 42: 'PA', 44: 'RI', 45: 'SC', 46: 'SD', 47: 'TN', 48: 'TX',
                         49: 'UT', 50: 'VT',
                         51: 'VA', 53: 'WA', 54: 'WV', 55: 'WI', 56: 'WY', 60: 'AS',
                         64: 'FM', 66: 'GU', 68: 'MH', 69: 'MP', 70: 'PW',
                         72: 'PR', 74: 'UM', 78: 'VI',
                         99: 'PR'}
        df['State'] = df['STATE_FIPS'].map(lambda x: fips_to_state[x])
        return df

    
    def _compute_total_cost(self, df):
        '''Converting the cost columns into integer numbers'''
        
        def _convertcost_tointeger(cost):
            '''From a cost in float or text format in format 'nb'letter' where letter = K or M or B
            Returns a cost as integer'''
            try:
                cost=int(cost)
                return cost
            except:
                if pd.isnull(cost): return 0
                elif 'K' in cost: return int(round(float(cost[:-1]) * 10**3))
                elif 'M' in cost: return int(round(float(cost[:-1]) * 10**6))
                elif 'B' in cost: return int(round(float(cost[:-1]) * 10**9))
                else: return np.nan
                
#        def _convertcost_tointeger(cost):
#            '''From a cost in text format in format 'nb'.00'letter' where letter = K or M or B
#            Returns a cost as integer'''
#            if pd.isnull(cost): return 0
#            elif 'K' in cost: return int(round(float(cost[:-2]) * 10**3))
#            elif 'M' in cost: return int(round(float(cost[:-2]) * 10**6))
#            elif 'B' in cost: return int(round(float(cost[:-2]) * 10**9))
#            else: return np.nan
            
        # Converting the cost columns into integer numbers and summing them
        df['TotalCost'] = df.DAMAGE_PROPERTY.map(_convertcost_tointeger) + \
                          df.DAMAGE_CROPS.map(_convertcost_tointeger)
        return df


    def _replace_spaces_in_source(self, df):
        # Replacing spaces by underscores in the SOURCE feature:
        df['Source'] = df.SOURCE.map(lambda x: x.replace(' ', '_'))
        return df


    def _drop_unused_cols(self, df):
        for col in ['EVENT_ID', 'EPISODE_ID', 'EVENT_TYPE', 'MAGNITUDE', 'MAGNITUDE_TYPE',
                    'FLOOD_CAUSE', 
                    'EPISODE_NARRATIVE', 'EVENT_NARRATIVE', 'DATA_SOURCE', 'CATEGORY',
                    'BEGIN_YEARMONTH', 'BEGIN_DAY', 'BEGIN_TIME', 
                    'END_YEARMONTH', 'END_DAY', 'END_TIME',
                    'BEGIN_DATE_TIME', 'CZ_TIMEZONE', 'END_DATE_TIME', 'YEAR', 'MONTH_NAME',
                    'DEATHS_DIRECT','DEATHS_INDIRECT','INJURIES_DIRECT','INJURIES_INDIRECT',
                    'BEGIN_RANGE', 'BEGIN_AZIMUTH', 'END_RANGE', 'END_AZIMUTH', 
                    'BEGIN_LAT', 'BEGIN_LON', 'END_LAT', 'END_LON', 
                    'CZ_NAME', 'CZ_TYPE', 'CZ_FIPS', 'WFO', 'STATE_FIPS', 'STATE', 
                    'BEGIN_LOCATION', 'END_LOCATION', 
                    'TOR_OTHER_CZ_STATE', 'TOR_OTHER_CZ_FIPS', 'TOR_OTHER_CZ_NAME', 
                    'TOR_OTHER_WFO', 'DAMAGE_PROPERTY', 'DAMAGE_CROPS', 'SOURCE']:
            try:
                df = df.drop(col, axis=1)
            except:
                pass
        return df

    
    def transform(self, X, *args):
        X = self._compute_duration(X)
        X = self._death_and_injuries(X)
        X = self._compute_Az_AvLat_AvLon_fromLatLong(X)
        X = self._fips_to_state_abbreviation(X)
        X = self._compute_total_cost(X)
        X = self._replace_spaces_in_source(X)
        X = self._drop_unused_cols(X)
        self.feature_names = X.columns
        return X

    
    def fit(self, X, *args):
        return self


### Checking the class with our data:

In [15]:
# Creating engine connection to my local "storms" database, using sqlalchemy:
engine_local = create_engine('postgresql://localhost:5432/storms')

# Saving the data from Feb 2007 until now (since the EF scale has been in place):
sql_query = """
SELECT * 
FROM tornadoes_1950_mid2017 
WHERE "BEGIN_YEARMONTH" >= 200702;
"""

raw_2007_2017 = pd.read_sql(sql_query, engine_local)

# Counting the number of tornadoes in each magnitude category:
raw_2007_2017.TOR_F_SCALE.value_counts()

EF0    7661
EF1    4822
EF2    1441
EF3     435
EF4     103
EFU      62
EF5      14
Name: TOR_F_SCALE, dtype: int64

In [25]:
preproc = TornadoPreprocessor()
preproc.fit(raw_2007_2017)

TornadoPreprocessor()

In [26]:
df = preproc.transform(raw_2007_2017)

In [27]:
print raw_2007_2017.shape, df.shape

(14538, 62) (14538, 14)


In [28]:
df.head(10)

Unnamed: 0,TOR_F_SCALE,TOR_LENGTH,TOR_WIDTH,Duration,AverageDate,AverageTime,Deaths,Injuries,Mean_Lat,Mean_Lon,Azimuth,State,TotalCost,Source
0,EF0,0.27,20.0,1,208,15.855,0,0,41.7345,-93.322,120.34,IA,0,Public
1,EF0,0.2,20.0,1,208,16.255,0,0,41.6694,-93.0534,117.98,IA,0,Law_Enforcement
2,EF1,0.29,100.0,1,208,15.575,0,0,41.7892,-93.5405,144.9,IA,60000,Trained_Spotter
3,EF0,1.43,50.0,3,126,15.685,0,0,37.2945,-98.6804,1.11,KS,0,Law_Enforcement
4,EF0,1.38,20.0,5,183,17.535,0,0,26.72,-80.25,0.0,FL,0,Broadcast_Media
5,EF0,2.77,100.0,9,111,18.875,0,0,43.683,-98.3963,81.93,SD,0,Storm_Chaser
6,EF0,0.55,30.0,1,152,11.925,0,0,24.6658,-81.524,-36.57,FL,20000,NWS_Storm_Survey
7,EF2,1.5,300.0,0,105,6.25,0,0,30.53,-82.25,64.12,FL,0,NWS_Storm_Survey
8,EF1,0.2,50.0,0,105,9.17,0,0,29.72,-81.24,64.12,FL,0,Public
9,EF1,3.47,150.0,5,295,16.935,0,0,33.4369,-90.5177,52.84,MS,180000,NWS_Storm_Survey


The class works as expected.

The class and all the packages import are also copied into a .py file inside the current folder so that it is easily accessible in other notebooks.