# Data cleaning


In the section below, the original dataset will be examined and cleaned up if necessary.  
The dataset is from the source below:

https://hogeschoolutrecht.sharepoint.com/:x:/r/sites/InstituteforICT-onderwijs/Gedeelde%20%20documenten/HBO-ICT%20-%20AAI/Data-driven%20business/sap_storing_data_hu_project.csv?d=w6af4a30998cb446283565c25459c2351&csf=1&web=1&e=bmoqi0

## Imports

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

Here we are importing some data analysis tools we will use troughout the notebook.

## Fetching

In [2]:
df = pd.read_csv("../Data/CSV/sap_storing_data_hu_project.csv", low_memory=False)
oorzaakcodes = pd.read_csv('../Data/CSV/oorzaakcodes.csv', sep=';')
geocodes = pd.read_csv('../Data/CSV/geocodes.csv', sep=';')
print(f'No less than {df.shape[0]} rows and {df.shape[1]} columns!')

No less than 898526 rows and 140 columns!


This requires a lot of calculation power as it is loading all of the columns.

## Cleaning

In [3]:
columns = {
    '#stm_sap_meldnr': 'meldnummer', 
    'stm_sap_meld_ddt': 'melddatum', 
    'stm_aanntpl_dd': 'startdatum_monteur', 
    'stm_aanntpl_tijd': 'starttijd_monteur', 
    'stm_fh_ddt': 'oplosdatum', 
    'stm_prioriteit': 'prioriteit', 
    'stm_oorz_code': 'oorzaakcode', 
    'stm_geo_gst': 'geocode', 
    'stm_sap_meldtekst_lang': 'eerste_beschrijving', 
    'stm_sap_meldtekst': 'tweede_beschrijving', 
    'stm_oorz_tekst_kort': 'derde_beschrijving', 
    'stm_fh_duur': 'duur'
}

df = df.rename(columns=columns)
df = df.sort_values(by=['prioriteit', 'melddatum'], ascending=False)

Renaming the columns to be more human readable. Aftewards we are sorting the dataframe on priority and date after that.

In [4]:
df = df[df['duur'] != 0.0]
df = df[df['duur'] < 360.0]
print(f'Amount of rows {df.shape[0]}, amount of columns {df.shape[1]}.')

Amount of rows 540712, amount of columns 140.


We delete all the rows where the repair time is lower than 0 or bigger than 360 minutes as theses are irrelevant for our purpose as we can assume these are not filled in truthfully/correctly.

In [5]:
#: Merging date and time into datetime.
df['starttijd_monteur'] = df.starttijd_monteur.fillna('23:59')  #: Replace empty start times with '23:59'.
df['startdatum_monteur'] = df['startdatum_monteur'] + ' ' + df['starttijd_monteur'] #: Concat start date and time
df = df.drop(columns=['starttijd_monteur'])

In [6]:
df['melddatum'] = pd.to_datetime(df['melddatum'], infer_datetime_format=True)
df['startdatum_monteur'] = pd.to_datetime(df['startdatum_monteur'], infer_datetime_format=True)
df['oplosdatum'] = pd.to_datetime(df['oplosdatum'], infer_datetime_format=True)
df['geocode'] = pd.to_numeric(df.geocode, errors='coerce')

This cell needs a lot of time as it is changing the data types for all the rows we will later convert the rest of the columns we use after filtering out the unnecessary rows.

In [7]:
#: Drop rows where there are nan values.
df = df.dropna(subset=['meldnummer', 'melddatum', 'startdatum_monteur', 'oplosdatum', 'prioriteit', 'oorzaakcode', 'geocode'])
print(f'Amount of rows {df.shape[0]}, amount of columns {df.shape[1]}.')

Amount of rows 413724, amount of columns 139.


In [8]:
#: Delete all rows with duplicate dates.
df = df[df.melddatum != df.startdatum_monteur]
df = df[df.melddatum != df.oplosdatum]
df = df[df.startdatum_monteur != df.oplosdatum]
print(f'Amount of rows {df.shape[0]}, amount of columns {df.shape[1]}.')

Amount of rows 400124, amount of columns 139.


In [9]:
#: Delete all rows with illogical dates.
df = df[df.melddatum < df.startdatum_monteur]
df = df[df.melddatum < df.oplosdatum]
df = df[df.startdatum_monteur < df.oplosdatum]
print(f'Amount of rows {df.shape[0]}, amount of columns {df.shape[1]}.')

Amount of rows 397767, amount of columns 139.


In [10]:
#: Filter duplicate meldnummers and keep the last.
df = df.drop_duplicates(subset=['meldnummer'], keep='last')
print(f'Amount of rows {df.shape[0]}, amount of columns {df.shape[1]}.')

Amount of rows 263140, amount of columns 139.


In [11]:
#: Add column with hersteltijd.
df['hersteltijd'] = (df.oplosdatum - df.startdatum_monteur) / pd.Timedelta(minutes=1)
df.hersteltijd = df.hersteltijd.astype(int)

In [12]:
#: Drop rows where hersteltijd is 0 or longer then 360.
df = df[df.hersteltijd <= 360]
df = df[df.hersteltijd > 0]
print(f'Amount of rows {df.shape[0]}, amount of columns {df.shape[1]}.')

Amount of rows 260737, amount of columns 140.


In [13]:
#: Drop all rows with non existing geocodes.
lijst_geocodes = list(geocodes.geocode)
df = df[df['geocode'].isin(lijst_geocodes)]
print(f'Amount of rows {df.shape[0]}, amount of columns {df.shape[1]}.')

Amount of rows 259196, amount of columns 140.


In [14]:
#: Drop all rows with non existing oorzaak codes.
lijst_oorzaakcodes = list(oorzaakcodes.Code)
df = df[df['oorzaakcode'].isin(lijst_oorzaakcodes)]
print(f'Amount of rows {df.shape[0]}, amount of columns {df.shape[1]}.')

Amount of rows 259140, amount of columns 140.


In [15]:
#: Drop all rows with non existing prioriteiten.
lijst_prioriteit = range(1, 10)
df = df[df['prioriteit'].isin(lijst_prioriteit)]
print(f'Amount of rows {df.shape[0]}, amount of columns {df.shape[1]}.')

Amount of rows 259140, amount of columns 140.


In [16]:
#: Create new column repairtimes 15 to represent the 15 minute interval of the original repair time.
df['hersteltijd_15'] = None
#: Iterate trough all repairtimes in intervals of 15 minutes.
for period in range(0, df['hersteltijd'].max(), 15):
    df['hersteltijd_15'] = np.where(df['hersteltijd'] > period, period, df['hersteltijd_15'])

The recovery times are defined per minute, this does not make the dataset transparent or easy to predict. For this reason, we have sorted the recovery times per fifteen minutes.

In [17]:
#: Convert some numeric values to integers.
df = df.astype({ 'meldnummer': 'int32', 'prioriteit': 'int32', 'geocode': 'int32', 'oorzaakcode': 'int32', 'hersteltijd_15': 'int32' })

In [18]:
df['stm_arbeid'].sample(5)

557736    58.0
409022    58.0
358539    37.0
231090    29.0
451380     3.0
Name: stm_arbeid, dtype: float64

In [19]:
df.corrwith(df['hersteltijd'])

meldnummer                   0.004997
stm_mon_nr                   0.046049
stm_geo_mld_uit_functiepl    0.027174
stm_equipm_nr_mld            0.005202
stm_km_van_mld              -0.001165
stm_km_tot_mld               0.016265
prioriteit                  -0.002255
stm_mon_nr__statuscode      -0.031056
geocode                      0.026947
stm_geo_gst_uit_functiepl    0.026947
stm_equipm_nr_gst            0.001879
stm_km_van_gst              -0.000511
stm_km_tot_gst               0.011398
oorzaakcode                 -0.050509
stm_fh_status               -0.122163
stm_tao_telling_mutatie      0.024709
stm_contractgeb_mld         -0.019410
stm_contractgeb_gst         -0.014085
stm_arbeid                   0.792926
stm_progfh_gw_lwd_datum      0.064661
stm_progfh_gw_teller         0.103485
duur                         0.662829
stm_reactie_duur             0.015307
stm_status_sapnaarmon        0.007352
stm_schadenr                 0.015541
stm_rapportage_maand         0.002149
stm_rapporta

We see here that "stm_arbeid" has a very high correlation of 0.792926, unfortunately we can't use this because we don't know what it means. If we had to take a guess, we think that it is actual labor time that was needed for the repair, but you only know this afterwards of course. Furthermore we see "stm_progfh_gw_teller" at number 2 which is the number of times a description has been changed, it correlates a bit with the recovery time. The logic behind this would be that for more complex problems you modify the description several times to describe the newly discovered problem.

These columns don't correlate well so we go for our own logic, "geocode", "oorzaakcode" and "prioriteit". Geocode because the location can have a lot of influence on the repair, if it is in a busy city it will probably take longer than a quiet meadow. Oorzaakcode because a certain problem can take longer than another and prioriteit because something with a high priority or can be tackled faster or can take longer.

In [20]:
#: Reorder and select high correlation columns.
df = df[['meldnummer', 'melddatum', 'startdatum_monteur', 'oplosdatum', 'prioriteit', 'oorzaakcode', 'geocode', 
         'eerste_beschrijving', 'tweede_beschrijving', 'derde_beschrijving', 'hersteltijd', 'hersteltijd_15']]

In [21]:
print(f'Minimal repair time: {df["hersteltijd_15"].min()}, maximal repair time: {df["hersteltijd_15"].max()}')
df = df[(np.abs(stats.zscore(df[['hersteltijd_15']])) < 3).all(axis=1)]
print(f'Minimal repair time: {df["hersteltijd_15"].min()}, maximal repair time: {df["hersteltijd_15"].max()}')
print(f'Amount of rows {df.shape[0]}, amount of columns {df.shape[1]}.')

Minimal repair time: 0, maximal repair time: 345
Minimal repair time: 0, maximal repair time: 195
Amount of rows 252751, amount of columns 12.


Here we delete all the outliers to have a better dataset so we can predict values easier. The drop of rows is not a lot (6389) in the grand scheme, but it will hopefully improve our model perfomance a bit later on.

## Displaying/Checking

In [22]:
df.sample(5)

Unnamed: 0,meldnummer,melddatum,startdatum_monteur,oplosdatum,prioriteit,oorzaakcode,geocode,eerste_beschrijving,tweede_beschrijving,derde_beschrijving,hersteltijd,hersteltijd_15
856350,80902509,2017-12-12 04:05:54,2017-12-12 04:36:00,2017-12-12 04:39:00,2,135,553,Std : km 22.030 storing verwarming bluswater,Std : km 22.030 storing verwarming blusw,Gestoord object: OBJECT 553 ONBEKEND. Ge,3,0
704931,80811611,2015-07-04 04:35:00,2015-07-04 05:19:00,2015-07-04 08:10:00,2,135,112,Dt : TTI Bolwerk oost vloeistofkelder nivo wa...,Dt : TTI Bolwerk oost vloeistofkelder ni,,171,165
210132,70462508,2019-03-21 16:16:26,2019-03-21 17:03:55,2019-03-21 17:37:01,2,215,55,Grvn-Dn : Ovw 27.9 in storing. (Ovw 27.2 korts...,Grvn-Dn : Ovw 27.9 in storing. (Ovw 27.2,Benning tg defect,33,30
293293,80440059,2008-01-14 09:14:00,2008-01-14 10:17:00,2008-01-14 12:00:00,2,218,525,Sgr: storing op beeldschermen westerkeersluis...,Sgr: storing op beeldschermen westerkeer,,103,90
325508,80480479,2008-11-13 00:42:00,2008-11-13 01:19:00,2008-11-13 01:20:00,2,230,158,Bd : S-1040 'rood' gedoofd.,Bd : S-1040 'rood' gedoofd.,,1,0


In [23]:
df.dtypes

meldnummer                      int32
melddatum              datetime64[ns]
startdatum_monteur     datetime64[ns]
oplosdatum             datetime64[ns]
prioriteit                      int32
oorzaakcode                     int32
geocode                         int32
eerste_beschrijving            object
tweede_beschrijving            object
derde_beschrijving             object
hersteltijd                     int32
hersteltijd_15                  int32
dtype: object

In [24]:
pd.isnull(df).sum()

meldnummer                  0
melddatum                   0
startdatum_monteur          0
oplosdatum                  0
prioriteit                  0
oorzaakcode                 0
geocode                     0
eerste_beschrijving         6
tweede_beschrijving         0
derde_beschrijving     171874
hersteltijd                 0
hersteltijd_15              0
dtype: int64

We see that no column has any null values anymore appart from the "derde_beschrijving" column, we will keep this column in case we want to display it into the GUI. The rest of our columns have the correct data types, the sample also looks like we expect.

## Creating a subset

In [25]:
subset_df = df.rename(columns={'eerste_beschrijving': 'beschrijving'}).head(10)
subset_df = subset_df[['meldnummer', 'oorzaakcode', 'prioriteit', 'melddatum', 'geocode', 'beschrijving', 'hersteltijd', 'hersteltijd_15']]

## Saving

In [26]:
df.to_csv('../Data/CSV/cleaned.csv', sep=';', index=False)
subset_df.to_csv('../Data/CSV/subset.csv', sep=';', index=False)

## Conclusion

We started with 898526 rows and 140 columns and reduced into a useable set of 252751 rows and 12 columns of which one column is handmade the "hersteltijd_15". This is roughly a 70% decrease, normally we would be skeptical of so much data waste but the data we filtered out is in our opinion truly unusable as it contains false/bad data.

We hope this cleaned set will be a good foundation for our models.