# 2019: Week 12

May 01, 2019

In previous weeks of Preppin' Data we have seen the struggles of manual data capture and the impact particular on text based data. In our experience, it's not just text where the impact is felt by manually inputting data, but the accuracy of capturing all types of data.

This week, we position ourselves in the shoes of IT at Chin & Beard Suds Co. The company has had a number of system outages but we need to understand the size of the issue.

*this is probably quite a challenging week*

Luckily for us, we have two separate data sources: 1. a set of automatically formed logs that captures service down time with a precise timestamp; 2. a manual spreadsheet where staff can capture issues they are having with systems. Sadly, for the latter data source, the data is captured without the same accuracy as the automatic logs. With the automatic logs, have we got all the data though? Help us clean the data and make clear how much downtime we are suffering from, in which system and what error causes the biggest % of downtime.

# Requirements

<img src="https://3.bp.blogspot.com/-5mmrhZLUZ4w/XMls6kBBw5I/AAAAAAAAAN0/Bu_N03J_NyY4Ra4Os9ANY34OUd6ZCziiACLcBGAs/s320/Auto%2BInput.JPG" width="600" height="300">

Auto input

<img src="https://1.bp.blogspot.com/-u-uUCtHJDeM/XMls6ohrTiI/AAAAAAAAAN8/ChR-N7wQuiI_PMaB5M0MRvq8jmDc-A0MwCEwYBhgL/s320/Manual%2BInput.JPG" width="400" height="200">

Manual input

* Input the data
* Make the manual date and time a date / time field
* Bring the datasets together in a manner that removes the duplicate records from the manually captured data set 
* Duplicates are determined by a system being down at the same time but recorded both automatically and manually. The automatic data should always overwrite the manual data.
* Workout the duration (in seconds) of the error
* Understand the '% of downtime' per system in hours


# Output

<img src="https://4.bp.blogspot.com/-3BCdcR0_gSw/XMls6iF40iI/AAAAAAAAAOA/wOQD_qF1fTcuus5VqTbRw85ZqeXihw61QCEwYBhgL/s400/Output.JPG" width="400" height="200">

* 8 columns
* 11 rows (12 including headers)
* No nulls

In [52]:
import pandas as pd
from datetime import datetime

In [53]:
# Nhập dữ liệu đầu vào
input = "input.xlsx"
df = pd.read_excel(input, sheet_name="Automatic Error log")
df2 = pd.read_excel(input, sheet_name="Manual capture error list")
print(df.head(5))
print(df.info())
print("=====================================================")
print(df2.head(5))
print(df2.info())
print("=====================================================")

    Start Date / Time     End Date / Time System           Error
0 2019-04-13 07:55:23 2019-04-13 08:23:12  Sales       Disc full
1 2018-04-13 09:03:22 2018-04-15 09:03:21  Stock  Planned Outage
2 2018-05-13 09:03:22 2018-05-16 09:03:21  Stock  Planned Outage
3 2018-06-13 09:03:22 2018-06-15 09:03:21  Stock  Planned Outage
4 2018-07-12 23:03:22 2018-07-13 07:00:03  Stock  Planned Outage
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Start Date / Time  9 non-null      datetime64[ns]
 1   End Date / Time    9 non-null      datetime64[ns]
 2   System             9 non-null      object        
 3   Error              9 non-null      object        
dtypes: datetime64[ns](2), object(2)
memory usage: 416.0+ bytes
None
  Start Date Start Time   End Date  End Time System          Error
0 2018-04-13   09:00:00 2018-04-15  09:00:00

In [54]:
# Tạo cột datetime
df2['Start Date'] = df2['Start Date'].astype(str)
df2['Start Time'] = df2['Start Time'].astype(str)
df2['Start Date / Time'] = pd.to_datetime(df2['Start Date'] + " " + df2['Start Time'])

df2['End Date'] = df2['End Date'].astype(str)
df2['End Time'] = df2['End Time'].astype(str)
df2['End Date / Time'] = pd.to_datetime(df2['End Date'] + " " + df2['End Time'])

df2.drop(columns=['Start Date', 'Start Time', 'End Date', 'End Time'], inplace=True)
print(df2.head(5))
print(df2.info())

  System          Error   Start Date / Time     End Date / Time
0  Stock  Planed Outage 2018-04-13 09:00:00 2018-04-15 09:00:00
1  Stock  Planed Outage 2018-05-13 09:00:00 2018-05-15 09:00:00
2  Stock  Planed Outage 2018-06-13 09:00:00 2018-06-15 09:00:00
3  Stock  Planed Outage 2018-07-13 09:00:00 2018-07-15 09:00:00
4  Stock  Planed Outage 2018-08-13 09:00:00 2018-08-15 09:00:00
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   System             6 non-null      object        
 1   Error              6 non-null      object        
 2   Start Date / Time  6 non-null      datetime64[ns]
 3   End Date / Time    6 non-null      datetime64[ns]
dtypes: datetime64[ns](2), object(2)
memory usage: 320.0+ bytes
None


In [55]:
# Nối 2 bảng với nhau
df3 = pd.concat([df, df2], ignore_index=True)
print(df3.head(5))
print(df3.info())

    Start Date / Time     End Date / Time System           Error
0 2019-04-13 07:55:23 2019-04-13 08:23:12  Sales       Disc full
1 2018-04-13 09:03:22 2018-04-15 09:03:21  Stock  Planned Outage
2 2018-05-13 09:03:22 2018-05-16 09:03:21  Stock  Planned Outage
3 2018-06-13 09:03:22 2018-06-15 09:03:21  Stock  Planned Outage
4 2018-07-12 23:03:22 2018-07-13 07:00:03  Stock  Planned Outage
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Start Date / Time  15 non-null     datetime64[ns]
 1   End Date / Time    15 non-null     datetime64[ns]
 2   System             15 non-null     object        
 3   Error              15 non-null     object        
dtypes: datetime64[ns](2), object(2)
memory usage: 608.0+ bytes
None


In [56]:
# Loại bỏ các trường hợp duplicate
df3 = df3.drop_duplicates()
print(df3.head(5))
print(df3.info())

    Start Date / Time     End Date / Time System           Error
0 2019-04-13 07:55:23 2019-04-13 08:23:12  Sales       Disc full
1 2018-04-13 09:03:22 2018-04-15 09:03:21  Stock  Planned Outage
2 2018-05-13 09:03:22 2018-05-16 09:03:21  Stock  Planned Outage
3 2018-06-13 09:03:22 2018-06-15 09:03:21  Stock  Planned Outage
4 2018-07-12 23:03:22 2018-07-13 07:00:03  Stock  Planned Outage
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Start Date / Time  15 non-null     datetime64[ns]
 1   End Date / Time    15 non-null     datetime64[ns]
 2   System             15 non-null     object        
 3   Error              15 non-null     object        
dtypes: datetime64[ns](2), object(2)
memory usage: 600.0+ bytes
None
