# Individual Exploration Abby: Path 2 - Strategy Execution Analysis

# Path 2 - Strategy Execution Analysis
## INSY6500 - PM Analysis Project

**Purpose:** Explore if any patterns emerge when we compare the planned maintenance activities to historical execution results

**Datasets:**
- `101ki_pm_performance.csv` - 12-month historical performance (April 2024 - March 2025)
- `103ki_pm_forecast.csv` - 12-month PM forecast (April 2026 - March 2027)

---

### Step 1 Import and Set Up

In [1]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

print("Libraries imported")

Libraries imported


### Step 2 Load Performance and Forecast Data

In [2]:
# Define data directory
DATA_DIR = Path('../data')

# Define file paths
PERFORMANCE_FILE = DATA_DIR / '101ki_pm_performance.csv'
FORECAST_FILE = DATA_DIR / '103ki_pm_forecast.csv'

# Load historical performance data
df_performance = pd.read_csv(PERFORMANCE_FILE)
print(f"Performance data shape: {df_performance.shape[0]:,} rows, {df_performance.shape[1]} columns")

#Load scheduled maintenance data
df_forecast = pd.read_csv(FORECAST_FILE, 
                        encoding='cp1252',
                        parse_dates = ['DUE_DATE'],
                        dtype={
                            'INTERVAL': 'category',
                            'JOB_TYPE': 'category',
                            'LABOR_CRAFT': 'category',
                            'PMSCOPETYPE': 'category',
                            'DEPT': 'category',
                            'DEPT_NAME': 'category',
                            'DEPT_TYPE' : 'category',
                            'PLANT' : 'category',
                            'LINE' : 'category',
                            'ZONENAME' : 'category',
                            'PROCESSNAME' : 'category'
                        })
print(f"Forecast data shape: {df_forecast.shape[0]:,} rows, {df_forecast.shape[1]} columns")

Performance data shape: 18,476 rows, 7 columns
Forecast data shape: 99,983 rows, 23 columns


In [3]:
# Checking data upload
print(df_performance.info(), df_performance.isna().sum())
print(df_forecast.info(), df_forecast.isna().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18476 entries, 0 to 18475
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   PMNUM                18476 non-null  object 
 1   TIMES_SCHEDULED      18476 non-null  int64  
 2   TIMES_ONTIME         18476 non-null  int64  
 3   TIMES_LATE           18476 non-null  int64  
 4   TIMES_NOT_COMPLETED  18476 non-null  int64  
 5   AVG_PLANNED_HRS      18476 non-null  float64
 6   AVG_ACTUAL_HRS       18476 non-null  float64
dtypes: float64(2), int64(4), object(1)
memory usage: 1010.5+ KB
None PMNUM                  0
TIMES_SCHEDULED        0
TIMES_ONTIME           0
TIMES_LATE             0
TIMES_NOT_COMPLETED    0
AVG_PLANNED_HRS        0
AVG_ACTUAL_HRS         0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99983 entries, 0 to 99982
Data columns (total 23 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------    

### Step 3  Mergeing Data Sets & Verification

In [4]:
# Data merge done with outer to account for PMNUM difference between data sets to prevent data loss
performance_forecast = pd.merge(df_performance, df_forecast, on='PMNUM', how='outer', indicator=True)

# Checking mergered data set 
performance_forecast.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102021 entries, 0 to 102020
Data columns (total 30 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   PMNUM                   102021 non-null  object        
 1   TIMES_SCHEDULED         96156 non-null   float64       
 2   TIMES_ONTIME            96156 non-null   float64       
 3   TIMES_LATE              96156 non-null   float64       
 4   TIMES_NOT_COMPLETED     96156 non-null   float64       
 5   AVG_PLANNED_HRS         96156 non-null   float64       
 6   AVG_ACTUAL_HRS          96156 non-null   float64       
 7   DUE_DATE                99983 non-null   datetime64[ns]
 8   COUNTKEY                99983 non-null   object        
 9   PMDESCRIPTION           99983 non-null   object        
 10  INTERVAL                99983 non-null   category      
 11  FORECASTJP              99910 non-null   object        
 12  JOB_TYPE                99906 

In [5]:
# Checking for lost keys
pmnum_lost_from_performance = set(df_performance["PMNUM"]) - set(performance_forecast["PMNUM"])
pmnum_lost_from_forecast   = set(df_forecast["PMNUM"]) - set(performance_forecast["PMNUM"])

print("Keys Lost from performance:", len(pmnum_lost_from_performance))
print("Keys Lost from forecast:", len(pmnum_lost_from_forecast))

# Comparing data set shapes 
print()
print("=" * 80)
print("Merged Data Shape:", f"   Rows: {performance_forecast.shape[0]:,}", f"   Columns: {performance_forecast.shape[1]}")
print("Orginal Performance Shape:", f"   Rows: {df_performance.shape[0]:,}", f"   Columns: {df_performance.shape[1]}")
print("Orginal Forecast Shape:", f"   Rows: {df_forecast.shape[0]:,}", f"   Columns: {df_forecast.shape[1]}")
print("=" * 80)


Keys Lost from performance: 0
Keys Lost from forecast: 0

Merged Data Shape:    Rows: 102,021    Columns: 30
Orginal Performance Shape:    Rows: 18,476    Columns: 7
Orginal Forecast Shape:    Rows: 99,983    Columns: 23


From previous data exploration, we know that:
* 2,038 PMNUMs from last year won't repeat next year (PMNUM only in Performance)
* 1,911 new PMNUMs in Forecast that are not in performance data (PMNUM only in Forecast)
* Forecast data set has 23 columns and 99,983rows
* Performance data set has 7 columns and 18,476 rows
  
When merging the data sets, we would expect:
* More rows than in either dataset 
* The number of columns to increase by 6 (not including the merge indication column)
* The number of rows to increase by 2,038 (when adding Performance to the Forecast) 

Merge Result Based on Shape: 
* Columns increased by 6 (23 &rarr; 29)
* Rows increased by 2,038 (99,983 &rarr; 102,021)

In [13]:
# Missing data verification 
orig_perf_missing   = df_performance.isna().sum()
orig_forecast_missing = df_forecast.isna().sum()

# Finding PMNUMs missing values from the  forecast dataset
forecast_missing_mask = performance_forecast['_merge'].isin(['left_only'])
    
pmnum_missing_forecast = performance_forecast.loc[forecast_missing_mask, 'PMNUM']

# Finding PMNUMs' missing values from the performance dataset
performance_missing_mask = performance_forecast['_merge'].isin(['right_only'])
pmnum_missing_performance = performance_forecast.loc[performance_missing_mask, 'PMNUM']

# Checking other columns for missing values in the performance  dataset 
performance_cols = df_performance.columns
perf_row_mask = performance_forecast["_merge"].isin(["both", "left_only"])
missing_performance =  performance_forecast.loc[perf_row_mask, performance_cols].isna().sum() 

# Checking other columns for missing values in the forecast dataset 
forecast_cols = df_forecast.columns
forecast_row_mask = performance_forecast["_merge"].isin(["both", "right_only"])
missing_forecast =  performance_forecast.loc[forecast_row_mask, forecast_cols].isna().sum()

print("Missing Data in Merged Dataset:")
print()
print("Number of Maintiance Identifiers Missing Forcast Data:", pmnum_missing_forecast.nunique())
print("Number of Maintiance Identifiers Missing Perfromance Data:", pmnum_missing_performance.nunique())
print()
print("Missing Data in Mergered Dataset form Forecast Data:\n", missing_forecast[missing_forecast > 0])
print("For comparison Orginal Missing: \n", orig_forecast_missing[orig_forecast_missing > 0]) 
print()
print("Missing Data in Mergered Dataset form Performance Data:\n", missing_performance)
print("For comparison Orginal Missing:\n",orig_perf_missing)




Missing Data in Merged Dataset:

Number of Maintiance Identifiers Missing Forcast Data: 2038
Number of Maintiance Identifiers Missing Perfromance Data: 1911

Missing Data in Mergered Dataset form Forecast Data:
 FORECASTJP                   73
JOB_TYPE                     77
LABOR_CRAFT                7882
PLANNED_LABORERS            829
PLANNED_LABOR_HRS           829
TOTAL_MATERIAL_COST       98051
TOTAL_TASK_DESC_LENGTH      104
DEPT_NAME                 14654
LINE                       1727
PROCESSNAME               14404
dtype: int64
For comparison Orginal Missing: 
 FORECASTJP                   73
JOB_TYPE                     77
LABOR_CRAFT                7882
PLANNED_LABORERS            829
PLANNED_LABOR_HRS           829
TOTAL_MATERIAL_COST       98051
TOTAL_TASK_DESC_LENGTH      104
DEPT_NAME                 14654
LINE                       1727
PROCESSNAME               14404
dtype: int64

Missing Data in Mergered Dataset form Performance Data:
 PMNUM                  0
TIMES

#### Takeaways 
* Missing data in the merged data set matches that of the original data sets when looking at the merged data set. 
* No data was lost during the merge. 
* Merge was successful.

### Step 4 Data Quality Analysis & Data Cleaning 

In [7]:
# Data Type verification 
performance_forecast.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102021 entries, 0 to 102020
Data columns (total 30 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   PMNUM                   102021 non-null  object        
 1   TIMES_SCHEDULED         96156 non-null   float64       
 2   TIMES_ONTIME            96156 non-null   float64       
 3   TIMES_LATE              96156 non-null   float64       
 4   TIMES_NOT_COMPLETED     96156 non-null   float64       
 5   AVG_PLANNED_HRS         96156 non-null   float64       
 6   AVG_ACTUAL_HRS          96156 non-null   float64       
 7   DUE_DATE                99983 non-null   datetime64[ns]
 8   COUNTKEY                99983 non-null   object        
 9   PMDESCRIPTION           99983 non-null   object        
 10  INTERVAL                99983 non-null   category      
 11  FORECASTJP              99910 non-null   object        
 12  JOB_TYPE                99906 

#### Takeaway
* Data types match the expexed types 

In [14]:
# Missing Data Analysis
missing_mask = performance_forecast.isna()
missing = missing_mask.sum().to_frame(name="count")
missing['precentage'] = (missing['count']/performance_forecast.shape[0])

missing.style.format({'precentage': "{:.1%}"})

Unnamed: 0,count,precentage
PMNUM,0,0.0%
TIMES_SCHEDULED,5865,5.7%
TIMES_ONTIME,5865,5.7%
TIMES_LATE,5865,5.7%
TIMES_NOT_COMPLETED,5865,5.7%
AVG_PLANNED_HRS,5865,5.7%
AVG_ACTUAL_HRS,5865,5.7%
DUE_DATE,2038,2.0%
COUNTKEY,2038,2.0%
PMDESCRIPTION,2038,2.0%


### Takeaways
* The missing data percentage is relatively low, between 2.0% and 6.0% for most of the data
* Process Name and  Department name  have about 16% missing data
    * Both are strings that provide names, less important for analysis
    * Still usable for analysis
    * Department name has additional data with lower missing percentages that can be used for analysis and to fill in missing data
* Labor Craft has about 9% missing data
    * Categorizes the required labor skill/craft to perform maintenance
* Total Material Cost has 98% of the data missing
    * Unusable 

In [15]:
# Duplication detection 
performance_forecast_dups = performance_forecast.nunique().to_frame(name = 'unique_vals')
performance_forecast_dups['duplicates'] = performance_forecast.shape[0] - performance_forecast_dups['unique_vals']
performance_forecast_dups['precentage'] = performance_forecast_dups['duplicates'] / performance_forecast.shape[0]

performance_forecast_dups.style.format({'precentage': "{:.1%}"})

Unnamed: 0,unique_vals,duplicates,precentage
PMNUM,20387,81634,80.0%
TIMES_SCHEDULED,50,101971,100.0%
TIMES_ONTIME,67,101954,99.9%
TIMES_LATE,26,101995,100.0%
TIMES_NOT_COMPLETED,10,102011,100.0%
AVG_PLANNED_HRS,389,101632,99.6%
AVG_ACTUAL_HRS,1381,100640,98.6%
DUE_DATE,364,101657,99.6%
COUNTKEY,89264,12757,12.5%
PMDESCRIPTION,17383,84638,83.0%


#### Takeaways 
* Due to the nature of the forecast data, duplication is expected
* Keeping duplications as data is a schedule for maintenance, and having duplicate (repeated) maintenance needs is reasonable
* Can use the maintenance due date  and interval frequency as a way to deal with duplications of unique preventive maintenance identifier

In [16]:
# Outlier Identification
# time min and max
print(performance_forecast['DUE_DATE'].min())
print(performance_forecast['DUE_DATE'].max())
print("Number of Unique Due Dates\n",performance_forecast['DUE_DATE'].nunique())

# Interval Examination And Cleaning 
performance_forecast["INTERVAL"].cat.categories
performance_forecast["INTERVAL"].cat.ordered

# Cleaning Interval Categorical Data
UNIT_TO_DAYS = {
    "DAYS": 1,
    "WEEKS": 7,
    "MONTHS": 30,
    "YEARS": 365,
}

def interval_to_days(value):
    try:
        num, unit = str(value).split("-")
        unit = unit.upper()
        return int(num) * UNIT_TO_DAYS[unit]
    except Exception:
        return float("nan")

df = performance_forecast.copy()

# 1. Compute duration in days
df["INTERVAL_DAYS"] = df["INTERVAL"].map(interval_to_days)

# 2. Build an ordered, UNIQUE, NON-NULL list of categories from shortest → longest
ordered_intervals = (
    df.loc[df["INTERVAL"].notna()]              # drop nulls
      .sort_values("INTERVAL_DAYS")            # shortest → longest
      ["INTERVAL"]
      .drop_duplicates()                       # make categories unique
      .tolist()
)

# 3. Make INTERVAL an ordered categorical using that list
df["INTERVAL"] = pd.Categorical(
    df["INTERVAL"],
    categories=ordered_intervals,
    ordered=True
)

# 4. drop helper column
df = df.drop(columns=["INTERVAL_DAYS"])

# 5. Assign back to main df
performance_forecast = df

print(performance_forecast["INTERVAL"].cat.categories)
performance_forecast["INTERVAL"].nunique()

2026-04-01 00:00:00
2027-03-30 00:00:00
Number of Unique Due Dates
 364
Index(['1-DAYS', '6-DAYS', '1-WEEKS', '7-DAYS', '2-WEEKS', '14-DAYS',
       '21-DAYS', '4-WEEKS', '28-DAYS', '1-MONTHS', '32-DAYS', '5-WEEKS',
       '40-DAYS', '6-WEEKS', '45-DAYS', '2-MONTHS', '60-DAYS', '10-WEEKS',
       '12-WEEKS', '3-MONTHS', '13-WEEKS', '15-WEEKS', '4-MONTHS', '5-MONTHS',
       '6-MONTHS', '180-DAYS', '7-MONTHS', '8-MONTHS', '9-MONTHS', '10-MONTHS',
       '11-MONTHS', '12-MONTHS', '52-WEEKS', '1-YEARS', '17-MONTHS',
       '18-MONTHS', '2-YEARS', '26-MONTHS', '36-MONTHS', '3-YEARS', '4-YEARS',
       '5-YEARS', '7-YEARS', '8-YEARS', '10-YEARS', '15-YEARS', '20-YEARS'],
      dtype='object')


47

#### Takeaways:
* Data contains a year's worth of data
* Interval data was repetitive (example:7-Days and 1-WEEKS) and unordered
* Intervals changed to remove duplicative date ranges and ordered from shortest interval to longest interval 
    * Shortest maintenance interval is 1 Day (task is done every day)
    * Longest maintenance interval is 20 years (task is done every 20 years)
* Number of unique interval matches that of the original dataset
* This affects the valid range for completion numbers from the performance dataset
    * Times scheduled cannot be more than 365 or a negative number
* Other range constraints
    *  `Times_Ontime + Times_Late + Times_Not_Completed ≤ Times_Scheduled`
    *  `Times_Not_Completed  ≤ Times_Scheduled - Times_Ontime - Times_Late`
    *  `Times_Ontime ≤ Times_Scheduled - Times_Not_Completed  - Times_Late`
    *  `Times_Late ≤ Times_Scheduled - Times_Not_Completed  - Times_Ontime`


In [17]:
# Completion Numbers
print("\n Summary Statistics:")
display(performance_forecast[['TIMES_SCHEDULED', 'TIMES_ONTIME', 'TIMES_LATE', 'TIMES_NOT_COMPLETED']].describe())

# Range Validation Completion Numbers
# Times Scheduled
valid_scheduled_mask = performance_forecast['TIMES_SCHEDULED'].between(0, 365)

invalid_rows = performance_forecast.loc[~valid_scheduled_mask]
print("Number of Invalid Rows for Times Scheduled:", invalid_rows['PMNUM'].nunique())

# Total outcome constraint

outcomes_total = df['TIMES_ONTIME'] + df['TIMES_LATE'] + df['TIMES_NOT_COMPLETED']
valid_total_mask = outcomes_total <= df['TIMES_SCHEDULED']

invalid_total_rows = df.loc[~valid_total_mask]
print("Number of Invalid Rows for Total Outcome Constraint:", invalid_total_rows['PMNUM'].nunique())

# Not Completed Validation 
valid_not_completed_mask = df['TIMES_NOT_COMPLETED'] <= (
    df['TIMES_SCHEDULED'] - df['TIMES_ONTIME'] - df['TIMES_LATE'])
invalid_not_completed_rows = df.loc[~valid_not_completed_mask]
print("Number of Invalid Rows for Not Completed Constraint:", invalid_not_completed_rows['PMNUM'].nunique())

# Ontime Validation
valid_ontime_mask = performance_forecast['TIMES_ONTIME'] <= (
    df['TIMES_SCHEDULED'] - df['TIMES_NOT_COMPLETED'] - df['TIMES_LATE'])
invalid_ontime_rows = df.loc[~valid_ontime_mask]
print("Number of Invalid Rows for Ontime Constraint:", invalid_ontime_rows['PMNUM'].nunique())

# Late Validation 
valid_late_mask = df['TIMES_LATE'] <= (
    df['TIMES_SCHEDULED'] - df['TIMES_NOT_COMPLETED'] - df['TIMES_ONTIME'])
invalid_late_rows = df.loc[~valid_ontime_mask]
print("Number of Invalid Rows for Late Constraint:", invalid_late_rows['PMNUM'].nunique())



 Summary Statistics:


Unnamed: 0,TIMES_SCHEDULED,TIMES_ONTIME,TIMES_LATE,TIMES_NOT_COMPLETED
count,96156.0,96156.0,96156.0,96156.0
mean,31.064583,29.568462,1.335413,0.160708
std,67.752496,66.67972,2.963758,0.724168
min,1.0,0.0,0.0,0.0
25%,4.0,3.0,0.0,0.0
50%,12.0,10.0,0.0,0.0
75%,13.0,12.0,2.0,0.0
max,365.0,363.0,67.0,9.0


Number of Invalid Rows for Times Scheduled: 1911
Number of Invalid Rows for Total Outcome Constraint: 1911
Number of Invalid Rows for Not Completed Constraint: 1911
Number of Invalid Rows for Ontime Constraint: 1911
Number of Invalid Rows for Late Constraint: 1911


#### Takeaways
* This is expected as 1,911 PMNUMs are unique to the forecast data set. (no performance data)
* This data is valid 

In [21]:
# Prefromance Hours
print("\n Summary Statistics:")
display(performance_forecast[df['AVG_PLANNED_HRS', 'AVG_ACTUAL_HRS', 'PLANNED_LABORERS','PLANNED_LABOR_HRS']].describe())


 Summary Statistics:


Unnamed: 0,AVG_PLANNED_HRS,AVG_ACTUAL_HRS,PLANNED_LABORERS,PLANNED_LABOR_HRS
count,96156.0,96156.0,99154.0,99154.0
mean,2.065262,1.87265,1.083264,1.758899
std,9.733242,10.183173,0.867851,9.36943
min,0.0,0.0,1.0,0.0
25%,0.5,0.5,1.0,0.5
50%,1.0,0.916667,1.0,1.0
75%,2.0,1.683974,1.0,1.75
max,499.75,591.577778,20.0,499.75


### Takeways 
* Planned Laborers look ok
    * It is believeable that some bigger jobs require 20 people
* The max for the avgerage hour coulmns looks off
* The max planned labor hours looks off 

In [52]:
# Visualizing the max value rows
max_hours_mask = (
    (df['PLANNED_LABOR_HRS'] == df['PLANNED_LABOR_HRS'].max()) |
    (df['AVG_PLANNED_HRS'] == df['AVG_PLANNED_HRS'].max()) |
    (df['AVG_ACTUAL_HRS'] == df['AVG_ACTUAL_HRS'].max()))

rows_with_max_hours = performance_forecast.loc[max_hours_mask, ['PMNUM','AVG_PLANNED_HRS', 'AVG_ACTUAL_HRS', 'PLANNED_LABORERS','PLANNED_LABOR_HRS']]
display(rows_with_max_hours)


Unnamed: 0,PMNUM,AVG_PLANNED_HRS,AVG_ACTUAL_HRS,PLANNED_LABORERS,PLANNED_LABOR_HRS
76050,PM165035,499.75,591.577778,1.0,499.75
76051,PM165035,499.75,591.577778,1.0,499.75
76052,PM165035,499.75,591.577778,1.0,499.75
76053,PM165035,499.75,591.577778,1.0,499.75
76054,PM165035,499.75,591.577778,1.0,499.75
76055,PM165035,499.75,591.577778,1.0,499.75
76056,PM165035,499.75,591.577778,1.0,499.75
76057,PM165035,499.75,591.577778,1.0,499.75
76058,PM165035,499.75,591.577778,1.0,499.75
76059,PM165035,499.75,591.577778,1.0,499.75


### Takeaway 
* Only one PMNUM is responsible for the unusual value
* Only one labor planned tells me this is probably an invalid value 

In [39]:
# Investigating PMNUM with large labor hours
description = performance_forecast.loc[
    performance_forecast['PMNUM'] == 'PM165035',['PMDESCRIPTION', 'INTERVAL', 'JOB_TYPE', 'TASK_COUNT', 'PMSCOPETYPE', 'LOCATION', 'LOCATIONDESC']]

display(description)

Unnamed: 0,PMDESCRIPTION,INTERVAL,JOB_TYPE,TASK_COUNT,PMSCOPETYPE,LOCATION,LOCATIONDESC
76050,"MONTHLY, PA2 PRODUCTION SUPPORT",1-MONTHS,NONASSET,4.0,LOCATION,2PA2OPMIGEGE,Miscellaneous General General
76051,"MONTHLY, PA2 PRODUCTION SUPPORT",1-MONTHS,NONASSET,4.0,LOCATION,2PA2OPMIGEGE,Miscellaneous General General
76052,"MONTHLY, PA2 PRODUCTION SUPPORT",1-MONTHS,NONASSET,4.0,LOCATION,2PA2OPMIGEGE,Miscellaneous General General
76053,"MONTHLY, PA2 PRODUCTION SUPPORT",1-MONTHS,NONASSET,4.0,LOCATION,2PA2OPMIGEGE,Miscellaneous General General
76054,"MONTHLY, PA2 PRODUCTION SUPPORT",1-MONTHS,NONASSET,4.0,LOCATION,2PA2OPMIGEGE,Miscellaneous General General
76055,"MONTHLY, PA2 PRODUCTION SUPPORT",1-MONTHS,NONASSET,4.0,LOCATION,2PA2OPMIGEGE,Miscellaneous General General
76056,"MONTHLY, PA2 PRODUCTION SUPPORT",1-MONTHS,NONASSET,4.0,LOCATION,2PA2OPMIGEGE,Miscellaneous General General
76057,"MONTHLY, PA2 PRODUCTION SUPPORT",1-MONTHS,NONASSET,4.0,LOCATION,2PA2OPMIGEGE,Miscellaneous General General
76058,"MONTHLY, PA2 PRODUCTION SUPPORT",1-MONTHS,NONASSET,4.0,LOCATION,2PA2OPMIGEGE,Miscellaneous General General
76059,"MONTHLY, PA2 PRODUCTION SUPPORT",1-MONTHS,NONASSET,4.0,LOCATION,2PA2OPMIGEGE,Miscellaneous General General


### Takeaways
* Looking closer, this looks like a catch-all code for general reliability/support tasks every month somewhere in this area, not tied to one tracked asset.
* So it is possible that this is a valid data point, with it being a catch-all column and the facility running 24 hours a day with shifts.

In [47]:
# looking at Hours without max value 
print("\n Summary Statistics:")
display(performance_forecast.loc[~max_hours_mask, ['AVG_PLANNED_HRS', 'AVG_ACTUAL_HRS', 'PLANNED_LABORERS','PLANNED_LABOR_HRS']].describe())

# investigating 250 planned hours 
mask_hours = performance_forecast['PLANNED_LABOR_HRS'] == 250
performance_forecast.loc[mask_hours, 'PMNUM'].unique()
df_2 = performance_forecast.loc[mask_hours, ['PMNUM', 'PMDESCRIPTION', 'INTERVAL', 'JOB_TYPE', 'TASK_COUNT', 'PMSCOPETYPE', 'LOCATION', 'LOCATIONDESC']]
df_2.nunique() 
df_2 = df_2.drop_duplicates(subset='PMNUM', keep='first')
df_2



 Summary Statistics:


Unnamed: 0,AVG_PLANNED_HRS,AVG_ACTUAL_HRS,PLANNED_LABORERS,PLANNED_LABOR_HRS
count,96144.0,96144.0,99142.0,99142.0
mean,2.003144,1.799048,1.083274,1.698623
std,7.98929,7.765319,0.867903,7.601048
min,0.0,0.0,1.0,0.0
25%,0.5,0.5,1.0,0.5
50%,1.0,0.916667,1.0,1.0
75%,2.0,1.683974,1.0,1.75
max,250.0,440.083333,20.0,250.0


Unnamed: 0,PMNUM,PMDESCRIPTION,INTERVAL,JOB_TYPE,TASK_COUNT,PMSCOPETYPE,LOCATION,LOCATIONDESC
27319,PM107977,"MONTHLY, TC DUTIES A CREW",1-MONTHS,NONASSET,4.0,ASSET,1PA1MGMIGEMA,GENERAL MAINTENANCE
76022,PM164972,"MONTHLY, TC DUTIES B CREW",1-MONTHS,NONASSET,4.0,ASSET,1PA1MGMIGEMA,GENERAL MAINTENANCE
76034,PM164973,"MONTHLY, TC DUTIES C CREW",1-MONTHS,NONASSET,4.0,ASSET,1PA1MGMIGEMA,GENERAL MAINTENANCE
98110,PM192293,"MONTHLY, TC DUTIES D CREW",1-MONTHS,NONASSET,4.0,ASSET,1PA1MGMIGEMA,GENERAL MAINTENANCE


### Takeaways
* This is likely a general asset inspection and minor mechanical/operational adjustments to maintain functional reliability and prevent degradation.
* So it is possible that this is a valid data point, with it being a general maintenance column, and that the facility likely runs for 24 hours a day in shifts. 

In [50]:
# Visualizing the Max laborer data
max_laborer_mask = (df['PLANNED_LABORERS'] == df['PLANNED_LABORERS'].max())

rows_with_max_laborer = performance_forecast.loc[max_laborer_mask, ['PMNUM', 'PLANNED_LABORERS','PLANNED_LABOR_HRS']]
display(rows_with_max_laborer['PMNUM'].unique())

df_3 = performance_forecast.loc[max_laborer_mask, ['PMNUM', 'PMDESCRIPTION', 'INTERVAL', 'JOB_TYPE', 'TASK_COUNT', 'PMSCOPETYPE', 'LOCATION', 'LOCATIONDESC']]
df_3.nunique() 
df_3 = df_3.drop_duplicates(subset='PMNUM', keep='first')
df_3

array(['PM187368', 'PM187369', 'PM187370', 'PM187371', 'PM187372',
       'PM187373', 'PM187374', 'PM187375', 'PM187376', 'PM187377',
       'PM187378', 'PM187379', 'PM189725', 'PM189727', 'PM189728',
       'PM189729'], dtype=object)

Unnamed: 0,PMNUM,PMDESCRIPTION,INTERVAL,JOB_TYPE,TASK_COUNT,PMSCOPETYPE,LOCATION,LOCATIONDESC
93817,PM187368,A-TEAM A ZONE DAILY EQUIPMENT INSPECTION,1-MONTHS,INSPECTION,14.0,LOCATION,1WE1A0,A0 Zone
93829,PM187369,B-TEAM A ZONE DAILY EQUIPMENT INSPECTION,1-MONTHS,INSPECTION,14.0,LOCATION,1WE1A0,A0 Zone
93841,PM187370,C-TEAM A ZONE DAILY EQUIPMENT INSPECTION,1-MONTHS,INSPECTION,14.0,LOCATION,1WE1A0,A0 Zone
93853,PM187371,A-TEAM B1 ZONE DAILY EQUIPMENT INSPECTION,1-MONTHS,INSPECTION,14.0,LOCATION,1WE1B1,B1 Zone
93865,PM187372,B-TEAM B1 ZONE DAILY EQUIPMENT INSPECTION,1-MONTHS,INSPECTION,14.0,LOCATION,1WE1B1,B1 Zone
93877,PM187373,C-TEAM B1 ZONE DAILY EQUIPMENT INSPECTION,1-MONTHS,INSPECTION,14.0,LOCATION,1WE1B1,B1 Zone
93889,PM187374,A-TEAM B2 ZONE DAILY EQUIPMENT INSPECTION,1-MONTHS,INSPECTION,14.0,LOCATION,1WE1B2,B2 Zone
93901,PM187375,B-TEAM B2 ZONE DAILY EQUIPMENT INSPECTION,1-MONTHS,INSPECTION,14.0,LOCATION,1WE1B2,B2 Zone
93913,PM187376,C-TEAM B2 ZONE DAILY EQUIPMENT INSPECTION,1-MONTHS,INSPECTION,14.0,LOCATION,1WE1B2,B2 Zone
93925,PM187377,A-TEAM D ZONE DAILY EQUIPMENT INSPECTION,1-MONTHS,INSPECTION,14.0,LOCATION,1WE1D0,D0Â Zone


### Takeaways
* The maintinance task with the most planned laborers are all team based and inspections
* It is plausable that the teams have 20 memebers