# Data Loading & Initial Reconnaissance
## INSY6500 - PM Analysis Project

**Purpose:** Load and perform initial exploration of PM forecast and performance datasets

**Datasets:**
- `101ki_pm_performance.csv` - 12-month historical performance (April 2024 - March 2025)
- `103ki_pm_forecast.csv` - 12-month PM forecast (April 2026 - March 2027)

---

## 1. Setup & Imports

In [11]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

print("Libraries imported")

Libraries imported


## 2. Define Data Paths

In [12]:
# Define data directory
DATA_DIR = Path('../data')

# Define file paths
FORECAST_FILE = DATA_DIR / '103ki_pm_forecast.csv'
PERFORMANCE_FILE = DATA_DIR / '101ki_pm_performance.csv'

## 3. Load Data

### 3.1 Load Performance Data (101ki)
Historical performance metrics from previous fiscal year.

In [13]:
# Load Historical performance data
df_performance = pd.read_csv(PERFORMANCE_FILE)
print(f"Performance data shape: {df_performance.shape[0]:,} rows, {df_performance.shape[1]} columns")

Performance data shape: 18,476 rows, 7 columns


### 3.2 Load Forecast Data (103ki)
Scheduled PM activities for the coming fiscal year.

In [52]:
# Load future workload forecast data
df_forecast = pd.read_csv(FORECAST_FILE, 
                        encoding='cp1252',
                        parse_dates = ['DUE_DATE'],
                        dtype={
                            'INTERVAL': 'category',
                            'JOB_TYPE': 'category',
                            'LABOR_CRAFT': 'category',
                            'PMSCOPETYPE': 'category',
                            'DEPT': 'category',
                            'DEPT_NAME': 'category',
                            'DEPT_TYPE' : 'category',
                            'PLANT' : 'category',
                            'LINE' : 'category',
                            'ZONENAME' : 'category',
                            'PROCESSNAME' : 'category'
                        })
print(f"Forecast data shape: {df_forecast.shape[0]:,} rows, {df_forecast.shape[1]} columns")

Forecast data shape: 131,397 rows, 22 columns


## 4. Initial Reconnaissance

### 4.1 Performance Dataset Overview

In [53]:
print("=" * 80)
print("PERFORMANCE DATASET (101ki) - INITIAL OVERVIEW")
print("=" * 80)

print("\n -> DataFrame Shape:")
print(f"   Rows: {df_performance.shape[0]:,}")
print(f"   Columns: {df_performance.shape[1]}")

print("\n -> Column Names:")
for i, col in enumerate(df_performance.columns, 1):
    print(f"   {i:2d}. {col}")

print("\n -> Head:")
display(df_performance.head())

PERFORMANCE DATASET (101ki) - INITIAL OVERVIEW

 -> DataFrame Shape:
   Rows: 18,476
   Columns: 7

 -> Column Names:
    1. PMNUM
    2. TIMES_SCHEDULED
    3. TIMES_ONTIME
    4. TIMES_LATE
    5. TIMES_NOT_COMPLETED
    6. AVG_PLANNED_HRS
    7. AVG_ACTUAL_HRS

 -> Head:


Unnamed: 0,PMNUM,TIMES_SCHEDULED,TIMES_ONTIME,TIMES_LATE,TIMES_NOT_COMPLETED,AVG_PLANNED_HRS,AVG_ACTUAL_HRS
0,PM104956,12,3,0,9,5.0,1.208333
1,PM105442,12,4,0,8,0.5,0.166667
2,PM105540,52,37,7,8,1.0,0.798077
3,PM178837,8,0,0,8,1.0,0.0
4,PM106181,364,350,7,7,0.5,0.578297


In [54]:
# Data types and non-null counts
print("\n Data Types and Non-Null Counts:")
df_performance.info()


 Data Types and Non-Null Counts:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18476 entries, 0 to 18475
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   PMNUM                18476 non-null  object 
 1   TIMES_SCHEDULED      18476 non-null  int64  
 2   TIMES_ONTIME         18476 non-null  int64  
 3   TIMES_LATE           18476 non-null  int64  
 4   TIMES_NOT_COMPLETED  18476 non-null  int64  
 5   AVG_PLANNED_HRS      18476 non-null  float64
 6   AVG_ACTUAL_HRS       18476 non-null  float64
dtypes: float64(2), int64(4), object(1)
memory usage: 1010.5+ KB


In [55]:
# Summary statistics for numeric columns
print("\n Summary Statistics:")
display(df_performance.describe())


 Summary Statistics:


Unnamed: 0,TIMES_SCHEDULED,TIMES_ONTIME,TIMES_LATE,TIMES_NOT_COMPLETED,AVG_PLANNED_HRS,AVG_ACTUAL_HRS
count,18476.0,18476.0,18476.0,18476.0,18476.0,18476.0
mean,4.866313,4.370264,0.450693,0.045356,2.026474,1.749821
std,11.684318,11.249047,1.633679,0.324547,7.647785,7.180505
min,1.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,0.0,0.0,0.5,0.5
50%,2.0,2.0,0.0,0.0,1.0,1.0
75%,4.0,4.0,0.0,0.0,2.0,1.682244
max,365.0,363.0,67.0,9.0,499.75,591.577778


### 4.2 Forecast Dataset Overview

In [56]:
print("=" * 80)
print("FORECAST DATASET (103ki) - INITIAL OVERVIEW")
print("=" * 80)

print("\n -> DataFrame Shape:")
print(f"   Rows: {df_forecast.shape[0]:,}")
print(f"   Columns: {df_forecast.shape[1]}")

print("\n -> Column Names:")
for i, col in enumerate(df_forecast.columns, 1):
    print(f"   {i:2d}. {col}")

print("\n -> Head:")
display(df_forecast.head(3))

FORECAST DATASET (103ki) - INITIAL OVERVIEW

 -> DataFrame Shape:
   Rows: 131,397
   Columns: 22

 -> Column Names:
    1. DUE_DATE
    2. PMNUM
    3. COUNTKEY
    4. PMDESCRIPTION
    5. INTERVAL
    6. FORECASTJP
    7. JOB_TYPE
    8. LABOR_CRAFT
    9. PLANNED_LABOR_HRS
   10. TOTAL_MATERIAL_COST
   11. TASK_COUNT
   12. TOTAL_TASK_DESC_LENGTH
   13. PMSCOPETYPE
   14. LOCATION
   15. LOCATIONDESC
   16. PLANT
   17. DEPT
   18. DEPT_NAME
   19. DEPT_TYPE
   20. LINE
   21. ZONENAME
   22. PROCESSNAME

 -> Head:


Unnamed: 0,DUE_DATE,PMNUM,COUNTKEY,PMDESCRIPTION,INTERVAL,FORECASTJP,JOB_TYPE,LABOR_CRAFT,PLANNED_LABOR_HRS,TOTAL_MATERIAL_COST,...,PMSCOPETYPE,LOCATION,LOCATIONDESC,PLANT,DEPT,DEPT_NAME,DEPT_TYPE,LINE,ZONENAME,PROCESSNAME
0,2026-04-01,PM104088,2026-04-01-PM104088,LP DIE CAST MACHINE #4 PM - M,1-MONTHS,JP211217,INSPECTION,ESTMULT,1.75,,...,ASSET,3DCLCAA4XXMC,Casting Die Cast Machine 4 Machine,3,3DC,DIE CAST,DC,L,LP Casting,Casting Die Cast Machine 4
1,2026-04-01,PM104114,2026-04-01-PM104114,LPDC #1 PF DEBURR STATION 1M PM,1-MONTHS,JP211160,INSPECTION,ESTMULT,0.5,,...,ASSET,3DCLPF01XXPF,Pre-finish Line 1 Pre-finish,3,3DC,DIE CAST,DC,L,Pre-finish,Pre-finish Line 1
2,2026-04-01,PM104116,2026-04-01-PM104116,LPDC #1 PF INSPECTION STATION PM,1-MONTHS,JP211164,INSPECTION,ESTMULT,0.333333,,...,ASSET,3DCLPF01XXPF,Pre-finish Line 1 Pre-finish,3,3DC,DIE CAST,DC,L,Pre-finish,Pre-finish Line 1


In [57]:
# Data types and non-null counts
print("\n Data Types and Non-Null Counts:")
df_forecast.info()


 Data Types and Non-Null Counts:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131397 entries, 0 to 131396
Data columns (total 22 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   DUE_DATE                131397 non-null  datetime64[ns]
 1   PMNUM                   131397 non-null  object        
 2   COUNTKEY                131397 non-null  object        
 3   PMDESCRIPTION           131397 non-null  object        
 4   INTERVAL                131397 non-null  category      
 5   FORECASTJP              131324 non-null  object        
 6   JOB_TYPE                131320 non-null  category      
 7   LABOR_CRAFT             104483 non-null  category      
 8   PLANNED_LABOR_HRS       130568 non-null  float64       
 9   TOTAL_MATERIAL_COST     2275 non-null    float64       
 10  TASK_COUNT              131397 non-null  int64         
 11  TOTAL_TASK_DESC_LENGTH  131293 non-null  float64       
 

## 5. Data Quality Assessment

### 5.1 Missing Value Analysis

---

#### Performance Data

In [58]:
#Missing data analysis -- performance
df_performance_missing = df_performance.isna().sum().to_frame(name='count')

#Add column for percentage of total that are missing
df_performance_missing['percentage'] =  100 - (((df_performance.shape[0] - df_performance_missing['count'])/df_performance.shape[0]) * 100)

df_performance_missing

Unnamed: 0,count,percentage
PMNUM,0,0.0
TIMES_SCHEDULED,0,0.0
TIMES_ONTIME,0,0.0
TIMES_LATE,0,0.0
TIMES_NOT_COMPLETED,0,0.0
AVG_PLANNED_HRS,0,0.0
AVG_ACTUAL_HRS,0,0.0


> *The historical perfomance dataset has **no missing values***

---

#### Forecast Data

In [59]:
#Missing data analysis -- forecast
df_forecast_missing = df_forecast.isna().sum().to_frame(name='count')

#Add column for percentage of total that are missing
df_forecast_missing['percentage'] =  100 - (((df_forecast.shape[0] - df_forecast_missing['count'])/df_forecast.shape[0]) * 100)

# Filter to only show columns with missing values
missing_forecast_filtered = df_forecast_missing[df_forecast_missing['count'] > 0]

missing_forecast_filtered

Unnamed: 0,count,percentage
FORECASTJP,73,0.055557
JOB_TYPE,77,0.058601
LABOR_CRAFT,26914,20.482964
PLANNED_LABOR_HRS,829,0.630912
TOTAL_MATERIAL_COST,129122,98.268606
TOTAL_TASK_DESC_LENGTH,104,0.079149
DEPT_NAME,14982,11.402087
LINE,2791,2.124097
PROCESSNAME,14740,11.217912


> *The Forecast dataset has a selection of missing values, most noteably `Total_Material_Cost`*

---

### 5.2 Duplicate Value Analysis

#### Performance Data

In [61]:
# Duplicate analysis -- performance

df_performance_dups = df_performance.nunique().to_frame(name = 'unique_vals')

df_performance_dups['duplicates'] = df_performance.shape[0] - df_performance_dups['unique_vals']

df_performance_dups

Unnamed: 0,unique_vals,duplicates
PMNUM,18476,0
TIMES_SCHEDULED,50,18426
TIMES_ONTIME,67,18409
TIMES_LATE,26,18450
TIMES_NOT_COMPLETED,10,18466
AVG_PLANNED_HRS,389,18087
AVG_ACTUAL_HRS,1381,17095


> PMNUM has no duplicates, which is good. This data set is clean from that perspective. The other values we would expect to see some duplicates since they are numeric values.

---

#### Forecast Data

In [62]:
# Duplicate analysis -- forecast

df_forecast_dups = df_forecast.nunique().to_frame(name = 'unique_vals')

df_forecast_dups['duplicates'] = df_forecast.shape[0] - df_forecast_dups['unique_vals']

df_forecast_dups['perc_of_total'] = 100 - (((df_forecast.shape[0] - df_forecast_dups['duplicates'])/df_forecast.shape[0]) * 100)

# Sort by perc_of_total, smallest to largest
df_forecast_dups = df_forecast_dups.sort_values('perc_of_total')

df_forecast_dups

Unnamed: 0,unique_vals,duplicates,perc_of_total
COUNTKEY,89210,42187,32.106517
PMNUM,18346,113051,86.037733
PMDESCRIPTION,17380,114017,86.77291
FORECASTJP,3669,127728,97.207699
LOCATION,3568,127829,97.284565
LOCATIONDESC,3453,127944,97.372086
TOTAL_TASK_DESC_LENGTH,1180,130217,99.101958
PROCESSNAME,476,130921,99.637739
DUE_DATE,364,131033,99.722977
ZONENAME,180,131217,99.863011


> The Forcast data is much different the nature of the Preventive Maintenance program is that the planned tasks are repetitive on an interval, so we should expect most or all of the rows to repeat only with different timing. Therefore this result is expected.

---

## 6. Join Key Analysis

In [67]:
print("="*80)
print("Primary Key on Performance - PMNUM")
print("Foreign Key on Forecast    - PMNUM")
print("="*80)

# Get unique PMNUMs from each dataset
pmnums_performance = set(df_performance['PMNUM'].unique())
pmnums_forecast = set(df_forecast['PMNUM'].unique())

# Calculate overlaps
both = pmnums_performance & pmnums_forecast
only_performance = pmnums_performance - pmnums_forecast
only_forecast = pmnums_forecast - pmnums_performance

print(f"\n -> PMNUM Distribution:")
print(f"   In Performance data: {len(pmnums_performance):,}")
print(f"   In Forecast data: {len(pmnums_forecast):,}")
print(f"   In BOTH datasets: {len(both):,}")
print(f"   Only in Performance: {len(only_performance):,}")
print(f"   Only in Forecast: {len(only_forecast):,}")

print(f"\n -> Insights:")
if len(only_performance) > 0:
    print(f"   • {len(only_performance):,} PMNUMs from last year won't repeat next year (PMNUM only in Performance)")
if len(only_forecast) > 0:
    print(f"   • {len(only_forecast):,} new PMNUMs in forecast that are not in performance data (PMNUM only in Forecast)")

Primary Key on Performance - PMNUM
Foreign Key on Forecast    - PMNUM

 -> PMNUM Distribution:
   In Performance data: 18,476
   In Forecast data: 18,346
   In BOTH datasets: 16,435
   Only in Performance: 2,041
   Only in Forecast: 1,911

 -> Insights:
   • 2,041 PMNUMs from last year won't repeat next year (PMNUM only in Performance)
   • 1,911 new PMNUMs in forecast that are not in performance data (PMNUM only in Forecast)


## 7 Next Steps & Path Assignments

### Path 1: Maintenance Strategy Comparison
**Owner:** Abby Tucker

**Starting points from this analysis:**
- Department summary statistics
- Interval distribution patterns
- Craft utilization by department

**Recommended next steps:**
1. Parse `INTERVAL` into frequency groups (daily, weekly, monthly, etc.)
2. Calculate job type mix ratios per department
3. Analyze craft diversity and specialization
4. Compare asset vs location maintenance preferences

### Path 2: Execution Completion Analysis
**Owner:** Mike Moyer

**Starting points from this analysis:**
- PMNUM overlap between datasets 
- Performance data completeness checks
- Basic hour planning metrics

**Recommended next steps:**
1. Merge forecast + performance on `PMNUM`
2. Calculate completion rates and on-time rates
3. Engineer planning accuracy metrics
4. Identify patterns in high vs low completion PMs

---


**Proceed to individual exploration notebooks!**