Machine Maintenance Data
Let’s assume we have a dataset that tracks machine maintenance in a manufacturing plant. The dataset has the following columns:

Column	Description

machine_id	        :Unique identifier for the machine.

maintenance_date	:Date when the maintenance was performed.

technician	        :Name of the technician performing the maintenance.

maintenance_type	:Type of maintenance (preventive, corrective).

downtime_minutes	:Number of minutes the machine was offline.

cost	            :Cost of the maintenance activity.

issues_reported	    :Description of issues reported during the activity.

In [6]:
import pandas as pd
import numpy as np

# Load the dataset
data = {
    'machine_id': ['M001', 'M002', 'M003', 'M001', 'M004', 'M005', 'M006', 'M003', 'M002', 'M001'],
    'maintenance_date': ['2024-01-01', '2023-12-01', '2023-11-15', '2023-10-10', '2023-09-20', 
                         '2023-08-05', '2023-07-25', '2023-06-15', '2023-05-10', '2023-04-01'],
    'technician': ['John', 'Sarah', 'Tom', None, 'Mike', 'Anna', 'Eve', 'Tom', 'Sarah', 'John'],
    'maintenance_type': ['Preventive', 'Corrective', 'Preventive', 'Corrective', 'Preventive',
                         'Corrective', 'Preventive', 'Preventive', 'Corrective', 'Corrective'],
    'downtime_minutes': [120, 700, 0, 300, 60, 480, 0, 0, 120, 240],
    'cost': [500, 1500, None, 1000, 300, 1200, None, 0, 400, 800],
    'issues_reported': ['Routine check', 'Major failure', 'None', 'Replaced parts', 
                        'Calibration', 'Critical error', 'None', 'None', 'Sensor replacement', 'Overheating']
}
df = pd.DataFrame(data)


In [None]:
# Replace missing technician names with 'Unknown'
df['technician'] = df['technician'].fillna('Unknown')
print(df['technician'])
# Replace missing costs with 0
df['cost'] = df['cost'].fillna(0)
print(df['cost'])
# Replace missing downtime minutes with 0
df['downtime_minutes'] = df['downtime_minutes'].fillna(0)
print(df['downtime_minutes'])



0       John
1      Sarah
2        Tom
3    Unknown
4       Mike
5       Anna
6        Eve
7        Tom
8      Sarah
9       John
Name: technician, dtype: object
0     500.0
1    1500.0
2       0.0
3    1000.0
4     300.0
5    1200.0
6       0.0
7       0.0
8     400.0
9     800.0
Name: cost, dtype: float64
0    120
1    700
2      0
3    300
4     60
5    480
6      0
7      0
8    120
9    240
Name: downtime_minutes, dtype: int64


In [None]:
# Drop duplicate rows
df = df.drop_duplicates()



Step 3: After Removing Duplicates
  machine_id maintenance_date technician maintenance_type  downtime_minutes  \
0       M001       2024-01-01       John       preventive               120   
1       M002       2023-12-01      Sarah       corrective               480   
3       M001       2023-10-10    Unknown       corrective               300   
4       M004       2023-09-20       Mike       preventive                60   
5       M005       2023-08-05       Anna       corrective               480   
8       M002       2023-05-10      Sarah       corrective               120   
9       M001       2023-04-01       John       corrective               240   

          cost     issues_reported  downtime_cost_per_minute downtime_severity  
0   500.000000       Routine check                  4.166667            Medium  
1  1208.304192       Major failure                  2.517300              High  
3  1000.000000      Replaced parts                  3.333333              High  
4   300.

In [26]:
# Cap cost at the 95th percentile
cost_cap = df['cost'].quantile(0.95)
df['cost'] = df['cost'].clip(upper=cost_cap)


# Cap downtime_minutes to a maximum of 480 minutes
df['downtime_minutes'] = df['downtime_minutes'].clip(upper=480)



Step 4: After Handling Outliers
  machine_id maintenance_date technician maintenance_type  downtime_minutes  \
0       M001       2024-01-01       John       preventive               120   
1       M002       2023-12-01      Sarah       corrective               480   
3       M001       2023-10-10    Unknown       corrective               300   
4       M004       2023-09-20       Mike       preventive                60   
5       M005       2023-08-05       Anna       corrective               480   
8       M002       2023-05-10      Sarah       corrective               120   
9       M001       2023-04-01       John       corrective               240   

          cost     issues_reported  downtime_cost_per_minute downtime_severity  
0   500.000000       Routine check                  4.166667            Medium  
1  1205.812935       Major failure                  2.517300              High  
3  1000.000000      Replaced parts                  3.333333              High  
4   300.00

In [19]:
# Convert maintenance_date to datetime
df['maintenance_date'] = pd.to_datetime(df['maintenance_date'])

# Ensure machine_id is treated as a string
df['machine_id'] = df['machine_id'].astype(str)


In [21]:
# Convert maintenance_type to lowercase
df['maintenance_type'] = df['maintenance_type'].str.lower()
df

Unnamed: 0,machine_id,maintenance_date,technician,maintenance_type,downtime_minutes,cost,issues_reported
0,M001,2024-01-01,John,preventive,120,500.0,Routine check
1,M002,2023-12-01,Sarah,corrective,480,1208.304192,Major failure
2,M003,2023-11-15,Tom,preventive,0,0.0,
3,M001,2023-10-10,Unknown,corrective,300,1000.0,Replaced parts
4,M004,2023-09-20,Mike,preventive,60,300.0,Calibration
5,M005,2023-08-05,Anna,corrective,480,1200.0,Critical error
6,M006,2023-07-25,Eve,preventive,0,0.0,
7,M003,2023-06-15,Tom,preventive,0,0.0,
8,M002,2023-05-10,Sarah,corrective,120,400.0,Sensor replacement
9,M001,2023-04-01,John,corrective,240,800.0,Overheating


In [22]:
# Calculate downtime cost per minute
df['downtime_cost_per_minute'] = df.apply(
    lambda row: row['cost'] / row['downtime_minutes'] if row['downtime_minutes'] > 0 else 0, axis=1
)

# Categorize downtime severity
df['downtime_severity'] = pd.cut(
    df['downtime_minutes'],
    bins=[0, 60, 240, 480],
    labels=['Low', 'Medium', 'High']
)


In [23]:
# Remove rows where both downtime_minutes and cost are 0
df = df[~((df['downtime_minutes'] == 0) & (df['cost'] == 0))]


In [24]:
# Display the cleaned DataFrame
print(df)


  machine_id maintenance_date technician maintenance_type  downtime_minutes  \
0       M001       2024-01-01       John       preventive               120   
1       M002       2023-12-01      Sarah       corrective               480   
3       M001       2023-10-10    Unknown       corrective               300   
4       M004       2023-09-20       Mike       preventive                60   
5       M005       2023-08-05       Anna       corrective               480   
8       M002       2023-05-10      Sarah       corrective               120   
9       M001       2023-04-01       John       corrective               240   

          cost     issues_reported  downtime_cost_per_minute downtime_severity  
0   500.000000       Routine check                  4.166667            Medium  
1  1208.304192       Major failure                  2.517300              High  
3  1000.000000      Replaced parts                  3.333333              High  
4   300.000000         Calibration         