# Big G Express: Predicting Derates
In this project, you will be working with fault code data and vehicle onboard diagnostic data to try and predict an upcoming full derate. These are indicated by an SPN 5246. 

You have been provided with a two files containing the data you will use to make these predictions (J1939Faults.csv and VehicleDiagnosticOnboardData.csv) as well as two files describing some of the contents (DataInfo.docx and Service Fault Codes_1_0_0_167.xlsx) 

Note that in its raw form the data does not have "labels", so you must define what labels you are going to use and create those labels in your dataset. Also, you will likely need to perform some significant feature engineering in order to build an accurate predictor.

There are service locations at (36.0666667, -86.4347222), (35.5883333, -86.4438888), and (36.1950, -83.174722), so you should remove any records in the vicinity of these locations, as fault codes may be tripped when working on the vehicles.

When evaluating the performance of your model, assume that the cost associated with a missed full derate is approximately $4000.00$ in towing and repairs, and the cost of a false positive prediction is about $500 due to having the truck off the road and serviced unnecessarily.

A failed component is usually what triggers this code.

Common Failures

 * Failed DEF doser valve
 * Associated fault code: SPN 5394
 * You ran out of DEF fluid
 * Associated fault code: SPN 5392, SPN 1761
 * Inlet and Outlet NOx sensors failed, not making pressure
 * Associated fault code: SPN 4094
 * EGR system malfunction causing NOX efficiency problems
 * DEF pump failed, not making pressure
 * Associated fault code: SPN 4334, SPN 4339
 * DEF module has failed or DEF harness failure or no power to DEF module causes DEF gauge to be empty and showing datalink error and SCR malfunction.
 * The DEF / ECM could also need updating to eliminate ghost codes.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import missingno as msno 
from shapely.geometry import Point
import folium
import geopandas as gpd

In [2]:
faults = pd.read_csv("../data/J1939Faults.csv", low_memory=False)
faults.head(2)

Unnamed: 0,RecordID,ESS_Id,EventTimeStamp,eventDescription,actionDescription,ecuSoftwareVersion,ecuSerialNumber,ecuModel,ecuMake,ecuSource,spn,fmi,active,activeTransitionCount,faultValue,EquipmentID,MCTNumber,Latitude,Longitude,LocationTimeStamp
0,1,990349,2015-02-21 10:47:13.000,Low (Severity Low) Engine Coolant Level,,unknown,unknown,unknown,unknown,0,111,17,True,2,,1439,105354361,38.857638,-84.626851,2015-02-21 11:34:25.000
1,2,990360,2015-02-21 11:34:34.000,,,unknown,unknown,unknown,unknown,11,629,12,True,127,,1439,105354361,38.857638,-84.626851,2015-02-21 11:35:10.000


In [3]:
#load diagnostics data
diagnostics = pd.read_csv("../data/VehicleDiagnosticOnboardData.csv")
diagnostics.head(2)

Unnamed: 0,Id,Name,Value,FaultId
0,1,IgnStatus,False,1
1,2,EngineOilPressure,0,1


In [4]:
#drop unnecessary columns
columns_to_drop = ['ESS_Id', 'actionDescription', 'ecuSoftwareVersion', 'ecuSerialNumber', 'ecuModel', 'ecuMake', 'ecuSource', 'faultValue', 'MCTNumber']
faults_a = faults.drop(columns=columns_to_drop)

# fix data types
faults_a['EventTimeStamp'] = pd.to_datetime(faults_a['EventTimeStamp'])
faults_a['LocationTimeStamp'] = pd.to_datetime(faults_a['LocationTimeStamp'])

In [5]:
faults_a.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1187335 entries, 0 to 1187334
Data columns (total 11 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   RecordID               1187335 non-null  int64         
 1   EventTimeStamp         1187335 non-null  datetime64[ns]
 2   eventDescription       1126490 non-null  object        
 3   spn                    1187335 non-null  int64         
 4   fmi                    1187335 non-null  int64         
 5   active                 1187335 non-null  bool          
 6   activeTransitionCount  1187335 non-null  int64         
 7   EquipmentID            1187335 non-null  object        
 8   Latitude               1187335 non-null  float64       
 9   Longitude              1187335 non-null  float64       
 10  LocationTimeStamp      1187335 non-null  datetime64[ns]
dtypes: bool(1), datetime64[ns](2), float64(2), int64(4), object(2)
memory usage: 91.7+ MB


In [6]:
# function to categorize time of day
def categorize_time_of_day(hour):
    if 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    elif 18 <= hour < 24:
        return 'Night'
    else:
        return 'Early'

In [7]:
# Apply the function to create a new column for time of day
faults_a['time_of_day'] = faults_a['EventTimeStamp'].dt.hour.apply(categorize_time_of_day)

In [8]:
faults_a['EquipmentID']=faults_a['EquipmentID'].str.extract('(\d+)')

In [9]:
faults_a.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1187335 entries, 0 to 1187334
Data columns (total 12 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   RecordID               1187335 non-null  int64         
 1   EventTimeStamp         1187335 non-null  datetime64[ns]
 2   eventDescription       1126490 non-null  object        
 3   spn                    1187335 non-null  int64         
 4   fmi                    1187335 non-null  int64         
 5   active                 1187335 non-null  bool          
 6   activeTransitionCount  1187335 non-null  int64         
 7   EquipmentID            1187335 non-null  object        
 8   Latitude               1187335 non-null  float64       
 9   Longitude              1187335 non-null  float64       
 10  LocationTimeStamp      1187335 non-null  datetime64[ns]
 11  time_of_day            1187335 non-null  object        
dtypes: bool(1), datetime64[ns](2

In [10]:
def categorize_time_of_day(hour):
    if faults_a.spn <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    elif 18 <= hour < 24:
        return 'Night'
    else:
        return 'Early'

In [None]:
faults_a.head()


In [None]:
faults_merged = pd.merge(faults_a, diagnostics.pivot(index='FaultId', columns='Name', values='Value'), 
                     left_on='RecordID',right_on= 'FaultId',how='left')

In [None]:
faults_merged.head()

In [None]:
faults_merged.info()

In [None]:
msno.matrix(faults_merged);

In [None]:
# drop all columns with active==False
faults_merged = faults_merged.drop(faults_merged[faults_merged.active == False].index)

In [None]:
#drop noncritical columns...
faults_merged=faults_merged.drop(columns=['eventDescription',
                                          'AcceleratorPedal',
                                          'FuelLevel',
                                          'FuelTemperature',
                                          'SwitchedBatteryVoltage',
                                          'eventDescription',
                                          'AcceleratorPedal',
                                          'FuelTemperature',
                                          'SwitchedBatteryVoltage',
                                          'ServiceDistance',
                                          'ParkingBrake',
                                          'Throttle'])

In [None]:
#check columns for nan's
msno.matrix(faults_merged);

In [None]:
#intermediate save
faults_merged.to_csv('../data/faults_merged.csv')

In [None]:
#Take out service centers using geo dataframe, pt1
faults_merged['geometry'] = gpd.points_from_xy(
    faults_merged['Longitude'], 
    faults_merged['Latitude']
    )

faults_merged_geo = gpd.GeoDataFrame(
    faults_merged, 
    crs = {'init':'epsg:4326'}, 
    geometry = faults_merged['geometry']
    )
#create service center geo dataframe
service_centers = [
    (36.0666667, -86.4347222), 
    (35.5883333, -86.4438888), 
    (36.1950, -83.174722)
    ]  # latitude and longitude coordinates for service centers

faults_merged_geo.to_crs(epsg = 3310, inplace = True)

In [None]:
#Take out service centers using geo dataframe, pt 2
#create service center geo dataframe
service_centers = [
    (36.0666667, -86.4347222), 
    (35.5883333, -86.4438888), 
    (36.1950, -83.174722)
    ]  # latitude and longitude coordinates for service centers

service_centers_geo = [Point(lon, lat) for lat, lon in service_centers]
# same as before
service_centers_geo_df = gpd.GeoDataFrame(geometry=service_centers_geo, crs={'init':'epsg:4326'})
service_centers_geo_df.to_crs(epsg = 3310, inplace = True)


# now we want to filter dataframe to exclude data within 5 miles of all service center locations
distance_threshold = 5*1.609*1000 #meters

# Iterate over each point of interest
def filter(df,point):
    df['distance'] = df['geometry'].distance(point['geometry'])
    filtered_df = df[df['distance'] >= distance_threshold]
    return filtered_df

for index, row in service_centers_geo_df.iterrows():
    faults_merged_geo = filter(faults_merged_geo, row)
    
# this dataframe has all data but within 5 miles of service center locations
faults_merged_geo.head(2)

In [None]:
faults_merged_geo.info()

In [None]:
faults_merged_geo=faults_merged_geo.round(2)

In [None]:
faults_merged_geo=faults_merged_geo.dropna()

In [None]:
msno.matrix(faults_merged_geo);

In [None]:
faults_merged_geo

In [None]:
#intermediate save
faults_merged_geo.to_csv('../data/faults_merged_pt1_done.csv')

In [None]:
#continued in Big-G_pipeline_pt2...