<a href="https://colab.research.google.com/github/asuzukosi/GraphNeuralNetworksAndKnowledgGraphs/blob/main/OBD_DataAnalysis_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Onboard Diagnositics data analytics project

This project comprises of analysis of 14 drivers driving on their daily routes. The data is gathered from the power train module and also contains meta data about the vehicles.

We perform Univariate and Bi-variate data analysis to gather insight from the data.

## Install required libraries
For the data analysis, we will be making use of numpy, pandas, matlotlib and seaborn for analysis and visualization.

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

## Extract data from xlsx files
We will be extracting the OBD-DatasetII to use oin our analysis

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
data = pd.read_excel('/content/drive/MyDrive/advanced data science assessment/OBD-DATA.xlsx')
data.head()

Unnamed: 0,TIMESTAMP,MAKE,MODEL,CAR_YEAR,ENGINE_POWER,AUTOMATIC,VEHICLE_ID,BAROMETRIC_PRESSURE(KPA),ENGINE_COOLANT_TEMP,FUEL_LEVEL,...,THROTTLE_POS,DTC_NUMBER,TROUBLE_CODES,TIMING_ADVANCE,EQUIV_RATIO,MIN,HOURS,DAYS_OF_WEEK,MONTHS,YEAR
0,1502902504267,chevrolet,agile,2011.0,1.4,n,car1,100.0,80.0,0.486,...,0.251,MIL is OFF0 codes,,0.569,0.01,13.0,16.0,2.0,8.0,2017.0
1,1502902512283,chevrolet,agile,2011.0,1.4,n,car1,100.0,80.0,0.486,...,0.251,MIL is OFF0 codes,,0.565,0.01,13.0,16.0,2.0,8.0,2017.0
2,1502902520291,chevrolet,agile,2011.0,1.4,n,car1,100.0,80.0,0.486,...,0.251,MIL is OFF0 codes,,0.573,0.01,13.0,16.0,2.0,8.0,2017.0
3,1502902528300,chevrolet,agile,2011.0,1.4,n,car1,100.0,80.0,0.486,...,0.251,MIL is OFF0 codes,,0.565,0.01,13.0,16.0,2.0,8.0,2017.0
4,1502902536320,chevrolet,agile,2011.0,1.4,n,car1,100.0,80.0,0.486,...,0.251,MIL is OFF0 codes,,0.569,0.01,13.0,16.0,2.0,8.0,2017.0


## Extract Trouble codes
We want to count the fraction of data instances with recorded trouble instances

In [4]:
data[data['TROUBLE_CODES'].notnull()].shape,data.shape

((11925, 33), (47514, 33))

In [6]:
(11925 * 100) / 47514. # 25% of the data has reported trouble codes

25.0978658921581

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47514 entries, 0 to 47513
Data columns (total 33 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   TIMESTAMP                    47514 non-null  int64  
 1   MAKE                         47459 non-null  object 
 2   MODEL                        47459 non-null  object 
 3   CAR_YEAR                     47459 non-null  float64
 4   ENGINE_POWER                 47459 non-null  float64
 5   AUTOMATIC                    47459 non-null  object 
 6   VEHICLE_ID                   47514 non-null  object 
 7   BAROMETRIC_PRESSURE(KPA)     10212 non-null  float64
 8   ENGINE_COOLANT_TEMP          33964 non-null  float64
 9   FUEL_LEVEL                   2994 non-null   float64
 10  ENGINE_LOAD                  30972 non-null  float64
 11  AMBIENT_AIR_TEMP             3619 non-null   float64
 12  ENGINE_RPM                   33859 non-null  float64
 13  INTAKE_MANIFOLD_

In [8]:
data['TROUBLE_CODES'].unique() # extract names of unique trouble codes

array([nan, 'P0133', 'C0300', 'P0079P2004P3000', 'P0078U1004P3000',
       'P0079C1004P3000', 'P007EP2036P18F0', 'P007EP2036P18D0',
       'P007FP2036P18D0', 'P0079P1004P3000', 'P007EP2036P18E0',
       'P007FP2036P18E0', 'P0078B0004P3000', 'P007FP2036P18F0'],
      dtype=object)

## Drop dtaa points without Trouble codes
Since we wish to analyse the data which have trouble codes, we can drop the rest of the data

In [9]:
data = data.dropna(subset = ['TROUBLE_CODES']).reset_index(drop=True) # getting rid of null values from trouble codes
data['TIMESTAMP'] = pd.to_datetime(data['TIMESTAMP'], unit='ms') # converting Timestamp to proper format.