<h1>Importing libraries</h1>
<p>Here, Pandas is used for analyzing the dataset.</p>

In [1]:
import pandas as pd

<h1>Loading data</h1>
<p>The data that will be used in the model building will be the collision data from Seattle. All collisions provided by SPD and recorded by Traffic Records. This includes all types of collisions. Collisions will display at the intersection or mid-block of a segment. Timeframe: 2004 to Present, update frequency: weekly. <br>This dataset and the metadata can be found <a href=https://www.coursera.org/learn/applied-data-science-capstone/supplement/Nh5uS/downloading-example-dataset><b>here</b></a>.</p>

In [2]:
df = pd.read_csv("Data-Collisions.csv")
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 38 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   SEVERITYCODE    194673 non-null  int64  
 1   X               189339 non-null  float64
 2   Y               189339 non-null  float64
 3   OBJECTID        194673 non-null  int64  
 4   INCKEY          194673 non-null  int64  
 5   COLDETKEY       194673 non-null  int64  
 6   REPORTNO        194673 non-null  object 
 7   STATUS          194673 non-null  object 
 8   ADDRTYPE        192747 non-null  object 
 9   INTKEY          65070 non-null   float64
 10  LOCATION        191996 non-null  object 
 11  EXCEPTRSNCODE   84811 non-null   object 
 12  EXCEPTRSNDESC   5638 non-null    object 
 13  SEVERITYCODE.1  194673 non-null  int64  
 14  SEVERITYDESC    194673 non-null  object 
 15  COLLISIONTYPE   189769 non-null  object 
 16  PERSONCOUNT     194673 non-null  int64  
 17  PEDCOUNT  

<h2>Characteristics of the dataset:</h2>
<p>There are 194673 entries and 38 features in this dataset.<br><b>The objective of this project is to distinguish the reason for street mishaps and the degree of severity.</b> That's why we can categorize this attributes into some points.</p>
<ul>
    <li><b>Severity</b></li>
    <p>A detailed description of the severity of the collision. A code that corresponds to the severity of the collision:</p>
        <ul>
            <li>3: fatality</li>
            <li>2b: serious injury</li>
            <li>2: injury</li>
            <li>1: prop damage</li>
            <li>0: unknown</li>
        </ul>
    <p>This includes features like SEVERITYCODE, SEVERITYCODE.1, SEVERITYDESC.</p>
    <li><b>Date and Time</b></li>
    <p>This includes the features of the date and time of the incident, like INCDATE, INCDTTM.</p>
    <li><b>Area Description</b></li>
    <p>This is mainly the description of the general location of the collision. X, Y, INCKEY, COLDETKEY, ADDRTYPE, INTKEY, LOCATION, JUNCTIONTYPE, SEGLANEKEY, CROSSWALKKEY are included here.</p>
    <li><b>Individuals</b></li>
    <p>This includes if pedestrians, cyclists or vehicles are involved. This is entered by the state. The features are PERSONCOUNT, PEDCOUNT, PEDCYLCOUNT and VEHCOUNT.</p>
    <li><b>Environment and Weather</b></li>
    <p>This includes the environmental conditions like WEATHER, ROADCOND and LIGHTCOND</p>
    <li><b>Information of the incident</b></li>
    <p>This includes the type of the incident like COLLISIONTYPE, SPEEDING and HITPARKEDCAR.</p>
    <li><b>Other features</b></li>
    <p>This includes some other relevant informations like OBJECTID, COLDETKEY, REPORTNO, STATUS, EXCEPTRSNCODE, EXCEPTRSNDES, SDOT_COLCODE, SDOT_COLDESC, INATTENTIONIND, UNDERINFL, PEDROWNOTGRNT, SDOTCOLNUM, ST_COLCODE and ST_COLDESC</p>
</ul>

<h1>Preprocessing of the Dataset</h1>

<h2>Features</h2>
<p>Now we'll check each feature of this dataset and decide what decisions should be made.</p>

<h3>Feature: 'SEVERITYCODE'</h3>
<p></p>

In [4]:
df['SEVERITYCODE'].unique()

array([2, 1])

In [5]:
df['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

<p>'SEVERITYCODE' has two classes.
<ul>
    <li>class 1: property damage</li>
    <li>class 2: injury</li>
</ul>
So, basically it'll be a binary problem. The dataset is biased because the number of entries in class 1 is almost three times bigger than the the number of entries in class 2.<br>We'll consider 'SEVERITYCODE' as our dependent variable to complete our goal.</p>

<h3>Feature: 'INCDATE'</h3>
<p>We'll split this feature into three of types: 'YEAR', 'MONTH' and 'DAY'.</p>

In [6]:
df['INCDATE'] = df['INCDATE'].str[:10]
df[['YEAR', 'MONTH', 'DAY']] = df['INCDATE'].str.split('/', expand=True)

<h3>Irrelevant Features</h3>
<p>We'll drop the irrelevant features because some of them are redundant or unnecessary or doesn't contain much information about the incident.</p>

In [7]:
drop_col = ["X", "Y", "OBJECTID", "INCKEY", "COLDETKEY", "REPORTNO", "STATUS", "INTKEY", "LOCATION",
            "EXCEPTRSNCODE", "EXCEPTRSNDESC", "SEVERITYCODE.1", "SEVERITYDESC", "INCDTTM", "SDOT_COLDESC",
            "SDOTCOLNUM", "ST_COLDESC", "INCDATE"]
df = df.drop(labels=drop_col, axis=1)
df.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,JUNCTIONTYPE,SDOT_COLCODE,INATTENTIONIND,...,LIGHTCOND,PEDROWNOTGRNT,SPEEDING,ST_COLCODE,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR,YEAR,MONTH,DAY
0,2,Intersection,Angles,2,0,0,2,At Intersection (intersection related),11,,...,Daylight,,,10,0,0,N,2013,3,27
1,1,Block,Sideswipe,2,0,0,2,Mid-Block (not related to intersection),16,,...,Dark - Street Lights On,,,11,0,0,N,2006,12,20
2,1,Block,Parked Car,4,0,0,3,Mid-Block (not related to intersection),14,,...,Daylight,,,32,0,0,N,2004,11,18
3,1,Block,Other,3,0,0,3,Mid-Block (not related to intersection),11,,...,Daylight,,,23,0,0,N,2013,3,29
4,2,Intersection,Angles,2,0,0,2,At Intersection (intersection related),11,,...,Daylight,,,10,0,0,N,2004,1,28


<h3>Analyzing the remaining features.</h3>
<p>Now we'll analyze the features that are kept.</p>

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 23 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   SEVERITYCODE    194673 non-null  int64 
 1   ADDRTYPE        192747 non-null  object
 2   COLLISIONTYPE   189769 non-null  object
 3   PERSONCOUNT     194673 non-null  int64 
 4   PEDCOUNT        194673 non-null  int64 
 5   PEDCYLCOUNT     194673 non-null  int64 
 6   VEHCOUNT        194673 non-null  int64 
 7   JUNCTIONTYPE    188344 non-null  object
 8   SDOT_COLCODE    194673 non-null  int64 
 9   INATTENTIONIND  29805 non-null   object
 10  UNDERINFL       189789 non-null  object
 11  WEATHER         189592 non-null  object
 12  ROADCOND        189661 non-null  object
 13  LIGHTCOND       189503 non-null  object
 14  PEDROWNOTGRNT   4667 non-null    object
 15  SPEEDING        9333 non-null    object
 16  ST_COLCODE      194655 non-null  object
 17  SEGLANEKEY      194673 non-nu

<h3>Feature: 'ADDRTYPE'</h3>

In [9]:
df['ADDRTYPE'].unique()

array(['Intersection', 'Block', 'Alley', nan], dtype=object)

<p>Now we'll remove the nan values and convert the categorical variables into quantitative variables.</p>

In [10]:
# df.dropna()
df = pd.concat([df, pd.get_dummies(df['ADDRTYPE'])], axis=1)
df.drop(['ADDRTYPE'], axis=1, inplace=True)
df.head()

Unnamed: 0,SEVERITYCODE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,JUNCTIONTYPE,SDOT_COLCODE,INATTENTIONIND,UNDERINFL,...,ST_COLCODE,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR,YEAR,MONTH,DAY,Alley,Block,Intersection
0,2,Angles,2,0,0,2,At Intersection (intersection related),11,,N,...,10,0,0,N,2013,3,27,0,0,1
1,1,Sideswipe,2,0,0,2,Mid-Block (not related to intersection),16,,0,...,11,0,0,N,2006,12,20,0,1,0
2,1,Parked Car,4,0,0,3,Mid-Block (not related to intersection),14,,0,...,32,0,0,N,2004,11,18,0,1,0
3,1,Other,3,0,0,3,Mid-Block (not related to intersection),11,,N,...,23,0,0,N,2013,3,29,0,1,0
4,2,Angles,2,0,0,2,At Intersection (intersection related),11,,0,...,10,0,0,N,2004,1,28,0,0,1


<h3>Feature: 'COLLISIONTYPE'</h3>

In [11]:
df['COLLISIONTYPE'].unique()

array(['Angles', 'Sideswipe', 'Parked Car', 'Other', 'Cycles',
       'Rear Ended', 'Head On', nan, 'Left Turn', 'Pedestrian',
       'Right Turn'], dtype=object)

<p>Now we'll remove the nan and Other values and convert the categorical variables into quantitative variables.</p>

In [12]:
df = pd.concat([df, pd.get_dummies(df['COLLISIONTYPE'])], axis=1)
drop_col = ['COLLISIONTYPE', 'Other']
df = df.drop(labels=drop_col, axis=1)
df.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,JUNCTIONTYPE,SDOT_COLCODE,INATTENTIONIND,UNDERINFL,WEATHER,...,Intersection,Angles,Cycles,Head On,Left Turn,Parked Car,Pedestrian,Rear Ended,Right Turn,Sideswipe
0,2,2,0,0,2,At Intersection (intersection related),11,,N,Overcast,...,1,1,0,0,0,0,0,0,0,0
1,1,2,0,0,2,Mid-Block (not related to intersection),16,,0,Raining,...,0,0,0,0,0,0,0,0,0,1
2,1,4,0,0,3,Mid-Block (not related to intersection),14,,0,Overcast,...,0,0,0,0,0,1,0,0,0,0
3,1,3,0,0,3,Mid-Block (not related to intersection),11,,N,Clear,...,0,0,0,0,0,0,0,0,0,0
4,2,2,0,0,2,At Intersection (intersection related),11,,0,Raining,...,1,1,0,0,0,0,0,0,0,0


<h3>Feature: 'JUNCTIONTYPE'</h3>

In [13]:
df['JUNCTIONTYPE'].unique()

array(['At Intersection (intersection related)',
       'Mid-Block (not related to intersection)', 'Driveway Junction',
       'Mid-Block (but intersection related)',
       'At Intersection (but not related to intersection)', nan,
       'Unknown', 'Ramp Junction'], dtype=object)

<p>Now we'll remove the nan and Unknown values and convert the categorical variables into quantitative variables.</p>

In [14]:
df = pd.concat([df, pd.get_dummies(df['JUNCTIONTYPE'])], axis=1)
drop_col = ['JUNCTIONTYPE', 'Unknown']
df = df.drop(labels=drop_col, axis=1)
df.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,...,Pedestrian,Rear Ended,Right Turn,Sideswipe,At Intersection (but not related to intersection),At Intersection (intersection related),Driveway Junction,Mid-Block (but intersection related),Mid-Block (not related to intersection),Ramp Junction
0,2,2,0,0,2,11,,N,Overcast,Wet,...,0,0,0,0,0,1,0,0,0,0
1,1,2,0,0,2,16,,0,Raining,Wet,...,0,0,0,1,0,0,0,0,1,0
2,1,4,0,0,3,14,,0,Overcast,Dry,...,0,0,0,0,0,0,0,0,1,0
3,1,3,0,0,3,11,,N,Clear,Dry,...,0,0,0,0,0,0,0,0,1,0
4,2,2,0,0,2,11,,0,Raining,Wet,...,0,0,0,0,0,1,0,0,0,0


<h3>Feature: 'INATTENTIONIND'</h3>

In [15]:
print("Unique values:\n{}".format(df['INATTENTIONIND'].unique()))
print("Count of values:\n{}".format(df['INATTENTIONIND'].value_counts()))

Unique values:
[nan 'Y']
Count of values:
Y    29805
Name: INATTENTIONIND, dtype: int64


<p>As we can see, this feature has not sufficient values/entries, we can drop it.</p>

In [16]:
df.drop(['INATTENTIONIND'], axis=1, inplace=True)

<h3>Feature: 'UNDERINFL'</h3>

In [17]:
df['UNDERINFL'].unique()

array(['N', '0', nan, '1', 'Y'], dtype=object)

<p>As we can see, this feature has both categorical and quantitative values, we'll convert all the categorical variables into quantitative variables.</p>

In [18]:
df = df.replace({'UNDERINFL': {'N':0, 'Y':1}})
df = df[df['UNDERINFL'].notna()]
df['UNDERINFL'] = df['UNDERINFL'].astype(int)
df['UNDERINFL'].unique()

array([0, 1])

<h3>Feature: 'WEATHER'</h3>

In [19]:
df['WEATHER'].unique()

array(['Overcast', 'Raining', 'Clear', 'Unknown', 'Other', 'Snowing', nan,
       'Fog/Smog/Smoke', 'Sleet/Hail/Freezing Rain', 'Blowing Sand/Dirt',
       'Severe Crosswind', 'Partly Cloudy'], dtype=object)

<p>Now we'll remove the nan, Unknown and Other values and convert the categorical variables into quantitative variables.</p>

In [20]:
df = pd.concat([df, pd.get_dummies(df['WEATHER'])], axis=1)
drop_col = ['WEATHER', 'Unknown', 'Other']
df = df.drop(labels=drop_col, axis=1)
df.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,UNDERINFL,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,...,Ramp Junction,Blowing Sand/Dirt,Clear,Fog/Smog/Smoke,Overcast,Partly Cloudy,Raining,Severe Crosswind,Sleet/Hail/Freezing Rain,Snowing
0,2,2,0,0,2,11,0,Wet,Daylight,,...,0,0,0,0,1,0,0,0,0,0
1,1,2,0,0,2,16,0,Wet,Dark - Street Lights On,,...,0,0,0,0,0,0,1,0,0,0
2,1,4,0,0,3,14,0,Dry,Daylight,,...,0,0,0,0,1,0,0,0,0,0
3,1,3,0,0,3,11,0,Dry,Daylight,,...,0,0,1,0,0,0,0,0,0,0
4,2,2,0,0,2,11,0,Wet,Daylight,,...,0,0,0,0,0,0,1,0,0,0


<h3>Feature: 'ROADCOND'</h3>

In [21]:
df['ROADCOND'].unique()

array(['Wet', 'Dry', 'Unknown', nan, 'Snow/Slush', 'Ice', 'Other',
       'Sand/Mud/Dirt', 'Standing Water', 'Oil'], dtype=object)

<p>Now we'll remove the nan, Unknown and Other values and convert the categorical variables into quantitative variables.</p>

In [22]:
df = pd.concat([df, pd.get_dummies(df['ROADCOND'])], axis=1)
drop_col = ['ROADCOND', 'Unknown', 'Other']
df = df.drop(labels=drop_col, axis=1)
df.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,UNDERINFL,LIGHTCOND,PEDROWNOTGRNT,SPEEDING,...,Severe Crosswind,Sleet/Hail/Freezing Rain,Snowing,Dry,Ice,Oil,Sand/Mud/Dirt,Snow/Slush,Standing Water,Wet
0,2,2,0,0,2,11,0,Daylight,,,...,0,0,0,0,0,0,0,0,0,1
1,1,2,0,0,2,16,0,Dark - Street Lights On,,,...,0,0,0,0,0,0,0,0,0,1
2,1,4,0,0,3,14,0,Daylight,,,...,0,0,0,1,0,0,0,0,0,0
3,1,3,0,0,3,11,0,Daylight,,,...,0,0,0,1,0,0,0,0,0,0
4,2,2,0,0,2,11,0,Daylight,,,...,0,0,0,0,0,0,0,0,0,1


<h3>Feature: 'LIGHTCOND'</h3>

In [23]:
df['LIGHTCOND'].unique()

array(['Daylight', 'Dark - Street Lights On', 'Dark - No Street Lights',
       'Unknown', 'Dusk', 'Dawn', 'Dark - Street Lights Off', 'Other',
       'Dark - Unknown Lighting', nan], dtype=object)

<p>Now we'll remove the nan, Unknown and Other values and convert the categorical variables into quantitative variables.</p>

In [24]:
df = pd.concat([df, pd.get_dummies(df['LIGHTCOND'])], axis=1)
drop_col = ['LIGHTCOND', 'Unknown', 'Other']
df = df.drop(labels=drop_col, axis=1)
df.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,UNDERINFL,PEDROWNOTGRNT,SPEEDING,ST_COLCODE,...,Snow/Slush,Standing Water,Wet,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk
0,2,2,0,0,2,11,0,,,10,...,0,0,1,0,0,0,0,0,1,0
1,1,2,0,0,2,16,0,,,11,...,0,0,1,0,0,1,0,0,0,0
2,1,4,0,0,3,14,0,,,32,...,0,0,0,0,0,0,0,0,1,0
3,1,3,0,0,3,11,0,,,23,...,0,0,0,0,0,0,0,0,1,0
4,2,2,0,0,2,11,0,,,10,...,0,0,1,0,0,0,0,0,1,0


<h3>Feature: 'PEDROWNOTGRNT'</h3>

In [25]:
print("Unique values:\n{}".format(df['PEDROWNOTGRNT'].unique()))
print("Count of values:\n{}".format(df['PEDROWNOTGRNT'].value_counts()))

Unique values:
[nan 'Y']
Count of values:
Y    4667
Name: PEDROWNOTGRNT, dtype: int64


<p>As we can see, this feature has not sufficient values/entries, we can drop it.</p>

In [26]:
df.drop(['PEDROWNOTGRNT'], axis=1, inplace=True)

<h3>Feature: 'SPEEDING'</h3>

In [27]:
print("Unique values:\n{}".format(df['SPEEDING'].unique()))
print("Count of values:\n{}".format(df['SPEEDING'].value_counts()))

Unique values:
[nan 'Y']
Count of values:
Y    9333
Name: SPEEDING, dtype: int64


<p>As we can see, this feature has not sufficient values/entries, we can drop it.</p>

In [28]:
df.drop(['SPEEDING'], axis=1, inplace=True)

<h3>Feature: 'ST_COLCODE'</h3>

In [29]:
df['ST_COLCODE'].unique()

array(['10', '11', '32', '23', '5', '22', '14', '30', '28', '51', '13',
       '50', '12', '45', '0', '20', '21', '1', '52', '16', '15', '74',
       '81', '26', '19', '2', '66', '71', '3', '24', '40', '57', '6',
       '83', '25', '27', '4', '72', '29', '56', '73', '41', '17', '65',
       '82', '67', '49', '84', '31', '43', '42', '48', '64', '53', 32, 50,
       15, 10, 14, 20, 13, 22, 51, 11, 28, 12, 52, 21, 0, 19, 30, 16, 40,
       26, 27, 83, 2, 45, 65, 23, 24, 71, 1, 29, 81, 25, 4, 73, 74, 72, 3,
       84, 64, 57, 42, 41, 48, 66, 56, 31, 82, 67, '54', '60', 53, 43, 87,
       54, '87', nan, ' ', '7', '8', '85', '88', '18'], dtype=object)

<p>We'll convert the type of this feature from object to numeric.</p>

In [30]:
df['ST_COLCODE'] = pd.to_numeric(df['ST_COLCODE'], errors='coerce')

<h3>Feature: 'HITPARKEDCAR'</h3>

In [31]:
df['HITPARKEDCAR'].unique()

array(['N', 'Y'], dtype=object)

<p>Now we'll convert the categorical variables into quantitative variables.</p>

In [32]:
df = df.replace({'HITPARKEDCAR': {'N':0, 'Y':1}})
df['HITPARKEDCAR'] = df['HITPARKEDCAR'].astype(int)
df['HITPARKEDCAR'].unique()

array([0, 1])

<h2>Null Values</h2>
<p>At first we'll find out how many null values are included to each feature.</p>

In [33]:
pd.isnull(df).sum()

SEVERITYCODE                                          0
PERSONCOUNT                                           0
PEDCOUNT                                              0
PEDCYLCOUNT                                           0
VEHCOUNT                                              0
SDOT_COLCODE                                          0
UNDERINFL                                             0
ST_COLCODE                                           21
SEGLANEKEY                                            0
CROSSWALKKEY                                          0
HITPARKEDCAR                                          0
YEAR                                                  0
MONTH                                                 0
DAY                                                   0
Alley                                                 0
Block                                                 0
Intersection                                          0
Angles                                          

Now we'll drop the null values according to the entries(rows).

In [34]:
df = df.dropna(axis=0)

<p>Finally, we'll check the features.</p>

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 189768 entries, 0 to 194672
Data columns (total 55 columns):
 #   Column                                             Non-Null Count   Dtype  
---  ------                                             --------------   -----  
 0   SEVERITYCODE                                       189768 non-null  int64  
 1   PERSONCOUNT                                        189768 non-null  int64  
 2   PEDCOUNT                                           189768 non-null  int64  
 3   PEDCYLCOUNT                                        189768 non-null  int64  
 4   VEHCOUNT                                           189768 non-null  int64  
 5   SDOT_COLCODE                                       189768 non-null  int64  
 6   UNDERINFL                                          189768 non-null  int64  
 7   ST_COLCODE                                         189768 non-null  float64
 8   SEGLANEKEY                                         189768 non-null  int64 

<h2>Saving Dataset</h2>
<p>We'll save the preprocessed dataset</p>

In [36]:
df.to_csv("preprocessed_dataset.csv")