# Data Cleaning

In this step, I will take a look in each variable, to clean features before doing explorative analysis. 

## Import libraries + load data set

In [1]:
# libraries used
import pandas as pd
import numpy as np

# for geo manipulation
import geopandas as gpd
from shapely.geometry import Point

# loading data
data = pd.read_csv(
    'data/processed/merged.csv',
    low_memory=False
)

# Target creation

As I outlined in the `README.md`, the aim is to predict accident's overall severity score. Therefore, we need to built an index of severity based on individual gravity (`grav`). 

The formular to calculate the overall gravity is (see `README.md` for more information): 

$$Severity Score = \frac{1}{n} \sum_{i=1}^n is_i  + weight_{killed} * pk^{0.4} + weight_{hospital} * ph^{0.2}$$

with $is$: individual score
$pk$: persons killed
$ph$: persons in hospital

Hence, I need to create counts of `killed` and `hospital` per accident. 

For `gravity` values are: `-1` no information, `1` uninjured, `2` killed, `3` in hospital, `4` lightly injured. In the following, I first apply the values (in relation to AIS), sum the values over each accident and take the mean. Then I create two dummies `killed` and `hospital` that count the people killed and injured in hospital per accident. Finally, I apply the extra weights for the amount of person killed or in hospital per accident.

In [2]:
weights = {
    1: 0,  # Uninjured, but involved
    2: 6,  # killed
    3: 3.5,  # injured in hospitel
    4: 1,  # lightly injured
    -1: 0.0,  # no information, nothing added
}
# based on scale for accident severity (see report)

# applying weights to grav and save it in new feature
data['grav_weighted'] = data['gravity'].map(weights)

# summing up over accident
data['severity_score'] = data.groupby('num_acc')[
    'grav_weighted'].transform('mean')

# Dummy: killed
data['killed'] = [1 if gravity == 2 else 0 for gravity in data['gravity']]

# Sum of killed per accident
data['killed_n'] = data.groupby('num_acc')['killed'].transform('sum')

# Dummy: Hospital
data['hospital'] = [1 if gravity == 3 else 0 for gravity in data['gravity']]

# Sum of in hospital per accident
data['hospital_n'] = data.groupby('num_acc')['hospital'].transform('sum')

# applying extra weight
data['severity_score'] = (
    data['severity_score']
    + (data['killed_n'] * 6) ** 0.4
    + (data['hospital_n'] * 3.5) ** 0.2
)

Now, I just have to drop the all features related to gravity except for `severity_score`. Furthermore, I can now just keep one line per accident. 

In [3]:
data.drop(
    columns=[
        'gravity', 'grav_weighted', 'killed', 'killed_n', 'hospital',
        'hospital_n'
    ],
    axis=1, 
    inplace=True
)

data.drop_duplicates(subset='num_acc', inplace=True)

data.head()

Unnamed: 0,num_acc,month,day,time,light_condit,location_type,intersect_type,weather_cond,collision_type,municip_code,...,res_lane,prof_road,plan_view,tpc_length,width_road,surface,infra,acc_loc,max_speed,severity_score
0,200500000001,1,12,1900,3,2,1,1.0,3.0,11,...,0.0,1.0,1.0,0.0,63.0,1.0,0.0,1.0,,2.034735
6,200500000002,1,21,1600,1,2,1,1.0,1.0,51,...,1.0,1.0,1.0,0.0,100.0,1.0,0.0,5.0,,3.034735
8,200500000003,1,21,1845,3,1,1,2.0,1.0,51,...,1.0,1.0,1.0,0.0,0.0,2.0,0.0,5.0,,3.034735
10,200500000004,1,4,1615,1,1,1,1.0,5.0,82,...,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,,3.725773
14,200500000005,1,10,1945,3,1,1,3.0,6.0,478,...,0.0,1.0,3.0,0.0,59.0,2.0,0.0,3.0,,3.534735


# Features

## Latitude and Longitude (main interest variables)

I already perceived problems in the `lat` and `long` features: 1) showing a wrong decimal sign, 2) some missing values are coded with `-`. In the first case, I just need to replace the `,` with `.`, in the second case I need to set a `-` to missing. 

In [4]:
data['long'].unique()

data['long'] = data['long'].astype(str)

data['long'] = data['long'].str.replace(',', '.', regex=False)

data.loc[data['long']=='-', 'long'] = np.nan

data['long'] = data['long'].astype(float)


Now, I do the same steps for latitude.

In [5]:
data['lat'].unique()

data['lat'] = data['lat'].astype(str)

data['lat'] = data['lat'].str.replace(',', '.', regex=False)

data.loc[data['lat']=='-', 'lat'] = np.nan

data['lat'] = data['lat'].astype(float)


The problem is that it seems to be mixed of degrees and some other projected coordinate system (PCS) in the data, since some data falls in the range of lattitude (-90 to +90) and longitude (-180 to +180) in degrees, while some data is clearly out of bound (i.e., 50334). As you can see below, there are 736791 cases where this is the case. 

In [6]:
print(data['lat'][((data['lat'] > 90) | (data['lat'] < -90)) & ((data['long'] > 180) | (data['long'] < -180))].count())

326434


Since there might be different approaches in typing in the lattitude and longitude data, I tried recoding with `geopandas` from different standards that might be used (EPSG:32631, EPSG:2154, ESPG:3395, and EPSG:27572), but none yielded in France. I inspected the transformations visually in `09-visualizations.ipynb` which showed incorrected data points. 
Hence, I filtered the data set to have only data within the limits of lattitude and longitude. This yielded to a correct amount of data within France. 

In [7]:
# # create invalid df with only cases out of bound
# invalid = data[
#     (data['lat'] > 90) | (data['lat'] < -90) |
#     (data['long'] > 180) | (data['long'] < -180)
# ]

# # create valid df with the rest of the data
# valid = data[~data.index.isin(invalid.index)] 

# # assuming invalid rows in meters (EPSG:3395), create a GeoDataFrame
# geom_invalid = [Point(xy) for xy in zip(invalid['long'], invalid['lat'])]
# gdf_invalid = gpd.GeoDataFrame(invalid, geometry=geom_invalid)

# # setting CRS for the invalid data (assuming they are in EPSG:2154)
# gdf_invalid.set_crs(epsg=32631, inplace=True)

# # reprojecting to WGS84 (EPSG:4326)
# gdf_invalid = gdf_invalid.to_crs(epsg=4326)

# # extracting latitude and longitude from GeoDataFrame
# gdf_invalid['lat'] = gdf_invalid.geometry.y
# gdf_invalid['long'] = gdf_invalid.geometry.x

# # combining valid data with reprojected data
# data = pd.concat([valid, gdf_invalid], axis=0)

In [8]:
data.loc[
    (data['lat'] < -90) | (data['lat'] > 90) | 
    (data['long'] < -180) | (data['long'] > 180), 
    ['lat', 'long']
] = np.nan

data.dropna(subset=['long', 'lat'], inplace=True)

data = data[data['lat'] != 0]

Now, I lastly check if any data point is still outside of France.

In [9]:
print(data['lat'][((data['lat'] < 41.3037) | (data['lat'] > 51.1242)) & ((data['long'] < -5.3163) | (data['long'] > 9.6625))].count())

0


Although we get 0 above, on the map in `09-visualizations.ipynb`, I saw one point in the gulf of Guinea which is clearly not mainland France. This might be a data point, that is plotted by degrees, however the input might be different, and therefore create wrong points which would mess with the algorithms. Since the original format is not given and the trials above did not lead to a sufficient solution, there is only deleting. 

In [10]:
print(data['lat'][((data['lat'] >= 3.5) & (data['lat'] <= 5.5)) & ((data['long'] >= -0.1) & (data['long'] <= 0.1))].count())

19678


In [11]:
data = data[
    ~(((data['lat'] >= 3.5) & (data['lat'] <= 5.5)) & 
    ((data['long'] >= -0.1) & (data['long'] <= 0.1)))
]

Let's check for the remaining features, if any feature has high missing values. Features with missings over 40% will be deleted.

In [12]:
data.isna().mean() * 100

num_acc            0.000000
month              0.000000
day                0.000000
time               0.000000
light_condit       0.000000
location_type      0.000000
intersect_type     0.000000
weather_cond       0.000000
collision_type     0.000000
municip_code       0.000000
adr                2.214121
gps_code          81.101064
lat                0.000000
long               0.000000
dep                0.000000
year               0.000000
road_cat           0.000000
road_num          10.075593
road_num_index     2.824859
v2                90.467285
traffic_dir        0.000000
num_lanes          0.000000
pr                 0.000000
pr1                0.000000
res_lane           0.000000
prof_road          0.000000
plan_view          0.000000
tpc_length        99.884007
width_road        14.031184
surface            0.000000
infra              0.000000
acc_loc            0.000000
max_speed          0.000000
severity_score     0.000000
dtype: float64

Here it's the case for:

- `road_num_index`
- `v2`
- `max_speed`

All other features will be inspected, probably more features will be deleted, since no information is sometimes coded as `0` or `-1` which is equal to NA but not detected above. Features with low NA's will be handled in feature engineering after test-train split.

## `light_condit`

I turned `-1` no information to `NA` (3 cases).

In [13]:
print(data['light_condit'].value_counts(dropna=False))

data.loc[data['light_condit'] == -1, 'light_condit'] = np.nan

light_condit
 1    83854
 5    19824
 3    13204
 2     8591
 4     1256
-1        3
Name: count, dtype: int64


## `location_type`

All good!

In [14]:
print(data['location_type'].value_counts(dropna=False))

location_type
2    80202
1    46530
Name: count, dtype: int64


## `intersect_type`

Values with `-1`/`0` no information turned to `NA` (17 cases). A lot of categories, further check in feature engineering.

In [15]:
print(data['intersect_type'].value_counts(dropna=False))

# no information to NA
data.loc[data['intersect_type'] == -1, 'intersect_type'] = np.nan
data.loc[data['intersect_type'] == 0, 'intersect_type'] = np.nan

intersect_type
 1    80677
 2    16030
 3    14238
 6     5481
 9     5184
 4     2789
 7     1289
 5      770
 8      263
-1       11
Name: count, dtype: int64


In [16]:
print(data['intersect_type'].value_counts(dropna=False))


intersect_type
1.0    80677
2.0    16030
3.0    14238
6.0     5481
9.0     5184
4.0     2789
7.0     1289
5.0      770
8.0      263
NaN       11
Name: count, dtype: int64


## `weather_cond`

Values with `-1` turned to `NA` (3 cases). A lot of categories, further check in feature engineering. 

In [17]:
print(data['weather_cond'].value_counts(dropna=False))

data.loc[data['weather_cond'] == -1, 'weather_cond'] = np.nan

weather_cond
 1.0    100097
 2.0     13799
 8.0      5059
 3.0      2930
 7.0      2447
 5.0       951
 9.0       638
 4.0       437
 6.0       368
-1.0         6
Name: count, dtype: int64


In [18]:
print(data['weather_cond'].value_counts(dropna=False))


weather_cond
1.0    100097
2.0     13799
8.0      5059
3.0      2930
7.0      2447
5.0       951
9.0       638
4.0       437
6.0       368
NaN         6
Name: count, dtype: int64


## `collision_type`

Cases with no information (`-1`) turned to `NA` (65 cases). A lot of categories, further check in feature engineering.

In [19]:
print(data['collision_type'].value_counts(dropna=False))

data.loc[data['collision_type'] == -1, 'collision_type'] = np.nan

collision_type
 6.0    38365
 3.0    37837
 2.0    16794
 1.0    13084
 7.0    12397
 4.0     4529
 5.0     3660
-1.0       66
Name: count, dtype: int64


In [20]:
print(data['collision_type'].value_counts(dropna=False))


collision_type
6.0    38365
3.0    37837
2.0    16794
1.0    13084
7.0    12397
4.0     4529
5.0     3660
NaN       66
Name: count, dtype: int64


## `municip_code`

Code for the municipial region based on INSEE. Regard that this correlates with the `dep` variable which has the department code based on INSEE. I explore this further in the feature engineering. There is one case with `N/C` value which is turned to `NA`.

Usually, the INSEE code for municipialities has 5 numbers, and the first two indicate the department and the last 3 the municipalities. However, as you can see below, majority of the entries here is lower than $10000$, indicating an odd format for INSEE for municipalities. Hence, I drop this variable and might only include `dep` next to the lattitude and longitude. 

In [21]:
print(data['municip_code'].value_counts(dropna=False))

data.loc[data['municip_code']=='N/C', 'municip_code'] = np.nan

data['municip_code'] = data['municip_code'].astype('Int64')


municip_code
75116    1287
75112     949
75119     907
75117     899
75115     884
         ... 
49201       1
64131       1
88375       1
30019       1
17083       1
Name: count, Length: 17399, dtype: int64


In [22]:
insee_test = data['municip_code'] < 10000

insee_test.value_counts()

municip_code
False    120025
True       6706
Name: count, dtype: Int64

## `adr`

This information is also captured by longitude and lattitude with more precision. It captures street names (categorical) and has 447383 unique values. I drop this variable, since the information with longitude and lattitude can be better used in the algorithms. 

In [23]:
data['adr'].nunique()

58099

## `gps_code`

This variable has only one level, since I only deal with mainland France. I drop it. 

In [24]:
print(data['gps_code'].value_counts(dropna=False))

print(data['gps_code'].unique())


gps_code
NaN    102781
M       23951
Name: count, dtype: int64
['M' nan]


## `dep`

Inconsistent data. Regarding [government pages](https://www.insee.fr/fr/information/7766585), code for department only have 2 digits, except for oversea territories which have digits over 970. The variable also has geographical information since each department can be represented in an array of longitude and longitude, therefore, I drop the feature here. 

In [25]:
print(data['dep'].value_counts(dropna=False))

print(data['dep'].sort_values().unique())

dep
75    12062
93     6720
92     6082
13     5562
94     5539
      ...  
70      256
8       253
48      189
90      113
23      104
Name: count, Length: 94, dtype: int64
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 21 22 23 24 25
 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95]


## `month`

Stays at it is.

In [26]:
print(data['month'].value_counts(dropna=False))

month
6     12038
2     11669
1     11649
10    11496
9     11419
5     10975
7     10728
11    10209
3      9587
8      9448
12     9053
4      8461
Name: count, dtype: int64


## `day`

stays as it is!

In [27]:
print(data['day'].value_counts(dropna=False))

day
7     4476
14    4458
20    4436
16    4385
18    4354
21    4343
19    4326
13    4316
8     4315
11    4290
9     4285
12    4260
22    4255
4     4252
17    4245
10    4242
24    4232
15    4223
6     4157
23    4143
25    4111
5     4082
26    4019
3     3973
27    3918
28    3857
2     3843
1     3841
29    3555
30    3438
31    2102
Name: count, dtype: int64


This variable does not give a lot of information, because a day is very different among months and years. What could be interesting, to have a variable showing which day of the week it was assuming on weekdays accidents might be more frequent/severe.
Therefore, I recode `day` into `weekday`.

In [28]:
data['date'] = pd.to_datetime(data[['year', 'month', 'day']])

data['weekday'] = data['date'].dt.weekday

In [29]:
data['weekday'].value_counts(dropna=False)

weekday
4    20929
2    18621
1    18360
3    18312
5    18056
0    17071
6    15383
Name: count, dtype: int64

## `time`

Some values have still `:`. For consistency, and hour extraction, I delete the `:`. Furthermore, some values only have two digits, sometimes above 24. Probably, these values indicate 0:43. None of these values if above 59, hence, it seems senseful to recode these with a leading 0 for midnight.

In [30]:
print(data['time'].value_counts(dropna=False))

time
18:00    1617
17:00    1482
18:30    1426
17:30    1348
19:00    1318
         ... 
02:31       1
03:01       1
04:01       1
05:39       1
03:59       1
Name: count, Length: 1437, dtype: int64


In [31]:
# delete `:` in time
def del_colon(value):
    if ':' in (str(value)):
        new = str(value).replace(':', '')
        return new
    else:
        return value

data['time'] = data['time'].apply(del_colon)

In [32]:
print(data['time'].value_counts(dropna=False))

time
1800    1617
1700    1482
1830    1426
1730    1348
1900    1318
        ... 
0231       1
0301       1
0401       1
0539       1
0359       1
Name: count, Length: 1437, dtype: int64


In [33]:
def get_hours(value):
    if len(str(value)) <= 4 and len(str(value)) >= 3:
        if str(value).startswith('0'):
            return int(str(value)[1:-2])
        else:
            return int(value) // 100
    elif len(str(value)) <= 2 and len(str(value)) >= 1:
        return 0
    elif len(str(value)) > 4:
        print(value)
    else:
        print(value)
    
data['hour'] = data['time'].apply(get_hours)

In [34]:
print(data['hour'].value_counts(dropna=False))

hour
17    10910
18    10575
16     9291
19     8082
15     7663
8      7576
14     6778
12     6596
9      6297
13     6267
11     6179
10     5737
7      5699
20     5611
21     3879
22     3322
23     2782
6      2715
0      2335
1      2009
5      1864
2      1714
4      1435
3      1416
Name: count, dtype: int64


In [35]:
data['time'] = data['time'].astype(float)

print(data['time'][(data['time'] < 100) & (data['time'] > 59)])

Series([], Name: time, dtype: float64)


## `year`

Stays as it is.

In [36]:
print(data['year'].value_counts(dropna=False))

year
2022    51772
2023    51009
2019    17901
2021     6050
Name: count, dtype: int64


## `road_cat`

Has a lot of categories, will be checked in feature engineering.

In [37]:
print(data['road_cat'].value_counts(dropna=False))

road_cat
4.0    53839
3.0    48821
1.0    12578
2.0     5585
7.0     4203
6.0      961
9.0      607
5.0      138
Name: count, dtype: int64


## `road_num`

In regard to longitude and lattitude it has no new information, therefore I drop it. Usually road numbers have a prefix like A for highways and N for long distance national roads, D for department roads, C for municipal roads and E for European roads. With just the number, it is not clearly separable.

In [38]:
print(data['road_num'].value_counts(dropna=False))

road_num
NaN                        12769
7                            958
86                           938
4                            789
6                            746
                           ...  
PAGESE (DOMAINE DE LA)         1
Boulevard Pierre 1er           1
PAUL VALERY                    1
FOCH. PLACE DU MARECHAL        1
KAMPMANN (RUE)                 1
Name: count, Length: 27475, dtype: int64


## `traffic_dir`

Keep it so far.

In [39]:
print(data['traffic_dir'].value_counts(dropna=False))

data.loc[data['traffic_dir'] == -1, 'traffic_dir'] = np.nan
data.loc[data['traffic_dir'] == 0, 'traffic_dir'] = np.nan

traffic_dir
 2.0    77513
 1.0    22747
 3.0    17882
-1.0     7862
 4.0      728
Name: count, dtype: int64


In [40]:
print(data['traffic_dir'].value_counts(dropna=False))

data['traffic_dir'].isna().mean()

traffic_dir
2.0    77513
1.0    22747
3.0    17882
NaN     7862
4.0      728
Name: count, dtype: int64


0.06203642331849888

## `num_lanes`

It should indicate the number of road lanes where the accident happened. However, it has quite odd numbers over 12 (Champs Elysee has 10), however, these are few cases. Furthermore, there are some strings, which were recoded to `NA`, too. With the other `NA's` these might be recoded in the feature engineering.

In [41]:
print(data['num_lanes'].value_counts(dropna=False))

num_lanes
2               66535
4               11991
1               11271
2.0             10811
3                8658
0                2686
6                2598
 -1              2241
4.0              2200
1.0              1713
5                1572
3.0              1239
8                 715
6.0               564
0.0               443
5.0               326
-1.0              230
7                 205
8.0               189
10                125
7.0                83
-1                 67
9                  65
#VALEURMULTI       52
10.0               49
9.0                34
12                 31
11                 18
12.0               11
11.0                9
#ERREUR             1
Name: count, dtype: int64


In [42]:
def set_int(value):
    try:
        return float(value)
    except ValueError:
        return np.nan

data['num_lanes'] = data['num_lanes'].apply(set_int)

data.loc[data['num_lanes'] == -1, 'num_lanes'] = np.nan
data.loc[data['num_lanes'] == 0, 'num_lanes'] = np.nan
data.loc[data['num_lanes'] == ' -1', 'num_lanes'] = np.nan

print(data['num_lanes'].value_counts(dropna=False))

num_lanes
2.0     77346
4.0     14191
1.0     12984
3.0      9897
NaN      5720
6.0      3162
5.0      1898
8.0       904
7.0       288
10.0      174
9.0        99
12.0       42
11.0       27
Name: count, dtype: int64


In [43]:
data['num_lanes'].isna().mean()

0.045134614777641004

## `pr`

After a first easy correction from no information to NA, crossed threshold of more than 30% NA's. Therefore, I dropped.

In [44]:
print(data['pr'].value_counts(dropna=False))

data.loc[data['pr'] == ' -1', 'pr'] = np.nan
data.loc[data['pr'] == '0', 'pr'] = np.nan

data['pr'].isna().mean()

pr
0        29286
(1)      22244
 -1      18157
1         6214
2         2734
         ...  
513          1
520          1
346          1
1111         1
4 400        1
Name: count, Length: 539, dtype: int64


0.3743569106460878

## `pr1`

Dropped as above, since there are more than 50% without information or NA.

In [45]:
print(data['pr1'].value_counts(dropna=False))

data.loc[data['pr1'] == ' -1', 'pr1'] = np.nan
data.loc[data['pr1'] == '0', 'pr1'] = np.nan

data['pr1'].isna().mean()

pr1
0        31220
(1)      22506
 -1      18291
1         4156
500       4110
         ...  
1616         1
1538         1
1152         1
1188         1
3 139        1
Name: count, Length: 1585, dtype: int64


0.39067480983492725

## `res_lane`

Has a lot of `NA's` (including no information `0`/`-1`). Nearly 93% have NA or no information, therefore I dropped the variable.

In [46]:
print(data['res_lane'].value_counts(dropna=False))

data.loc[data['res_lane'] == 0, 'res_lane'] = np.nan
data.loc[data['res_lane'] == -1, 'res_lane'] = np.nan

res_lane
 0.0    110011
 1.0      6597
 3.0      4595
 2.0      3898
-1.0      1631
Name: count, dtype: int64


In [47]:
print(data['res_lane'].value_counts(dropna=False))

data['res_lane'].isna().mean()

res_lane
NaN    111642
1.0      6597
3.0      4595
2.0      3898
Name: count, dtype: int64


0.8809298361897547

## `prof_road`

Has a lot of `NA's`. Will be further inspected in feature engineering.

In [48]:
print(data['prof_road'].value_counts(dropna=False))

data.loc[data['prof_road'] == 0, 'prof_road'] = np.nan
data.loc[data['prof_road'] == -1, 'prof_road'] = np.nan

prof_road
 1.0    103205
 2.0     19803
 3.0      1957
 4.0      1724
-1.0        43
Name: count, dtype: int64


In [49]:
print(data['prof_road'].value_counts(dropna=False))

data['prof_road'].isna().mean()

prof_road
1.0    103205
2.0     19803
3.0      1957
4.0      1724
NaN        43
Name: count, dtype: int64


0.00033929867752422436

## `plan_view`

Has a lot of `NA's`. Will be further inspected in feature engineering.

In [50]:
print(data['plan_view'].value_counts(dropna=False))

data.loc[data['plan_view'] == 0, 'plan_view'] = np.nan
data.loc[data['plan_view'] == -1, 'plan_view'] = np.nan

plan_view
 1.0    102344
 2.0     11948
 3.0     10763
 4.0      1640
-1.0        37
Name: count, dtype: int64


In [51]:
print(data['plan_view'].value_counts(dropna=False))

data['plan_view'].isna().mean()

plan_view
1.0    102344
2.0     11948
3.0     10763
4.0      1640
NaN        37
Name: count, dtype: int64


0.0002919546760092163

## `tpc_length`

Has a lot of `NA's`/no information. Also super high values over 200m. Due to high missing/no information (~91%) dropped.

In [52]:
print(data['tpc_length'].value_counts(dropna=False))

data.loc[data['tpc_length'] == '0.0', 'tpc_length'] = np.nan
data.loc[data['tpc_length'] == -1, 'tpc_length'] = np.nan

tpc_length
NaN      126585
0.0          56
0            17
3            11
2             7
5             5
2.5           5
3,5           3
2,5           3
6             3
4             2
6,5           2
7             2
2.0           2
5.5           1
18.0          1
6,21          1
6,2           1
13,45         1
8             1
1             1
50.0          1
6,31          1
10,5          1
3,2           1
0,4           1
5,5           1
1,5           1
35.0          1
40            1
1.0           1
10            1
5,25          1
2,8           1
3,1           1
20.0          1
1.5           1
30.0          1
6,8           1
13            1
0,8           1
12,5          1
4,5           1
Name: count, dtype: int64


In [53]:
print(data['tpc_length'].value_counts(dropna=False))

data['tpc_length'].isna().mean()

tpc_length
NaN      126641
0            17
3            11
2             7
2.5           5
5             5
3,5           3
2,5           3
6             3
2.0           2
6,5           2
7             2
4             2
5,5           1
1,5           1
6,21          1
0,4           1
3,2           1
6,2           1
10            1
6,31          1
1             1
8             1
13,45         1
10,5          1
3,1           1
5,25          1
2,8           1
50.0          1
6,8           1
13            1
0,8           1
12,5          1
40            1
5.5           1
30.0          1
1.5           1
20.0          1
1.0           1
35.0          1
18.0          1
4,5           1
Name: count, dtype: int64


0.9992819493103557

## `width_road`

Some inconsistent values (wrong decimal, too broad roads). The broadest road in France is Avenue Foch with 120 meters. However, lot of NA's and no information values (`0`/`-1`), therefore dropped. Also information is also captured in `num_lanes`, since the number of lanes is directly connected to `width_road`.

In [54]:
print(data['width_road'].value_counts(dropna=False))

data.loc[data['width_road'] == '0.0', 'width_road'] = np.nan
data.loc[data['width_road'] == -1, 'width_road'] = np.nan
data.loc[data['width_road'] == " -1", 'width_road'] = np.nan


width_road
 -1      89450
NaN      17782
7         3980
4         3273
5,5       3051
         ...  
730.0        1
6.05         1
170.0        1
34.0         1
2,85         1
Name: count, Length: 191, dtype: int64


In [55]:
print(data['width_road'].value_counts(dropna=False))

data['width_road'].isna().mean()

width_road
NaN     107232
7         3980
4         3273
5,5       3051
5         2071
         ...  
90.0         1
7,45         1
49.0         1
3.8          1
2,85         1
Name: count, Length: 190, dtype: int64


0.8461319950762238

## `surface`

`NA's` will be handled in feature engineering in addition to meaningful groups in this feature.

In [56]:
print(data['surface'].value_counts(dropna=False))

data.loc[data['surface'] == 0, 'surface'] = np.nan
data.loc[data['surface'] == -1, 'surface'] = np.nan

surface
 1.0    101410
 2.0     23273
 9.0       641
 7.0       542
 5.0       235
 3.0       214
 8.0       209
 6.0        99
-1.0        58
 4.0        51
Name: count, dtype: int64


In [57]:
print(data['surface'].value_counts(dropna=False))


surface
1.0    101410
2.0     23273
9.0       641
7.0       542
5.0       235
3.0       214
8.0       209
6.0        99
NaN        58
4.0        51
Name: count, dtype: int64


## `infra`

Ca. 92% is NA or no information. Dropped.

In [58]:
print(data['infra'].value_counts(dropna=False))

data.loc[data['infra'] == 0, 'infra'] = np.nan
data.loc[data['infra'] == -1, 'infra'] = np.nan

infra
 0.0    106298
 5.0      6795
 9.0      4289
 2.0      2234
 3.0      1553
 1.0      1534
-1.0      1333
 6.0      1269
 8.0       948
 4.0       390
 7.0        89
Name: count, dtype: int64


In [59]:
print(data['infra'].value_counts(dropna=False))

data['infra'].isna().mean()

infra
NaN    107631
5.0      6795
9.0      4289
2.0      2234
3.0      1553
1.0      1534
6.0      1269
8.0       948
4.0       390
7.0        89
Name: count, dtype: int64


0.8492803711769719

## `acc_loc`

`NA's` will be handled in feature engineering.

In [60]:
print(data['acc_loc'].value_counts(dropna=False))

data.loc[data['acc_loc'] == 0, 'acc_loc'] = np.nan
data.loc[data['acc_loc'] == -1, 'acc_loc'] = np.nan

acc_loc
 1.0    104697
 3.0      9332
 8.0      4284
 5.0      3026
 4.0      2796
 6.0      1512
 2.0       991
-1.0        94
Name: count, dtype: int64


In [61]:
print(data['acc_loc'].value_counts(dropna=False))

data['acc_loc'].isna().mean()

acc_loc
1.0    104697
3.0      9332
8.0      4284
5.0      3026
4.0      2796
6.0      1512
2.0       991
NaN        94
Name: count, dtype: int64


0.0007417226904017928

## Dropping variables + cases

First, I drop cases, where I don't have information on longitude or lattitude (`NA` or `0`). Next I drop the variables I decided above: `municip_code`, `gps_code`, and `adr`. The cases `0` occured in lattitude and this is not meaningful, since lattitude 0 is the equator. In addition, I delete the identifier `num_acc`. 

In [62]:
data.drop(
    columns=[
        'municip_code', 'adr', 'gps_code', 'dep', 'road_num', 
        'road_num_index', 'v2', 'max_speed', 'pr', 'pr1',
        'res_lane', 'tpc_length', 'width_road', 'infra', 
        'time', 'day', 'num_acc',         
    ], 
    axis=1, 
    inplace=True
)

# Save data frame

In [63]:
data.to_csv('data/processed/clean.csv', index=False)