# Overview
This notebook is created to perform Explanatory data analysis for the data set posted on the 4th week of the MondayMakeOver data visualization challenge. The notebook is mainly to have a better grasp of the data and choose the aspect to focus on.

Here is a good question: "What is this data about ?". Well, it is about NHTSA recalls. Well that must have cleared it out entirely hasn't it ?.   
Well not really? So recalls are reports issued by a department in the U.S department of transportation responsable for the safety of means of transport.   

As for the data, it focuses on cars. Each row represents a report issued by this department in question. Each report specifies the components that potentially do not meet the safety criteria the manufacture, the date and the different adminstrational details.  

Well Don't take my word for it here is the original [article](https://www.nhtsa.gov/recalls#:~:text=A%20recall%20is%20issued%20when,to%20any%20involvement%20by%20NHTSA.)

## Data
Please find the data through the following [link](https://data.world/makeovermonday/2023w4). (It might require registering\logging in to the ***data.world*** platform).  
Please find a description of the challenge through following [link](https://www.makeovermonday.co.uk/about-us/)  

## Cleaning and skimming the data
Enough talk, show me the pandas dataframe

In [74]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import numpy as np

file_name_relative = "Recalls_Data.csv"
data = pd.read_csv(os.path.join(os.getcwd(), file_name_relative))

data.info()
data_org = data.copy()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26592 entries, 0 to 26591
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Report Received Date  26592 non-null  object
 1   NHTSA ID              26592 non-null  object
 2   Recall Link           26592 non-null  object
 3   Manufacturer          26592 non-null  object
 4   Subject               26592 non-null  object
 5   Component             26592 non-null  object
 6   Mfr Campaign Number   26563 non-null  object
 7   Recall Type           26592 non-null  object
 8   Potentially Affected  26550 non-null  object
 9   Recall Description    24191 non-null  object
 10  Consequence Summary   21704 non-null  object
 11  Corrective Action     24204 non-null  object
dtypes: object(12)
memory usage: 2.4+ MB


In [75]:
# let's get rid of the NHTSA ID field as it is a mere identifier (no direct benefit currently)
try:
    data.drop(columns=['NHTSA ID'], inplace=True)
except KeyError: 
    # the cell is being run more than once (without restarting the karnel)
    pass

# data.head()
print(data['Recall Link'].iloc[0])

Go to Recall (https://www.nhtsa.gov/recalls?nhtsaId=23V002000)


In [76]:
rec_link = "Recall Link"
# apparently the recall link follows the following pattern: "Go to Recall(link)"
# let's verify this hypothesis
import re
regex = r'Go to Recall \(https?:\/\/.*\)'
ev_arr = np.array([(re.fullmatch(regex, t, flags=re.IGNORECASE) is not None) for t in data['Recall Link'].values])

print(ev_arr.all())
# so all the data is already cleaned and well formatted so let's isolate the actual link from the descriptive text here

def isolate_link(row):
    des, link = re.split(r'[\(\)]', row[rec_link])[:2]
    row[rec_link] = link
    return row

data = data.apply(isolate_link, axis=1)

True


In [77]:
new_cols = {"Report Received Date": "date", "Recall Type": "Type", "Potentially Affected": "num_cars", 
"Recall Description": "description", "Consequence Summary": "summary", "Corrective Action": "action"}
data = data.rename(columns=new_cols)
data = data.rename(columns=lambda x: x.lower())
# let's drop the description, summary and link columns
try:
    data.drop(columns=['summary', 'description', rec_link.lower()], inplace=True)
except:
    pass 

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26592 entries, 0 to 26591
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   date                 26592 non-null  object
 1   manufacturer         26592 non-null  object
 2   subject              26592 non-null  object
 3   component            26592 non-null  object
 4   mfr campaign number  26563 non-null  object
 5   type                 26592 non-null  object
 6   num_cars             26550 non-null  object
 7   action               24204 non-null  object
dtypes: object(8)
memory usage: 1.6+ MB


## Limit the data to the top 25 manufacturers
As most of the manufacturers in the dataset appear quite rarely (in the vacinity of 2, 3 occurrences as demonstrated below), it seems reasonable to focus on a sample of manufacturers in the search of actual insights and patterns.

In [78]:
mnf = "manufacturer"
mnf_num = "mfr campaign number"
# let's extract the occurences of each manufacturer and calculate the main 5 quantiles
occs = data[mnf].value_counts().values
q1, q2, q3 = np.percentile(occs, [25, 50, 75])
print([q1, q2, q3])
# the 75-th percentile of occurences is estimated at 3 occurences, which is extremely low. 

[1.0, 1.0, 3.0]


In [79]:
# let's limit the search to around the most 25 frequent manufacturers
# let's extract these companies
occs = data[mnf].value_counts()[:25].index
data = data[data[mnf].isin(occs)]
data.head()

Unnamed: 0,date,manufacturer,subject,component,mfr campaign number,type,num_cars,action
3,12/29/2022,"Volkswagen Group of America, Inc.",12-Volt Battery Cable May Short Circuit,ELECTRICAL SYSTEM,97HA,Vehicle,1042,Owners are advised to park outside and away fr...
9,12/23/2022,"Mercedes-Benz USA, LLC",Engine Stall from Water Intrusion into Vehicle,"FUEL SYSTEM, DIESEL",NR (Not Reported),Vehicle,323963,"Dealers will install a water drain plug, inspe..."
10,12/23/2022,Ford Motor Company,Seat Belt Warning System Malfunction/FMVSS 208,SEAT BELTS,22C35,Vehicle,101001,Dealers will update the audio control module s...
11,12/23/2022,"Mercedes-Benz USA, LLC",Sunroof Panel May Detach,VISIBILITY,NR (Not Reported),Vehicle,123696,Dealers will inspect and replace the sunroof p...
13,12/22/2022,Honda (American Honda Motor Co.),Damaged Tire Bead,TIRES,GCW,Vehicle,19,Dealers will inspect and replace the tires as ...


In [80]:
# first let's fill the nan values in the Mfr Campaign number with the Non Reported flag
data[mnf_num] = data[mnf_num].fillna(value="NR (Not Reported)")
print(data.isna().sum())
print(f"the percentage of reports with non-reported manufacture's number is \
{round(data[data[mnf_num].str.lower().str.contains('not reported')].shape[0] / (data.shape[0]) * 100, 2)}%")


# as the Mfr Campaign number is not reported in 45% of recall reports, dropping this columns seems like a reasonable idea...
try:
    data.drop(columns=mnf_num, inplace=True)
except KeyError:
    pass

date                      0
manufacturer              0
subject                   0
component                 0
mfr campaign number       0
type                      0
num_cars                 11
action                 1192
dtype: int64
the percentage of reports with non-reported manufacture's number is 45.46%


### some more cleaning
The manufacturer and num_cars columns should be addressed with more care. The former can be represented more uniformally: remove additional characters, punctuations as well as actual terms. As for the latter, it should be converted to integer datatype to perform numerical operations.


In [81]:
# the main idea is to split the manufacturer's string by the any non-word character (+ hyphens) and keep only the first token
def manufacturer(row):
    tokens = re.split(r'[^-\w\s]+', row[mnf])
    row[mnf] = tokens[0].upper().strip()
    return row

data_c = data.apply(manufacturer, axis=1)
print(len(data_c[mnf].value_counts()))

25


In [82]:
# first let's remove rows for which the num_cars is nan
data.dropna(subset=['num_cars'], inplace=True)
# let's remove any non-numeric character from the num_cars column
def num_cars_to_int(row):
    row['num_cars'] = int(re.sub("\D", '', row['num_cars']))
    return row
data = data.apply(num_cars_to_int, axis=1)

In [83]:
# to reduce the length of the companies' names even further, let's remove any terms indicating the type of the company or its location (as all companies in the datasets
# are either American or the branches of a non-american companies in the USA)
def reduce_mnf(row):
    row[mnf] = re.sub(r'(COMPANY|USA|LLC|AMERICA|NORTH|OF|GROUP|ENGINEERING)', '', row[mnf])
    return row 
data = data.apply(reduce_mnf, axis=1)

## aggregating

In [84]:
from math import floor

# let's first extract the year from the date column
data['date'] = pd.to_datetime(data['date'])
data['year'] = pd.DatetimeIndex(data['date']).year
# let's extract the decade as well
def extract_decade(row):
    c1 = str(row['year'])[:2]
    c2 = floor(int(str(row['year'])[2:]) / 10) * 10
    if c2 < 10:
        c2 = "0" + str(c2)
         
    row['decade'] = f"{c1}{c2}"
    
    c2 = floor(int(str(row['year'])[2:]) / 5) * 5
    if c2 < 10:
        c2 = "0" + str(c2)
         
    row['model'] = f"{c1}{c2}"
    return row

data = data.apply(extract_decade, axis=1)

In [85]:
# let's group by each manufacturer and see: 
# 1. the sum of potentitally affected cars (num_cars), the number of
# 2. the number of distinct components affected
agg1 = pd.pivot_table(data, index='manufacturer', values='num_cars', columns='model', aggfunc=['max']) 
agg2 = data.groupby(['manufacturer', 'decade'])['component'].nunique()


In [86]:
agg1

Unnamed: 0_level_0,max,max,max,max,max,max,max,max,max,max,max,max
model,1965,1970,1975,1980,1985,1990,1995,2000,2005,2010,2015,2020
manufacturer,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
BLUE BIRD BODY,202.0,3855.0,6000.0,1500.0,1369.0,21000.0,74000.0,25839.0,37095.0,18821.0,19833.0,4900.0
BMW,3501.0,17284.0,32500.0,66600.0,120261.0,375000.0,410000.0,84000.0,200000.0,574007.0,840000.0,917106.0
CHRYSLER,174857.0,270815.0,1300000.0,1000000.0,634000.0,640000.0,1342202.0,2315768.0,826687.0,1560000.0,4815661.0,1224078.0
DAIMLER TRUCKS,,,,,,,,,243435.0,103437.0,438255.0,219390.0
FLEETWOOD ENTERPRISES,,2587.0,3800.0,3832.0,28545.0,20413.0,7200.0,60877.0,167096.0,,,
FORD MOTOR,447000.0,4072000.0,1400000.0,21000000.0,3600000.0,1610000.0,7900000.0,1556221.0,4500000.0,1325000.0,2046297.0,2925968.0
FOREST RIVER,,,,,,,459.0,2010.0,128000.0,9526.0,365885.0,99190.0
FREIGHTLINER,908.0,10000.0,15000.0,31002.0,25000.0,14660.0,77000.0,105000.0,75000.0,,,
GENERAL MOTORS,2966979.0,6682084.0,1896222.0,5821160.0,1810000.0,1702880.0,2400000.0,3662211.0,1497516.0,5877718.0,3640162.0,2641272.0
HARLEY-DAVIDSON MOTOR,5000.0,22310.0,79056.0,11714.0,43058.0,77407.0,176515.0,81496.0,167628.0,250757.0,185272.0,199419.0


In [87]:
print(data['type'].value_counts())
# this column is to be deleted
try:
    data.drop(columns='type', inplace=True)
except KeyError:
    pass

Vehicle       11464
Equipment       158
Child Seat        1
Name: type, dtype: int64


Our dataset it now ready!! time to save it to an excel file

In [88]:
data.to_excel("recalls_data_cleaned.xlsx")