# Car Insurance Risk Project 🚘

The idea of this project is to predict the insurance risk rating of a car as well as characterizing the 
different segments of the population.  

### 1. EDA
To begin with the project, we'll perform data cleaning and exploratory data analysis (EDA) on the provided dataset. This will include checking for data quality issues, visualizing the data, and extracting insights from it.

### 1.1 Data Quality

In this section, the focus will be on assessing the integrity of the data, ensuring that it is clean and consistent for analysis. 
Steps will include:

- Identifying Missing Data: We will search for any missing or incomplete data points and handle them appropriately (e.g., replacing them with median values or removing affected rows).
- Checking for Duplicates: Any duplicated rows will be identified and removed to prevent bias in the analysis.
- Data Type Correction: Ensuring that each column has the appropriate data type, converting strings to numbers where necessary.
- Outliers: We will identify and handle outliers that could skew the analysis using visualization techniques such as box plots.

1.1.1 Importing Libraries and Loading the Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("fivethirtyeight")
import seaborn as sns

Matplotlib is building the font cache; this may take a moment.


1.1.2 Data Collection and information

In [2]:
def import_data(file_path, columns):
    """
    Function woth the responsibility to import the data from the file.
    """
    try:
        data = pd.read_csv(file_path, names=columns)
        return data
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

# Constants
file_path = '../data/imports-85.csv'
columns = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style',
              'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
                'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower',
                'peak-rpm', 'city-mpg', 'highway-mpg', 'price']

# Import the data
df = import_data(file_path, columns)

# Showing or Checking results
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [5]:
def dfInformation(dataframe):
    """
    Gets the initial information of the Dataset, the number of records, number of variables, non-null objects and data type.

    Args:
        dataframe (DataFrame): Source dataset.
    
    Returns:
        void: A range indes conforma by float64(2), int64(1), object(6)

    Raises:
        TypeError: If the dataframe is not a DataFrame.
    """
    dataframe.info()
dfInformation(df)
# Getting the shape of the dataset
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       205 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non

(205, 26)

In [11]:
# Getting all the unique values in the make column
df['make'].unique()

array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
       'isuzu', 'jaguar', 'mazda', 'mercedes-benz', 'mercury',
       'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'renault',
       'saab', 'subaru', 'toyota', 'volkswagen', 'volvo'], dtype=object)

1.1.3 Identifying Missing Data

In [12]:
def check(dataframe):
    """
    Gets the amount of null data and unique is calculated

    Args:
        dataframe (DataFrame): Source dataset.

    Returns:
        DataFrame: A new Dataframe tha represents de amortized values of null and unique values for each column.

    Raises:
        TypeError: If the dataframe is not a DataFrame.
    """
    l=[]
    columns=dataframe.columns
    for col in columns:
        dtypes=dataframe[col].dtypes
        nunique=dataframe[col].nunique()
        sum_null=dataframe[col].isnull().sum()
        l.append([col,dtypes,nunique,sum_null])
    df_check=pd.DataFrame(l)
    df_check.columns=['Column','Types','Unique','Nulls']
    return df_check 
check(df)

Unnamed: 0,Column,Types,Unique,Nulls
0,symboling,int64,6,0
1,normalized-losses,object,52,0
2,make,object,22,0
3,fuel-type,object,2,0
4,aspiration,object,2,0
5,num-of-doors,object,3,0
6,body-style,object,5,0
7,drive-wheels,object,3,0
8,engine-location,object,2,0
9,wheel-base,float64,53,0


In [13]:
def checkDuplicates(dataframe):
    """
    Checks duplicated values for each column and amortized this count.

    Args:
        dataframe (DataFrame): Source dataset.

    Returns:
        int: The total of duplicated values in an specifica dataframe

    Raises:
        TypeError: If the dataframe is not a DataFrame.
    """
    return dataframe.duplicated().sum()
checkDuplicates(df)

np.int64(0)

All looks like there are no missing values in the dataset. However, we can see that some columns have special characters that need to be removed.

In [None]:
def removeSymbols(dataframe):
    """
    Removes the symbols from the dataset.

    Args:
        dataframe (DataFrame): Source dataset.

    Returns:
        DataFrame: A new Dataframe without the symbols.

    Raises:
        TypeError: If the dataframe is not a DataFrame.
    """
    dataframe = dataframe.replace('?', np.nan)
    return dataframe

# Removing the symbols
df = removeSymbols(df)

### Cleaning the Data

In [9]:
# Indentify null values
missing_values = data.isnull().sum()
print(missing_values)

# Indentify ? , | , - values in the data
data = data.replace('?', np.nan)
data = data.replace('-', np.nan)
data = data.replace('|', np.nan)
data = data.replace(' ', np.nan)
symbols_values = data.isnull().sum()
print(symbols_values)

symboling            0
normalized-losses    0
make                 0
fuel-type            0
aspiration           0
num-of-doors         0
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-ratio    0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64
symboling             0
normalized-losses    41
make                  0
fuel-type             0
aspiration            0
num-of-doors          2
body-style            0
drive-wheels          0
engine-location       0
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-type           0
num-of-cylinders      0
engin

In [None]:
# Clean data

## Replace '?' with NaN
data.replace('?', np.nan, inplace=True)

### 2. Data Visualization
Visualizations will help uncover relationships between car characteristics and insurance risk. The focus here includes:

- Distribution of Features: Visualizing the distribution of important car attributes (e.g., engine size, body style) using histograms and bar charts to understand their spread across the dataset.
- Correlations: Create a heatmap to identify how different features (e.g., engine size, fuel type, and price) are correlated with each other and with the insurance risk rating.
- Feature-Target Visualization: Use box plots and scatter plots to visually analyze how each feature relates to the car's risk rating, helping to identify patterns.

### 3. Extraction of Insights
This section focuses on extracting actionable insights to inform risk prediction and characterize segments. Steps include:

- Summary Statistics: Provide a summary of key features that influence risk rating, such as engine type, price, or drive wheels, to better understand typical car profiles and segments.
- Feature Importance for Risk Prediction: Identify which features are most significant in predicting the car's insurance risk rating using correlation and statistical tests.
- Segmentation of Population: Cluster the dataset into segments (e.g., based on car type, price range, or risk level) to characterize different population segments, helping insurers understand their customer base.