![Parked car](car.jpg)

Insurance companies invest a lot of [time and money](https://www.accenture.com/_acnmedia/pdf-84/accenture-machine-leaning-insurance.pdf) into optimizing their pricing and accurately estimating the likelihood that customers will make a claim. In many countries, insurance is a legal requirement to have car insurance in order to drive a vehicle on public roads, so the market is very large!

Knowing all of this, On the Road car insurance has requested your services in building a model to predict whether a customer will make a claim on their insurance during the policy period. As they have very little expertise and infrastructure for deploying and monitoring machine learning models, they've asked you to use simple Logistic Regression, identifying the single feature that results in the best-performing model, as measured by accuracy.

They have supplied you with their customer data as a csv file called `car_insurance.csv`, along with a table (below) detailing the column names and descriptions below.

## The dataset

| Column | Description |
|--------|-------------|
| `id` | Unique client identifier |
| `age` | Client's age: <br> <ul><li>`0`: 16-25</li><li>`1`: 26-39</li><li>`2`: 40-64</li><li>`3`: 65+</li></ul> |
| `gender` | Client's gender: <br> <ul><li>`0`: Female</li><li>`1`: Male</li></ul> |
| `driving_experience` | Years the client has been driving: <br> <ul><li>`0`: 0-9</li><li>`1`: 10-19</li><li>`2`: 20-29</li><li>`3`: 30+</li></ul> |
| `education` | Client's level of education: <br> <ul><li>`0`: No education</li><li>`1`: High school</li><li>`2`: University</li></ul> |
| `income` | Client's income level: <br> <ul><li>`0`: Poverty</li><li>`1`: Working class</li><li>`2`: Middle class</li><li>`3`: Upper class</li></ul> |
| `credit_score` | Client's credit score (between zero and one) |
| `vehicle_ownership` | Client's vehicle ownership status: <br><ul><li>`0`: Does not own their vehilce (paying off finance)</li><li>`1`: Owns their vehicle</li></ul> |
| `vehcile_year` | Year of vehicle registration: <br><ul><li>`0`: Before 2015</li><li>`1`: 2015 or later</li></ul> |
| `married` | Client's marital status: <br><ul><li>`0`: Not married</li><li>`1`: Married</li></ul> |
| `children` | Client's number of children |
| `postal_code` | Client's postal code | 
| `annual_mileage` | Number of miles driven by the client each year |
| `vehicle_type` | Type of car: <br> <ul><li>`0`: Sedan</li><li>`1`: Sports car</li></ul> |
| `speeding_violations` | Total number of speeding violations received by the client | 
| `duis` | Number of times the client has been caught driving under the influence of alcohol |
| `past_accidents` | Total number of previous accidents the client has been involved in |
| `outcome` | Whether the client made a claim on their car insurance (response variable): <br><ul><li>`0`: No claim</li><li>`1`: Made a claim</li></ul> |

In [75]:
# Import required libraries
library(readr)
library(dplyr)
library(glue)
library(yardstick)
install.packages("caret")
library(caret)

# Start coding!
data <- read_csv("car_insurance.csv")

Installing caret [6.0-94] ...
	OK [linked cache]


[1mRows: [22m[34m10000[39m [1mColumns: [22m[34m19[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (19): id, age, gender, race, driving_experience, education, income, cred...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [76]:

#Checking missing values 
missing_values <- sum(is.na(data))
print(missing_values)


# Check for missing values in specific columns
missing_values <- colSums(is.na(data))

# Print the number of missing values for each column
print(missing_values) #credit_score 982, annual_mileage 957  missing values

cars$credit_score[is.na(cars$credit_score)] <- mean(cars$credit_score, na.rm = TRUE)
cars$annual_mileage[is.na(cars$annual_mileage)] <- mean(cars$annual_mileage, na.rm = TRUE)

#remove any missing values
dataset <- data %>%
	na.omit()
	

[1] 1939
                 id                 age              gender                race 
                  0                   0                   0                   0 
 driving_experience           education              income        credit_score 
                  0                   0                   0                 982 
  vehicle_ownership        vehicle_year             married            children 
                  0                   0                   0                   0 
        postal_code      annual_mileage        vehicle_type speeding_violations 
                  0                 957                   0                   0 
               duis      past_accidents             outcome 
                  0                   0                   0 


Find the feature with the best predictive performance for a car insurance claim ("outcome") by creating simple Logistic Regression models (each with a single feature) and assessing their accuracy.

In [77]:
# Initialize variables
best_accuracy <- 0
best_feature <- ""

# Create the features data frame
features <- data.frame(features = setdiff(names(cars), c("outcome", "id")))


In [78]:
# Create an empty data frame to store feature accuracies
features_df <- data.frame(Feature = character(), Accuracy = numeric(), stringsAsFactors = FALSE)

# Loop through features
for (col in names(dataset)) {
  if (col != "outcome" && col != "id") {
    # Train the logistic regression model
    formula_str <- glue::glue("outcome ~ {col}")
    model <- glm(formula_str, data = dataset, family = binomial())
    
    # Make predictions on the dataset
    predictions <- predict(model, type = "response")
    
    # Calculate accuracy
    accuracy <- mean(predictions >= 0.5)  # Assuming 0.5 threshold for binary classification
    
    # Store feature and accuracy in the data frame
    features_df <- rbind(features_df, data.frame(Feature = col, Accuracy = accuracy, stringsAsFactors = FALSE))
  }
}

# Print the features_df data frame
print(features_df)




               Feature   Accuracy
1                  age 0.19892011
2               gender 0.00000000
3                 race 0.00000000
4   driving_experience 0.35145417
5            education 0.00000000
6               income 0.17904037
7         credit_score 0.13412689
8    vehicle_ownership 0.30077310
9         vehicle_year 0.00000000
10             married 0.00000000
11            children 0.00000000
12         postal_code 0.00000000
13      annual_mileage 0.01889802
14        vehicle_type 0.00000000
15 speeding_violations 0.00000000
16                duis 0.00000000
17      past_accidents 0.00000000


Based on the results provided:

    The "age" feature had an accuracy score of 0.1989, indicating a low predictive performance.
    The "gender" feature had an accuracy score of 0.0, suggesting that it did not contribute to predicting car insurance claims.
    Similarly, features such as "race," "education," "vehicle_year," "married," "children," "postal_code," "vehicle_type," "speeding_violations," "duis," and "past_accidents" also had accuracy scores of 0.0, indicating no predictive power for these features.
    On the other hand, features like "driving_experience," "income," and "vehicle_ownership" had relatively higher accuracy scores, suggesting they might have some predictive value in determining car insurance claims.

#Finding the best performing model and store the results in a data frame


In [79]:
# Find the best performing feature
best_feature <- features_df$Feature[which.max(features_df$Accuracy)]

best_feature
# Store the maximum accuracy value
best_accuracy <- max(features_df$Accuracy)

best_accuracy