![car](car.jpg)

Insurance companies invest a lot of [time and money](https://www.accenture.com/_acnmedia/pdf-84/accenture-machine-leaning-insurance.pdf) into optimizing their pricing and accurately estimating the likelihood that customers will make a claim. In many countries insurance it is a legal requirement to have car insurance in order to drive a vehicle on public roads, so the market is very large!

Knowing all of this, On the Road car insurance have requested your services in building a model to predict whether a customer will make a claim on their insurance during the policy period. As they have very little expertise and infrastructure for deploying and monitoring machine learning models, they've asked you to identify the single feature that results in the best performing model, as measured by accuracy, so they can start with a simple model in production.

They have supplied you with their customer data as a csv file called `car_insurance.csv`, along with a table detailing the column names and descriptions below.



## The dataset

| Column | Description |
|--------|-------------|
| `id` | Unique client identifier |
| `age` | Client's age: <br> <ul><li>`0`: 16-25</li><li>`1`: 26-39</li><li>`2`: 40-64</li><li>`3`: 65+</li></ul> |
| `gender` | Client's gender: <br> <ul><li>`0`: Female</li><li>`1`: Male</li></ul> |
| `driving_experience` | Years the client has been driving: <br> <ul><li>`0`: 0-9</li><li>`1`: 10-19</li><li>`2`: 20-29</li><li>`3`: 30+</li></ul> |
| `education` | Client's level of education: <br> <ul><li>`0`: No education</li><li>`1`: High school</li><li>`2`: University</li></ul> |
| `income` | Client's income level: <br> <ul><li>`0`: Poverty</li><li>`1`: Working class</li><li>`2`: Middle class</li><li>`3`: Upper class</li></ul> |
| `credit_score` | Client's credit score (between zero and one) |
| `vehicle_ownership` | Client's vehicle ownership status: <br><ul><li>`0`: Does not own their vehilce (paying off finance)</li><li>`1`: Owns their vehicle</li></ul> |
| `vehcile_year` | Year of vehicle registration: <br><ul><li>`0`: Before 2015</li><li>`1`: 2015 or later</li></ul> |
| `married` | Client's marital status: <br><ul><li>`0`: Not married</li><li>`1`: Married</li></ul> |
| `children` | Client's number of children |
| `postal_code` | Client's postal code | 
| `annual_mileage` | Number of miles driven by the client each year |
| `vehicle_type` | Type of car: <br> <ul><li>`0`: Sedan</li><li>`1`: Sports car</li></ul> |
| `speeding_violations` | Total number of speeding violations received by the client | 
| `duis` | Number of times the client has been caught driving under the influence of alcohol |
| `past_accidents` | Total number of previous accidents the client has been involved in |
| `outcome` | Whether the client made a claim on their car insurance (response variable): <br><ul><li>`0`: No claim</li><li>`1`: Made a claim</li></ul> |

In [1]:
# Import required modules
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix


# Start coding!

Identify the single feature of the data that is the best predictor of whether a customer will put in a claim (the "outcome" column), excluding the "id" column.

Store as a DataFrame called best_feature_df, containing columns named "best_feature" and "best_accuracy" with the name of the feature with the highest accuracy, and the respective accuracy score.

In [2]:
df = pd.read_csv('car_insurance.csv')
df

Unnamed: 0,id,age,gender,driving_experience,education,income,credit_score,vehicle_ownership,vehicle_year,married,children,postal_code,annual_mileage,vehicle_type,speeding_violations,duis,past_accidents,outcome
0,569520,3,0,0-9y,high school,upper class,0.629027,1.0,after 2015,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,750365,0,1,0-9y,none,poverty,0.357757,0.0,before 2015,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,199901,0,0,0-9y,high school,working class,0.493146,1.0,before 2015,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0
3,478866,0,1,0-9y,university,working class,0.206013,1.0,before 2015,0.0,1.0,32765,11000.0,sedan,0,0,0,0.0
4,731664,1,1,10-19y,none,working class,0.388366,1.0,before 2015,0.0,0.0,32765,12000.0,sedan,2,0,1,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,323164,1,0,10-19y,university,upper class,0.582787,1.0,before 2015,0.0,0.0,10238,16000.0,sedan,0,0,1,0.0
9996,910346,1,0,10-19y,none,middle class,0.522231,1.0,after 2015,0.0,1.0,32765,,sedan,1,0,0,0.0
9997,468409,1,1,0-9y,high school,middle class,0.470940,1.0,before 2015,0.0,1.0,10238,14000.0,sedan,0,0,0,0.0
9998,903459,1,0,10-19y,high school,poverty,0.364185,0.0,before 2015,0.0,1.0,10238,13000.0,sedan,2,0,1,1.0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   10000 non-null  int64  
 1   age                  10000 non-null  int64  
 2   gender               10000 non-null  int64  
 3   driving_experience   10000 non-null  object 
 4   education            10000 non-null  object 
 5   income               10000 non-null  object 
 6   credit_score         9018 non-null   float64
 7   vehicle_ownership    10000 non-null  float64
 8   vehicle_year         10000 non-null  object 
 9   married              10000 non-null  float64
 10  children             10000 non-null  float64
 11  postal_code          10000 non-null  int64  
 12  annual_mileage       9043 non-null   float64
 13  vehicle_type         10000 non-null  object 
 14  speeding_violations  10000 non-null  int64  
 15  duis                 10000 non-null  

In [4]:
for col in df.columns:
    print(col, df[col].value_counts(dropna=False))
    print('-------')

id 569520    1
702473    1
426507    1
386239    1
454670    1
         ..
931908    1
672440    1
297005    1
559584    1
442696    1
Name: id, Length: 10000, dtype: int64
-------
age 1    3063
2    2931
0    2016
3    1990
Name: age, dtype: int64
-------
gender 0    5010
1    4990
Name: gender, dtype: int64
-------
driving_experience 0-9y      3530
10-19y    3299
20-29y    2119
30y+      1052
Name: driving_experience, dtype: int64
-------
education high school    4157
university     3928
none           1915
Name: education, dtype: int64
-------
income upper class      4336
middle class     2138
poverty          1814
working class    1712
Name: income, dtype: int64
-------
credit_score NaN         982
0.428487      1
0.594531      1
0.396540      1
0.578306      1
           ... 
0.309272      1
0.847325      1
0.432080      1
0.527041      1
0.435225      1
Name: credit_score, Length: 9019, dtype: int64
-------
vehicle_ownership 1.0    6970
0.0    3030
Name: vehicle_ownership, dtype:

In [5]:
binary_features = ['vehicle_year','vehicle_type']

In [6]:
categ_non_ord_features = ['postal_code']

In [7]:
categ_ord_features = ['driving_experience','education','income']

In [8]:
for feature in binary_features:
    df[feature] = df[feature].astype('category').cat.rename_categories([0,1]).astype('int64')

In [9]:
driving_experience_map = {'0-9y':0, '10-19y':1, '20-29y':2, '30y+':3}
education_map = {'high school':1, 'none':0, 'university':2}
income_map = {'upper class':3, 'poverty':0, 'working class':1, 'middle class':2}

In [10]:
for feature in categ_ord_features:
    df[feature] = df[feature].map(globals()[str(feature)+'_map'])

In [11]:
# frequency encoding

postal_code_frequency = df.postal_code.value_counts(normalize = True)
print(postal_code_frequency)

for feature in categ_non_ord_features:
    df[feature] = df[feature].map(postal_code_frequency)

df.postal_code.value_counts()

10238    0.6940
32765    0.2456
92101    0.0484
21217    0.0120
Name: postal_code, dtype: float64


0.6940    6940
0.2456    2456
0.0484     484
0.0120     120
Name: postal_code, dtype: int64

In [12]:
# # target encoding 

# postal_code_target_mean = df.groupby('postal_code')['outcome'].mean()
# print(postal_code_target_mean)

# for feature in categ_non_ord_features:
#     df[feature] = df[feature].map(postal_code_target_mean)

# df.postal_code.value_counts()

In [13]:
# # one hot encoding

# df = pd.get_dummies(df, columns=['postal_code'])
# df

In [14]:
df['credit_score'].fillna(df['credit_score'].mean(), inplace=True)
df['annual_mileage'].fillna(df['annual_mileage'].mode()[0], inplace=True)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   10000 non-null  int64  
 1   age                  10000 non-null  int64  
 2   gender               10000 non-null  int64  
 3   driving_experience   10000 non-null  int64  
 4   education            10000 non-null  int64  
 5   income               10000 non-null  int64  
 6   credit_score         10000 non-null  float64
 7   vehicle_ownership    10000 non-null  float64
 8   vehicle_year         10000 non-null  int64  
 9   married              10000 non-null  float64
 10  children             10000 non-null  float64
 11  postal_code          10000 non-null  float64
 12  annual_mileage       10000 non-null  float64
 13  vehicle_type         10000 non-null  int64  
 14  speeding_violations  10000 non-null  int64  
 15  duis                 10000 non-null  

In [16]:
# y = df['outcome']
# X = df.drop(['outcome','id'], axis=1)

In [17]:
train_data, test_data = train_test_split(df, test_size=0.25, random_state=777)

In [18]:
features = df.drop(['outcome','id'], axis=1).columns
features

Index(['age', 'gender', 'driving_experience', 'education', 'income',
       'credit_score', 'vehicle_ownership', 'vehicle_year', 'married',
       'children', 'postal_code', 'annual_mileage', 'vehicle_type',
       'speeding_violations', 'duis', 'past_accidents'],
      dtype='object')

In [19]:
models = [] # saving a fitted model for each feature

for feature in features:
    model = logit(f"outcome ~ {feature}", data=train_data).fit()
    models.append(model)

Optimization terminated successfully.
         Current function value: 0.511551
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.616040
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.469418
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.605342
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.530193
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.572488
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.553703
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.571210
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.587262
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.598354
  

In [20]:
coefficients = {}
accuracies = {}

for i, feature in enumerate(features):
    # print(models[i].summary())
    coefficients[feature] = models[i].params[1]
    predictions = models[i].predict(test_data)
    conf_matrix = models[i].pred_table()
    # print(conf_matrix)
    tn = conf_matrix[0,0]
    tp = conf_matrix[1,1]
    fn = conf_matrix[1,0]
    fp = conf_matrix[0,1]
    acc = (tn + tp) / (tn + fn + fp + tp)
    accuracies[feature] = acc

In [21]:
sorted_accuracies = dict(sorted(accuracies.items(), key=lambda item: item[1], reverse=True))
sorted_accuracies

{'driving_experience': 0.7750666666666667,
 'age': 0.7748,
 'income': 0.744,
 'vehicle_ownership': 0.7345333333333334,
 'credit_score': 0.7050666666666666,
 'annual_mileage': 0.6890666666666667,
 'gender': 0.6856,
 'education': 0.6856,
 'vehicle_year': 0.6856,
 'married': 0.6856,
 'children': 0.6856,
 'postal_code': 0.6856,
 'vehicle_type': 0.6856,
 'speeding_violations': 0.6856,
 'duis': 0.6856,
 'past_accidents': 0.6856}

In [22]:
# # Create best_feature_df
# best_feature_df = pd.DataFrame({"best_feature": best_feature,
#                                 "best_accuracy": max(accuracies)},
#                                 index=[0])

best_feature = next(iter(sorted_accuracies.items()))[0]
best_accuracy = next(iter(sorted_accuracies.items()))[1]

best_feature_df = pd.DataFrame({'best_feature': best_feature, 
                               'best_accuracy': best_accuracy},
                              index=[0])
best_feature_df

Unnamed: 0,best_feature,best_accuracy
0,driving_experience,0.775067


In [23]:
sorted_coefficients_abs = dict(sorted(coefficients.items(), key=lambda item: np.abs(item[1]), reverse=True))
sorted_coefficients_abs

{'credit_score': -5.51404906267477,
 'vehicle_year': 1.7783554180949352,
 'vehicle_ownership': -1.7192178381327863,
 'driving_experience': -1.6587054234851901,
 'postal_code': -1.317591770645024,
 'married': -1.1812833800176927,
 'age': -1.1399358994089013,
 'duis': -1.1355011769926817,
 'children': -1.0019372708590988,
 'past_accidents': -0.8470287156628181,
 'income': -0.8411530773790739,
 'education': -0.5425048654237492,
 'speeding_violations': -0.5347115103898243,
 'gender': 0.4949593400501128,
 'vehicle_type': 0.011699091680676964,
 'annual_mileage': 0.00014310174932689738}

In [24]:
all_features_formula = "outcome ~ " + " + ".join(features)
all_features_formula

'outcome ~ age + gender + driving_experience + education + income + credit_score + vehicle_ownership + vehicle_year + married + children + postal_code + annual_mileage + vehicle_type + speeding_violations + duis + past_accidents'

In [25]:
all_features_model = logit(all_features_formula, data=train_data).fit()
all_features_model.summary()

Optimization terminated successfully.
         Current function value: 0.344987
         Iterations 8


0,1,2,3
Dep. Variable:,outcome,No. Observations:,7500.0
Model:,Logit,Df Residuals:,7483.0
Method:,MLE,Df Model:,16.0
Date:,"Thu, 15 Aug 2024",Pseudo R-squ.:,0.4459
Time:,18:24:58,Log-Likelihood:,-2587.4
converged:,True,LL-Null:,-4669.3
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.1837,0.288,0.638,0.524,-0.381,0.748
age,-0.0168,0.053,-0.317,0.751,-0.121,0.087
gender,1.0220,0.075,13.542,0.000,0.874,1.170
driving_experience,-1.7569,0.087,-20.097,0.000,-1.928,-1.586
education,0.0374,0.058,0.648,0.517,-0.076,0.150
income,-0.0787,0.057,-1.388,0.165,-0.190,0.032
credit_score,0.1168,0.379,0.309,0.758,-0.625,0.859
vehicle_ownership,-1.7372,0.080,-21.770,0.000,-1.894,-1.581
vehicle_year,1.8052,0.097,18.600,0.000,1.615,1.995


In [26]:
all_features_predictions = all_features_model.predict(test_data)

# Convert predictions to binary outcomes
predicted_classes = (all_features_predictions > 0.5).astype(int)

# Compute the confusion matrix
conf_matrix = confusion_matrix(test_data["outcome"], predicted_classes)
tn, fp, fn, tp = conf_matrix.ravel()

# Compute accuracy
accuracy = (tn + tp) / (tn + fn + fp + tp)
print(f"Accuracy: {accuracy}")

# Print coefficients
coefficients = all_features_model.params
print("Coefficients:")
print(coefficients)

Accuracy: 0.8464
Coefficients:
Intercept              0.183670
age                   -0.016807
gender                 1.021998
driving_experience    -1.756916
education              0.037369
income                -0.078690
credit_score           0.116818
vehicle_ownership     -1.737152
vehicle_year           1.805233
married               -0.359317
children              -0.006015
postal_code           -3.033902
annual_mileage         0.000121
vehicle_type           0.053598
speeding_violations   -0.010403
duis                   0.026522
past_accidents        -0.088776
dtype: float64
