# Insurance company customers' segmentation
## Group project
### This notebook uses the *a2z_insurance.sas7bdat* dataset

(c) Vasco Jesus, Nuno António 2020-2021 - Rev. 2.01

## Dataset description

- **CustID**: numeric - customer ID
- **FirstPolYear**: numeric - year of the customer's first policy. Maybe considered the first year as a customer
- **BirthYear**: numeric - birth year of the customer. The current year of the database is 2016
- **EducDeg**: categorical - academic degree
- **MonthSal**: numerical - monthly gross salary (€)
- **GeoLivArea**: numerical - codes about the area of living. No additional information is available for these codes
- **Children**: numerical - indication if the customer has children (0: no, 1: yes)
- **CustMonVal**: numerical - customer monetary value (CMV). CMV = (annual profit from the customer) x (number of years since a customer) - (acquisition cost)
- **ClaimsRate**: numerical - claims rate. Amount paid by the insurance company (€)/Premiums(€) (in the last two years)
- **PremMotor**: numerical - premiums in the Line of Business (LOB) Motor (€)
- **PremHousehold**: numerical - premiums in the LOB Household (€)
- **PremHealth**: numerical - premiums in the LOB Health (€)
- **PremLife**: numerical - premiums in the LOB Life (€)
- **PremWork**: numerical - premiums in the LOB Work (€)

<br>NOTES about all Premiums:
- Annual premiums (2016)
- Negative premiums may manifest reversals occurred in the current year, paid in previous one(s)

## Group details
- Composed of three students. Groups of two are aceptable, but must be approved by instructors.
- Students can be from different theory and practical classes.

## Work description

### Overview
<p>You should organize into groups of up to 3 students, where you will assume the role of Data Mining/Analytic Consultant company. You are asked to develop a Customer Segmentation in such a way that it will be possible for the Marketing Department of an insurance company to understand all the different Customers’ Profiles better.</p>
<p>Employing the CRISP-DM process model, you are expected to define, describe and explain the clusters you chose. Invest time in reasoning how you want to do your clustering, possible approaches, and advantages or disadvantages of different decisions. Simultaneous, you should express the marketing approach you recommend for each cluster.</p>

### Deliverables
- Python source code (Jupyter notebook or .py files). Code should be commented to facilitate comprehension
- Report:
    - Maximum of 20 pages (excluding appendixes)
    - Minimum font size is 10
    - Should describe the main outputs according to CRISP-DM, including the brief description of the problem, methods, results, and their discussion


### Discussion
- To be done in the exam season with all group members present
- Slots of 15 minutes per group
- No presentation is required. Just start the discussion with your report and Python file(s) open


### Questions or additional informations
For any additional questions, don't hesitate to get in touch with the instructors of the practical classes. They will also act as the insurance company business/project stakeholders.

<br><br>
Good work or good luck ;)

## Initializations and data loading

In [2]:
# Loading packages
import pandas as pd

In [3]:
# Loading the dataset and visualizing summary statistics
ds = pd.read_sas('a2z_insurance.sas7bdat', format='sas7bdat')
ds.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
CustID,10296.0,,,,5148.5,2972.34352,1.0,2574.75,5148.5,7722.25,10296.0
FirstPolYear,10266.0,,,,1991.062634,511.267913,1974.0,1980.0,1986.0,1992.0,53784.0
BirthYear,10279.0,,,,1968.007783,19.709476,1028.0,1953.0,1968.0,1983.0,2001.0
EducDeg,10279.0,4.0,b'3 - BSc/MSc',4799.0,,,,,,,
MonthSal,10260.0,,,,2506.667057,1157.449634,333.0,1706.0,2501.5,3290.25,55215.0
GeoLivArea,10295.0,,,,2.709859,1.266291,1.0,1.0,3.0,4.0,4.0
Children,10275.0,,,,0.706764,0.455268,0.0,0.0,1.0,1.0,1.0
CustMonVal,10296.0,,,,177.892605,1945.811505,-165680.42,-9.44,186.87,399.7775,11875.89
ClaimsRate,10296.0,,,,0.742772,2.916964,0.0,0.39,0.72,0.98,256.2
PremMotor,10262.0,,,,300.470252,211.914997,-4.11,190.59,298.61,408.3,11604.42


In [4]:
# Show top rows
ds.head()

Unnamed: 0,CustID,FirstPolYear,BirthYear,EducDeg,MonthSal,GeoLivArea,Children,CustMonVal,ClaimsRate,PremMotor,PremHousehold,PremHealth,PremLife,PremWork
0,1.0,1985.0,1982.0,b'2 - High School',2177.0,1.0,1.0,380.97,0.39,375.85,79.45,146.36,47.01,16.89
1,2.0,1981.0,1995.0,b'2 - High School',677.0,4.0,1.0,-131.13,1.12,77.46,416.2,116.69,194.48,106.13
2,3.0,1991.0,1970.0,b'1 - Basic',2277.0,3.0,0.0,504.67,0.28,206.15,224.5,124.58,86.35,99.02
3,4.0,1990.0,1981.0,b'3 - BSc/MSc',1099.0,4.0,1.0,-16.99,0.99,182.48,43.35,311.17,35.34,28.34
4,5.0,1986.0,1973.0,b'3 - BSc/MSc',1763.0,4.0,1.0,35.23,0.9,338.62,47.8,182.59,18.78,41.45


______________________________________