## Scenario
You are working as an analyst for an auto insurance company. The company has collected some data about its customers including their demographics, education, employment, policy details, vehicle information on which insurance policy is, and claim amounts. You will help the senior management with some business questions that will help them to better understand their customers, improve their services, and improve profitability. 

## Business Objectives
Retain customers,
analyze relevant customer data,
develop focused customer retention programs.
Based on the analysis, take targeted actions to increase profitable customer response, retention, and growth.

In [1]:
#Aggregate data into one Data Frame using Pandas.
#Flo : I import the necessary modules for the analysis and and I import the data to be aggregated into one df(=dataframe).
#I print the head of the df and the shape to verify that the data is correctly aggregated.
#The number of columns in the aggregated dataset is bigger than in the files, I will have to merge some columns later

import pandas as pd
import numpy as np
import statistics as stats

file_1 = pd.read_csv("Data/file1.csv")
file_2 = pd.read_csv("Data/file2.csv")
file_3 = pd.read_csv("Data/file3.csv")

customer_data = pd.concat([file_1, file_2, file_3], axis=0)

print(customer_data.head())
print(file_1.shape)
print(file_2.shape)
print(file_3.shape)
print(customer_data.shape)

  Customer          ST GENDER             Education Customer Lifetime Value  \
0  RB50392  Washington    NaN                Master                     NaN   
1  QZ44356     Arizona      F              Bachelor              697953.59%   
2  AI49188      Nevada      F              Bachelor             1288743.17%   
3  WW63253  California      M              Bachelor              764586.18%   
4  GA49547  Washington      M  High School or Below              536307.65%   

    Income  Monthly Premium Auto Number of Open Complaints     Policy Type  \
0      0.0                1000.0                    1/0/00   Personal Auto   
1      0.0                  94.0                    1/0/00   Personal Auto   
2  48767.0                 108.0                    1/0/00   Personal Auto   
3      0.0                 106.0                    1/0/00  Corporate Auto   
4  36357.0                  68.0                    1/0/00   Personal Auto   

   Vehicle Class  Total Claim Amount State Gender  
0  F

In [14]:
# I investigate the amount of null rows in the df (2937 / 12074 rows)

customer_data.isna().sum()
null = customer_data[customer_data.isnull().all(axis=1)]
print(null)

     Customer   ST GENDER Education Customer Lifetime Value  Income  \
1071      NaN  NaN    NaN       NaN                     NaN     NaN   
1072      NaN  NaN    NaN       NaN                     NaN     NaN   
1073      NaN  NaN    NaN       NaN                     NaN     NaN   
1074      NaN  NaN    NaN       NaN                     NaN     NaN   
1075      NaN  NaN    NaN       NaN                     NaN     NaN   
...       ...  ...    ...       ...                     ...     ...   
4003      NaN  NaN    NaN       NaN                     NaN     NaN   
4004      NaN  NaN    NaN       NaN                     NaN     NaN   
4005      NaN  NaN    NaN       NaN                     NaN     NaN   
4006      NaN  NaN    NaN       NaN                     NaN     NaN   
4007      NaN  NaN    NaN       NaN                     NaN     NaN   

      Monthly Premium Auto Number of Open Complaints Policy Type  \
1071                   NaN                       NaN         NaN   
1072       

In [15]:
# I remove the % from the customer lifetime value 

customer_data['Customer Lifetime Value'] = customer_data['Customer Lifetime Value'].str.replace("%",'')
print(customer_data.head())
type(customer_data['Customer Lifetime Value'])

  Customer          ST GENDER             Education Customer Lifetime Value  \
0  RB50392  Washington    NaN                Master                     NaN   
1  QZ44356     Arizona      F              Bachelor               697953.59   
2  AI49188      Nevada      F              Bachelor              1288743.17   
3  WW63253  California      M              Bachelor               764586.18   
4  GA49547  Washington      M  High School or Below               536307.65   

    Income  Monthly Premium Auto Number of Open Complaints     Policy Type  \
0      0.0                1000.0                    1/0/00   Personal Auto   
1      0.0                  94.0                    1/0/00   Personal Auto   
2  48767.0                 108.0                    1/0/00   Personal Auto   
3      0.0                 106.0                    1/0/00  Corporate Auto   
4  36357.0                  68.0                    1/0/00   Personal Auto   

   Vehicle Class  Total Claim Amount State Gender  
0  F

pandas.core.series.Series

In [16]:
#Standardizing header names
#Flo : I will standardize header names by putting them all in lowercase

def lower_case_column_names(customer_data):
    customer_data.columns=[i.lower() for i in customer_data.columns]
    return customer_data

lower_case_column_names(customer_data)

Unnamed: 0,customer,st,gender,education,customer lifetime value,income,monthly premium auto,number of open complaints,policy type,vehicle class,total claim amount,state,gender.1
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934,,
1,QZ44356,Arizona,F,Bachelor,697953.59,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935,,
2,AI49188,Nevada,F,Bachelor,1288743.17,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247,,
3,WW63253,California,M,Bachelor,764586.18,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344,,
4,GA49547,Washington,M,High School or Below,536307.65,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7065,LA72316,,,Bachelor,,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,California,M
7066,PK87824,,,College,,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,California,F
7067,TD14365,,,Bachelor,,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,California,M
7068,UP19263,,,College,,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,California,M


In [17]:
#Deleting and rearranging columns – delete the column customer as it is only a unique identifier for each row of data

customer_data.drop(["customer"],axis=1, inplace=True)
print(customer_data.head())

           st gender             education customer lifetime value   income  \
0  Washington    NaN                Master                     NaN      0.0   
1     Arizona      F              Bachelor               697953.59      0.0   
2      Nevada      F              Bachelor              1288743.17  48767.0   
3  California      M              Bachelor               764586.18      0.0   
4  Washington      M  High School or Below               536307.65  36357.0   

   monthly premium auto number of open complaints     policy type  \
0                1000.0                    1/0/00   Personal Auto   
1                  94.0                    1/0/00   Personal Auto   
2                 108.0                    1/0/00   Personal Auto   
3                 106.0                    1/0/00  Corporate Auto   
4                  68.0                    1/0/00   Personal Auto   

   vehicle class  total claim amount state gender  
0  Four-Door Car            2.704934   NaN    NaN  
1  Fou

In [18]:
# I investigate the column names

list(customer_data.columns)

customer_data = customer_data[['st','state',
 'gender',
 'education',
 'customer lifetime value',
 'income',
 'monthly premium auto',
 'number of open complaints',
 'policy type',
 'vehicle class',
 'total claim amount']]

customer_data.head()

Unnamed: 0,st,state,gender,gender.1,education,customer lifetime value,income,monthly premium auto,number of open complaints,policy type,vehicle class,total claim amount
0,Washington,,,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,Arizona,,F,,Bachelor,697953.59,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,Nevada,,F,,Bachelor,1288743.17,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,California,,M,,Bachelor,764586.18,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,Washington,,M,,High School or Below,536307.65,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323


In [19]:
# I print the different unique values for the states in order to investigate and correct them

print(customer_data["st"].value_counts())
print(customer_data["state"].value_counts())

Oregon        623
California    488
Arizona       328
Nevada        223
Washington    181
Cali          120
AZ             74
WA             30
Name: st, dtype: int64
California    2544
Oregon        1978
Arizona       1302
Nevada         659
Washington     587
Name: state, dtype: int64


In [20]:
# I merge the redundant state and gender columns

list(customer_data.columns)

customer_data.columns = ['st','state',
 'gender1','gender2',
 'education',
 'customer lifetime value',
 'income',
 'monthly premium auto',
 'number of open complaints',
 'policy type',
 'vehicle class',
 'total claim amount']

customer_data["st"] = customer_data.pop("st").fillna(customer_data.pop("state")).astype(str)
customer_data["gender1"] = customer_data.pop("gender1").fillna(customer_data.pop("gender2")).astype(str)
customer_data.rename(columns = {'gender1':'gender'}, inplace = True)
print(customer_data.head())

customer_data = customer_data[['st',
 'gender',
 'education',
 'customer lifetime value',
 'income',
 'monthly premium auto',
 'number of open complaints',
 'policy type',
 'vehicle class',
 'total claim amount']]


customer_data.head()

              education customer lifetime value   income  \
0                Master                     NaN      0.0   
1              Bachelor               697953.59      0.0   
2              Bachelor              1288743.17  48767.0   
3              Bachelor               764586.18      0.0   
4  High School or Below               536307.65  36357.0   

   monthly premium auto number of open complaints     policy type  \
0                1000.0                    1/0/00   Personal Auto   
1                  94.0                    1/0/00   Personal Auto   
2                 108.0                    1/0/00   Personal Auto   
3                 106.0                    1/0/00  Corporate Auto   
4                  68.0                    1/0/00   Personal Auto   

   vehicle class  total claim amount          st gender  
0  Four-Door Car            2.704934  Washington    nan  
1  Four-Door Car         1131.464935     Arizona      F  
2   Two-Door Car          566.472247      Nevada  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  customer_data["st"] = customer_data.pop("st").fillna(customer_data.pop("state")).astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  customer_data["gender1"] = customer_data.pop("gender1").fillna(customer_data.pop("gender2")).astype(str)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,st,gender,education,customer lifetime value,income,monthly premium auto,number of open complaints,policy type,vehicle class,total claim amount
0,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,Arizona,F,Bachelor,697953.59,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,Nevada,F,Bachelor,1288743.17,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,California,M,Bachelor,764586.18,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,Washington,M,High School or Below,536307.65,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323


In [21]:
#Working with data types – Check the data types of all the columns and fix the incorrect ones
#(for ex. customer lifetime value and number of complaints )

customer_data.info()

customer_data['customer lifetime value'] =  pd.to_numeric(customer_data['customer lifetime value'], errors='coerce')
customer_data['number of open complaints'] =  pd.to_numeric(customer_data['number of open complaints'], errors='coerce')

customer_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12074 entries, 0 to 7069
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   st                         12074 non-null  object 
 1   gender                     12074 non-null  object 
 2   education                  9137 non-null   object 
 3   customer lifetime value    2060 non-null   object 
 4   income                     9137 non-null   float64
 5   monthly premium auto       9137 non-null   float64
 6   number of open complaints  9137 non-null   object 
 7   policy type                9137 non-null   object 
 8   vehicle class              9137 non-null   object 
 9   total claim amount         9137 non-null   float64
dtypes: float64(3), object(7)
memory usage: 1.0+ MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12074 entries, 0 to 7069
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
--- 

In [22]:
#Filtering data and Correcting typos – Filter the data in state and gender column to standardize the texts in those columns

print(customer_data["st"].value_counts())
print(customer_data["gender"].value_counts())

customer_data.loc[customer_data["st"].str.contains('Cali') == True, "st"] = "California"
customer_data.loc[customer_data["st"].str.contains('WA') == True, "st"] = "Washington"
customer_data.loc[customer_data["st"].str.contains('AZ') == True, "st"] = "Arizona"

customer_data.loc[customer_data["gender"].str.contains('Male') == True, "gender"] = "M"
customer_data.loc[customer_data["gender"].str.contains('female') == True, "gender"] = "F"
customer_data.loc[customer_data["gender"].str.contains('Femal') == True, "gender"] = "F"

print(customer_data["st"].value_counts())
print(customer_data["gender"].value_counts())


California    3032
nan           2937
Oregon        2601
Arizona       1630
Nevada         882
Washington     768
Cali           120
AZ              74
WA              30
Name: st, dtype: int64
F         4560
M         4368
nan       3059
Male        40
female      30
Femal       17
Name: gender, dtype: int64
California    3152
nan           2937
Oregon        2601
Arizona       1704
Nevada         882
Washington     798
Name: st, dtype: int64
F      4607
M      4408
nan    3059
Name: gender, dtype: int64


In [23]:
#Removing duplicates

print(customer_data[customer_data.duplicated(keep=False)])
customer_data.drop_duplicates()

print(customer_data)

              st gender             education  customer lifetime value  \
344   California      F  High School or Below                559538.99   
538   California      M               College               1749752.20   
552       Oregon      F                Master                417068.73   
588   California      M              Bachelor                445811.34   
633   California      F              Bachelor               1017971.70   
...          ...    ...                   ...                      ...   
7024  California      M               College                      NaN   
7031  California      M               College                      NaN   
7037  California      M                Master                      NaN   
7058  California      F               College                      NaN   
7059  California      F              Bachelor                      NaN   

       income  monthly premium auto  number of open complaints  \
344   74454.0                  71.0          

In [2]:
#Replacing null values – Replace missing values with means of the column (for numerical columns)

nulls_df = pd.DataFrame(round(customer_data.isna().sum()/len(customer_data),4)*100)
nulls_df = nulls_df.reset_index()
nulls_df.columns = ['header_name', 'percent_nulls']
nulls_df

Unnamed: 0,header_name,percent_nulls
0,Customer,24.32
1,ST,82.88
2,GENDER,83.89
3,Education,24.32
4,Customer Lifetime Value,24.38
5,Income,24.32
6,Monthly Premium Auto,24.32
7,Number of Open Complaints,24.32
8,Policy Type,24.32
9,Vehicle Class,24.32
