<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Scenario" data-toc-modified-id="Scenario-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Scenario</a></span></li><li><span><a href="#Business-Objectives" data-toc-modified-id="Business-Objectives-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Business Objectives</a></span><ul class="toc-item"><li><span><a href="#Activity-1" data-toc-modified-id="Activity-1-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Activity 1</a></span></li><li><span><a href="#Activity-2" data-toc-modified-id="Activity-2-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Activity 2</a></span></li></ul></li></ul></div>

## Scenario
You are working as an analyst for an auto insurance company. The company has collected some data about its customers including their demographics, education, employment, policy details, vehicle information on which insurance policy is, and claim amounts. You will help the senior management with some business questions that will help them to better understand their customers, improve their services, and improve profitability. 

## Business Objectives
Retain customers,
analyze relevant customer data,
develop focused customer retention programs.
Based on the analysis, take targeted actions to increase profitable customer response, retention, and growth.

### Activity 1

- Aggregate data into one Data Frame using Pandas.
- Standardizing header names
- Deleting and rearranging columns – delete the column customer as it is only a unique identifier for each row of data
- Working with data types – Check the data types of all the columns and fix the incorrect ones (for ex. customer lifetime value and number of complaints )
- Filtering data and Correcting typos – Filter the data in state and gender column to standardize the texts in those columns
- Removing duplicates
- Replacing null values – Replace missing values with means of the column (for numerical columns)

In [1]:
#Importing the necessary modules for the code

import pandas as pd
import numpy as np
import statistics as stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
#Standardizing header names
#Flo : I will standardize header names by putting them all in lowercase
#Corrected with Nina's code

def lower_case_column_names(my_df):
    my_df.columns=[i.lower() for i in my_df.columns]
    return

In [3]:
#Aggregate data into one Data Frame using Pandas.
#Flo : I import the necessary modules for the analysis and and I import the data to be aggregated into one df(=dataframe).
#I print the head of the df and the shape to verify that the data is correctly aggregated.
#The number of columns in the aggregated dataset is bigger than in the files, I will have to merge some columns later
#Corrected with Nina's code

file_1 = pd.read_csv("Data/file1.csv")
lower_case_column_names(file_1)
file_2 = pd.read_csv("Data/file2.csv")
lower_case_column_names(file_2)
file_3 = pd.read_csv("Data/file3.csv")
lower_case_column_names(file_3)

file_3.rename(columns={'state': 'st'}, inplace=True)

customer_data = pd.concat([file_1, file_2, file_3], axis=0)

In [4]:
#Deleting and rearranging columns – delete the column customer as it is only a unique identifier for each row of data
def drop_column(dataframe,column_name):
    dataframe.drop([column_name],axis=1, inplace=True)
    return

drop_column(customer_data,"customer")
print(customer_data.head())

           st gender             education customer lifetime value   income  \
0  Washington    NaN                Master                     NaN      0.0   
1     Arizona      F              Bachelor              697953.59%      0.0   
2      Nevada      F              Bachelor             1288743.17%  48767.0   
3  California      M              Bachelor              764586.18%      0.0   
4  Washington      M  High School or Below              536307.65%  36357.0   

   monthly premium auto number of open complaints     policy type  \
0                1000.0                    1/0/00   Personal Auto   
1                  94.0                    1/0/00   Personal Auto   
2                 108.0                    1/0/00   Personal Auto   
3                 106.0                    1/0/00  Corporate Auto   
4                  68.0                    1/0/00   Personal Auto   

   vehicle class  total claim amount  
0  Four-Door Car            2.704934  
1  Four-Door Car         1131.46

In [5]:
#Nina's code
# Correct data types (look for things that should be numbers but aren't)
## IDEA: save the original cols as ORIG name, then delete them once we're satisfied the number transform ran correctly
## fix a/b/c formatted entries by replacing with just b

customer_data.rename(columns={'customer lifetime value': 'ORIG customer lifetime value',
                  'number of open complaints': 'ORIG number of open complaints'}, inplace=True)

def convert_datelike_strings_to_int(x):
    '''
    Code assumes:
    a) the correct numeric value is in the middle position of the "/"-delimited string
    b) if the input is a string, it's in the date-like format
    Input: input to convert
    Output: integer (or original value if input is not a string)
    '''
    if type(x) == str:
        s = x.split("/")
        myval = int(s[1])
    else:
        myval = x
    return myval


customer_data.loc[:, "number of open complaints"] = list(map(convert_datelike_strings_to_int,
                                                  customer_data["ORIG number of open complaints"]))
customer_data["ORIG number of open complaints"].value_counts()
customer_data["number of open complaints"].value_counts()

## drop % sign and convert to float 
customer_data.loc[:, "customer lifetime value"] = list(map(lambda x: float(x[:-1]) if (type(x) == str) else x,
                                                  customer_data["ORIG customer lifetime value"]))

customer_data["ORIG customer lifetime value"].value_counts()
customer_data["customer lifetime value"].value_counts()

drop_column(customer_data,"ORIG customer lifetime value")
drop_column(customer_data,"ORIG number of open complaints")

In [6]:
#Working with data types – Check the data types of all the columns and fix the incorrect ones
#(for ex. customer lifetime value and number of complaints )

customer_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12074 entries, 0 to 7069
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   st                         9137 non-null   object 
 1   gender                     9015 non-null   object 
 2   education                  9137 non-null   object 
 3   income                     9137 non-null   float64
 4   monthly premium auto       9137 non-null   float64
 5   policy type                9137 non-null   object 
 6   vehicle class              9137 non-null   object 
 7   total claim amount         9137 non-null   float64
 8   number of open complaints  9137 non-null   float64
 9   customer lifetime value    9130 non-null   float64
dtypes: float64(5), object(5)
memory usage: 1.0+ MB


In [7]:
#Filtering data and Correcting typos – Filter the data in state and gender column to standardize the texts in those columns

print(customer_data["st"].value_counts())
print(customer_data["gender"].value_counts())

def correct_name(column_name,old_string,new_string):
    customer_data.loc[customer_data[column_name].str.contains(old_string) == True, column_name] = new_string
    
correct_name("st","Cali","California")
correct_name("st","WA","Washington")
correct_name("st","AZ","Arizona")

correct_name("gender","Male","M")
correct_name("gender","female","F")
correct_name("gender","Femal","F")

print(customer_data["st"].value_counts())
print(customer_data["gender"].value_counts())

California    3032
Oregon        2601
Arizona       1630
Nevada         882
Washington     768
Cali           120
AZ              74
WA              30
Name: st, dtype: int64
F         4560
M         4368
Male        40
female      30
Femal       17
Name: gender, dtype: int64
California    3152
Oregon        2601
Arizona       1704
Nevada         882
Washington     798
Name: st, dtype: int64
F    4607
M    4408
Name: gender, dtype: int64


In [8]:
#Removing duplicates

customer_data.drop_duplicates(inplace=True)

print(customer_data)

              st gender             education   income  monthly premium auto  \
0     Washington    NaN                Master      0.0                1000.0   
1        Arizona      F              Bachelor      0.0                  94.0   
2         Nevada      F              Bachelor  48767.0                 108.0   
3     California      M              Bachelor      0.0                 106.0   
4     Washington      M  High School or Below  36357.0                  68.0   
...          ...    ...                   ...      ...                   ...   
7065  California      M              Bachelor  71941.0                  73.0   
7066  California      F               College  21604.0                  79.0   
7067  California      M              Bachelor      0.0                  85.0   
7068  California      M               College  21941.0                  96.0   
7069  California      M               College      0.0                  77.0   

         policy type  vehicle class  to

In [9]:
#Replacing null values – Replace missing values with means of the column (for numerical columns)
#To improve with list comprehension

nulls_df = pd.DataFrame(round(customer_data.isna().sum()/len(customer_data),4)*100)
nulls_df = nulls_df.reset_index()
nulls_df.columns = ['header_name', 'percent_nulls']
nulls_df

result = list(customer_data.select_dtypes(include='number').columns.values)
#print(result)

for column_name in result:
    mean_median_home_value = np.mean(customer_data[column_name])
    customer_data[column_name] = customer_data[column_name].fillna(mean_median_home_value)
    
#newlist = [expression for item in iterable if condition == True]
#[customer_data[column_name].fillna(np.mean(customer_data[column_name])) for column_name in list(customer_data.select_dtypes(include='number').columns.values)]

customer_data.isna().sum()

st                             1
gender                       123
education                      1
income                         0
monthly premium auto           0
policy type                    1
vehicle class                  1
total claim amount             0
number of open complaints      0
customer lifetime value        0
dtype: int64

In [16]:
#Instead of having null values in the income column,I replace them with the mean to avoid outliers

customer_data["income"] = customer_data["income"].replace(0.0,int(np.mean(customer_data["income"])))
print(customer_data["income"])

# I could also choose, either to replace them with the median (np.median) or to drop them with customer_data = customer_data.drop(customer_data[customer_data['income'] == 0].index)

0       37823.323268
1       37823.323268
2       48767.000000
3       37823.323268
4       36357.000000
            ...     
7065    71941.000000
7066    21604.000000
7067    37823.323268
7068    21941.000000
7069    37823.323268
Name: income, Length: 8876, dtype: float64


### Activity 2

- Bucketing the data - Write a function to replace column "State" to different zones. California as West Region, Oregon as North West, and Washington as East, and Arizona and Nevada as Central
- Standardizing the data – Use string functions to standardize the text data (lower case)



In [17]:
#Bucketing the data - Write a function to replace column "State" to different zones.
#California as West Region, Oregon as North West, and Washington as East, and Arizona and Nevada as Central

#Creating the dictionary to map the state with the appropriate zone

zones_dict = {"California":"West Region","Oregon":"North West","Washington":"East","Arizona":"Central","Nevada":"Central"}

In [18]:
#Bucketing the data - Write a function to replace column "State" to different zones.
#California as West Region, Oregon as North West, and Washington as East, and Arizona and Nevada as Central

#Creating a new column with the dictionary values

customer_data['zone']= customer_data['st'].map(zones_dict)
customer_data.head()

Unnamed: 0,st,gender,education,income,monthly premium auto,policy type,vehicle class,total claim amount,number of open complaints,customer lifetime value,zone
0,Washington,,Master,37823.323268,1000.0,Personal Auto,Four-Door Car,2.704934,0.0,185590.2,East
1,Arizona,F,Bachelor,37823.323268,94.0,Personal Auto,Four-Door Car,1131.464935,0.0,697953.6,Central
2,Nevada,F,Bachelor,48767.0,108.0,Personal Auto,Two-Door Car,566.472247,0.0,1288743.0,Central
3,California,M,Bachelor,37823.323268,106.0,Corporate Auto,SUV,529.881344,0.0,764586.2,West Region
4,Washington,M,High School or Below,36357.0,68.0,Personal Auto,Four-Door Car,17.269323,0.0,536307.7,East


In [19]:
#Standardizing the data – Use string functions to standardize the text data (lower case)

result_lowercase = list(customer_data.select_dtypes(include='object').columns.values)
print(result_lowercase)

#for column_name in result_lowercase:
    #customer_data[column_name] = customer_data[column_name].str.lower()

def lowercase_values(dataframe):
    for column_name in list(dataframe.select_dtypes(include='object').columns.values):
        dataframe[column_name] = dataframe[column_name].str.lower()

lowercase_values(customer_data)
customer_data.head()

['st', 'gender', 'education', 'policy type', 'vehicle class', 'zone']


Unnamed: 0,st,gender,education,income,monthly premium auto,policy type,vehicle class,total claim amount,number of open complaints,customer lifetime value,zone
0,washington,,master,37823.323268,1000.0,personal auto,four-door car,2.704934,0.0,185590.2,east
1,arizona,f,bachelor,37823.323268,94.0,personal auto,four-door car,1131.464935,0.0,697953.6,central
2,nevada,f,bachelor,48767.0,108.0,personal auto,two-door car,566.472247,0.0,1288743.0,central
3,california,m,bachelor,37823.323268,106.0,corporate auto,suv,529.881344,0.0,764586.2,west region
4,washington,m,high school or below,36357.0,68.0,personal auto,four-door car,17.269323,0.0,536307.7,east


In [20]:
#Exporting the data in a new csv file

customer_data.to_csv("Data/round_2_cleaned.csv",index=False)

In [14]:
#Replacing the dataset with the new data

customer_data = pd.read_csv("Data/round_2_cleaned.csv")
print(customer_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8876 entries, 0 to 8875
Data columns (total 13 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   st                              8875 non-null   object 
 1   gender                          8753 non-null   object 
 2   education                       8875 non-null   object 
 3   ORIG customer lifetime value    2041 non-null   object 
 4   income                          8876 non-null   float64
 5   monthly premium auto            8876 non-null   float64
 6   ORIG number of open complaints  2048 non-null   object 
 7   policy type                     8875 non-null   object 
 8   vehicle class                   8875 non-null   object 
 9   total claim amount              8876 non-null   float64
 10  number of open complaints       8876 non-null   float64
 11  customer lifetime value         8876 non-null   float64
 12  zone                            88

In [24]:
#Same Steps with the round 2 dataset

customer_data = pd.read_csv("Data/Data_Marketing_Customer_Analysis_Round2.csv")
lower_case_column_names(customer_data)
customer_data.rename(columns={'state': 'st'}, inplace=True)
drop_column(customer_data,"unnamed: 0")
drop_column(customer_data,"customer")
customer_data['zone']= customer_data['st'].map(zones_dict)
lowercase_values(customer_data)
#customer_data["income"] = customer_data["income"].replace(0.0,int(np.mean(customer_data["income"])))
customer_data = customer_data.drop(customer_data[customer_data['income'] == 0].index)
customer_data.drop_duplicates(inplace=True)

print(customer_data.head())

customer_data.info()
customer_data.to_csv("Data/round_2_cleaned.csv",index=False)

           st  customer lifetime value response  coverage  \
0     arizona              4809.216960       no     basic   
2  washington             14947.917300       no     basic   
3      oregon             22332.439460      yes  extended   
4      oregon              9025.067525       no   premium   
5         NaN              4745.181764      NaN     basic   

              education effective to date employmentstatus gender  income  \
0               college           2/18/11         employed      m   48029   
2              bachelor           2/10/11         employed      m   22139   
3               college           1/11/11         employed      m   49078   
4              bachelor           1/17/11    medical leave      f   23675   
5  high school or below           2/14/11         employed      m   50549   

  location code  ... number of policies     policy type        policy  \
0      suburban  ...                  9  corporate auto  corporate l3   
2      suburban  ...    