# Lab | Cleaning categorical data

For this lab, we will be using the dataset in the Customer Analysis Business Case. This dataset can be found in `files_for_lab` folder. In this lab we will explore categorical data.

## Data Analysis Process
#### Remember the process:

- Case Study
- **Get data**
- **Cleaning/Wrangling/EDA**
- Processing Data
- Modeling
 -Validation
- Reporting

### Instructions
***
#### 1. Import the necessary libraries load the data and start a new notebook.
Using the same data as the previous lab: we_fn_use_c_marketing_customer_value_analysis.csv

In [34]:
import pandas as pd
import numpy as np
import snakecase
import stringcase
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [35]:
customer_df = pd.read_csv('we_fn_use_c_marketing_customer_value_analysis.csv')
customer_df.columns = customer_df.columns.str.replace(' ', '').map(stringcase.snakecase)
customer_df['effective_to_date'] = pd.to_datetime(customer_df['effective_to_date'],errors='coerce')

***
#### 2. Find  all of the categorical data.  Save it in a categorical_df variable.

In [36]:
categorical_df = customer_df.select_dtypes('object')
categorical_df

Unnamed: 0,customer,state,response,coverage,education,employment_status,gender,location_code,marital_status,policy_type,policy,renew_offer_type,sales_channel,vehicle_class,vehicle_size
0,BU79786,Washington,No,Basic,Bachelor,Employed,F,Suburban,Married,Corporate Auto,Corporate L3,Offer1,Agent,Two-Door Car,Medsize
1,QZ44356,Arizona,No,Extended,Bachelor,Unemployed,F,Suburban,Single,Personal Auto,Personal L3,Offer3,Agent,Four-Door Car,Medsize
2,AI49188,Nevada,No,Premium,Bachelor,Employed,F,Suburban,Married,Personal Auto,Personal L3,Offer1,Agent,Two-Door Car,Medsize
3,WW63253,California,No,Basic,Bachelor,Unemployed,M,Suburban,Married,Corporate Auto,Corporate L2,Offer1,Call Center,SUV,Medsize
4,HB64268,Washington,No,Basic,Bachelor,Employed,M,Rural,Single,Personal Auto,Personal L1,Offer1,Agent,Four-Door Car,Medsize
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9129,LA72316,California,No,Basic,Bachelor,Employed,M,Urban,Married,Personal Auto,Personal L1,Offer2,Web,Four-Door Car,Medsize
9130,PK87824,California,Yes,Extended,College,Employed,F,Suburban,Divorced,Corporate Auto,Corporate L3,Offer1,Branch,Four-Door Car,Medsize
9131,TD14365,California,No,Extended,Bachelor,Unemployed,M,Suburban,Single,Corporate Auto,Corporate L2,Offer1,Branch,Four-Door Car,Medsize
9132,UP19263,California,No,Extended,College,Employed,M,Suburban,Married,Personal Auto,Personal L2,Offer3,Branch,Four-Door Car,Large


***
#### 3. Check for NaN values and decide what to do with them, do it now.

In [37]:
(customer_df.isna().sum()/len(customer_df)).sort_values(ascending=False)

customer                         0.0
state                            0.0
vehicle_class                    0.0
total_claim_amount               0.0
sales_channel                    0.0
renew_offer_type                 0.0
policy                           0.0
policy_type                      0.0
numberof_policies                0.0
numberof_open_complaints         0.0
months_since_policy_inception    0.0
months_since_last_claim          0.0
monthly_premium_auto             0.0
marital_status                   0.0
location_code                    0.0
income                           0.0
gender                           0.0
employment_status                0.0
effective_to_date                0.0
education                        0.0
coverage                         0.0
response                         0.0
customer_lifetime_value          0.0
vehicle_size                     0.0
dtype: float64

***
#### 4. Check all unique values of columns.

In [38]:
categorical_df.nunique()

customer             9134
state                   5
response                2
coverage                3
education               5
employment_status       5
gender                  2
location_code           3
marital_status          3
policy_type             3
policy                  9
renew_offer_type        4
sales_channel           4
vehicle_class           6
vehicle_size            3
dtype: int64

***
#### 5. Check dtypes. Do they all make sense as categorical data?

In [39]:
categorical_df.dtypes

customer             object
state                object
response             object
coverage             object
education            object
employment_status    object
gender               object
location_code        object
marital_status       object
policy_type          object
policy               object
renew_offer_type     object
sales_channel        object
vehicle_class        object
vehicle_size         object
dtype: object

***
#### 6. Does any column contain alpha and numeric data?  Decide how to clean it and do it now.

In [40]:
categorical_df.columns

Index(['customer', 'state', 'response', 'coverage', 'education',
       'employment_status', 'gender', 'location_code', 'marital_status',
       'policy_type', 'policy', 'renew_offer_type', 'sales_channel',
       'vehicle_class', 'vehicle_size'],
      dtype='object')

In [41]:
categorical_df['alpha_customer'] = categorical_df['customer'].str[:2]
categorical_df['customer'] = categorical_df['customer'].str[2:]

In [42]:
categorical_df = categorical_df[['alpha_customer','customer', 'state', 'response', 'coverage', 'education',
       'employment_status', 'gender', 'location_code', 'marital_status',
       'policy_type', 'policy', 'renew_offer_type', 'sales_channel',
       'vehicle_class', 'vehicle_size']]
categorical_df

Unnamed: 0,alpha_customer,customer,state,response,coverage,education,employment_status,gender,location_code,marital_status,policy_type,policy,renew_offer_type,sales_channel,vehicle_class,vehicle_size
0,BU,79786,Washington,No,Basic,Bachelor,Employed,F,Suburban,Married,Corporate Auto,Corporate L3,Offer1,Agent,Two-Door Car,Medsize
1,QZ,44356,Arizona,No,Extended,Bachelor,Unemployed,F,Suburban,Single,Personal Auto,Personal L3,Offer3,Agent,Four-Door Car,Medsize
2,AI,49188,Nevada,No,Premium,Bachelor,Employed,F,Suburban,Married,Personal Auto,Personal L3,Offer1,Agent,Two-Door Car,Medsize
3,WW,63253,California,No,Basic,Bachelor,Unemployed,M,Suburban,Married,Corporate Auto,Corporate L2,Offer1,Call Center,SUV,Medsize
4,HB,64268,Washington,No,Basic,Bachelor,Employed,M,Rural,Single,Personal Auto,Personal L1,Offer1,Agent,Four-Door Car,Medsize
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9129,LA,72316,California,No,Basic,Bachelor,Employed,M,Urban,Married,Personal Auto,Personal L1,Offer2,Web,Four-Door Car,Medsize
9130,PK,87824,California,Yes,Extended,College,Employed,F,Suburban,Divorced,Corporate Auto,Corporate L3,Offer1,Branch,Four-Door Car,Medsize
9131,TD,14365,California,No,Extended,Bachelor,Unemployed,M,Suburban,Single,Corporate Auto,Corporate L2,Offer1,Branch,Four-Door Car,Medsize
9132,UP,19263,California,No,Extended,College,Employed,M,Suburban,Married,Personal Auto,Personal L2,Offer3,Branch,Four-Door Car,Large


In [43]:
categorical_df['policy'].value_counts()

Personal L3     3426
Personal L2     2122
Personal L1     1240
Corporate L3    1014
Corporate L2     595
Corporate L1     359
Special L2       164
Special L3       148
Special L1        66
Name: policy, dtype: int64

In [44]:
categorical_df['policy'].str[:-1]
categorical_df['policy'].str[-1:]

0       3
1       3
2       3
3       2
4       1
       ..
9129    1
9130    3
9131    2
9132    2
9133    3
Name: policy, Length: 9134, dtype: object

In [45]:
categorical_df['policy_n']=categorical_df['policy'].str[-1:]

In [46]:
categorical_df['policy']=categorical_df['policy'].str[:-1]
categorical_df

Unnamed: 0,alpha_customer,customer,state,response,coverage,education,employment_status,gender,location_code,marital_status,policy_type,policy,renew_offer_type,sales_channel,vehicle_class,vehicle_size,policy_n
0,BU,79786,Washington,No,Basic,Bachelor,Employed,F,Suburban,Married,Corporate Auto,Corporate L,Offer1,Agent,Two-Door Car,Medsize,3
1,QZ,44356,Arizona,No,Extended,Bachelor,Unemployed,F,Suburban,Single,Personal Auto,Personal L,Offer3,Agent,Four-Door Car,Medsize,3
2,AI,49188,Nevada,No,Premium,Bachelor,Employed,F,Suburban,Married,Personal Auto,Personal L,Offer1,Agent,Two-Door Car,Medsize,3
3,WW,63253,California,No,Basic,Bachelor,Unemployed,M,Suburban,Married,Corporate Auto,Corporate L,Offer1,Call Center,SUV,Medsize,2
4,HB,64268,Washington,No,Basic,Bachelor,Employed,M,Rural,Single,Personal Auto,Personal L,Offer1,Agent,Four-Door Car,Medsize,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9129,LA,72316,California,No,Basic,Bachelor,Employed,M,Urban,Married,Personal Auto,Personal L,Offer2,Web,Four-Door Car,Medsize,1
9130,PK,87824,California,Yes,Extended,College,Employed,F,Suburban,Divorced,Corporate Auto,Corporate L,Offer1,Branch,Four-Door Car,Medsize,3
9131,TD,14365,California,No,Extended,Bachelor,Unemployed,M,Suburban,Single,Corporate Auto,Corporate L,Offer1,Branch,Four-Door Car,Medsize,2
9132,UP,19263,California,No,Extended,College,Employed,M,Suburban,Married,Personal Auto,Personal L,Offer3,Branch,Four-Door Car,Large,2


In [47]:
categorical_df.columns

Index(['alpha_customer', 'customer', 'state', 'response', 'coverage',
       'education', 'employment_status', 'gender', 'location_code',
       'marital_status', 'policy_type', 'policy', 'renew_offer_type',
       'sales_channel', 'vehicle_class', 'vehicle_size', 'policy_n'],
      dtype='object')

In [48]:
categorical_df = categorical_df[['alpha_customer', 'customer', 'state', 'response', 'coverage',
       'education', 'employment_status', 'gender', 'location_code',
       'marital_status', 'policy_type', 'policy', 'policy_n','renew_offer_type',
       'sales_channel', 'vehicle_class', 'vehicle_size']]
categorical_df

Unnamed: 0,alpha_customer,customer,state,response,coverage,education,employment_status,gender,location_code,marital_status,policy_type,policy,policy_n,renew_offer_type,sales_channel,vehicle_class,vehicle_size
0,BU,79786,Washington,No,Basic,Bachelor,Employed,F,Suburban,Married,Corporate Auto,Corporate L,3,Offer1,Agent,Two-Door Car,Medsize
1,QZ,44356,Arizona,No,Extended,Bachelor,Unemployed,F,Suburban,Single,Personal Auto,Personal L,3,Offer3,Agent,Four-Door Car,Medsize
2,AI,49188,Nevada,No,Premium,Bachelor,Employed,F,Suburban,Married,Personal Auto,Personal L,3,Offer1,Agent,Two-Door Car,Medsize
3,WW,63253,California,No,Basic,Bachelor,Unemployed,M,Suburban,Married,Corporate Auto,Corporate L,2,Offer1,Call Center,SUV,Medsize
4,HB,64268,Washington,No,Basic,Bachelor,Employed,M,Rural,Single,Personal Auto,Personal L,1,Offer1,Agent,Four-Door Car,Medsize
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9129,LA,72316,California,No,Basic,Bachelor,Employed,M,Urban,Married,Personal Auto,Personal L,1,Offer2,Web,Four-Door Car,Medsize
9130,PK,87824,California,Yes,Extended,College,Employed,F,Suburban,Divorced,Corporate Auto,Corporate L,3,Offer1,Branch,Four-Door Car,Medsize
9131,TD,14365,California,No,Extended,Bachelor,Unemployed,M,Suburban,Single,Corporate Auto,Corporate L,2,Offer1,Branch,Four-Door Car,Medsize
9132,UP,19263,California,No,Extended,College,Employed,M,Suburban,Married,Personal Auto,Personal L,2,Offer3,Branch,Four-Door Car,Large


In [49]:
categorical_df['renew_offer_type'].value_counts()

Offer1    3752
Offer2    2926
Offer3    1432
Offer4    1024
Name: renew_offer_type, dtype: int64

In [51]:
categorical_df['renew_offer_type']=categorical_df['renew_offer_type'].str[-1:]
categorical_df

Unnamed: 0,alpha_customer,customer,state,response,coverage,education,employment_status,gender,location_code,marital_status,policy_type,policy,policy_n,renew_offer_type,sales_channel,vehicle_class,vehicle_size
0,BU,79786,Washington,No,Basic,Bachelor,Employed,F,Suburban,Married,Corporate Auto,Corporate L,3,1,Agent,Two-Door Car,Medsize
1,QZ,44356,Arizona,No,Extended,Bachelor,Unemployed,F,Suburban,Single,Personal Auto,Personal L,3,3,Agent,Four-Door Car,Medsize
2,AI,49188,Nevada,No,Premium,Bachelor,Employed,F,Suburban,Married,Personal Auto,Personal L,3,1,Agent,Two-Door Car,Medsize
3,WW,63253,California,No,Basic,Bachelor,Unemployed,M,Suburban,Married,Corporate Auto,Corporate L,2,1,Call Center,SUV,Medsize
4,HB,64268,Washington,No,Basic,Bachelor,Employed,M,Rural,Single,Personal Auto,Personal L,1,1,Agent,Four-Door Car,Medsize
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9129,LA,72316,California,No,Basic,Bachelor,Employed,M,Urban,Married,Personal Auto,Personal L,1,2,Web,Four-Door Car,Medsize
9130,PK,87824,California,Yes,Extended,College,Employed,F,Suburban,Divorced,Corporate Auto,Corporate L,3,1,Branch,Four-Door Car,Medsize
9131,TD,14365,California,No,Extended,Bachelor,Unemployed,M,Suburban,Single,Corporate Auto,Corporate L,2,1,Branch,Four-Door Car,Medsize
9132,UP,19263,California,No,Extended,College,Employed,M,Suburban,Married,Personal Auto,Personal L,2,3,Branch,Four-Door Car,Large


***
#### 7. Would you choose to do anything else to clean or wrangle the categorical data?  Comment your decisions and do it now.

I already did everything on the 6 exercise

***
#### 8. Compare policy_type and policy.  What information is contained in these columns.  Can you identify what is important?  

In [52]:
categorical_df.drop(['policy'],axis=1,inplace=True)

***
#### 9. Check number of unique values in each column, can they be combined in any way to ease encoding?  Comment your thoughts and make those changes.

In [53]:
categorical_df.nunique()

alpha_customer        677
customer             8688
state                   5
response                2
coverage                3
education               5
employment_status       5
gender                  2
location_code           3
marital_status          3
policy_type             3
policy_n                3
renew_offer_type        4
sales_channel           4
vehicle_class           6
vehicle_size            3
dtype: int64

***
#### 10.  Save the cleaned catagorical dataframe as categorical.csv   You will use this file again this week.