Draft an email to the client identifying the data quality issues and strategies to mitigate these issues. Refer to ‘Data Quality Framework Table’ and resources below for criteria and dimensions which you should consider.

### Data Quality Summary

1. Missing Data found in last_name, DOB, job_title, job_industry_category, default and tenure columns.
2. Default column data is gibberish and unknown.
3. Gender column has 6 unique values which should be corrected to 2 values.
4. DOB has one customer born in year 1843 which is impossible.
5. The columns in this dataset are inconsistent with new customers dataset.

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import datetime

%matplotlib inline
sns.set_style('dark')
sns.set(font_scale=1.2)

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)

np.random.seed(0)
np.set_printoptions(suppress=True)

In [2]:
df = pd.read_csv("custdemo.csv",parse_dates=['DOB'])

In [3]:
df

Unnamed: 0,customer_id,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,default,owns_car,tenure
0,1,Laraine,Medendorp,F,93,1953-12-10,Executive Secretary,Health,Mass Customer,N,"""'",Yes,11.0
1,2,Eli,Bockman,Male,81,1980-12-16,Administrative Officer,Financial Services,Mass Customer,N,<script>alert('hi')</script>,Yes,16.0
2,3,Arlin,Dearle,Male,61,1954-01-20,Recruiting Manager,Property,Mass Customer,N,1-Feb,Yes,15.0
3,4,Talbot,,Male,33,1961-03-10,,IT,Mass Customer,N,() { _; } >_[$($())] { touch /tmp/blns.shellsh...,No,7.0
4,5,Sheila-kathryn,Calton,Female,56,1977-05-13,Senior Editor,,Affluent Customer,N,NIL,Yes,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,3996,Rosalia,Halgarth,Female,8,1975-09-08,VP Product Management,Health,Mass Customer,N,-1.00E+02,No,19.0
3996,3997,Blanch,Nisuis,Female,87,2001-07-13,Statistician II,Manufacturing,High Net Worth,N,â¦testâ§,Yes,1.0
3997,3998,Sarene,Woolley,U,60,NaT,Assistant Manager,IT,High Net Worth,N,,No,
3998,3999,Patrizius,,Male,11,1973-10-24,,Manufacturing,Affluent Customer,N,Â¡â¢Â£Â¢âÂ§Â¶â¢ÂªÂºââ,Yes,10.0


### Exploratory Data Analysis

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   customer_id                          4000 non-null   int64         
 1   first_name                           4000 non-null   object        
 2   last_name                            3875 non-null   object        
 3   gender                               4000 non-null   object        
 4   past_3_years_bike_related_purchases  4000 non-null   int64         
 5   DOB                                  3913 non-null   datetime64[ns]
 6   job_title                            3494 non-null   object        
 7   job_industry_category                3344 non-null   object        
 8   wealth_segment                       4000 non-null   object        
 9   deceased_indicator                   4000 non-null   object        
 10  default     

In [5]:
df.describe()

Unnamed: 0,customer_id,past_3_years_bike_related_purchases,tenure
count,4000.0,4000.0,3913.0
mean,2000.5,48.89,10.657041
std,1154.844867,28.715005,5.660146
min,1.0,0.0,1.0
25%,1000.75,24.0,6.0
50%,2000.5,48.0,11.0
75%,3000.25,73.0,15.0
max,4000.0,99.0,22.0


In [6]:
df.columns

Index(['customer_id', 'first_name', 'last_name', 'gender', 'past_3_years_bike_related_purchases', 'DOB', 'job_title', 'job_industry_category', 'wealth_segment', 'deceased_indicator', 'default', 'owns_car', 'tenure'], dtype='object')

In [7]:
df["customer_id"].nunique()

4000

In [8]:
df["customer_id"].value_counts()

2047    1
657     1
2732    1
681     1
2728    1
       ..
3371    1
1322    1
3367    1
1318    1
2049    1
Name: customer_id, Length: 4000, dtype: int64

In [9]:
df["first_name"].nunique()

3139

In [10]:
df["first_name"].value_counts()

Max        5
Timmie     5
Tobe       5
Kim        4
Osgood     4
          ..
Ashia      1
Sharon     1
Nina       1
Burnaby    1
Job        1
Name: first_name, Length: 3139, dtype: int64

In [11]:
df["last_name"].nunique()

3725

In [12]:
df["last_name"].value_counts()

Pristnor     3
Ramsdell     3
Cotillard    2
Ligerton     2
Dredge       2
            ..
Fawloe       1
Megarrell    1
LeEstut      1
Benkin       1
Heart        1
Name: last_name, Length: 3725, dtype: int64

In [13]:
df["gender"].nunique()

6

In [14]:
df["gender"].value_counts()

Female    2037
Male      1872
U           88
F            1
M            1
Femal        1
Name: gender, dtype: int64

In [15]:
df["past_3_years_bike_related_purchases"].nunique()

100

In [16]:
df["past_3_years_bike_related_purchases"].value_counts()

16    56
19    56
20    54
67    54
2     50
      ..
8     28
86    27
95    27
85    27
92    24
Name: past_3_years_bike_related_purchases, Length: 100, dtype: int64

In [17]:
df["DOB"].nunique()

3448

In [18]:
df["DOB"].value_counts()

1978-01-30    7
1964-08-07    4
1978-08-19    4
1977-05-13    4
1976-07-16    4
             ..
1995-11-15    1
1989-01-08    1
1984-06-15    1
1979-11-21    1
1972-04-14    1
Name: DOB, Length: 3448, dtype: int64

In [19]:
df["DOB"].min()

Timestamp('1843-12-21 00:00:00')

In [20]:
df["DOB"].max()

Timestamp('2002-11-03 00:00:00')

In [21]:
df["DOB"].sort_values()

33     1843-12-21
719    1931-10-23
1091   1935-08-22
3409   1940-09-22
2412   1943-11-08
          ...    
3778          NaT
3882          NaT
3930          NaT
3934          NaT
3997          NaT
Name: DOB, Length: 4000, dtype: datetime64[ns]

In [22]:
df["job_title"].nunique()

195

In [23]:
df["job_title"].value_counts()

Business Systems Development Analyst    45
Tax Accountant                          44
Social Worker                           44
Internal Auditor                        42
Legal Assistant                         41
                                        ..
Administrative Assistant II              4
Research Assistant III                   3
Health Coach I                           3
Health Coach III                         3
Developer I                              1
Name: job_title, Length: 195, dtype: int64

In [24]:
df["job_industry_category"].nunique()

9

In [25]:
df["job_industry_category"].value_counts()

Manufacturing         799
Financial Services    774
Health                602
Retail                358
Property              267
IT                    223
Entertainment         136
Argiculture           113
Telecommunications     72
Name: job_industry_category, dtype: int64

In [26]:
df["wealth_segment"].nunique()

3

In [27]:
df["wealth_segment"].value_counts()

Mass Customer        2000
High Net Worth       1021
Affluent Customer     979
Name: wealth_segment, dtype: int64

In [28]:
df["deceased_indicator"].nunique()

2

In [29]:
df["deceased_indicator"].value_counts()

N    3998
Y       2
Name: deceased_indicator, dtype: int64

In [30]:
df["default"].nunique()

101

In [31]:
df["default"].value_counts()

1.00E+02                                                                                                                                 111
-1.00E+02                                                                                                                                 96
1                                                                                                                                         70
-1                                                                                                                                        64
Ù¡Ù¢Ù£                                                                                                                                    53
                                                                                                                                        ... 
0.00E+00                                                                                                                                   2
-5.00E-01    

In [32]:
df["owns_car"].nunique()

2

In [33]:
df["owns_car"].value_counts()

Yes    2024
No     1976
Name: owns_car, dtype: int64

In [34]:
df["tenure"].nunique()

22

In [35]:
df["tenure"].value_counts()

7.0     235
5.0     228
11.0    221
10.0    218
16.0    215
8.0     211
18.0    208
12.0    202
9.0     200
14.0    200
6.0     192
13.0    191
4.0     191
17.0    182
15.0    179
1.0     166
3.0     160
19.0    159
2.0     150
20.0     96
22.0     55
21.0     54
Name: tenure, dtype: int64