Draft an email to the client identifying the data quality issues and strategies to mitigate these issues. Refer to ‘Data Quality Framework Table’ and resources below for criteria and dimensions which you should consider.

### Data Quality Summary

1. Missing one customer ID may result in missing values once combined with customer demographics dataset
2. State column New South Wales needs to be shortened to NSW.

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import datetime

%matplotlib inline
sns.set_style('dark')
sns.set(font_scale=1.2)

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)

np.random.seed(0)
np.set_printoptions(suppress=True)

In [2]:
df = pd.read_csv("custaddress.csv")

In [3]:
df

Unnamed: 0,customer_id,address,postcode,state,country,property_valuation
0,1,060 Morning Avenue,2016,New South Wales,Australia,10
1,2,6 Meadow Vale Court,2153,New South Wales,Australia,10
2,4,0 Holy Cross Court,4211,QLD,Australia,9
3,5,17979 Del Mar Point,2448,New South Wales,Australia,4
4,6,9 Oakridge Court,3216,VIC,Australia,9
...,...,...,...,...,...,...
3994,3999,1482 Hauk Trail,3064,VIC,Australia,3
3995,4000,57042 Village Green Point,4511,QLD,Australia,6
3996,4001,87 Crescent Oaks Alley,2756,NSW,Australia,10
3997,4002,8194 Lien Street,4032,QLD,Australia,7


### Exploratory Data Analysis

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3999 entries, 0 to 3998
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         3999 non-null   int64 
 1   address             3999 non-null   object
 2   postcode            3999 non-null   int64 
 3   state               3999 non-null   object
 4   country             3999 non-null   object
 5   property_valuation  3999 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 187.6+ KB


In [5]:
df.describe()

Unnamed: 0,customer_id,postcode,property_valuation
count,3999.0,3999.0,3999.0
mean,2003.987997,2985.755939,7.514379
std,1154.576912,844.878364,2.824663
min,1.0,2000.0,1.0
25%,1004.5,2200.0,6.0
50%,2004.0,2768.0,8.0
75%,3003.5,3750.0,10.0
max,4003.0,4883.0,12.0


In [6]:
df.columns

Index(['customer_id', 'address', 'postcode', 'state', 'country', 'property_valuation'], dtype='object')

In [7]:
df["customer_id"].nunique()

3999

In [8]:
df["customer_id"].value_counts()

2047    1
653     1
2728    1
677     1
2724    1
       ..
3371    1
1322    1
3367    1
1318    1
2049    1
Name: customer_id, Length: 3999, dtype: int64

In [9]:
df["address"].nunique()

3996

In [10]:
df["address"].value_counts()

64 Macpherson Junction         2
3 Talisman Place               2
3 Mariners Cove Terrace        2
74910 Burning Wood Junction    1
5 Loftsgordon Avenue           1
                              ..
52273 Bay Place                1
975 Annamark Hill              1
8718 Warner Avenue             1
322 Scott Plaza                1
9 Pearson Plaza                1
Name: address, Length: 3996, dtype: int64

In [11]:
df["postcode"].nunique()

873

In [12]:
df["postcode"].value_counts()

2170    31
2155    30
2145    30
2153    29
2770    26
        ..
4552     1
4555     1
2485     1
3580     1
4421     1
Name: postcode, Length: 873, dtype: int64

In [13]:
df["state"].nunique()

5

In [14]:
df["state"].value_counts()

NSW                2054
VIC                 939
QLD                 838
New South Wales      86
Victoria             82
Name: state, dtype: int64

In [15]:
df["country"].nunique()

1

In [16]:
df["country"].value_counts()

Australia    3999
Name: country, dtype: int64

In [17]:
df["property_valuation"].nunique()

12

In [18]:
df["property_valuation"].value_counts()

9     647
8     646
10    577
7     493
11    281
6     238
5     225
4     214
12    195
3     186
1     154
2     143
Name: property_valuation, dtype: int64