# Outlier checking

We focus on outliers beyond z score of 5 and examine by each feature carefully.

1 in 21613 samples ~0.005% ~ z score of 4, so using absolute value of z score of 5 as threshold for checking should be reasonable for first screening

Since we cannot double check data accruacy directly with sellers or buyers, we will only remove outliers conservatively and give the benefit of doubt to data.

After checking each feature carefully, we will provide a summary and decision at the end.

In [79]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

path = './archive/'

kc_data = pd.read_csv(path + '/kc_house_data.csv')


In [80]:
# Droping 'id' and 'date' column since the first one is note a feature and the second one is not a numerical feature

data = kc_data.drop(['id','date'],axis = 1)
columns =[]
for col in data.columns:
    columns.append(col)
# Calculating z-score value
z_score = np.abs(stats.zscore(data))
z_score_df = pd.DataFrame(z_score, columns = columns)

outlier = data[(z_score_df > 5).all(axis=1)]
outlier

print("Examine outliers by each feature in the below:")

Examine outliers by each feature in the below:


In [81]:
data[z_score_df.price>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
246,2400000.0,4,2.50,3650,8354,1.0,1,4,3,9,1830,1820,2000,0,98074,47.6338,-122.072,3120,18841
269,2900000.0,4,3.25,5050,20100,1.5,0,2,3,11,4750,300,1982,2008,98004,47.6312,-122.223,3890,20060
300,3075000.0,4,5.00,4550,18641,1.0,1,4,3,10,2600,1950,2002,0,98074,47.6053,-122.077,4550,19508
312,2384000.0,5,2.50,3650,9050,2.0,0,4,5,10,3370,280,1921,0,98119,47.6345,-122.367,2880,5400
656,3070000.0,3,2.50,3930,55867,1.0,1,4,4,8,2330,1600,1957,0,98034,47.7022,-122.224,2730,26324
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20460,3345000.0,5,3.75,5350,15360,1.0,0,1,3,11,3040,2310,2008,0,98004,47.6480,-122.218,3740,15940
20535,2950000.0,4,4.25,4470,5884,2.0,0,1,3,11,3230,1240,2010,0,98199,47.6387,-122.405,2570,6000
21040,2900000.0,5,4.00,5190,14600,2.0,0,1,3,11,5190,0,2013,0,98039,47.6102,-122.225,3840,19250
21530,3000000.0,4,3.75,5090,14823,1.0,0,0,3,11,4180,910,2013,0,98004,47.6200,-122.207,3030,12752


In [82]:
data[z_score_df.price>10]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
1164,5110800.0,5,5.25,8010,45517,2.0,1,4,3,12,5990,2020,1999,0,98033,47.6767,-122.211,3430,26788
1315,5300000.0,6,6.0,7390,24829,2.0,1,4,4,12,5000,2390,1991,0,98040,47.5631,-122.21,4320,24619
1448,5350000.0,5,5.0,8000,23985,2.0,0,4,3,12,6720,1280,2009,0,98004,47.6232,-122.22,4600,21750
2626,4500000.0,5,5.5,6640,40014,2.0,1,4,3,12,6350,290,2004,0,98155,47.7493,-122.28,3030,23408
3914,7062500.0,5,4.5,10040,37325,2.0,1,2,3,11,7680,2360,1940,2001,98004,47.65,-122.214,3930,25449
4411,5570000.0,5,5.75,9200,35069,2.0,0,0,3,13,6200,3000,2001,0,98039,47.6289,-122.233,3560,24345
7252,7700000.0,6,8.0,12050,27600,2.5,0,3,4,13,8570,3480,1910,1987,98102,47.6298,-122.323,3940,8800
8092,4668000.0,5,6.75,9640,13068,1.0,1,4,3,12,4820,4820,1983,2009,98040,47.557,-122.21,3270,10454
8638,4489000.0,4,3.0,6430,27517,2.0,0,0,3,12,6430,0,2001,0,98004,47.6208,-122.219,3720,14592
9254,6885000.0,6,7.75,9890,31374,2.0,0,4,3,13,8860,1030,2001,0,98039,47.6305,-122.24,4540,42730


In [83]:
data[z_score_df.bedrooms>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
4096,599999.0,9,4.5,3830,6988,2.5,0,0,3,7,2450,1380,1938,0,98103,47.6927,-122.338,1460,6291
4235,700000.0,9,3.0,3680,4400,2.0,0,0,3,7,2830,850,1908,0,98102,47.6374,-122.324,1960,2450
6079,1280000.0,9,4.5,3650,5000,2.0,0,0,3,8,2530,1120,1915,2010,98105,47.6604,-122.289,2510,5000
8546,450000.0,9,7.5,4050,6504,2.0,0,0,3,7,4050,0,1996,0,98144,47.5923,-122.301,1448,3866
8757,520000.0,11,3.0,3000,4960,2.0,0,0,3,7,2400,600,1918,1999,98106,47.556,-122.363,1420,4960
13314,1148000.0,10,5.25,4590,10920,1.0,0,2,3,9,2500,2090,2008,0,98004,47.5861,-122.113,2730,10400
15161,650000.0,10,2.0,3610,11914,2.0,0,0,4,7,3010,600,1958,0,98006,47.5705,-122.175,2040,11914
15870,640000.0,33,1.75,1620,6000,1.0,0,0,5,7,1040,580,1947,0,98103,47.6878,-122.331,1330,4700
16844,1400000.0,9,4.0,4620,5508,2.5,0,0,3,11,3870,750,1915,0,98105,47.6684,-122.309,2710,4320
18443,934000.0,9,3.0,2820,4480,2.0,0,0,3,7,1880,940,1918,0,98105,47.6654,-122.307,2460,4400


In [84]:
data[z_score_df.bathrooms>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
1315,5300000.0,6,6.0,7390,24829,2.0,1,4,4,12,5000,2390,1991,0,98040,47.5631,-122.21,4320,24619
4024,800000.0,7,6.75,7480,41664,2.0,0,2,3,11,5080,2400,1953,0,98166,47.4643,-122.368,2810,33190
4035,2150000.0,8,6.0,4340,9415,2.0,0,0,3,8,4340,0,1967,0,98004,47.6316,-122.202,2050,9100
7252,7700000.0,6,8.0,12050,27600,2.5,0,3,4,13,8570,3480,1910,1987,98102,47.6298,-122.323,3940,8800
8092,4668000.0,5,6.75,9640,13068,1.0,1,4,3,12,4820,4820,1983,2009,98040,47.557,-122.21,3270,10454
8546,450000.0,9,7.5,4050,6504,2.0,0,0,3,7,4050,0,1996,0,98144,47.5923,-122.301,1448,3866
9254,6885000.0,6,7.75,9890,31374,2.0,0,4,3,13,8860,1030,2001,0,98039,47.6305,-122.24,4540,42730
12370,4208000.0,5,6.0,7440,21540,2.0,0,0,3,12,5550,1890,2003,0,98006,47.5692,-122.189,4740,19329
12777,2280000.0,7,8.0,13540,307752,3.0,0,4,3,12,9410,4130,1999,0,98053,47.6675,-121.986,4850,217800
14556,2888000.0,5,6.25,8670,64033,2.0,0,4,3,13,6120,2550,1965,2003,98177,47.7295,-122.372,4140,81021


In [85]:
data[z_score_df.sqft_living>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
1164,5110800.0,5,5.25,8010,45517,2.0,1,4,3,12,5990,2020,1999,0,98033,47.6767,-122.211,3430,26788
1315,5300000.0,6,6.0,7390,24829,2.0,1,4,4,12,5000,2390,1991,0,98040,47.5631,-122.21,4320,24619
1448,5350000.0,5,5.0,8000,23985,2.0,0,4,3,12,6720,1280,2009,0,98004,47.6232,-122.22,4600,21750
2444,3278000.0,2,1.75,6840,10000,2.5,1,4,3,11,4350,2490,2001,0,98008,47.6042,-122.112,3120,12300
2713,1110000.0,5,3.5,7350,12231,2.0,0,4,3,11,4750,2600,2001,0,98065,47.5373,-121.865,5380,12587
3020,2525000.0,4,5.5,6930,45100,1.0,0,0,4,11,4310,2620,1950,1991,98006,47.5547,-122.144,2560,37766
3914,7062500.0,5,4.5,10040,37325,2.0,1,2,3,11,7680,2360,1940,2001,98004,47.65,-122.214,3930,25449
4024,800000.0,7,6.75,7480,41664,2.0,0,2,3,11,5080,2400,1953,0,98166,47.4643,-122.368,2810,33190
4149,4000000.0,4,5.5,7080,16573,2.0,0,0,3,12,5760,1320,2008,0,98039,47.6151,-122.224,3140,15996
4411,5570000.0,5,5.75,9200,35069,2.0,0,0,3,13,6200,3000,2001,0,98039,47.6289,-122.233,3560,24345


In [86]:
data[z_score_df.sqft_lot>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
145,921500.0,4,2.50,3670,315374,2.0,0,0,4,9,3670,0,1994,0,98077,47.7421,-122.026,2840,87991
527,1600000.0,6,5.00,6050,230652,2.0,0,3,3,11,6050,0,2001,0,98024,47.6033,-121.943,4210,233971
929,390000.0,4,3.00,2570,262018,1.0,0,0,3,7,1420,1150,1988,0,98058,47.4417,-122.090,2260,19811
1045,590000.0,3,1.75,1560,242629,1.0,0,0,3,7,1560,0,1981,0,98053,47.6493,-121.956,2320,220654
1703,617000.0,3,1.75,3020,360241,2.0,0,0,3,8,3020,0,1992,0,98092,47.2662,-122.088,1890,209959
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20811,580000.0,3,2.50,1820,374616,2.0,0,0,3,7,1820,0,1999,0,98014,47.6539,-121.959,1870,220654
20974,950000.0,3,2.50,2780,275033,1.0,0,0,3,10,2780,0,2006,0,98045,47.4496,-121.766,1680,16340
21325,659000.0,3,2.50,3090,384634,2.0,0,0,3,8,3090,0,2007,0,98019,47.7072,-121.927,2200,292645
21344,1488000.0,5,6.00,6880,279968,2.0,0,3,3,12,4070,2810,2007,0,98045,47.4624,-121.779,4690,256803


In [87]:
data[z_score_df.sqft_lot>10]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
1719,700000.0,4,1.0,1300,1651359,1.0,0,3,4,6,1300,0,1920,0,98022,47.2313,-122.023,2560,425581
2964,999000.0,3,2.75,2830,505166,1.0,1,3,4,8,1830,1000,1962,0,98070,47.3782,-122.514,2120,21988
3949,998000.0,4,3.25,3770,982998,2.0,0,0,3,10,3770,0,1992,0,98058,47.414,-122.087,2290,37141
4387,480000.0,4,3.5,3370,435600,2.0,0,3,3,9,3370,0,2005,0,98045,47.4398,-121.738,2790,114868
4441,790000.0,2,3.0,2560,982278,1.0,0,0,3,8,2560,0,2004,0,98014,47.6955,-121.861,1620,40946
4540,550000.0,3,2.0,3650,843309,2.0,0,0,4,7,3650,0,1991,0,98070,47.3627,-122.496,1870,273992
5073,859000.0,3,2.5,2920,434728,2.0,0,3,4,8,2920,0,1999,0,98042,47.3809,-122.13,3150,55216
6691,1998000.0,2,2.5,3900,920423,2.0,0,0,3,12,3900,0,2009,0,98065,47.5371,-121.756,2720,411962
7077,1650000.0,4,3.25,3920,881654,3.0,0,3,3,11,3920,0,2002,0,98024,47.5385,-121.896,2970,112384
7250,950000.0,4,3.0,3230,438213,2.0,0,0,3,9,3230,0,1999,0,98070,47.4141,-122.47,1600,144619


In [88]:
data[z_score_df.floors>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15


In [89]:
data[z_score_df.waterfront>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
49,1350000.0,3,2.50,2753,65005,1.0,1,2,5,9,2165,588,1953,0,98070,47.4041,-122.451,2680,72513
230,655000.0,2,1.75,1450,15798,2.0,1,4,3,7,1230,220,1915,1978,98166,47.4497,-122.375,2030,13193
246,2400000.0,4,2.50,3650,8354,1.0,1,4,3,9,1830,1820,2000,0,98074,47.6338,-122.072,3120,18841
264,369900.0,1,0.75,760,10079,1.0,1,4,5,5,760,0,1936,0,98070,47.4683,-122.438,1230,14267
300,3075000.0,4,5.00,4550,18641,1.0,1,4,3,10,2600,1950,2002,0,98074,47.6053,-122.077,4550,19508
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19984,1898000.0,3,2.50,2830,4334,3.0,1,4,3,10,2830,0,2006,0,98074,47.6318,-122.071,2830,38211
20325,3000000.0,3,3.50,4410,10756,2.0,1,4,3,11,3430,980,2014,0,98056,47.5283,-122.205,3550,5634
20767,2300000.0,4,4.00,4360,8175,2.5,1,4,3,10,3940,420,2007,0,98008,47.5724,-122.104,2670,8525
21201,2230000.0,3,3.50,3760,5634,2.0,1,4,3,11,2830,930,2014,0,98056,47.5285,-122.205,3560,5762


In [90]:
data[z_score_df.view>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15


In [91]:
data[z_score_df.condition>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15


In [92]:
data[z_score_df.grade>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
19452,142000.0,0,0.0,290,20875,1.0,0,0,1,1,290,0,1963,0,98024,47.5308,-121.888,1620,22850


In [93]:
data[z_score_df.sqft_above>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
419,1550000.0,5,4.25,6070,171626,2.0,0,0,3,12,6070,0,1999,0,98024,47.5954,-121.95,4680,211267
527,1600000.0,6,5.0,6050,230652,2.0,0,3,3,11,6050,0,2001,0,98024,47.6033,-121.943,4210,233971
1100,1570000.0,5,4.5,6070,14731,2.0,0,0,3,11,6070,0,2004,0,98059,47.5306,-122.134,4750,13404
1164,5110800.0,5,5.25,8010,45517,2.0,1,4,3,12,5990,2020,1999,0,98033,47.6767,-122.211,3430,26788
1448,5350000.0,5,5.0,8000,23985,2.0,0,4,3,12,6720,1280,2009,0,98004,47.6232,-122.22,4600,21750
2626,4500000.0,5,5.5,6640,40014,2.0,1,4,3,12,6350,290,2004,0,98155,47.7493,-122.28,3030,23408
3121,1320000.0,4,5.25,6110,10369,2.0,0,0,3,11,6110,0,2005,0,98059,47.5285,-122.135,4190,10762
3914,7062500.0,5,4.5,10040,37325,2.0,1,2,3,11,7680,2360,1940,2001,98004,47.65,-122.214,3930,25449
4411,5570000.0,5,5.75,9200,35069,2.0,0,0,3,13,6200,3000,2001,0,98039,47.6289,-122.233,3560,24345
4811,2479000.0,5,3.75,6810,7500,2.5,0,0,3,13,6110,700,1922,0,98102,47.6285,-122.322,2660,7500


In [94]:
data[z_score_df.sqft_basement>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
2088,1195000.0,5,3.25,5180,19606,1.0,0,0,3,11,2610,2570,1993,0,98006,47.555,-122.114,4050,15296
2125,1325000.0,3,3.75,6400,76665,1.0,0,2,4,10,3810,2590,1966,0,98177,47.7313,-122.37,3430,60548
2713,1110000.0,5,3.5,7350,12231,2.0,0,4,3,11,4750,2600,2001,0,98065,47.5373,-121.865,5380,12587
3020,2525000.0,4,5.5,6930,45100,1.0,0,0,4,11,4310,2620,1950,1991,98006,47.5547,-122.144,2560,37766
4411,5570000.0,5,5.75,9200,35069,2.0,0,0,3,13,6200,3000,2001,0,98039,47.6289,-122.233,3560,24345
5049,1385000.0,6,2.75,5700,20000,1.0,0,4,4,10,2850,2850,1977,0,98006,47.5601,-122.16,3690,15700
6628,850000.0,4,2.75,5440,239580,1.0,0,0,2,9,2720,2720,1969,0,98001,47.354,-122.293,1970,40392
7035,3800000.0,5,5.5,7050,42840,1.0,0,2,4,13,4320,2730,1978,0,98004,47.6229,-122.22,5070,20570
7252,7700000.0,6,8.0,12050,27600,2.5,0,3,4,13,8570,3480,1910,1987,98102,47.6298,-122.323,3940,8800
8092,4668000.0,5,6.75,9640,13068,1.0,1,4,3,12,4820,4820,1983,2009,98040,47.557,-122.21,3270,10454


In [95]:
data[z_score_df.yr_built>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15


In [96]:
data[z_score_df.yr_renovated>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15


In [97]:
data[z_score_df.zipcode>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15


In [98]:
data[z_score_df.lat>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15


In [99]:
data[z_score_df.long>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
2589,134000.0,2,1.5,980,5000,2.0,0,0,3,7,980,0,1922,2003,98014,47.7076,-121.359,1040,5000
2927,167000.0,1,1.0,780,10235,1.5,0,0,3,6,780,0,1989,0,98014,47.713,-121.315,930,10165
4203,150000.0,3,0.75,490,38500,1.5,0,0,4,5,490,0,1959,0,98014,47.7112,-121.315,800,18297
4848,525000.0,3,2.75,2100,10362,2.0,0,0,3,9,1510,590,1998,0,98045,47.4347,-121.417,2240,11842
5867,175000.0,2,1.75,1050,9800,1.5,0,0,4,6,1050,0,1975,0,98019,47.7595,-121.473,1230,12726
6089,150000.0,3,1.0,890,6488,1.5,0,0,3,5,890,0,1928,0,98014,47.7087,-121.352,1330,16250
10095,200000.0,2,1.75,1320,13052,1.5,0,0,3,7,1320,0,1980,0,98014,47.712,-121.352,1320,13052
10898,241000.0,2,1.75,1070,9750,1.5,0,0,3,7,1070,0,1995,0,98014,47.7131,-121.319,970,9750
13072,155000.0,2,1.0,1010,43056,1.5,0,0,3,5,1010,0,1990,0,98014,47.7105,-121.316,830,18297
13249,375000.0,3,1.75,2140,13598,1.5,0,0,4,7,1620,520,1970,0,98014,47.7139,-121.321,930,10150


In [100]:
data[z_score_df.sqft_living15>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
1123,1200000.0,4,1.75,3990,13470,2.0,0,0,3,11,3990,0,2006,0,98059,47.5305,-122.131,5790,13709
1530,1250000.0,4,3.75,3830,41263,2.0,0,0,4,11,3830,0,1990,0,98077,47.7237,-122.042,5600,56568
5451,1780000.0,4,3.25,4890,13402,2.0,0,0,3,13,4890,0,2004,0,98059,47.5303,-122.131,5790,13539
10373,2983000.0,5,5.5,7400,18898,2.0,0,3,3,13,6290,1110,2001,0,98006,47.5431,-122.112,6110,26442
11871,1950000.0,4,3.25,7420,167869,2.0,0,3,3,12,7420,0,2002,0,98045,47.4548,-121.764,5610,169549
12713,2408000.0,5,2.5,4600,23250,1.5,0,2,3,9,3600,1000,1918,2003,98004,47.623,-122.218,5500,20066
16430,1750000.0,6,4.25,5860,13928,2.0,0,3,3,10,4150,1710,2013,0,98006,47.5382,-122.114,5790,13928
19858,2700000.0,4,4.0,7850,89651,2.0,0,0,3,12,7850,0,2006,0,98027,47.5406,-121.982,6210,95832
20563,1240420.0,5,3.25,5790,13726,2.0,0,3,3,10,4430,1360,2014,0,98006,47.5388,-122.114,5790,13726
20830,1750000.0,5,3.25,5790,12739,2.0,0,3,3,10,4430,1360,2014,0,98006,47.538,-122.114,5790,13928


In [101]:
data[z_score_df.sqft_lot15>5]

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
147,430000.0,2,2.50,2420,60984,2.0,0,0,3,7,2420,0,2007,0,98027,47.5262,-121.943,1940,193842
416,563500.0,4,1.75,2085,174240,1.0,0,0,3,7,1610,475,1964,0,98024,47.5753,-121.950,2690,174240
419,1550000.0,5,4.25,6070,171626,2.0,0,0,3,12,6070,0,1999,0,98024,47.5954,-121.950,4680,211267
443,350000.0,3,1.50,1250,219978,1.0,0,0,4,6,1250,0,1980,0,98038,47.4056,-121.955,1930,210394
484,1385000.0,4,3.25,4860,181319,2.5,0,0,3,9,4860,0,1993,0,98074,47.6179,-122.005,3850,181319
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21325,659000.0,3,2.50,3090,384634,2.0,0,0,3,8,3090,0,2007,0,98019,47.7072,-121.927,2200,292645
21344,1488000.0,5,6.00,6880,279968,2.0,0,3,3,12,4070,2810,2007,0,98045,47.4624,-121.779,4690,256803
21351,565000.0,2,1.75,1130,276170,1.0,0,0,3,8,1130,0,2006,0,98022,47.2673,-122.027,2092,217800
21431,800000.0,4,3.25,3540,159430,2.0,0,0,3,9,3540,0,2007,0,98014,47.6285,-121.899,1940,392040


# Summary and Decision on removal or modification

1) Feature with obvious wrong or self-contradictory data
bedrooms: one outlier has 33 bedrooms but with only 1620 sqft_living, which is impossible. This record (#15870) should be removed. Other outliers are 9-11, reasonable when compared with the size, possibly multi-family homes.

2) Features with no outliers detected by threshold in the above: floors, view, condition, grade, yr_built, yr_renovated, zipcode, lat

Among these above features, further checking:

floors: whole range within 1-3.5, looks sensible.

view, condition and grade: all fall within defined range

yr_built: oldest is 1900, not usual but possible

yr_renovated: 
oldest is 1934 - looks odd, but more important issue is its overall data quality alerted by sparse data - only 5% is non-zero, so on average a house is renovated only every 20 years, perhaps still reasonable overall, but with some very old years of renovation, and subjective definition of major renovation, instead of taking out a few records, we may need to consider whether to use this feature (to be examined during further checking or training). One alternative is to consider modification by replacing zero with the value of yr_built, as this year should reflect the condition of those properties truly without renovation. Also, another problem is whether this feature (if of good data quality) would be redundant given that there is already a feature "condition".

zipcode: confirmed with list of zipcodes of King County

lat: using Google Map, the range of King County should be 47.08 to 47.78, the data is consistent with this range

3) Features with outliers but can be explained trivially

long: using Google Map, the range of King County shoud be -122.53 to -121.06, the data is consistent with this range

waterfront: 
all outliers are labelled 1, and actually it is defined as either 0 or 1 with only 0.8% labelled as 1, so this is data sparsity or potentially data quality issue instead. Depending on the landscape the data may be genuinely sparse. Instead of removing all these outliers, we should consider whether this feature should be used (to be examined during further checking or training), as there is also another issue - whether it is redundant given that there is already a feature "view".

bathrooms: outliers are at 6-8, reasonable when compared with size (at least 4050 sqft_living)

4) Features about size and price

Sqft_living: note that it is simply the sum of sqft_above and sqft_basement - so one of these feature columns should be removed for data training. Among these 3 features we focus on sqft_living, which can be up to 13540 - this is rare but possible as the largest house for sale in Seattle is around this size.

Sqft_living15: similar to sqft_living but it refers to nearest 15 neighbours'average sqft_living. Outliers up to 6210. By similar argument as sqft_living, it is rare but possible. 

Sqft_lot: outliers up to 1651359 , a very large number but it can be the case for farmhouse with large land lot. We give the benefit of doubt to data as this feature is hard to judge correctness by itself. 

Sqft_lot15: similar to sqft_lot, we give benefit of doubt to data.

Price: with second screening using threshold at z score of 10, outliers are at 4.5m - 7.7m, all with sizeable sqft_living, and these amounts are in par with the most luxuary homes in Seattle. 

