<a href="https://colab.research.google.com/github/ezorigo/DS-Unit-2-Applied-Modeling/blob/master/module1/assignment_applied_modeling_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<!-- Lambda School Data Science, Unit 2: Predictive Modeling -->

# Applied Modeling, Module 1

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Choose which observations you will use to train, validate, and test your model. And which observations, if any, to exclude.
- [ ] Determine whether your problem is regression or classification.
- [ ] Choose your evaluation metric.
- [ ] Begin with baselines: majority class baseline for classification, or mean baseline for regression, with your metric of choice.
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" information from the future?

## Reading

### ROC AUC
- [Machine Learning Meets Economics](http://blog.mldb.ai/blog/posts/2016/01/ml-meets-economics/)
- [ROC curves and Area Under the Curve explained](https://www.dataschool.io/roc-curves-and-auc-explained/)
- [The philosophical argument for using ROC curves](https://lukeoakdenrayner.wordpress.com/2018/01/07/the-philosophical-argument-for-using-roc-curves/)

### Imbalanced Classes
- [imbalance-learn](https://github.com/scikit-learn-contrib/imbalanced-learn)
- [Learning from Imbalanced Classes](https://www.svds.com/tbt-learning-imbalanced-classes/)

### Last lesson
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [How Shopify Capital Uses Quantile Regression To Help Merchants Succeed](https://engineering.shopify.com/blogs/engineering/how-shopify-uses-machine-learning-to-help-our-merchants-grow-their-business)
- [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)

In [2]:
# Upload dataset
from google.colab import files
uploaded = files.upload()

Saving DC_Properties.csv to DC_Properties.csv


In [51]:
# read dataset using pandas
import pandas as pd 

df = pd.read_csv('DC_Properties.csv', low_memory=False)

print(df.shape)
df.head()

(158957, 49)


Unnamed: 0.1,Unnamed: 0,BATHRM,HF_BATHRM,HEAT,AC,NUM_UNITS,ROOMS,BEDRM,AYB,YR_RMDL,EYB,STORIES,SALEDATE,PRICE,QUALIFIED,SALE_NUM,GBA,BLDG_NUM,STYLE,STRUCT,GRADE,CNDTN,EXTWALL,ROOF,INTWALL,KITCHENS,FIREPLACES,USECODE,LANDAREA,GIS_LAST_MOD_DTTM,SOURCE,CMPLX_NUM,LIVING_GBA,FULLADDRESS,CITY,STATE,ZIPCODE,NATIONALGRID,LATITUDE,LONGITUDE,ASSESSMENT_NBHD,ASSESSMENT_SUBNBHD,CENSUS_TRACT,CENSUS_BLOCK,WARD,SQUARE,X,Y,QUADRANT
0,0,4,0,Warm Cool,Y,2.0,8,4,1910.0,1988.0,1972,3.0,2003-11-25 00:00:00,1095000.0,Q,1,2522.0,1,3 Story,Row Inside,Very Good,Good,Common Brick,Metal- Sms,Hardwood,2.0,5,24,1680,2018-07-22 18:01:43,Residential,,,1748 SWANN STREET NW,WASHINGTON,DC,20009.0,18S UJ 23061 09289,38.91468,-77.040832,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW
1,1,3,1,Warm Cool,Y,2.0,11,5,1898.0,2007.0,1972,3.0,2000-08-17 00:00:00,,U,1,2567.0,1,3 Story,Row Inside,Very Good,Good,Common Brick,Built Up,Hardwood,2.0,4,24,1680,2018-07-22 18:01:43,Residential,,,1746 SWANN STREET NW,WASHINGTON,DC,20009.0,18S UJ 23067 09289,38.914683,-77.040764,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW
2,2,3,1,Hot Water Rad,Y,2.0,9,5,1910.0,2009.0,1984,3.0,2016-06-21 00:00:00,2100000.0,Q,3,2522.0,1,3 Story,Row Inside,Very Good,Very Good,Common Brick,Built Up,Hardwood,2.0,4,24,1680,2018-07-22 18:01:43,Residential,,,1744 SWANN STREET NW,WASHINGTON,DC,20009.0,18S UJ 23074 09289,38.914684,-77.040678,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW
3,3,3,1,Hot Water Rad,Y,2.0,8,5,1900.0,2003.0,1984,3.0,2006-07-12 00:00:00,1602000.0,Q,1,2484.0,1,3 Story,Row Inside,Very Good,Good,Common Brick,Built Up,Hardwood,2.0,3,24,1680,2018-07-22 18:01:43,Residential,,,1742 SWANN STREET NW,WASHINGTON,DC,20009.0,18S UJ 23078 09288,38.914683,-77.040629,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW
4,4,2,1,Warm Cool,Y,1.0,11,3,1913.0,2012.0,1985,3.0,,,U,1,5255.0,1,3 Story,Semi-Detached,Very Good,Good,Common Brick,Neopren,Hardwood,1.0,0,13,2032,2018-07-22 18:01:43,Residential,,,1804 NEW HAMPSHIRE AVENUE NW,WASHINGTON,DC,20009.0,18S UJ 23188 09253,38.914383,-77.039361,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW


In [0]:
# Let's do some exploring and cleaning
import pandas_profiling

profile = pandas_profiling.ProfileReport(df)

In [24]:
profile

0,1
Number of variables,49
Number of observations,158957
Total Missing (%),15.4%
Total size in memory,59.4 MiB
Average record size in memory,392.0 B

0,1
Numeric,23
Categorical,23
Boolean,0
Date,0
Text (Unique),0
Rejected,3
Unsupported,0

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Y,114620
N,44272
0,65

Value,Count,Frequency (%),Unnamed: 3
Y,114620,72.1%,
N,44272,27.9%,
0,65,0.0%,

0,1
Distinct count,58
Unique (%),0.0%
Missing (%),0.0%
Missing (n),1

0,1
Old City 2,15978
Old City 1,15000
Columbia Heights,9474
Other values (54),118504

Value,Count,Frequency (%),Unnamed: 3
Old City 2,15978,10.1%,
Old City 1,15000,9.4%,
Columbia Heights,9474,6.0%,
Brookland,6568,4.1%,
Petworth,6323,4.0%,
Deanwood,5983,3.8%,
Chevy Chase,5354,3.4%,
Congress Heights,4729,3.0%,
Brightwood,4112,2.6%,
Mt. Pleasant,4052,2.5%,

0,1
Distinct count,122
Unique (%),0.1%
Missing (%),20.5%
Missing (n),32551

0,1
040 D Old City 2,4403
040 E Old City 2,2968
040 C Old City 2,2886
Other values (118),116149
(Missing),32551

Value,Count,Frequency (%),Unnamed: 3
040 D Old City 2,4403,2.8%,
040 E Old City 2,2968,1.9%,
040 C Old City 2,2886,1.8%,
042 B Petworth,2763,1.7%,
039 K Old City 1,2640,1.7%,
007 E Brookland,2388,1.5%,
040 B Old City 2,2289,1.4%,
015 D Columbia Heights,2246,1.4%,
015 A Columbia Heights,2206,1.4%,
015 E Columbia Heights,2183,1.4%,

0,1
Distinct count,221
Unique (%),0.1%
Missing (%),0.2%
Missing (n),271
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1942
Minimum,1754
Maximum,2019
Zeros (%),0.0%

0,1
Minimum,1754
5-th percentile,1900
Q1,1918
Median,1937
Q3,1960
95-th percentile,2007
Maximum,2019
Range,265
Interquartile range,42

0,1
Standard deviation,33.64
Coef of variation,0.017323
Kurtosis,-0.077994
Mean,1942
MAD,26.631
Skewness,0.51125
Sum,308170000
Variance,1131.7
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
1900.0,8967,5.6%,
1925.0,5129,3.2%,
1910.0,4563,2.9%,
1940.0,4316,2.7%,
1923.0,3724,2.3%,
1927.0,3707,2.3%,
1941.0,3420,2.2%,
1926.0,3117,2.0%,
1942.0,3058,1.9%,
1939.0,2849,1.8%,

Value,Count,Frequency (%),Unnamed: 3
1754.0,2,0.0%,
1765.0,1,0.0%,
1776.0,3,0.0%,
1780.0,4,0.0%,
1782.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2015.0,906,0.6%,
2016.0,1016,0.6%,
2017.0,592,0.4%,
2018.0,98,0.1%,
2019.0,1,0.0%,

0,1
Distinct count,15
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.8107
Minimum,0
Maximum,14
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,1
Q1,1
Median,2
Q3,2
95-th percentile,4
Maximum,14
Range,14
Interquartile range,1

0,1
Standard deviation,0.9764
Coef of variation,0.53924
Kurtosis,3.8939
Mean,1.8107
MAD,0.76178
Skewness,1.5147
Sum,287820
Variance,0.95335
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
1,74555,46.9%,
2,53325,33.5%,
3,20785,13.1%,
4,8119,5.1%,
5,1367,0.9%,
6,500,0.3%,
7,129,0.1%,
8,71,0.0%,
0,58,0.0%,
9,22,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,58,0.0%,
1,74555,46.9%,
2,53325,33.5%,
3,20785,13.1%,
4,8119,5.1%,

Value,Count,Frequency (%),Unnamed: 3
10,14,0.0%,
11,7,0.0%,
12,3,0.0%,
13,1,0.0%,
14,1,0.0%,

0,1
Distinct count,20
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.7325
Minimum,0
Maximum,24
Zeros (%),3.3%

0,1
Minimum,0
5-th percentile,1
Q1,2
Median,3
Q3,3
95-th percentile,5
Maximum,24
Range,24
Interquartile range,1

0,1
Standard deviation,1.3589
Coef of variation,0.4973
Kurtosis,2.951
Mean,2.7325
MAD,1.0313
Skewness,0.73077
Sum,434351
Variance,1.8465
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
3,57864,36.4%,
2,34946,22.0%,
4,24893,15.7%,
1,24181,15.2%,
5,6898,4.3%,
0,5297,3.3%,
6,3090,1.9%,
8,792,0.5%,
7,750,0.5%,
9,123,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0,5297,3.3%,
1,24181,15.2%,
2,34946,22.0%,
3,57864,36.4%,
4,24893,15.7%,

Value,Count,Frequency (%),Unnamed: 3
15,3,0.0%,
16,2,0.0%,
19,1,0.0%,
20,1,0.0%,
24,1,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.0006
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,1
Median,1
Q3,1
95-th percentile,1
Maximum,5
Range,4
Interquartile range,0

0,1
Standard deviation,0.031622
Coef of variation,0.031603
Kurtosis,6428.6
Mean,1.0006
MAD,0.0011947
Skewness,71.372
Sum,159052
Variance,0.00099992
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
1,158884,100.0%,
2,59,0.0%,
3,8,0.0%,
4,4,0.0%,
5,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1,158884,100.0%,
2,59,0.0%,
3,8,0.0%,
4,4,0.0%,
5,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1,158884,100.0%,
2,59,0.0%,
3,8,0.0%,
4,4,0.0%,
5,2,0.0%,

0,1
Distinct count,3849
Unique (%),2.4%
Missing (%),33.3%
Missing (n),52906

0,1
009000 1001,340
009201 1004,312
009509 3004,206
Other values (3845),105193
(Missing),52906

Value,Count,Frequency (%),Unnamed: 3
009000 1001,340,0.2%,
009201 1004,312,0.2%,
009509 3004,206,0.1%,
009904 2009,204,0.1%,
009508 2005,195,0.1%,
009000 1010,189,0.1%,
007809 1001,175,0.1%,
000300 3003,170,0.1%,
009700 1012,160,0.1%,
000801 2008,158,0.1%,

0,1
Distinct count,177
Unique (%),0.1%
Missing (%),0.0%
Missing (n),1
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5348.2
Minimum,100
Maximum,11100
Zeros (%),0.0%

0,1
Minimum,100
5-th percentile,502
Q1,2102
Median,5201
Q3,8302
95-th percentile,10200
Maximum,11100
Range,11000
Interquartile range,6200

0,1
Standard deviation,3369.6
Coef of variation,0.63005
Kurtosis,-1.425
Mean,5348.2
MAD,3021.8
Skewness,0.0078893
Sum,850130000
Variance,11355000
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
5500.0,2933,1.8%,
801.0,2620,1.6%,
1001.0,2552,1.6%,
300.0,2182,1.4%,
5301.0,2179,1.4%,
100.0,2090,1.3%,
1500.0,2081,1.3%,
4400.0,1960,1.2%,
1100.0,1879,1.2%,
5201.0,1766,1.1%,

Value,Count,Frequency (%),Unnamed: 3
100.0,2090,1.3%,
202.0,1684,1.1%,
300.0,2182,1.4%,
400.0,605,0.4%,
501.0,519,0.3%,

Value,Count,Frequency (%),Unnamed: 3
10700.0,392,0.2%,
10800.0,386,0.2%,
10900.0,160,0.1%,
11000.0,911,0.6%,
11100.0,1501,0.9%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),33.3%
Missing (n),52906

0,1
WASHINGTON,106051
(Missing),52906

Value,Count,Frequency (%),Unnamed: 3
WASHINGTON,106051,66.7%,
(Missing),52906,33.3%,

0,1
Distinct count,2914
Unique (%),1.8%
Missing (%),67.1%
Missing (n),106696
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2371.5
Minimum,1001
Maximum,5621
Zeros (%),0.0%

0,1
Minimum,1001
5-th percentile,1066
Q1,1501
Median,2265
Q3,2910
95-th percentile,5176
Maximum,5621
Range,4620
Interquartile range,1409

0,1
Standard deviation,1114.3
Coef of variation,0.46985
Kurtosis,1.1404
Mean,2371.5
MAD,835.89
Skewness,1.1417
Sum,123940000
Variance,1241600
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
1066.0,720,0.5%,
2423.0,615,0.4%,
1080.0,429,0.3%,
2282.0,423,0.3%,
2838.0,396,0.2%,
1657.0,360,0.2%,
2279.0,324,0.2%,
2661.0,302,0.2%,
2898.0,292,0.2%,
2430.0,291,0.2%,

Value,Count,Frequency (%),Unnamed: 3
1001.0,36,0.0%,
1002.0,157,0.1%,
1003.0,16,0.0%,
1004.0,21,0.0%,
1005.0,3,0.0%,

Value,Count,Frequency (%),Unnamed: 3
5616.0,71,0.0%,
5617.0,4,0.0%,
5619.0,11,0.0%,
5620.0,4,0.0%,
5621.0,2,0.0%,

0,1
Distinct count,8
Unique (%),0.0%
Missing (%),32.9%
Missing (n),52261

0,1
Average,58217
Good,37497
Very Good,8130
Other values (4),2852
(Missing),52261

Value,Count,Frequency (%),Unnamed: 3
Average,58217,36.6%,
Good,37497,23.6%,
Very Good,8130,5.1%,
Excellent,1338,0.8%,
Fair,1320,0.8%,
Poor,175,0.1%,
Default,19,0.0%,
(Missing),52261,32.9%,

0,1
Distinct count,26
Unique (%),0.0%
Missing (%),32.9%
Missing (n),52261

0,1
Common Brick,81068
Brick/Siding,5569
Vinyl Siding,5290
Other values (22),14769
(Missing),52261

Value,Count,Frequency (%),Unnamed: 3
Common Brick,81068,51.0%,
Brick/Siding,5569,3.5%,
Vinyl Siding,5290,3.3%,
Wood Siding,4540,2.9%,
Stucco,3216,2.0%,
Shingle,1181,0.7%,
Brick Veneer,1069,0.7%,
Aluminum,954,0.6%,
Stone,744,0.5%,
Brick/Stucco,673,0.4%,

0,1
Distinct count,135
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1963.7
Minimum,1800
Maximum,2018
Zeros (%),0.0%

0,1
Minimum,1800
5-th percentile,1919
Q1,1954
Median,1963
Q3,1975
95-th percentile,2009
Maximum,2018
Range,218
Interquartile range,21

0,1
Standard deviation,24.923
Coef of variation,0.012692
Kurtosis,0.62118
Mean,1963.7
MAD,18.012
Skewness,-0.12248
Sum,312146726
Variance,621.16
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
1957,12541,7.9%,
1954,12346,7.8%,
1967,10408,6.5%,
1964,9362,5.9%,
1960,7636,4.8%,
1969,7033,4.4%,
1943,5022,3.2%,
1919,4707,3.0%,
1950,4522,2.8%,
1947,3718,2.3%,

Value,Count,Frequency (%),Unnamed: 3
1800,4,0.0%,
1820,6,0.0%,
1865,4,0.0%,
1870,10,0.0%,
1875,100,0.1%,

Value,Count,Frequency (%),Unnamed: 3
2014,716,0.5%,
2015,1360,0.9%,
2016,1032,0.6%,
2017,886,0.6%,
2018,186,0.1%,

0,1
Distinct count,20
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.3747
Minimum,0
Maximum,293920
Zeros (%),65.3%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,2
Maximum,293920
Range,293920
Interquartile range,1

0,1
Standard deviation,737.3
Coef of variation,310.48
Kurtosis,158880
Mean,2.3747
MAD,3.8549
Skewness,398.55
Sum,377471
Variance,543600
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
0,103837,65.3%,
1,40567,25.5%,
2,10779,6.8%,
3,2410,1.5%,
4,841,0.5%,
5,277,0.2%,
6,148,0.1%,
7,47,0.0%,
8,18,0.0%,
9,10,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,103837,65.3%,
1,40567,25.5%,
2,10779,6.8%,
3,2410,1.5%,
4,841,0.5%,

Value,Count,Frequency (%),Unnamed: 3
922,1,0.0%,
1017,1,0.0%,
1601,1,0.0%,
4068,1,0.0%,
293920,1,0.0%,

0,1
Distinct count,105979
Unique (%),66.7%
Missing (%),33.3%
Missing (n),52917

0,1
1755 STANTON TERRACE SE,5
1754 STANTON TERRACE SE,5
1508 SHIPPEN LANE SE,4
Other values (105975),106026
(Missing),52917

Value,Count,Frequency (%),Unnamed: 3
1755 STANTON TERRACE SE,5,0.0%,
1754 STANTON TERRACE SE,5,0.0%,
1508 SHIPPEN LANE SE,4,0.0%,
1517 SHIPPEN LANE SE,4,0.0%,
2600 TILDEN STREET NW,3,0.0%,
312 MILLERS COURT NE,3,0.0%,
1507 TOBIAS DRIVE SE,3,0.0%,
1530 34TH STREET NW,3,0.0%,
8000 PARKSIDE LANE NW,2,0.0%,
606 SOUTHERN AVENUE SE,2,0.0%,

0,1
Distinct count,4765
Unique (%),3.0%
Missing (%),32.9%
Missing (n),52261
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1714.5
Minimum,0
Maximum,45384
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,864
Q1,1190
Median,1480
Q3,1966
95-th percentile,3262
Maximum,45384
Range,45384
Interquartile range,776

0,1
Standard deviation,880.68
Coef of variation,0.51365
Kurtosis,135.31
Mean,1714.5
MAD,579.24
Skewness,5.6357
Sum,182930000
Variance,775590
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
1088.0,1782,1.1%,
1152.0,1661,1.0%,
1024.0,1405,0.9%,
832.0,1364,0.9%,
1280.0,1236,0.8%,
1080.0,1094,0.7%,
1200.0,877,0.6%,
1360.0,862,0.5%,
1440.0,815,0.5%,
800.0,723,0.5%,

Value,Count,Frequency (%),Unnamed: 3
0.0,15,0.0%,
180.0,1,0.0%,
252.0,1,0.0%,
299.0,1,0.0%,
340.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
21210.0,1,0.0%,
24030.0,1,0.0%,
27451.0,1,0.0%,
41604.0,1,0.0%,
45384.0,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
2018-07-22 18:01:43,106696
2018-07-22 18:01:38,52261

Value,Count,Frequency (%),Unnamed: 3
2018-07-22 18:01:43,106696,67.1%,
2018-07-22 18:01:38,52261,32.9%,

0,1
Distinct count,14
Unique (%),0.0%
Missing (%),32.9%
Missing (n),52261

0,1
Average,37357
Above Average,32101
Good Quality,20800
Other values (10),16438
(Missing),52261

Value,Count,Frequency (%),Unnamed: 3
Average,37357,23.5%,
Above Average,32101,20.2%,
Good Quality,20800,13.1%,
Very Good,8976,5.6%,
Excellent,3390,2.1%,
Superior,2634,1.7%,
Exceptional-A,818,0.5%,
Exceptional-B,278,0.2%,
Fair Quality,150,0.1%,
Exceptional-C,92,0.1%,

0,1
Distinct count,14
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Forced Air,53972
Hot Water Rad,47202
Warm Cool,33628
Other values (11),24155

Value,Count,Frequency (%),Unnamed: 3
Forced Air,53972,34.0%,
Hot Water Rad,47202,29.7%,
Warm Cool,33628,21.2%,
Ht Pump,21412,13.5%,
Wall Furnace,1120,0.7%,
Water Base Brd,402,0.3%,
Elec Base Brd,351,0.2%,
No Data,330,0.2%,
Electric Rad,144,0.1%,
Gravity Furnac,140,0.1%,

0,1
Distinct count,10
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.45824
Minimum,0
Maximum,11
Zeros (%),58.6%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,1
Maximum,11
Range,11
Interquartile range,1

0,1
Standard deviation,0.58757
Coef of variation,1.2822
Kurtosis,2.0746
Mean,0.45824
MAD,0.53705
Skewness,1.0741
Sum,72840
Variance,0.34524
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
0,93148,58.6%,
1,59258,37.3%,
2,6186,3.9%,
3,289,0.2%,
4,56,0.0%,
5,12,0.0%,
7,3,0.0%,
6,3,0.0%,
11,1,0.0%,
9,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,93148,58.6%,
1,59258,37.3%,
2,6186,3.9%,
3,289,0.2%,
4,56,0.0%,

Value,Count,Frequency (%),Unnamed: 3
5,12,0.0%,
6,3,0.0%,
7,3,0.0%,
9,1,0.0%,
11,1,0.0%,

0,1
Distinct count,13
Unique (%),0.0%
Missing (%),32.9%
Missing (n),52261

0,1
Hardwood,83643
Hardwood/Carp,10938
Wood Floor,8170
Other values (9),3945
(Missing),52261

Value,Count,Frequency (%),Unnamed: 3
Hardwood,83643,52.6%,
Hardwood/Carp,10938,6.9%,
Wood Floor,8170,5.1%,
Carpet,3563,2.2%,
Lt Concrete,141,0.1%,
Default,110,0.1%,
Ceramic Tile,50,0.0%,
Vinyl Comp,28,0.0%,
Parquet,19,0.0%,
Resiliant,15,0.0%,

0,1
Correlation,0.93061

0,1
Distinct count,11359
Unique (%),7.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2473.3
Minimum,0
Maximum,942632
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,137
Q1,697
Median,1649
Q3,3000
95-th percentile,7475
Maximum,942632
Range,942632
Interquartile range,2303

0,1
Standard deviation,5059
Coef of variation,2.0455
Kurtosis,11264
Mean,2473.3
MAD,1882.7
Skewness,78.59
Sum,393145512
Variance,25594000
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
1800,1071,0.7%,
2000,1020,0.6%,
4000,848,0.5%,
5000,833,0.5%,
1600,792,0.5%,
2500,601,0.4%,
1700,562,0.4%,
1440,552,0.3%,
1500,530,0.3%,
3000,511,0.3%,

Value,Count,Frequency (%),Unnamed: 3
0,72,0.0%,
1,2,0.0%,
2,2,0.0%,
3,2,0.0%,
4,12,0.0%,

Value,Count,Frequency (%),Unnamed: 3
339658,1,0.0%,
451804,1,0.0%,
498734,1,0.0%,
691817,1,0.0%,
942632,1,0.0%,

0,1
Distinct count,105523
Unique (%),66.4%
Missing (%),0.0%
Missing (n),1
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,38.915
Minimum,38.82
Maximum,38.996
Zeros (%),0.0%

0,1
Minimum,38.82
5-th percentile,38.859
Q1,38.895
Median,38.915
Q3,38.936
95-th percentile,38.965
Maximum,38.996
Range,0.17581
Interquartile range,0.04065

0,1
Standard deviation,0.031723
Coef of variation,0.00081518
Kurtosis,0.022501
Mean,38.915
MAD,0.02506
Skewness,-0.2982
Sum,6185700
Variance,0.0010063
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
38.93466835,1128,0.7%,
38.88082098,1022,0.6%,
38.9094433,592,0.4%,
38.87478919500001,524,0.3%,
38.90314058,504,0.3%,
38.94449932,429,0.3%,
38.89542487,428,0.3%,
38.86303776,410,0.3%,
38.90445577,406,0.3%,
38.928060825,367,0.2%,

Value,Count,Frequency (%),Unnamed: 3
38.81973129,1,0.0%,
38.81978931,1,0.0%,
38.81988895,1,0.0%,
38.819943,1,0.0%,
38.81995335,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
38.99503065,1,0.0%,
38.99516273,1,0.0%,
38.99530086,1,0.0%,
38.9954352,1,0.0%,
38.99553969,1,0.0%,

0,1
Distinct count,2217
Unique (%),1.4%
Missing (%),67.1%
Missing (n),106696
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,888.83
Minimum,0
Maximum,8553
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,440
Q1,616
Median,783
Q3,1060
95-th percentile,1662
Maximum,8553
Range,8553
Interquartile range,444

0,1
Standard deviation,420.19
Coef of variation,0.47274
Kurtosis,15.695
Mean,888.83
MAD,299.92
Skewness,2.5564
Sum,46451000
Variance,176560
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
888.0,205,0.1%,
740.0,185,0.1%,
1210.0,179,0.1%,
670.0,175,0.1%,
1332.0,168,0.1%,
810.0,148,0.1%,
575.0,145,0.1%,
504.0,144,0.1%,
625.0,143,0.1%,
749.0,137,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1,0.0%,
104.0,1,0.0%,
148.0,1,0.0%,
199.0,1,0.0%,
209.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
6034.0,1,0.0%,
6116.0,1,0.0%,
6145.0,1,0.0%,
7164.0,1,0.0%,
8553.0,1,0.0%,

0,1
Distinct count,105936
Unique (%),66.6%
Missing (%),0.0%
Missing (n),1
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-77.017
Minimum,-77.114
Maximum,-76.91
Zeros (%),0.0%

0,1
Minimum,-77.114
5-th percentile,-77.083
Q1,-77.043
Median,-77.02
Q3,-76.989
95-th percentile,-76.941
Maximum,-76.91
Range,0.20415
Interquartile range,0.054266

0,1
Standard deviation,0.040938
Coef of variation,-0.00053155
Kurtosis,-0.38794
Mean,-77.017
MAD,0.032999
Skewness,0.16701
Sum,-12242000
Variance,0.001676
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
-77.08468826,1128,0.7%,
-77.01427073,1022,0.6%,
-77.039692675,592,0.4%,
-77.01630113,524,0.3%,
-77.01777614,504,0.3%,
-77.06124775,429,0.3%,
-77.02156757,428,0.3%,
-76.94956535,410,0.3%,
-77.03105732,406,0.3%,
-77.0792663,367,0.2%,

Value,Count,Frequency (%),Unnamed: 3
-77.11390873,1,0.0%,
-77.1138097,1,0.0%,
-77.11377421,1,0.0%,
-77.1136275,1,0.0%,
-77.11356932,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
-76.90988281,1,0.0%,
-76.90984731,1,0.0%,
-76.90984266,1,0.0%,
-76.9097583,1,0.0%,
-76.90975796,1,0.0%,

0,1
Distinct count,105950
Unique (%),66.7%
Missing (%),33.3%
Missing (n),52906

0,1
18S UJ 28168 01936,5
18S UJ 28233 01950,5
18S UJ 28025 01949,4
Other values (105946),106037
(Missing),52906

Value,Count,Frequency (%),Unnamed: 3
18S UJ 28168 01936,5,0.0%,
18S UJ 28233 01950,5,0.0%,
18S UJ 28025 01949,4,0.0%,
18S UJ 28045 01888,4,0.0%,
18S UJ 25398 04622,4,0.0%,
18S UJ 21962 12164,3,0.0%,
18S UJ 28027 01972,3,0.0%,
18S UJ 20689 08775,3,0.0%,
18S UJ 26425 06527,3,0.0%,
18S UJ 26815 06218,2,0.0%,

0,1
Distinct count,8
Unique (%),0.0%
Missing (%),32.9%
Missing (n),52261
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.198
Minimum,0
Maximum,6
Zeros (%),0.1%

0,1
Minimum,0
5-th percentile,1
Q1,1
Median,1
Q3,1
95-th percentile,2
Maximum,6
Range,6
Interquartile range,0

0,1
Standard deviation,0.59692
Coef of variation,0.49825
Kurtosis,12.386
Mean,1.198
MAD,0.34712
Skewness,3.4679
Sum,127830
Variance,0.35632
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
1.0,92491,58.2%,
2.0,9864,6.2%,
4.0,3059,1.9%,
3.0,1101,0.7%,
0.0,168,0.1%,
5.0,10,0.0%,
6.0,3,0.0%,
(Missing),52261,32.9%,

Value,Count,Frequency (%),Unnamed: 3
0.0,168,0.1%,
1.0,92491,58.2%,
2.0,9864,6.2%,
3.0,1101,0.7%,
4.0,3059,1.9%,

Value,Count,Frequency (%),Unnamed: 3
2.0,9864,6.2%,
3.0,1101,0.7%,
4.0,3059,1.9%,
5.0,10,0.0%,
6.0,3,0.0%,

0,1
Distinct count,13487
Unique (%),8.5%
Missing (%),38.2%
Missing (n),60741
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,931350
Minimum,1
Maximum,137430000
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,93087
Q1,240000
Median,400000
Q3,652000
95-th percentile,1350000
Maximum,137430000
Range,137430000
Interquartile range,412000

0,1
Standard deviation,7061300
Coef of variation,7.5818
Kurtosis,344.9
Mean,931350
MAD,951010
Skewness,18.316
Sum,91474000000
Variance,49862000000000
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
350000.0,595,0.4%,
250000.0,536,0.3%,
300000.0,523,0.3%,
450000.0,519,0.3%,
375000.0,488,0.3%,
325000.0,482,0.3%,
550000.0,459,0.3%,
275000.0,455,0.3%,
500000.0,426,0.3%,
320000.0,425,0.3%,

Value,Count,Frequency (%),Unnamed: 3
1.0,5,0.0%,
10.0,4,0.0%,
250.0,14,0.0%,
500.0,4,0.0%,
936.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
25000000.0,1,0.0%,
25100000.0,1,0.0%,
53696391.0,1,0.0%,
53969391.0,118,0.1%,
137427545.0,242,0.2%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.1%
Missing (n),237

0,1
NW,89736
NE,37675
SE,27224

Value,Count,Frequency (%),Unnamed: 3
NW,89736,56.5%,
NE,37675,23.7%,
SE,27224,17.1%,
SW,4085,2.6%,
(Missing),237,0.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
U,82608
Q,76349

Value,Count,Frequency (%),Unnamed: 3
U,82608,52.0%,
Q,76349,48.0%,

0,1
Distinct count,17
Unique (%),0.0%
Missing (%),32.9%
Missing (n),52261

0,1
Built Up,31402
Comp Shingle,30301
Metal- Sms,29957
Other values (13),15036
(Missing),52261

Value,Count,Frequency (%),Unnamed: 3
Built Up,31402,19.8%,
Comp Shingle,30301,19.1%,
Metal- Sms,29957,18.8%,
Slate,11135,7.0%,
Neopren,1254,0.8%,
Shake,907,0.6%,
Clay Tile,654,0.4%,
Shingle,433,0.3%,
Metal- Pre,244,0.2%,
Typical,229,0.1%,

0,1
Distinct count,40
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,6.1877
Minimum,0
Maximum,48
Zeros (%),0.1%

0,1
Minimum,0
5-th percentile,3
Q1,4
Median,6
Q3,7
95-th percentile,11
Maximum,48
Range,48
Interquartile range,3

0,1
Standard deviation,2.6182
Coef of variation,0.42312
Kurtosis,4.5636
Mean,6.1877
MAD,1.9149
Skewness,1.2834
Sum,983584
Variance,6.8548
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
6,37259,23.4%,
7,22338,14.1%,
4,20593,13.0%,
3,17759,11.2%,
5,16852,10.6%,
8,16327,10.3%,
9,7616,4.8%,
10,5909,3.7%,
2,5294,3.3%,
12,2929,1.8%,

Value,Count,Frequency (%),Unnamed: 3
0,138,0.1%,
1,96,0.1%,
2,5294,3.3%,
3,17759,11.2%,
4,20593,13.0%,

Value,Count,Frequency (%),Unnamed: 3
37,1,0.0%,
39,2,0.0%,
40,1,0.0%,
41,1,0.0%,
48,1,0.0%,

0,1
Distinct count,6938
Unique (%),4.4%
Missing (%),16.8%
Missing (n),26770

0,1
2007-04-10 00:00:00,413
1999-04-01 00:00:00,266
2001-01-01 00:00:00,258
Other values (6934),131250
(Missing),26770

Value,Count,Frequency (%),Unnamed: 3
2007-04-10 00:00:00,413,0.3%,
1999-04-01 00:00:00,266,0.2%,
2001-01-01 00:00:00,258,0.2%,
2015-11-17 00:00:00,160,0.1%,
2010-05-04 00:00:00,134,0.1%,
2017-06-14 00:00:00,124,0.1%,
2018-05-29 00:00:00,104,0.1%,
2016-10-31 00:00:00,95,0.1%,
2018-07-03 00:00:00,88,0.1%,
2018-05-15 00:00:00,85,0.1%,

0,1
Distinct count,15
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.68
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,1
Median,1
Q3,2
95-th percentile,4
Maximum,15
Range,14
Interquartile range,1

0,1
Standard deviation,1.2859
Coef of variation,0.7654
Kurtosis,4.8959
Mean,1.68
MAD,0.97259
Skewness,2.1317
Sum,267053
Variance,1.6535
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
1,113671,71.5%,
3,14738,9.3%,
2,12901,8.1%,
4,9851,6.2%,
5,4687,2.9%,
6,1970,1.2%,
7,703,0.4%,
8,261,0.2%,
9,108,0.1%,
10,37,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1,113671,71.5%,
2,12901,8.1%,
3,14738,9.3%,
4,9851,6.2%,
5,4687,2.9%,

Value,Count,Frequency (%),Unnamed: 3
11,17,0.0%,
12,6,0.0%,
13,3,0.0%,
14,2,0.0%,
15,2,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Residential,106696
Condominium,52261

Value,Count,Frequency (%),Unnamed: 3
Residential,106696,67.1%,
Condominium,52261,32.9%,

0,1
Distinct count,3292
Unique (%),2.1%
Missing (%),0.0%
Missing (n),0

0,1
1601,1366
0540,1022
1301,721
Other values (3289),155848

Value,Count,Frequency (%),Unnamed: 3
1601,1366,0.9%,
0540,1022,0.6%,
1301,721,0.5%,
0157,600,0.4%,
4325,559,0.4%,
0546,524,0.3%,
0515,504,0.3%,
2049,430,0.3%,
0457,428,0.3%,
2106,415,0.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),33.3%
Missing (n),52906

0,1
DC,106051
(Missing),52906

Value,Count,Frequency (%),Unnamed: 3
DC,106051,66.7%,
(Missing),52906,33.3%,

0,1
Distinct count,41
Unique (%),0.0%
Missing (%),32.9%
Missing (n),52305
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.0918
Minimum,0
Maximum,826
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,1.5
Q1,2.0
Median,2.0
Q3,2.0
95-th percentile,3.0
Maximum,826.0
Range,826.0
Interquartile range,0.0

0,1
Standard deviation,2.9333
Coef of variation,1.4023
Kurtosis,60246
Mean,2.0918
MAD,0.27439
Skewness,228.69
Sum,223090
Variance,8.6044
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
2.0,79357,49.9%,
3.0,9230,5.8%,
2.5,6105,3.8%,
1.0,4683,2.9%,
1.5,2291,1.4%,
2.25,2225,1.4%,
1.75,1175,0.7%,
1.25,452,0.3%,
2.75,444,0.3%,
4.0,375,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,43,0.0%,
0.25,1,0.0%,
0.5,1,0.0%,
0.75,1,0.0%,
1.0,4683,2.9%,

Value,Count,Frequency (%),Unnamed: 3
43.0,1,0.0%,
65.0,1,0.0%,
250.0,1,0.0%,
275.0,2,0.0%,
826.0,1,0.0%,

0,1
Distinct count,10
Unique (%),0.0%
Missing (%),32.9%
Missing (n),52261

0,1
Row Inside,40593
Single,32063
Semi-Detached,16756
Other values (6),17284
(Missing),52261

Value,Count,Frequency (%),Unnamed: 3
Row Inside,40593,25.5%,
Single,32063,20.2%,
Semi-Detached,16756,10.5%,
Row End,12225,7.7%,
Multi,4726,3.0%,
Town Inside,218,0.1%,
Town End,85,0.1%,
Default,26,0.0%,
Vacant Land,4,0.0%,
(Missing),52261,32.9%,

0,1
Distinct count,19
Unique (%),0.0%
Missing (%),32.9%
Missing (n),52261

0,1
2 Story,81137
3 Story,9449
2.5 Story Fin,7000
Other values (15),9110
(Missing),52261

Value,Count,Frequency (%),Unnamed: 3
2 Story,81137,51.0%,
3 Story,9449,5.9%,
2.5 Story Fin,7000,4.4%,
1 Story,4420,2.8%,
1.5 Story Fin,2655,1.7%,
2.5 Story Unfin,729,0.5%,
4 Story,369,0.2%,
Split Level,303,0.2%,
Split Foyer,279,0.2%,
3.5 Story Fin,133,0.1%,

0,1
Distinct count,16
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,14.253
Minimum,11
Maximum,117
Zeros (%),0.0%

0,1
Minimum,11
5-th percentile,11
Q1,11
Median,13
Q3,17
95-th percentile,24
Maximum,117
Range,106
Interquartile range,6

0,1
Standard deviation,3.7257
Coef of variation,0.2614
Kurtosis,37.243
Mean,14.253
MAD,3.0242
Skewness,2.5568
Sum,2265614
Variance,13.881
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
11,45597,28.7%,
12,31623,19.9%,
17,27511,17.3%,
16,24741,15.6%,
13,16588,10.4%,
24,8272,5.2%,
23,4497,2.8%,
15,79,0.0%,
19,31,0.0%,
117,8,0.0%,

Value,Count,Frequency (%),Unnamed: 3
11,45597,28.7%,
12,31623,19.9%,
13,16588,10.4%,
15,79,0.0%,
16,24741,15.6%,

Value,Count,Frequency (%),Unnamed: 3
41,1,0.0%,
81,4,0.0%,
83,2,0.0%,
116,1,0.0%,
117,8,0.0%,

0,1
Distinct count,158957
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,79478
Minimum,0
Maximum,158956
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,7947.8
Q1,39739.0
Median,79478.0
Q3,119220.0
95-th percentile,151010.0
Maximum,158956.0
Range,158956.0
Interquartile range,79478.0

0,1
Standard deviation,45887
Coef of variation,0.57736
Kurtosis,-1.2
Mean,79478
MAD,39739
Skewness,0
Sum,12633584446
Variance,2105600000
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
2047,1,0.0%,
7465,1,0.0%,
54576,1,0.0%,
11567,1,0.0%,
9518,1,0.0%,
15661,1,0.0%,
13612,1,0.0%,
3371,1,0.0%,
1322,1,0.0%,
5416,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.0%,
1,1,0.0%,
2,1,0.0%,
3,1,0.0%,
4,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
158952,1,0.0%,
158953,1,0.0%,
158954,1,0.0%,
158955,1,0.0%,
158956,1,0.0%,

0,1
Distinct count,9
Unique (%),0.0%
Missing (%),0.0%
Missing (n),1

0,1
Ward 6,23973
Ward 3,23688
Ward 4,22202
Other values (5),89093

Value,Count,Frequency (%),Unnamed: 3
Ward 6,23973,15.1%,
Ward 3,23688,14.9%,
Ward 4,22202,14.0%,
Ward 2,22167,13.9%,
Ward 5,21359,13.4%,
Ward 1,17455,11.0%,
Ward 7,17206,10.8%,
Ward 8,10906,6.9%,
(Missing),1,0.0%,

0,1
Correlation,0.9999

0,1
Correlation,0.99992

0,1
Distinct count,111
Unique (%),0.1%
Missing (%),49.1%
Missing (n),78029
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1998.2
Minimum,20
Maximum,2019
Zeros (%),0.0%

0,1
Minimum,20
5-th percentile,1973
Q1,1985
Median,2004
Q3,2010
95-th percentile,2016
Maximum,2019
Range,1999
Interquartile range,25

0,1
Standard deviation,16.576
Coef of variation,0.0082952
Kurtosis,2506.4
Mean,1998.2
MAD,12.876
Skewness,-21.693
Sum,161710000
Variance,274.76
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
2006.0,5029,3.2%,
2005.0,4937,3.1%,
2004.0,3985,2.5%,
2007.0,3771,2.4%,
1980.0,3310,2.1%,
2003.0,2951,1.9%,
2011.0,2856,1.8%,
2008.0,2766,1.7%,
1978.0,2690,1.7%,
2010.0,2680,1.7%,

Value,Count,Frequency (%),Unnamed: 3
20.0,1,0.0%,
1880.0,2,0.0%,
1900.0,2,0.0%,
1910.0,1,0.0%,
1911.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2015.0,2595,1.6%,
2016.0,2190,1.4%,
2017.0,1991,1.3%,
2018.0,417,0.3%,
2019.0,1,0.0%,

0,1
Distinct count,25
Unique (%),0.0%
Missing (%),0.0%
Missing (n),1
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,20013
Minimum,20001
Maximum,20392
Zeros (%),0.0%

0,1
Minimum,20001
5-th percentile,20001
Q1,20007
Median,20011
Q3,20018
95-th percentile,20032
Maximum,20392
Range,391
Interquartile range,11

0,1
Standard deviation,15.627
Coef of variation,0.00078086
Kurtosis,403.5
Mean,20013
MAD,7.4459
Skewness,16.863
Sum,3181100000
Variance,244.21
Memory size,1.2 MiB

Value,Count,Frequency (%),Unnamed: 3
20011.0,16352,10.3%,
20002.0,16310,10.3%,
20009.0,13171,8.3%,
20019.0,12458,7.8%,
20016.0,10644,6.7%,
20001.0,10549,6.6%,
20020.0,9805,6.2%,
20007.0,9029,5.7%,
20003.0,8015,5.0%,
20008.0,6801,4.3%,

Value,Count,Frequency (%),Unnamed: 3
20001.0,10549,6.6%,
20002.0,16310,10.3%,
20003.0,8015,5.0%,
20004.0,1082,0.7%,
20005.0,3404,2.1%,

Value,Count,Frequency (%),Unnamed: 3
20032.0,5111,3.2%,
20036.0,1892,1.2%,
20037.0,3730,2.3%,
20052.0,19,0.0%,
20392.0,186,0.1%,

Unnamed: 0.1,Unnamed: 0,BATHRM,HF_BATHRM,HEAT,AC,NUM_UNITS,ROOMS,BEDRM,AYB,YR_RMDL,EYB,STORIES,SALEDATE,PRICE,QUALIFIED,SALE_NUM,GBA,BLDG_NUM,STYLE,STRUCT,GRADE,CNDTN,EXTWALL,ROOF,INTWALL,KITCHENS,FIREPLACES,USECODE,LANDAREA,GIS_LAST_MOD_DTTM,SOURCE,CMPLX_NUM,LIVING_GBA,FULLADDRESS,CITY,STATE,ZIPCODE,NATIONALGRID,LATITUDE,LONGITUDE,ASSESSMENT_NBHD,ASSESSMENT_SUBNBHD,CENSUS_TRACT,CENSUS_BLOCK,WARD,SQUARE,X,Y,QUADRANT
0,0,4,0,Warm Cool,Y,2.0,8,4,1910.0,1988.0,1972,3.0,2003-11-25 00:00:00,1095000.0,Q,1,2522.0,1,3 Story,Row Inside,Very Good,Good,Common Brick,Metal- Sms,Hardwood,2.0,5,24,1680,2018-07-22 18:01:43,Residential,,,1748 SWANN STREET NW,WASHINGTON,DC,20009.0,18S UJ 23061 09289,38.91468,-77.040832,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW
1,1,3,1,Warm Cool,Y,2.0,11,5,1898.0,2007.0,1972,3.0,2000-08-17 00:00:00,,U,1,2567.0,1,3 Story,Row Inside,Very Good,Good,Common Brick,Built Up,Hardwood,2.0,4,24,1680,2018-07-22 18:01:43,Residential,,,1746 SWANN STREET NW,WASHINGTON,DC,20009.0,18S UJ 23067 09289,38.914683,-77.040764,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW
2,2,3,1,Hot Water Rad,Y,2.0,9,5,1910.0,2009.0,1984,3.0,2016-06-21 00:00:00,2100000.0,Q,3,2522.0,1,3 Story,Row Inside,Very Good,Very Good,Common Brick,Built Up,Hardwood,2.0,4,24,1680,2018-07-22 18:01:43,Residential,,,1744 SWANN STREET NW,WASHINGTON,DC,20009.0,18S UJ 23074 09289,38.914684,-77.040678,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW
3,3,3,1,Hot Water Rad,Y,2.0,8,5,1900.0,2003.0,1984,3.0,2006-07-12 00:00:00,1602000.0,Q,1,2484.0,1,3 Story,Row Inside,Very Good,Good,Common Brick,Built Up,Hardwood,2.0,3,24,1680,2018-07-22 18:01:43,Residential,,,1742 SWANN STREET NW,WASHINGTON,DC,20009.0,18S UJ 23078 09288,38.914683,-77.040629,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW
4,4,2,1,Warm Cool,Y,1.0,11,3,1913.0,2012.0,1985,3.0,,,U,1,5255.0,1,3 Story,Semi-Detached,Very Good,Good,Common Brick,Neopren,Hardwood,1.0,0,13,2032,2018-07-22 18:01:43,Residential,,,1804 NEW HAMPSHIRE AVENUE NW,WASHINGTON,DC,20009.0,18S UJ 23188 09253,38.914383,-77.039361,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW


In [15]:
df.isna().sum()

Unnamed: 0                 0
BATHRM                     0
HF_BATHRM                  0
HEAT                       0
AC                         0
NUM_UNITS              52261
ROOMS                      0
BEDRM                      0
AYB                      271
YR_RMDL                78029
EYB                        0
STORIES                52305
SALEDATE               26770
PRICE                  60741
QUALIFIED                  0
SALE_NUM                   0
GBA                    52261
BLDG_NUM                   0
STYLE                  52261
STRUCT                 52261
GRADE                  52261
CNDTN                  52261
EXTWALL                52261
ROOF                   52261
INTWALL                52261
KITCHENS               52262
FIREPLACES                 0
USECODE                    0
LANDAREA                   0
GIS_LAST_MOD_DTTM          0
SOURCE                     0
CMPLX_NUM             106696
LIVING_GBA            106696
FULLADDRESS            52917
CITY          

In [27]:
df.nunique().sort_values()

CITY                       1
STATE                      1
SOURCE                     2
QUALIFIED                  2
GIS_LAST_MOD_DTTM          2
AC                         3
QUADRANT                   4
BLDG_NUM                   5
NUM_UNITS                  7
CNDTN                      7
KITCHENS                   8
WARD                       8
STRUCT                     9
HF_BATHRM                 10
INTWALL                   12
GRADE                     13
HEAT                      14
BATHRM                    15
SALE_NUM                  15
USECODE                   16
ROOF                      16
STYLE                     18
FIREPLACES                20
BEDRM                     20
ZIPCODE                   24
EXTWALL                   25
ROOMS                     40
STORIES                   40
ASSESSMENT_NBHD           57
YR_RMDL                  110
ASSESSMENT_SUBNBHD       121
EYB                      135
CENSUS_TRACT             176
AYB                      220
LIVING_GBA    

In [0]:
def datawrangle(x):
  x = x.copy()
  
  # drop X and Y because they're the same as LONGITUDE and LATITUDE

  x = x.drop(columns=['X', 'Y'])

  # drop redundant or useless columns

  x = x.drop(columns=['Unnamed: 0', 'STATE', 'CITY', 'SOURCE', 'FULLADDRESS', 'NATIONALGRID'])

  # drop null values in price since that's our target

  x = x.dropna(subset=['PRICE'], how='any')
  
  return x

In [57]:
df = datawrangle(df)
print(df.shape)
df.dtypes

(98216, 41)


BATHRM                  int64
HF_BATHRM               int64
HEAT                   object
AC                     object
NUM_UNITS             float64
ROOMS                   int64
BEDRM                   int64
AYB                   float64
YR_RMDL               float64
EYB                     int64
STORIES               float64
SALEDATE               object
PRICE                 float64
QUALIFIED              object
SALE_NUM                int64
GBA                   float64
BLDG_NUM                int64
STYLE                  object
STRUCT                 object
GRADE                  object
CNDTN                  object
EXTWALL                object
ROOF                   object
INTWALL                object
KITCHENS              float64
FIREPLACES              int64
USECODE                 int64
LANDAREA                int64
GIS_LAST_MOD_DTTM      object
CMPLX_NUM             float64
LIVING_GBA            float64
ZIPCODE               float64
LATITUDE              float64
LONGITUDE 

In [50]:
df['PRICE'].mean()

931351.5949336156

In [58]:
from sklearn.metrics import mean_absolute_error, r2_score

target = 'PRICE'

mean_baseline = df.copy()

y = mean_baseline[target]

y_pred = y.mean()

mean_baseline['predicted'] = y_pred
mean_baseline['error'] = y_pred - y

mae = mean_absolute_error(y, mean_baseline['predicted'].to_list())
r2 = r2_score(y, mean_baseline['predicted'].to_list())
print(f'Mean Baseline MAE: ${mae:,.0f} \n')
print(f'Mean Baseline R^2: {r2:} \n')

Mean Baseline MAE: $951,010 

Mean Baseline R^2: 0.0 

