# Exploratory data Analysis

In [11]:
# import libs
from utils.reports import generate_profiling_report
import os
import pandas as pd

In [12]:
# read dataset
dataset_filepath = './data/train.csv'
dataset_df = pd.read_csv(dataset_filepath)
dataset_df.head(10)

Unnamed: 0,uid,city,description,homeType,latitude,longitude,garageSpaces,hasSpa,yearBuilt,numOfPatioAndPorchFeatures,lotSizeSqFt,avgSchoolRating,MedianStudentsPerTeacher,numOfBathrooms,numOfBedrooms,priceRange
0,1748,austin,MULTIPLE OFFERS submit best & final to Agent b...,Single Family,30.380089,-97.800621,0,False,1988,0,102366.0,7.0,17,4.0,4,650000+
1,13380,austin,"4644 Hoffman Dr, Austin, TX 78749 is a single ...",Single Family,30.199486,-97.859947,0,False,1997,0,6534.0,6.666667,16,3.0,4,350000-450000
2,4115,austin,"6804 Canal St, Austin, TX 78741 is a single fa...",Single Family,30.227398,-97.696083,0,False,1952,0,5619.0,3.333333,11,1.0,2,0-250000
3,6926,austin,Beautiful large lot with established trees. Lo...,Single Family,30.205469,-97.792351,4,False,1976,0,6416.0,4.0,14,2.0,4,0-250000
4,14480,austin,Stunning NW Hills designer remodel by Cedar an...,Single Family,30.345106,-97.767426,2,False,1984,0,10759.0,7.0,16,3.0,5,650000+
5,13448,austin,"1808 Saint Albans Blvd, Austin, TX 78745 is a ...",Single Family,30.218163,-97.79425,0,False,1964,0,9452.0,4.666667,14,2.0,3,450000-650000
6,1996,austin,Stunning 1story contemporary home on a beautif...,Single Family,30.283981,-97.88578,0,False,2015,0,45302.4,5.666667,16,4.0,5,650000+
7,11353,austin,HIGH RENTABLE AREA! Beautifully kept duplex lo...,Multiple Occupancy,30.327396,-97.693901,1,False,1967,0,8886.0,3.333333,15,4.0,6,0-250000
8,4643,austin,"Beautifully, south Austin home on large corner...",Single Family,30.209915,-97.845276,2,False,1983,0,7535.0,4.666667,14,3.0,3,250000-350000
9,3118,austin,This fantastic property is walking distance to...,Single Family,30.38129,-97.667625,0,False,1972,0,13068.0,5.0,15,2.0,4,250000-350000


In [13]:
dataset_df.describe()

Unnamed: 0,uid,latitude,longitude,garageSpaces,yearBuilt,numOfPatioAndPorchFeatures,lotSizeSqFt,avgSchoolRating,MedianStudentsPerTeacher,numOfBathrooms,numOfBedrooms
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,7626.6219,30.291191,-97.778476,1.2296,1988.5704,0.6672,20710.28,5.766236,14.8577,2.6921,3.4492
std,4380.486852,0.097075,0.084543,1.328179,21.515272,0.986378,448833.8,1.86194,1.748473,0.979206,0.813441
min,1.0,30.08503,-98.020477,0.0,1905.0,0.0,100.0,2.333333,10.0,1.0,1.0
25%,3838.75,30.202516,-97.838594,0.0,1975.0,0.0,6534.0,4.0,14.0,2.0,3.0
50%,7603.5,30.283664,-97.76968,1.0,1993.0,0.0,8189.0,5.666667,15.0,3.0,3.0
75%,11435.75,30.366375,-97.718313,2.0,2006.0,1.0,10890.0,7.0,16.0,3.0,4.0
max,15170.0,30.517323,-97.570633,22.0,2020.0,8.0,34154520.0,9.5,19.0,10.0,10.0


In [14]:
# generate a data profiling report using data_profiling lib
title = "Raw Dataset Profiling"
report_name = 'raw_dataset_profiling'
results_folder_path = 'results'
report_filepath = os.path.join(results_folder_path, f"{report_name}.html")
generate_profiling_report(report_filepath=report_filepath, title=title, data_filepath=dataset_filepath, minimal=False)

Summarize dataset:  19%|█▉        | 4/21 [00:00<00:00, 25.62it/s, Describe variable: priceRange]      
Summarize dataset:  62%|██████▏   | 13/21 [00:00<00:00, 41.21it/s, Describe variable: priceRange]
 12%|█▎        | 2/16 [00:00<00:00, 14.73it/s][A
100%|██████████| 16/16 [00:00<00:00, 23.36it/s][A
Summarize dataset: 100%|██████████| 146/146 [00:07<00:00, 19.93it/s, Completed]                                                    
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.78s/it]
Render HTML: 100%|██████████| 1/1 [00:01<00:00,  1.38s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 21.20it/s]


## Features review
* uid: unique id.
* city: Categorical (open vocabulary) high Imbalance, 99% of the values are 'austin'. low variance
* description: Text. Most common word is 'home'.
* homeType: Categorical (closed vocabulary) high Imbalance, 94% of the values are 'Single Family'. low variance
* latitude: Float. Gaussian like distribution
* longitude: Float. Gaussian like distribution
* garageSpaces: Integer. A bit high positive skewness.
* hasSpa: Boolean: A bit unbalanced, 91% True.
* yearBuilt: Integer. A bit high negative skewness.
* numOfPatioAndPorchFeatures: : Integer. A bit high positive skewness.
* lotSizeSqFt: Integer. Max value is far to the 95 Quantile. It may be an outlier. Long tail.
* avgSchoolRating: Float: Gaussian like distribution
* MedianStudentsPerTeacher: Gaussian like distribution
* numOfBathrooms: Integer. Common values are 1 to 4, from 5 to max are not common, A bit high positive skewness.
* numOfBedrooms: Integer. Common values are 2 to 5, 1 and >5 not common, A bit High Kurtosis.
* priceRange: Categorical (closed vocabulary). The target, good balanced, each category represent about the 23% to 12%. 

## Additional observations
* longitude has a high negative linear correlation to MedianStudentsPerTeacher and avgSchoolRating. It could indicate the school quality depends on the longitude location.
* MedianStudentsPerTeacher and avgSchoolRating have a high positive linear correlation, more median students per teacher means high rating, quality.
* There is a sutil negative linear correlation between longitude to lotSizeSqFt, numOfBathrooms, and numOfBedrooms. This could mean, depending on the longitude location the houses are smaller 
* 0 missing values in all the features except one in description, remove it

## Features Analysis and TODO
* uid: non informative. REMOVE FEATURE
* city: low variance. Limit the model to 'austin' until have more information. REMOVE VALUES DIFFERENT TO 'austin'.
* description: It may contain useful information, APPLY NLP TO EXTRACT USEFUL INFORMATION
* homeType: low variance. Limit the model to 'Single Family' until have more information. REMOVE VALUES DIFFERENT TO 'Single Family'.
* latitude: DO NOT APPLY TRANSFORMATIONS
* longitude: DO NOT APPLY TRANSFORMATIONS
* garageSpaces: Apply binding, group the values => 5. APPLY BINDING
* hasSpa: It may require any transformations in the future, for now: DO NOT APPLY TRANSFORMATIONS
* yearBuilt: DO NOT APPLY TRANSFORMATIONS
* numOfPatioAndPorchFeatures: Apply binding, group the values => 4. APPLY BINDING
* lotSizeSqFt: The max value is unreal, remove values >95 quantile. APPLY 95 QUANTILE REMOVAL
* avgSchoolRating: DO NOT APPLY TRANSFORMATIONS
* MedianStudentsPerTeacher: DO NOT APPLY TRANSFORMATIONS
* numOfBathrooms: Apply binding, group the values => 6. APPLY BINDING
* numOfBedrooms: Apply binding, group the values => 6. APPLY BINDING
* priceRange: Apply Category to Numerical transformation, keep the order because the order could be useful for the model. CATEGORICAL TO NUMERICAL.  

### Detailed feature exploration

In [19]:
# Explore description
dataset_df['description'].tolist()

['MULTIPLE OFFERS submit best & final to Agent by Mon 21st - 5pm. Appt with Agent.  RARE PANORAMIC VIEW LOT IN JESTER ESTATES SEE FOR MILES!!  Home sits on Cul-de-sac & backs to a Preserve.  Stunning remodeled Kitchen & Bathrooms. Master suite is a private sanctuary with chic master bath, huge bedroom, walk-in closet & private deck.  Jester has a pool, park, tennis courts & feeds into Anderson High.  This home has been well loved & features 3 living areas, an office, & 3 car garage.',
 '4644 Hoffman Dr, Austin, TX 78749 is a single family home that contains 2,059 sq ft and was built in 1997. It contains 4 bedrooms and 3 bathrooms. \r\n \r\n',
 '6804 Canal St, Austin, TX 78741 is a single family home that contains 832 sq ft and was built in 1952. It contains 2 bedrooms and 1 bathroom. \r\n \r\n',
 'Beautiful large lot with established trees. Lovely, light filled, spacious home, with large dining area, living room and kitchen in open floor plan layout. Large family-friendly back yard, ne

Based on the description values, we can use it in two directions using NLP
1. Extract known features to check accuracy, # bedrooms (in description) = numOfBedrooms
2. Extract additional features, like pool, schools, hospitals, this could increase the property price value.

TODO: We will focus on t #2 options only

In [20]:
# explore target
dataset_df['priceRange'].unique()

array(['650000+', '350000-450000', '0-250000', '450000-650000',
       '250000-350000'], dtype=object)