# B''H

## House Prices: Exploratory Data Analysis (EDA)

The objective of the Kaggle competition [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques#description) is to predict sales prices and practice feature engineering, RFs, and gradient boosting.

In this stage we will be focusing on exploring and analyzing the underlying data. 

---


### EDA Walk-Through 

**Step 1: Ensure the data is in the [tidy-data](http://www.stat.wvu.edu/~jharner/courses/stat623/docs/tidy-dataJSS.pdf) format. Convert if needed.**
   
**Key Takeaway:** 

- Thankfully the data is already tidy:
    -  Each variable forms a column
    -  Each observation forms a row (in our case a single house)
    -  Each type of observational unit forms a table (in our case only one table is needed)
---   

**Step 2: Do an initial inspection of the data to categorize the dependent and independent variables.**

See the **`step-02-initial-peek`** notebook for details.    

**Key Takeaway:** 
   
- There is a total of 80 variables:

| Variable Type                                           | Count |
| ------------------------------------------------------- | ----- |
| dependent variables                                     | 1 |   
| independent numerical continuous and discrete variables | 28 |   
| independent numerical ordinal variables                 | 2 |
| independent numerical interval variables                | 5 |
| independent categorical text variables                  | 44 |


---   

**Step 3: Where applicable, convert categorical text variables to new numerical variables.**

See the **`step-03-convert-categorical`** notebook for details.    

**Key Takeaway:** 
- The following fields have been converted to new numerical ordinal variables:
    1. `ExterQual`
    2. `ExterCond`
    3. `BsmtQual`
    4. `BsmtCond`
    5. `HeatingQC`
    6. `KitchenQual`
    7. `FireplaceQu`
    8. `GarageQual`
    9. `GarageCond`
    10. `PoolQC`

---   

**Step 4: Handle missing data.**

See the **`step-04-missing-data`** notebook for details.    

**Key Takeaway:** 
- We ended up dropping 18 variables (columns) and 1 observation (record)  


---   

**Step 5: Analyze the categorical text variables in consideration for feature selection.**

See the **`step-05-analyze-categorical`** notebook for details.    

**Key Takeaway:** 
- We selected the following few fields to further research:

    - **`MSZoning`**: Zoning classification    
    - **`Neighborhood`**: Locations within Ames city
    - **`CentralAir`**: Central air conditioning


- These fields have a clear impact on the **`SalesPrice`**

---   

**Step 6: Analyze the numerical variables in consideration for feature selection.**

See the **`step-06-analyze-numerical`** notebook for details.    

**Key Takeaway:** 

- We selected the following few fields to further research:
    - **`OverallQual`**: Rates the overall material and finish of the house
    - **`GrLivArea`**: Above grade (ground) living area square feet
    - **`ExterQualRecode`**: The quality of the material on the exterior
    - **`KitchenQualRecode`**: Kitchen quality
    - **`GarageCars`**: Size of garage in car capacity
    - **`TotalBsmtSF`**: Total square feet of basement area


- These fields have a clear impact on the **`SalesPrice`**

---   

**Step 7: Analyze the dependent variable: SalePrice.**

See the **`step-07-analyze-dependent`** notebook for details.    

**Key Takeaway:** 
- SalePrice does not follow a normal distribution
- The data is quite skewed to the right
- We ended up removing two outlier records

---


**Conclusion (for now).**
- There is definitely a lot more than can be done as part of this EDA project. 
- Besides doing more in-depth analysis in each of the above 7 steps, we can also do further EDA work like testing to see if the data fits other models, such as the log-normal, etc. 
- Additionally, we're barely scratched the surface in testing out different models with the data partitioned out by certain variables etc.

... but that's all for now.

