# **Data Analysis Heritage-Housing-Project**

## Objectives

This notebook focuses on exploring the relationship between house attributes and their sale prices. The analysis aims to support the client in understanding which features have the strongest influence on property value.

The outcome will help identify the key variables to include in the predictive model and inform the visualizations that will be part of the final dashboard.

## Inputs

- Dataset collected and stored in the previous notebook:
  `outputs/datasets/collection/house_prices_records.csv`

## Outputs

- Correlation matrix for numerical features
- Visualizations of selected variables with strong correlation to sale price
- Observations and summary of insights to support the next steps


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [50]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\PabloGalindo\\Coding-Institute\\PMS5'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [13]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [37]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\PabloGalindo\\Coding-Institute\\PMS5'

# Initial Data Exploration

## Load data for analysis

We will load here the data from  `outputs/datasets/collection/house_prices_records.csv`


In [15]:
import pandas as pd

# Define the path to the dataset
file_path = "outputs/datasets/collection/house_prices_records.csv"

# Load the dataset
df = pd.read_csv(file_path)

# Preview the first few rows
df.head()




FileNotFoundError: [Errno 2] No such file or directory: 'outputs/datasets/collection/house_prices_records.csv'

---

## Exploring Feature Correlations with Sale Price

Before analyzing the lationships between features and the target variable, we will make an analysis of the dataset structure.

This includes:
- Reviewing the types of variables available
- Looking at their distribution
- Identifying missing values
- Considering how each feature might relate to the sale price in a business context

This step helps us better understand the structure of the data and prepares us for a more focused analysis of the most influential variables.



### Feature Glossary and Units of Measure

Before proceeding with the technical analysis of the data, it is important to establish a basic understanding of what each feature in the dataset represents, such as the units of measure, feature type, and possible value ranges. 

To support this, we summarize the information provided in the `house-metadata.txt` file. This simple analysis is helpful as it provides business context to the variables and improves explainability throughout the project.



#### Feature Glossary (Condensed with Categorical Meanings)

| Feature         | Unit / Type           | Range / Potential Values / Glossary                             |
|----------------|------------------------|------------------------------------------------------------------|
| 1stFlrSF        | sq ft                 | 334 – 4692                                                       |
| 2ndFlrSF        | sq ft                 | 0 – 2065                                                         |
| BedroomAbvGr    | count                 | 0 – 8                                                            |
| BsmtExposure    | category              | Gd: Good, Av: Average, Mn: Minimum, No: None, None: No Basement |
| BsmtFinType1    | category              | GLQ: Good Living Quarters, ALQ: Avg Living, BLQ: Below Avg, Rec: Rec Room, LwQ: Low Quality, Unf: Unfinished, None: No Basement |
| BsmtFinSF1      | sq ft                 | 0 – 5644                                                         |
| BsmtUnfSF       | sq ft                 | 0 – 2336                                                         |
| TotalBsmtSF     | sq ft                 | 0 – 6110                                                         |
| GarageArea      | sq ft                 | 0 – 1418                                                         |
| GarageFinish    | category              | Fin: Finished, RFn: Rough Finished, Unf: Unfinished, None: No Garage |
| GarageYrBlt     | year                  | 1900 – 2010                                                      |
| GrLivArea       | sq ft                 | 334 – 5642                                                       |
| KitchenQual     | category              | Ex: Excellent, Gd: Good, TA: Typical/Average, Fa: Fair, Po: Poor |
| LotArea         | sq ft                 | 1300 – 215245                                                    |
| LotFrontage     | linear ft             | 21 – 313                                                         |
| MasVnrArea      | sq ft                 | 0 – 1600                                                         |
| EnclosedPorch   | sq ft                 | 0 – 286                                                          |
| OpenPorchSF     | sq ft                 | 0 – 547                                                          |
| OverallCond     | ordinal (1–10)        | 1: Very Poor → 10: Very Excellent                                |
| OverallQual     | ordinal (1–10)        | 1: Very Poor → 10: Very Excellent                                |
| WoodDeckSF      | sq ft                 | 0 – 736                                                          |
| YearBuilt       | year                  | 1872 – 2010                                                      |
| YearRemodAdd    | year                  | 1950 – 2010                                                      |
| SalePrice       | USD                   | 34,900 – 755,000                                                 |



### Data Profiling

To support a better understanding of the underlying structure of the data, an automated profiling report will be generated using `ydata_profiling`. 

The output will be saved as an HTML file in the `outputs/reports/` folder. This report provides a comprehensive overview of the dataset, including:

- Variable types and summary statistics
- Distributions and potential outliers
- Missing values and data quality alerts
- Correlation matrices
- Duplicate and constant columns
- Feature interactions and skewness


In [None]:
from ydata_profiling import ProfileReport

# Create the reports folder if it doesn't exist
os.makedirs("outputs/reports", exist_ok=True)
# Load the dataset
df = pd.read_csv(file_path)

# Create profile report
profile = ProfileReport(df, title="Heritage Housing Data Profiling Report", explorative=True)

# Save to HTML file
profile.to_file("outputs/reports/heritage_housing_profile_report.html")

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'Function <code object pandas_auto_compute at 0x000001EEB952D720, file "c:\Users\PabloGalindo\Coding-Institute\PMS5\heritage-housing-ml-pgz\venv\Lib\site-packages\ydata_profiling\model\pandas\correlations_pandas.py", line 167>')
Summarize dataset: 100%|██████████| 434/434 [00:59<00:00,  7.25it/s, Completed]                           
Generate report structure: 100%|██████████| 1/1 [00:09<00:00,  9.61s/it]
Render HTML: 100%|██████████| 1/1 [00:08<00:00,  8.69s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00,  8.85it/s]


### Profiling Summary

To better understand the structure and quality of the dataset, we used `ydata_profiling` to generate a comprehensive report. Below is a summary of the key insights extracted from the profiling:

#### Dataset Overview
- **Number of observations**: 1,460
- **Number of features**: 24
- **Total missing cells**: 3,580 (10.2%)
- **Duplicate rows**: 0
- **Variable types**: 20 numerical, 4 categorical

#### Missing Values
- Features with the highest missingness:
  - `WoodDeckSF`: 89.4%
  - `EnclosedPorch`: 90.7%
  - `GarageFinish`: 16.1%
  - `LotFrontage`: 17.7%
  - `BsmtFinType1`: 9.9%
  - `BedroomAbvGr`: 6.8%

These missing values will need to be reviewed before modeling. However, based on the context of the data, it’s reasonable to assume that in many cases the missing values indicate that the feature is simply not present in the house.

  
#### Feature Observations
- Most numerical variables are clean, with **no infinite or negative values**.
- Several features like `2ndFlrSF`, `MasVnrArea`, and `OpenPorchSF` have a **high percentage of zeros**, indicating large populations of non-use (e.g., no second floor or no masonry veneer).
  
#### Categorical Variables
- Examples include:
  - `BsmtFinType1`: {Unf, GLQ, ALQ, BLQ, Rec, LwQ}
  - `BsmtExposure`: {No, Av, Gd, Mn}
  - `GarageFinish`: {Unf, RFn, Fin}
  - `KitchenQual`: {TA, Gd, Ex, Fa}




## Correlation Analysis

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Select numeric features only
numeric_df = df.select_dtypes(include=["number"])

# Compute correlation matrix
correlation_matrix = numeric_df.corr()

# Top correlated features with SalePrice (excluding itself)
saleprice_corr = correlation_matrix["SalePrice"].drop("SalePrice").sort_values(ascending=False)
top_corr = saleprice_corr.head(10)

# --- Plot 1: Correlation Coefficients ---
plt.figure(figsize=(10, 6))
sns.barplot(x=top_corr.values, y=top_corr.index)
plt.title("Top 10 Features Correlated with SalePrice")
plt.xlabel("Correlation Coefficient")
plt.ylabel("Feature")
plt.grid(True)
plt.tight_layout()
plt.show()

# --- Plot 2: Feature vs. SalePrice Scatter/Box ---
for feature in top_corr.index:
    plt.figure(figsize=(8, 5))
    if df[feature].nunique() > 10:
        sns.scatterplot(x=df[feature], y=df["SalePrice"])
        plt.title(f"{feature} vs. SalePrice (Scatter)")
    else:
        sns.boxplot(x=df[feature], y=df["SalePrice"])
        plt.title(f"{feature} vs. SalePrice (Boxplot)")
    
    plt.xlabel(feature)
    plt.ylabel("SalePrice")
    plt.tight_layout()
    plt.grid(True)
    plt.show()


  plt.show()
  plt.show()
  plt.show()
  plt.show()
  plt.show()
  plt.show()
  plt.show()
  plt.show()
  plt.figure(figsize=(8, 5))
  plt.show()
  plt.show()
  plt.show()


---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.