# Section 1. Introduction to the problem/task and dataset
The dataset contains house prices for King County which is located in the US State of Washington. It includes homes sold between May 2014 and May 2015. It is an IBM dataset that focused on predicting the prices of houses in the USA through analysis.

In the realm of real estate and housing, the condition of a property plays a pivotal role in its market value. Understanding and accurately assessing the condition of houses is essential for buyers, sellers, and real estate professionals alike.

To address this need, we embark on a project aimed at classifying houses based on their condition. The condition of a house, graded on a scale of 1 to 5, serves as our target variable. This classification task will empower us to predict and differentiate houses based on their state of repair and maintenance.

# Section 2. Description of the dataset


The dataset of this project encompasses a comprehensive collection of housing records, each providing insights into the conditions and attributes of residential properties. It serves as the foundation for our task of classifying houses based on their condition. This dataset has been meticulously assembled from multiple sales transactions, capturing houses that have changed ownership over time.

The data for these sales comes from the official public records of home sales in the King County area, Washington State. The data sets contains 21613 rows. Each represents a home sold from May 2014 through May 2015.

## Structure

The dataset is structured as a single file in the widely-used CSV (Comma-Separated Values) format. Each row in the dataset represents a distinct house sale event, while each column corresponds to an attribute or feature of the property.

In total, the dataset comprises:
- `21613` instances; and
- `21` features.

## Features

### Brief Description of Features

Our dataset encompasses a rich array of features, both numerical and categorical, each contributing to our understanding of the condition and characteristics of houses. Below is a list of the features included in the dataset, grouped by relevance.

**Location**
- `lat` and `long` represent the latitude and longitude of the house's location.

**Size:**
- `sqft_living` is the square footage of the interior living space.
- `sqft_lot` is the square footage of the land.
- `sqft_living15` is the square footage of interior living space for the nearest 15 neighbors.
- `sqft_lot15` is the square footage of the land lots of the nearest 15 neighbors.

**Rooms:**
- `bedrooms` counts the number of bedrooms in the house.
- `bathrooms` counts the number of bathrooms. A value of .5 indicates a room with a toilet but no shower.

**Floors:**
- `floors` is the number of floors in the house.

**Waterfront and View:**
- `waterfront` is a binary variable, indicating whether the house overlooks the waterfront (1 for yes, 0 for no).
- `view` is an index from 0 to 4, rating the quality of the property's view.

**Condition and Grade:**
- `condition` is an index from 1 to 5, indicating the condition of the apartment.
- `grade` is an index from 1 to 13, where 1-3 represent lower-quality construction, 7 indicates average quality, and 11-13 signify high-quality construction and design.

**Square Footage Above and Below Ground:**
- `sqft_above` represents the square footage of the interior housing space above ground level.
- `sqft_basement` represents the square footage of the interior housing space below ground level.

**Year Information:**
- `yr_built` is the year the house was initially built.
- `yr_renovated` is the year of the last house renovation.

**Location:**
- `zipcode` indicates the zipcode area where the house is situated.

These attributes provide quantitative details about the properties. In contrast, categorical data, such as 'waterfront' and 'view,' offer qualitative information about specific aspects of the houses. It's important to clarify the significance of each feature as it guides our analysis and classification process. Even those features not directly utilized in our study may hold relevance for a comprehensive understanding of housing conditions.

### Full Feature Table

| Feature        | Description                                                                             |
|----------------|-----------------------------------------------------------------------------------------|
| id             | Unique ID for each home sold                                                            |
| date           | Date of the home sale                                                                  |
| price          | Price of each home sold                                                               |
| bedrooms       | Number of bedrooms                                                                    |
| bathrooms      | Number of bathrooms, where .5 accounts for a room with a toilet but no shower         |
| sqft_living    | Square footage of the apartment's interior living space                               |
| sqft_lot       | Square footage of the land space                                                       |
| floors         | Number of floors                                                                      |
| waterfront     | A dummy variable for whether the apartment was overlooking the waterfront or not     |
| view           | An index from 0 to 4 of how good the view of the property was                          |
| condition      | An index from 1 to 5 on the condition of the apartment                                 |
| grade          | An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design |
| sqft_above     | The square footage of the interior housing space that is above ground level           |
| sqft_basement  | The square footage of the interior housing space that is below ground level           |
| yr_built       | The year the house was initially built                                                 |
| yr_renovated   | The year of the house’s last renovation                                                |
| zipcode        | What zipcode area the house is in                                                     |
| lat            | Latitude                                                                              |
| long           | Longitude                                                                             |
| sqft_living15  | The square footage of interior housing living space for the nearest 15 neighbors     |
| sqft_lot15     | The square footage of the land lots of the nearest 15 neighbors                         |


# Section 3. List of Requirements

In [1]:
import numpy as np
import pandas as pd

# Section 4. Data Preprocessing and Cleaning

Since we intend to classify houses based on their condition, we want to remove biases that may arise from duplicate data. We will remove duplicate rows from the dataset. In this case, there are duplicate houses due to the fact that some houses were sold more than once during the period of study. We will remove the duplicates and keep the last instance of the house.

### Reading the Data

In [2]:
df = pd.read_csv('house_prices.csv')
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


### Removing Duplicates

In [3]:
df_old_rows = len(df)

# Store rows with duplicate ids in a separate dataframe
df_duplicates = df[df.duplicated(['id'], keep=False)]

# Store rows from df_duplicates tha are not the most recent sale in a separate dataframe
df_duplicates = df_duplicates[df_duplicates.duplicated(['id'], keep='last')]

# Remove rows from df that are in df_duplicates
df = df.drop(df_duplicates.index)

# Print number old and new number of rows
print(f'Old Rows: {df_old_rows}\nNew Rows: {len(df)}\nRemoved Rows: {len(df_duplicates)}')

Old Rows: 21613
New Rows: 21436
Removed Rows: 177


# Section 5. Exploratory Data Analysis

## Data Summary and Visualization

## Class Distribution Analysis

## Correlation Analysis

## Geospatial Analysis

## Cross-Feature Relationships

## Outlier Detection

# Section 6. Model Training

# Section 7. Hyperparameter Tuning

# Section 8. Model Selection

# Section 9. Insights and Conclusions

# Section 10. References