# Section 1. Introduction to the problem/task and dataset
The dataset contains house prices for King County which is located in the US State of Washington. It includes homes sold between May 2014 and May 2015. It is an IBM dataset that focused on predicting the prices of houses in the USA through analysis.

In the realm of real estate and housing, the condition of a property plays a pivotal role in its market value. Understanding and accurately assessing the condition of houses is essential for buyers, sellers, and real estate professionals alike.

To address this need, we embark on a project aimed at classifying houses based on their condition. The condition of a house, graded on a scale of 1 to 5, serves as our target variable. This classification task will empower us to predict and differentiate houses based on their state of repair and maintenance.

# Section 2. Description of the dataset


<!-- https://rstudio-pubs-static.s3.amazonaws.com/155304_cc51f448116744069664b35e7762999f.html -->
<!-- https://www.kaggle.com/datasets/harlfoxem/housesalesprediction -->

The dataset of this project encompasses a comprehensive collection of housing records, each providing insights into the conditions and attributes of residential properties. It serves as the foundation for our task of classifying houses based on their condition. This dataset has been meticulously assembled from multiple sales transactions, capturing houses that have changed ownership over time.

The data for these sales comes from the official public records of home sales in the King County area, Washington State. The data sets contains 21613 rows. Each represents a home sold from May 2014 through May 2015.

## Structure

The dataset is structured as a single file in the widely-used CSV (Comma-Separated Values) format. Each row in the dataset represents a distinct house sale event, while each column corresponds to an attribute or feature of the property.

In total, the dataset comprises:
- `21613` instances; and
- `21` features.

## Features

### Brief Description of Features

Our dataset encompasses a rich array of features, both numerical and categorical, each contributing to our understanding of the condition and characteristics of houses. Below is a list of the features included in the dataset, grouped by relevance.

**Location**
- `lat` and `long` represent the latitude and longitude of the house's location.

**Size:**
- `sqft_living` is the square footage of the interior living space.
- `sqft_lot` is the square footage of the land.
- `sqft_living15` is the square footage of interior living space for the nearest 15 neighbors.
- `sqft_lot15` is the square footage of the land lots of the nearest 15 neighbors.

**Rooms:**
- `bedrooms` counts the number of bedrooms in the house.
- `bathrooms` counts the number of bathrooms. A value of .5 indicates a room with a toilet but no shower.

**Floors:**
- `floors` is the number of floors in the house.

**Waterfront and View:**
- `waterfront` is a binary variable, indicating whether the house overlooks the waterfront (1 for yes, 0 for no).
- `view` is an index from 0 to 4, rating the quality of the property's view.

**Condition and Grade:**
- `condition` is an index from 1 to 5, indicating the condition of the apartment.
- `grade` is an index from 1 to 13, where 1-3 represent lower-quality construction, 7 indicates average quality, and 11-13 signify high-quality construction and design.

**Square Footage Above and Below Ground:**
- `sqft_above` represents the square footage of the interior housing space above ground level.
- `sqft_basement` represents the square footage of the interior housing space below ground level.

**Year Information:**
- `yr_built` is the year the house was initially built.
- `yr_renovated` is the year of the last house renovation.

**Location:**
- `zipcode` indicates the zipcode area where the house is situated.

These attributes provide quantitative details about the properties. In contrast, categorical data, such as 'waterfront' and 'view,' offer qualitative information about specific aspects of the houses. It's important to clarify the significance of each feature as it guides our analysis and classification process. Even those features not directly utilized in our study may hold relevance for a comprehensive understanding of housing conditions.

### Full Feature Table

| Feature        | Description                                                                             |
|----------------|-----------------------------------------------------------------------------------------|
| id             | Unique ID for each home sold                                                            |
| date           | Date of the home sale                                                                  |
| price          | Price of each home sold                                                               |
| bedrooms       | Number of bedrooms                                                                    |
| bathrooms      | Number of bathrooms, where .5 accounts for a room with a toilet but no shower         |
| sqft_living    | Square footage of the apartment's interior living space                               |
| sqft_lot       | Square footage of the land space                                                       |
| floors         | Number of floors                                                                      |
| waterfront     | A dummy variable for whether the apartment was overlooking the waterfront or not     |
| view           | An index from 0 to 4 of how good the view of the property was                          |
| condition      | An index from 1 to 5 on the condition of the apartment                                 |
| grade          | An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design |
| sqft_above     | The square footage of the interior housing space that is above ground level           |
| sqft_basement  | The square footage of the interior housing space that is below ground level           |
| yr_built       | The year the house was initially built                                                 |
| yr_renovated   | The year of the house’s last renovation                                                |
| zipcode        | What zipcode area the house is in                                                     |
| lat            | Latitude                                                                              |
| long           | Longitude                                                                             |
| sqft_living15  | The square footage of interior housing living space for the nearest 15 neighbors     |
| sqft_lot15     | The square footage of the land lots of the nearest 15 neighbors                         |


# Section 3. List of Requirements

In [None]:
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Section 4. Data Preprocessing and Cleaning

Since we intend to classify houses based on their condition, we want to remove biases that may arise from duplicate data. We will remove duplicate rows from the dataset. In this case, there are duplicate houses due to the fact that some houses were sold more than once during the period of study. We will remove the duplicates and keep the last instance of the house.

## Data Preprocessing

### Reading the Data

In [None]:
df = pd.read_csv('house_prices.csv')
df.head()

### Removing Duplicates

In [None]:
df_old_rows = len(df)

# Store rows with duplicate ids in a separate dataframe
df_duplicates = df[df.duplicated(['id'], keep=False)]

# Store rows from df_duplicates tha are not the most recent sale in a separate dataframe
df_duplicates = df_duplicates[df_duplicates.duplicated(['id'], keep='last')]

# Remove rows from df that are in df_duplicates
df = df.drop(df_duplicates.index)

# Print number old and new number of rows
print(f'Old Rows: {df_old_rows}\nNew Rows: {len(df)}\nRemoved Rows: {len(df_duplicates)}')

In [None]:
# Confirm there are no more duplicate ids
df[df.duplicated(['id'], keep=False)]

## Data Cleaning

We will now clean the data by checking for missing values or incorrect data types.

In [None]:
# Check for string values in the dataframe
df.select_dtypes(include=['object']).columns

### Missing Values

In [None]:
# Check for missing values in the dataset
df.isnull().sum().sort_values(ascending=False)

### Negative Values

In [None]:
# Check for negative values (excluding longitude, latitude, date, and id)
excluded_cols = ['id', 'long', 'lat', 'date']
df[(df.drop(excluded_cols, axis=1) < 0).any(axis=1)]


# Section 5. Exploratory Data Analysis

## Data Summary and Visualization

In [None]:
# Generate a summary of the dataset
# Remove columns with no significant data and format the output
remove_cols = ['id', 'date']
df_summary = df.drop(remove_cols, axis=1).describe().transpose()
df_summary = df_summary[['mean', 'std', 'min', '25%', '50%', '75%', 'max']]
df_summary = df_summary.round(2)
df_summary = df_summary.rename(columns={'mean': 'Mean', 'std': 'Standard Deviation', 'min': 'Minimum', '25%': '25th Percentile', '50%': '50th Percentile', '75%': '75th Percentile', 'max': 'Maximum'})
df_summary

### Visualizations

#### Condition of Houses

Let's start by finding out what are the number of houses in each condition.

In [None]:
# Histogram of the `condition` column with labels
plt.hist(df['condition'], bins=5, edgecolor='k')
plt.title('Condition')
plt.xlabel('Condition')
plt.ylabel('Count')
plt.show()

# Also print the value counts
df['condition'].value_counts()

In [None]:
# Get percentage of the houses based on condition
condition_percentages = df['condition'].value_counts(normalize=True)

# Format the percentages to 2 decimal places
condition_percentages.map(lambda x: '{:.2f}%'.format(x*100))


On average, most houses would fall under the 3.0 category. This means that most houses are in average condition. Meaning, out of `21436` houses, `13911` houses are in average condition. This means that `64.90%` of the houses are in average condition.

As for those with above average houses, houses under 4.0 and 5.0 are considered above average. This means that out of `21436` houses, `7332` houses are above average. This means that `34.20%` of the houses are above average.
- There are `5646` houses with a condition of four (4). They comprise `26.33%` of the houses.
- There are `1687` houses with a condition of five (5). They comprise `7.87%` of the houses.

As for those with below average houses, houses under 1.0 and 2.0 are considered below average. This means that out of `21436` houses, `193` houses are below average. This means that `0.9%` of the houses are below average.
- There are `164` houses with a condition of two (2). They comprise `0.77%` of the houses.
- There are `29` houses with a condition of one (1). They comprise `0.14%` of the houses.

#### House Prices

We will now look at the numerical distribution of house prices.

In [None]:
# Print the numerical data behind `price`
df['price'].describe().apply(lambda x: format(x, 'f'))

The average price for a house appears to be `$541,649.9627` with a standard deviation of `$367,314.9294`. The minimum price for a house is `$75,000.0000` and the maximum price for a house is `$7,700,000.0000`.

#### House Built and Renovated

We now know that most houses are in average condition. Let's see how many houses were built and renovated.

In [None]:
# Create a new figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))  # 1 row, 2 columns

# Plot the 'Year Built' histogram on the first subplot (ax1)
ax1.hist(df['yr_built'], edgecolor='k')
ax1.set_title('Year Built')
ax1.set_xlabel('Year Built')
ax1.set_ylabel('Count')

# Plot the 'Year Renovated' histogram on the second subplot (ax2)
renovated = df[df['yr_renovated'] > 0]
ax2.hist(renovated['yr_renovated'], edgecolor='k')
ax2.set_title('Year Renovated')
ax2.set_xlabel('Year Renovated')
ax2.set_ylabel('Count')

# Adjust the layout to prevent overlap
plt.tight_layout()

# Show the combined plot
plt.show()

# Print how many houses have been renovated and how many have not
print(f'Renovated: {len(renovated)}\nNot Renovated: {len(df) - len(renovated)}')

print('\n')

# Print the year most houses were built and the number of houses built in that year
print(f'Most houses were built in {df["yr_built"].mode()[0]}')
print(f'Number of houses built in {df["yr_built"].mode()[0]}: {len(df[df["yr_built"] == df["yr_built"].mode()[0]])}')

# Print the year most houses were renovated and the number of houses renovated in that year (don't include houses that haven't been renovated)
print(f'Most houses were renovated in {renovated["yr_renovated"].mode()[0]}')
print(f'Number of houses renovated in {renovated["yr_renovated"].mode()[0]}: {len(renovated[renovated["yr_renovated"] == renovated["yr_renovated"].mode()[0]])}')


Given the data, it seems that the renovated houses are not that many. Meaning that most houses are in their original state. Most houses appear to be built and renovated in 2014. This means that most houses are new and have not been renovated yet. However, we must consider that the data was collected from May 2014 to May 2015. This may have underlying effects on the data due to the fact that the data was collected in a span of one year and that 2014 is the most recent year in the data.

#### House material grade

The material grade of a house is an important factor in determining the condition and price of a house. Let's see how many houses are in each material grade.

`grade` - 
*An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design*

In [None]:
# Plot the grade distribution
plt.hist(df['grade'], bins=13, edgecolor='k')
plt.title('Grade')
plt.xlabel('Grade')
plt.ylabel('Count')
plt.show()

# Print the value counts for `grade`
df['grade'].value_counts()


Most houses appear to be in grade 7. This means that most houses have an average level of construction and design. This means that most houses are not that well-built and designed. This may be a factor in determining the condition of a house.

- Only four (4) houses are in grade 1-3 meaning that only four (4) houses are poorly built and designed and in poor condition.
- `498` houses are in grade 11-13 meaning that `498` houses are well-built and designed and in good condition.
- `16,554` houses are in grade 4-10 meaning that `16,554` houses are averagely built and designed and in average condition.
  - `2047` houses are in grade 4-6 meaning that `2047` houses are below average condition.
  - `8896` houses are in grade 7 meaning that `8896` houses are average condition.
  - `6611` houses are in grade 8-10 meaning that `6611` houses are above average condition.

The construction of most houses appear to be on the higher spectrum of material grade. It is more likely that most houses are in average to above average condition.

#### Rooms

Now, let's have a quick look at the number of bedrooms and bathrooms in houses.

In [None]:
# Plot bedrooms and bathrooms on separate plots as a histogram
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))  # 1 row, 2 columns

ax1.hist(df['bedrooms'], edgecolor='k')
ax1.set_title('Bedrooms')
ax1.set_xlabel('Bedrooms')
ax1.set_ylabel('Count')

ax2.hist(df['bathrooms'], edgecolor='k')
ax2.set_title('Bathrooms')
ax2.set_xlabel('Bathrooms')
ax2.set_ylabel('Count')

plt.tight_layout()
plt.show()

# Describe bedrooms and bathrooms
df[['bedrooms', 'bathrooms']].describe()



Upon observation, it is common for houses to have `three (3)` bedrooms and `two (2)` bathrooms. This may infer that most houses are built for families. Containing possibly at least three persons.

#### Square Feet

### Findings

## Class Distribution Analysis

## Correlation Analysis

## Geospatial Analysis

## Cross-Feature Relationships

## Outlier Detection

# Section 6. Model Training

# Section 7. Hyperparameter Tuning

# Section 8. Model Selection

# Section 9. Insights and Conclusions

# Section 10. References