
# ✨ Data Cleaning and Imputation for Missing Values ✨

In this notebook, we will handle missing data from a housing dataset. Specifically, we will identify missing values and impute them using an appropriate strategy (mode imputation for the `hotwaterheating` column).

---

## 📝 1. Load the Dataset

Let's begin by loading the dataset and displaying a few records to get an overview of the data.

```python
import pandas as pd

# Load the dataset
file_path = 'Housing_V0.csv'
housing_data = pd.read_csv(file_path)

# Display the first few rows of the dataset
housing_data.head()
```

---

## 🔍 2. Check for Missing Values

Before performing any data imputation, it's essential to check for missing values in the dataset.

```python
# Check for any missing values
null_values = housing_data.isnull().sum()

# Display the columns with null values
null_values[null_values > 0]
```

### 📊 Output:

```
hotwaterheating    13
```

As we can see, the `hotwaterheating` column contains 13 missing values.

---

## 🔧 3. Impute Missing Values

We will impute the missing values in the `hotwaterheating` column using the **mode**, which is the most frequent value in that column.

```python
# Impute missing values using the mode of 'hotwaterheating'
mode_value = housing_data['hotwaterheating'].mode()[0]
housing_data['hotwaterheating'].fillna(mode_value, inplace=True)

# Verify if any missing values remain
null_values_after_imputation = housing_data.isnull().sum()

# Display the result
null_values_after_imputation
```

### 📊 Output:

```
No missing values remain in the dataset.
```

The missing values have been successfully filled using the most frequent value (mode).

---

## ✅ 4. Conclusion

We have successfully handled the missing values in the `hotwaterheating` column by applying mode imputation. The dataset is now clean and ready for further analysis or modeling.

Next steps:
- Explore the data further for outliers or inconsistencies.
- Proceed with data transformation, scaling, or feature engineering.


# 🚫 Removing Outliers Based on Price

In this notebook, we will identify and remove outliers in the dataset based on the `price` column using the IQR (Interquartile Range) method.

---

## 1. Load the Dataset

Let's begin by loading the dataset and displaying a few records to get an overview of the data.

```python
import pandas as pd

# Load the dataset
file_path = 'Housing_V0.csv'
housing_data = pd.read_csv(file_path)

# Display the first few rows of the dataset
housing_data.head()
```

---

## 2. Identify Outliers in `price`

We will use the IQR method to identify outliers. Outliers are data points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.

```python
# Calculate Q1, Q3, and IQR for 'price'
Q1 = housing_data['price'].quantile(0.25)
Q3 = housing_data['price'].quantile(0.75)
IQR = Q3 - Q1

# Define the upper and lower bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify rows with outliers in the 'price' column
outliers = housing_data[(housing_data['price'] < lower_bound) | (housing_data['price'] > upper_bound)]

# Display the outliers
outliers
```

---

## 3. Remove Outliers from the Dataset

After identifying the outliers, we will remove them from the dataset.

```python
# Remove outliers from the dataset
housing_data_cleaned = housing_data[~((housing_data['price'] < lower_bound) | (housing_data['price'] > upper_bound))]

# Verify the number of remaining rows
housing_data_cleaned.shape
```

---

## 4. Conclusion

We have successfully removed the outliers from the dataset based on the `price` column. The dataset is now ready for further analysis or modeling.


# 📏 Scaling the 'Area' Column Between 0 and 1

In this notebook, we will scale the `area` column values to fall between 0 and 1 using **Min-Max Scaling**.

---

## 1. Load the Dataset

Let's begin by loading the dataset and displaying a few records to get an overview of the data.

```python
import pandas as pd

# Load the dataset
file_path = 'Housing_V0.csv'
housing_data = pd.read_csv(file_path)

# Display the first few rows of the dataset
housing_data.head()
```

---

## 2. Apply Min-Max Scaling to the `area` Column

We will use the `MinMaxScaler` from the `sklearn.preprocessing` module to scale the `area` values between 0 and 1.

```python
from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Scale the 'area' column
housing_data[['area']] = scaler.fit_transform(housing_data[['area']])

# Display the first few rows of the scaled dataset
housing_data.head()
```

---

## 3. Verify the Scaling

Check the minimum and maximum values of the `area` column to confirm that they are scaled between 0 and 1.

```python
# Verify the min and max values of the scaled 'area' column
print("Min value:", housing_data['area'].min())
print("Max value:", housing_data['area'].max())
```

### Expected Output:
```
Min value: 0.0
Max value: 1.0
```

---

## 4. Conclusion

The `area` column has been successfully scaled to fall between 0 and 1. The dataset is now ready for further analysis or modeling.


# 🔄 One-Hot Encoding the 'Furnishingstatus' Column

In this notebook, we will apply **one-hot encoding** to the `furnishingstatus` column, which is a categorical variable, to convert it into multiple binary columns.

---

## 1. Load the Dataset

Let's begin by loading the dataset and displaying a few records to get an overview of the data.

```python
import pandas as pd

# Load the dataset
file_path = 'Housing_V0.csv'
housing_data = pd.read_csv(file_path)

# Display the first few rows of the dataset
housing_data.head()
```

---

## 2. Apply One-Hot Encoding to the `furnishingstatus` Column

We will use the `get_dummies` method from pandas to apply one-hot encoding to the `furnishingstatus` column.

```python
# Apply one-hot encoding to 'furnishingstatus'
housing_data_encoded = pd.get_dummies(housing_data, columns=['furnishingstatus'], drop_first=True)

# Display the first few rows of the encoded dataset
housing_data_encoded.head()
```

---

## 3. Verify the Encoding

Check that the `furnishingstatus` column has been converted into binary columns.

```python
# Display the columns to verify one-hot encoding
housing_data_encoded.columns
```

---

## 4. Conclusion

The `furnishingstatus` column has been successfully converted using one-hot encoding. The dataset is now ready for further analysis or modeling.


# ✂️ Splitting the Dataset into Training and Testing Sets

In this notebook, we will split the dataset into **training** and **testing** sets to prepare it for model training and evaluation.

---

## 1. Load the Dataset

Let's begin by loading the dataset and displaying a few records to get an overview of the data.

```python
import pandas as pd

# Load the dataset
file_path = 'Housing_V0.csv'
housing_data = pd.read_csv(file_path)

# Display the first few rows of the dataset
housing_data.head()
```

---

## 2. Split the Data into Training and Testing Sets

We will use the `train_test_split` function from `sklearn.model_selection` to split the data. Typically, we split 80% for training and 20% for testing.

```python
from sklearn.model_selection import train_test_split

# Define the feature set and target variable
X = housing_data.drop(columns=['price'])  # Example: using 'price' as the target variable
y = housing_data['price']

# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
print("Training set shape (X_train, y_train):", X_train.shape, y_train.shape)
print("Testing set shape (X_test, y_test):", X_test.shape, y_test.shape)
```

---

## 3. Verify the Data Split

We will check the shapes of the training and testing sets to ensure that the split was successful.

```python
# Verify the first few rows of the training set
X_train.head()
```

---

## 4. Conclusion

The dataset has been successfully split into training and testing sets. These sets can now be used for training machine learning models and evaluating their performance.