<a href="https://colab.research.google.com/github/YYL1129/House_Model_Prediction_Model/blob/main/Simple_ML_classifier_Train_Model_Part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## “Is this house worth buying?” → Yes or No

In [15]:
import pandas as pd # work with tables
import numpy as np # math helper
from sklearn.model_selection import train_test_split # split data for training n testing
from sklearn.ensemble import RandomForestClassifier # the ML model
from sklearn.metrics import accuracy_score  # check how good the model is

In [26]:
# Simulate fake housing dataset with 500 entries
np.random.seed(42)  # same random results every time

data = pd.DataFrame({
    'price': np.random.randint(100000, 1000000, 500),
    'size_sqft': np.random.randint(500, 3000, 500),
    'bedrooms': np.random.randint(1, 5, 500),
    'bathrooms': np.random.randint(1, 4, 500),
    'age_years': np.random.randint(0, 30, 500),
    'near_mrt': np.random.choice([0, 1], 500),  # 1 = near MRT
    'location_score': np.random.randint(1, 10, 500),
})

In [27]:
data

Unnamed: 0,price,size_sqft,bedrooms,bathrooms,age_years,near_mrt,location_score
0,221958,2215,2,3,27,1,8
1,771155,1181,4,2,19,1,9
2,231932,1337,2,1,0,1,7
3,465838,1577,1,1,9,1,1
4,359178,2991,3,2,7,0,3
...,...,...,...,...,...,...,...
495,805660,2417,2,1,16,1,1
496,384821,638,4,2,9,0,8
497,155609,2526,1,3,28,0,4
498,327897,2806,1,2,9,0,9


What this does:

Creates 500 fake house entries
- Each row = a house with:
- price
- size
- number of rooms
- MRT access (yes/no)
- location score (1–10)

In [11]:
len(data)

500

In [13]:
data.shape

(500, 8)

In [14]:
data.columns

Index(['price', 'size_sqft', 'bedrooms', 'bathrooms', 'age_years', 'near_mrt',
       'location_score', 'worth_buying'],
      dtype='object')

In [28]:
data.describe()

Unnamed: 0,price,size_sqft,bedrooms,bathrooms,age_years,near_mrt,location_score
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,532184.786,1714.5,2.448,1.98,14.052,0.526,5.136
std,258722.401964,702.258146,1.107135,0.825204,8.620087,0.499824,2.527901
min,102869.0,503.0,1.0,1.0,0.0,0.0,1.0
25%,301306.75,1150.25,1.0,1.0,7.0,0.0,3.0
50%,520154.5,1701.0,2.0,2.0,14.0,1.0,5.0
75%,755398.75,2319.5,3.0,3.0,21.25,1.0,7.0
max,999684.0,2997.0,4.0,3.0,29.0,1.0,9.0


In [6]:
# Add simple logic to label good deals (1 = worth buying, 0 = not worth)
data['worth_buying'] = (
    (data['price'] < 500000) &
    (data['location_score'] > 6) &
    (data['near_mrt'] == 1)
).astype(int)


🔎 This creates our “answer” column:

If price < 500K and location is good and near MRT → worth buying (1)

Else → not worth it (0)

In [7]:
X = data.drop('worth_buying', axis=1)  # features (what you know)
y = data['worth_buying']               # label (what you want to predict)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


| `axis` | Means         | Why?             |
| ------ | ------------- | ---------------- |
| `0`    | Remove rows   | Not what we want |
| `1`    | Remove column | ✅ We want this   |


🔎

X = the inputs (everything except the answer)

y = the answer (yes/no)

train_test_split = splits 80% for training, 20% for testing

## **How to remember:**
Use this:

X = Everything except the answer <br>
y = The answer

And:

Drop column = axis=1 (imagine columns go left to right ➡️ = axis 1)<br>
Drop row = axis=0 (imagine rows go top to bottom ⬇️ = axis 0)

In [8]:
#Train the model
model = RandomForestClassifier()
model.fit(X_train, y_train)


🔎

Creates a random forest model (a set of decision trees)

Trains it using the data

In [29]:
import joblib

# Save the trained model to a file (like saving a brain)
joblib.dump(model, 'house_model.pkl')
print("✅ Model saved as house_model.pkl")


✅ Model saved as house_model.pkl


In [9]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", round(accuracy * 100, 2), "%")


Model Accuracy: 100.0 %


🔎
Predicts for the test data

Compares prediction vs real answer
Shows % accuracy
### What it does:
Compare y_pred (model’s guess) with y_test (actual answer)<br>
Count how many it got correct<br>


| Name      | What it is                         | Use for...          |
| --------- | ---------------------------------- | ------------------- |
| `X_train` | Input features used for training   | Training the model  |
| `y_train` | Answers/labels used for training   | Training the model  |
| `X_test`  | Input features used for testing    | Making predictions  |
| `y_test`  | True answers used for testing      | Checking accuracy   |
| `y_pred`  | What the model guessed on `X_test` | Compare to `y_test` |

In [10]:
# Example: Predict this new house
new_house = pd.DataFrame({
    'price': [420000],
    'size_sqft': [900],
    'bedrooms': [2],
    'bathrooms': [1],
    'age_years': [5],
    'near_mrt': [1],
    'location_score': [8]
})

prediction = model.predict(new_house)[0]
print("Prediction:", "✅ Worth Buying" if prediction == 1 else "❌ Not Worth")


Prediction: ✅ Worth Buying


🔎 Predicts using your trained model on a new house data.

| Model Type                 | Use For                                | Output       |
| -------------------------- | -------------------------------------- | ------------ |
| **Linear Regression**      | Predict **numbers**                    | 100.50, 27.0 |
| **RandomForestClassifier** | Predict **categories** (Yes/No, A/B/C) | 0, 1, "Yes"  |


| Reason                     | Benefit                           |
| -------------------------- | --------------------------------- |
| Uses many decision trees   | More accurate, avoids overfitting |
| Handles 0/1 or Yes/No well | Perfect for classification        |
| No need to scale your data | Beginner-friendly                 |
| Works well with mixed data | Numbers + categories = OK         |


# **Step 1: Add this after training your model**

Right after this line:
model.fit(X_train, y_train)

Add this:
import joblib

### **Save the trained model to a file (like saving a brain)**
joblib.dump(model, 'house_model.pkl')
print("✅ Model saved as house_model.pkl")


📦 What it does:
Saves your trained model (only the brain)

File is called house_model.pkl

No data is saved, just the model

In [30]:
# This will download house_model.pkl to your local PC.
from google.colab import files
files.download('house_model.pkl')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>