Problem:
You are given a dataset containing information about houses. The dataset includes various features such as the number of bedrooms, the size of the living area, the location, and the age of the house. Your task is to build a machine learning model that can predict the price of a house based on these features.

Dataset:
The dataset consists of the following features: number of bedrooms, living area size, location (categorical variable), and house age. Each data point also includes the corresponding house price. Here's a sample of the dataset:

| Bedrooms | Living Area (sq.ft.) | Location | Age (years) | Price (in dollars) |
|----------|---------------------|----------|-------------|--------------------|
|    3     |         1500        |   Urban  |      5      |       250,000      |
|    4     |         2000        | Suburban |     10      |       400,000      |
|    2     |         1200        |   Rural  |     15      |       150,000      |
|    5     |         3000        | Suburban |      2      |       600,000      |
|    3     |         1800        |   Urban  |      8      |       350,000      |

Your task is to develop a supervised learning model that can predict the price of a house based on these features. You will need to preprocess the data, select an appropriate algorithm, train the model on the labeled dataset, and evaluate its performance. Once the model is trained, you can use it to predict the prices of new, unseen houses based on their features.

Note: This problem involves more complex data with multiple features and a continuous target variable. You can explore algorithms such as linear regression, decision trees, random forests, or support vector regression to solve this problem.

Solution with Linear Regression

Solution to the problem using linear regression:


# Step 1: Import the necessary libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Step 2: Load the dataset

In [2]:
dataset = pd.DataFrame({
    'Bedrooms': [3, 4, 2, 5, 3],
    'Living Area': [1500, 2000, 1200, 3000, 1800],
    'Location': ['Urban', 'Suburban', 'Rural', 'Suburban', 'Urban'],
    'Age': [5, 10, 15, 2, 8],
    'Price': [250000, 400000, 150000, 600000, 350000]
})

# Step 3: Preprocess the data
# Convert categorical variable 'Location' into dummy variables

In [3]:
dataset = pd.get_dummies(dataset, columns=['Location'])

# Step 4: Separate the input features and target variable

In [4]:
X = dataset.drop('Price', axis=1)
y = dataset['Price']

# Step 5: Split the dataset into training and testing sets

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Create an instance of the linear regression model

In [6]:
model = LinearRegression()

# Step 7: Fit the model to the training data

In [7]:
model.fit(X_train, y_train)

# Step 8: Make predictions on the test data

In [8]:
y_pred = model.predict(X_test)

# Step 9: Evaluate the performance of the model

In [45]:
"""mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R2 Score:", r2)
"""

from sklearn.metrics import r2_score

# Assuming your predictions and actual values are stored in y_pred and y_actual, respectively

if len(y_test) >= 2:
    r2 = r2_score(y_test, y_pred)
    print("R^2 score:", r2)
else:
    print("Insufficient data to calculate R^2 score.")


Insufficient data to calculate R^2 score.


# Step 10: Make predictions on new, unseen data

In [None]:
"""new_data = pd.DataFrame({
    'Bedrooms': [4],
    'Living Area': [2200],
    'Location_Suburban': [1],
    'Location_Urban': [0],
    'Age': [7]
})
new_prediction = model.predict(new_data)
print("Predicted Price for new house:", new_prediction)
"""

import pandas as pd

new_data = pd.DataFrame({
    'Bedrooms': [4],
    'Living Area': [2200],
    'Location_Suburban': [1],
    'Location_Urban': [0],
    'Location_Rural': [0],  # Add Location_Rural feature
    'Age': [7]
})
new_prediction = model.predict(new_data)
print("Predicted Price for new house:", new_prediction)



In this solution, we use the `LinearRegression` class from scikit-learn to create an instance of the linear regression model. We preprocess the data by converting the categorical variable 'Location' into dummy variables using one-hot encoding. Then, we split the dataset into training and testing sets. We fit the linear regression model to the training data, make predictions on the test data, and evaluate the model's performance using mean squared error (MSE) and R-squared (R2) score. Finally, we use the trained model to predict the price of a new house based on its features.


Solution

Solution to the problem using a decision tree algorithm:


# Step 1: Import the necessary libraries

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Step 2: Load the dataset

In [15]:
dataset = pd.DataFrame({
    'Bedrooms': [3, 4, 2, 5, 3],
    'Living Area': [1500, 2000, 1200, 3000, 1800],
    'Location': ['Urban', 'Suburban', 'Rural', 'Suburban', 'Urban'],
    'Age': [5, 10, 15, 2, 8],
    'Price': [250000, 400000, 150000, 600000, 350000]
})

# Step 3: Preprocess the data
# Convert categorical variable 'Location' into dummy variables

In [16]:
dataset = pd.get_dummies(dataset, columns=['Location'])


# Step 4: Separate the input features and target variable

In [17]:
X = dataset.drop('Price', axis=1)
y = dataset['Price']

# Step 5: Split the dataset into training and testing sets

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Create an instance of the decision tree regression model

In [19]:
model = DecisionTreeRegressor()

# Step 7: Fit the model to the training data

In [20]:
model.fit(X_train, y_train)

# Step 8: Make predictions on the test data

In [21]:
y_pred = model.predict(X_test)

# Step 9: Evaluate the performance of the model

In [None]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R2 Score:", r2)

# Step 10: Make predictions on new, unseen data

In [None]:
new_data = pd.DataFrame({
    'Bedrooms': [4],
    'Living Area': [2200],
    'Location_Suburban': [1],
    'Location_Urban': [0],
    'Age': [7]
})
new_prediction = model.predict(new_data)
print("Predicted Price for new house:", new_prediction)

In this solution, we use the `DecisionTreeRegressor` class from scikit-learn to create an instance of the decision tree regression model. We preprocess the data by converting the categorical variable 'Location' into dummy variables using one-hot encoding. Then, we split the dataset into training and testing sets. We fit the decision tree model to the training data, make predictions on the test data, and evaluate the model's performance using mean squared error (MSE) and R-squared (R2) score. Finally, we use the trained model to predict the price of a new house based on its features.

Solution

Solution to the problem using a random forest algorithm:


# Step 1: Import the necessary libraries

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Step 2: Load the dataset

In [25]:
dataset = pd.DataFrame({
    'Bedrooms': [3, 4, 2, 5, 3],
    'Living Area': [1500, 2000, 1200, 3000, 1800],
    'Location': ['Urban', 'Suburban', 'Rural', 'Suburban', 'Urban'],
    'Age': [5, 10, 15, 2, 8],
    'Price': [250000, 400000, 150000, 600000, 350000]
})

# Step 3: Preprocess the data
# Convert categorical variable 'Location' into dummy variables

In [26]:
dataset = pd.get_dummies(dataset, columns=['Location'])

# Step 4: Separate the input features and target variable

In [27]:
X = dataset.drop('Price', axis=1)
y = dataset['Price']


# Step 5: Split the dataset into training and testing sets

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Create an instance of the random forest regression model

In [29]:
model = RandomForestRegressor()

# Step 7: Fit the model to the training data

In [30]:
model.fit(X_train, y_train)

# Step 8: Make predictions on the test data

In [31]:
y_pred = model.predict(X_test)

# Step 9: Evaluate the performance of the model

In [None]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R2 Score:", r2)

# Step 10: Make predictions on new, unseen data

In [33]:
new_data = pd.DataFrame({
    'Bedrooms': [4],
    'Living Area': [2200],
    'Location_Suburban': [1],
    'Location_Urban': [0],
    'Age': [7]
})
new_prediction = model.predict(new_data)
print("Predicted Price for new house:", new_prediction)


ValueError: The feature names should match those that were passed during fit.
Feature names seen at fit time, yet now missing:
- Location_Rural


In this solution, we use the `RandomForestRegressor` class from scikit-learn to create an instance of the random forest regression model. We preprocess the data by converting the categorical variable 'Location' into dummy variables using one-hot encoding. Then, we split the dataset into training and testing sets. We fit the random forest model to the training data, make predictions on the test data, and evaluate the model's performance using mean squared error (MSE) and R-squared (R2) score. Finally, we use the trained model to predict the price of a new house based on its features.


Solutoin

Solution to the problem using a Support Vector Machine (SVM) algorithm:


# Step 1: Import the necessary libraries

In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

# Step 2: Load the dataset

In [35]:
dataset = pd.DataFrame({
    'Bedrooms': [3, 4, 2, 5, 3],
    'Living Area': [1500, 2000, 1200, 3000, 1800],
    'Location': ['Urban', 'Suburban', 'Rural', 'Suburban', 'Urban'],
    'Age': [5, 10, 15, 2, 8],
    'Price': [250000, 400000, 150000, 600000, 350000]
})


# Step 3: Preprocess the data
# Convert categorical variable 'Location' into dummy variables

In [36]:
dataset = pd.get_dummies(dataset, columns=['Location'])

# Step 4: Separate the input features and target variable

In [37]:
X = dataset.drop('Price', axis=1)
y = dataset['Price']


# Step 5: Split the dataset into training and testing sets

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Create an instance of the support vector regression model

In [39]:
model = SVR()

# Step 7: Fit the model to the training data

In [40]:
model.fit(X_train, y_train)

# Step 8: Make predictions on the test data

In [41]:
y_pred = model.predict(X_test)

# Step 9: Evaluate the performance of the model

In [42]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R2 Score:", r2)

Mean Squared Error: 9999934384.184526
R2 Score: nan




# Step 10: Make predictions on new, unseen data

In [43]:
new_data = pd.DataFrame({
    'Bedrooms': [4],
    'Living Area': [2200],
    'Location_Suburban': [1],
    'Location_Urban': [0],
    'Age': [7]
})
new_prediction = model.predict(new_data)
print("Predicted Price for new house:", new_prediction)

ValueError: The feature names should match those that were passed during fit.
Feature names seen at fit time, yet now missing:
- Location_Rural


In this solution, we use the `SVR` class from scikit-learn to create an instance of the support vector regression model. We preprocess the data by converting the categorical variable 'Location' into dummy variables using one-hot encoding. Then, we split the dataset into training and testing sets. We fit the support vector machine model to the training data, make predictions on the test data, and evaluate the model's performance using mean squared error (MSE) and R-squared (R2) score. Finally, we use the trained model to predict the price of a new house based on its features.