# SC1304 Tutorial: Supervised and Unsupervised Learning Practice Problems

Exercise 1: House Price Prediction using Linear Regression
Background
You work for a real estate company that wants to predict house prices based on various features. They have provided you with a dataset containing information about houses sold in the past year.

Data Description
The dataset contains the following features:

Square footage (sqft)
<br> Number of bedrooms (bedrooms)
<br> Number of bathrooms (bathrooms)
<br> Age of the house (years)
<br> Distance to nearest school (miles)
<br> Target variable: House price (price)

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Generate sample data (in real scenario, you would load your dataset)
np.random.seed(42)
n_samples = 1000

data = {
    'sqft': np.random.normal(2000, 500, n_samples),
    'bedrooms': np.random.randint(1, 6, n_samples),
    'bathrooms': np.random.randint(1, 4, n_samples),
    'age': np.random.randint(0, 50, n_samples),
    'distance_to_school': np.random.uniform(0, 5, n_samples)
}

# Create price with some relationship to features
data['price'] = (
    200 * data['sqft'] +
    50000 * data['bedrooms'] +
    75000 * data['bathrooms'] -
    2000 * data['age'] -
    25000 * data['distance_to_school'] +
    np.random.normal(0, 50000, n_samples)
)

df = pd.DataFrame(data)

In [2]:
df.head()

Unnamed: 0,sqft,bedrooms,bathrooms,age,distance_to_school,price
0,2248.357077,4,3,32,0.071907,814393.216497
1,1930.867849,1,3,49,3.821767,493000.44297
2,2323.844269,3,1,37,3.118715,492004.731839
3,2761.514928,5,3,26,3.811513,936729.897273
4,1882.923313,3,3,33,0.19469,611025.858251


Questions to Exercise 1:
<br> a. Split the data into training (70%) and testing (30%) sets
<br> b. Train a linear regression model
<br> c. Make predictions on the test set
<br> d. Calculate and interpret the RÂ² score and RMSE
<br> e. Which features have the strongest influence on house prices?
<br> f. What are the assumptions of linear regression, and how can you verify them?

In [None]:
#your code here, feel free to use any library or software you are familiar with

# Exercise 2: Customer Churn Prediction using Logistic Regression
Background:
A telecommunications company wants to predict which customers are likely to churn (cancel their service) based on their usage patterns and demographic information.
Data Description
The dataset contains the following features:

<br> Monthly charges ($)
<br> Contract length (months)
<br> Total data usage (GB)
<br> Customer service calls (count)
<br> Age of customer (years)
<br> Target variable: Churned (0 = No, 1 = Yes)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Generate sample data
np.random.seed(42)
n_samples = 1000

data = {
    'monthly_charges': np.random.uniform(30, 120, n_samples),
    'contract_length': np.random.randint(1, 24, n_samples),
    'data_usage': np.random.uniform(0, 500, n_samples),
    'service_calls': np.random.poisson(2, n_samples),
    'age': np.random.normal(40, 15, n_samples)
}

# Create churn probability based on features
z = (
    0.03 * data['monthly_charges'] -
    0.1 * data['contract_length'] +
    0.001 * data['data_usage'] +
    0.5 * data['service_calls'] -
    0.02 * data['age']
)

# Convert to probabilities using sigmoid function
prob_churn = 1 / (1 + np.exp(-z))
data['churned'] = (prob_churn > 0.5).astype(int)

df = pd.DataFrame(data)

Questions to answer:
<br> a. Perform exploratory data analysis to understand the relationship between features and churn
<br> b. Split the data and train a logistic regression model
<br> c. Evaluate the model using:

<br> Accuracy
<br> Precision
<br> Recall
<br> F1-score
<br> d. Create and interpret the confusion matrix
<br> e. What features are most predictive of customer churn?
<br> f. How would you handle class imbalance if present in the dataset?

In [None]:
#your code here

Optional Exercise 3: Model Comparison and Feature Engineering
Challenge
For either the house price or customer churn dataset:

<br> Engineer at least three new features that you think might improve model performance
<br>Implement both linear/logistic regression and one other model of your choice (e.g., Random Forest, XGBoost)
<br>Compare the performance of the models using appropriate metrics
<br>Perform cross-validation to ensure robust results
<br>Write a brief report explaining:

<br>Your feature engineering process and rationale
<br>Why you chose the additional model
<br>Which model performed better and why
<br>Recommendations for future improvements

In [None]:
#your code here