## Objective
In this lab, you will learn how to use Scikit-Learn to build and evaluate a basic linear regression model for predicting house prices. You will go through the steps of loading and exploring a dataset, preprocessing the data, building a model, and evaluating its performance.
## Scenario
Welcome to your new role as a data analyst intern at a real estate firm! Your manager has tasked you with developing a simple machine learning model to predict house prices based on historical data. This model will help the firm estimate the value of new properties quickly and efficiently.
## Materials Provided
- A dataset (`house_prices.csv`) preloaded into a pandas DataFrame named `df`.
- Python environment with Scikit-Learn, pandas, and other essential libraries pre-installed.

## High-Level Tasks
1. **Load and Explore the Data**
2. **Data Preprocessing**
3. **Build and Train a Linear Regression Model**
4. **Make Predictions and Evaluate the Model**
5. **Bonus Challenge (Optional)**

## Lab Instructions
### 1. Load and Explore the Data
#### Step 1.1: Import the required Python library and load dataset.

In [1]:
import pandas as pd
df = pd.read_csv("house_prices.csv")

#### Step 1.2: Display the First 5 Rows
Use the provided code cell to display the first 5 rows of the dataset.

In [2]:
# Display the first 5 rows of the dataframe
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated
0,376000.0,3.0,2.0,1340,1384,3.0,0,0,,1340,0,2008,0
1,800000.0,4.0,3.25,3540,159430,2.0,0,0,,3540,0,2007,0
2,2238888.0,5.0,6.5,7270,130017,2.0,0,0,,6420,850,2010,0
3,324000.0,3.0,2.25,998,904,2.0,0,0,,798,200,2007,0
4,549900.0,5.0,2.75,3060,7015,1.0,0,0,5.0,1600,1460,1979,0


#### Step 1.3: Examine Column Names and Data Types
Inspect the column names and data types using `df.info()`.

In [3]:
# Display column names and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4140 entries, 0 to 4139
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   price          4140 non-null   float64
 1   bedrooms       4140 non-null   float64
 2   bathrooms      4140 non-null   float64
 3   sqft_living    4140 non-null   int64  
 4   sqft_lot       4140 non-null   int64  
 5   floors         4140 non-null   float64
 6   waterfront     4140 non-null   int64  
 7   view           4140 non-null   int64  
 8   condition      3595 non-null   float64
 9   sqft_above     4140 non-null   int64  
 10  sqft_basement  4140 non-null   int64  
 11  yr_built       4140 non-null   int64  
 12  yr_renovated   4140 non-null   int64  
dtypes: float64(5), int64(8)
memory usage: 420.6 KB


#### Step 1.4: Get Summary Statistics
Get summary statistics of numerical columns using `df.describe()` and `df.dtypes`.

In [4]:
# Summary statistics of numerical columns
df.describe()

# Observe data types of df
df.dtypes

price            float64
bedrooms         float64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition        float64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
dtype: object

### 2. Data Preprocessing
#### Step 2.1: Handle Missing Values
Identify and handle any missing values. You could choose to drop rows with missing values or fill them with appropriate statistics (mean, median, etc.). For this activity, fill the missing values with the median to retain as much data as possible without introducing bias.


In [5]:
#Show where and how many missing values are in data set
df.isnull().sum()

price              0
bedrooms           0
bathrooms          0
sqft_living        0
sqft_lot           0
floors             0
waterfront         0
view               0
condition        545
sqft_above         0
sqft_basement      0
yr_built           0
yr_renovated       0
dtype: int64

In [6]:
# Fill missing values in “condition” with median
# Fill missing values in 'condition' with the median value
df['condition'] = df['condition'].fillna(df['condition'].median())


# Show dataframe with filled values in “condition”
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated
0,376000.0,3.0,2.0,1340,1384,3.0,0,0,3.0,1340,0,2008,0
1,800000.0,4.0,3.25,3540,159430,2.0,0,0,3.0,3540,0,2007,0
2,2238888.0,5.0,6.5,7270,130017,2.0,0,0,3.0,6420,850,2010,0
3,324000.0,3.0,2.25,998,904,2.0,0,0,3.0,798,200,2007,0
4,549900.0,5.0,2.75,3060,7015,1.0,0,0,5.0,1600,1460,1979,0


#### Step 2.2: Select Relevant Features
Select the features (e.g., `'sqft_living'`, `'bedrooms'`, `'bathrooms'`,`’condition’`,`’floors’`) and the target variable (`'price'`).

In [7]:
# Select relevant features and target variable
# Define the features (independent variables)
features = ['sqft_living', 'bedrooms', 'bathrooms', 'condition', 'floors']

# Select the target variable (dependent variable)
target = 'price'

#### Step 2.3: Encode Categorical Feature
Encode the categorical feature `'condition'` using one-hot encoding.

In [8]:
# One-hot encoding for 'condition' feature
# Perform one-hot encoding on 'condition'
df_encoded = pd.get_dummies(df, columns=['condition'], prefix='condition', drop_first=True)

# Show dataframe with one-hot encoding
# Define the original feature list before encoding
original_features = ['sqft_living', 'bedrooms', 'bathrooms', 'condition', 'floors']

# Remove 'condition' from the feature list since it has been encoded
original_features.remove('condition')

# Get the new one-hot encoded column names
encoded_condition_columns = [col for col in df_encoded.columns if 'condition_' in col]

# Create the updated feature list
features = original_features + encoded_condition_columns

In [9]:
features

['sqft_living',
 'bedrooms',
 'bathrooms',
 'floors',
 'condition_2.0',
 'condition_3.0',
 'condition_4.0',
 'condition_5.0']

In [10]:
# Select the updated feature matrix X and target variable y
X = df_encoded[features]  # Features
y = df_encoded['price']   # Target variable

In [11]:
X.head()

Unnamed: 0,sqft_living,bedrooms,bathrooms,floors,condition_2.0,condition_3.0,condition_4.0,condition_5.0
0,1340,3.0,2.0,3.0,0,1,0,0
1,3540,4.0,3.25,2.0,0,1,0,0
2,7270,5.0,6.5,2.0,0,1,0,0
3,998,3.0,2.25,2.0,0,1,0,0
4,3060,5.0,2.75,1.0,0,0,0,1


#### Step 2.4: Split the Data
Split the data into training and testing sets (80% train, 20% test) using `train_test_split` from Scikit-Learn.

Make sure to set the `random_state` parameter to 42 to ensure reproducibility and obtain the same results as the expected solution.

In [12]:
from sklearn.model_selection import train_test_split

# Splitting the data
# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
print(f"Training set: X_train shape = {X_train.shape}, y_train shape = {y_train.shape}")
print(f"Testing set: X_test shape = {X_test.shape}, y_test shape = {y_test.shape}")

Training set: X_train shape = (3312, 8), y_train shape = (3312,)
Testing set: X_test shape = (828, 8), y_test shape = (828,)


### 3. Build and Train a Linear Regression Model
#### Step 3.1: Import LinearRegression
Import `LinearRegression` from `sklearn.linear_model`.

In [13]:
from sklearn.linear_model import LinearRegression

#### Step 3.2: Create an Instance of the Model
Create an instance of the `LinearRegression` model.

In [14]:
# Create an instance of the LinearRegression model
model = LinearRegression()

#### Step 3.3: Fit the Model
Fit the model to the training data.

In [15]:
# Fit the model
model.fit(X_train,y_train)

### 4. Make Predictions and Evaluate the Model
#### Step 4.1: Make Predictions
Use the trained model to make predictions on the testing data.


In [16]:
# Make predictions on the test set
y_pred = model.predict(X_test)

#### Step 4.2: Evaluate the Model
Calculate the Mean Squared Error (MSE) as `mse` and R-squared value as `r_squared` to evaluate the model's performance, then check your results by printing them in the following cell. 

In [17]:
from sklearn.metrics import mean_squared_error, r2_score


# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Calculate R-squared (R²) value
r_squared = r2_score(y_test, y_pred)

# Print the results
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²) value: {r_squared}")

Mean Squared Error (MSE): 71857868280.10321
R-squared (R²) value: 0.31456514310691674


#### Check Your Results: 

In [18]:
# Print to check results
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared: {r_squared}")


Mean Squared Error (MSE): 71857868280.10321
R-squared: 0.31456514310691674


### 5. Bonus Challenge (Optional)
#### Step 5.1: Experiment with a Different Regression Algorithm
Experiment with a different regression algorithm (e.g., `DecisionTreeRegressor` or `RandomForestRegressor`) and compare its performance to the Linear Regression model using the same evaluation metrics.

In [19]:
from sklearn.tree import DecisionTreeRegressor

# (Optional) Create and train a DecisionTreeRegressor
tree = DecisionTreeRegressor()
tree.fit(X_train,y_train)
# Evaluate the DecisionTreeRegressor
# Calculate Mean Squared Error (MSE)
tree_mse = mean_squared_error(y_test, y_pred)

# Calculate R-squared (R²) value
tree_r2 = r2_score(y_test, y_pred)

# Print to check results
print(f"DecisionTreeRegressor Mean Squared Error (MSE): {tree_mse}")
print(f"DecisionTreeRegressor R-squared: {tree_r2}")

DecisionTreeRegressor Mean Squared Error (MSE): 71857868280.10321
DecisionTreeRegressor R-squared: 0.31456514310691674


In [20]:
from sklearn.ensemble import RandomForestRegressor

# (Optional) Create and train a RandomForestRegressor and evaluate it
forest = RandomForestRegressor()
forest.fit(X_train,y_train)

# Evaluate the RandomForestRegressor

# Calculate Mean Squared Error (MSE)
forest_mse = mean_squared_error(y_test, y_pred)

# Calculate R-squared (R²) value
forest_r2 = r2_score(y_test, y_pred)

# Print to check results
print(f"RandomForestRegressor Mean Squared Error (MSE): {forest_mse}")
print(f"RandomForestRegressor R-squared: {forest_r2}")

RandomForestRegressor Mean Squared Error (MSE): 71857868280.10321
RandomForestRegressor R-squared: 0.31456514310691674


## Hints & Tips
- Refer to the "pandas Cheat Sheet" for quick syntax reference on DataFrame operations.
- Check the "Scikit-Learn Documentation" for examples and explanations of machine learning models.

Good luck with your first machine learning model!