# Programming Assignment
## My first machine learning model implementation

### Objectives:
- Using Scikit-Learn to build and evaluate a basic linear regression model for predicting house prices based on historical data.
- Going through the steps of loading and exploring a dataset, preprocessing the data, building a model, and evaluating its performance.

### High-Level Tasks:
1. Load and Explore the Data
2. Data Preprocessing
3. Build and Train a Linear Regression Model
4. Make Predictions and Evaluate the Model
5. Bonus Challenge (Optional)

### 1. Load and Explore the Data
#### Step 1.1: Import the required Python library and preview dataset.

In [1]:
# pip install pandas scikit-learn

In [2]:
import pandas as pd
df = pd.read_csv("house_prices.csv")

# Display the first 5 rows of the DataFrame
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated
0,376000.0,3.0,2.0,1340,1384,3.0,0,0,,1340,0,2008,0
1,800000.0,4.0,3.25,3540,159430,2.0,0,0,,3540,0,2007,0
2,2238888.0,5.0,6.5,7270,130017,2.0,0,0,,6420,850,2010,0
3,324000.0,3.0,2.25,998,904,2.0,0,0,,798,200,2007,0
4,549900.0,5.0,2.75,3060,7015,1.0,0,0,5.0,1600,1460,1979,0


#### Step 1.2: Examine Column Names and Data Types

In [3]:
# Display column names and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4140 entries, 0 to 4139
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   price          4140 non-null   float64
 1   bedrooms       4140 non-null   float64
 2   bathrooms      4140 non-null   float64
 3   sqft_living    4140 non-null   int64  
 4   sqft_lot       4140 non-null   int64  
 5   floors         4140 non-null   float64
 6   waterfront     4140 non-null   int64  
 7   view           4140 non-null   int64  
 8   condition      3595 non-null   float64
 9   sqft_above     4140 non-null   int64  
 10  sqft_basement  4140 non-null   int64  
 11  yr_built       4140 non-null   int64  
 12  yr_renovated   4140 non-null   int64  
dtypes: float64(5), int64(8)
memory usage: 420.6 KB


#### Step 1.3: Get Summary Statistics

In [4]:
# Summary statistics of numerical columns
df.describe()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated
count,4140.0,4140.0,4140.0,4140.0,4140.0,4140.0,4140.0,4140.0,3595.0,4140.0,4140.0,4140.0,4140.0
mean,553062.9,3.400483,2.163043,2143.638889,14697.64,1.51413,0.007488,0.246618,3.521001,1831.351449,312.28744,1970.81401,808.368357
std,583686.5,0.903939,0.784733,957.481621,35876.84,0.534941,0.086219,0.790619,0.703193,861.382947,464.349222,29.807941,979.380535
min,0.0,0.0,0.0,370.0,638.0,1.0,0.0,0.0,1.0,370.0,0.0,1900.0,0.0
25%,320000.0,3.0,1.75,1470.0,5000.0,1.0,0.0,0.0,3.0,1190.0,0.0,1951.0,0.0
50%,460000.0,3.0,2.25,1980.0,7676.0,1.5,0.0,0.0,3.0,1600.0,0.0,1976.0,0.0
75%,659125.0,4.0,2.5,2620.0,11000.0,2.0,0.0,0.0,4.0,2310.0,602.5,1997.0,1999.0
max,26590000.0,8.0,6.75,10040.0,1074218.0,3.5,1.0,4.0,5.0,8020.0,4820.0,2014.0,2014.0


In [5]:
# Observe data types of df
df.dtypes

price            float64
bedrooms         float64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition        float64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
dtype: object

### 2. Data Preprocessing
#### Step 2.1: Handle Missing Values

In [6]:
# Show where and how many missing values are in dataset
df.isnull().sum()

price              0
bedrooms           0
bathrooms          0
sqft_living        0
sqft_lot           0
floors             0
waterfront         0
view               0
condition        545
sqft_above         0
sqft_basement      0
yr_built           0
yr_renovated       0
dtype: int64

In [7]:
# Calculate the median of the "condition" field and fill missing values in "condition" with it
df['condition'] = df['condition'].fillna(df['condition'].median())

# Display the DataFrame head (The first 5 rows)
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated
0,376000.0,3.0,2.0,1340,1384,3.0,0,0,3.0,1340,0,2008,0
1,800000.0,4.0,3.25,3540,159430,2.0,0,0,3.0,3540,0,2007,0
2,2238888.0,5.0,6.5,7270,130017,2.0,0,0,3.0,6420,850,2010,0
3,324000.0,3.0,2.25,998,904,2.0,0,0,3.0,798,200,2007,0
4,549900.0,5.0,2.75,3060,7015,1.0,0,0,5.0,1600,1460,1979,0


#### Step 2.2: Select Relevant Features

In [8]:
# Select relevant features and target variable

X = df[['sqft_living', 'bedrooms', 'bathrooms', 'floors']]

y = df['price']

# Check the shape of features and target
print(f'Shape of X: {X.shape}')
print("Expected X should be displayed as (4140, 4).")

print(f'Shape of y: {y.shape}')
print("Expected y should be displayed as (4140,).")

Shape of X: (4140, 4)
Expected X should be displayed as (4140, 4).
Shape of y: (4140,)
Expected y should be displayed as (4140,).


#### Step 2.3: Split the Data

In [9]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80% train, 20% test)
# Set the "random_state" parameter to 42 to ensure reproducibility and obtain the same results as the expected solution
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Check the shape of splits
print(f'Shape of X_train: {X_train.shape}')
print("Expected X_train should be displayed as (3312, 4)")

print(f'Shape of X_test: {X_test.shape}')
print("Expected X_test should be displayed as (828, 4)")

Shape of X_train: (3312, 4)
Expected X_train should be displayed as (3312, 4)
Shape of X_test: (828, 4)
Expected X_test should be displayed as (828, 4)


### 3. Build and Train a Linear Regression Model
#### Step 3.1: Import LinearRegression and fit the model to training data

In [10]:
# Import LinearRegression from sklearn.linear_model
from sklearn.linear_model import LinearRegression

# Create an instance of the LinearRegression model
model = LinearRegression()

# Fit the model
model.fit(X_train, y_train)

# There will be no output except a LinearRegression object from this step.

### Make Predictions and Evaluate the Model
#### Step 4.1: Make Predictions

In [11]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Use the trained model to make predictions on the testing data
y_pred = model.predict(X_test)

# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Calculate the R-squared value
r_squared = r2_score(y_test, y_pred)

# Calculate the Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)

# Test values
print(f"Your Mean Squared Error (MSE): {mse}")
print("Expected MSE should be around 74224655277.46896")

print(f"Your R-squared: {r_squared}")
print("Expected R-squared should be around 0.2919889333519581")

print(f"Your Mean Absolute Error (MAE): {mae}")
print("Expected MAE should be around 181509.94407826668")

Your Mean Squared Error (MSE): 74224655277.46898
Expected MSE should be around 74224655277.46896
Your R-squared: 0.29198893335195786
Expected R-squared should be around 0.2919889333519581
Your Mean Absolute Error (MAE): 181509.94407826668
Expected MAE should be around 181509.94407826668
