## Install Julia and Relevant Packages

In [4]:
using CSV
using DataFrames
using GLM
using Statistics

## Data Loading and Processing

In [7]:
df = CSV.read("house_prices.csv", DataFrame)

df[!,:Size] = (df[!,:Size] .- mean(df[!,:Size])) ./ std(df[!,:Size])

features = df[!,:Size]
labels = df[!,:Price]

display(features)
display(labels)

30-element Vector{Float64}:
 -1.647089319316488
 -1.533496952467075
 -1.4199045856176622
 -1.3063122187682492
 -1.1927198519188362
 -1.0791274850694232
 -0.9655351182200103
 -0.8519427513705973
 -0.7383503845211843
 -0.6247580176717713
  ⋮
  0.7383503845211843
  0.8519427513705973
  0.9655351182200103
  1.0791274850694232
  1.1927198519188362
  1.3063122187682492
  1.4199045856176622
  1.533496952467075
  1.647089319316488

30-element Vector{Int64}:
 300000
 310000
 320000
 330000
 340000
 350000
 360000
 370000
 380000
 390000
      ⋮
 510000
 520000
 530000
 540000
 550000
 560000
 570000
 580000
 590000

## Model Building and Training

In [8]:
prep = hcat(ones(size(features)), features)
display(prep)

model = lm(prep, labels)
display(model)

30×2 Matrix{Float64}:
 1.0  -1.64709
 1.0  -1.5335
 1.0  -1.4199
 1.0  -1.30631
 1.0  -1.19272
 1.0  -1.07913
 1.0  -0.965535
 1.0  -0.851943
 1.0  -0.73835
 1.0  -0.624758
 ⋮    
 1.0   0.73835
 1.0   0.851943
 1.0   0.965535
 1.0   1.07913
 1.0   1.19272
 1.0   1.30631
 1.0   1.4199
 1.0   1.5335
 1.0   1.64709

LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}:

Coefficients:
───────────────────────────────────────────────────────────────────────────────
       Coef.   Std. Error                     t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────────
x1  445000.0  1.28597e-11  34604101581029564.00    <1e-99   445000.0   445000.0
x2   88034.1  1.30796e-11   6730646631664252.00    <1e-99    88034.1    88034.1
───────────────────────────────────────────────────────────────────────────────


## Model Evaluation

In [12]:
pred = predict(model, prep)
mse = mean((labels - pred) .^ 2)
println("Mean Squared Error: $mse")

Mean Squard Error: 4.6304467783235086e-21


## Report

### Methodology
The methodology used was outlined in the assignment and commenced with the creation of a synthetic dataset named **house_prices.csv**, containing two columns: **Size** (in square feet) and **Price**. The dataset comprised 30 rows of sample data representing a linear relationship between house size and price.

Which was then loading using the **CSV** and **DataFrames** packages for Julia then normalized on the **Size** feature to ensure better model performance. This was done via subtracting the mean and dividing by the standard deviation.

After normalization the dataset was then split into **features** and **labels** to then load into the **GLM** package for creating and training a linear regression model.

```julia
using CSV, DataFrames, GLM

# Load and process data
data = CSV.read("house_prices.csv", DataFrame)
data[!, :Size] = (data[!, :Size] .- mean(data[!, :Size])) ./ std(data[!, :Size])
features = data[!, :Size]
labels = data[!, :Price]

# Prepare data for the model
prep = hcat(ones(size(features)), features)

# Model building and training
model = lm(prep, labels)

# Model evaluation
pred = predict(model, prep)
mse = mean((labels - pred) .^ 2)
```

### Findings
The linear regression model's coefficients were as follows:
- Intercept (x1): 445,000
- Slope (x2): 88,034.1

With the model performance being evaluated using the mean squared error, which was found to be approximately 4.63e-21. The MSE indicates a very high level of accuracy.

### Challenges
Challenges faced include the creation of the dataset itself, which after fiddling on Excel was found that online tools are far superior and less manual. Then followed by data normalization which I have not encountered yet in Julia but promptly enjoyed the integration of the math symbols inline with the dataframes and more. Of course syntax errors were prevalent but were swiftly resolved as the Visual Studio IDE has the tips in the terminal to ensure quick reconciliation.

## Conclusion
The linear regression model demonstrated high accuracy in predicting house prices based on house size. The exercise provided valuable insights into data processing, model creation, and evaluation in Julia. The challenges faced were effectively addressed, enhancing the understanding of Julia's data handling and machine learning capabilities.