# R Programming - Unit V: Modelling in R
### Prof. Anjit Raja R  
Welcome to **Unit V – Modelling in R**. This notebook covers linear models, generalized linear models, nonlinear regression, time series analysis, clustering, and a brief intro to parallel computation in R, with runnable examples and exercises.


## Learning Outcomes
By the end of this unit, students will be able to:
1. Fit and interpret linear and generalized linear models using `lm()` and `glm()`.
2. Fit simple nonlinear models using `nls()`.
3. Work with built-in time series datasets (AirPassengers) and perform decomposition and forecasting basics.
4. Perform clustering (k-means) and evaluate cluster assignments.
5. Use basic parallelism (`parallel` package) to speed up computations.


## Model Building Flow (Text Diagram)
```
  Data Collection --> Data Cleaning --> Exploratory Analysis --> Feature Selection
         |                                                      |
         +------------------> Model Training --------------------+
                                   |
                           Model Evaluation & Validation
                                   |
                             Deployment / Reporting
```


In [ ]:
# --- Linear Model (lm) Example using mtcars ---
data(mtcars)
head(mtcars)
# Predict mpg using wt and hp
lm_model <- lm(mpg ~ wt + hp, data=mtcars)
summary(lm_model)

# Plot actual vs predicted
png('lm_actual_vs_predicted.png', width=600, height=400)
plot(mtcars$mpg, predict(lm_model), xlab='Actual MPG', ylab='Predicted MPG', main='LM: Actual vs Predicted')
abline(0,1)
dev.off()
print('Saved lm_actual_vs_predicted.png')


In [ ]:
# --- Generalized Linear Model (glm) Example ---
# Create a binary variable: is_high_mpg
mtcars$is_high_mpg <- ifelse(mtcars$mpg > 20, 1, 0)
glm_model <- glm(is_high_mpg ~ wt + hp, data=mtcars, family=binomial)
summary(glm_model)

# Predicted probabilities
probs <- predict(glm_model, type='response')
head(probs)


In [ ]:
# --- Non-linear Least Squares (nls) Example ---
# Synthetic data: y = a * exp(b * x) + noise
set.seed(123)
x <- seq(0, 5, length.out=50)
y <- 2.5 * exp(0.8 * x) + rnorm(length(x), sd=2)
plot(x, y, main='Synthetic non-linear data')

# Fit nls model
nls_model <- nls(y ~ a * exp(b * x), start=list(a=1, b=0.5))
summary(nls_model)

# Plot fitted curve and save
png('nls_fit.png', width=600, height=400)
plot(x, y, main='nls Fit')
lines(x, predict(nls_model), col='blue', lwd=2)
dev.off()
print('Saved nls_fit.png')


In [ ]:
# --- Time Series: AirPassengers ---
data(AirPassengers)
AP <- AirPassengers
start(AP); end(AP); frequency(AP)
plot(AP, main='AirPassengers time series')

# Decompose
decomp <- decompose(AP)
plot(decomp)
png('AirPassengers_decompose.png', width=800, height=600)
plot(decomp)
dev.off()
print('Saved AirPassengers_decompose.png')

# Simple forecasting using auto.arima (forecast package)
if (!require('forecast')) install.packages('forecast', repos='http://cran.r-project.org')
library(forecast)
fit <- auto.arima(AP)
fcast <- forecast(fit, h=12)
plot(fcast)
png('AirPassengers_forecast.png', width=800, height=600)
plot(fcast)
dev.off()
print('Saved AirPassengers_forecast.png')


## Clustering Example (kmeans)
We'll use the iris dataset (numeric columns) for k-means clustering and visualize clusters.


In [ ]:
data(iris)
iris_num <- iris[, 1:4]
set.seed(42)
kmeans_res <- kmeans(iris_num, centers=3, nstart=20)
table(kmeans_res$cluster, iris$Species)

# PCA for 2D visualization
pca <- prcomp(iris_num, scale.=TRUE)
png('kmeans_iris.png', width=700, height=500)
plot(pca$x[,1], pca$x[,2], col=kmeans_res$cluster, pch=19, xlab='PC1', ylab='PC2', main='Kmeans clusters on Iris (PCA projection)')
legend('topright', legend=1:3, col=1:3, pch=19)
dev.off()
print('Saved kmeans_iris.png')


## Parallel R (basic example)
We'll use `parallel::mclapply()` for demonstrating parallel computation on a list of inputs. Note: `mclapply()` works on Unix-like systems (Colab). On Windows, use `parLapply()` with clusters.


In [ ]:
library(parallel)
nums <- as.list(1:8)
square_fun <- function(x) { x^2 }
res <- mclapply(nums, square_fun, mc.cores=4)
print(res)


## Mini Exercises
1. Fit an `lm()` model to predict `mpg` using all available predictors in `mtcars`. Perform stepwise selection and report the best model.
2. Using `AirPassengers`, perform seasonal adjustment and compute seasonally adjusted series.
3. Run k-means with different `k` (2 to 6) on `iris` and plot total within-cluster sum of squares (elbow method).
4. Implement a parallel bootstrap estimator of the mean for a large synthetic dataset using `mclapply()`.


## Lab Questions (Assignment)
1. Build a predictive pipeline: load a dataset, clean it, create features, split into train/test, fit `lm()` or `glm()`, and evaluate RMSE/AUC as appropriate.
2. Using time series data of your choice, decompose, fit an ARIMA model, and forecast the next 12 periods. Save plots and forecast table.
3. Compare hierarchical clustering and k-means on the `mtcars` dataset; discuss cluster stability and interpret clusters.


## Summary & Key Functions
- `lm(), glm(), nls(), kmeans(), prcomp(), decompose(), auto.arima(), forecast(), parallel::mclapply()`
- Always inspect residuals, check assumptions (linearity, homoscedasticity) and perform cross-validation when possible.


In [ ]:
cat('\n✅ Unit V Completed: Modelling in R - Examples, Plots & Exercises included!')