# Lecture 3: Introduction to R

```{note}
R is an open-source programming language and software environment specifically designed for statistical computing, data analysis, and data visualization. Originally developed by statisticians Ross Ihaka and Robert Gentleman in the early 1990s, R has since grown into a powerful tool used by scientists, researchers, and analysts across a range of disciplines. Its strength lies in its extensive ecosystem of packages, active community support, and seamless integration with advanced statistical methods, making it ideal for tasks such as hypothesis testing, regression modeling, data mining, and creating high-quality visualizations.
```

---

## Installing R and RStudio on Windows

Follow these steps:

### Step 1: Install R
1. Visit [https://cran.r-project.org](https://cran.r-project.org)
2. Click on "Download R for Windows" > "base" > Download the installer
3. Run the `.exe` file and follow installation instructions

### Step 2: Install RStudio
1. Visit [https://posit.co/download/rstudio-desktop/](https://posit.co/download/rstudio-desktop/)
2. Download the RStudio Desktop version for Windows
3. Run the installer and follow instructions

Once installed, open **RStudio** to begin writing R code!

## Hello World!

In [1]:
# Hello World in R
print("Hello World!")

[1] "Hello World!"


## Data Types in R

R supports the following basic data types:
- Character
- Numeric
- Integer
- Logical
- Complex
  
Here are some examples:

In [2]:
# Character
x <- "CE5540"
message("Type of x is: ", typeof(x))

# Numeric
r <- 3.14
message("Type of y is: ", typeof(r))

# Integer
v <- 42L
message("Type of v is: ", typeof(v))

# Logical
f <- TRUE
message("Type of f is: ", typeof(f))

# Complex
z <- 2 + 3i
message("Type of z is: ", typeof(z))

Type of x is: character

Type of y is: double

Type of v is: integer

Type of f is: logical

Type of z is: complex



## Data Structures in R

R supports the following data structures:
- Vectors
- Matrices
- Lists
- Data Frames

Here are some examples:

In [28]:
# Vectors
v1 <- c("Apple", "Banana", "Mango")
v2 <- c(9, 1, 5, 4, 6, 7, 0, 3, 8)
v3 <- c(1:5)
message("# Vectors")
print(v1)
print(v2)
print(v3)
message("Accessing a value in a vector: v1[1] = ", v1[1], ". Notice that R follows 1-based indexing!")

# Matrices
m1 <- matrix(c(9, 1, 5, 4, 6, 7, 0, 3, 8), nrow = 3, byrow = TRUE)
m2 <- matrix(c(9, 1, 5, 4, 6, 7, 0, 3, 8), nrow = 3, byrow = FALSE)
message("\n# Matrices")
print(m1)
print(m2)
message("Accessing a value in a matrix: m1[1][3] = ", m1[1,3])

# Lists
l <- list(name="John", age=25L, scores=c(90, 85, 88))
message("\n# List")
print(l)
message("Accessing a value in a list: l$name = ", l$name)

# Data Frames
df <- data.frame(Name=c("Alice", "Bob"), Age=c(23L, 25L))
message("\n# Data Frames")
print(df)
message("Accessing a value in a data frame: df$Age[1] = ", df$Age[1])


# Vectors



[1] "Apple"  "Banana" "Mango" 
[1] 9 1 5 4 6 7 0 3 8
[1] 1 2 3 4 5


Accessing a value in a vector: v1[1] = Apple. Notice that R follows 1-based indexing!


# Matrices



     [,1] [,2] [,3]
[1,]    9    1    5
[2,]    4    6    7
[3,]    0    3    8
     [,1] [,2] [,3]
[1,]    9    4    0
[2,]    1    6    3
[3,]    5    7    8


Accessing a value in a matrix: m1[1][3] = 5


# List



$name
[1] "John"

$age
[1] 25

$scores
[1] 90 85 88



Accessing a value in a list: l$name = John


# Data Frames



   Name Age
1 Alice  23
2   Bob  25


Accessing a value in a data frame: df$Age[1] = 23



## Control Flow

Here is how you would write control flow statements in R

In [29]:
x <- 10

if (x > 0) {
  message("x is a positive number")
} else if (x < 0) {
  message("x is a negative number")
} else {
  message("x is zero!")
}

x is a positive number



## Writing Loops in R

R supports both `for` and `while` loops.

In [37]:
# For loop
message("# For Loop")
for (i in 1:5) {
  message("Iteration:", i)
}

# While loop
message("\n\n# While Loop")
i <- 1
while (i <= 5) {
  message("Count:", i)
  i <- i + 1
}

# For Loop

Iteration:1



Iteration:2

Iteration:3

Iteration:4

Iteration:5



# While Loop

Count:1

Count:2

Count:3

Count:4

Count:5



## Writing Functions in R

Functions are blocks of code that can be reused. Here's how to define and call one.

In [42]:
# Factorial Function (Iterative Form)
factorial_iterative <- function(n) {
  result <- 1
  for (i in 2:n) {
    result <- result * i
  }
  return(result)
}

# Example usage
factorial_iterative(5)

In [41]:
# Factorial Function (Recursive Form)
factorial_recursive <- function(n) {
  if (n == 0 || n == 1) {
    return(1)
  } else {
    return(n * factorial_recursive(n - 1))
  }
}

# Example usage
factorial_recursive(5)

---

## Summarising Data in R

In this segment of the lecture, we will develop measures of location, dispersion, and shape discussed in the previous lecture through 2024 ITUS sample individual dataset. 

In [3]:
# 2024 ITUS sample individual data
url  <- "https://raw.githubusercontent.com/anmpahwa/CE5540/refs/heads/main/resources/2024ITUS_IndividualData_Sample.csv"
data <- read.csv(url) # Loading Data
message("2024 ITUS sample individual data is retreived as ", typeof(data), " as follows: ")
str(data)             # Data Structure

2024 ITUS sample individual data is retreived as list as follows: 



'data.frame':	100 obs. of  23 variables:
 $ survey_year      : int  2024 2024 2024 2024 2024 2024 2024 2024 2024 2024 ...
 $ fsu_serial_no    : int  30010 30010 30010 30010 30010 30010 30010 30010 30010 30010 ...
 $ sector           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ nss_region       : int  241 241 241 241 241 241 241 241 241 241 ...
 $ district         : int  17 17 17 17 17 17 17 17 17 17 ...
 $ stratum          : int  13 13 13 13 13 13 13 13 13 13 ...
 $ sub_stratum      : int  11 11 11 11 11 11 11 11 11 11 ...
 $ sub_round        : int  2 2 2 2 2 2 2 2 2 2 ...
 $ fod_sub_region   : int  2420 2420 2420 2420 2420 2420 2420 2420 2420 2420 ...
 $ nsc              : int  4 4 4 4 4 4 4 4 4 4 ...
 $ household_id     : int  1 2 2 2 2 3 3 3 3 3 ...
 $ individual_id    : int  1 1 2 3 4 1 2 3 4 5 ...
 $ response_code    : int  1 1 1 1 99999 1 1 1 1 1 ...
 $ day_of_week      : int  2 3 3 3 99999 7 7 7 7 7 ...
 $ type_of_day      : int  1 1 1 1 99999 1 1 1 1 1 ...
 $ relation_to_head : int  1 1 2

In [20]:
v  <- sort(unique(data$age))
f  <- numeric(length(v))

for (r in 1:nrow(data)) {
  z <- data$age[r]
  i <- which(v == z)
  f[i] <- f[i] + 1
}

df <- data.frame(age=v, f=f)

print(df)

   age f
1    1 2
2    2 4
3    3 5
4    4 2
5    5 4
6    6 1
7    7 2
8    8 2
9    9 2
10  11 2
11  12 1
12  13 2
13  14 1
14  15 2
15  16 2
16  17 3
17  18 3
18  20 3
19  21 1
20  22 2
21  23 1
22  24 3
23  25 4
24  27 1
25  28 2
26  29 1
27  30 2
28  31 2
29  32 2
30  33 1
31  35 4
32  38 2
33  40 3
34  42 2
35  43 1
36  44 1
37  45 3
38  48 1
39  50 2
40  52 2
41  53 1
42  54 2
43  55 1
44  59 1
45  60 4
46  62 1
47  65 2
48  66 1
49  89 1


In [None]:
# Measures of Location

## Mean
### Manually Computed Mean
v1 <- sum(df$age * df$f) / sum(df$f)
### Auto Computed Mean
v2 <- mean(data$age)

message("Manually Computed Mean = ", v1, " and Auto Computed Mean = ", v2)


## Median
### Manually Computed Median
qtl <- function (c, f, p) {
  F  <- 0
  T  <- sum(f)
  xl <- -1
  xu <- -1
  for (r in 1:length(c)) {
    if (F / T < p && (F + f[r]) / T >= p) {
      xl <- c[r]
    }
    if (F / T <= p && (F + f[r]) / T > p) {
      xu <- c[r]
    }
    F <- F + f[r]
  }

  if (xl == -1) {
    v <- xu
  } else if (xu == -1) {
    v <- xl
  } else {
    v <- (xl + xu) / 2
  }
  return(v)
}
v1 <- qtl(df$age, df$f, 0.5)
### Auto Computed Median
v2 <- median(data$age)

message("Manually Computed Median = ", v1, " and Auto Computed Median = ", v2)


## Mode
v <- df$age[which.max(df$f)]

message("Mode = ", v)


Manually Computed Mean = 27.55 and Auto Computed Mean = 27.55

Manually Computed Median = 24.5 and Auto Computed Median = 24.5

Mode = 3



In [None]:
# Measures of Dispersion

## Range
v = max(df$age) - min(df$age)

message("Range = ", v)


## Inter-Quartile Range
### Manually Computed Inter-Quartile Range
v1 <- qtl(df$age, df$f, 0.75) - qtl(df$age, df$f, 0.25)
### Auto Computed Inter-Quartile Range
v2 <- IQR(data$age)

message("Manually Computed Inter-Quartile Range = ", v1, " and Auto Computed Inter-Quartile Range = ", v2)


## Standard Deviation
### Manually Computed Standard Deviation
v1 <- sqrt(sum(df$f * (df$age - mean(df$age))^2) / sum(df$f))
### Auto Computed Standard Deviation
v2 <- sd(data$age)

message("Manually Computed Standard Deviation = ", v1, " and Auto Computed Standard Deviation = ", v2)
message("The differences in manual and auto computation arise due to differences in calculations done for the population in the former and sample in the latter.")

Range = 88

Manually Computed Inter-Quartile Range = 31 and Auto Computed Inter-Quartile Range = 31



Manually Computed Standard Deviation = 19.9055212213524 and Auto Computed Standard Deviation = 19.7480979777135



[1] "The differences in manual and auto computation arise due to differences in calculations done for the population in the former and sample in the latter"


In [None]:
# install.packages("moments")
library(moments)
# Measures of Shape

## Skewness
### Manually Computed Skewness
v1 <- (sum(df$f * (df$age - mean(data$age))^3) / sum(df$f)) / sd(data$age)^3
### Auto Computed Skewness
v2 <- skewness(data$age)

message("Manually Computed Skewness = ", v1, " and Auto Computed Skewness = ", v2)
message("The differences in manual and auto computation arise due to differences in calculations done for the population in the former and sample in the latter.")


## Kurtosis
### Manually Computed Kurtosis
v1 <- (sum(df$f * (df$age - mean(data$age))^4) / sum(df$f)) / sd(data$age)^4
### Auto Computed Kurtosis
v2 <- kurtosis(data$age)

message("Manually Computed Kurtosis = ", v1, " and Auto Computed Kurtosis = ", v2)
message("The differences in manual and auto computation arise due to differences in calculations done for the population in the former and sample in the latter.")

Manually Computed Skewness = 0.565400796831327 and Auto Computed Skewness = 0.573989072316338



The differences in manual and auto computation arise due to differences in calculations done for the population in the former and sample in the latter

Manually Computed Kurtosis = 2.55963807679161 and Auto Computed Kurtosis = 2.61160909783859

The differences in manual and auto computation arise due to differences in calculations done for the population in the former and sample in the latter

