# Assignment 2: Functions and Data Wrangling

In [1]:
options(jupyter.rich_display = FALSE)

## POP77001 Computer Programming for Social Scientists

## Before Submission

-   Make sure that you can run all cells without errors
-   You can do it by clicking `Kernel`, `Restart & Run All` in the menu
    above
-   Make sure that you save the output by pressing Command+S / CTRL+S
-   Rename the file from `02_assignment.ipynb` to
    `02_assignment_lastname_firstname_studentnumber.ipynb`
-   Use Firefox browser for submitting your Jupyter notebook on
    Blackboard.

## Exercise 1: Formalising Expectations

[Student’s t-test](https://en.wikipedia.org/wiki/Student's_t-test)
(developed by [William
Gosset](https://en.wikipedia.org/wiki/William_Sealy_Gosset) when working
at the Guinness Brewery) is a classical tool for testing differences in
means between groups when samples are small. It is a very simple test to
implement and understand, but it is also very powerful and is used in
many applications.

Re-visit the function for calculating t-test from the tutorial. How can
a call to this function produce invalid output? What are your
expectations about the input? Re-write the function below to check
user’s input and handle situations when a user provides incorrect
values.

In [2]:
t_test <- function(x, y) {
  # Calculate the means of the two samples
  mean_x <- mean(x)
  mean_y <- mean(y)

  # Calculate the variances of the two samples
  var_x <- var(x)
  var_y <- var(y)

  # Calculate the sample sizes
  n_x <- length(x)
  n_y <- length(y)

  # Calculate the sample standard errors
  se_x <- sqrt(var_x / n_x)
  se_y <- sqrt(var_y / n_y)
  se <- sqrt(se_x^2 + se_y^2)

  # Calculate the t-statistic
  t_stat <- (mean_x - mean_y) / se

  # Calculate the degrees of freedom
  df <- se^4 / (se_x^4 / (n_x - 1) + se_y^4 / (n_y - 1))

  # Calculate the p-value
  p_val <- 2 * (1 - pt(abs(t_stat), df))

  # Create a result list
  res <- list(
    t_statistic = t_stat,
    degrees_of_freedom = df,
    p_value = p_val
  )

  return(res)
}

In [10]:
# Exercise 1:

# Your code goes here

##### How can a call to this function produce invalid output? #####

# t_test(813, 530) -> User can type length 1 vector instead of normal vector.
# t_test(c(8, 13), c("5, 30")) -> User can input other type of data in vector.

##### What are your expectations about the input? #####
# We expect readers input vector which is all numeric and length > 1.

# Below is the re-written code

t_test <- function(x, y) {

  # Ensure the input length != 1
  if (length(x) == 1 | length(y) == 1) {
    stop("You have to input vector length > 1.")
  }

  # Ensure the inputs are numeric
  if (!(is.numeric(x) & is.numeric(y))) {
    stop("Both arguments must be numeric.")
  }

  # Calculate the means of the two samples
  mean_x <- mean(x)
  mean_y <- mean(y)

  # Calculate the variances of the two samples
  var_x <- var(x)
  var_y <- var(y)

  # Calculate the sample sizes
  n_x <- length(x)
  n_y <- length(y)

  # Calculate the sample standard errors
  se_x <- sqrt(var_x / n_x)
  se_y <- sqrt(var_y / n_y)
  se <- sqrt(se_x^2 + se_y^2)

  # Calculate the t-statistic
  t_stat <- (mean_x - mean_y) / se

  # Calculate the degrees of freedom
  df <- se^4 / (se_x^4 / (n_x - 1) + se_y^4 / (n_y - 1))

  # Calculate the p-value
  p_val <- 2 * (1 - pt(abs(t_stat), df))

  # Create a result list
  res <- list(
    t_statistic = t_stat,
    degrees_of_freedom = df,
    p_value = p_val
  )

  return(res)
}

# Test the results
excise_1_results_test <- t_test(c(8, 13), c(5, 30))
print(excise_1_results_test)

$t_statistic
[1] -0.5491252

$degrees_of_freedom
[1] 1.079872

$p_value
[1] 0.6747619



## Exercise 2: Fibonacci sequence

[Fibonacci sequence](https://en.wikipedia.org/wiki/Fibonacci_number) is
an integer sequence that frequently appears in mathematics and is often
associated with the [golden
ratio](https://en.wikipedia.org/wiki/Golden_ratio).

-   Fibonacci sequence starts with the first element being equal to 0
    and the second element being equal to 1.
-   All subsequent elements are the sums of the preceeding two elements.
-   More formally: $F_0 = 0, F_1 = 1, F_n = F_{n-1} + F_{n-2}, n \gt 1$
-   The beginning of the sequence then looks like this: \[0, 1, 1, 2, 3,
    5, 8, …\]

Implement a function that takes a vector of integers as an input and
returns a vector of numbers from Fibonacci sequence, whose indices
correspond to the integers from the input vector.

Test the implemented function on the following inputs: - `10` -
`c(1, 2, 3, 4, 5)` - `c(3, 9, 12)`

Function specification:

Function takes 1 argument: - `num_vec` - vector of whole numbers

Function returns 1 object: - `fib` - vector of numbers from Fibonacci
sequence corresponding to indices from the input vector

Example input → output: - $[1, 3, 6]$ → $[0, 1, 5]$

In [4]:
# Exercise 2:

# Your code goes here

fibonacci_index <- function(num_vec) {

  # Transfer the vector elements into integer type
  num_vec <- as.integer(num_vec)

  # Get the maximise index number
  max_index <- max(num_vec)

  # Create a fibonacci vector and the length is max_index
  fibonacci_seq <- numeric(max_index)

  # Create F0 and F_1, F_0 = 0, F_1 = 1
  fibonacci_seq[1] <- 0
  fibonacci_seq[2] <- 1

  # Create F_n = F_{n-1} + F_{n-2}
  for (i in 3:max_index) {
    fibonacci_seq[i] <- fibonacci_seq[i - 1] + fibonacci_seq[i - 2]
  }

  # Make num_vec as index of the fibonacci sequence
  fib <- fibonacci_seq[num_vec]

  return(fib)
}

# Test the Results
test_1 <- fibonacci_index(10); print(test_1)
test_2 <- fibonacci_index(c(1, 2, 3, 4, 5)); print(test_2)
test_3 <- fibonacci_index(c(3, 9, 12)); print(test_3)

[1] 34
[1] 0 1 1 2 3
[1]  1 21 89


## Exercise 3: Calculating t-tests

Assume that you drew a random sample of $200$ individuals and ran
$1,000$ experiments on them (an experiment does not need to mean
something big, think, different shapes of some button on a website or
Guinness served at slightly different temperatures).

Implement a function that takes a matrix of experimental results and a
vector with experimental/control-group assignment and returns a vector
of calculated t-statistics for each of the experiments. Internally, you
can use either the `t_test()` function from before or built-in
`t.test()` function in R. See simulated input data below.

Function specification:

Function takes 2 arguments: - `mat` - matrix of experimental results
(rows - individuals, columns - experiments) - `grp` - vector of
experimental/control-group assignment

Function returns 1 object: - `tstats` - vector of calculated
t-statistics for each of the experiments

In [5]:
set.seed(2023)
n <- 200
m <- 1000
mat <- matrix(rnorm(m * n, mean = 20, sd = 3), nrow = 200)
# For simplicity, let's assume that assignment to control and experimental groups is always the same
grp <- rep(0:1, times = 100)

In [6]:
# Exercise 3:

# Your code goes here

matrix_vector <- function(mat, grp) {
  col_num <- ncol(mat)
  tstats <- numeric(col_num)

  for (i in 1:col_num) {

    # Create experimental groups and contorl groups
    data_exp <- mat[grp == 1, i]
    data_ctrl <- mat[grp == 0, i]

    # Calculate mean of each group
    mean_exp <- mean(data_exp)
    mean_ctrl <- mean(data_ctrl)

    # Calculate sd of each group
    sd_exp <- sd(data_exp)
    sd_ctrl <- sd(data_ctrl)

    # Calculate t value
    t_value <- (mean_exp - mean_ctrl)/sqrt((sd_exp^2/length(data_exp)+(sd_ctrl^2/length(data_ctrl))))

    # Store t value into vector tstats
    tstats[i] <- t_value
  }
  return(tstats)
}

# Check the result
excise_3_results <- matrix_vector(mat, grp)
print(excise_3_results)


   [1]  1.9956959972  1.3566367066  1.6256050710 -0.2014228134 -2.3489188421
   [6]  1.2135850837  0.4447660957 -1.1692104395  0.0631488771  1.8530129390
  [11]  0.7039836036 -1.2275547372 -1.0330958213  0.7483325367 -0.5603018406
  [16] -1.5066476130  2.5115833477 -0.7565973502 -1.6256912463 -1.5782072900
  [21] -0.7118832613 -0.1078139072  0.0815483618 -0.8812888982 -1.5891665332
  [26] -0.1118908097  0.0644707960 -0.3267827846 -1.5004813509 -1.5394999065
  [31]  0.2014999481  1.0785698927 -1.3705536085 -0.0898989182  0.7472927083
  [36]  0.2927579042 -2.4968168725  0.6458997136 -1.7117436489 -0.1553690940
  [41]  0.6793471541  0.6900885486  1.8179046346 -0.4429489718 -1.0990468654
  [46] -0.9518362276  2.0034321074 -0.0079048237 -1.3596794464  0.8964585797
  [51]  0.4016744637  0.0775749125 -0.5277018480 -0.8423992384 -0.4721931713
  [56] -0.7099877359 -0.6342580786  1.0277095411  0.1787218193  0.7056383296
  [61] -0.3679947235 -0.5670339638  2.9567855471 -0.7927065969 -2.1440045230

## Exercise 4: Manipulating Data Frames

List of data frames is a common output of some data collection tools.
Find below a list of 10 data frames. Append them together by row. Keep
only unique rows.

In [7]:
set.seed(2023)
dfs <- lapply(1:10, function(n) data.frame(x = letters[sample(1:26,20,TRUE)], y = sample(1:26,20,TRUE)))

In [8]:
# Exercise 4:

# Your code goes here

# Merge the df
merge_df <- do.call(rbind, dfs)

# Count the unique value
counted_values <- table(merge_df$x)
print(counted_values)

# Store the result
counted_df <- data.frame(Count = counted_values)

# Check the result
print(counted_df)


 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z 
 6  6  6 18 12  9 12  7  6 13  4  7  4  8  7  6 14  6  8  6  6  4  8 11  1  5 
   Count.Var1 Count.Freq
1           a          6
2           b          6
3           c          6
4           d         18
5           e         12
6           f          9
7           g         12
8           h          7
9           i          6
10          j         13
11          k          4
12          l          7
13          m          4
14          n          8
15          o          7
16          p          6
17          q         14
18          r          6
19          s          8
20          t          6
21          u          6
22          v          4
23          w          8
24          x         11
25          y          1
26          z          5


## Exercise 5: Pivoting Tables

Let’s use [Kaggle](https://www.kaggle.com) [2022 Machine Learning and
Data Science Survey](https://www.kaggle.com/c/kaggle-survey-2022/), that
we encountered before. The dataset is available on Blackboard.

In question 12 respondents were asked “What programming languages do you
use on a regular basis? (Select all that apply)”. Calculate percentages
of respondents who use each of the programming languages. Sort them by
popularity.

Tip: You can use `tidyr` and `dplyr` packages to pivot subset of columns
into longer format.

In [9]:
# Exercise 5:

# Your code goes here

# Import the packages
# install.packages("tidyr")
# install.packages("dplyr")
library(tidyr)
library(dplyr)

# Import the data
df <- read.csv("kaggle_survey_2022_responses.csv")

# Import Q12 in the df
program_df <- df %>%
  select(31:45)

# Calculate the sample size (the row of df)
total <- nrow(program_df)

# Pivot into long data type
program_long <- program_df %>%
  pivot_longer(cols = everything(),
               names_to = "Language", values_to = "Used") %>%
  filter(!is.na(Used))  # Remove rows where "Used" is NA

# Because row 1 - row 15 no means, so delete them
program_long <- program_long[-c(1:15), ]

# Calculate the frequency of each language
lang_freq <- table(program_long$Used)

# Trnasfer into a new df
lang_freq_df <- as.data.frame(lang_freq)

# Because row 1 no means, so delete it
lang_freq_df <- lang_freq_df[-c(1), ]

# Trnasfer df from frequency into proportion
lang_freq_df$Freq <- lang_freq_df$Freq / total * 100

# Rename df and col names
assign("lang_prop_df", lang_freq_df)
colnames(lang_prop_df) <- c("Language", "Prop")

# Sort the df in reverse order
lang_prop_df <- lang_prop_df[order(-lang_prop_df$Prop), ]
excise_5_results <- lang_prop_df
print(excise_5_results)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




     Language      Prop
14     Python 77.727311
16        SQL 40.086674
15          R 19.047421
5         C++ 18.955746
7        Java 16.093008
3           C 15.838820
8  Javascript 14.538712
10     MATLAB 10.171681
2        Bash  6.975581
4          C#  6.138012
13        PHP  6.013001
12      Other  5.592133
6          Go  1.341778
9       Julia  1.233436
11       None  1.066756
