# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

#### a) Make a function to tokenize the text.

In [2]:
library(httr)
library(tokenizers)

tokenize_text <- function(text) {
    tokenizers::tokenize_words(text, lowercase=TRUE, strip_punct=TRUE)[[1]]
}


#### b) Make a function generate keys for ngrams.

In [3]:
key_from <- function(ngram, sep = "\x1f") {
    paste(ngram, collapse=sep)
}

#### c) Make a function to build an ngram table.

In [5]:
build_ngram_table <- function(tokens, n, sep = "\x1f") {
    if (length(tokens) < n) return(new.env(parent = emptyenv()))
    tbl <- new.env(parent = emptyenv())
    for (i in seq_len(length(tokens) - n + 1L)) {
        ngram <- tokens[i:(i + n - 2L)]
        next_word <- tokens[i + n - 1L]
        key <- paste(ngram, collapse = sep)
        counts <- if (!is.null(tbl[[key]])) tbl[[key]] else integer(0)
        if (next_word %in% names(counts)) {
            counts[[next_word]] <- counts[[next_word]] + 1L
        } else {
            counts[[next_word]] <- 1L
        }
        tbl[[key]] <- counts
    }
    tbl
}

#### d) Function to digest the text.

In [6]:
digest_text <- function(text, n) {
    tokens <- tokenize_text(text)
    build_ngram_table(tokens, n)
}

#### e) Function to digest the url.

In [7]:
digest_url <- function(url, n) {
    res <- httr::GET(url)
    txt <- httr::content(res, as = "text", encoding = "UTF-8")
    digest_text(txt,n)
}

#### f) Function that gives random start.

In [8]:
random_start <- function(tbl, sep = "\x1f") {
    keys <- ls(envir = tbl, all.names=TRUE)
    if (length(keys)==0) stop("No n-grams available. Digest text first.")
    picked <- sample(keys, 1)
    strsplit(picked, sep, fixed=TRUE)[[1]]
}

#### g) Function to predict the next word.

In [9]:
predict_next_word <- function(tbl, ngram, sep = "\x1f") {
    key <- paste(ngram, collapse = sep)
    counts <- if(!is.null(tbl[[key]])) tbl[[key]] else integer(0)
    if (length(counts) == 0) return(NA_character_)
    sample(names(counts), size=1, prob=as.numeric(counts))
}

#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [10]:
make_ngram_generator <- function(tbl, n, sep = "\x1f") {
    force(tbl); n <- as.integer(n); force(sep)
    function(start_words = NULL, length = 10L) {
        if ((is.null(start_words)) || length(start_words) != n - 1L) {
            start_words <- random_start(tbl, sep=sep)
        }
        word_sequence <- start_words
        for (i in seq_len(max(0L, length - length(start_words)))) {
            ngram <- tail(word_sequence, n - 1L)
            next_word <- predict_next_word(tbl, ngram, sep=sep)
            if (is.na(next_word)) break
            word_sequence <- c(word_sequence, next_word)
        }
        paste(word_sequence, collapse= " ")
    }
}

In [21]:
gen_txt <- function(source, n = 2, length = 20L, start_words = NULL, from_url = FALSE){
    tbl <- if (from_url) {
    digest_url(source, n)
  } else {
    digest_text(source, n)
  }
  generator <- make_ngram_generator(tbl, n)
  generator(start_words = start_words, length = length)
}

## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [None]:
set.seed(2025)
url <- "https://www.gutenberg.org/cache/epub/10662/pg10662.txt"

txtai <- gen_txt(
    source = url,
    n = 3,
    length = 15,
    start_words = c("the","king"),
    from_url = TRUE)

txtaii <-  gen_txt(
    source = url,
    n = 3,
    length = 15,
    start_words = NULL,
    from_url = TRUE)

print(txtai)
print(txtaii)

[1] "the king"
[1] "now to think it but did lie in my dreams and now truly i had"


#### b) Test your model using a text file of [Ancient Armour and Weapons in Europe](https://www.gutenberg.org/cache/epub/46342/pg46342.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

#### c) Explain in 1-2 sentences the difference in content generated from each source.

In [29]:
set.seed(2025)
urlb <- "https://www.gutenberg.org/cache/epub/46342/pg46342.txt"

txtbi <- gen_txt(
    source = urlb,
    n = 3,
    length = 15,
    start_words = c("the","king"),
    from_url = TRUE)

txtbii <-  gen_txt(
    source = urlb,
    n = 3,
    length = 15,
    start_words = NULL,
    from_url = TRUE)

print(txtbi)
print(txtbii)

[1] "the king he added to the entire exclusion of the swords were made prisoners the"
[1] "king was campaigning in france denmark germany switzerland and livonia figures 5 and the sword"


The difference in content generated from the first is that the grimms fairy tales is a narrative based text therefore it could be shorter when generating things related to "the king." For the ancient armour one it could have multiple areas and it could pick a longer one compared to in a narrative type.

## Question 3
#### a) What is a language learning model? 
#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?

a) A language learning model is one that is trained to understand or generate human language through the process of learning pattern in text. It can be used to predict the next word or write longer bits of text based on use case.

b) If the internet goes down, you can use an already trained model or also have existing frameworks that can load and run models.

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** |  |
| **Terminal emulator** |  |
| **Process** |  |
| **Signal** |  |
| **Standard input** |  |
| **Standard output** |  |
| **Command line argument** |  |
| **The environment** |  |

## Question 4
shell is the program that interprets commands and in this case the shell reads the command and runs the mkdir program
The teminal emulator is the app that is the window for writing commands. That is the location of typing the mkdir projact
Process is an instance of the program you run. It creates a process mkdir that runs and then quits
Signal is the message that is sent to a process. You can stop a process such as mkdir for example
Standardinput is the default input.mkdir does not need input but other commands use it 
Standard output is the default location a program writes the output. mkdir project does not have this but you would see an error message in this area
command line arument is the added information such as project in this case, the extra argument
the environment is the variables and settings availible for different programs. other variables in the environment like PATH can help with this process


## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?
#### b) Explain what this command is doing, part by part.

a) the command has the programs find that looks in the directory for files, xargs, that passes inputs to another command and grap, looking for text in files
b)The find. -iname "*.R" looks for the files ending in .R in the cur directory
-iname makes it case sensitive
xargs grep read_csv takes file name and passes it into grap and then it searches through .R files for the string

## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions. 
#### a) Show the response when you run `docker run hello-world`.
#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.
#### c) How do you log in to the RStudio server?

(base) MacBook-Pro-225:~ ericyao$ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
198f93fd5094: Pull complete 
Digest: sha256:f7931603f70e13dbd844253370742c4fc4202d290c80442b2e68706d8f33ce26
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (arm64v8)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

run --platform linux/amd64 -it rocker/verse /bin/bash
Unable to find image 'rocker/verse:latest' locally
latest: Pulling from rocker/verse
2c9ba66d5dbe: Pull complete 
bcdf914130e3: Pull complete 
983a57e0f10d: Pull complete 
04c61279cc76: Pull complete 
53593fccee71: Pull complete 
255aa55589e3: Pull complete 
7da3fea5923e: Pull complete 
7f54ce591537: Pull complete 
3c7cdccc4be7: Pull complete 
7acb5d2ece3f: Pull complete 
fc14ca29bd0e: Pull complete 
4b3ffd8ccb52: Pull complete 
b615453605c4: Pull complete 
7bca23a8b40d: Pull complete 
999e4b8f7ed8: Pull complete 
b71e78fefbbb: Pull complete 
33aa1b89cc9c: Pull complete 
e82dc96b20d6: Pull complete 
a7519eda3916: Pull complete 
339259f92146: Pull complete 
2a63ed8b2250: Pull complete 
3deebd4cc2ea: Pull complete 
12b920580d3a: Pull complete 
Digest: sha256:96e1068eed2400e24c337a7ab53c7aab136970d92c1612bb3a1bb0c8972c7bf4
Status: Downloaded newer image for rocker/verse:latest
root@3595e7f6f751:/# docker run -it -p 8787:8787 rocker/verse
bash: docker: command not found
root@3595e7f6f751:/# exit
exit
(base) MacBook-Pro-225:~ ericyao$ docker run --platform linux/amd64 -it -p 8787:8787 rocker/verse
[s6-init] making user provided files available at /var/run/s6/etc...exited 0.
[s6-init] ensuring user provided files have correct perms...exited 0.
[fix-attrs.d] applying ownership & permissions fixes...
[fix-attrs.d] done.
[cont-init.d] executing container initialization scripts...
[cont-init.d] 01_set_env: executing... 
skipping /var/run/s6/container_environment/HOME
skipping /var/run/s6/container_environment/RSTUDIO_VERSION
[cont-init.d] 01_set_env: exited 0.
[cont-init.d] 02_userconf: executing... 


The password is set to Oungaewohngei9ei
If you want to set your own password, set the PASSWORD environment variable. e.g. run with:
docker run -e PASSWORD=<YOUR_PASS> -p 8787:8787 rocker/rstudio


[cont-init.d] 02_userconf: exited 0.
[cont-init.d] done.
[services.d] starting services
[services.d] done.
TTY detected. Printing informational message about logging configuration. Logging configuration loaded from '/etc/rstudio/logging.conf'. Logging to 'syslog'.

TTY detected. Printing informational message about logging configuration. Logging configuration loaded from '/etc/rstudio/logging.conf'. Logging to 'syslog'.
TTY detected. Printing informational message about logging configuration. Logging configuration loaded from '/etc/rstudio/logging.conf'. Logging to 'syslog'.




you log into rstudio by using the username rstudio and password that is either automatically generated or you select it.