# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

#### a) Make a function to tokenize the text.

In [1]:
install.packages("tokenizers")

Installing package into ‘/srv/rlibs’
(as ‘lib’ is unspecified)



In [2]:
library(httr)
library(tokenizers)

tokenize_text <- function(text) {
    tokenizers::tokenize_words(text, lowercase=TRUE, strip_punct=TRUE)[[1]]
}

#### b) Make a function generate keys for ngrams.

In [3]:
key_from <- function(ngram, sep = "\x1f") {
    paste(ngram, collapse=sep)
}

#### c) Make a function to build an ngram table.

In [4]:
build_ngram_table <- function(tokens, n, sep = "\x1f") {
    if (length(tokens) < n) return(new.env(parent = emptyenv()))
    tbl <- new.env(parent = emptyenv())
    for (i in seq_len(length(tokens) - n + 1L)) {
        ngram <- tokens[i:(i + n - 2L)]
        next_word <- tokens[i + n - 1L]
        key <- paste(ngram, collapse = sep)
        counts <- if (!is.null(tbl[[key]])) tbl[[key]] else integer(0)
        if (next_word %in% names(counts)) {
            counts[[next_word]] <- counts[[next_word]] + 1L
        } else {
            counts[[next_word]] <- 1L
        }
        tbl[[key]] <- counts
    }
    tbl
}

#### d) Function to digest the text.

In [5]:
digest_text <- function(text, n) {
    tokens <- tokenize_text(text)
    build_ngram_table(tokens, n)
}

#### e) Function to digest the url.

In [6]:
digest_url <- function(url, n) {
    res <- httr::GET(url)
    txt <- httr::content(res, as = "text", encoding = "UTF-8")
    digest_text(txt,n)
}

#### f) Function that gives random start.

In [7]:
random_start <- function(tbl, sep = "\x1f") {
    keys <- ls(envir = tbl, all.names=TRUE)
    if (length(keys)==0) stop("No n-grams available. Digest text first.")
    picked <- sample(keys, 1)
    strsplit(picked, sep, fixed=TRUE)[[1]]
}

#### g) Function to predict the next word.

In [8]:
predict_next_word <- function(tbl, ngram, sep = "\x1f") {
    key <- paste(ngram, collapse = sep)
    counts <- if(!is.null(tbl[[key]])) tbl[[key]] else integer(0)
    if (length(counts) == 0) return(NA_character_)
    sample(names(counts), size=1, prob=as.numeric(counts))
}

#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [9]:
make_ngram_generator <- function(tbl, n, sep = "\x1f") {
    force(tbl); n <- as.integer(n); force(sep)
    function(start_words = NULL, length = 10L) {
        if ((is.null(start_words)) || length(start_words) != n - 1L) {
            start_words <- random_start(tbl, sep=sep)
        }
        word_sequence <- start_words
        for (i in seq_len(max(0L, length - length(start_words)))) {
            ngram <- tail(word_sequence, n - 1L)
            next_word <- predict_next_word(tbl, ngram, sep=sep)
            if (is.na(next_word)) break
            word_sequence <- c(word_sequence, next_word)
        }
        paste(word_sequence, collapse= " ")
    }
}

## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [16]:
#question i
set.seed(2025)

url <- "https://www.gutenberg.org/cache/epub/2591/pg2591.txt"
tbl3 <- digest_url(url, n=3)
gen3 <- make_ngram_generator(tbl3, n=3)

print(gen3(start_words = c("the", "king"), length = 15))

[1] "the king has forbidden me to marry another husband am not i shall ride upon"


In [17]:
#question ii
set.seed(2025)

url <- "https://www.gutenberg.org/cache/epub/2591/pg2591.txt"
tbl3 <- digest_url(url, n=3)
gen3 <- make_ngram_generator(tbl3, n=3)

print(gen3(length=15))

[1] "spread the jam over it spread its wings and crying here comes our hobblety jib"


#### b) Test your model using a text file of [Ancient Armour and Weapons in Europe](https://www.gutenberg.org/cache/epub/46342/pg46342.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [18]:
#question i
set.seed(2025)

url <- "https://www.gutenberg.org/cache/epub/46342/pg46342.txt"
tbl3 <- digest_url(url, n=3)
gen3 <- make_ngram_generator(tbl3, n=3)

print(gen3(start_words = c("the", "king"), length = 15))

[1] "the king he added to the entire exclusion of the swords were made prisoners the"


In [19]:
#question ii
set.seed(2025)

url <- "https://www.gutenberg.org/cache/epub/46342/pg46342.txt"
tbl3 <- digest_url(url, n=3)
gen3 <- make_ngram_generator(tbl3, n=3)

print(gen3(length = 15))

[1] "lamentation de lemburn came forth completely armed after the fashion of this may be seen"


#### c) Explain in 1-2 sentences the difference in content generated from each source.

The training dataset that each ngram generator uses is different, because it is two different text sources. This means that the results of each generator will be different, since each was trained on different sets.

## Question 3
#### a) What is a language learning model? 
A language model is a type of machine learning model that predicts the probability of a sequence of words to understand and generate human language, in other words a probability distribution over words
#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?
OLLAMA can run language model tools locally on your computer and functions as a wrapper around Docker

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** | **type `mkdir project` into the command line of a shell. The shell is the program that allows you to interact with all the functionality of a system, forming a shell around the OS** |
| **Terminal emulator** | **where the shell sits, provides the user interface to type `mkdir project`** |
| **Process** | **something running on your computer, when you type `mkdir project`, the shell launches a process to execute the command** |
| **Signal** | **things we can send to processes to tell them to do something, after you type `mkdir project` signals can be used to interrupt the process** |
| **Standard input** | **part of each process, reads characters in** |
| **Standard output** | **part of each process, writes characters out** |
| **Command line argument** | **something we pass to a process when we start it, when typing `mkdir project` project is an arguement that tells mkdir what name to create**  |
| **The environment** | **all the stuff a process can see when its running, the process can look at the environment when you tell it to `mkdir project` to see where it can create a directory** |

## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?
find, xargs, grep
#### b) Explain what this command is doing, part by part.
Find searches for files that end in (-iname) .R. Xargs passess all the found file name as arguments and then grep using those arguements to search for all the lines a read_csv function is found in any of the R files.   

## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions. 
#### a) Show the response when you run `docker run hello-world`.
#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.
#### c) How do you log in to the RStudio server?

#### a) 

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

#### b)
Command:
`docker run -it -p 8787:8787 -v C:\Users\emily\Desktop\R:/home/rstudio/512 rocker/verse`

Output: 
The password is set to jo3oofohL9oophae. If you want to set your own password, set the PASSWORD environment variable. e.g
. run with: docker run -e PASSWORD=<YOUR_PASS> -p 8787:8787 rocker/rstudio




#### c)
I navigated to http://localhost:8787/ and typed in rstudio for the username and jo3oofohL9oophae for the password. I was able to view the files on the Rstudio server. 