# Script Notebook
This notebook contains utility scripts for the website.

## UMDrive migration
The following cells rewrite links from UMDrive to blogs.memphis.edu. 
Some reorganization of directories is included.

Let's get the names and text of all posts we need to relink

In [2]:
let fileTextTuples = 
    System.IO.Directory.GetFiles("_posts")
    |> Seq.map( fun filePath -> filePath, System.IO.File.ReadAllText( filePath ) )

//(https://umdrive.memphis.edu/aolney/public/website-media/1554866357.jpg)
let umRegex = System.Text.RegularExpressions.Regex(@"umdrive.memphis.edu/aolney/public/([^/]+)/([^/]+)")

Let's look at the original directory structure

In [22]:
let directoryCategories =
    fileTextTuples
    |> Seq.choose( fun (_,text) -> 
        let matches = umRegex.Matches(text)
        if matches.Count = 0 then None
        else
            matches |> Seq.map( fun m -> m.Groups.[1].Value,m.Groups.[2].Value ) |> Some
        )
    |> Seq.collect id
    |> Seq.toArray

directoryCategories |> Array.map fst |> Array.distinct

[|"publications"; "website-media"; "press"; "Teaching"; "projects"; "resume";
  "photos"|]

Some of these are nested, so check that

In [24]:
directoryCategories 
|> Array.map snd 
|> Array.filter( fun t -> t.Contains(".") |> not ) //assumes directory names do not have "." 
|> Array.distinct

[|"hubo_files"; "bass"; "pkd"; "harness2018"; "autotutor"|]

These are our special cases for merging. Everything else is straight renaming.

New structure:

- website-media (contains press, photos, projects, resume)
- publications (same)
- teaching (same, lowercase)

Special cases:

- press/hubo_files go to website-media with unique names
- projects/bass go to website-media
- projects/pkd go to website-media
- projects/autotutor go to website-media

**The new structure was accomplished my manually copying files from UMDrive to blogs.memphis.edu with the structure above. Post-internal links were then rewritten in code below**

### UPDATES: 

**It turns out that blogs.memphis.edu destroys all folder structure. So we remap to the root folder in the code block below. It also looks like blogs.memphis.edu renames files with whitespace, replacing with hypens.**

In [9]:
let Replace (query:string) replacement (input:string) =
    input.Replace(query,replacement)

let urlRegex =  System.Text.RegularExpressions.Regex("""(https://blogs.memphis.edu/[^\)"]+)""")

System.IO.Directory.CreateDirectory("_posts2")
fileTextTuples
|> Seq.map( fun (fileName,text) -> 
    let newText = 
        text
        //special cases first
        |> Replace "press/hubo_files" "website-media"
        |> Replace "projects/bass" "website-media"
        |> Replace "projects/pkd" "website-media"
        |> Replace "projects/autotutor" "website-media"
        //general cases
        |> Replace "public/Teaching" "public/teaching"
        |> Replace "public/press" "public/website-media"
        |> Replace "public/photos" "public/website-media"
        |> Replace "public/projects" "public/website-media"
        |> Replace "public/resume" "public/website-media"
        //because blogs.memphis.edu destroys folders
        |> Replace "public/teaching" "public"
        |> Replace "public/website-media/harness2018" "public"
        |> Replace "public/website-media" "public"
        |> Replace "public/publications" "public"
        //domain prefix
        |> Replace "umdrive.memphis.edu/aolney/public/" "blogs.memphis.edu/aolney/files/2019/10/"
        
    fileName,newText
)
//because blogs.memphis.edu disallows whitespace in filenames
|> Seq.map( fun (fileName,text) ->
    let mutable newText = text
    let matches = urlRegex.Matches(text)
    for m in matches do
        let url = m.Groups.[1].Value
        let newUrl = url.Trim().Replace(" ","-").Replace("%20","-").Replace("%2C","-").Replace("%27","-").Replace("--","-").Replace("--","-")
        newText <- text.Replace(url,newUrl)
    fileName,newText
)
|> Seq.iter( fun (fileName,text) -> 
    let fileName2 = fileName.Replace("_posts","_posts2")
    System.IO.File.WriteAllText( fileName2, text )
)

Check the links that we've rewritten

In [None]:
let urlRegex =  System.Text.RegularExpressions.Regex("""(https://blogs.memphis.edu/[^\)"]+)""")

let postsPath = "_posts2"

let checkURL (url:string) = 
    try
        let req = System.Net.WebRequest.Create(url) :?> System.Net.HttpWebRequest
        req.Method <- "HEAD"
        req.UserAgent <- "Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 70.0.3538.77 Safari / 537.36";
        let resp = req.GetResponse() :?> System.Net.HttpWebResponse
        resp.Close()
        url, resp.StatusCode |> int
    with
    | :? System.Net.WebException as e -> url, e.Status |> int
        
    
let urlCheckTuples = 
    System.IO.Directory.GetFiles( postsPath )
    |> Seq.map( fun filePath -> filePath, System.IO.File.ReadAllText( filePath ) )
    |> Seq.collect( fun (filePath,text) -> 
        let matches = urlRegex.Matches(text)
        matches 
        |> Seq.map( fun m -> m.Groups.[1].Value )
        |> Seq.map checkURL
        //|> Seq.map ( fun t -> 1, t )
    )

System.IO.File.WriteAllLines( "checkedURLs.txt", urlCheckTuples |> Seq.map( fun (u,r) -> u.ToString() + "\t" + r.ToString()  ))

All check out, but a few needed to be manually corrected:

- 68950141.pdf appears to be snaider_behavior_net.pdf 
- journal.pone.0130293.pdf is journal.pone_.0130293.pdf 