Skip to content

bedatadriven/schoolgids

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Mining for Education Research Course

This reposistory contains presentations and analyses for a intensive six-week course in text mining with R, focused on the corpus of Dutch School Guides.

Corpus

A corpus of 2915 PDF schoolgids from Dutch bassisscholen were acquired by crawling the websites listed in DUO's addressen van alle schoolvestigingen in het basisonderwijs.

A custom scraper was written locate links to PDFs on basisschool websites and to handle some of the peculiarieties of websites in this sector.

Next, text was extracted from each of the 2974 PDFs that successfully downloaded using the pdftools package, and compiled into a VCorpus object suitable for use with the tm package:

Note that these are .rds files which can be read with readRDS() function in R.

Week 1

Topics covered:

  • Introduction to R

  • Regular Expressions

Sample analyses:

  • Extracting school year from School Guide URL

Week 2

Topics covered:

  • Term Document Matrices

  • Writing Functions in R

Sample analyses:

Week 3

Topics covered:

  • Tokenizing by n-gram

  • SVM, cluster analysis

  • Data visualization with ggplot

Sample analyses:

Week 4

Topics covered:

Sample analyses:

Week 5

Topics covered:

Sample analyses:

Week 6

About

Text mining on Dutch School Guides

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages