## polmineR

In this Jupyter notebook we look at various examples to work with the [polmineR](https://github.com/PolMine/polmineR) package. These examples are taken from the [README documentation](https://github.com/PolMine/polmineR#core-functions) about the core functions present in [polmineR](https://github.com/PolMine/polmineR).

To run individual cells in this notebook, you can go to the cell and press SHIFT+ENTER. You can make changes as you like and run the cells again interactively.

If you would like to run the whole notebook, you can click on the Kernel tab in the options menu and select "Restart and Run all". This will restart the R kernel and run all the cells in the notebook.

You can learn more about Jupyter notebooks at https://jupyter.org.

In [None]:
library(polmineR)
use("polmineR")
use("europarl")

In [None]:
corpus()

### partition (and partition_bundle)

All methods can be applied to a whole corpus, as well as to partitions (i.e. subcorpora). Use the metadata of a corpus (so-called s-attributes) to define a subcorpus.

In [None]:
ep2005 <- partition("EUROPARL-EN", text_year = "2006")
size(ep2005)

In [None]:
barroso <- partition("EUROPARL-EN", speaker_name = "Barroso", regex = TRUE)
size(barroso)

Partitions can be bundled into partition_bundle objects, and most methods can be applied to a whole corpus, a partition, or a partition_bundle object alike. Consult the package vignette to learn more.

### count (using CQP syntax)

Counting occurrences of a feature in a corpus, a partition or in the partitions of a partition_bundle is a basic operation. By offering access to the query syntax of the Corpus Query Processor (CQP), polmineR package exposes a query syntax that goes far beyond regular expressions. See the [CQP documentation](http://www.ims.uni-stuttgart.de/forschung/projekte/CorpusWorkbench/CQPTutorial/cqp-tutorial.2up.pdf) to learn more.

In [None]:
count("EUROPARL-EN", "France")

In [None]:
count("EUROPARL-EN", c("France", "Germany", "Britain", "Spain", "Italy", "Denmark", "Poland"))

In [None]:
count("EUROPARL-EN", '"[pP]opulism"')

### dispersion (across one or two dimensions)

The dispersion method is there to analyse the dispersion of a query, or a set of queries across one or two dimensions (absolute and relative frequencies). The CQP syntax can be used.

In [None]:
populism <- dispersion("EUROPARL-EN", "populism", s_attribute = "text_year", progress = FALSE)
pop_regex <- dispersion("EUROPARL-EN", '"[pP]opulism"', s_attribute = "text_year", cqp = TRUE, progress = FALSE)

In [None]:
populism

In [None]:
pop_regex

### cooccurrences (to analyse collocations)

The cooccurrences method is used to analyse the context of a query (including some statistics).

In [None]:
islam <- cooccurrences("EUROPARL-EN", query = 'Islam', left = 10, right = 10)
islam <- subset(islam, rank_ll <= 100)
dotplot(islam)


### features (keyword extraction)

Compare partitions to identify features / keywords (using statistical tests such as chi square).

In [None]:
ep_2002 <- partition("EUROPARL-EN", text_year = "2002", p_attribute = "word")
ep_pre_2002 <- partition("EUROPARL-EN", text_year = 1997:2001, p_attribute = "word")
features(ep_2002, ep_pre_2002, included = FALSE) %>%
  subset(rank_chisquare <= 10) %>%
  format() %>%
  knitr::kable(format = "markdown")


### kwic (also known as concordances)

So what happens in the context of a word, or a CQP query? To attain valid research results, reading will often be necessary.

In [None]:
kwic("EUROPARL-EN", "Islam", meta = c("text_date", "speaker_name")) %>%
  as.data.frame() %>%
  .[1:8,] %>%
  knitr::kable(format = "markdown", escape = FALSE)


### as.TermDocumentMatrix (for text mining purposes)

Many advanced methods in text mining require term document matrices as input. Based on the metadata of a corpus, these data structures can be obtained in a fast and flexible manner, for performing topic modelling, machine learning etc.

In [None]:
speakers <- partition_bundle(
  "EUROPARL-EN", s_attribute = "speaker_id",
  progress = FALSE, verbose = FALSE
)
speakers_count <- count(speakers, p_attribute = "word", progress = TRUE)
tdm <- as.TermDocumentMatrix(speakers_count, col = "count")
dim(tdm)


### read (the full text)

Corpus analysis involves moving from text to numbers, and back again. Use the read method, to inspect the full text of a partition (a speech given by chancellor Angela Merkel in this case).

In [None]:
library(GermaParl)
use("GermaParl")
if (!"GERMAPARL" %in% corpus()$corpus){
  GermaParl::germaparl_download_corpus()
  use("GermaParl")
}

In [None]:
merkel <- partition("GERMAPARL", speaker = "Angela Merkel", date = "2013-09-03")
read(merkel)