# Analysis of the first round of the 2016 Bulgarian presidential elections

# Part I - Web scraping the data

## Objective

Bulgaria held its presential elections on Nov 6, 2016. Given the large number of Bulgarian emigrants who left the country after the fall of communism in the early 90s, I was interested in examining the voting patterns of Bulgarians living abroad. For this purpose, I scraped information from the website of the Bulgarian Election Committee, pre-processed the data for analysis, and then created several plots and tables, which illustrate some interesting patterns. [ELABORATE FINDINGS HERE] 

## Some background

While Bulgaria is a multi-party parliamentary democracy and the president is independent of the three branches of government, the president plays a fairly significant role in international affairs by acting as commander-in-chief of the army and signing international treaties. There were 21 presidential candidates in the latest election. A majority vote (over 50%) is required to elect a president after the first round. If no single candidate reaches this threshold, a second round is held between the two candidates collecting the highest and second highest number of votes from the first round. As it happened, neither of the 21 candidates won more than 50% of the vote and a second round was held. This analysis focuses on the first round due to the broader representation of political parties and independent candidates. 

Among the 21 presidential candidates, only 6 gathered more than 4% of the vote. To focus this analysis, I only collected data on the number of votes for these six candidates:

* Traicho Traikov (TT) - a pro-Western, anti-corruption technocrat 
* Rumen Radev (RR) - a pro-Russian airforce general
* Cecka Caceva (CC) - the status quo candidate, balancing between the West and Russia
* Krassimir Karakachanov (KK) - a pro-Russian far-right nationalist
* Vesselin Mareshki (VM) - a local oligarch and independent candidate
* Plamen Oresharski (PO) - representing the Turkish minority party and widely perceived as just another candidate of the oligarchy

During these elections, people were also allowed to check a box indicating that they do not support any of the candidates - a so called, protest vote. Therefore, I also collected information on the number of protest votes abroad. 

Let's install the 'rvest' package first as we will use it to scrape the data.

In [9]:
install.packages("rvest", repos="https://cran.r-project.org")
require(rvest)

also installing the dependencies 'xml2', 'selectr'



package 'xml2' successfully unpacked and MD5 sums checked
package 'selectr' successfully unpacked and MD5 sums checked
package 'rvest' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\grozeve\AppData\Local\Temp\Rtmpo7xppj\downloaded_packages


Loading required package: rvest
: package 'rvest' was built under R version 3.3.2Loading required package: xml2
: package 'xml2' was built under R version 3.3.2

### Step 1. Scrape all links to the voting sections abroad

The following page (https://results.cik.bg/pvrnr2016/tur1/protokoli_pr/32/index.html) contains links to the results of each voting section outside of Bulgaria. For example, the section results for Brisbane, Australia are stored in https://results.cik.bg/pvrnr2016/tur1/protokoli_pr/32/320100003.html. Note that "320100003" identifies the section link and is the unique id for each section. We can scrape each section link from the main page and store it in a vector, after which we can loop through each of these links and collect the data of interest.

In [23]:
setwd("C:/Users/grozeve/Documents/_9_Misc/DataScience/My projects/bg_election")

# Read in source
cik_url_main <- read_html("https://results.cik.bg/pvrnr2016/tur1/protokoli_pr/32/index.html") 

# Grab all links
cik_url_main %>%
  html_nodes("a") %>% html_attr(name="href") -> hrefs_1

# Keep only the links to section pages - they start from ./320100003.html
endpos <- length(hrefs_1)
startpos <- which(hrefs_1=="./320100003.html")
hrefs_2 <- hrefs_1[startpos:endpos]

# Trim the links
hrefs_3 <- gsub("./", "", hrefs_2)

# Create individual page urls
cik_url_section <- paste0("https://results.cik.bg/pvrnr2016/tur1/protokoli_pr/32/", hrefs_3)

# Create an id for each section
cik_section_id <- gsub(".html", "", hrefs_3)

# Display number of links to individual voting sections outside Bulgaria
length(hrefs_3)

### Step 2. Define some useful functions for scraping the data of interest

We will create functions to extract numerical values and characters from each section link.

Note that while the results for each section are presented in several tables and we could have used readHTMLtable from the XML package, we do not need to extract all of this data. 

In [24]:
# Extract numeric values (here, the number of ballots of interest) 
extract_num <- function(mynode, myurl){
  myurl %>%
    html_nodes(paste0(mynode)) %>% 
    html_text %>%
    as.numeric
}

# Extract strings
extract_char <- function(mynode, myurl){
  myurl %>%
    html_nodes(paste0(mynode)) %>% 
    html_text
}

Below we define the paths to each object of interest:
* description of each section containing the location's country and city 
* number of valid ballots
* number of protest ballots (not supporting either candidate)
* number of ballots cast for each of the top 6 candidates

The paths can be obtained by right-clicking on the object in Chrome and selecting Inspect and then from the Elements pane on the right, we click on the object again and select Copy -> Copy selector. 

In [25]:
node_title <- c("#main > h3:nth-child(6)")
node_valid <- c("#main > table:nth-child(18) > tbody > tr:nth-child(3) > td:nth-child(2)")
node_protest <- c("#main > table:nth-child(18) > tbody > tr:nth-child(5) > td:nth-child(2)")
node_tt <- c("#main > table:nth-child(20) > tbody > tr:nth-child(6) > td:nth-child(3)")
node_rr <- c("#main > table:nth-child(20) > tbody > tr:nth-child(11) > td:nth-child(3)")
node_cc <- c("#main > table:nth-child(20) > tbody > tr:nth-child(15) > td:nth-child(3)")
node_kk <- c("#main > table:nth-child(20) > tbody > tr:nth-child(17) > td:nth-child(3)")
node_vm <- c("#main > table:nth-child(20) > tbody > tr:nth-child(1) > td:nth-child(3)")
node_po <- c("#main > table:nth-child(20) > tbody > tr:nth-child(3) > td:nth-child(3)")

Next, we will extract the country and city from the title description of each section - for this we need to scrape a sample url first. We will also have to encode it in UTF-8 since the characters are in Cyrillic.

The next two functions below will extract the country and city strings.

In [26]:
cik_url_sample <- read_html("https://results.cik.bg/pvrnr2016/tur1/protokoli_pr/32/320300014.html")

mytitle <- extract_char(node_title, cik_url_sample) 
Encoding(mytitle) <- "UTF-8"

# Functions to get the country and city names
extract_country <- function(myurl){
  tmp.mytitle <- extract_char(node_title, myurl) 
  Encoding(tmp.mytitle) <- "UTF-8"
  r1 <- regexpr("държава ", tmp.mytitle)
  r2 <- regexpr("място", tmp.mytitle)
  r3 <- regexpr("гласуване ", tmp.mytitle)
  tmp.country <- substr(tmp.mytitle, r1+8, r2-5)
  return(tmp.country)
}

extract_city <- function(myurl){
  tmp.mytitle <- extract_char(node_title, myurl) 
  Encoding(tmp.mytitle) <- "UTF-8"
  r1 <- regexpr("държава ", tmp.mytitle)
  r2 <- regexpr("място", tmp.mytitle)
  r3 <- regexpr("гласуване ", tmp.mytitle)
  tmp.city <- substr(tmp.mytitle, r3+10, nchar(mytitle)-1)
  return(tmp.city)
}


### Step 3 - Loop through each page url and collect the data points of interest

We will create a loop to go through each section page, collect the data of interest and store it as the next element in a vector. Since R is not particularly loop-friendly, we will specify the length and type of the vectors we will be populating in order to speed up the procedure. 

In [27]:
num_sections <- length(cik_url_section)
section.id <- numeric(num_sections)
country <-  c(rep("NA", num_sections))
city <-  c(rep("NA", num_sections))
valid <- numeric(num_sections)
protest <- numeric(num_sections)
tt <- numeric(num_sections)
rr <- numeric(num_sections)
cc <- numeric(num_sections)
kk <- numeric(num_sections)
vm <- numeric(num_sections)
po <- numeric(num_sections)

Loop through each section url.

In [29]:
for (i in 1:num_sections){
  tmp.html <- read_html(cik_url_section[i])
  
  section.id[i] <- cik_section_id[i]
  country[i] <- extract_country(tmp.html)
  city[i] <- extract_city(tmp.html)
  valid[i] <- extract_num(node_valid, tmp.html)
  protest[i] <- extract_num(node_protest, tmp.html)
  tt[i] <- extract_num(node_tt, tmp.html)
  rr[i] <- extract_num(node_rr, tmp.html)
  cc[i] <- extract_num(node_cc, tmp.html)
  kk[i] <- extract_num(node_kk, tmp.html)
  vm[i] <- extract_num(node_vm, tmp.html)
  po[i] <- extract_num(node_po, tmp.html)
}

# Create a dataframe with the results - each row represents 
# a unique section and each column represents a variable
ballots.df <- data.frame(section.id, country, city, valid, protest, tt, rr, cc, kk, vm, po)


Let's take a look at the data.

In [30]:
head(ballots.df)

Unnamed: 0,section.id,country,city,valid,protest,tt,rr,cc,kk,vm,po
1,320100003,,",",45,2,14,13,4,7,4,0
2,320100001,,",",40,1,3,17,11,0,1,0
3,320100004,,",",115,3,45,13,24,22,1,0
4,320100005,,",",81,9,30,11,11,7,2,0
5,320100002,,",",148,8,60,22,16,10,8,1
6,320200012,,",",122,13,23,19,24,14,19,0


There are two things to note here. 
* The country and city variables despite being encoded in UTF-8 are not displayed properly. Therefore, we will manually enter their values in English using Excel. This is something to come back to and resolve because any data manipulation outside R undermines the reproducibility of the data. 
* The number of votes collected by the top 6 candidates are less than or equal to the number of total valid votes - this is because we did not capture the ballots cast for the marginal candidates. 

We will save the dataframe as a .csv file, enter the English transcription of the country and city names, and re-load the edited .csv file. 

In [31]:
write.csv(ballots.df, "ballots_cyrillic.csv")

This completes the data collection step. In Part II we will analyze the data.