# Data cleaning and summarizing with dplyr

## The United Nations voting dataset

### UN Voting Dataset

* Descrption
    - rcid: Roll call ID
    - session: Session(year). Sessions for alternating years are included to keep the size of dataset reasonable
    - vote: Vote, 1(yes), 9(the country was not a member of UN yet)
    - ccode: Country code
    
### Votes in dplyr

In [8]:
# Load dplyr package
library(dplyr)
library(data.table)

# Load the UN voting dataset
votes <- readRDS("./14_exploratory_data_analysis_in_r_case_study/votes.rds")

# Check the data set
head(votes, n = 10)

rcid,session,vote,ccode
46,2,1,2
46,2,1,20
46,2,9,31
46,2,1,40
46,2,1,41
46,2,1,42
46,2,9,51
46,2,9,52
46,2,9,53
46,2,9,54


### dplyr verbs

* filter(): Subsets observations from the data set to remove rows that are not interested
* mutate(): Adds variables or changes existing variables

### original data

* vote
    - 1 = Yes
    - 2 = Abstain
    - 3 = No
    - 8 = Not present: Not interested
    - 9 = Not a member: Not interested

### Filtering rows

The vote column in the dataset has a number that represents that country's vote:

* 1 = Yes
* 2 = Abstain
* 3 = No
* 8 = Not present
* 9 = Not a member

One step of data cleaning is removing observations (rows) that you're not interested in. In this case, you want to remove "Not present" and "Not a member".

INSTRUCTIONS

* Load the dplyr package.
* Print the votes table.
* Filter out rows where the vote recorded is "not present" or "not a member", leaving cases where it is "yes", "abstain", or "no".

In [13]:
# Load the dplyr package
library(dplyr)

# Print the votes dataset
print(votes)

# Filter for votes that are "yes", "abstain", or "no"
votes %>%
    filter(vote <= 3)

# A tibble: 508,929 x 4
    rcid session  vote ccode
   <dbl>   <dbl> <dbl> <int>
 1  46.0    2.00  1.00     2
 2  46.0    2.00  1.00    20
 3  46.0    2.00  9.00    31
 4  46.0    2.00  1.00    40
 5  46.0    2.00  1.00    41
 6  46.0    2.00  1.00    42
 7  46.0    2.00  9.00    51
 8  46.0    2.00  9.00    52
 9  46.0    2.00  9.00    53
10  46.0    2.00  9.00    54
# ... with 508,919 more rows


rcid,session,vote,ccode
46,2,1,2
46,2,1,20
46,2,1,40
46,2,1,41
46,2,1,42
46,2,1,70
46,2,1,90
46,2,1,91
46,2,1,92
46,2,1,93


### Adding a year column

The next step of data cleaning is manipulating your variables (columns) to make them more informative.

In this case, you have a session column that is hard to interpret intuitively. But since the UN started voting in 1946, and holds one session per year, you can get the year of a UN resolution by adding 1945 to the session number.

INSTRUCTIONS

* Use mutate() to add a year column by adding 1945 to the session column.

In [14]:
# Add another %>% step to add a year column
votes %>%
    filter(vote <= 3) %>%
    mutate(year = session + 1945)

rcid,session,vote,ccode,year
46,2,1,2,1947
46,2,1,20,1947
46,2,1,40,1947
46,2,1,41,1947
46,2,1,42,1947
46,2,1,70,1947
46,2,1,90,1947
46,2,1,91,1947
46,2,1,92,1947
46,2,1,93,1947


### Adding a country column

The country codes in the ccode column are what's called Correlates of War codes. This isn't ideal for an analysis, since you'd like to work with recognizable country names.

You can use the countrycode package to translate. For example:
```
library(countrycode)

# Translate the country code 2
> countrycode(2, "cown", "country.name")
[1] "United States"

# Translate multiple country codes
> countrycode(c(2, 20, 40), "cown", "country.name")
[1] "United States" "Canada"        "Cuba"
```

INSTRUCTIONS

* Load the countrycode package.
* Convert the country code 100 to its country name.
* Add a new country column in your mutate() statement containing country names, using the countrycode() function to translate from the ccode column. Save the result to votes_processed.

In [35]:
# Load the countrycode package
library(countrycode)

# Convert country code 100
countrycode(100, "cown", "country.name")

# Add a country column within the mutate: votes_processed
votes_processed <- votes %>%
    filter(vote <= 3) %>%
    mutate(year = session + 1945) %>%
    mutate(country = countrycode(ccode, "cown", "country.name"))

# Fill out NAs
# 260 = "Federal Republic of Germay"
votes_processed[votes_processed$ccode == 260, "country"] <- "Federal Republic of Germany"

# 816 = "Viet Nam"
votes_processed[votes_processed$ccode == 816, "country"] <- "Viet Nam"

"Some values were not matched unambiguously: 260, 816
"

## Grouping and summarizing

### Processed votes

* There are more than 350,000 rows in the data set, so it is requred to extract some summary statistics that maybe interested.

### Using "% of Yes Votes" as a summary

* Wheter the country is consesus with the majority of the vote or against it.

### dplyr verb: summarize(summarise)

* summarise() turns many rows into one

```
# summarise: The total number of data, percentage of yes votes
votes_processed %>%
    summarise(total = n(),
              percent_yes = mean(vote = 1)) 
```

### dplyr verb: group_by

* group_by() before summarise() turns groups into one row each.

```
# summarise the number of rows and the percentage of yes by year
votes_processed %>%
    group_by(year) %>%
    summarise(total = n(),
              percent_yes = mean(vote == 1))
```

In [36]:
print(votes_processed)

# A tibble: 353,547 x 6
    rcid session  vote ccode  year country           
   <dbl>   <dbl> <dbl> <int> <dbl> <chr>             
 1  46.0    2.00  1.00     2  1947 United States     
 2  46.0    2.00  1.00    20  1947 Canada            
 3  46.0    2.00  1.00    40  1947 Cuba              
 4  46.0    2.00  1.00    41  1947 Haiti             
 5  46.0    2.00  1.00    42  1947 Dominican Republic
 6  46.0    2.00  1.00    70  1947 Mexico            
 7  46.0    2.00  1.00    90  1947 Guatemala         
 8  46.0    2.00  1.00    91  1947 Honduras          
 9  46.0    2.00  1.00    92  1947 El Salvador       
10  46.0    2.00  1.00    93  1947 Nicaragua         
# ... with 353,537 more rows


### Summarizing the full dataset

In this analysis, you're going to focus on "% of votes that are yes" as a metric for the "agreeableness" of countries.

You'll start by finding this summary for the entire dataset: the fraction of all votes in their history that were "yes". Note that within your call to summarize(), you can use n() to find the total number of votes and mean(vote == 1) to find the fraction of "yes" votes.

INSTRUCTIONS

* Print the votes_processed dataset that you created in the previous exercise.
* Summarize the dataset using the summarize() function to create two columns:
    - total: with the number of votes
    - percent_yes: the percentage of "yes" votes

In [37]:
# Print votes_processed
print(votes_processed)

# Find total and fraction of "yes" votes
votes_processed %>%
    summarise(total = n(),
               percent_yes = mean(vote == 1))

# A tibble: 353,547 x 6
    rcid session  vote ccode  year country           
   <dbl>   <dbl> <dbl> <int> <dbl> <chr>             
 1  46.0    2.00  1.00     2  1947 United States     
 2  46.0    2.00  1.00    20  1947 Canada            
 3  46.0    2.00  1.00    40  1947 Cuba              
 4  46.0    2.00  1.00    41  1947 Haiti             
 5  46.0    2.00  1.00    42  1947 Dominican Republic
 6  46.0    2.00  1.00    70  1947 Mexico            
 7  46.0    2.00  1.00    90  1947 Guatemala         
 8  46.0    2.00  1.00    91  1947 Honduras          
 9  46.0    2.00  1.00    92  1947 El Salvador       
10  46.0    2.00  1.00    93  1947 Nicaragua         
# ... with 353,537 more rows


total,percent_yes
353547,0.7999248


### Summarizing by year

The summarize() function is especially useful because it can be used within groups.

For example, you might like to know how much the average "agreeableness" of countries changed from year to year. To examine this, you can use group_by() to perform your summary not for the entire dataset, but within each year.

INSTRUCTIONS

* Add a group_by() to your code to summarize() within each year.

In [38]:
# Change this code to summarize by year
votes_processed %>%
    group_by(year) %>%
    summarize(total = n(),
              percent_yes = mean(vote == 1))

year,total,percent_yes
1947,2039,0.5693968
1949,3469,0.4375901
1951,1434,0.5850767
1953,1537,0.6317502
1955,2169,0.6947902
1957,2708,0.6085672
1959,4326,0.5880721
1961,7482,0.5729751
1963,3308,0.7294438
1965,4382,0.7078959


### Summarizing by country

In the last exercise, you performed a summary of the votes within each year. You could instead summarize() within each country, which would let you compare voting patterns between countries.

INSTRUCTIONS

* Change the code in the editor to summarize() within each country rather than within each year. Save the result as by_country.

In [40]:
# Summarize by country: by_country
by_country <- 
    votes_processed %>%
    group_by(country) %>%
    summarize(total = n(),
              percent_yes = mean(vote == 1))

by_country

country,total,percent_yes
Afghanistan,2373,0.8592499
Albania,1695,0.7174041
Algeria,2213,0.8992318
Andorra,719,0.6383866
Angola,1431,0.9238295
Antigua & Barbuda,1302,0.9124424
Argentina,2553,0.7677242
Armenia,758,0.7467018
Australia,2575,0.5565049
Austria,2389,0.6224362


## Sorting and filtering summarized data

### by_country dataset

In [41]:
print(by_country)

# A tibble: 200 x 3
   country           total percent_yes
   <chr>             <int>       <dbl>
 1 Afghanistan        2373       0.859
 2 Albania            1695       0.717
 3 Algeria            2213       0.899
 4 Andorra             719       0.638
 5 Angola             1431       0.924
 6 Antigua & Barbuda  1302       0.912
 7 Argentina          2553       0.768
 8 Armenia             758       0.747
 9 Australia          2575       0.557
10 Austria            2389       0.622
# ... with 190 more rows


### dplyr verb: arrange()

* arrange(): Sorts a table based on a variable

```
# Sort the data by ascending order of percent_yes
by_country %>%
    arrange(percent_yes)
```

### Transforming tidy data

* filter()
* group_by()
* summarise()
* arrange()

### Sorting by percentage of "yes" votes

Now that you've summarized the dataset by country, you can start examining it and answering interesting questions.

For example, you might be especially interested in the countries that voted "yes" least often, or the ones that voted "yes" most often.

INSTRUCTIONS

* Print the by_country dataset created in the last exercise.
* Use arrange() to sort the countries in ascending order of percent_yes.
* Arrange the countries by the same variable, but in descending order.

In [42]:
# You have the votes summarized by country
by_country <- votes_processed %>%
    group_by(country) %>%
    summarize(total = n(),
              percent_yes = mean(vote == 1))

# Print the by_country dataset
print(by_country)

# Sort in ascending order of percent_yes
by_country %>%
    arrange(percent_yes)

# Now sort in descending order
by_country %>%
    arrange(desc(percent_yes))

# A tibble: 200 x 3
   country           total percent_yes
   <chr>             <int>       <dbl>
 1 Afghanistan        2373       0.859
 2 Albania            1695       0.717
 3 Algeria            2213       0.899
 4 Andorra             719       0.638
 5 Angola             1431       0.924
 6 Antigua & Barbuda  1302       0.912
 7 Argentina          2553       0.768
 8 Armenia             758       0.747
 9 Australia          2575       0.557
10 Austria            2389       0.622
# ... with 190 more rows


country,total,percent_yes
Zanzibar,2,0.0000000
United States,2568,0.2694704
Palau,369,0.3387534
Israel,2380,0.3407563
Federal Republic of Germany,1075,0.3972093
United Kingdom,2558,0.4167318
France,2527,0.4265928
Micronesia (Federated States of),724,0.4419890
Marshall Islands,757,0.4914135
Belgium,2568,0.4922118


country,total,percent_yes
São Tomé & Príncipe,1091,0.9761687
Seychelles,881,0.9750284
Djibouti,1598,0.9612015
Guinea-Bissau,1538,0.9603381
Timor-Leste,326,0.9570552
Mauritius,1831,0.9497542
Zimbabwe,1361,0.9493020
Comoros,1133,0.9470432
United Arab Emirates,1934,0.9467425
Mozambique,1701,0.9465021


### Filtering summarized output

In the last exercise, you may have noticed that the country that voted least frequently, Zanzibar, had only 2 votes in the entire dataset. You certainly can't make any substantial conclusions based on that data!

Typically in a progressive analysis, when you find that a few of your observations have very little data while others have plenty, you set some threshold to filter them out.

INSTRUCTIONS

* Use filter() to remove from the sorted data countries that have fewer than 100 votes.

In [43]:
# Filter out countries with fewer than 100 votes
by_country %>%
    filter(total >= 100) %>%
    arrange(percent_yes)

country,total,percent_yes
United States,2568,0.2694704
Palau,369,0.3387534
Israel,2380,0.3407563
Federal Republic of Germany,1075,0.3972093
United Kingdom,2558,0.4167318
France,2527,0.4265928
Micronesia (Federated States of),724,0.4419890
Marshall Islands,757,0.4914135
Belgium,2568,0.4922118
Canada,2576,0.5081522
