In [27]:
import os
import re
import csv
from operator import itemgetter
from itertools import groupby

# PART A: Start with one document

## 1. Read, Clean, Assign

**task**:

1. Read one document
2. Collect information on the country and year
3. Keep the section we're interested in
4. Turn each line into an item in a list.

**skills**:
- file reading
- string splicing
- string methods
- indexing

### 1.1 Read in "cotedivoire2014.txt"

Fill in the blanks to read in the file.

In [32]:
# SOLUTION
dir = './data/txts'
file_name = "cotedivoire2014.txt"
with open(dir + '/'+ file_name,'r', encoding = "ISO-8859-1") as f:
    text = f.read()

### 1.2 Assign country and year variables 

Splice the file name to create 2 new variables, `country`, and `year`

In [33]:
# SOLUTION
country = file_name[:-8]
year = file_name[-8:-4]

### 1.3 Get the Recommendations Section

Note that the section we want starts with `"II. Conclusions and/or recommendations\n"`. What method would you use to get everything after this substring? Fill in the blank below and assign the value to a new variable called `rec_text`


In [34]:
# SOLUTION
sections = text.split("II. Conclusions and/or recommendations\n")
rec_text = sections[1]

### 1.4 Turn it into a list

Turn the string above into a list of lines, and store it in a variable called `recs`

In [41]:
# SOLUTION
recs = rec_text.split("\n")
recs[:15]

['127. The recommendations listed below enjoy the support of C\x99te dÕIvoire: ',
 '127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); ',
 '127.2 Make efforts towards the ratification of the OP-CAT (Chile); ',
 '127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); ',
 '127.4 Accede to the OP-CAT as soon as possible (Uruguay); ',
 '127.5 Consider ratifying OP-CAT (Burkina Faso); ',
 '127.6 Ratify the International Convention on the Protection of the Rights of All Migrant Workers and Members of Their Families (ICRMW) (Ghana); ',
 '127.7 Consider acceding to the ICRMW (Chad); ',
 '127.8 Make efforts towards the ratification of ICCPR-OP 2 (Chile); ',
 '127.9 Ratify ICCPR-OP 2 (Rwanda) to abolish death penalty (France, Montenegro); ',
 '127.10 Accede to the Agr

## 2. Chunk 

**task**:

These texts have 3 sections each. 
1. The first section contains those recommendations the country supports. 
2. The second section contains recs the country will examine. 
3. The third contains recommendations the country explicitely rejects. 

We want to chunk the the text into three lists, `accept`, `examine`, `reject` -- each containing their respective recommendations.

**skills**:
- string methods
- lists
- loops
- conditionals
- indexing

### 2.1: Find the paragraph numbers

Each section starts with a main paragraph number (e.g. **123**. The individual recommendations are then noted as subparagraphs (e.g. **123.1, 123.2** etc.

The problem is, we don't know what these paragarph numbers are *a priori*. 

In [36]:
# SOLUTION
para1 = recs[0].split(".")[0]
para1 = int(para1)
para2 = para1 + 1
para3 = para2 + 1

### 2.2 Parse the text

Now create 3 new lists: `accept`, `examine`, `reject.` Loop through the `recs` and assign each one to their corresponding section.

**hint**: How do you know if a line belongs to a section? It starts with the main paragraph number for that section. So use the **.startswith()** method.

In [42]:
# SOLUTION
accept = [line for line in recs if line.startswith(str(para1))][1:]

examine = [line for line in recs if line.startswith(str(para2))][1:]

reject = [line for line in recs if line.startswith(str(para3))][1:]

['127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); ', '127.2 Make efforts towards the ratification of the OP-CAT (Chile); ', '127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); ', '127.4 Accede to the OP-CAT as soon as possible (Uruguay); ', '127.5 Consider ratifying OP-CAT (Burkina Faso); ', '127.6 Ratify the International Convention on the Protection of the Rights of All Migrant Workers and Members of Their Families (ICRMW) (Ghana); ', '127.7 Consider acceding to the ICRMW (Chad); ', '127.8 Make efforts towards the ratification of ICCPR-OP 2 (Chile); ', '127.9 Ratify ICCPR-OP 2 (Rwanda) to abolish death penalty (France, Montenegro); ', '127.10 Accede to the Agreement on the Privileges and Immunities of the International Criminal Court (Slovakia); ', 

## 3. Get Recommending Country

**skills**

- string methods
- indexing
- functions

**task**
- extract the substring representing the recommending country.

### 3.1 Extracting the Country

Take a look at a recommendation. I've given you a sample one below.

In [46]:
# get the first line, from the first section, of the first upr in `l`
rec = accept[1]
print(rec)

127.2 Make efforts towards the ratification of the OP-CAT (Chile); 


Notice that they're all formatted the same way, with the recommending country in parenthesis at the end, in between parentheses.

Using your string skills, find a way to pull out the recommending country.

In [47]:
# SOLUTION
rec_country = rec.split('(')[-1].split(')')[0]
print(rec_country)

Chile


### 3.2 Create a Function

Create a function called `get_country` that passes an individual recommendation and returns the recommending country

In [48]:
# SOLUTION
def get_country(rec):
    rec_country = rec.split('(')[-1].split(')')[0]
    return(rec_country)

# test you code
get_country(rec)

'Chile'

## 4. Store in Dictionary

**task**:

We now want to create a new list called `reclist` containing just individual recommendations. Each recommendation should be a dictionary with the following keys: 

1. `to`: the country under review
2. `from`: the country (or countries) giving the recommendation
4. `year`: the year of the review (all 2014 here)
5. `decision`: whether the recommendation was supported, rejected, etc.
6. `text`: the text of the recommendation

Create your `reclist` by looping through your list `l`. (Hint: You'll need to use loops within loops.)

**skills**:
- loops
- dictionaries

### 4.1 Fill in the Blanks

The program below loops through all the recommentations in the `accept` list and creates a list of dictionaries described above. Fill in the blanks to complete the code.

(Remember we the `country` and `year` variables we created above!)

In [19]:
# SOLUTION
accept_dictionaries = []
for rec in accept:
    dic = {}
    dic['to'] = country
    dic['year'] = year
    dic['decision'] = 'accept'
    dic['from'] = get_country(rec)
    dic['text'] = rec
    accept_dictionaries.append(dic) 

### 4.2 Repeat 

Now write a program that does the same for the `examine` and `rejected` lists:

In [20]:
# SOLUTION
examine_dictionaries = []
for rec in examine:
    dic = {}
    dic['to'] = country
    dic['year'] = year
    dic['decision'] = 'examine'
    dic['from'] = get_country(rec)
    dic['text'] = rec
    examine_dictionaries.append(dic) 

reject_dictionaries = []
for rec in examine:
    dic = {}
    dic['to'] = country
    dic['year'] = year
    dic['decision'] = 'reject'
    dic['from'] = get_country(rec)
    dic['text'] = rec
    reject_dictionaries.append(dic) 

### 4.3 Put em Together

Now concenate the `accept_dictionaries`, `examine_dictionaries`, `reject_dictionaries` lists to make one big list called `rec_list`

In [22]:
# SOLUTION
rec_list = accept_dictionaries + examine_dictionaries + reject_dictionaries
print(len(rec_list))

190


# PART B: Repeat for all documents

We just wrote a program that takes one document and turns it into a dataset!

The problem is we have 11 documents!

We'll now modify our program to create our data set from all 11 documents.

## 5. Make a function

**task**

Combine the code you wrote above to create a function that passes filename and returns a list of dictionaries representing all of the recommendations in that document.

**skills**
- Functions
- Copyin and pasting :)

In [23]:
# SOLUTION

def process_document(file_name):
    
    # read document
    with open(dir + '/'+ file_name,'r', encoding = "ISO-8859-1") as f:
        text = f.read()
    
    # collect info on country and year
    country = file_name[:-8]
    year = file_name[-8:-4]
    
    # get rec section
    sections = text.split("II. Conclusions and/or recommendations\n")
    rec_text = sections[1]
    
    # turn recs into a list
    recs = rec_text.split("\n")
    
    # find paragraph numbers
    para1 = recs[0].split(".")[0]
    para1 = int(para1)
    para2 = para1 + 1
    para3 = para2 + 1
    
    # chunk sections
    accept = [line for line in recs if line.startswith(str(para1))]
    accept = accept[1:]

    examine = [line for line in recs if line.startswith(str(para2))]
    examine = examine[1:]

    reject = [line for line in recs if line.startswith(str(para3))]
    reject = reject[1:]
    
    # make accept dictionaries
    accept_dictionaries = []
    for rec in accept:
        dic = {}
        dic['to'] = country
        dic['year'] = year
        dic['decision'] = 'accept'
        dic['from'] = get_country(rec)
        dic['text'] = rec
        accept_dictionaries.append(dic) 
    
    # "" examine ""
    examine_dictionaries = []
    for rec in examine:
        dic = {}
        dic['to'] = country
        dic['year'] = year
        dic['decision'] = 'examine'
        dic['from'] = get_country(rec)
        dic['text'] = rec
        examine_dictionaries.append(dic) 
        
    # "" reject ""
    reject_dictionaries = []
    for rec in examine:
        dic = {}
        dic['to'] = country
        dic['year'] = year
        dic['decision'] = 'reject'
        dic['from'] = get_country(rec)
        dic['text'] = rec
        reject_dictionaries.append(dic)
    
    rec_list = accept_dictionaries + examine_dictionaries + reject_dictionaries
    
    return(rec_list)

In [24]:
# test your code!
print(process_document("tuvalu2013.txt")[:5])

[{'from': 'Costa Rica', 'to': 'tuvalu', 'text': '82.1. Continue the efforts to achieve accession to the main human rights international instruments and their consistent incorporation into domestic legislation (Costa Rica); ', 'year': '2013', 'decision': 'accept'}, {'from': 'Nicaragua', 'to': 'tuvalu', 'text': '82.2. Consider ratifying new international human rights instruments which would assist in strengthening its legal and institutional framework for the promotion and protection of human rights (Nicaragua); ', 'year': '2013', 'decision': 'accept'}, {'from': 'Turkey', 'to': 'tuvalu', 'text': '82.3. Continue its efforts to accede to the remaining core international human rights treaties, which will strengthen the domestic legislation with regard to the promotion and protection of human rights, including freedom of religion or belief (Turkey); ', 'year': '2013', 'decision': 'accept'}, {'from': 'Viet Nam', 'to': 'tuvalu', 'text': '82.4. Work closely with the OHCHR and the Council for co

## 6. Loop through filenames

**task**

1. Find the file_names in our directory.
2. Apply the function above to all the filenames
3. Create a master database

**skills**
- I/O
- Loops
- Functions

### 6.1 Make a list of file_names

The program below reads all the file_names in the directory `data/txts`.

In [28]:
# SOLUTION
dir = 'data/txts'
for file_name in os.listdir(dir):
    print(file_name)

.DS_Store
afghanistan2014.txt
bangladesh2013.txt
cotedivoire2014.txt
djibouti2013.txt
fiji2014.txt
jordan2013.txt
kazakhstan2014.txt
monaco2013.txt
sanmarino2014.txt
turkmenistan2013.txt
tuvalu2013.txt


Modify the program to include only the file_names that end in `.txt`

In [29]:
# SOLUTION
for file_name in os.listdir(dir):
    if file_name.endswith(".txt"):
        print(file_name)

afghanistan2014.txt
bangladesh2013.txt
cotedivoire2014.txt
djibouti2013.txt
fiji2014.txt
jordan2013.txt
kazakhstan2014.txt
monaco2013.txt
sanmarino2014.txt
turkmenistan2013.txt
tuvalu2013.txt


## 6.2 Process the documents

Fill in the blanks below to process each document.

In [30]:
# SOLUTION
all_recs = []
for file_name in os.listdir(dir):
    if file_name.endswith(".txt"):
        print("processing: ", file_name)
        recs = process_document(file_name)
        all_recs.extend(recs)

processing:  afghanistan2014.txt
processing:  bangladesh2013.txt
processing:  cotedivoire2014.txt
processing:  djibouti2013.txt
processing:  fiji2014.txt
processing:  jordan2013.txt
processing:  kazakhstan2014.txt
processing:  monaco2013.txt
processing:  sanmarino2014.txt
processing:  turkmenistan2013.txt
processing:  tuvalu2013.txt


In [87]:
len(all_recs)

1830

## 6.3 Save to file

Now we get to save our data_base to a CSV, and we're done!

In [92]:
#writing column headings
import csv
keys = all_recs[0].keys()

#writing the rest
with open('upr-recs.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_recs)