# Using Python for Information Retrieval

## Solutions

In [30]:
import os
import csv

# PART A: Start with one document

## 1. Read, Clean, Assign

**task**:

1. Read one document
2. Collect information on the country and year
3. Keep the section we're interested in
4. Turn each line into an item in a list.

**skills**:
- file reading
- [string](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#string) splicing
- string methods
- indexing

### 1.1 Read in "cotedivoire2014.txt"

Fill in the blanks to read in the file. We'll need to include the `encoding='utf8'` optional parameter to the `open()` function to ensure that the text file is read correctly on all operating systems.

In [31]:
# SOLUTION
directory = './data/txts'
file_name = "cotedivoire2014.txt"
with open(directory + '/'+ file_name,'r', encoding='utf8') as f:
    text = f.read()

### 1.2 Assign country and year variables 

You'll notice that the file name consists of the name of the country and the year. We can use this to get that information. Slice the file name to create 2 new variables, `country`, and `year`.

Be careful! Remember that we are going to apply this to the other file names later. Make sure that however you slice "cotedivoire2014.txt" would work for the other files in the `data/txts` directory.

In [32]:
# SOLUTION
country = file_name[:-8]
year = file_name[-8:-4]

### 1.3 Get the Recommendations Section

Note that the section we want starts with `"II. Conclusions and/or recommendations\n"`. What [method](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#method) would you use to get everything after this substring? Fill in the blank below and [assign](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#assign) the value to a new variable called `rec_text`.

Note: there is certainly more than one way to do this, but the code below suggests one string method in particular. If you have time, think about what other methods or libraries you could use to get certain substrings.

In [33]:
# SOLUTION
sections = text.split("II. Conclusions and/or recommendations\n")
rec_text = sections[1]

### 1.4 Turn it into a list

Using a string method, turn the string above into a list of lines, and store it in a variable called `recs`. Remember that a new line is represented by `\n`.

In [34]:
# SOLUTION
recs = rec_text.split("\n")
recs[:15]

['127. The recommendations listed below enjoy the support of C™te dÕIvoire: ',
 '127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); ',
 '127.2 Make efforts towards the ratification of the OP-CAT (Chile); ',
 '127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); ',
 '127.4 Accede to the OP-CAT as soon as possible (Uruguay); ',
 '127.5 Consider ratifying OP-CAT (Burkina Faso); ',
 '127.6 Ratify the International Convention on the Protection of the Rights of All Migrant Workers and Members of Their Families (ICRMW) (Ghana); ',
 '127.7 Consider acceding to the ICRMW (Chad); ',
 '127.8 Make efforts towards the ratification of ICCPR-OP 2 (Chile); ',
 '127.9 Ratify ICCPR-OP 2 (Rwanda) to abolish death penalty (France, Montenegro); ',
 '127.10 Accede to the Agreem

### 1.5 Make a function

Let's put all of that code into a function that will read in a file and return a list of recommendations.

In [35]:
def read_recommendations(filename):
    # read document
    with open(directory + '/'+ filename,'r', encoding='utf8') as f:
        text = f.read()
    
    # collect info on country and year
    country = filename[:-8]
    year = filename[-8:-4]
    
    # get rec section
    sections = text.split("II. Conclusions and/or recommendations\n")
    rec_text = sections[1]
    
    # turn recs into a list
    recs = rec_text.split("\n")
    
    return recs

## 2. Chunk Recomendations

**task**:

These texts have 3 sections each. 
1. The first section contains those recommendations the country supports. 
2. The second section contains recs the country will examine. 
3. The third contains recommendations the country explicitely rejects. 

We want to chunk the the text into three lists, `accept`, `examine`, `reject` -- each containing their respective recommendations.

**skills**:
- string methods
- lists
- loops
- conditionals
- indexing

### 2.1: Find the paragraph numbers

Each section starts with a main paragraph number (e.g. **123**). The individual recommendations are then noted as subparagraphs (e.g. **123.1, 123.2** etc.).

All the accepted recommendations have the same main paragraph number (**123**). Next come the recommendations which will be examined, whose main paragraph number is just the next integer (**124**). After that are the rejected recommendations, with the next integer as their main paragraph number (**125**).

We can't know the paragraph numbers beforehand. But we *can* leverage our knowledge of the structure of the documents to get them.

Fill in the blanks below to create 3 variables containing the 3 paragraph numbers.

In [36]:
# SOLUTION
para1 = recs[0].split(".")[0]
para1 = int(para1)
para2 = para1 + 1
para3 = para2 + 1

### 2.2 Parse the text

Now create 3 new lists: `accept`, `examine`, `reject.` Complete the for loop code to filter through `recs` and assign each recommendation to its corresponding section.

**hint**: How do you know if a line belongs to a section? It starts with the main paragraph number for that section. So use the **.startswith()** method.

In [37]:
# allocate lists for the 3 types of recommendations
accept_recs = []
examine_recs = []
reject_recs = []

# iterate through all the recommendations and add each one to the appropriate list
for line in recs:
    if line.startswith(str(para1)):
        accept_recs.append(line)
    elif line.startswith(str(para2)):
        examine_recs.append(line)
    elif line.startswith(str(para3)):
        reject_recs.append(line)

# remove the first item from each list, which just demarcates the sections
accept_recs = accept_recs[1:]
examine_recs = examine_recs[1:]
reject_recs = reject_recs[1:]    

### 2.3 Make a function

Let's again put the code we just created to parse the text into 3 separate lists into a function.

In [38]:
def parse_recommendations(recs):
    # SOLUTION
    para1 = recs[0].split(".")[0]
    para1 = int(para1)
    para2 = para1 + 1
    para3 = para2 + 1    

    # allocate lists for the 3 types of recommendations
    accept_recs = []
    examine_recs = []
    reject_recs = []

    # iterate through all the recommendations and add each one to the appropriate list
    for line in recs:
        if line.startswith(str(para1)):
            accept_recs.append(line)
        elif line.startswith(str(para2)):
            examine_recs.append(line)
        elif line.startswith(str(para3)):
            reject_recs.append(line)

    # remove the first item from each list, which just demarcates the sections
    accept_recs = accept_recs[1:]
    examine_recs = examine_recs[1:]
    reject_recs = reject_recs[1:]        
    
    return (accept_recs, examine_recs, reject_recs)

## 3. Get Recommending Country

**skills**

- string methods
- indexing
- functions

**task**
- extract the substring representing the recommending country.

### 3.1 Extracting the Country

Take a look at several recommendations to get an idea of their format. I've given you several samples below.

In [39]:
for cur_rec in accept_recs[:5]: 
    print(cur_rec)

127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); 
127.2 Make efforts towards the ratification of the OP-CAT (Chile); 
127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); 
127.4 Accede to the OP-CAT as soon as possible (Uruguay); 
127.5 Consider ratifying OP-CAT (Burkina Faso); 


Notice that they're all formatted the same way, with the recommending country in parenthesis at the end, in between parentheses.

Using your string skills, find a way to pull out the recommending country from the first recommendation (stored in `first_rec` below).

In [40]:
first_rec = accept_recs[0]

In [41]:
# SOLUTION
rec_after_paran = first_rec.split('(')[-1]
first_rec_country = rec_after_paran.split(')')[0]
print(first_rec_country)

Philippines


### 3.2 Create a Function

Create a function called `get_country` that passes an individual recommendation and returns the recommending country

In [42]:
# SOLUTION
def get_country(rec):
    rec_after_paran = rec.split('(')[-1]
    rec_country = rec_after_paran.split(')')[0]
    return(rec_country)

# test you code
get_country(first_rec)

'Philippines'

## 4. Processing all Recommendations

**task**:

We now want to create a new list for each variable we eventually want in our output csv file. Each list will contain a single value per individual recommendation. The five variables we want a list for are: 

1. `to`: the country under review
2. `from`: the country (or countries) giving the recommendation
3. `year`: the year of the review (all 2014 here)
4. `decision`: whether the recommendation was supported, rejected, etc.
5. `text`: the text of the recommendation

To make it easier to store these data (and later to write it out to a csv file), we'll create a dictionary with an empty list for each of these variable names.

**skills**:
- loops
- dictionaries

In [43]:
rec_output = {'to':[],
              'from':[],
              'year':[],
              'decision':[],
              'text':[]}

### 4.1 Process the `accept` Recommendations

The code below loops through all the recommentations in the `accept` list and appends an item to each of the 5 lists within the dictionary defined above. Fill in the blanks to complete the code.

(Remember we've already created the `country` and `year` variables above!)

In [44]:
# SOLUTION
for rec in accept_recs:
    rec_output['to'].append(country)
    rec_output['from'].append(get_country(rec))
    rec_output['year'].append(year)
    rec_output['decision'].append('accept')
    rec_output['text'].append(rec)

### 4.2 Make a function 

Now write a function that does the same for any list of recommendations. It should first create an output dictionary and then populate that dictionary. Think about all the parameters that the function should take in order to fill in all 5 fields of the `rec_output` dictionary. 

In [45]:
def process_recs(recs, to_country, year, decision_type):
    # Create the output dictionary
    output = {'to':[],
              'from':[],
              'year':[],
              'decision':[],
              'text':[]}
    
    # loop over the recommendations and fill the output dictionary's lists
    for rec in recs:
        output['to'].append(to_country)
        output['from'].append(get_country(rec))
        output['year'].append(year)
        output['decision'].append(decision_type)
        output['text'].append(rec)
        
    return output

### 4.3 Process all the Recommendations

Now use the function that you just wrote to process the recommendations from the `accept` the `examine` and `reject` recommendation lists.

In [46]:
# FILL ME OUT
output_accept_recs = process_recs(accept_recs, country, year, 'accept')
output_examine_recs = process_recs(examine_recs, country, year, 'examine')
output_reject_recs = process_recs(reject_recs, country, year, 'reject')

 ### 4.4 Combine output dictionaries
 
Now let's write a function that takes a list of output recommendation dictionaries and creates a new one that is the combination of all of them. 

In [47]:
def combine_outputs(dicts):
    # create a new dictionary to contain the combined values of all the dictionaries
    output = {'to':[],
              'from':[],
              'year':[],
              'decision':[],
              'text':[]}
    
    # Loop over all the input dictionaries
    for cur_dict in dicts:        
        # loop over all the keys in the output dictionary
        for cur_key in output.keys():
            # extend the list which is the value of the current key using the list from the current dictionary
            cur_keys_list = cur_dict[cur_key]
            output[cur_key].extend(cur_keys_list)

    return output

Now combine the output dictionaries for the accept, examine, and reject recommendations into a single output dictionary

In [48]:
output_recs = combine_outputs([output_accept_recs, output_examine_recs, output_reject_recs])

# uncomment to test your code
print(len(accept_recs) + len(examine_recs) + len(reject_recs))
print(len(output_recs['to']))

186
186


# PART B: Repeat for all documents

We just wrote code that takes one document and turns it into a dataset!

The problem is we have 11 documents!

We'll now combine the code we've written so far to create a function that can read one document at a time, and then read all 11 documents into a single dataset.

## 5. Make a function

**task**

Combine the functions that you wrote above to create a single function that takes a filename as a parameter and returns a dictionary of lists representing all of the recommendations in that document.

**skills**
- Functions
- Copying and pasting :)

In [49]:
# SOLUTION

def process_document(filename):

    # Use the function we wrote to read in the recommendations
    recs = read_recommendations(filename)
    
    # Use the function we wrote to parse the recommendations we read in
    accept_recs, examine_recs, reject_recs = parse_recommendations(recs)
    
    # Get the "to" country
    
    country = filename[:-8]

    # Use the function to process the three recommendation types
    output_accept_recs = process_recs(accept_recs, country, year, 'accept')
    output_examine_recs = process_recs(examine_recs, country, year, 'examine')
    output_reject_recs = process_recs(reject_recs, country, year, 'reject')

    # combine the processed recommendations for the accept, examine and reject types
    output_recs = combine_outputs([output_accept_recs, output_examine_recs, output_reject_recs])

    return(output_recs)

In [50]:
# test your code!
print(len(process_document("tuvalu2013.txt")['to']))

97


## 6. Process all of the files

**task**

1. Find the file_names in our directory.
2. Apply the function above to all the filenames
3. Create a master dataset

**skills**
- I/O
- Loops
- Functions

### 6.1 Make a list of file_names

The code below reads all the file_names in the directory `data/txts`.

In [51]:
# SOLUTION
directory = 'data/txts'
for file_name in os.listdir(directory):
    print(file_name)

sanmarino2014.txt
tuvalu2013.txt
kazakhstan2014.txt
cotedivoire2014.txt
fiji2014.txt
bangladesh2013.txt
turkmenistan2013.txt
jordan2013.txt
monaco2013.txt
afghanistan2014.txt
djibouti2013.txt


Modify the code to include only the file_names that end in `.txt`

In [52]:
# SOLUTION
for file_name in os.listdir(directory):
    if file_name.endswith(".txt"):
        print(file_name)

sanmarino2014.txt
tuvalu2013.txt
kazakhstan2014.txt
cotedivoire2014.txt
fiji2014.txt
bangladesh2013.txt
turkmenistan2013.txt
jordan2013.txt
monaco2013.txt
afghanistan2014.txt
djibouti2013.txt


## 6.2 Process all the documents

Fill in the blanks below to process all the documents.

In [53]:
# SOLUTION
output_recs = []
for filename in os.listdir(directory):
    # Assume all txt files contain meaningful data
    if filename.endswith(".txt"):
        print("processing: ", filename)
        
        # Process the current file using the function
        cur_output_recs = process_document(filename)
        output_recs.append(cur_output_recs)

# Combine the output dictionaries from all of the files we've read in
output_recs_final = combine_outputs(output_recs)

processing:  sanmarino2014.txt
processing:  tuvalu2013.txt
processing:  kazakhstan2014.txt
processing:  cotedivoire2014.txt
processing:  fiji2014.txt
processing:  bangladesh2013.txt
processing:  turkmenistan2013.txt
processing:  jordan2013.txt
processing:  monaco2013.txt
processing:  afghanistan2014.txt
processing:  djibouti2013.txt


In [54]:
# Should be 1709
len(output_recs_final['to'])

1709

## 6.3 Save to file

Now we'll create a `pandas` `DataFrame` around our dataset and write it to a CSV file, and we're done!

In [55]:
#writing column headings
import pandas as pd

# create a dataframe using the dictionary we've created
output_recs_df = pd.DataFrame(output_recs_final)

# write the DataFrame
output_recs_df.to_csv('upr-recs.csv')