# Using Python for Information Retrieval

In this unit, we'll use Python to turn a bunch of loose text documents into a real-life database. (Note: This database was created for a project by R. Terman and E. Voeten, and was processed using much the same process as you'll be learning here.)

The lecture and problem set will leverage your new Python skills, especially working with text, lists, and dictionaries; writing for-loops, conditional statements, and functions; and thinking like a programmer.

**About the Data**

We'll be creating a database from [Universal Period Review outcome reports](http://www.ohchr.org/EN/HRBodies/UPR/Pages/BasicFacts.aspx).

The Universal Periodic Review (UPR) is a process run by the United Nations Human Rights Council, which involves a periodic review of the human rights records of all 193 UN Member States.

Reviews take place through an interactive discussion between the State under review and other UN Member States. During this discussion any UN Member State can pose questions, comments and/or make recommendations to the States under review. States under review can then respond, stating which recommendations they reject, accept, will consider, etc. Reports are then drawn up detailing this discussion.

We will be analyzing outcome reports from the 2014 Universal Period Reviews of 42 countries, which we retrieved [here](http://www.ohchr.org/EN/HRBodies/UPR/Pages/Documentation.aspx) and formatted as text documents.

The goal is to convert these semi-structured texts to a tabular dataset of **recommendations** with the following variables:

1. Text of recommendation (*text*)
2. Country to which the recommendation is directed (*to*)
3. Country that is making the recommendation (*from*)
4. The year when the review took place (*year*)
5. The response to the recommendation, i.e. whether the reviewed country rejects, accepts, etc (*decision*)

In other words, we want to turn this:

<img src="img/text.png" width="600">

into this:

<img src="img/tabular.png" width="400">

Take a few minutes to look at the files, which are located in `data/txts`, and get a sense for how they're structured.

In [None]:
import os
import csv

# PART A: Start with one document

## 1. Read, Clean, Assign

We're going to start off working with just one document. Then we'll be able to put that into a loop that runs on all the documents.

**task**:

1. Read one document
2. Collect information on the country and year
3. Keep the section we're interested in
4. Turn each line into an item in a list.

**skills**:
- file reading
- [string](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#string) slicing
- string methods
- indexing

### 1.1 Read in "cotedivoire2014.txt"

Fill in the blanks to read in the file. We'll need to include the `encoding='utf8'` optional parameter to the `open()` function to ensure that the text file is read correctly on all operating systems.

In [None]:
directory = './data/txts'
file_name = "cotedivoire2014.txt"
# FILL ME OUT

### 1.2 Assign country and year variables 

You'll notice that the file name consists of the name of the country and the year. We can use this to get that information. Slice the file name to create 2 new variables, `country`, and `year`.

Be careful! Remember that we are going to apply this to the other file names later. Make sure that however you slice "cotedivoire2014.txt" would work for the other files in the `data/txts` directory.

In [None]:
# FILL ME OUT

### 1.3 Get the Recommendations Section

Note that the section we want starts with `"II. Conclusions and/or recommendations\n"`. What [method](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#method) would you use to get everything after this substring? Fill in the blank below and [assign](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#assign) the value to a new variable called `rec_text`.

Note: there is certainly more than one way to do this, but the code below suggests one string method in particular. If you have time, think about what other methods or libraries you could use to get certain substrings.

In [None]:
# FILL ME OUT

### 1.4 Turn it into a list

Using a string method, turn the string above into a list of lines, and store it in a variable called `recs`. Remember that a new line is represented by `\n`.

In [None]:
# FILL ME OUT

### 1.5 Make a function

Let's put all of that code into a function that will read in a file and return a list of recommendations.

In [None]:
def read_recommendations(filename):
    # ADD YOUR CODE FROM SECTION 1 HERE    
    return recs

## 2. Chunk Recommendations

**task**:

These texts have 3 sections each. 
1. The first section contains those recommendations the country supports. 
2. The second section contains recommendations the country will examine. 
3. The third contains recommendations the country explicitely rejects. 

We want to chunk the the text into three lists, `accept`, `examine`, `reject` -- each containing their respective recommendations.

**skills**:
- string methods
- list comprehensions
- conditionals
- indexing

### 2.1: Find the paragraph numbers

Each section starts with a main paragraph number (e.g. **123**). The individual recommendations are then noted as subparagraphs (e.g. **123.1, 123.2** etc.).

All the accepted recommendations have the same main paragraph number (**123**). Next come the recommendations which will be examined, whose main paragraph number is just the next integer (**124**). After that are the rejected recommendations, with the next integer as their main paragraph number (**125**).

We can't know the paragraph numbers beforehand. But we *can* leverage our knowledge of the structure of the documents to get them.

Fill in the blanks below to create 3 variables containing the 3 paragraph numbers.

In [None]:
# FILL ME OUT

### 2.2 Parse the text

Now create 3 new lists: `accept`, `examine`, `reject.` Complete the for loop code to filter through `recs` and assign each recommendation to its corresponding section.

**hint**: How do you know if a line belongs to a section? It starts with the main paragraph number for that section. So use the **.startswith()** method.

In [None]:
# allocate lists for the 3 types of recommendations
accept_recs = []
examine_recs = []
reject_recs = []

# FILL ME OUT

### 2.3 Make a function

Let's again put the code we just created to parse the text into 3 separate lists into a function.

In [None]:
def parse_recommendations(recs):
    # PUT YOUR CODE HERE FROM SECTION 2
    
    # Put the three lists of recommendations into a tuple so it can be returned
    return (accept_recs, examine_recs, reject_recs)

## 3. Get Recommending Country

**skills**

- string methods
- indexing
- functions

**task**
- extract the substring representing the recommending country.

### 3.1 Extracting the Country

Take a look at several recommendations to get an idea of their format. I've given you several samples below.

In [None]:
for cur_rec in accept_recs[:5]: 
    print(cur_rec)

Notice that they're all formatted the same way, with the recommending country in parenthesis at the end, in between parentheses.

Using your string skills, find a way to pull out the recommending country from the first recommendation (stored in `first_rec` below).

In [None]:
first_rec = accept_recs[0]

In [None]:
# FILL ME OUT

### 3.2 Create a Function

Using the code you just wrote, create a function called `get_country` that passes an individual recommendation and returns the recommending country

In [None]:
def get_country(rec):
    # YOUR CODE HERE
    return(rec_country)

In [None]:
# test your code
get_country(first_rec)

## 4. Processing all Recommendations

**task**:

We now want to create a new list for each variable we eventually want in our output csv file. Each list will contain a single value per individual recommendation. The five variables we want a list for are: 

1. `to`: the country under review
2. `from`: the country (or countries) giving the recommendation
3. `year`: the year of the review (all 2014 here)
4. `decision`: whether the recommendation was supported, rejected, etc.
5. `text`: the text of the recommendation

To make it easier to store these data (and later to write it out to a csv file), we'll create a dictionary with an empty list for each of these variable names.

**skills**:
- loops
- dictionaries

In [None]:
rec_output = {'to':[],
              'from':[],
              'year':[],
              'decision':[],
              'text':[]}

### 4.1 Process the `accept` Recommendations

The code below loops through all the recommentations in the `accept` list and appends an item to each of the 5 lists within the dictionary defined above. Fill in the blanks to complete the code.

(Remember we've already created the `country` and `year` variables above!)

In [None]:
# FILL ME OUT

### 4.2 Make a function 

Now write a function that does the same for any list of recommendations. It should first create an output dictionary and then populate that dictionary. Think about all the parameters that the function should take in order to fill in all 5 fields of the `rec_output` dictionary. 

In [None]:
def process_recs(recs, to_country, year, decision_type):
    # YOUR CODE FROM SECTION 4.1 HERE, UPDATED TO USE THE PARAMETERS PASSED IN
    ...

### 4.3 Process all the Recommendations

Now use the function that you just wrote to process the recommendations from the `accept` the `examine` and `reject` recommendation lists.

In [None]:
# FILL ME OUT

# uncomment to test your code
# print(len(rec_output['to']))

 ### 4.4 Combine output dictionaries
 
Now let's write a function that takes a list of output recommendation dictionaries and creates a new one that is the combination of all of them. 

In [None]:
def combine_outputs(dicts):
    # FILL ME OUT
    return output

Now combine the output dictionaries for the accept, examine, and reject recommendations into a single output dictionary

In [None]:
# FILL IN THE BELOW LINE TO USE THE combine_outputs FUNCTION FROM ABOVE
output_recs = ...

# uncomment to test your code
# print(len(accept_recs) + len(examine_recs) + len(reject_recs))
# print(len(output_recs['to']))

# PART B: Repeat for all documents

We just wrote code that takes one document and turns it into a dataset!

The problem is we have 11 documents!

We'll now combine the code we've written so far to create a function that can read one document at a time, and then read all 11 documents into a single dataset.

## 5. Make a function

**task**

Combine the functions that you wrote above to create a single function that takes a filename as a parameter and returns a dictionary of lists representing all of the recommendations in that document.

**skills**
- Functions
- Copying and pasting :)

In [None]:
# complete the code below.
def process_document(filename):

    # FILL USING THE FUNCTIONS YOU'VE WRITTEN IN SECTIONS 1-4
    
    return(output_recs)

In [None]:
# test your code!
print(len(process_document("tuvalu2013.txt")['to']))

## 6. Process all of the files

**task**

1. Find the file_names in our directory.
2. Apply the function above to all the filenames
3. Create a master dataset

**skills**
- I/O
- Loops
- Functions

### 6.1 Make a list of file_names

The program below reads all the file_names in the directory `data/txts`.

In [None]:
directory = 'data/txts'
for file_name in os.listdir(directory):
    print(file_name)

Modify the program to include only the file_names that end in `.txt` by using a string method.

**hint:** We used the `.startswith()` method earlier. What do you think could work here?

In [None]:
# YOUR CODE HERE

## 6.2 Process all the documents

Fill in the blanks below to process all the documents.

In the last line we put the recommendations from one document into a list called `output_recs` which will hold the recommendations for all of the documents. We then need to combine all the output dictionaries from all the documents. We've written a function to do this already, which one was it? 

In [None]:
output_recs = []
# FILL ME OUT

In [None]:
# Should be 1709
len(output_recs_final['to'])

## 6.3 Save to file

Now we'll create a `pandas` `DataFrame` around our dataset and write it to a CSV file, and we're done!

In [None]:
# FILL ME OUT