# Denison CS181/DA210 Homework

Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE".

---

# Project 1: Working with the Kiva Loans and Lenders

> Be sure to include code documentation (docstrings, inline comments) for all Python code you provide.
    
> This project is based, in part, on a case study in the book **XML and Web Technologies for Data Science with R** by Deborah Nolan and Duncan Temple Lang.

## Preliminaries

The projects in this class serve some additional objectives that reach beyond what lectures, in-class hands-on engagement, and homeworks can provide.  Some of these objectives include:

- Working with real-world data sets that have greater volume and scale,
- Getting beyond the pigeon-hole thinking that can come from solving the relatively small problems involved in homework exercises,
- Thinking about the data itself, and the information the study of that data can provide, 
- Building up a larger whole, synthesized from many of the low-level skills and operations, and
- Effectively communicating what was learned.

In a "normal" semester, our approach would be to make a project like this almost fully self-defined, and to provide only high-level guidance.  Students would use a combination of discovering what they wanted to explore and how to do so through their own investigation along with meetings and refinement with the course instructor.  In our COVID semester of Fall 2020, we provide a structured progression, documented in Parts 1 through 4 of this notebook.  

As offered in class, for extra credit applied to the quiz portion of your grade, you can "go beyond" and build upon the techniques and the data explored in this notebook and do the following:

1. Extract additional tidy tables of "useful" information from the data set,
2. Define at least two, and possibly more interesting questions about the data,
3. Build visualizations that can help answer the questions from #2, based on the data,
4. Using best practices in writing, construct an essay that describes and documents your exploration, presents your visualizations, interprets your results, and generates a coherent conclusion.

Some additional detail on this extra credit project part is covered in the notebook `Kiva_EC.ipynb`.

To meet the objectives (cited above) for this project, the data for this project comes from a non-profit organization, **kiva**, whose goal is to make a difference for individuals that have been left behind in our global and overall prosperous economy.  I strongly encourage you to visit their website (https://kiva.org) and discover more about the source of our data for this project.

### About Kiva

Access to loans is a huge problem for many people in the world. To combat this problem, Kiva connects people in need (often in developing countries) with lenders (often, altruistic people who want to make the world a bet
ter place) in an attempt to alleviate poverty and provide opportunities.

There are over 2.5 million borrowers in 84 countries involved with Kiva. With a 97.1 repayment rate, the 1.6 million lenders have lent over 1 billion dollars. Most of the loans are very very smallAs Kiva is a nonprofit, it has developers worldwide and has shared its data. 

The Kiva API (application programming interface, a concept we will cover later in this course) has many files and formats containing the information of loans, partners, lenders, teams, and more. For the purpose of this project, we will focus on the XML files (rather than JSON files) containing the data from 2005-2017. We will start this project by using `partners.xml`.

## Part 1: Exploring Partners

Kiva has regional partners all over the world, who work with Kiva to find lenders and to fund loans. All Kiva loans are offered by partners to borrowers, using money provided by lenders. The data on partners includes the region they are from (e.g., "Africa"), the country they are located in, their rating (which tells lenders how likely a loan through this partner is to default), their delinquency rate, and more.

Use the following cell to read and parse `partners.xml`. Do so by writing a function:

    readXML(datadir, filename)
    
That looks for `filename` in the directory `datadir` and, if found, parses the file and returns the root Element of the tree.  Your function should return `None` if the file is not found or if the found file cannot be parsed.

Then use your function on `partners.xml` and determines the number of partners in our sample.   Store your answer as a Python integer `n`.

Please study this XML file carefully, and find the tags and paths to `partner`, `rating`, `delinquency_rate`, `region`, and country name. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Delinquincy Rate Histogram

Although 97.1% of people repay their loans, some do not. Loans that are not repaid are called "delinquent." The delinquency rate for a partner is calculated as the number of delinquent payments divided by the total number of loans the partner holds, and is a field for each partner in the XML.  

The next few cells will work toward building a histogram of the delinquincy rate for the set of partners.

**Step 1**

A histogram provides an approximate representation of the distribution of a set of values.  It does this by taking values from a list of numbers and counting the number of instances of the values that fall into certain value ranges, called **bins**.

To peform this calculation, we first need to know the smallest value, largest value, and the difference between these two, called the *range* of the values.

Extract the set of delinquicy rates from the partners tree, treat them as floating point numbers, and assign to variuable `delinquincies`.  Then find the maximum value, the minimum value, and the range, assigning to `value_max`, `value_min`, and `value_range`, respectively.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
print("Value max:", value_max)
print("Value min:", value_min)
print("Value range:", value_range)

**Step 2** Perhaps not unexpected, but we do have some partners with a zero delinquicy rate.  This is helpful, because, as we think about the set of intervals that define each of our histogram bins, the low end of the first bin starts at 0, and when we compute the bin for a particular value, we don't have to adjust for a start value.

To make the bins concept more concrete, suppose we have values that range from 0 to 10, inclusive.  Further suppose that we decide we want 20 bins.  That means that the first bin would represent the the value interval from 0 up to 0.5.  The second bin would represent that value interval from 0.5 up to 1.0.

Given a particular value, `x`, and number of `bins`, we, in general, want to compute `value_bin(x, bins)` that gives the **integer** bin corresponding to the value.  This amounts to taking the floor of $x$ divided by the interval per bin, which can be accomplished by integer division in Python.  This decision implies that, for any $x$ that divides evenly with no remainder, it's bin is the start of the next bin interval.  The only problem with this simple algorithm is that if $x$ is `value_max`, the division results in a bin number which is one too high.

Compute the interval per bin by dividing the value range by the variable `bins`, and assign to variable `interval`, and then write a function, `value_bin(x, interval)` that returns the integer bin associated with value `x`.  Try and make this a lambda function.

In [None]:
bins = 50
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 1.11 < interval < 1.12
assert value_bin(value_max, interval) == bins-1

**Step 3** Finally, we need to create a list, L, whose length is `bins` and whose values are all zero, and then to iterate through the `delinquincies` list and, for each value, compute its bin, `b` and then increment `L[b]`.  Write code to initialize `L` and to iterate over the values in `delinquincies` and increment the counters in `L`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Step 4** The use of global variables for all of our piecewise steps above is poor programming practice.  Encapsulate the set of steps above, and write a function

    delinquency(rootElement, bins=50)
    
that creates, computes, and returns a list of histogram counts.  It should start with the root `etree Element` given by `rootElement` and perform all the steps **without relying on any global variable**.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

To check our function and the computed list of bin count values, the cell below calls the function, computes the associated x-values (using the previously computed interval) and displays a bar graph of the result.  A bar graph computed in this way is a histogram.

> For those of you who have had to create a new environment based on Python 3.7, you may have to go into the Environments section of Anaconda Navigator and add the `matplotlib` package to your cs181 environment.

In [None]:
import matplotlib.pyplot as pyplot

partner_root = readXML("project_kiva_xmlFiles", "partners.xml")
assert partner_root is not None

num_bins = 50
bin_values = delinquincy(partner_root, num_bins)

interval = value_range / num_bins
x_values = [i + interval + interval/2 for i in range(bins)]
pyplot.bar(x_values, bin_values)
pyplot.show()

**Q** In the following cell, **interpret** the histogram, describing why you see high bars where you see them, why you see low bars where you see them, the meaning of the y axis, and the meaning of the x axis.

YOUR ANSWER HERE

The `pyplot` package in the `matplotlib` module has its own algorithm for computing and displaying a histogram based only on the value list, `delinquincies` in this case.  The next cell simply demonstrates this shortcut.

In [None]:
# This cell relies on the global delinquincies that was to be computed in Step 1

pyplot.hist(delinquincies,bins=50)
pyplot.show()

### Partner Rating

The `rating` of a partner tells how likely they are to default. A low rating means they have a high risk of defaulting, i.e., not paying back loans. A high rating means they have a low risk of defaulting. Write a function

    delinquency_rating(rootElement)

that computes and creates a scatterplot showing the relationship between ratings (on the x-axis) and delinquency rate (on the y-axis). Be sure to label your axes. Warning: omit (x,y) pairs for partners that do not have a rating.

So your function has to:

1. Obtain the x values from the XML and convert to the correct type.
2. Obtain the y values from the XML and convert to the correct type.
3. Filter the data to omit (x, y) pairs for partners that do not have a rating.
4. Build the scatter plot.
5. Invoke functions to set the title and axes.
6. Show the plot.

References:

- [`pyplot.scatter`](https://matplotlib.org/3.3.2/api/_as_gen/matplotlib.pyplot.scatter.html)
- [`pyplot.title`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.title.html)
- [`pyplot.xlabel`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.xlabel.html)
- [`pyplot.ylabel`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.ylabel.html)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Q** Interpret the scatter plot, paying attention to talking about the x axis, y axis, and why you see the relationship you do for the correspondence between x values and y values.  Is there another visualization that would be better at showing the distribution of values *for a given rating*?

YOUR ANSWER HERE

### Regions and Countries of Partners

Kiva is in a large part successful due to their ability to connect regional lenders to borrowers. Simulating this process, write a function,

    findPartner(rootElement, region)

that, given a region, returns the name of a partner. If there are multiple partners in a region, return the name of a random partner. There should only be one XPath query in your function.  To refresh how to select a random number in a given integer range, see the `random` package: https://docs.python.org/3.7/library/random.html and the `randint` function.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
partner_root = readXML("project_kiva_xmlFiles", "partners.xml")
assert partner_root is not None

print("Africa:", findPartner(partner_root, "Africa"))
print("South America:", findPartner(partner_root, "South America"))

#### Countries

According to Kiva, they have partners for 84 countries. Where are the partners from in our example sample? How many **different** countries does our sample contain? To answer these question, make a dictionary that maps from country keys to partner-count pairs.  Call your resulting dictionary `cPartners`.  This process is analgous to making a frequency dictionary like we often do in the intro class, and repeated at the beginning of this semester.

In this part, I am allowing you to use global cell computations instead of putting things into a function.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Now use your dictionary to determine how many **unique** countries (the length of the dictionary), and then, by summing up the set of values, the total number of countries associated with partners.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Write a few sentences saying how to **interpret** what was meant by the Kiva claim of 84 countries (at least based on our sample data from 2017).

YOUR ANSWER HERE

## Part 2: Building Tabular Data from XML

XML is meant to provide flexible organization and to transmit data. While it is the best tool for that job, it cannot handle advanced queries. Thus, we often need to traverse the XML file and build two-dimensional subsets of the data into `pandas` data frames in order to do any further analysis. Look at the `partners.xml`, using what you learned from the previous question, what would be one or more logical functional dependency (FD) with the independent variable(s) on the left side of the FD and dependent variables on the right side of the FD.

YOUR ANSWER HERE

### Partner Data

We are ready to extract partner data into a `pandas` data frame. 

**Step 1**

Following our best programming practice guidelines, we will do this work *incrementally*, building up a solution step-by-step. Write a function

    genPartnerRow(partnerElement)

that constructs a single row of data about an individual based on a `partnerElement`, which would be an XML element rooted at a `<partner>` node.  This building block could either construct and return a dictionary mapping column fields to values, or it could construct a list of column values in a known order.  In either case, the idea is that we can iterate over the set of partners and, for each one, invoke this function and accumulate the row-result into a collection that will be used to generate a data frame.  But for now, we are focusing on a single row.

You get to decide which fields make sense for this table based on the functional dependency(ies) you cited above and the desire to have a normalized (tidy) result where each row represents a single partner.  I do require you to include a `rating` column and to handle the case where a partner is `"Not Rated"` (i.e. this is missing data for this observation).

The data types for your fields should be appropriate for the given column variable.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
partner_root = readXML("project_kiva_xmlFiles", "partners.xml")
assert partner_root is not None

firstpartner = partner_root.xpath("//partners/partner[1]")[0]
row = genPartnerRow(firstpartner)
print(row)

**Step 2**

Now that we can construct row data for a partner Element, write a function

    genPartnerData(datadir, filename)
    
that, given a data directory and filename, uses our helper function `readXML()` and `genPartnerRow()` and constructs and returns a `pandas` data frame, where each row is a partner. Make sure to include meaningful column headers.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Remember that the `info()` method summarizes information about the columns of a data frame, including a count of the non-missing entries.  Use this in the cell below to see which columns have missing data and how many.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Partner Location

It is time to move on to a second table derived from our partner data. Repeat the two step process above to generate a `pandas` data frame, called `partnerLocs`, of the partner country locations. **Hint:** Some partners have more than one location.  That means that a table with a partner id and the set of country location information can have more than one row for a given partner.

As exemplified in the code following the solution, I chose to define `genLocData()` and to give it an argument of a root Element, as opposed to passing a data directory and filename again.  You can design your functions as you desire.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
partner_root = readXML("project_kiva_xmlFiles", "partners.xml")
assert partner_root is not None

partnerLocs = genLocData(partner_root)
partnerLocs.head(12)

### Loans

Now that we are familiar with the partners and have created a `pandas` data frame, let's do the same for the loans of borrowers. There are 10 files containing information on a total of 1,000 loans. These are called `loanSearch1.xml`, `loanSearch2.xml`, ..., `loanSearch10.xml`. The reason for multiple files is that the API to search for loans limits the number of returned results to a single logical **page**.  In this case, a page contains information on 100 loans, and the process to obtain results was repeated 10 times to get the first 10 pages of loan information.  There is actually a small subset of all the loans in the full data set (this is just 10 pages out of nearly 13,000), but is sufficient for our purposes in this project.

For each loan, Kiva provides data on a unique id for the loan, the loan amount, where the loan was made, which partner the loan was made through, the name attached to the loan, the status of the loan (e.g., whether it's in the fundraising stage or already funded), the activity the loan was made to support, the economic sector of that activity, and more. 

Beware, not every loan has all fields, so our processing must handle those cases and make sure the values for missing data result in `numpy.NAN`. (The Python `None` value may also work so that, when the data structure is made into a `pandas` data frame, the values are considered missing relative to methods like `dropna()`, but I have not verified this.)

**Step 1: Explore**

In the following cell, along with any additional cells you would like to add, use `lxml` (and `xpath`) to explore the data.  See what the min/max and/or distribtion of fields like `loan_amount` are.  Ask questions like "Is the funded amount and the loan amount always the same", or "how do they differ", and then use our XML operations to answer the questions.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Step 2: Build a Tabular Representation**

Our goal is to build a tabular representation for loans, wherein each row is a loan and we have column fields that are rich enough to ask meaningful questions about the data.  While you can include the column fields that you want, you should definitely include the partner id, as it allows us to "tie together" loans with their corresponding partners.

I would **strongly** suggest mimicing the "good software development process" that you were led through above:

1. Develop a function to obtain the desired fields for a single row representing a loan.
2. Develop a function to iterate over a single XML tree collecting a set of rows into a composite structure.

and, because we have multiple files, and therefore multiple trees, we need to add a higher level function:

3. Develop a function to iterate over a set of files and, for each one, construct a tree and use the function from (2) to get a collection of rows and, finally, aggregate these collections into a single composite data frame.

Call this top level function

    genLoans(datadir, filename_prefix, num_files)
    
where 

- `datadir` is the directory where the XML files will be found,
- `filename_prefix` is the prefix used for naming the collection of files, not including the number that specifies a specific file, nor the `.xml` extension.

Before you begin the code, answer, in the Markdown cell that follows, **Why have we defined `genLoans` as we have?**  Think about what purpose it serves to have the arguments that have been specified.

YOUR ANSWER HERE

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(loans)==1000

### Lenders

In addition to partners and borrowers, we also have data on the lenders, who provide the funds to borrowers **through** the  partners. The file `lenders.xml` contains data on lender names, whereabouts, the country they live in, their occupation, how many loans they have made, and a narrative about why they loan money through Kiva. Create a `pandas` data frame with one lender per row and whichever columns you are interested in exploring. Beware that not every lender has data in every field, so handle missing data appropriately.

Call the top level function:

    genLenders(datadir, filename="lenders.xml")
    
where `datadir` and `filename` have the obvious meaning.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Part 3: Analysis in Pandas

Using the `pandas` data frames you created, answer the following questions:

**Question 1:** What are the 5 countries that get the largest mean loan amount? Does your answer make sense? Why is this the case? You may use `isoCode` to represent country.  Use the Markdown cell after the code cell for your interpretation.  Feel free to experiment and ask questions of the data to further your understanding of your result.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

**Question 2**: Aggregate by country and determine both the number of loans and the mean loan amount.  Present the result in descending order of loan amount. In your interpretation, reconcile this with  what you saw initially. You may use `isoCode` to represent country. **Hint:** It may be helpful to use the `.agg()` method here.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

**Question 3:** 

Sometimes we need to combine data from multiple tables to be able to answer questions.  For instance, we may wish to analyze over a set of partners information about the loans attributed to those partners.

Create a `pandas` data frame that contains the following columns: 

- ID of partner,
- Name of partner, 
- number of loans for that partner as posted in our sample, 
- sum of their loans, and 
- average loan amount. 

With this table, we should then be able to determine the answer to the following question:

What are the names of the top 3 partners that gives the biggest average loan and have posted more than 20 loans in our sample?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## Part 4: Export

To enable the extra credit and to enable import of our data into Visualization tools like Tableau, we want to take a final step and take our `partners`, `loans` and `lenders` data and export them to CSV.  In the cell that follows, perform that export, writing `partners.csv`, `loans.csv`, and `lenders.csv` to a directory called `exports` relative to the notebook directory.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()