# Empirical Project 9

## Getting Started in Python

Head to the "Getting Started in Python" page for help and advice on setting up a Python session to work with. Remember, you can run any page from this book as a *notebook* by downloading the relevant file from this [repository](https://github.com/aeturrell/core_python) and running it on your own computer. Alternatively, you can run pages online in your browser over at [Binder](https://mybinder.org/v2/gh/aeturrell/core_python/HEAD).

## Preliminary Settings

Let's import the packages we'll need and also configure the settings we want:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pingouin as pg
from lets_plot import *

LetsPlot.setup_html(no_js=True)

## Python Walkthrough 9.1

**Importing data into Python**

Ensure that the Excel file you downloaded is in a sub-directory of your working directory called "data".

Before importing the data, open it in Excel to look at its structure. You can see there are three tabs: ‘Data dictionary’, ‘All households’, and ‘Got loan’. We will import them into separate dataframes (data_dict, all_hh, and got_l respectively). We import the ‘Data dictionary’ so that we do not have to return to the Excel spreadsheet.

Also note that there are a lot of empty cells, which is how missing data is coded in Excel (but not in Python). In the `pd.read_excel` function, the default is already that empty cell are read as NA, so we don't need to specify this. Note that this particular Excel file has some file issues that mean **pandas** will warn you about an "unknown extension": an Excel file is actually a bundle of files tied up to look like one file, and what's happened here is that **pandas** doesn't recognise one of the files in the bundle—but we can still happily get at the data we need in the worksheets.

In [None]:
data_dict = pd.read_excel(
    Path("data/doing-economics-working-in-excel-project-9-datafile.xlsx"),
    sheet_name="Data dictionary",
)

all_hh = pd.read_excel(
    Path("data/doing-economics-working-in-excel-project-9-datafile.xlsx"),
    sheet_name="All households",
)

got_l = pd.read_excel(
    Path("data/doing-economics-working-in-excel-project-9-datafile.xlsx"),
    sheet_name="Got loan",
)

Now let’s look at the variable types `all_hh` and `got_l` using the `.info()` method.

In [None]:
all_hh.info()

In [None]:
got_l.info()

It is important to ensure that all variables we expect to be numerical (numbers) show either int or float as their 'Dtype', and in this case, they are. You can see that there are many variables that are coded as object variables because they are text (for example gender or region), but since we can use these variables to group data by category, we will use `.astype("category")` to change them into categorical variables for later use.

Instead of converting each object variable to a factor variable individually, we can use the `select_dtypes` method to find *all* columns that are currently of type object and then convert those columns specifically.

In [None]:
cols = all_hh.select_dtypes("object").columns
all_hh[cols] = all_hh[cols].astype("category")

cols = got_l.select_dtypes("object").columns
got_l[cols] = got_l[cols].astype("category")

Making use of `.info()` will now show you that the remaining object columns have been "recast" to be of data type "category".

## Python Walkthrough 9.2

**Creating summary tables**

In order to get the proportions of households in each region living in large towns, small towns, or rural areas (encoded in the variable `rural`), we use the `pd.crosstab` function to create a cross-tabulation. Without any further options, `pd.crosstab` would produce counts of households in the respective regions and area types. However, we can pass a keyword argument, `normalize`, to turn the counts into proportions by, for example, rows or columns.

In the below, we use `normalize="index"` to ensure that each row sums to one. We also use `.round(3)` to limit the number of decimal places that are saved in the output data frame.

Remember that if you ever want to know more about a function, you can hover over it in Visual Studio Code and see its options, including the keyword arguments it takes. Or you can run `help(function-name)`.

In [None]:
stab_one = pd.crosstab(all_hh["region"], all_hh["rural"], normalize="index").round(3)
stab_one

We may not always want a cross-tab. For the case where we simply want an overall percentage, we can use 

Let’s use a similar approach to calculate the percentage of households with women as head of the family (encoded in the variable `gender`).

In [None]:
stab_two = all_hh["gender"].value_counts(normalize=True).round(3)
stab_two

As shown, 30.4% of households have a head of the family who is a woman.

We need to provide summary statistics for a range of variables. Most of these variables are numeric variables, but one, gender, is a categorical variable. We can distinguish these types using the `.select_dtypes` method.

There are a few different ways to get summary statistics on a data frame. **pandas** has a built-in method called `.describe` which produces different information depending on whether the given columns are numeric or not. Here it is running on just numeric columns:

In [None]:
all_hh.select_dtypes("number").describe().round(2)

If we use it on a categorical variable, `gender`, we get information relevant to that type of data:

In [None]:
all_hh["gender"].describe()

There are, of course, lots of different types of data. A more powerful summary of a data frame is provided by a packaged called **skimpy** (which you can install by running `pip install skimpy` on the command line). Here is **skimpy**'s `skim` function applied to `all_hh`:

In [None]:
from skimpy import skim

skim(all_hh)

Now we can see lots of information on both the categorical and numeric data types!

## Python Walkthrough 9.3

**Making frequency tables for loan applications and outcomes**

The easiest way to make a frequency table is to use the `pd.crosstab` function. Note that we use the `dropna=False` option (and "All" minus the sum of "No" and "Yes" gives the number of NAs), and the `margins=True` option to give those totals.

In [None]:
stab_three = pd.crosstab(
    all_hh["did_not_apply"], all_hh["loan_rejected"], dropna=False, margins=True
)
stab_three

Now we will do some data cleaning and re-examine the table. We:

- exclude the households that indicated that they did not apply for a loan, but also indicated that they were refused a loan. This results in excluding more than 10% of households that indicated that they were refused a loan, but the answer is nonsensical. We will use the `~` operator, which means "not" the logic that follows.
- We shall also remove all observations that have missing data for any of these two questions (by using the implicit default value of `dropna=True`)



In [None]:
all_hh_c = all_hh.loc[
    ~(
        (all_hh["loan_rejected"] == "Yes")
        & (all_hh["did_not_apply"] == "Did not apply")
    ),
    :,
].copy()
stab_four = pd.crosstab(
    all_hh_c["did_not_apply"], all_hh_c["loan_rejected"], margins=True, normalize="all"
).round(3)
stab_four

In the above, we used the classic syntax for accessing rows and columns, `.loc[rows, columns]`. Because we wanted *all* columns, we used `:`. We also created a new set of data that we'll follow around as `all_hh_c` by using `.copy()`. Without this, any modifications we made to `all_hh_c` would also go back to `all_hh`.

## Python Walkthrough 9.4

**Creating variables to classify households**

Let’s first create the `hh_status` variable. We set the values of `hh_status` to "not applied", then use logical indexing to change all entries where households applied for a loan (`all_hh_c["did_not_apply"] == "Applied"`) and were accepted (`all_hh_c["loan_rejected"] == "No"`) to "successful", and households who were denied (`all_hh_c["loan_rejected"] == "Yes"`) to "denied".



In [None]:
# This is the default category and creates the new column
all_hh_c["hh_status"] = "not applied"

all_hh_c.loc[
    (all_hh_c["did_not_apply"] == "Applied") & (all_hh_c["loan_rejected"] == "No"),
    "hh_status",
] = "successful"
all_hh_c.loc[
    (all_hh_c["did_not_apply"] == "Applied") & (all_hh_c["loan_rejected"] == "Yes"),
    "hh_status",
] = "denied"

# Change from a string variable to a categorical variable
all_hh_c["hh_status"] = all_hh_c["hh_status"].astype("category")

Let's have a look at the frequencies of the different outcomes:

In [None]:
all_hh_c["hh_status"].value_counts()

Now we will continue by using the same steps to make the `discouraged_borrower` variable.

In [None]:
# This is the default category and creates the new column
all_hh_c["discouraged_borrower"] = "No"

all_hh_c.loc[
    (all_hh_c["reason_not_apply1"] == "Believe Would Be Refused"),
    "discouraged_borrower",
] = "Yes"
all_hh_c.loc[
    (all_hh_c["reason_not_apply2"] == "Believe Would Be Refused"),
    "discouraged_borrower",
] = "Yes"

# Change from a string variable to a categorical variable
all_hh_c["discouraged_borrower"] = all_hh_c["discouraged_borrower"].astype("category")

all_hh_c["discouraged_borrower"].value_counts()

To make the `credit_constrained` variable, we use the `.cat.categories` property to check all the possible answers to the `reason_not_apply1` variable. We store these answers in the object `sel_ans`. (To convert from an index, a special type of **pandas** object, to a simple Python list, we put the expression on the right-hand side within a `list(..)`.)

In [None]:
sel_ans = list(all_hh_c["reason_not_apply1"].cat.categories)
sel_ans

Of these reasons, only reasons 4 (Have Adequate Farm) and 7 (Other) do not lead to a conclusion that a household is credit constrained, so we remove them from `sel_ans`. Note that this numbering starts from 0.

In [None]:
sel_ans.remove("Have Adequate Farm")
sel_ans.remove("Other (Specify)")
sel_ans

Households that did not provide any reasons are classified as *not* credit constrained. We'll take this as our default category and use it to create the column.

In [None]:
all_hh_c["credit_constrained"] = "No"

all_hh_c.loc[all_hh_c["reason_not_apply1"].isin(sel_ans), "credit_constrained"] = "Yes"
all_hh_c.loc[all_hh_c["reason_not_apply2"].isin(sel_ans), "credit_constrained"] = "Yes"

# let's turn this into a categorical variable
all_hh_c["credit_constrained"] = all_hh_c["credit_constrained"].astype("category")
all_hh_c["credit_constrained"].value_counts()

The use of `.isin` in the selection criterion is a very useful programming technique that you can use to select data according to a list of variables. In this case, `sel_ans` contains all the answers that we associate with a credit-constrained household.

`all_hh_c["reason_not_apply1"].isin(sel_ans)` gives an outcome of `True` if the variable taken by an entry in the column `reason_not_apply1` is one of the values in `sel_ans`. In that case, it sets the value of `credit_constrained` to "Yes" for those observations.

In [None]:
stab_four = pd.crosstab(
    all_hh_c["credit_constrained"],
    all_hh_c["discouraged_borrower"],
    margins=True,
    normalize="all",
).round(3)
stab_four

Let's look at the frequencies of the different reasons to not apply, in order.

In [None]:
all_hh_c["reason_not_apply1"].value_counts(normalize=True).round(3)

In [None]:
all_hh_c["reason_not_apply2"].value_counts(normalize=True).round(3)

## Python Walkthrough 9.5

**Making frequency tables to compare proportions**

Some of the data is in the `all_hh` dataset, while the rest is in the `got_l` dataset, both of which we imported in Python Walkthrough 9.1. We will combine that information into one new dataset called `loan_data`, which we then use to produce the table.

In [None]:
sel_all_hh_c = all_hh_c.loc[all_hh_c["hh_status"].isin(["successful", "denied"])]

pd.crosstab(
    sel_all_hh_c["loan_purpose"], sel_all_hh_c["hh_status"], margins=True, dropna=False
).round(3)

This reveals a particular feature of the data: it doesn't contain as much useful information as we'd like! Most succesful loans actually have "NA" in the box for loan purpose (bar one which has "Expanding Business"). The cross-tab above shows us that there were 1363 successful loans but only one with a purpose.

There is more useful information on loan purpose in the `got_l` data, so we will extract the `loan_purpose` variable for unsuccessful households from the `all_hh_c` dataset, and the equivalent information for successful loaners from the `got_l` dataset.

In [None]:
# Select unsuccessful households from all_hh_c
loan_no = all_hh_c.loc[all_hh_c["hh_status"] == "denied", ["loan_purpose", "hh_status"]]

# Select loan purpose for successful households from gotL
loan_yes = got_l.loc[got_l["got_loan"] == "Yes", ["loan_purpose"]]
loan_yes["hh_status"] = "successful"

# combine the data through concatenation
loan_data = pd.concat([loan_yes, loan_no])

# let's look at the values
pd.crosstab(
    loan_data["loan_purpose"], loan_data["hh_status"], normalize="columns"
).round(3)

## Python Walkthrough 9.6

**Calculating differences in household characteristics**

Here we show how to get average characteristics conditional on `"hh_status"` using the mean function.


In [None]:
# Show the number of observations in each category
all_hh_c["hh_status"].value_counts()

Now let's look at mean household size conditional on credit status:

In [None]:
all_hh_c.groupby("hh_status").mean(numeric_only=True)["hhsize"].round(2)

What about the mean max_education of household head by credit status?

In [None]:
all_hh_c.groupby("hh_status").mean(numeric_only=True)["max_education"].round(2)

As before, we can use cross-tabs to see the number of observations in each category:

In [None]:
pd.crosstab(
    all_hh_c["rural"],
    all_hh_c["hh_status"],
).round(3)

To get even more breakdowns, you can add more variables to the "groupby".

Let's see an example with mean household size by the other variables rural and credit status.

In [None]:
all_hh_c.groupby(["hh_status", "rural"]).mean(numeric_only=True)["hhsize"].round(2)

Or, here, the number of working age adults by the rural and credit variables:

In [None]:
all_hh_c.groupby(["hh_status", "rural"]).mean(numeric_only=True)[
    "working_age_adults"
].round(2)

## Python Walkthrough 9.7

**Calculating confidence intervals and adding them to a chart**

To repeat the same set of calculations for a list of variables, first we create a list of these variables (called `sel_var`).

In [None]:
sel_var = [
    "age",
    "max_education",
    "number_assets",
    "hhsize",
    "young_children",
    "working_age_adults",
]

Now we use the `age` variable as an example, removing the 'did not apply' entries.

In [None]:
stats_5 = (
    all_hh_c.groupby("hh_status")["age"]
    .agg({"mean", "count", "std"})
    .drop("not applied")
    .round(2)
)
stats_5

Now we use the t-test function from the **pingouin** package to calculate the difference between the successful group (`sel_success`) and the denied borrowers (`sel_denied`).

**pingouin**'s t-test function is called `ttest`.

In [None]:
import pingouin as pg

# Select the age variable (aka sel_var[0]) for successful and
# denied borrowers
sel_success = all_hh_c.loc[all_hh_c["hh_status"] == "successful", sel_var[0]]
sel_denied = all_hh_c.loc[all_hh_c["hh_status"] == "denied", sel_var[0]]

# do the t-test. Default confidence is 0.95, but include it here to be explicit
pg.ttest(x=sel_success, y=sel_denied, confidence=0.95).round(3)

The output of this test indicates whether the ages of the two groups are statistically different (here, they are).

We will now do this for all variables of interest and save the difference in means and the confidence interval values in a dataframe so we can plot this information.

To make this easier, we'll write a function that:

- takes the name of the variable of interest as function
- selects successful and unsuccessful applicants, and performs a t-test on that variable (like we already did for age)
- returns the answers (the difference in means and the upper and lower confidence limits) in a format useful for populating a dataframe of results

We will create an empty dataframe to populate with this info and then run the whole thing in a loop.

In [None]:
def get_ttest_and_mean_for_variable(dataframe_variable, selected_variable):
    """Given a dataframe with loan statuses encoded by a "hh_status" column
    with values of "successfull" and "denied", and columns that have other
    relevant characteristics, this function returns the mean difference between
    the successful and denied households according to the characteristic given
    by the 'selected_variable'.

    Args:
        dataframe_name (pandas dataframe): Data containing loan outcomes.
        selected_variable (string): Name of other characteristic.

    Returns:
        list (floats): Mean difference, low limit conf int, high limit conf int
    """
    # Select the variable for successful and
    # denied borrowers
    sel_success = dataframe_variable.loc[
        dataframe_variable["hh_status"] == "successful", selected_variable
    ]
    sel_denied = dataframe_variable.loc[
        dataframe_variable["hh_status"] == "denied", selected_variable
    ]

    # do the t-test. Default confidence is 0.95, but include it here to be explicit
    pg.ttest(x=sel_success, y=sel_denied, confidence=0.95).round(3)

    mean_difference = sel_success.mean() - sel_denied.mean()
    mean_low, mean_high = (
        pg.ttest(x=sel_success, y=sel_denied, confidence=0.95)
        .round(3)["CI95%"]
        .explode()
        .to_list()
    )
    return mean_difference, mean_low, mean_high


# create an empty dataframe for the results
data_to_plot = pd.DataFrame()

# Now we can loop through the variables
for this_variable in sel_var:
    mean_diff, low, high = get_ttest_and_mean_for_variable(all_hh_c, this_variable)
    temp_data = pd.DataFrame.from_dict(
        {
            "var_name": this_variable,
            "mean_difference": mean_diff,
            "conf_low": low,
            "conf_high": high,
        },
        orient="index",
    ).T
    data_to_plot = pd.concat([temp_data, data_to_plot], axis=0)

# give different rows different index numbers (dropping old index)
data_to_plot = data_to_plot.reset_index(drop=True)
data_to_plot.head()

Now we can plot the chart using **lets-plot**

In [None]:
(
    ggplot(
        data_to_plot.reset_index(),
        aes(
            x="var_name",
            y="mean_difference",
            fill="var_name",
        ),
    )
    + geom_bar(stat="identity", show_legend=False, color="black", alpha=0.6)
    + geom_errorbar(
        aes(ymin="conf_low", ymax="conf_high", color="var_name"),
        size=2,
        show_legend=False,
    )
    + coord_flip()
)

**Figure 9.2 Bar chart showing difference in household characteristics for successful and denied borrowers.**

## Python Walkthrough 9.8

**Calculating conditional means**

We are interested in the means of a range of variables for different subgroups. Two subgroups are mutually exclusive (`hh_status` == "successful" and `hh_status` == "denied"), while the others (`credit_constrained` == "yes" and `discouraged_borrower` == "yes") are partially overlapping subgroups of the data. Our strategy is to create a temporary dataframe (`sel_all_hh_c`) that only contains the relevant observations and the relevant variables. Then we can calculate the required means using **pandas** built-in `.mean` method.



In [None]:
# variables we are interested in
sel_var = [
    "age",
    "max_education",
    "number_assets",
    "hhsize",
    "young_children",
    "working_age_adults",
]

sel_all_hh_c = all_hh_c.loc[all_hh_c["hh_status"] == "successful", sel_var]

print(f"Successful, n = {len(sel_all_hh_c)}")

Now we get the means of each column:

In [None]:
sel_all_hh_c.mean(axis="index").round(2)

We can perform the same operation with "denied":

In [None]:
sel_all_hh_c = all_hh_c.loc[all_hh_c["hh_status"] == "denied", sel_var]

print(f"Denied, n = {len(sel_all_hh_c)}")

You can re-calculate the conditional means based on this cut too:

In [None]:
sel_all_hh_c.mean(axis="index").round(2)

You can use similar methods to look at discouraged and credit constrained households.

## Python Walkthrough 9.9

**Data cleaning and summarising loan characteristics**

We start by cleaning up the loan dates. We have information on start month and year as well as end month and year. Let's look at these in turn. The structure of the dataframe `got_l.info()` indicates that the start and end year are numeric variables, but the months are factor variables with month names (for example 'April').

Let's first look at the years by creating a scatterplot.

In [None]:
(
    ggplot(got_l, aes(x="loan_endyear", y="loan_startyear"))
    + geom_point(size=6)
    + labs(
        title="Loan start and end year",
    )
    + scale_x_continuous(format="")
    + scale_y_continuous(format="")
)

**Figure 9.3 Scatterplot showing loan start and end year.**

We can see that there are three observations that have very low (< 500) start or end year values, which does not make sense. We will replace these with 'pd.NA', but leave the original data untouched and create a new dataset called `got_l_c`, where the ‘c’ indicates cleaned data.

In [None]:
got_l_c = got_l.copy()
got_l_c.loc[
    (got_l_c["loan_startyear"] < 500) | (got_l_c["loan_endyear"] < 500),
    ["loan_startyear", "loan_endyear"],
] = pd.NA

In [None]:
(
    ggplot(got_l_c, aes(x="loan_endyear", y="loan_startyear"))
    + geom_point()
    + labs(
        title="Loan start and end year",
    )
    + scale_x_continuous(format="")
    + scale_y_continuous(format="")
)

**Figure 9.4 Revised scatterplot showing loan start and end year without outliers.**

In the top left corner, there is a loan with the start year (2006) after the end year (2003). Clearly this is incorrect, so we should remove this observation when analysing loan periods. However, we wait until we have combined the years with the months as there may be more observations with this issue.

Also, we can only see a small number of points because there are many identical observations (for example startyear of 2006 and endyear of 2006). To see these points you can add `position=position_jitter()` to the **lets-plot** plotting command. Hover your mouse over the function within an integrated development environment to see what `position_jitter()` does.


In [None]:
(
    ggplot(got_l_c, aes(x="loan_endyear", y="loan_startyear"))
    + geom_point(position=position_jitter())
    + labs(
        title="Loan start and end year",
    )
    + scale_x_continuous(format="")
    + scale_y_continuous(format="")
)

Now let’s look at the values in `loan_startmonth`. We'll keep any null values because we're interested in missing observations here.

In [None]:
got_l_c["loan_startmonth"].value_counts(dropna=False)

This all looks fine apart from the NaN (not a number) entries; what about the end months?

In [None]:
got_l_c["loan_endmonth"].value_counts(dropna=False)

Two things are noteworthy here: there are many "NaN" entries, and there is an entry called "Pagume". As described in the task, "Pagume" can be approximated by September. Let's recode that, and the nans to "Missing"

In [None]:
got_l_c.loc[got_l_c["loan_endmonth"] == "Pagume", "loan_endmonth"] = "September"
for col in ["loan_startmonth", "loan_endmonth"]:
    # add a 'missing' category
    got_l_c[col] = got_l_c[col].cat.add_categories("missing")
    # fill na with 'missing'
    got_l_c[col] = got_l_c[col].fillna("missing")

for col in ["loan_startyear", "loan_endyear"]:
    # first convert to integer to remove .0 at end
    # then to string, with missing replacing nans
    got_l_c[col] = got_l_c[col].astype("Int64").astype("string").fillna("missing")

You can check that the end months now have sensible entries for the valid rows.

Let's now calculate the length of the loan; in other words, the number of days between start and end day. **pandas** has very powerful functionality for dates and times. Our first step is to create a new variables that combines months and years together. We can do this by casting (using `.astype`) columns as strings and then using `pd.to_datetime`. We have to coerce any NaN values otherwise we will get an error.

Here's an example:

In [None]:
pd.to_datetime(
    got_l_c["loan_startmonth"].astype("str") + "-" + got_l_c["loan_startyear"],
    errors="coerce",
)

Note that it is assumed that we're counting from the first day of each month. `pd.to_datetime` is converting whatever we feed it to the closest *datetime* that is similar. A *datetime* is a special type of variable in computer science: it encodes year, month, day, hours, minutes and seconds. Timing is important so this variable type is incredibly useful!

`pd.to_datetime` will always do a best guess based on what you put in. See what happens if you try putting in a string like "15-March-2004" instead. If you *don't* want your date to be at the start of the month, you can 'add' the month end on to the date using `+ pd.offsets.MonthEnd()`.

Now let's put the months into the dataframe (using the default of month start):

In [None]:
got_l_c["loan_start_datetime"] = pd.to_datetime(
    got_l_c["loan_startmonth"].astype("str") + "-" + got_l_c["loan_startyear"],
    errors="coerce",
)
got_l_c["loan_end_datetime"] = pd.to_datetime(
    got_l_c["loan_endmonth"].astype("str") + "-" + got_l_c["loan_endyear"],
    errors="coerce",
)

Let's assess how much missing data we have (as a proportion of the categories we're interested in)

In [None]:
summary_nans = (
    got_l_c.isna()
    .agg(["sum", "count"])
    .loc[:, ["loan_start_datetime", "loan_end_datetime"]]
    .T
)
summary_nans["pct"] = 100 * summary_nans["sum"] / summary_nans["count"]
summary_nans.round(2)

So we now have start and end dates in datetime formats. We need only compute the difference:

In [None]:
got_l_c["loan_length"] = got_l_c["loan_end_datetime"] - got_l_c["loan_start_datetime"]
got_l_c["loan_length"].head()

Note the following:

- we are missing some loans that didn't have start or end dates, which have appeared as 'NaT', ie 'Not a Time'
- some loan lengths are negative because the recorded end date is before the start date (it could be that the two dates were switched when the data was entered into the system)

These data problems are unfortunate but a common feature of real-life empirical work, and you will have to be on the lookout for them!

As required in Question 1, we will create two variants of the `loan_length` variable: one where we assign missing values to all observations that have negative `loan_length`, and one where we assume that the problem was the switching of start and end date, so we transform all loan lengths to positive values.

In [None]:
# create the NaT version
got_l_c["loan_length_na"] = got_l_c["loan_length"].copy()
# set anything less than 0 days to NaT
got_l_c.loc[got_l_c["loan_length_na"] < pd.Timedelta(0, "d"), "loan_length_na"] = pd.NaT

# create the absolute version
got_l_c["loan_length_abs"] = got_l_c["loan_length"].copy()
# set anything less than 0 days to its reverse
got_l_c.loc[
    got_l_c["loan_length_abs"] < pd.Timedelta(0, "d"), "loan_length_abs"
] = -got_l_c.loc[got_l_c["loan_length_abs"] < pd.Timedelta(0, "d"), "loan_length_abs"]

Now we can create the `long_term` variable and look at the number of long-term loans.

In [None]:
got_l_c.loc[got_l_c["loan_length_abs"].isna(), "long_term"] = pd.NA
got_l_c.loc[got_l_c["loan_length_abs"] > pd.Timedelta(365, "d"), "long_term"] = True
got_l_c.loc[got_l_c["loan_length_abs"] < pd.Timedelta(365, "d"), "long_term"] = False

got_l_c["long_term"].value_counts(dropna=False)

We therefore have about 23% loans that are long-term (only looking at loans for which we do have date information).

## Python Walkthrough 9.10

**Making summary tables and calculating correlations**

To make summary tables, we use the `skim` function from the **skimpy** package.

In [None]:
from skimpy import skim

skim(got_l_c)

In [None]:
got_l_c["loan_length_abs"].head()

It can be helpful to look at loan amounts and interest rate graphically, for example in a scatterplot. We'll use the **lets-plot** package for that.

In [None]:
(
    ggplot(got_l_c, aes(x="loan_amount", y="loan_interest"))
    + geom_point(size=3)
    + labs(
        title="Loan start and end year",
    )
)

**Figure 9.5 Scatterplot showing loan amounts and interest payments.**

One large loan (top right corner) dominates this graph. Let's exclude observations with a loan amount larger than 200,000 from the plotted area of the graph.

In [None]:
(
    ggplot(got_l_c, aes(x="loan_amount", y="loan_interest"))
    + geom_point(size=3)
    + labs(
        title="Loan start and end year",
    )
    + ylim(0, 3e4)
    + xlim(0, 2e5)
)

**Figure 9.6 Revised scatterplot showing loan amounts and interest payments without outliers.**

Interestingly we can see many zero interest loans. Now we will calculate the interest rate as `loan_interest`/`loan_amount`.

In [None]:
got_l_c["interest_rate"] = got_l_c["loan_interest"] / got_l_c["loan_amount"]
got_l_c["interest_rate"].describe()

The maximum interest rate is 200 (in other words 20,000%), which does not make sense and could be due to a data entry error. Making another scatterplot can also identify extreme values for loan amounts:

In [None]:
(
    ggplot(got_l_c, aes(x="loan_amount", y="interest_rate"))
    + geom_point(size=3)
    + labs(
        title="Loan amounts and interest rates",
    )
)

**Figure 9.7 Scatterplot identifying extreme values for loan amounts.**

Let's make another scatterplot, excluding the observation with the extremely high interest rate and only looking at small loan amounts (<1,000).

In [None]:
(
    ggplot(got_l_c, aes(x="loan_amount", y="interest_rate"))
    + geom_point(size=3)
    + labs(
        title="Loan amounts and interest rates",
    )
    + xlim(0, 1e3)
    + ylim(0, 5)
)

**Figure 9.8 Scatterplot excluding extremely high interest rate and including only small loan amounts.**

Again we can see that there are many zero interest loans. From the summary statistics above, we can see that the median interest rate is 0, which implies that at least 50% of loans have a zero interest rate. The following code calculates that percentage precisely.

In [None]:
num_zero_interest_rate = (
    100 * (got_l_c["interest_rate"] == 0).sum() / got_l_c["interest_rate"].count()
)

print(f"The number of loans with a rate of zero is {num_zero_interest_rate.round(2)}%")

Now let's calculate statistics conditional on whether a loan is long term or not. Before we do this, we will remove the observation with the very extreme interest rate (20,000%) from our `got_l_c` dataset (but not from the original `got_l` dataset). That observation has a loan amount of 1 and an interest payment of 200, which is probably a data entry mistake. There is another extreme observation (with a loan amount of 30,000,000), but there is no indication that this observation is misrecorded as there is a significant interest payment for this loan.

In [None]:
got_l_c = got_l_c.loc[got_l_c["interest_rate"] < 200, :]
got_l_c.groupby("long_term")["interest_rate"].agg(
    ["mean", "std", "min", "max", "median", "count"]
).round(2)

Both the mean and median interest rate are higher for long-term loans. You can adapt the code above to calculate statistics for the `loan_amount` variable.

We now calculate correlations between interest rates and household characteristics. We store the correlation coefficients in a matrix (array of rows and columns) called `m_corr`.

In [None]:
m_corr = got_l_c.loc[
    ~got_l_c["interest_rate"].isna(),
    [
        "age",
        "max_education",
        "number_assets",
        "hhsize",
        "young_children",
        "working_age_adults",
        "interest_rate",
    ],
].corr()

m_corr["interest_rate"].round(4)

## Python Walkthrough 9.11

**Creating summary tables of means**

First we use the `pd.crosstab` method to create the table with the variable `borrowed_from`.

In [None]:
stab_10 = pd.crosstab(
    got_l_c["borrowed_from"],
    got_l_c["rural"],
    margins=True,
    normalize="columns",
).round(4)
stab_10

Note that in all settings, most loans come from relatives. To create the table with `borrowed_from_other`, substitute this variable name in the above command.

Let's take a quick look at the loan lengths we expect (mean) when looking at the cross-tab between `rural` and `borrowed_from`:

In [None]:
tab_10 = got_l_c.groupby(["borrowed_from", "rural"])["loan_length_abs"].agg("mean")
tab_10.head()

We don't need the extra info on hours, so let's instead simplify this to only use full days. And we'll "unstack" the data to make it wider in format too, making it a bit more readable

In [None]:
tab_10.dt.days.unstack()

### Extension: Investigating sources of finance associated with zero interest loans

We previously saw that a large percentage of loans have a zero interest rate. Here we investigate whether particular sources of finance are responsible for these interest rates. The code we use is very similar to the code above, but instead of calculating the mean of a variable, we calculate the mean of a boolean (true/false) variable ((`interest_rate==0`)). This will deliver the proportion of `True` observations, in other words, loans where the interest rate was equal to zero.

In [None]:
tab_11 = (
    got_l_c.assign(rate_of_zero=lambda x: x["interest_rate"] == 0)
    .groupby(["borrowed_from", "rural"], dropna=False)
    .agg(prop_0_interest=("rate_of_zero", "mean"))
    .unstack()
)

tab_11.round(2)

You can see that in both urban and rural settings, a high proportion of loans granted by local merchants, neighbours, and relatives are zero interest (possibly because these people have a close relationship with the borrower so there is a lower chance of default).

We will use exactly the same technique to determine the proportion of loans that go to households that report as being headed by a woman.

In [None]:
tab_12 = (
    got_l_c.assign(headed_by_woman=lambda x: x["gender"] == "Female")
    .groupby(["borrowed_from", "rural"], dropna=False)
    .agg(prop_women=("headed_by_woman", "mean"))
    .unstack()
)

tab_12.round(2)

This chapter used the following packages where *sys* is the Python version:

In [None]:
%load_ext watermark
%watermark --iversions