# Empirical Project 9

---
**Download the code**

To download the code used in this project as a notebook that can be run in Visual Studio Code, Google Colab, or Jupyter Notebook, right click [here]() and select 'Save Link As', then save it as a `.ipynb` file.

Don’t forget to also download the data into your working directory by following the steps in this project.

---

## Getting started in Python

For this project, you will need the following packages:

- **pandas** for data analysis
- **matplotlib** for data visualisation
- **numpy** for numerical methods
- **statsmodels** for an extra statistics function

You'll also be using the **warnings** and **pathlib** packages, but these come built-in with Python.

Remember, you can install packages in Visual Studio Code's integrated terminal (click "View > Terminal") by running `conda install packagename` (if using the Anaconda distribution of Python) or `pip install packagename` if not.

Once you have the Python packages installed, you will need to import them into your Python session—and configure any other initial settings.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import warnings
import matplotlib_inline.backend_inline

# Set the plot style for prettier charts:
plt.style.use("plot_style.txt")
# Make output charts in 'svg' format
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")

# Ignore warnings to make nice output
warnings.simplefilter("ignore")

## Python Walkthrough 9.1

**Importing data into Python**

Ensure that the Excel file you downloaded is in a sub-directory of your working directory called "data".

Before importing the data, open it in Excel to look at its structure. You can see there are three tabs: ‘Data dictionary’, ‘All households’, and ‘Got loan’. We will import them into separate dataframes (data_dict, all_hh, and got_l respectively). We import the ‘Data dictionary’ so that we do not have to return to the Excel spreadsheet.

Also note that there are a lot of empty cells, which is how missing data is coded in Excel (but not in Python). In the `pd.read_excel` function, the default is already that empty cell are read as NA, so we don't need to specify this.

In [None]:
data_dict = pd.read_excel(
    Path("data/doing-economics-working-in-excel-project-9-datafile.xlsx"),
    sheet_name="Data dictionary",
)

all_hh = pd.read_excel(
    Path("data/doing-economics-working-in-excel-project-9-datafile.xlsx"),
    sheet_name="All households",
)

got_l = pd.read_excel(
    Path("data/doing-economics-working-in-excel-project-9-datafile.xlsx"),
    sheet_name="Got loan",
)

Now let’s look at the variable types `all_hh` and `got_l` using the `.info()` method.

In [None]:
all_hh.info()

In [None]:
got_l.info()

It is important to ensure that all variables we expect to be numerical (numbers) show either int or float as their 'Dtype', and in this case, they are. You can see that there are many variables that are coded as object variables because they are text (for example gender or region), but since we can use these variables to group data by category, we will use `.astype("category")` to change them into categorical variables for later use.

Instead of converting each object variable to a factor variable individually, we can use the `select_dtypes` method to find *all* columns that are currently of type object and then convert those columns specifically.

In [None]:
cols = all_hh.select_dtypes("object").columns
all_hh[cols] = all_hh[cols].astype("category")

cols = got_l.select_dtypes("object").columns
got_l[cols] = got_l[cols].astype("category")

Making use of `.info()` will now show you that the remaining object columns have been "recast" to be of data type "category".

## Python Walkthrough 9.2

**Creating summary tables**

In order to get the proportions of households in each region living in large towns, small towns, or rural areas (encoded in the variable `rural`), we use the `pd.crosstab` function to create a cross-tabulation. Without any further options, `pd.crosstab` would produce counts of households in the respective regions and area types. However, we can pass a keyword argument, `normalize`, to turn the counts into proportions by, for example, rows or columns.

In the below, we use `normalize="index"` to ensure that each row sums to one. We also use `.round(3)` to limit the number of decimal places that are saved in the output data frame.

Remember that if you ever want to know more about a function, you can hover over it in Visual Studio Code and see its options, including the keyword arguments it takes. Or you can run `help(function-name)`.

In [None]:
stab_one = pd.crosstab(all_hh["region"], all_hh["rural"], normalize="index").round(3)
stab_one

We may not always want a cross-tab. For the case where we simply want an overall percentage, we can use 

Let’s use a similar approach to calculate the percentage of households with women as head of the family (encoded in the variable `gender`).

In [None]:
stab_two = all_hh["gender"].value_counts(normalize=True).round(3)
stab_two

As shown, 30.4% of households have a head of the family who is a woman.

We need to provide summary statistics for a range of variables. Most of these variables are numeric variables, but one, gender, is a categorical variable. We can distinguish these types using the `.select_dtypes` method.

There are a few different ways to get summary statistics on a data frame. **pandas** has a built-in method called `.describe` which produces different information depending on whether the given columns are numeric or not. Here it is running on just numeric columns:

In [None]:
all_hh.select_dtypes("number").describe().round(2)

If we use it on a categorical variable, `gender`, we get information relevant to that type of data:

In [None]:
all_hh["gender"].describe()

There are, of course, lots of different types of data. A more powerful summary of a data frame is provided by a packaged called **skimpy** (which you can install by running `pip install skimpy` on the command line). Here is **skimpy**'s `skim` function applied to `all_hh`:

In [None]:
from skimpy import skim

skim(all_hh)

Now we can see lots of information on both the categorical and numeric data types!

## Python Walkthrough 9.3

**Making frequency tables for loan applications and outcomes**

The easiest way to make a frequency table is to use the `pd.crosstab` function. Note that we use the `dropna=False` option (and "All" minus the sum of "No" and "Yes" gives the number of NAs), and the `margins=True` option to give those totals.

In [None]:
stab_three = pd.crosstab(
    all_hh["did_not_apply"], all_hh["loan_rejected"], dropna=False, margins=True
)
stab_three

Now we will do some data cleaning and re-examine the table. We:

- exclude the households that indicated that they did not apply for a loan, but also indicated that they were refused a loan. This results in excluding more than 10% of households that indicated that they were refused a loan, but the answer is nonsensical. We will use the `~` operator, which means "not" the logic that follows.
- We shall also remove all observations that have missing data for any of these two questions (by using the implicit default value of `dropna=True`)



In [None]:
all_hh_c = all_hh.loc[
    ~(
        (all_hh["loan_rejected"] == "Yes")
        & (all_hh["did_not_apply"] == "Did not apply")
    ),
    :,
]
stab_four = pd.crosstab(
    all_hh_c["did_not_apply"], all_hh_c["loan_rejected"], margins=True, normalize="all"
).round(3)
stab_four

## Python Walkthrough 9.4

**Creating variables to classify households**

Let’s first create the `hh_status` variable. We set the values of `hh_status` to "not applied", then use logical indexing to change all entries where households applied for a loan (`all_hh_c["did_not_apply"] == "Applied"`) and were accepted (`all_hh_c["loan_rejected"] == "No"`) to "successful", and households who were denied (`all_hh_c["loan_rejected"] == "Yes"`) to "denied".



In [None]:
# This is the default category and creates the new column
all_hh_c["hh_status"] = "not applied"

all_hh_c.loc[
    (all_hh_c["did_not_apply"] == "Applied") & (all_hh_c["loan_rejected"] == "No"),
    "hh_status",
] = "successful"
all_hh_c.loc[
    (all_hh_c["did_not_apply"] == "Applied") & (all_hh_c["loan_rejected"] == "Yes"),
    "hh_status",
] = "denied"

# Change from a string variable to a categorical variable
all_hh_c["hh_status"] = all_hh_c["hh_status"].astype("category")

Let's have a look at the frequencies of the different outcomes:

In [None]:
all_hh_c["hh_status"].value_counts()

Now we will continue by using the same steps to make the `discouraged_borrower` variable.

In [None]:
# This is the default category and creates the new column
all_hh_c["discouraged_borrower"] = "No"

all_hh_c.loc[
    (all_hh_c["reason_not_apply1"] == "Believe Would Be Refused"),
    "discouraged_borrower",
] = "Yes"
all_hh_c.loc[
    (all_hh_c["reason_not_apply2"] == "Believe Would Be Refused"),
    "discouraged_borrower",
] = "Yes"

# Change from a string variable to a categorical variable
all_hh_c["discouraged_borrower"] = all_hh_c["discouraged_borrower"].astype("category")

all_hh_c["discouraged_borrower"].value_counts()

To make the `credit_constrained` variable, we use the `.cat.categories` property to check all the possible answers to the `reason_not_apply1` variable. We store these answers in the object `sel_ans`. (To convert from an index, a special type of **pandas** object, to a simple Python list, we put the expression on the right-hand side within a `list(..)`.)

In [None]:
sel_ans = list(all_hh_c["reason_not_apply1"].cat.categories)
sel_ans

Of these reasons, only reasons 4 (Have Adequate Farm) and 7 (Other) do not lead to a conclusion that a household is credit constrained, so we remove them from `sel_ans`. Note that this numbering starts from 0.

In [None]:
sel_ans.remove("Have Adequate Farm")
sel_ans.remove("Other (Specify)")
sel_ans

Households that did not provide any reasons are classified as *not* credit constrained. We'll take this as our default category and use it to create the column.

In [None]:
all_hh_c["credit_constrained"] = "No"

all_hh_c.loc[all_hh_c["reason_not_apply1"].isin(sel_ans), "credit_constrained"] = "Yes"
all_hh_c.loc[all_hh_c["reason_not_apply1"].isin(sel_ans), "credit_constrained"] = "Yes"

# let's turn this into a categorical variable
all_hh_c["credit_constrained"] = all_hh_c["credit_constrained"].astype("category")
all_hh_c["credit_constrained"].value_counts()

The use of `.isin` in the selection criterion is a very useful programming technique that you can use to select data according to a list of variables. In this case, `sel_ans` contains all the answers that we associate with a credit-constrained household.

`all_hh_c["reason_not_apply1"].isin(sel_ans)` gives an outcome of `True` if the variable taken by an entry in the column `reason_not_apply1` is one of the values in `sel_ans`. In that case, it sets the value of `credit_constrained` to "Yes" for those observations.

In [None]:
stab_four = pd.crosstab(
    all_hh_c["credit_constrained"],
    all_hh_c["discouraged_borrower"],
    margins=True,
    normalize="all",
).round(3)
stab_four

Let's look at the frequencies of the different reasons to not apply, in order.

In [None]:
all_hh_c["reason_not_apply1"].value_counts(normalize=True).round(3)

In [None]:
all_hh_c["reason_not_apply2"].value_counts(normalize=True).round(3)

## Python Walkthrough 9.5

**Making frequency tables to compare proportions**

Some of the data is in the `all_hh` dataset, while the rest is in the `got_l` dataset, both of which we imported in Python Walkthrough 9.1. We will combine that information into one new dataset called `loan_data`, which we then use to produce the table.

In [None]:
sel_all_hh_c = all_hh_c.loc[all_hh_c["hh_status"].isin(["successful", "denied"])]

pd.crosstab(
    sel_all_hh_c["loan_purpose"], sel_all_hh_c["hh_status"], normalize="columns"
).round(3)

This reveals a particular feature of the data, namely that for successful borrowers, the `all_hh_c` dataset does not contain all the useful information, as every successful household has "Other (Specify)" in the `loan_purpose` variable. There is more useful information on loan purpose in the `got_l` data, so we will extract the `loan_purpose` variable for unsuccessful households from the `all_hh_c` dataset, and the equivalent information for successful loaners from the `got_l` dataset.

# Select unsuccessful households from allHHc
loan_no <- subset(allHHc, allHHc$HH_status == "denied", 
  select = c("loan_purpose", "HH_status"))

# Select loan purpose for successful households from gotL
loan_yes <- subset(gotL, gotL$got_loan == "Yes",
  select = "loan_purpose")

loan_yes$HH_status <- "successful"

# Combine into one dataset
loan_data <- rbind(loan_no, loan_yes)

# Remove the unused 'did not apply' level
loan_data <- droplevels(loan_data)

kable(prop.table(table(
  loan_data$loan_purpose, loan_data$HH_status,
  dnn = c("Loan Purpose", "Loan")), 2))

In [None]:
# Select unsuccessful households from all_hh_c
loan_no = all_hh_c.loc[all_hh_c["hh_status"] == "denied", ["loan_purpose", "hh_status"]]

# Select loan purpose for successful households from gotL
loan_yes = got_l.loc[got_l["got_loan"] == "Yes", ["loan_purpose"]]
loan_yes["hh_status"] = "successful"

# combine the data through concatenation
loan_data = pd.concat([loan_yes, loan_no])

# let's look at the values
pd.crosstab(
    loan_data["loan_purpose"], loan_data["hh_status"], normalize="columns"
).round(3)