<a href="https://colab.research.google.com/github/asternoeld/introduction-to-python/blob/main/kiva_loans_analysis_aster.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kiva Loans Analysis  
### Assignment 3 – Pandas

**Name:** Aster Noel Dsouza  
**Student ID:** 29211  
**Date:** NOVEMBER 18, 2025



## Introduction

In this notebook I analyse the **Kiva loans** dataset (`kiva_loans.csv`) using the Python library **pandas**.  
Kiva is a crowdfunding platform where lenders fund loans for borrowers around the world. The dataset contains information about loan amounts, sectors, countries, borrower genders, currencies, dates, and other details.

The goal of this assignment is to:
- Practice working with pandas DataFrames.
- Explore the dataset by asking and answering **8 questions**.
- Use important operations such as:
  - Filtering rows with string methods and numerical conditions.
  - Sorting tables.
  - Grouping and aggregating data with `groupby`.
  - Creating new variables using `apply` and `lambda`.

In the video presentation I will focus on 6 of these questions, but all 8 questions are included here for practice.


## 0. Setup and data loading

In this section I will:
- Import the libraries I need (mainly `pandas`).
- Load the `kiva_loans.csv` file into a DataFrame.
- Take a quick look at the structure of the dataset (columns, data types, and a few example rows).


In [1]:
import pandas as pd

# If the CSV is in the same folder as your notebook:
df = pd.read_csv("kiva_loans.csv")

# Take a quick look at the data
df.head()


FileNotFoundError: [Errno 2] No such file or directory: 'kiva_loans.csv'

## Question 1 – Which sectors and activities are most common?

In this section I want to get a first idea of what kind of loans appear in the Kiva dataset.

I will:
- Look at the most frequent **sectors** of loans.
- Also check the most common **activities**.
- Compute, for each sector, how many loans there are and the average loan amount.
- Take into account that loan amounts are recorded in different **currencies**, so I will keep the currency information in the analysis (for example, by grouping by sector *and* currency).

This will help me understand which areas Kiva focuses on the most.


## Question 2 – For a specific currency, which sectors receive the largest loans?

Because loan amounts use different currencies, it is hard to compare them all at once.  
In this question I will focus on **one single currency** (for example `USD` or another currency with many loans) and compare sectors only within that currency.

I will:
- Filter the dataset to keep only loans in the chosen currency.
- Optionally remove loans with missing or zero loan amounts.
- Group the data by sector and calculate:
  - The number of loans per sector.
  - The average loan amount per sector.
- Sort the results to see which sectors receive the biggest loans on average in that currency.

This gives a clearer and fair comparison of loan sizes across sectors.


## Question 3 – Do female-only, male-only, and mixed-gender groups receive different loan sizes?

The dataset contains a column with the genders of the borrowers in each loan.  
The text can contain multiple values (e.g. "female, female", "male, female", etc.).

In this section I will:
- Create a new simplified category for each loan based on `borrower_genders`, for example:
  - `female_only`
  - `male_only`
  - `mixed`
  - `unknown` (for missing or unclear values)
- Use `apply` with a small `lambda` function to generate this new column.
- Group the data by this new borrower type and calculate:
  - The average loan amount.
  - The average number of lenders (`lender_count`).
- Sort the results to see which groups receive larger loans and/or attract more lenders.

This helps explore possible differences between different types of borrower groups.


## Question 4 – How do selected countries compare in loan amounts and lender counts?

In this question I want to focus on a small set of countries and compare them.  
I will manually choose a few countries (for example 3–5 countries that interest me).

I will:
- Use `.isin()` to filter the rows where the `country` is one of the selected countries.
- Optionally also filter to a single currency (for example `USD`) so that loan amounts are comparable.
- For each selected country, compute:
  - The average loan amount.
  - The average number of lenders.
- Sort the summary table by average loan amount or by average lender count.

This will show how these countries compare in terms of typical loan sizes and lender participation.


## Question 5 – Is there a relationship between the number of lenders and the loan amount?

Intuitively, we might expect that loans with more lenders are larger, but this is not obvious.

In this section I will:
- Focus on the `lender_count` and `loan_amount` columns.
- Optionally remove rows with zero or missing loan amounts.
- Create a new variable that groups loans into **lender count buckets**, for example:
  - `1–5 lenders`
  - `6–10 lenders`
  - `11–20 lenders`
  - `>20 lenders`
- Use `apply` with a `lambda` function to assign each loan to a bucket.
- Group by this new bucket and compute the average loan amount in each group.
- Sort the buckets to see if there is a clear trend.

This gives an idea of whether more lenders typically means a bigger loan.


## Question 6 – Do shorter-term loans get funded faster than longer-term loans?

The dataset contains timestamps for when loans were posted and when they were funded.

In this question I will:
- Convert the `posted_time` and `funded_time` columns to datetime format.
- Create a new column, for example `time_to_fund_days`, which measures how many days passed between posting and funding.
- Use either:
  - The raw `term_in_months` value, or
  - A new variable that groups loans into term buckets (e.g. `short` / `medium` / `long` term).
- Group the loans by term (or term bucket) and compute the average `time_to_fund_days` for each group.
- Sort the results to see if shorter-term loans are funded more quickly.

This analysis explores a possible relationship between loan duration and how fast lenders fund it.


## Question 7 – How many borrowers are in each loan, and does this vary by sector or country?

Some loans are for a single borrower, while others are for a group.  
We can estimate the number of borrowers from the `borrower_genders` column, where multiple genders are separated by commas.

In this section I will:
- Use `apply` with a `lambda` function on `borrower_genders` to:
  - Count how many genders are listed for each loan.
  - Store this in a new column such as `num_borrowers`.
- Handle missing values safely (for example, treat them as `num_borrowers = 0` or `1` depending on how I decide).
- Group by `sector` (and/or by `country`) and compute the average number of borrowers per loan.
- Sort the results by the average `num_borrowers` to see where group loans are more common.

This shows how group vs individual loans are distributed across sectors or countries.


## Question 8 – Are loans with certain tags different in size from the overall average?

The dataset includes a `tags` column with extra information about the purpose or characteristics of the loan.  
These tags are stored as text, often with several tags in the same field.

In this question I will:
- Explore which tags appear most frequently (at least in a simple way).
- Pick one or two tags of interest (for example containing words like "education" or "water").
- Use string methods (such as `.str.contains(...)`) to filter loans that include a given tag.
- Compare the average loan amount for loans with that tag against:
  - the overall average loan amount, or
  - loans without that tag.
- Optionally repeat for more than one tag.

This gives a simple view of whether certain tagged projects tend to have larger or smaller loan amounts.


## Conclusion

In this notebook I explored the Kiva loans dataset using pandas and answered eight different questions.  
Across these questions I used:

- Row selection with string methods and numerical conditions.
- Sorting tables by one or more columns.
- Grouping and aggregating data with `groupby`.
- Creating new variables with `apply` and `lambda`.

The analyses gave some insights into sectors, countries, borrower types, loan sizes, and funding behaviour on Kiva.  
In the accompanying video I will briefly present six of these questions and highlight the most relevant pieces of code and results.
