### Load and Analyze Grant Data
The `download_data.py` script has already downloaded, extracted, and processed the raw XML data into a clean CSV file located at `data/parsed_grants.csv`. This notebook will now focus on loading this pre-processed data and performing an in-depth analysis of the grant information.

In [1]:
import pandas as pd

# Load the parsed data from the CSV file
grants_df = pd.read_csv('data/parsed_grants.csv')

# Display basic information and the first few rows to verify the data
print("Successfully loaded 'data/parsed_grants.csv'")
print(f"The dataset contains {grants_df.shape[0]} rows and {grants_df.shape[1]} columns.")
print("\nFirst 5 rows of the dataset:")
print(grants_df.head())

# Display summary statistics for numerical columns
print("\nSummary statistics for numerical columns:")
print(grants_df.describe())

# Check for missing values in each column
print("\nMissing values per column:")
print(grants_df.isnull().sum())

Successfully loaded 'data/parsed_grants.csv'
The dataset contains 327376 rows and 11 columns.

First 5 rows of the dataset:
    FilerEIN                    FilerName  ReturnType TaxPeriodEnd  \
0  208578192           OAKLAWN FOUNDATION         990   2024-06-30   
1  208578192           OAKLAWN FOUNDATION         990   2024-06-30   
2  570314406  North Greenville University         990   2024-05-31   
3  570314406  North Greenville University         990   2024-05-31   
4   43102943     NEW AMERICAN ASSOCIATION         990   2024-06-30   

   TotalGrantsPaid                  RecipientName  RecipientCity  \
0                0  ARKANSAS COMMUNITY FOUNDATION            NaN   
1                0        OAKLAWN CENTER ON AGING            NaN   
2                0      First Presbyterian Church            NaN   
3                0     Tigerville Fire Department            NaN   
4                0   CONGOLESE DEVELOPMENT CENTER            NaN   

   RecipientState  RecipientZIP  GrantAmount  

In [2]:
# Load the parsed data from the CSV file
grants_df = pd.read_csv('data/parsed_grants.csv')

# Display basic information and the first few rows to verify the data
print("Successfully loaded 'data/parsed_grants.csv'")
print(f"The dataset contains {grants_df.shape[0]} rows and {grants_df.shape[1]} columns.")
print("\nFirst 5 rows of the dataset:")
print(grants_df.head())

# Display summary statistics for numerical columns
print("\nSummary statistics for numerical columns:")
print(grants_df.describe())

# Check for missing values in each column
print("\nMissing values per column:")
print(grants_df.isnull().sum())

Successfully loaded 'data/parsed_grants.csv'
The dataset contains 327376 rows and 11 columns.

First 5 rows of the dataset:
    FilerEIN                    FilerName  ReturnType TaxPeriodEnd  \
0  208578192           OAKLAWN FOUNDATION         990   2024-06-30   
1  208578192           OAKLAWN FOUNDATION         990   2024-06-30   
2  570314406  North Greenville University         990   2024-05-31   
3  570314406  North Greenville University         990   2024-05-31   
4   43102943     NEW AMERICAN ASSOCIATION         990   2024-06-30   

   TotalGrantsPaid                  RecipientName  RecipientCity  \
0                0  ARKANSAS COMMUNITY FOUNDATION            NaN   
1                0        OAKLAWN CENTER ON AGING            NaN   
2                0      First Presbyterian Church            NaN   
3                0     Tigerville Fire Department            NaN   
4                0   CONGOLESE DEVELOPMENT CENTER            NaN   

   RecipientState  RecipientZIP  GrantAmount  

In [10]:
# Define keywords for a broad search related to kidney health
keywords = ['kidney', 'organ', 'dialysis', 'transplant', 'renal', 'nephrology']

# Create a regex pattern to search for any of the keywords, case-insensitively
regex_pattern = '|'.join(keywords)

# Filter the DataFrame for grants where 'GrantPurpose' contains any of the keywords.
# The `na=False` argument ensures that NaN values in 'GrantPurpose' are treated as not matching.
kidney_related_grants_df = grants_df[grants_df['GrantPurpose'].str.contains(regex_pattern, case=False, na=False)].copy()

# Display the number of grants found and show a sample of the results
print(f"Found {len(kidney_related_grants_df)} grants related to the specified keywords.")
print("\nFirst 10 rows of the filtered dataset:")
# Calculate the average grant amount for each filer and sort them in descending order
print(kidney_related_grants_df.head(10))

Found 3196 grants related to the specified keywords.

First 10 rows of the filtered dataset:
      FilerEIN             FilerName  ReturnType TaxPeriodEnd  \
11   870800705  Fuse Innovation Fund         990   2024-06-30   
12   870800705  Fuse Innovation Fund         990   2024-06-30   
13   870800705  Fuse Innovation Fund         990   2024-06-30   
14   870800705  Fuse Innovation Fund         990   2024-06-30   
15   870800705  Fuse Innovation Fund         990   2024-06-30   
16   870800705  Fuse Innovation Fund         990   2024-06-30   
255  882722663      AGE WELL AT HOME         990   2024-12-31   
256  882722663      AGE WELL AT HOME         990   2024-12-31   
257  882722663      AGE WELL AT HOME         990   2024-12-31   
381  741159753     BAYLOR UNIVERSITY         990   2024-05-31   

     TotalGrantsPaid                               RecipientName  \
11                 0                      PA Alliance Foundation   
12                 0                           Justice 