Skip to content

dlymarg/aml-bhs

Repository files navigation

Background

The Applied Mathematics Laboratory (AML) was contracted by the Baltimore Humane Society (BHS) to evaluate their donor database to classify donors, visualize donation data, and understand what factors encourage major donations.

Geocoding

One of our tasks was to convert a .csv file that consisted of over 20,000 addresses into a set of GPS coordinates. This conversion is known as geocoding. This was necessary because other projects within the AML project relied heavily on a set of coordinates. Geocoding eases the process of handling a large dataset so a machine can perform tasks --- such as compiling a list of invalid addresses, plotting donors onto a map, and calculating the distance between an address and a centroid --- that would be impossible to achieve by hand. Identifying invalid or missing addresses can be useful for mailing purposes. If an address in the BHS database is not valid or missing, the BHS can save money on postage costs by terminating mail services to such addresses.

Geocoding was made possible by using a free application program interface (API) from Google. To verify the coordinates produced by Google's API, geocoding was also performed using Maryland iMAP's Composite Locator and the U.S. Census Geocoder.

Of the addresses in the donor database, 20,660 geocoded properly (97.6%), 211 did not geocode properly (1.0%), and 289 did not have enough information to undergo geocoding (e.g., missing zip code) (1.4%).

In this repository, you will find a sample list of addresses (sample.csv) and a Python file that performs the process of geocoding (geocoding.py). The formatting of sample.csv is as follows (by column):

ID number, Name, Address1, Address2, Address3, Address4, City, State, Zip Code

However you process an address depends on how many fields you need for an address; geocoding.py accounts for as many as four fields for an address. The Python file should generate three .csv files as output: one that consists of successfully geocoded addresses, one that consists of unsuccessfully geocoded addresses, and one that consists of donors with a lack of address information in the database.

The status of all donors can be compiled into a single file using donor_id_sorting.py. This file reads the output files generated by geocoding.py and places donors into an output file (donor_info.csv) that consists of donor ID, geocoding status (good, bad, or none), and GPS coordinates if applicable.

Packages required: googlemaps (requires a key from the Google Maps API to work), csv

Donor Grouping

The AML sought to find ways to categorize donors based on their donation history. By evaluating different donors' donation histories, the AML can identify existing or previous donors that may be more likely to provide a donation to the BHS. The main goals in evaluating donation history were to highlight potential periods of inactivity and to distinguish similar donation amounts and times. One of the ways was to create functions that would not only capture how much a donor donated at a given time, but also capture how active a donor was over a span of donations. One of the primary goals was to distinguish similar donation amounts and times by converting the discrete data from the donor database into a continuous function. For example, if a donor donated 25 dollars in March 2013 and did not make another donation until May 2016, it was important to capture this period of inactivity. Smooth, bell-shaped functions were created to showcase donors' donation behaviors and were defined as the following:

In this piecewise function, t_1 and t_3 define the width of a bell. The coefficient c was solved using the following integral, where d is donation amount:

Here is an example of the graph the piecewise function creates for a single donor:

This graph shows a donor's donation history from their first donation to their most recent donation. Notice that many of the bells overlap each other. Since this graph is difficult to interpret, we can aggregate all overlapping regions to generate a nice, smooth, continuous function:

Aggregating these regions not only make the graph look cleaner, but doing so also acknowledges periods of active donation behavior. Consequently, the aggregation of overlaps generates a single function for one donor, which is considered to be a "profile."

Multiple donor profiles can be graphed. Ultimately, donor profiles were generated for any desired number of donors, as long as a profile contained more than five donations. Since it is difficult to determine behavior with few data points, a decision was made to only consider donors who had donated more than five times.

In this repository, you will find a sample list of donor IDs accompanied with days a donation was made and donation amounts (days_gift_amounts.csv) and a Python file that creates the donor profiles (functions.py).

Future Work

To understand how donors can be classified into several groups based on behavior, the K-means clustering algorithm --- a data analysis method that enables us to assign labels to donors who have similar behavior --- can be applied.

Packages required: csv, numpy, matplotlib

About

A collection of Python files that were used in Towson University's Applied Mathematics Laboratory (AML) for the spring 2018 and fall 2018 semesters

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages