# DataScience for Good: DonersChoose.org

The below will include the kernel for an investigation into the doner modelling from the data publicly available at Kaggle - https://www.kaggle.com/donorschoose/io/data


The problem description from Kaggle:

A good solution will enable DonorsChoose.org to build targeted email campaigns recommending specific classroom requests to prior donors. Part of the challenge is to assess the needs of the organization, uncover insights from the data available, and build the right solution for this problem. Submissions will be evaluated on the following criteria:

    Performance - How well does the solution match donors to project requests to which they would be motivated to donate? DonorsChoose.org will not be able to live test every submission, so a strong entry will clearly articulate why it will be effective at motivating repeat donations.

    Adaptable - The DonorsChoose.org team wants to put the winning submissions to work, quickly. Therefore a good entry will be easy to implement in production.

    Intelligible - A good entry should be easily understood by the DonorsChoose.org team should it need to be updated in the future to accommodate a changing marketplace.


## Working with the data

This Notebook will explore the Donations.csv provided by DonorsChoose.org partnered with Google.org and document the process that will begin with performing basic descriptive statistics, to finalising a survival model which will predict the likelihood of donor dropoff.

Follow along in your own python session if you want to have ago yourself. I'm working in Python 3.5 right now, my admin has been slacking recently XD. You'll need the following packages installed with pip:

numpy
pandas
matplotlib



In [2]:
import pandas as pd
import numpy as np

columns = ['Project ID',
          'Donation ID',
          'Donor ID',
          'Donation Included Optional Donation',
          'Donation Amount',
          'Donor Cart Sequence',
          'Donation Received Date']

dtypes = {'Project ID': object,
          'Donation ID': object,
          'Donor ID': object,
          'Donation Included Optional Donation': object,
          'Donation Amount': np.float64,
          'Donor Cart Sequence': np.float64,
          'Donation Received Date': object}

#needed header=0 as names was passed to the function, but first line of csv contains column titles. This ensures they're not converted
df = pd.read_csv('..\Donations.csv', names = columns, header=0, dtype = dtypes)

## Initial Data Profiling

At the onset of any new project, I always like to find my bearings. Determining some high level statistical data can help to provide this picture. Descriptive statistics is the most basic form of data analysis and is essential to an effective workflow. Following, this, visualising the data is another important skillset to have in your back pocket. 

In [3]:
print(df.shape)

(4687884, 7)


Slow and steady now! Okay, so the dataframe that we've loaded with the Donations.csv has 4 687 884 rows! That's some hefty data. Fortunately we've only got 7 columns to work with, which have been defined earlier. 

I think we'll need to enrich this data with some additional statistical information on the donor level to create these predictors of survive. We'll cross that bridge when we get there. 

In [9]:
print("The minimum donation amount is: $", df['Donation Amount'].min())
print("The maximum donation amount is: $", df['Donation Amount'].max())
print("The average donation amount is: $", df['Donation Amount'].mean())
df_mode = df['Donation Amount'].mode() #note the mode outputs a dataframe where each row would identify the most common element in the case of ties.
print("The most common donation amount is: $", df_mode.iloc[0])


The minimum donation amount is: $ 0.01
The maximum donation amount is: $ 60000.0
The average donation amount is: $ 60.6687885792
The most common donation amount is: $ 25.0


That tell's us a little more. Donations vary - from as little as \$0.01 to \$60 000, whilst the mean is \$60.67. The mean is a measure of centrality, but it isn't necessarily the best one to use in this kind of data set. From my own personal experience, I believe that this data set would be fairly right skewed (long tail to the right of the mean). A priori, I would guess that there would be a lot of smaller donations, and fewer at the high end. The mode gives me extra motivation that this is indeed the case.

Don't forget to apply your own intuitions to the data - ask yourself what your prior knowledge or assumptions might be, and always challenge them.

Sometimes the median - the middle item in an ordered list would be more appropriate. If the distribution of the donation amounts was symmetrically normal, then the median and means should be the same. To calculate the median, we'll first sort the dataframe. 

In [16]:
df_sort = df.sort_values(by='Donation Amount', ascending=True)
print("The middle donation amount is: $", df_sort['Donation Amount'].median())

#sorting actually isn't needed here!

The middle donation amount is: $ 25.0


Ha! Knew it! Funnily enough, the most common donation is the same as the middle donation. That doesn't always happen like that. 

We can also do some further investigatory work to measure the spread of values across the distribution. Some inter-quantile information, and calculating the standard deviation should help us here.

In [15]:
print("The standard deviation is: $", df['Donation Amount'].std())
print("The variance is: $", df['Donation Amount'].var())
print("The 1st, 2nd, 3rd quantiles are: \n", df_sort['Donation Amount'].quantile([0.25, 0.50, 0.75]))

The standard deviation is: $ 166.899615325
The variance is: $ 27855.4815956
The 1st, 2nd, 3rd quantiles are: 
 0.25    14.82
0.50    25.00
0.75    50.00
Name: Donation Amount, dtype: float64


These statistice are a little more difficult to interpret, but the standard deviation can be used to estimate what proportion of the population lies within specific intervals around the mean. For instance, approximately  between 1 Standard Deviation above and below the mean. The higher the standard deivation, the greater the spread. Variance is a similar concept and can also be expressed as the square of the standard deviation.

The quartiles are the cutoff points for each population quarter of the population. 25% of the donation amounts are below \$14.82, 50% are below \$25, etc.

## Visualising the Data

Visualising the data is an important step to understanding it. Pictures are worth 1 000 words. Okay with that in mind, we'll need to import the MAat

## Some Data Preparation

We enhance the donor data by creating additional columns which generate some statistical data based on 

In [None]:
#Note the as_index option will enable the datasets to be joined. The default behaviour of the groupby method is to make the groupby variable an index.
df_donor_count = df[['Donation ID', 'Donor ID']].groupby('Donor ID', as_index=False).count()
df_donor_recency = df[['Donor ID', 'Donation Received Date']].groupby('Donor ID', as_index=False).max()
df_donor_donations = df[['Donor ID', 'Donation Amount']].groupby('Donor ID', as_index=False).agg({'Donation Amount': ['min', 'max', 'mean', 'mode', 'sum']})

KeyboardInterrupt: 

In [None]:
df_donor_count.rename(columns = {'Donor ID': 'Donor ID', 'Donation ID':'Donation Count'}, inplace=True) 
df_donor_recency.rename(columns = {'Donor ID': 'Donor ID', 'Donation Received Date': 'Most Recent Donation'}, inplace=True)

In [None]:
print(df_donor_recency.index, df_donor_count.index)

In [None]:
df_donor_int = pd.merge(df_donor_count, df_donor_recency, how='inner', on='Donor ID')
df_donor = pd.merge(df_donor_int, df_donor_donations, how='inner', on='Donor ID')