# DonorsChoose.org EDA

[DonorChoose.org](http://DonorsChoose.orgs) is a nonprofit dedicated to providing the funds that teachers need in order to improve the overall quaility of education.

## The problem
DonorChoose.org wants to find a method that will enable DonorsChoose.org to build targeted email campaigns recommending specific classroom requests to prior donors'.

The three target metrics for the solution are:
* Performance - Good Targeting
* Adaptable - Feasable Implementation
* Intelligible - Easily Understandable

A brief overview of the company and the problem we wish to solve using the data we have been provided. Perhaps the most important metrics for any large company is the bottom line, so this kernel begins with examining income and a sucessful solutions possible impact on income.
In order to get a good idea about how we should get donors for DonorChoose.org re-engaged, we should get a better understanding of the data.


In [1]:
library(data.table)
library(ggplot2)
library(scales)
library(repr)
options(repr.plot.width=10, repr.plot.height=3)

In [2]:
Donations <- fread('../input/Donations.csv')
Donors <- fread('../input/Donors.csv')
Projects <- fread('../input/Projects.csv')
Resources <- fread('../input/Resources.csv')
Schools <- fread('../input/Schools.csv')
Teachers <- fread('../input/Teachers.csv')

In [3]:
head(Donors)
head(Schools)

## Donations
The most important information is the 'Donations' information. After all, this is how the money is made!

In [4]:
head(Donations)

In [5]:
# Visualizing Donations over time

Donations[, `Donation Received Date` := anytime::anydate(`Donation Received Date`)]
Donations[, Year := format(`Donation Received Date`, '%Y')]
Donation_by_day <- Donations[,.(Total_Donations = sum(`Donation Amount`)), by = `Donation Received Date`]
Donation_by_day[, Year := format(`Donation Received Date`, '%Y')]

In [6]:
Donations[,.(`Donation Amount` = sum( `Donation Amount`)), by =Year]


## Revenue Year by Year

In [7]:
ggplot(Donations[Year!=2018 & Year!=2012,.(`Donation Amount` = sum( `Donation Amount`)), by =Year], aes(Year, `Donation Amount`)) +
    geom_line(aes(group = 1), color = 'green') +
    scale_y_continuous(labels=dollar_format())

In [8]:
print(paste0('Ratio of unique projects per donation: ' ,round(length(unique(Donations$`Project ID`)) / nrow(Donations),4)))
print(paste0('Ratio of unique donors per donation: ', round(length(unique(Donations$`Donor ID`)) / nrow(Donations),4)))

In [9]:
ggplot(Donation_by_day, aes(`Donation Received Date`, Total_Donations)) +
  geom_line(size = 1, aes(color = Year)) +
  theme(legend.position="none")

In [10]:
ggplot(Donation_by_day[`Year` >= 2015], aes(`Donation Received Date`, Total_Donations)) +
  geom_line(size = 1, aes(color = Year)) +
  theme(legend.position="none")

In [None]:
#ggplot(Donations[`Donation Amount` < 600], aes(`Donation Amount`)) +
#  geom_histogram(binwidth = 25, fill = 'skyblue', color = 'black') +
#  scale_x_continuous(breaks = seq(0, 600, by = 25)) +
#  labs(title = '(covers over 99% of the data)')

# Don't run this cell on Kaggle, it will crash thw wholw kernel

Then, we have an idea of the distribution of donations over time and amount, let's take a look the repeat donations.

In [11]:
Repeated_Donors <- Donations[,.(`Times Donated`=.N, `Dollar Amount` = sum(`Donation Amount`)),by=`Donor ID`]

In [12]:
head(Repeated_Donors[order(-`Dollar Amount`)], 10)

In [13]:
Repeated_Donors  <- Repeated_Donors[,.(`Number of People` = .N, `Dollar Worth` = sum(`Dollar Amount`)),  by = `Times Donated`]
Repeated_Donors[,`Dollar Per Individual` := `Dollar Worth`/`Number of People`]

In [14]:
head(Repeated_Donors[order(`Times Donated`)])

In [15]:
(123.49126- 53.40797) * 1471613
.2*53.40797*1471613

By using `Dollar Per Individual` averages as an extremely rough benchmark, moving the `Times Donated = 1` group into the `Times Donated = 2` would net an additional 103 million dollars. Of course, this is not a scientific estimate given the unknown variations between those in the `Times Donated = 1` group and the `Times Donated = 2` group, but it goes to show how important re-engagement is.
If we get at least 20% of those who donated only once to make another donation with the same average donation, then we can get another 15 million dollars in revenue.

In [16]:
print(paste0('The ratio of those who only donated once: ',Repeated_Donors[`Times Donated` == 1,`Number of People`]/Repeated_Donors[,sum(`Number of People`)]))
print(paste0('The dollar ratio of those who only donated once: ',Repeated_Donors[`Times Donated` == 1,`Dollar Worth`]/Repeated_Donors[,sum(`Dollar Worth`)]))
print('This means those who donated more than once (approx 28%) are responsible for around 73% of total revenue')

In [17]:
ggplot(Repeated_Donors[`Times Donated` <= 5], aes(x=`Times Donated`, y = `Dollar Worth`)) +
    geom_bar(stat = 'identity', aes(fill = `Dollar Per Individual`)) +
    scale_y_continuous(labels=dollar_format(prefix="$")) +
    scale_x_continuous(breaks = seq(0, 5, by = 1)) +
    labs(title = 'Dollar Worth of by the Number of Times Donated')

In [18]:
Repeated_Donors[,Is_One := ifelse(`Times Donated`== 1,'Donated Once','Donated More Than Once')]

ggplot(Repeated_Donors, aes(x=`Is_One`, y = `Dollar Worth`)) +
    geom_bar(stat = 'identity', color = 'darkblue') +
    scale_y_continuous(labels=dollar_format(prefix="$")) +
    labs(title = 'Dollar Worth of by the Number of Times Donated')

## Teachers
This dataframe does not contain a large amount of information, but gender and first posted date may come in handy later.

In [19]:
head(Teachers)

In [20]:
ggplot(Teachers,aes(`Teacher Prefix`))+
    geom_histogram(stat='count')

In [21]:
Males = Teachers[,sum(`Teacher Prefix`=='Mr.')]
Females = Teachers[,sum(`Teacher Prefix`=='Mrs.')]

Considering all `Mrs./Ms.` to be **female** and all `Mr.` to be **male**, we can find out the gender distribution.

In [22]:
Males
Females
Females/Males

**From the above data, we can see that for every male teacher, there is about four female teachers.**

## Projects
This dataset is interesting, we can learn the differences in between projects that succeeded and projects that failed quite easily. Not only this, but linking this back to school or teacher may bring out interesting differences in between projects that pass or fail.

In [23]:
head(subset(Projects, select = -`Project Essay`))

In [24]:
test  <- Projects
test[,`Project Cost`:=as.numeric(`Project Cost`)]

In [25]:
Projects

In [26]:
Categories  <- Projects[,.(Count = .N,mean(`Project Cost`)),by=.(`Project Resource Category`)]
Categories[order(-N)]

In [27]:
Projects[,.N,by=.(`Project Current Status`)]

In [28]:
Projects[,.N,by=.(`Project Current Status`)][1,2]/Projects[,.N,by=.(`Project Current Status`)][2,2]

In [29]:
ggplot(Projects[,.N,by=.(`Project Current Status`)],aes(`Project Current Status`, N)) +
    geom_bar(stat = 'identity')

In [30]:
ggplot(Projects,aes(`Project Current Status`))+
    geom_bar(aes(fill = `Project Resource Category`))

# Keep watching! More to come....