# Assignment 3: Improving the Pipeline

## Part 3: Report

Compare the performance of the different classifiers across all the metrics. Which classifier does better on which metrics? How do the results change over time? What would be your recommendation to someone who's working on this model to identify 5% of posted projects to intervene with, which model should they decide to go forward with and deploy?

### Methodology
The goal of this analysis is to predict if a project on donorschoose will not get funded within 60 days of posting. Note that based on this definition, 'positive' in machine learning terms means that a project is NOT fully funded within 60 days, and 'negative' means that a project IS fully funded within 60 days. 

To make this prediction, five feature variables are used across all models (as determined by a feature selection algorithm): 
- eligible_double_your_impact_match: whether the project was eligible for a 50% off offer by a corporate partner
- total_price_over500: whether the project's total cost exceed 500 dollars
- resource_type_Books: whether the project's requested resource was books 
- resource_type_Supplies: whether the project's requested resource was supplies
- resource_type_Technology: whether the project's requested resource was technology

The data available to build these models involves projects spanning from Jan 1, 2012 to Dec 21 2013. It is assumed here that all projects were fully funded eventually, so the question is whether they were funded within or beyond 60 days of being posted. To evaluate this, models across three validation sets spanning rolling windows of 6-months are created: 
- Split 1: July 1, 2012 - Jan 1, 2013
- Split 2: Jan 1, 2013 - July 1, 2013
- Split 3: July 1, 2013 - Jan 1, 2014 

In all three cases, the models are trained using all of the available data before that validation set (i.e., for the first split, the model is trained on data from Jan 1, 2012 through July 1, 2012; for the second split, the model is trained on data from Jan 1, 2012 through Jan 1, 2013; and for the third split, the model is trained on data from Jan 1, 2012 through July 1, 2013). 

In a machine learning context, this is a supervised classification task where the target variable is whether a project is not funded within 60 days (not_funded_within_60days). A variety of classifiers are developed, each using the set of features listed above and across the three temporal splits: 
- Logistic Regression
- K-Nearest Neighbor 
- Decision Tree
- Support Vector Machine
- Random Forest
- Gradient Boosting

These models are also compared to a simple baseline model, where all projects are predicted to the most frequent label (i.e., all projects are predicted to be funded within 60 days, since the majority of projects in the training data were). 

To compare these models, a variety of evaluation metrics are considered: 
- Accuracy: What proportion of predictions did the model get right? 
- Precision: What proportion of positive predictions were actually correct? 
- Recall: What proportion of actual positives were correctly predicted? 
- F1: A weighted average of precision and recall 
- AUC ROC: A measure of how well the model can distinguish between outcomes 

These metrics are considered across various thresholds for converting a 'score' that each classifier predicts for a project into a binary categorization indicating whether that project should be predicted to not be funded within 60 days: 1%, 2%, 5%, 10%, 20%, 30%, and 50%. 

### Which classifier does better on which metrics?

#### Accuracy
None of the models is more accurate that the baseline model (where all projects are predicted to be funded within 60 days). All of the models can achieve this level of accuracy by setting the threshold at 50% across each of the three temporal validation splits– at this threshold, no projects are predicted to not be funded within 60 days. 

This highlights the importance of the threshold in converting prediction scores into prediction labels. Specifically, across the Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting models, thresholds less than 10% led to no 'negatives' (i.e., no projects predicted to be funded within 60 days), and thresholds above 50% led to no 'positives' (i.e., no projects predicted NOT to be funded within 60 days). The 20% and 30% thresholds provided variation in predictions (indicating that the optimal threshold likely falls around this range). Since models that provide no variation in predictions aren't particularly useful, they are ignored here. Additionally, because thresholds carry a different meaning in Support Vector Machines, comparing many of these metrics in Support Vector Machines to other classifiers at the same threshold levels isn't meaningful. 

Among the models with variation, the K-Nearest Neighbor models achieves marginally higher accuracy than the others. This accuracy is highest at higher thresholds (and is highest at the 50% threshold). 

#### Precision
All of the models perform similarly in their precision. Across the thresholds and time splits, all of the models have precision generally between 0.2 and 0.4, meaning between 20% and 40% of projects predicted to not be funded within 60 days were actually not funded within that time period. The K-Nearest Neighbors classifiers had slightly higher precisions than the others. 

#### Recall 
Across the models, recall is fairly high, particularly at the 20% threshold. Specifically, all of the models considered (Logistic Regression, K-Nearest Neighbor, Decision Tree, Random Forest, and Gradient Boosting) yield recall above 0.9 at various temporal splits and thresholds. This means that of projects predicted to not be funded within 60 days, these models were correct in over 90% of those predictions. As there is a tradeoff between precision and recall (i.e., models with higher precision have lower recall, and vice versa), it is unsurprising that the K-Nearest Neighbors models had a lower recall than the others. 

#### F1 
F1 captures the tradeoff between precision and recall using a weighted average of the two metrics. Given that the different classifiers perform similarly across precision and recall, they, unsurprisingly, also perform similarly across F1. Across models with variaton, this typically fell between 0.4 and 0.5. Unlike with precision and recall, F1 doesn't have a straightforward intepretation. Again, the models all have similar F1 scores, with the exception of K-Nearest Neighbors, which has a slightly lower score (attributable to its lower recall).  

#### AUC ROC 
AUC ROC (or the area under the receiver operating curve) also doesn't have a straightforward interpretation, but measures how well the model can distinguish between outcomes. Again, as with the earlier metrics, all of the classifiers performed similarly along this metric, with values ranging between 0.6 and 0.7. 

### How do the results change over time?
As discussed above, three 6-month validation sets were considered for each of the models and across each threshold. Across models, across thresholds, and across metrics, the performance  varied tremendously. However, no clear trends emerged where particular models or metrics consistently performed better or worse over time. 

In the baseline model, accuracy was highest in the first set (0.74), then in the last set (0.71), followed by the third set (0.68). In the K-Nearest Neighbors classifier, accuracy was slightly lower in the second set than in the others. In the remaining classifiers, accuracy was slightly lower in the first set than the later two. This is similarly true for precision across the classifiers– the last two sets saw slightly higher precision than the first. In line with the precision-recall tradeoff, models across the first had higher recall than the later two. Considering a weighted average of precision and recall, F1, the second set performed marginally better than the other two, while the third set performed marginally better on AUC ROC than the othes. 

Importantly, while these slight differences across the three temporal sets do exist, their magnitude is extremely marginal. 

### What would be your recommendation to someone who's working on this model to identify 5% of posted projects to intervene with– which model should they decide to go forward with and deploy?
As noted above, with the exception of the K-Nearest Neighbors classifier, the others performed really similarly. This suggests that the specific classifier used is less important than other considerations. Two particularly important considerations include the thresholds used to convert scores into predictons and the particular features included in the models. Varying these is likely to have substantively greater effects on the evaluation metrics than the parameters discussed above. 

Moreover, deciding which model to deploy should be guided by the priorities of the implementer. Specifically, a series of questions needs to be considered, for example:  
- Is it preferred to intervene in those projects that are the least likely to be funded within 60 days, or those on the cusp of being funded within 60 days? 
- Is it preferred to potentially intervene in a project that would be funded within 60 days without the intervention, or to potentially fail to intervene in a project that would not be funded within 60 days? (i.e., how should the precision-recall tradeoff be evaluated)
- Is the ultimate goal of the intervention simply to get more projects funded within 60 days, or is the goal more nuanced (e.g., to reduce the average length of time for full funding across all projects, to increase the total amount of funding for all projects, etc.) 
- Should considerations around equity be weighted (e.g., should projects in higher-poverty neighborhoods be prioritized over others, etc.) 

Given limited resources, precision is likely a key metric to consider. Across the full dataset, over 70% of projects were funded within 60 days (without intervention), and intervening in one of these projects would be an inefficient use of resources. Thus, high precision means that of the projects predicted to not be funded within 60 days, a high proportion of them were indeed not funded within 60 days (i.e., interventions would be less likely to 'mistakenly' go to projects that would've succeeded without the intervention). 

As noted above, the different classifiers performed similarly in their precision (typically between 0.2 and 0.4. Across all variations considered, precision was maximized using the second temporal split with a 30% threshold across the Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting models. At this level, the other evaluation parameters (specifically, accuracy, F1, and AUC ROC) were all fairly high. Thus, I would recommend any of these models to identify projects to intervene in. Given the value in a model offering intuition, I would particularly recommend the Logistic Regression and Decision Tree models, which allow for more straightfoward interpretations than the others. 