# Final Project 2: Identify & Acquire

---

# Part A. Identify

## Problem Statement

Determine which DonorsChoose.org projects will pass the review process, using project data collected at submission and teacher data based on DonorsChoose.org data from January 2016 - January 2017.

### business objectives
To better understand the types of DonorsChoose.org proejcts that are most likely to pass the initial project screening process. This will bring us one step closer to better guiding our teachers through the project creating process, and automating the screening process, which currently is very manual and requires hundreds of hours of work every month by a larger volunteer community. This is a priority for the Operations and Teacher Sucess teams at DonorsChoose.org.

### research goals (outcome)
To determine the association between different project and teacher factors, and whether they will pass the initial DonorsChoose.org screening process.

>**examples of some predictors & covariates**
>* essay
>* project sequence 
>* project type
>* project cost
>* time teacher spent creating the project

> **timeframe:** January 2016 - January 2017 <p>

### hypothesis
The project essay and need statement will have the biggest impact on whether a project will pass the initial screening process. Past research has shown surprising findings that are not very intuitive - for example, a teacher posting their second project is just as likely to be rejected, as a teacher posting their first project. Given this, I think the answer lies in the actual project essay.

### <p style="text-align: center;"> ---------------------- </p>

# Part B. Aquire & Parse

In [5]:
import os
import pandas as pd

df = pd.read_csv(os.path.join('..','..', 'bc_final_project_dataset.csv'))

df.head()

Unnamed: 0,Project ID,Project Title,Project Essay,Project Need Statement,Project Type (Punchout or Special),Project Cost,Project Draft Template Name,Project Sequence Number,Project Ever Drafted (Yes / No),Time to submit
0,2409313,Why Do I Look Like Me?,"my students love to experience life, and i inv...",my students need a class set of wisconsin plan...,materials,234.35,,2.0,No,349.0
1,2409312,Reeds and Neck Straps,what can i say about my students! many of my ...,my students need reeds and new neck straps in ...,materials,567.42,,4.0,No,64.0
2,2409305,Kindergarten STEAM,my students love to play! providing quality ma...,my students need magnetic building sets and cl...,materials,373.69,,,Yes,43.0
3,2409305,Kindergarten STEAM,my students love to play! providing quality ma...,my students need magnetic building sets and cl...,materials,373.69,Materials not described in essay,,Yes,43.0
4,2409304,Help Us Get Rid of the Creepy Crawlies!,my students are amazing! all of them are bilin...,"my students need lice shampoo, nit combs, and ...",materials,486.8,,20.0,No,31.0


## parse the data

###  data dictionary
Variable | Description | Type of Variable
---| ---| ---
project id |unique identifier of each project  | integer
title | project title | string
essay | project essay | string
need statement | first part of the essay | string
project type | the type of request: materials, class trip, visitor, special request | categorical
project cost | total cost of the project | continuous
project draft template name | each id represents a unique reason for why the project did not pass review | categorical
project sequence number | indicates whether this is the first, second, etc. project from this teacher | continuous
time to submit | the amount of time (in minutes) it took for the teacher to write and submit their proposal | continuous
ever drafted | whether the project was sent back to the teacher for revisions | binary

## assumptions & risks
> Since I have built this dataset myself, I've made a lot of assumptions (based on past research done by the data science team at DonorsChoose.org), about the features I want in the dataset to begin with. Here are some risks to this approach:
><p>&nbsp;&nbsp;&nbsp;(a) There are seemingly infinite features to choose from in our database. The features I chose to include seemed most relevant based on past analysis, but there is a risk that I've missed something. It's possible that I will need to go back to the original dataset and add additional features if performance is poor. <p>
>&nbsp;&nbsp;&nbsp;(b) There are over 100,000 rows in this data set, but on average, only about 20% of projects do not pass the screenign process. While there is a lot of data to work with, this might be challenging. <p>
>&nbsp;&nbsp;&nbsp;(c) The screening process is done by human beings, and thus is prone to human error. Each project that is sent back to a teacher has a corresponding reason, but screening always has a subjective element to it. <p>
>&nbsp;&nbsp;&nbsp;(d) The dataset is text-heavy and has some imperfections (null values, duplicates, etc.), so it will require a lot of data munging.

# Part C. Plan

## distribution & outliers

### distribution
>* describe the data to identify the min, max, mean, and standard deviation of continuous features
>* visualize the data to better understand the distribution of categorical features
>* research the best ways to handle text features

### outliers
>* compare the mean to the median and mode of the continuous features and visualize the data - I expect to see the biggest range in the Time to submit feature

## potential methods & models
### methods
* I'll need to do some research into language processing in Python.
* I'll need to keep a close eye on performance with such a large dataset, and explore methods like dimension reductionality and PCA.

### models
* I'd like to explore different unsupervised learning models that I can then do supervised learning with.
* For such a large data set, with so many potential features, random forest will probably be most helpful.