## Import libraries

In [27]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [28]:
# Set up directories
data_dir = 'data'
output_dir = 'output'
if not os.path.isdir(output_dir):
    os.mkdir(output_dir)

## Ingest the data
Read in the test and training data from their downloaded location

In [29]:
res = pd.read_csv(os.path.join(data_dir, 'resources.csv'))
train = pd.read_csv(os.path.join(data_dir, 'train.csv'))

## Check out the data
High level investigation. What are the attributes? Summary statistics of each?

In [30]:
list(res.columns)

['id', 'description', 'quantity', 'price']

In [31]:
list(train.columns)

['id',
 'teacher_id',
 'teacher_prefix',
 'school_state',
 'project_submitted_datetime',
 'project_grade_category',
 'project_subject_categories',
 'project_subject_subcategories',
 'project_title',
 'project_essay_1',
 'project_essay_2',
 'project_essay_3',
 'project_essay_4',
 'project_resource_summary',
 'teacher_number_of_previously_posted_projects',
 'project_is_approved']

In [6]:
res.describe(include='all')

Unnamed: 0,id,description,quantity,price
count,1541272,1540980,1541272.0,1541272.0
unique,260115,332928,,
top,p066966,Apple - iPad� mini 2 with Wi-Fi - 16GB - Space...,,
freq,100,3037,,
mean,,,2.860509,50.28398
std,,,7.570345,144.7326
min,,,1.0,0.0
25%,,,1.0,7.9
50%,,,1.0,14.99
75%,,,2.0,39.8


In [7]:
train.describe(include='all')

Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved
count,182080,182080,182076,182080,182080,182080,182080,182080,182080,182080,182080,6374,6374,182080,182080.0,182080.0
unique,182080,104414,5,51,180439,4,51,407,164282,147689,180984,6359,6336,179730,,
top,p207545,fa2f220b537e8653fb48878ebb38044d,Mrs.,CA,2016-09-01 00:00:03,Grades PreK-2,Literacy & Language,Literacy,Flexible Seating,As a teacher in a low-income/high poverty scho...,Students will be using Chromebooks to increase...,We will use these class sets of Scholastic mag...,"Having taught engineering in college, I have c...",My students need electronic tablets to do all ...,,
freq,1,74,95405,25695,30,73890,39257,15775,377,46,24,2,3,84,,
mean,,,,,,,,,,,,,,,11.237055,0.847682
std,,,,,,,,,,,,,,,28.016086,0.35933
min,,,,,,,,,,,,,,,0.0,0.0
25%,,,,,,,,,,,,,,,0.0,1.0
50%,,,,,,,,,,,,,,,2.0,1.0
75%,,,,,,,,,,,,,,,9.0,1.0


Essays are an interesting chunk of this dataset and potentially have a wealth of information in them. Every application has questions 1 and 2, but the counts of questions 3 and 4 are significantly lower. Typically if there are large amounts of missing data in an attribute, it would be thrown out. However, I believe that removing `project_essay_3` and `project_essay_4` from consideration would throw away useful data. According to the data description: 
>Note: Prior to May 17, 2016, the prompts for the essays were as follows:
- `project_essay_1`: "Introduce us to your classroom"
- `project_essay_2`: "Tell us more about your students"
- `project_essay_3`: "Describe how your students will use the materials you're requesting"
- `project_essay_4`: "Close by sharing why your project will make a difference"

>Starting on May 17, 2016, the number of essays was reduced from 4 to 2, and the prompts for the first 2 essays were changed to the following:
- `project_essay_1`: "Describe your students: What makes your students special? Specific details about their background, your neighborhood, and your school are all helpful."
- `project_essay_2`: "About your project: How will these materials make a difference in your students' learning and improve their school lives?"

To me, the newer question 1 appears to be a combination of the old questions 1 and 2, while the newer question 2 appears to be a combination of the old questions 3 and 4. The old questions 1 and 2 ask about the classroom and students, while the new question 2 asks about the students and the school. The old questions 3 and 4 ask about how materials will be used and how they will make a difference, while the new question 2 addresses the same topics.

As such I will combine the old question 1 and 2 into question 1, and the old question 3 and 4 into quesiton 2.

## Combine relevant data 

In [105]:
def merge_question_text(df):
    '''
    Looks at 4 values, if #3 and #4 are filled, 
    then combine #1 and #2 into #1, 
    and combine #3 and #4 into #2
    '''
    
    for row in df.itertuples():
        q1 = getattr(row, 'project_essay_1')
        q2 = getattr(row, 'project_essay_2')
        q3 = getattr(row, 'project_essay_3')
        q4 = getattr(row, 'project_essay_4')
        if (q3 == q3) and (q4 == q4):
            q1 = ' '.join((q1, q2))
            q2 = ' '.join((q3, q4))
        break
            

In [111]:
df['project_essay_1'] + df['project_essay_4']
# FIXME

0                                                       NaN
1                                                       NaN
2                                                       NaN
3                                                       NaN
4                                                       NaN
5                                                       NaN
6                                                       NaN
7                                                       NaN
8                                                       NaN
9                                                       NaN
10                                                      NaN
11                                                      NaN
12                                                      NaN
13                                                      NaN
14                                                      NaN
15                                                      NaN
16                                      

In [53]:
if q1 == q1 and q2 == q2:
    print('this')

In [107]:
# Get sum of each application request
res['total'] = res['quantity'] * res['price']
res_unique = res.groupby(by='id').sum()
# Map to the training set
df = pd.merge(train, res_unique, how='left', left_on='id', right_on='id')

In [112]:
df.head(20)

Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved,quantity,price,total
0,p036502,484aaf11257089a66cfedc9461c6bd0a,Ms.,NV,2016-11-18 14:45:59,Grades PreK-2,Literacy & Language,Literacy,Super Sight Word Centers,Most of my kindergarten students come from low...,I currently have a differentiated sight word c...,,,My students need 6 Ipod Nano's to create and d...,26,1,6,299.98,899.94
1,p039565,df72a3ba8089423fa8a94be88060f6ed,Mrs.,GA,2017-04-26 15:57:28,Grades 3-5,"Music & The Arts, Health & Sports","Performing Arts, Team Sports",Keep Calm and Dance On,Our elementary school is a culturally rich sch...,We strive to provide our diverse population of...,,,My students need matching shirts to wear for d...,1,0,20,20.0,400.0
2,p233823,a9b876a9252e08a55e3d894150f75ba3,Ms.,UT,2017-01-01 22:57:44,Grades 3-5,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",Lets 3Doodle to Learn,Hello;\r\nMy name is Mrs. Brotherton. I teach ...,We are looking to add some 3Doodler to our cla...,,,My students need the 3doodler. We are an SEM s...,5,1,1,469.99,469.99
3,p185307,525fdbb6ec7f538a48beebaa0a51b24f,Mr.,NC,2016-08-12 15:42:11,Grades 3-5,Health & Sports,Health & Wellness,"\""Kid Inspired\"" Equipment to Increase Activit...",My students are the greatest students but are ...,"The student's project which is totally \""kid-i...",,,My students need balls and other activity equi...,16,0,5,684.47,684.47
4,p013780,a63b5547a7239eae4c1872670848e61a,Mr.,CA,2016-08-06 09:09:11,Grades 6-8,Health & Sports,Health & Wellness,We need clean water for our culinary arts class!,My students are athletes and students who are ...,For some reason in our kitchen the water comes...,,,My students need a water filtration system for...,42,1,2,355.5,711.0
5,p063374,403c6783e9286e51ab318fba40f8d729,Mrs.,DE,2016-11-05 10:01:51,Grades PreK-2,"Applied Learning, Literacy & Language","Character Education, Literature & Writing",Need to Reach Our Virtual Mentors!!!,My kids tell me each day that they want to mak...,I started a program called Telementoring in ho...,,,My students need tablets in order to communic...,0,1,7,207.82,727.36
6,p103285,4e156c5fb3eea2531601c8736f3751a7,Mrs.,MO,2016-08-31 00:30:43,Grades PreK-2,Health & Sports,Health & Wellness,Active Kindergartners,Kindergarten is the new first grade. My studen...,With balance discs and stools as flexible seat...,,,My students need stability stools and inflatab...,1,1,6,111.0,414.02
7,p181781,c71f2ef13b4bc91afac61ca8fd4c0f9f,Mrs.,SC,2016-08-03 13:26:01,Grades PreK-2,"Applied Learning, Literacy & Language","Early Development, Literature & Writing",Fabulous Firsties-Wiggling to Learn!,First graders are fantastic! They are excited ...,First graders love learning! We need 6 wiggle-...,,,My students need wiggle stools to allow them t...,0,1,6,69.13,414.78
8,p114989,b580c11b1497a0a67317763b7f03eb27,Ms.,IN,2016-09-13 22:35:57,Grades 6-8,Math & Science,Mathematics,Wobble Chairs Help Fidgety Kids Focus,My seventh graders dream big. They can't wait ...,I have used alternative seating in my classroo...,,,My students need seating that allows the most ...,13,1,4,79.95,319.8
9,p191410,2071fb0af994f8f16e7c6ed0f35062a1,Mrs.,IL,2016-09-24 18:38:59,Grades PreK-2,Literacy & Language,Literacy,Snuggle Up With A Good Book,I teach first grade in a small farming town in...,There is nothing better than snuggling up with...,,,My students need 2 youth sized reclining chair...,12,1,2,59.88,119.76


In [19]:
df['teacher_prefix'].value_counts()

Mrs.       95405
Ms.        65066
Mr.        17667
Teacher     3912
Dr.           26
Name: teacher_prefix, dtype: int64

In [25]:
df.groupby(['teacher_prefix'])['project_is_approved'].sum()

teacher_prefix
Dr.           21
Mr.        14876
Mrs.       81484
Ms.        54854
Teacher     3107
Name: project_is_approved, dtype: int64

In [18]:
df['project_submitted_datetime'] = pd.to_datetime(df['project_submitted_datetime'])
df.groupby(df['project_submitted_datetime'].dt.year).sum()

Unnamed: 0_level_0,teacher_number_of_previously_posted_projects,project_is_approved,quantity,price,total
project_submitted_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016,1338682,111107,2305594,40393840.0,73727730.0
2017,707361,43239,784567,13817070.0,25642240.0
