<h2><strong>In This Notebook...</strong></h2><br />
This is for data cleaning and engineering for our project.  Much inspiration received from <a href="https://www.kaggle.com/shivamb/extensive-text-data-feature-engineering/notebook" target="_blank">here</a>.

#### Dependencies

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from keras.preprocessing import sequence, text
from keras.layers import Input, Embedding

from nltk import word_tokenize
from nltk.corpus import stopwords
from textblob import TextBlob

import datetime as dt
import pandas as pd
import numpy as np
import warnings
import string

import matplotlib.pyplot as plt
%matplotlib inline

stop_words = list(set(stopwords.words('english')))
warnings.filterwarnings('ignore')
punctuation = string.punctuation

Using TensorFlow backend.


#### Read in data

In [2]:
id_column = 'id'
missing_token = ' UNK '

train = pd.read_csv('../data/train.csv', parse_dates=['project_submitted_datetime'])
test = pd.read_csv('../data/test.csv', parse_dates=['project_submitted_datetime'])
hopes = pd.read_csv('../data/resources.csv').fillna(missing_token)

df = pd.concat([train,test], axis=0)

##### Mathy Features
+ Min, Max, Mean Price for resources requested
+ Min Quantity, Max Quantity, Mean Quantity of resources requested
+ Min Total Price, Max Total Price, Mean Total Price of resources requested
+ Total Price of items requested by proposal
+ Number of Unique Items Requested by proposal
+ Quantity of items requested in proposal

In [3]:
hopes['total_price'] = hopes['quantity']*hopes['price']
aggregatedf = hopes.groupby('id').agg({'description':'count', 'quantity':'sum', 'price':'sum', 'total_price':'sum'}).rename(columns={'description':'items'})

for maths in ['min', 'max', 'mean']:
    temporary = hopes.groupby('id').agg({'quantity':maths, 'price':maths, 'total_price':maths}).rename(columns={'quantity':maths+'_quantity', 'price':maths+'_price', 'total_price':maths+'_total_price'}).fillna(0)
    aggregatedf = aggregatedf.join(temporary)
# This didn't work whoops # aggregatedf = aggregatedf.join([hopes.groupby('id').agg({'quantity':maths, 'price':maths, 'total_price':maths}).rename(columns={'quantity':maths+'_quantity', 'price':maths+'_price', 'total_price':maths+'_total_price'}).fillna(0) for maths in ['min', 'max', 'mean']])

aggregatedf = aggregatedf.join(hopes.groupby('id').agg({'description':lambda x:' '.join(x.values.astype(str))}).rename(columns={'description':'resource_description'}))

df = df.join(aggregatedf, on='id')
df.head()

Unnamed: 0,id,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_grade_category,project_is_approved,project_resource_summary,project_subject_categories,project_subject_subcategories,...,min_quantity,min_price,min_total_price,max_quantity,max_price,max_total_price,mean_quantity,mean_price,mean_total_price,resource_description
0,p036502,Most of my kindergarten students come from low...,I currently have a differentiated sight word c...,,,Grades PreK-2,1.0,My students need 6 Ipod Nano's to create and d...,Literacy & Language,Literacy,...,3,149.99,449.97,3,149.99,449.97,3.0,149.99,449.97,Apple - iPod nano� 16GB MP3 Player (8th Genera...
1,p039565,Our elementary school is a culturally rich sch...,We strive to provide our diverse population of...,,,Grades 3-5,0.0,My students need matching shirts to wear for d...,"Music & The Arts, Health & Sports","Performing Arts, Team Sports",...,20,20.0,400.0,20,20.0,400.0,20.0,20.0,400.0,Reebok Girls' Fashion Dance Graphic T-Shirt - ...
2,p233823,Hello;\r\nMy name is Mrs. Brotherton. I teach ...,We are looking to add some 3Doodler to our cla...,,,Grades 3-5,1.0,My students need the 3doodler. We are an SEM s...,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",...,1,469.99,469.99,1,469.99,469.99,1.0,469.99,469.99,3doodler Start Full Edu Bundle
3,p185307,My students are the greatest students but are ...,"The student's project which is totally \""kid-i...",,,Grades 3-5,0.0,My students need balls and other activity equi...,Health & Sports,Health & Wellness,...,1,18.95,18.95,1,354.99,354.99,1.0,136.894,136.894,BALL PG 4'' POLY SET OF 6 COLORS BALL PLAYGROU...
4,p013780,My students are athletes and students who are ...,For some reason in our kitchen the water comes...,,,Grades 6-8,1.0,My students need a water filtration system for...,Health & Sports,Health & Wellness,...,2,355.5,711.0,2,355.5,711.0,2.0,355.5,711.0,Crown Berkey Water Filter With 2 Black and 2 P...


#### Great, now lets play with time!
+ Year of submission
+ Month of submission
+ Year Day (1-365) of submission
+ Month Day (1-31) of submission
+ Week Day (1-7) of submission
+ Hour of submission

In [4]:
# using datetime to make the above features
df['Year'] = df['project_submitted_datetime'].dt.year
df['Month'] = df['project_submitted_datetime'].dt.month
df['Year_Day'] = df['project_submitted_datetime'].dt.dayofyear
df['Month_Day'] = df['project_submitted_datetime'].dt.day
df['Week_Day'] = df['project_submitted_datetime'].dt.weekday
df['Hour'] = df['project_submitted_datetime'].dt.hour
df.head(1)

Unnamed: 0,id,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_grade_category,project_is_approved,project_resource_summary,project_subject_categories,project_subject_subcategories,...,mean_quantity,mean_price,mean_total_price,resource_description,Year,Month,Year_Day,Month_Day,Week_Day,Hour
0,p036502,Most of my kindergarten students come from low...,I currently have a differentiated sight word c...,,,Grades PreK-2,1.0,My students need 6 Ipod Nano's to create and d...,Literacy & Language,Literacy,...,3.0,149.99,449.97,Apple - iPod nano� 16GB MP3 Player (8th Genera...,2016,11,323,18,4,14


In [None]:
# To Be Continued...  My feeble attempts that weren't anywhere near all encompassing are below!

In [5]:
athing = resource_df[resource_df['id'] == 'p069063']

In [9]:
athing_length = len(athing)
for row in athing.itertuples():
    print(round(row[3] * row[4], 2))
athing_length

44.85
8.45
27.18
74.85
16.99
9.95
20.22


7

In [None]:
sumprice = []
numbought = []
avgprice = []

for row in train_df.itertuples():
    try:
        df = resource_df[resource_df['id'] == row[1]]
        df_length = len(df)
        

In [7]:
train_df.head(1)

Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved
0,p036502,484aaf11257089a66cfedc9461c6bd0a,Ms.,NV,2016-11-18 14:45:59,Grades PreK-2,Literacy & Language,Literacy,Super Sight Word Centers,Most of my kindergarten students come from low...,I currently have a differentiated sight word c...,,,My students need 6 Ipod Nano's to create and d...,26,1


In [None]:
def resource_scrape(idnum):
    df = resource_df[resource_df['id'] == idnum]
    try:
        foo = [round(row[3] * row[4], 2) for row in df.itertuples()]
        

In [None]:
data['project_is_approved'].value_counts()

In [None]:
data['teacher_number_of_previously_posted_projects'].value_counts() > 5