<h2><strong>In This Notebook...</strong></h2><br />
This is for data cleaning and engineering for our project.  Much inspiration received from <a href="https://www.kaggle.com/shivamb/extensive-text-data-feature-engineering/notebook" target="_blank">here</a>.

#### Dependencies

In [1]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from keras.preprocessing import sequence, text
from keras.layers import Input, Embedding

from nltk import word_tokenize
from nltk.corpus import stopwords
from textblob import TextBlob

import datetime as dt
import pandas as pd
import numpy as np
import warnings
import string

import matplotlib.pyplot as plt
%matplotlib inline

stop_words = list(set(stopwords.words('english')))
warnings.filterwarnings('ignore')
punctuation = string.punctuation

Using TensorFlow backend.


Wall time: 2.23 s


#### Read in data

In [2]:
%%time
# declare some strings
id_column = 'id'
missing_token = ' UNK '

# read in our data, parse_dates=['column name'] will read that column as a datetime object, can take a boolean, list of integers / names, list of lists or a dictionary,
# does different things depending on which one you use read the docs~
train = pd.read_csv('../data/train.csv', parse_dates=['project_submitted_datetime'])
test = pd.read_csv('../data/test.csv', parse_dates=['project_submitted_datetime'])
hopes = pd.read_csv('../data/resources.csv').fillna(missing_token)

# lets make a master df of the train and test data to make our lives easier!
df = pd.concat([train,test], axis=0)

Wall time: 7.52 s


##### Mathy Features
+ Min, Max, Mean Price for resources requested
+ Min Quantity, Max Quantity, Mean Quantity of resources requested
+ Min Total Price, Max Total Price, Mean Total Price of resources requested
+ Total Price of items requested by proposal
+ Number of Unique Items Requested by proposal
+ Quantity of items requested in proposal

In [3]:
%%time
# A new column for total price
hopes['total_price'] = hopes['quantity']*hopes['price']

# Make an aggregate df to join to our normal df
# the .agg method takes in a function, string, or a dictionary or list of strings or functions.  The dictionary keys will be column names upon which functions should be run
# I named it after the horse in Shadow of the Colossus~ the description column is now a count of how many, so it can be renamed to (number of )items
agro = {'description':'count', 'quantity':'sum', 'price':'sum', 'total_price':'sum'}
aggregatedf = hopes.groupby('id').agg(agro).rename(columns={'description':'items'})

# now lets use that string functionality of .agg to get the min, max, and mean values!
for maths in ['min', 'max', 'mean']:
    # romanized Japanese horse name from game, and that guy that changes names in ff because why not lets have fun with variable names they're just for here anyway
    aguro = {'quantity':maths, 'price':maths, 'total_price':maths}
    namingway = {'quantity':maths+'_quantity', 'price':maths+'_price', 'total_price':maths+'_total_price'}
    
    # do some aggregation and join it to our previously created df
    temporary = hopes.groupby('id').agg(aguro).rename(columns=namingway).fillna(0)
    aggregatedf = aggregatedf.join(temporary)
# This didn't work whoops # aggregatedf = aggregatedf.join([hopes.groupby('id').agg({'quantity':maths, 'price':maths, 'total_price':maths}).rename(columns={'quantity':maths+'_quantity', 'price':maths+'_price', 'total_price':maths+'_total_price'}).fillna(0) for maths in ['min', 'max', 'mean']])

# and finally give it the original description columns aggregated together with a space in between them
aggregatedf = aggregatedf.join(hopes.groupby('id').agg({'description':lambda x:' '.join(x.values.astype(str))}).rename(columns={'description':'resource_description'}))

# Join that together with our everything df and check it out
df = df.join(aggregatedf, on='id')
df.head()

Wall time: 6.74 s


#### Great, now lets play with time!
+ Year of submission
+ Month of submission
+ Year Day (1-365) of submission
+ Month Day (1-31) of submission
+ Week Day (1-7) of submission
+ Hour of submission

In [4]:
%%time
# using datetime to make the above features
df['Year'] = df['project_submitted_datetime'].dt.year
df['Month'] = df['project_submitted_datetime'].dt.month
df['Year_Day'] = df['project_submitted_datetime'].dt.dayofyear
df['Month_Day'] = df['project_submitted_datetime'].dt.day
df['Week_Day'] = df['project_submitted_datetime'].dt.weekday
df['Hour'] = df['project_submitted_datetime'].dt.hour
df.head(1)

Wall time: 122 ms


#### Text based features
+ Length of essays including spaces
+ Length of project title
+ Word count across essays
+ Character count across essays
+ Word density / average length of words used
+ Punctuation count
+ Uppercase count
+ Title Word Count (Gotta Have This Case)
+ Stopword Count

In [5]:
%%time
# fill empty values with missing token ' UNK '
df['project_essay_3'] = df['project_essay_3'].fillna(missing_token)
df['project_essay_4'] = df['project_essay_4'].fillna(missing_token)

Wall time: 32.9 ms


In [6]:
%%time
# get length of each essay and its title
df['essay1_len'] = df['project_essay_1'].apply(len)
df['essay2_len'] = df['project_essay_2'].apply(len)
df['essay3_len'] = df['project_essay_3'].apply(len)
df['essay4_len'] = df['project_essay_4'].apply(len)
df['title_len'] = df['project_title'].apply(len)
df.head()

Wall time: 348 ms


In [7]:
%%time
# Combine the essays into one string
df['text'] = df.apply(lambda row: ' '.join([str(row['project_essay_1']),
                                            str(row['project_essay_2']),
                                            str(row['project_essay_3']),
                                            str(row['project_essay_4'])]), axis=1)

Wall time: 20.6 s


In [8]:
%%time
# get our delicious features from that massive text
df['char_count'] = df['text'].apply(len)
df['word_count'] = df['text'].apply(lambda x: len(x.split()))
df['word density'] = df['char_count'] / (df['word_count'] + 1)
df['punctuation_count'] = df['text'].apply(lambda x: len("".join(_ for _ in x if _ in punctuation)))
df['title_word_count'] = df['text'].apply(lambda x: len([word for word in x.split() if word.istitle()]))
df['upper_case_word_count'] = df['text'].apply(lambda x: len([word for word in x.split() if word.isupper()]))
df['stopword_count'] = df['text'].apply(lambda x: len([word for word in x.split() if word.lower() in stop_words]))
df.head()

Wall time: 2min 31s


#### NLP style features
+ Article Polarity - Sentiment polarity
+ Article Subjectivity - Sentiment subjectivity
+ Noun Count - count of words that are nouns, the ones that name objects, people, etc...
+ Verb Count - count of words that are verbs, the ones that tell you about moving like walk or think...
+ Adjective Count - count of words that are adjectives, the ones that describe nouns like red or big...
+ Adverb Count - count of words that are adverbs, the ones that describe adjectives or verbs and typically end with -ly
+ Pronoun Count - count of words that are pronouns, the ones that replace other words like her or they

In [9]:
%%time
# functions get polarity and subjectivity using TextBlob
def get_polarity(text):
    try:
        textblob = TextBlob(unicode(text, 'utf-8'))
        pol = textblob.sentiment.polarity
    except:
        pol = 0.0
    return pol

def get_subjectivity(text):
    try:
        textblob = TextBloob(unicode(text, 'utf-8'))
        subj = textblob.sentiment.subjectivity
    except:
        subj = 0.0
    return subj

Wall time: 0 ns


In [10]:
%%time
# Now lets apply those functions to our df
df['polarity'] = df['text'].apply(get_polarity)
df['subjectivity'] = df['text'].apply(get_subjectivity)

Wall time: 333 ms


<a href="https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/" target="_blank">NLTK Part of Speech Tags</a> <- Click me (don't get excited there's no R)

In [11]:
%%time
# make a dictionary for parts of speech
pos_dict = {
    'noun': ['NN', 'NNS', 'NNP', 'NNPS'], #singular, plural regular nouns, singular, plural proper nouns
    'pron': ['PRP', 'PRP$', 'WP', 'WP$'], #personal pronouns, possessive personal, wh pronouns, possessive wh pronouns
    'verb': ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'], #verb base, past tense, gerund, past participle, singular present, 3rd person present
    'adj': ['JJ', 'JJR', 'JJS'], #adjective, comparative, superlative
    'adv': ['RB', 'RBR', 'RBS', 'WRB'] #adverb, compartive, superlative, wh- adverb
}

Wall time: 0 ns


In [12]:
%%time
# function to retrieve the parts of speech tag counts
def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt

Wall time: 0 ns


In [13]:
%%time
# now lets use that function to make new columns each in their own cell because it takes a while
df['noun_count'] = df['text'].apply(lambda x: pos_check(x, 'noun'))

Wall time: 1h 18min 6s


In [14]:
%%time
df['verb_count'] = df['text'].apply(lambda x: pos_check(x, 'verb'))

Wall time: 1h 17min 45s


In [15]:
%%time
df['adj_count'] = df['text'].apply(lambda x: pos_check(x, 'adj'))

Wall time: 1h 17min 27s


In [16]:
%%time
df['adv_count'] = df['text'].apply(lambda x: pos_check(x, 'adv'))

Wall time: 1h 17min 20s


In [17]:
%%time
df['pron_count'] = df['text'].apply(lambda x: pos_check(x, 'pron'))

Wall time: 1h 17min 12s


In [18]:
%%time
df.head()

Wall time: 1.99 ms


Unnamed: 0,id,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_grade_category,project_is_approved,project_resource_summary,project_subject_categories,project_subject_subcategories,...,title_word_count,upper_case_word_count,stopword_count,polarity,subjectivity,noun_count,verb_count,adj_count,adv_count,pron_count
0,p036502,Most of my kindergarten students come from low...,I currently have a differentiated sight word c...,UNK,UNK,Grades PreK-2,1.0,My students need 6 Ipod Nano's to create and d...,Literacy & Language,Literacy,...,21,7,151,0.0,0.0,0,0,0,0,0
1,p039565,Our elementary school is a culturally rich sch...,We strive to provide our diverse population of...,UNK,UNK,Grades 3-5,0.0,My students need matching shirts to wear for d...,"Music & The Arts, Health & Sports","Performing Arts, Team Sports",...,15,5,79,0.0,0.0,0,0,0,0,0
2,p233823,Hello;\r\nMy name is Mrs. Brotherton. I teach ...,We are looking to add some 3Doodler to our cla...,UNK,UNK,Grades 3-5,1.0,My students need the 3doodler. We are an SEM s...,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",...,26,6,103,0.0,0.0,0,0,0,0,0
3,p185307,My students are the greatest students but are ...,"The student's project which is totally \""kid-i...",UNK,UNK,Grades 3-5,0.0,My students need balls and other activity equi...,Health & Sports,Health & Wellness,...,31,6,188,0.0,0.0,0,0,0,0,0
4,p013780,My students are athletes and students who are ...,For some reason in our kitchen the water comes...,UNK,UNK,Grades 6-8,1.0,My students need a water filtration system for...,Health & Sports,Health & Wellness,...,13,2,98,0.0,0.0,0,0,0,0,0


#### TF-IDF style features
+ 1-3 NGram TF-IDF for Article Text at word level
+ 1-3 NGram TF-IDF for Project Title at word level
+ 1-3 NGram TF-IDF for Resource Text at word level
+ 1-3 NGram TF-IDF for Article Text at character level
+ 1-3 NGram TF-IDF for Project Title at character level
+ 1-3 NGram TF-IDF for Resource Text at character level

In [19]:
%%time
df['resource_text'] = df.apply(lambda row: ' '.join([str(row['resource_description']), str(row['project_resource_summary'])]), axis=1)

article_text = list(df['text'].values)
title_text = list(df['project_title'].values)
resource_text = list(df['resource_text'].values)

Wall time: 17 s


In [29]:
%%time
df.to_csv('../data/everything.csv')

Wall time: 37.8 s


In [21]:
%%time
# word level tf-idf for article text
vectorizer = TfidfVectorizer(max_features=2500, analyzer='word', stop_words='english', ngram_range=(1,3), dtype=np.float32)
vectorizer.fit(article_text)
article_word_tfidf = vectorizer.transform(article_text)

Wall time: 5min 3s


In [22]:
%%time
# word level tf-idf for titles
vectorizer = TfidfVectorizer(max_features=2500, analyzer='word', stop_words='english', ngram_range=(1,3), dtype=np.float32)
vectorizer.fit(title_text)
title_word_tfidf = vectorizer.transform(title_text)

Wall time: 10.6 s


In [23]:
%%time
# word level tf-idf for resource text
vectorizer = TfidfVectorizer(max_features=2500, analyzer='word', stop_words='english', ngram_range=(1,3), dtype=np.float32)
vectorizer.fit(resource_text)
resource_word_tfidf = vectorizer.transform(resource_text)

Wall time: 1min 40s


In [24]:
%%time
# create a dictionary mapping tokens to their tfidf values
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
tfidf = pd.DataFrame(columns=['resource_word_tfidf']).from_dict(dict(tfidf), orient='index')
tfidf.columns = ['resource_word_tfidf']

Wall time: 18 ms


In [30]:
%%time
# 15 highest tf-idf from that list
tfidf.sort_values(by=['resource_word_tfidf'], ascending=False).head(15)

Wall time: 6.96 ms


Unnamed: 0,resource_word_tfidf
branches book,7.823436
diaries dork diaries,7.802456
superbright,7.792129
dork diaries dork,7.785303
diaries dork,7.781908
branches,7.622444
12 amp 34,7.61095
34 18 34,7.61095
amp 34 18,7.61095
paper 12 amp,7.608097


In [26]:
%%time
# Character level tf-idfs
# article text
vectorizer = TfidfVectorizer(max_features=2000, analyzer='char', stop_words='english', ngram_range=(1,3), dtype=np.float32)
vectorizer.fit(article_text)
article_char_tfidf = vectorizer.transform(article_text)

Wall time: 12min 3s


In [27]:
%%time
# project title
vectorizer = TfidfVectorizer(max_features=2000, analyzer='char', stop_words='english', ngram_range=(1,3), dtype=np.float32)
vectorizer.fit(title_text)
title_char_tfidf = vectorizer.transform(title_text)

Wall time: 22.9 s


In [28]:
%%time
# resource text
vectorizer = TfidfVectorizer(max_features=2000, analyzer='char', stop_words='english', ngram_range=(1,3), dtype=np.float32)
vectorizer.fit(resource_text)
resource_char_tfidf = vectorizer.transform(resource_text)

Wall time: 3min 55s


In [None]:
# To Be Continued...  My feeble attempts that weren't anywhere near all encompassing are below!

In [None]:
athing = resource_df[resource_df['id'] == 'p069063']

In [None]:
athing_length = len(athing)
for row in athing.itertuples():
    print(round(row[3] * row[4], 2))
athing_length

In [None]:
sumprice = []
numbought = []
avgprice = []

for row in train_df.itertuples():
    try:
        df = resource_df[resource_df['id'] == row[1]]
        df_length = len(df)
        

In [None]:
train_df.head(1)

In [None]:
def resource_scrape(idnum):
    df = resource_df[resource_df['id'] == idnum]
    try:
        foo = [round(row[3] * row[4], 2) for row in df.itertuples()]
        

In [None]:
data['project_is_approved'].value_counts()

In [None]:
data['teacher_number_of_previously_posted_projects'].value_counts() > 5