<h2><strong>In This Notebook...</strong></h2><br />
This is for data cleaning and engineering for our project.  Much inspiration received from <a href="https://www.kaggle.com/shivamb/extensive-text-data-feature-engineering/notebook" target="_blank">here</a>.

#### Dependencies

In [1]:
%%time
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from keras.preprocessing import sequence, text
from keras.layers import Input, Embedding

from nltk import word_tokenize
from nltk.corpus import stopwords
from textblob import TextBlob

import datetime as dt
import pandas as pd
import numpy as np
import scipy as sp
import warnings
import string
import collections

import matplotlib.pyplot as plt
%matplotlib inline

stop_words = list(set(stopwords.words('english')))
warnings.filterwarnings('ignore')
punctuation = string.punctuation

Using TensorFlow backend.


Wall time: 8.81 s


#### Read in data

In [2]:
%%time
# declare some strings
id_column = 'id'
missing_token = ' UNK '

# read in our data, parse_dates=['column name'] will read that column as a datetime object, can take a boolean, list of integers / names, list of lists or a dictionary,
# does different things depending on which one you use read the docs~
train = pd.read_csv('../data/train.csv', parse_dates=['project_submitted_datetime'])
test = pd.read_csv('../data/test.csv', parse_dates=['project_submitted_datetime'])
hopes = pd.read_csv('../data/resources.csv').fillna(missing_token)

# # lets make a master df of the train and test data to make our lives easier!
# df = pd.concat([train,test], axis=0)
# no lets not that was awful
df = train

Wall time: 6.65 s


##### Mathy Features
+ Min, Max, Mean Price for resources requested
+ Min Quantity, Max Quantity, Mean Quantity of resources requested
+ Min Total Price, Max Total Price, Mean Total Price of resources requested
+ Total Price of items requested by proposal
+ Number of Unique Items Requested by proposal
+ Quantity of items requested in proposal

In [3]:
%%time
# A new column for total price
hopes['total_price'] = hopes['quantity']*hopes['price']

# Make an aggregate df to join to our normal df
# the .agg method takes in a function, string, or a dictionary or list of strings or functions.  The dictionary keys will be column names upon which functions should be run
# I named it after the horse in Shadow of the Colossus~ the description column is now a count of how many, so it can be renamed to (number of )items
agro = {'description':'count', 'quantity':'sum', 'price':'sum', 'total_price':'sum'}
aggregatedf = hopes.groupby('id').agg(agro).rename(columns={'description':'items'})

# now lets use that string functionality of .agg to get the min, max, and mean values!
for maths in ['min', 'max', 'mean']:
    # romanized Japanese horse name from game, and that guy that changes names in ff because why not lets have fun with variable names they're just for here anyway
    aguro = {'quantity':maths, 'price':maths, 'total_price':maths}
    namingway = {'quantity':maths+'_quantity', 'price':maths+'_price', 'total_price':maths+'_total_price'}
    
    # do some aggregation and join it to our previously created df
    temporary = hopes.groupby('id').agg(aguro).rename(columns=namingway).fillna(0)
    aggregatedf = aggregatedf.join(temporary)
# This didn't work whoops # aggregatedf = aggregatedf.join([hopes.groupby('id').agg({'quantity':maths, 'price':maths, 'total_price':maths}).rename(columns={'quantity':maths+'_quantity', 'price':maths+'_price', 'total_price':maths+'_total_price'}).fillna(0) for maths in ['min', 'max', 'mean']])

# and finally give it the original description columns aggregated together with a space in between them
aggregatedf = aggregatedf.join(hopes.groupby('id').agg({'description':lambda x:' '.join(x.values.astype(str))}).rename(columns={'description':'resource_description'}))

# Join that together with our everything df and check it out
df = df.join(aggregatedf, on='id')
df.head()

Wall time: 6.19 s


#### Great, now lets play with time!
+ Year of submission
+ Month of submission
+ Year Day (1-365) of submission
+ Month Day (1-31) of submission
+ Week Day (1-7) of submission
+ Hour of submission

In [117]:
%%time
# using datetime to make the above features
df['year'] = df['project_submitted_datetime'].dt.year
df['month'] = df['project_submitted_datetime'].dt.month
df['year_day'] = df['project_submitted_datetime'].dt.dayofyear
df['month_day'] = df['project_submitted_datetime'].dt.day
df['week_day'] = df['project_submitted_datetime'].dt.weekday
df['hour'] = df['project_submitted_datetime'].dt.hour
df.head(1)

Wall time: 115 ms


#### Text based features
+ Length of essays including spaces
+ Length of project title
+ Word count across essays
+ Character count across essays
+ Word density / average length of words used
+ Punctuation count
+ Uppercase count
+ Title Word Count (Gotta Have This Case)
+ Stopword Count

In [5]:
%%time
# fill empty values with missing token ' UNK '
df['project_essay_3'] = df['project_essay_3'].fillna(missing_token)
df['project_essay_4'] = df['project_essay_4'].fillna(missing_token)

Wall time: 23.9 ms


In [6]:
%%time
# get length of each essay and its title
df['essay1_len'] = df['project_essay_1'].apply(len)
df['essay2_len'] = df['project_essay_2'].apply(len)
df['essay3_len'] = df['project_essay_3'].apply(len)
df['essay4_len'] = df['project_essay_4'].apply(len)
df['title_len'] = df['project_title'].apply(len)
df.head()

Wall time: 230 ms


In [7]:
%%time
# Combine the essays into one string
df['text'] = df.apply(lambda row: ' '.join([str(row['project_essay_1']),
                                            str(row['project_essay_2']),
                                            str(row['project_essay_3']),
                                            str(row['project_essay_4'])]), axis=1)

Wall time: 13.4 s


In [8]:
%%time
# get our delicious features from that massive text
df['char_count'] = df['text'].apply(len)
df['word_count'] = df['text'].apply(lambda x: len(x.split()))
df['word_density'] = df['char_count'] / (df['word_count'] + 1)
df['punctuation_count'] = df['text'].apply(lambda x: len("".join(_ for _ in x if _ in punctuation)))
df['title_word_count'] = df['text'].apply(lambda x: len([word for word in x.split() if word.istitle()]))
df['upper_case_word_count'] = df['text'].apply(lambda x: len([word for word in x.split() if word.isupper()]))
df['stopword_count'] = df['text'].apply(lambda x: len([word for word in x.split() if word.lower() in stop_words]))
df.head()

Wall time: 1min 34s


#### NLP style features
+ Article Polarity - Sentiment polarity
+ Article Subjectivity - Sentiment subjectivity
+ Noun Count - count of words that are nouns, the ones that name objects, people, etc...
+ Verb Count - count of words that are verbs, the ones that tell you about moving like walk or think...
+ Adjective Count - count of words that are adjectives, the ones that describe nouns like red or big...
+ Adverb Count - count of words that are adverbs, the ones that describe adjectives or verbs and typically end with -ly
+ Pronoun Count - count of words that are pronouns, the ones that replace other words like her or they

<a href="https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/" target="_blank">NLTK Part of Speech Tags</a> <- Click me (don't get excited there's no R)

In [94]:
%%time
def blob_the_text(text):
    """
    take in a text and apply a bunch of text blob features to it
    INPUT: text string
    OUTPUT: a tuple of everything you might want textblob to run on that text
            sentiment polarity,
            sentiment subjectivity,
            count of nouns,
            count of pronouns,
            count of verbs,
            count of adjectives,
            count of adverbs
    """
    tb = TextBlob(text)

    nouns = ['NN', 'NNS', 'NNP', 'NNPS'] #singular, plural regular nouns, singular, plural proper nouns
    pronouns = ['PRP', 'PRP$', 'WP', 'WP$'] #personal pronouns, possessive personal, wh pronouns, possessive wh pronouns
    verbs = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'] #verb base, past tense, gerund, past participle, singular present, 3rd person present
    adjectives = ['JJ', 'JJR', 'JJS'] #adjective, comparative, superlative
    adverbs = ['RB', 'RBR', 'RBS', 'WRB'] #adverb, compartive, superlative, wh- adverb

    tagcol = collections.namedtuple('tag', ['word', 'pos'])
    tags = [tagcol(word[0], word[1]) for word in tb.tags]

    try:
        pol = tb.sentiment.polarity
    except:
        pol = 0.0
    try:
        subj = tb.sentiment.subjectivity
    except:
        subj = 0.0
    ncount = sum(collections.Counter(tag.pos for tag in tags if tag.pos in nouns).values())
    procount = sum(collections.Counter(tag.pos for tag in tags if tag.pos in pronouns).values())
    vcount = sum(collections.Counter(tag.pos for tag in tags if tag.pos in verbs).values())
    adjcount = sum(collections.Counter(tag.pos for tag in tags if tag.pos in adjectives).values())
    advcount = sum(collections.Counter(tag.pos for tag in tags if tag.pos in adverbs).values())

    # print('polarity {} subjectivity {}'.format(pol, subj))
    # print('pos tags: {}'.format(posstring))
    # trying = TextBlob(df['text'][0]).tags
    # tag = collections.namedtuple('tag', ['word', 'pos'])
    # tags = [tag(thing[0], thing[1]) for thing in trying]
    return [pol, subj, ncount, procount, vcount, adjcount, advcount]

Wall time: 0 ns


In [103]:
%%time
df[['polarity', 'subjectivity', 'noun_count', 'pronoun_count', 'verb_count', 'adjective_count', 'adverb_count']] = pd.DataFrame([(blob_the_text(row['text'])) for index, row in df.iterrows()], index = df.index)

Wall time: 1h 2min 38s


In [104]:
df[['project_grade_category', 'polarity', 'subjectivity', 'noun_count', 'pronoun_count', 'verb_count', 'adjective_count', 'adverb_count']].head()

Unnamed: 0,project_grade_category,polarity,subjectivity,noun_count,pronoun_count,verb_count,adjective_count,adverb_count
0,Grades PreK-2,0.213402,0.391136,81,36,58,25,16
1,Grades 3-5,0.192889,0.597111,59,19,27,25,10
2,Grades 3-5,0.353888,0.53445,58,32,51,19,7
3,Grades 3-5,0.17588,0.416224,105,34,78,37,23
4,Grades 6-8,0.285417,0.557192,44,23,33,18,9


In [11]:
# %%time
# # functions get polarity and subjectivity using TextBlob
# def get_polarity(text):
#     try:
#         textblob = TextBlob(unicode(text, 'utf-8'))
#         pol = textblob.sentiment.polarity
#     except:
#         pol = 0.0
#     return pol

# def get_subjectivity(text):
#     try:
#         textblob = TextBlob(unicode(text, 'utf-8'))
#         subj = textblob.sentiment.subjectivity
#     except:
#         subj = 0.0
#     return subj

Wall time: 0 ns


In [12]:
# %%time
# # Now lets apply those functions to our df
# df['polarity'] = df['text'].apply(get_polarity)
# df['subjectivity'] = df['text'].apply(get_subjectivity)

Wall time: 212 ms


In [38]:
# df['polarity'].value_counts()

0.0    182080
Name: polarity, dtype: int64

In [13]:
# %%time
# # function to retrieve the parts of speech tag counts
# def pos_check(x, flag):
#     cnt = 0
#     try:
#         wiki = TextBlob(x)
#         for tup in wiki.tags:
#             ppo = list(tup)[1]
#             if ppo in pos_dic[flag]:
#                 cnt += 1
#     except:
#         pass
#     return cnt

Wall time: 0 ns


In [None]:
# def get_polarity(text):
#     try:
#         textblob = TextBlob(unicode(text, 'utf-8'))
#         pol = textblob.sentiment.polarity
#     except:
#         pol = 0.0
#     return pol

# def get_subjectivity(text):
#     try:
#         textblob = TextBlob(unicode(text, 'utf-8'))
#         subj = textblob.sentiment.subjectivity
#     except:
#         subj = 0.0
#     return subj

In [10]:
# %%time
# # make a dictionary for parts of speech
# pos_dict = {
#     'noun': ['NN', 'NNS', 'NNP', 'NNPS'], #singular, plural regular nouns, singular, plural proper nouns
#     'pron': ['PRP', 'PRP$', 'WP', 'WP$'], #personal pronouns, possessive personal, wh pronouns, possessive wh pronouns
#     'verb': ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'], #verb base, past tense, gerund, past participle, singular present, 3rd person present
#     'adj': ['JJ', 'JJR', 'JJS'], #adjective, comparative, superlative
#     'adv': ['RB', 'RBR', 'RBS', 'WRB'] #adverb, compartive, superlative, wh- adverb
# }

Wall time: 0 ns


In [13]:
# %%time
# # now lets use that function to make new columns each in their own cell because it takes a while
# df['noun_count'] = df['text'].apply(lambda x: pos_check(x, 'noun'))

Wall time: 1h 18min 6s


In [14]:
# %%time
# df['verb_count'] = df['text'].apply(lambda x: pos_check(x, 'verb'))

Wall time: 1h 17min 45s


In [15]:
# %%time
# df['adj_count'] = df['text'].apply(lambda x: pos_check(x, 'adj'))

Wall time: 1h 17min 27s


In [16]:
# %%time
# df['adv_count'] = df['text'].apply(lambda x: pos_check(x, 'adv'))

Wall time: 1h 17min 20s


In [17]:
# %%time
# df['pron_count'] = df['text'].apply(lambda x: pos_check(x, 'pron'))

Wall time: 1h 17min 12s


In [105]:
%%time
df.head()

Wall time: 997 µs


Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,...,title_word_count,upper_case_word_count,stopword_count,polarity,subjectivity,noun_count,pronoun_count,verb_count,adjective_count,adverb_count
0,p036502,484aaf11257089a66cfedc9461c6bd0a,Ms.,NV,2016-11-18 14:45:59,Grades PreK-2,Literacy & Language,Literacy,Super Sight Word Centers,Most of my kindergarten students come from low...,...,21,7,151,0.213402,0.391136,81,36,58,25,16
1,p039565,df72a3ba8089423fa8a94be88060f6ed,Mrs.,GA,2017-04-26 15:57:28,Grades 3-5,"Music & The Arts, Health & Sports","Performing Arts, Team Sports",Keep Calm and Dance On,Our elementary school is a culturally rich sch...,...,15,5,79,0.192889,0.597111,59,19,27,25,10
2,p233823,a9b876a9252e08a55e3d894150f75ba3,Ms.,UT,2017-01-01 22:57:44,Grades 3-5,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",Lets 3Doodle to Learn,Hello;\r\nMy name is Mrs. Brotherton. I teach ...,...,26,6,103,0.353888,0.53445,58,32,51,19,7
3,p185307,525fdbb6ec7f538a48beebaa0a51b24f,Mr.,NC,2016-08-12 15:42:11,Grades 3-5,Health & Sports,Health & Wellness,"\""Kid Inspired\"" Equipment to Increase Activit...",My students are the greatest students but are ...,...,31,6,188,0.17588,0.416224,105,34,78,37,23
4,p013780,a63b5547a7239eae4c1872670848e61a,Mr.,CA,2016-08-06 09:09:11,Grades 6-8,Health & Sports,Health & Wellness,We need clean water for our culinary arts class!,My students are athletes and students who are ...,...,13,2,98,0.285417,0.557192,44,23,33,18,9


In [106]:
%%time
df.to_csv('../data/training_clean.csv')

Wall time: 24.1 s


#### Make that all into a nice little file that will hopefully work nicely

In [114]:
%%file clean_and_feature.py

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from keras.preprocessing import sequence, text
from keras.layers import Input, Embedding

from nltk import word_tokenize
from nltk.corpus import stopwords
from textblob import TextBlob

import datetime as dt
import pandas as pd
import numpy as np
import warnings
import string
import collections

import matplotlib.pyplot as plt
%matplotlib inline

def blob_the_text(text):
    """
    take in a text and apply a bunch of text blob features to it
    INPUT: text string
    OUTPUT: a tuple of everything you might want textblob to run on that text
            sentiment polarity,
            sentiment subjectivity,
            count of nouns,
            count of pronouns,
            count of verbs,
            count of adjectives,
            count of adverbs
    """
    tb = TextBlob(text)

    nouns = ['NN', 'NNS', 'NNP', 'NNPS'] #singular, plural regular nouns, singular, plural proper nouns
    pronouns = ['PRP', 'PRP$', 'WP', 'WP$'] #personal pronouns, possessive personal, wh pronouns, possessive wh pronouns
    verbs = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'] #verb base, past tense, gerund, past participle, singular present, 3rd person present
    adjectives = ['JJ', 'JJR', 'JJS'] #adjective, comparative, superlative
    adverbs = ['RB', 'RBR', 'RBS', 'WRB'] #adverb, compartive, superlative, wh- adverb

    tagcol = collections.namedtuple('tag', ['word', 'pos'])
    tags = [tagcol(word[0], word[1]) for word in tb.tags]

    try:
        pol = tb.sentiment.polarity
    except:
        pol = 0.0
    try:
        subj = tb.sentiment.subjectivity
    except:
        subj = 0.0
    ncount = sum(collections.Counter(tag.pos for tag in tags if tag.pos in nouns).values())
    procount = sum(collections.Counter(tag.pos for tag in tags if tag.pos in pronouns).values())
    vcount = sum(collections.Counter(tag.pos for tag in tags if tag.pos in verbs).values())
    adjcount = sum(collections.Counter(tag.pos for tag in tags if tag.pos in adjectives).values())
    advcount = sum(collections.Counter(tag.pos for tag in tags if tag.pos in adverbs).values())

    # print('polarity {} subjectivity {}'.format(pol, subj))
    # print('pos tags: {}'.format(posstring))
    # trying = TextBlob(df['text'][0]).tags
    # tag = collections.namedtuple('tag', ['word', 'pos'])
    # tags = [tag(thing[0], thing[1]) for thing in trying]
    return [pol, subj, ncount, procount, vcount, adjcount, advcount]

stop_words = list(set(stopwords.words('english')))
warnings.filterwarnings('ignore')
punctuation = string.punctuation

# declare some strings
id_column = 'id'
missing_token = ' UNK '

# read in our data, parse_dates=['column name'] will read that column as a datetime object, can take a boolean, list of integers / names, list of lists or a dictionary,
# does different things depending on which one you use read the docs~
train = pd.read_csv('data/train.csv', parse_dates=['project_submitted_datetime'])
# test = pd.read_csv('data/test.csv', parse_dates=['project_submitted_datetime'])
hopes = pd.read_csv('data/resources.csv').fillna(missing_token)

# # lets make a master df of the train and test data to make our lives easier!
# df = pd.concat([train,test], axis=0)
# no lets not that was awful
df = train

# A new column for total price
hopes['total_price'] = hopes['quantity']*hopes['price']

# Make an aggregate df to join to our normal df
# the .agg method takes in a function, string, or a dictionary or list of strings or functions.  The dictionary keys will be column names upon which functions should be run
# I named it after the horse in Shadow of the Colossus~ the description column is now a count of how many, so it can be renamed to (number of )items
agro = {'description':'count', 'quantity':'sum', 'price':'sum', 'total_price':'sum'}
aggregatedf = hopes.groupby('id').agg(agro).rename(columns={'description':'items'})

# now lets use that string functionality of .agg to get the min, max, and mean values!
for maths in ['min', 'max', 'mean']:
    # romanized Japanese horse name from game, and that guy that changes names in ff because why not lets have fun with variable names they're just for here anyway
    aguro = {'quantity':maths, 'price':maths, 'total_price':maths}
    namingway = {'quantity':maths+'_quantity', 'price':maths+'_price', 'total_price':maths+'_total_price'}
    
    # do some aggregation and join it to our previously created df
    temporary = hopes.groupby('id').agg(aguro).rename(columns=namingway).fillna(0)
    aggregatedf = aggregatedf.join(temporary)
# This didn't work whoops # aggregatedf = aggregatedf.join([hopes.groupby('id').agg({'quantity':maths, 'price':maths, 'total_price':maths}).rename(columns={'quantity':maths+'_quantity', 'price':maths+'_price', 'total_price':maths+'_total_price'}).fillna(0) for maths in ['min', 'max', 'mean']])

# and finally give it the original description columns aggregated together with a space in between them
aggregatedf = aggregatedf.join(hopes.groupby('id').agg({'description':lambda x:' '.join(x.values.astype(str))}).rename(columns={'description':'resource_description'}))

# Join that together with our everything df and check it out
df = df.join(aggregatedf, on='id')
df.head()

# using datetime to make the above features
df['year'] = df['project_submitted_datetime'].dt.year
df['month'] = df['project_submitted_datetime'].dt.month
df['year_day'] = df['project_submitted_datetime'].dt.dayofyear
df['month_day'] = df['project_submitted_datetime'].dt.day
df['week_day'] = df['project_submitted_datetime'].dt.weekday
df['hour'] = df['project_submitted_datetime'].dt.hour

# fill empty values with missing token ' UNK '
df['project_essay_3'] = df['project_essay_3'].fillna(missing_token)
df['project_essay_4'] = df['project_essay_4'].fillna(missing_token)

# get length of each essay and its title
df['essay1_len'] = df['project_essay_1'].apply(len)
df['essay2_len'] = df['project_essay_2'].apply(len)
df['essay3_len'] = df['project_essay_3'].apply(len)
df['essay4_len'] = df['project_essay_4'].apply(len)
df['title_len'] = df['project_title'].apply(len)

# Combine the essays into one string
df['text'] = df.apply(lambda row: ' '.join([str(row['project_essay_1']),
                                            str(row['project_essay_2']),
                                            str(row['project_essay_3']),
                                            str(row['project_essay_4'])]), axis=1)

# get our delicious features from that massive text
df['char_count'] = df['text'].apply(len)
df['word_count'] = df['text'].apply(lambda x: len(x.split()))
df['word_density'] = df['char_count'] / (df['word_count'] + 1)
df['punctuation_count'] = df['text'].apply(lambda x: len("".join(_ for _ in x if _ in punctuation)))
df['title_word_count'] = df['text'].apply(lambda x: len([word for word in x.split() if word.istitle()]))
df['upper_case_word_count'] = df['text'].apply(lambda x: len([word for word in x.split() if word.isupper()]))
df['stopword_count'] = df['text'].apply(lambda x: len([word for word in x.split() if word.lower() in stop_words]))

df[['polarity', 'subjectivity', 'noun_count', 'pronoun_count', 'verb_count', 'adjective_count', 'adverb_count']] = pd.DataFrame([(blob_the_text(row['text'])) for index, row in df.iterrows()], index = df.index)

df.to_csv('../data/training_clean.csv')

Overwriting clean_and_feature.py


#### TF-IDF style features
+ 1-3 NGram TF-IDF for Article Text at word level
+ 1-3 NGram TF-IDF for Project Title at word level
+ 1-3 NGram TF-IDF for Resource Text at word level
+ 1-3 NGram TF-IDF for Article Text at character level
+ 1-3 NGram TF-IDF for Project Title at character level
+ 1-3 NGram TF-IDF for Resource Text at character level

In [134]:
train = df[['teacher_number_of_previously_posted_projects', 'project_is_approved', 'items', 'quantity', 'price', 'total_price', 'min_quantity', 'min_price', 'min_total_price',
          'max_quantity', 'max_price', 'max_total_price', 'year', 'month', 'year_day', 'month_day', 'week_day', 'hour', 'essay1_len', 'essay2_len', 'essay3_len', 'essay4_len',
          'title_len', 'char_count', 'word_count', 'word_density', 'punctuation_count', 'title_word_count', 'stopword_count', 'polarity', 'subjectivity', 'noun_count',
          'pronoun_count', 'verb_count', 'adjective_count', 'adverb_count', 'text', 'resource_description', 'project_resource_summary', 'project_title']]

In [135]:
%%time
X = train.drop('project_is_approved', axis=1)
y = train['project_is_approved']

X['resource_text'] = X.apply(lambda row: ' '.join([str(row['resource_description']), str(row['project_resource_summary'])]), axis=1)
X = X.drop(['resource_description','project_resource_summary'], axis=1)

In [136]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [130]:
%%time

article_text = list(X_train['text'].values)
title_text = list(X_train['project_title'].values)
resource_text = list(X_train['resource_text'].values)

Wall time: 38.9 ms


In [131]:
%%time
# word level tf-idf for article text
article_vectorizer = TfidfVectorizer(max_features=2500, analyzer='word', stop_words='english', ngram_range=(1,3), dtype=np.float32)
article_vectorizer.fit(article_text)
article_word_tfidf = article_vectorizer.transform(article_text)

Wall time: 2min 53s


In [132]:
%%time
# word level tf-idf for titles
vectorizer = TfidfVectorizer(max_features=2500, analyzer='word', stop_words='english', ngram_range=(1,3), dtype=np.float32)
vectorizer.fit(title_text)
title_word_tfidf = vectorizer.transform(title_text)

Wall time: 4.11 s


In [133]:
%%time
# word level tf-idf for resource text
resource_vectorizer = TfidfVectorizer(max_features=2500, analyzer='word', stop_words='english', ngram_range=(1,3), dtype=np.float32)
resource_vectorizer.fit(resource_text)
resource_word_tfidf = resource_vectorizer.transform(resource_text)

Wall time: 56.4 s


In [137]:
%%time
X_train = X_train.drop(['text', 'project_title', 'resource_text'], axis=1)
extra = sp.sparse.csr_matrix(X_train.astype(float))

Wall time: 294 ms


In [138]:
extra.shape

(136560, 35)

In [139]:
X_trained = sp.sparse.hstack((article_word_tfidf, extra))
X_trained.shape

(136560, 2535)

### Repeat for test group

In [140]:
%%time

article_text = list(X_test['text'].values)
title_text = list(X_test['project_title'].values)
resource_text = list(X_test['resource_text'].values)

Wall time: 28.9 ms


In [141]:
%%time
# word level tf-idf for article text
article_vectorizer = TfidfVectorizer(max_features=2500, analyzer='word', stop_words='english', ngram_range=(1,3), dtype=np.float32)
article_vectorizer.fit(article_text)
article_word_tfidf = article_vectorizer.transform(article_text)

Wall time: 1min 2s


In [None]:
%%time
# word level tf-idf for titles
vectorizer = TfidfVectorizer(max_features=2500, analyzer='word', stop_words='english', ngram_range=(1,3), dtype=np.float32)
vectorizer.fit(title_text)
title_word_tfidf = vectorizer.transform(title_text)

In [None]:
%%time
# word level tf-idf for resource text
resource_vectorizer = TfidfVectorizer(max_features=2500, analyzer='word', stop_words='english', ngram_range=(1,3), dtype=np.float32)
resource_vectorizer.fit(resource_text)
resource_word_tfidf = resource_vectorizer.transform(resource_text)

In [143]:
%%time
X_test = X_test.drop(['text', 'project_title', 'resource_text'], axis=1)
extra = sp.sparse.csr_matrix(X_test.astype(float))

Wall time: 104 ms


In [144]:
X_tested = sp.sparse.hstack((article_word_tfidf, extra))
X_tested.shape

(45520, 2535)

In [145]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [152]:
%%time
for C in [1, 10, 100, 1000, 1e4]:
    logreg = LogisticRegression(C=C)
    logreg.fit(X_trained, y_train)
    y_pred_class = logreg.predict(X_tested)

    print('C: {} Score: {}'.format(C, metrics.accuracy_score(y_test, y_pred_class)))

C: 1 Score: 0.8486379613356766
C: 10 Score: 0.8486379613356766
C: 100 Score: 0.8486599297012303
C: 1000 Score: 0.8486379613356766
C: 10000.0 Score: 0.848615992970123
Wall time: 1min 21s


### and once more for everything

In [153]:
%%time

article_text = list(X['text'].values)
title_text = list(X['project_title'].values)
resource_text = list(X['resource_text'].values)

Wall time: 57.8 ms


In [154]:
%%time
# word level tf-idf for article text
article_vectorizer = TfidfVectorizer(max_features=2500, analyzer='word', stop_words='english', ngram_range=(1,3), dtype=np.float32)
article_vectorizer.fit(article_text)
article_word_tfidf = article_vectorizer.transform(article_text)

Wall time: 3min 49s


In [155]:
%%time
# word level tf-idf for titles
vectorizer = TfidfVectorizer(max_features=2500, analyzer='word', stop_words='english', ngram_range=(1,3), dtype=np.float32)
vectorizer.fit(title_text)
title_word_tfidf = vectorizer.transform(title_text)

Wall time: 5.49 s


In [156]:
%%time
# word level tf-idf for resource text
resource_vectorizer = TfidfVectorizer(max_features=2500, analyzer='word', stop_words='english', ngram_range=(1,3), dtype=np.float32)
resource_vectorizer.fit(resource_text)
resource_word_tfidf = resource_vectorizer.transform(resource_text)

Wall time: 1min 15s


In [157]:
%%time
X = X.drop(['text', 'project_title', 'resource_text'], axis=1)
extra = sp.sparse.csr_matrix(X.astype(float))

Wall time: 376 ms


In [158]:
%%time
X_training_arc = sp.sparse.hstack((article_word_tfidf, extra))
X_training_arc = sp.sparse.hstack((title_word_tfidf, extra))
X_training_arc = sp.sparse.hstack((resource_word_tfidf, extra))
X_training_arc.shape

Wall time: 1.37 s


In [160]:
X_training_arc.shape

(182080, 2535)

In [166]:
%%time
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(units=5000, activation='relu', input_dim=2535))
model.add(Dense(units=5000, activation='relu'))
model.add(Dense(units=2, activation='softmax'))

TypeError: softmax() got an unexpected keyword argument 'axis'

TypeError: softmax() got an unexpected keyword argument 'axis'

In [24]:
%%time
# create a dictionary mapping tokens to their tfidf values
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
tfidf = pd.DataFrame(columns=['resource_word_tfidf']).from_dict(dict(tfidf), orient='index')
tfidf.columns = ['resource_word_tfidf']

Wall time: 18 ms


In [30]:
%%time
# 15 highest tf-idf from that list
tfidf.sort_values(by=['resource_word_tfidf'], ascending=False).head(15)

Wall time: 6.96 ms


Unnamed: 0,resource_word_tfidf
branches book,7.823436
diaries dork diaries,7.802456
superbright,7.792129
dork diaries dork,7.785303
diaries dork,7.781908
branches,7.622444
12 amp 34,7.61095
34 18 34,7.61095
amp 34 18,7.61095
paper 12 amp,7.608097


In [26]:
%%time
# Character level tf-idfs
# article text
vectorizer = TfidfVectorizer(max_features=2000, analyzer='char', stop_words='english', ngram_range=(1,3), dtype=np.float32)
vectorizer.fit(article_text)
article_char_tfidf = vectorizer.transform(article_text)

Wall time: 12min 3s


In [27]:
%%time
# project title
vectorizer = TfidfVectorizer(max_features=2000, analyzer='char', stop_words='english', ngram_range=(1,3), dtype=np.float32)
vectorizer.fit(title_text)
title_char_tfidf = vectorizer.transform(title_text)

Wall time: 22.9 s


In [28]:
%%time
# resource text
vectorizer = TfidfVectorizer(max_features=2000, analyzer='char', stop_words='english', ngram_range=(1,3), dtype=np.float32)
vectorizer.fit(resource_text)
resource_char_tfidf = vectorizer.transform(resource_text)

Wall time: 3min 55s


In [None]:
# To Be Continued...  My feeble attempts that weren't anywhere near all encompassing are below!

In [None]:
athing = resource_df[resource_df['id'] == 'p069063']

In [None]:
athing_length = len(athing)
for row in athing.itertuples():
    print(round(row[3] * row[4], 2))
athing_length

In [None]:
sumprice = []
numbought = []
avgprice = []

for row in train_df.itertuples():
    try:
        df = resource_df[resource_df['id'] == row[1]]
        df_length = len(df)
        

In [None]:
train_df.head(1)

In [None]:
def resource_scrape(idnum):
    df = resource_df[resource_df['id'] == idnum]
    try:
        foo = [round(row[3] * row[4], 2) for row in df.itertuples()]
        

In [None]:
data['project_is_approved'].value_counts()

In [None]:
data['teacher_number_of_previously_posted_projects'].value_counts() > 5