# Text classification task: Bag of Words Approach Exploration

Comparing performances of TF-IDF and simple word-count, testing different feature dimensionalities and different classifiers

## TL;DR 

- A good feature range for the TF-IDF seems to be max_features = 5000 (above this threshold the classification results do not vary significantly

- A good feature range for the simple word-count seems to be max_features = 5000  (same consideration as above)
    
- The Random Forest Classifier parameters have been tuned with a RandomSearchCV, but the obtained "best_params" does not perform better than the default ones. Therefore, the default ones will be used in the successive classifications (time-saving) 

## Packages

In [1]:
#---- magic trio + special guest
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#---- utils
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from pprint import pprint 
from sklearn.model_selection import RandomizedSearchCV# Number of trees in random forest
import time

In [6]:
df = pd.read_pickle("../data/df_preprocessed_eng.pckle")

df.head()

Unnamed: 0,description,event_id,category,category_label,lang,lang_reliab,desc_stemm,desc_lemm,desc_stemm_no_badwords,desc_lemm_no_badwords
0,"<p><span>Hey explorersssss, what's up?</span><...",239719250,language/ethnic identity,15,English,99,secret spot hong kong night view food adventur...,secret spot hong kong night view food adventur...,secret spot hong kong night view food adventur...,secret spot hong kong night view food adventur...
1,"<p>Free, unauditioned, collaborative pop choir...",gpjktmywhbnb,music,18,English,99,sing us free unaudit collabor pop choir identi...,sing u free unauditioned collaborative pop cho...,sing us unaudit collabor pop choir identifi wo...,sing u unauditioned collaborative pop choir id...
2,"<p>We provide a forum to learn about, promote ...",drrtzmywhbgb,career/business,3,English,98,east valley busi network meetup provid forum l...,east valley business network meetup provide fo...,east valley busi network provid forum learn pr...,east valley business network provide forum lea...
3,<p><b>【WhyNot!?JAPAN + MeetUp Collaboration Ev...,239719229,socializing,27,English,92,friendli friday whynot japan meetup collabor e...,friendly friday whynot japan meetup collaborat...,friendli whynot japan collabor whynot japan si...,friendly whynot japan collaboration whynot jap...
4,<p>This is an introductory meeting to get to k...,mtzxwmywjbjb,tech,30,English,99,authent option introductori meet get know fell...,authentication option introductory meet get kn...,authent option introductori get fellow coder s...,authentication option introductory get fellow ...


# Some Statistics

## Plain Descriptions

In [5]:
print("word count (tot):", sum(df['desc_no_html_no_link_no_emoji'].str.split().map(lambda x: len(x))))
print("average word count per description:", int(sum(df['desc_no_html_no_link_no_emoji'].str.split().map(lambda x: len(x)))/(df['desc_no_html_no_link_no_emoji'].shape[0])))

unique_words = set()
df['desc_no_html_no_link_no_emoji'].str.split().map(unique_words.update)
print("total unique words:", len(unique_words))

print("--------------------------------")

from collections import Counter
wc = Counter()
df['desc_no_html_no_link_no_emoji'].str.split().map(wc.update)
print("top 20 most common words:", wc.most_common(20))


word count (tot): 23648302
average word count per description: 176
total unique words: 858114
--------------------------------
top 20 most common words: [('the', 765406), ('and', 699862), ('to', 650766), ('a', 442209), ('of', 415498), ('you', 275024), ('is', 271980), ('in', 270494), ('for', 269016), ('will', 196949), ('on', 169135), ('your', 165993), ('at', 162985), ('with', 159761), ('be', 152646), ('are', 147640), ('or', 124561), ('-', 124375), ('that', 103818), ('have', 102617)]


## After Stemming

In [128]:
print("word count (tot):", sum(df['desc_stemm'].str.split().map(lambda x: len(x))))
print("average word count per description:", int(sum(df['desc_stemm'].str.split().map(lambda x: len(x)))/(df['desc_stemm'].shape[0])))

unique_words = set()
df['desc_stemm'].str.split().map(unique_words.update)
print("total unique words:", len(unique_words))

print("--------------------------------")

from collections import Counter
wc = Counter()
df['desc_stemm'].str.split().map(wc.update)
print("top 20 most common words:", wc.most_common(20))



word count (tot): 12830221
average word count per description: 106
total unique words: 164722
--------------------------------
top 20 most common words: [('event', 81348), ('pleas', 77385), ('pm', 71569), ('us', 66477), ('time', 65960), ('meet', 65883), ('group', 61979), ('get', 59971), ('class', 53120), ('free', 52327), ('join', 52031), ('one', 51165), ('bring', 48424), ('new', 47862), ('meetup', 47363), ('learn', 46186), ('make', 45425), ('may', 44927), ('peopl', 44504), ('day', 43799)]


## After Lemmatization

In [129]:
print("word count (tot):", sum(df['desc_lemm'].str.split().map(lambda x: len(x))))
print("average word count per description:", int(sum(df['desc_lemm'].str.split().map(lambda x: len(x)))/(df['desc_lemm'].shape[0])))

unique_words = set()
df['desc_lemm'].str.split().map(unique_words.update)
print("total unique words:", len(unique_words))

print("--------------------------------")

from collections import Counter
wc = Counter()
df['desc_lemm'].str.split().map(wc.update)
print("top 20 most common words:", wc.most_common(20))



word count (tot): 12830221
average word count per description: 106
total unique words: 188989
--------------------------------
top 20 most common words: [('event', 81320), ('please', 77248), ('pm', 71549), ('u', 66477), ('time', 65386), ('get', 62897), ('group', 61909), ('class', 53082), ('free', 52366), ('join', 51640), ('make', 51170), ('one', 50560), ('new', 48535), ('bring', 48530), ('meet', 47924), ('may', 44921), ('people', 44504), ('meetup', 43942), ('day', 43794), ('learn', 43693)]


## After Stemming and further words removal

In [130]:
print("word count (tot):", sum(df['desc_stemm_no_badwords'].str.split().map(lambda x: len(x))))
print("average word count per description:", int(sum(df['desc_stemm_no_badwords'].str.split().map(lambda x: len(x)))/(df['desc_stemm_no_badwords'].shape[0])))

unique_words = set()
df['desc_stemm_no_badwords'].str.split().map(unique_words.update)
print("total unique words:", len(unique_words))

print("--------------------------------")

from collections import Counter
wc = Counter()
df['desc_stemm_no_badwords'].str.split().map(wc.update)
print("top 20 most common words:", wc.most_common(20))



word count (tot): 11159308
average word count per description: 92
total unique words: 164648
--------------------------------
top 20 most common words: [('us', 66477), ('get', 59971), ('class', 53120), ('one', 51165), ('learn', 46186), ('make', 45425), ('may', 44927), ('park', 43420), ('work', 37086), ('start', 36371), ('use', 36094), ('go', 35779), ('see', 34699), ('need', 34050), ('take', 33265), ('includ', 31464), ('busi', 30854), ('experi', 30481), ('game', 29758), ('walk', 28881)]


## After Lemmatization and further words removal

In [132]:
print("word count (tot):", sum(df['desc_lemm_no_badwords'].str.split().map(lambda x: len(x))))
print("average word count per description:", int(sum(df['desc_lemm_no_badwords'].str.split().map(lambda x: len(x)))/(df['desc_lemm_no_badwords'].shape[0])))

unique_words = set()
df['desc_lemm_no_badwords'].str.split().map(unique_words.update)
print("total unique words:", len(unique_words))

print("--------------------------------")

from collections import Counter
wc = Counter()
df['desc_lemm_no_badwords'].str.split().map(wc.update)
print("top 20 most common words:", wc.most_common(20))


word count (tot): 11246062
average word count per description: 93
total unique words: 188993
--------------------------------
top 20 most common words: [('u', 66477), ('get', 62886), ('class', 53082), ('make', 51101), ('one', 50560), ('may', 44921), ('learn', 43618), ('go', 40522), ('park', 37339), ('see', 36691), ('work', 36625), ('take', 36177), ('start', 36080), ('use', 34252), ('need', 33708), ('experience', 33129), ('include', 31295), ('game', 29503), ('life', 29462), ('business', 29450)]


## Text representation

In [63]:
#n_tree
n_estimators = [100, 200, 500, 1000]
# Number of features to consider at every split
max_features = ['auto', 'sqrt', 100, 300, 1000]

# Maximum number of levels in tree
max_depth = [10, 30, 50, 70, 100, None]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
pprint(random_grid)

{'bootstrap': [True, False],
 'max_depth': [10, 30, 50, 70, 100, None],
 'max_features': ['auto', 'sqrt', 100, 300, 1000],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [100, 200, 500, 1000]}


In [2]:
def run_experiment(X_data, y_data, 
                   mode = "tfidf", classifier = "rf", 
                   tf_idf_features = None, max_features = 10000, 
                   count_vect_features = None,
                   random_search = False, param_grid = None, params = None,
                   verbose = False, get_exec_time = False, n_jobs = 35):
    
    t0 = time.time()
    if mode == "tfidf":
        
        if tf_idf_features is None:
            
            tf_vectorizer = TfidfVectorizer(analyzer = "word", tokenizer = None, norm = 'l1',
                                        preprocessor = None, max_features = max_features, sublinear_tf = False)
            if verbose:
                print("extracting tf-idf features...")
            X_data = tf_vectorizer.fit_transform(X_data)
            
        else:
        
            if verbose:
                print("using previously extracted tf-idf features")
            X_data = tf_idf_features
        if verbose:
            print("tf-idf features shape:", end="\t")
            print(X_data.shape)

    elif mode == "count":
        
        if count_vect_features is None:
            if verbose:
                print("extracting count-resulting features...")

            count_vect = CountVectorizer(max_features = max_features, analyzer = 'word', ngram_range = (1,1))
            X_data = count_vect.fit_transform(X_data)
        else:
            if verbose:
                print("using previously extracted count-vect features")
            X_data = count_vect_features
        
        if verbose:
            print("count-vect features shape:", end="\t")
            print(X_data.shape)

    else:
        raise NotImplementedError
        
    X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.2, random_state = 42, stratify = y_data)
    
    if classifier == "rf":
        if verbose:
            print("Classifying with Random Forest...")
        if random_search:
            if verbose:
                print("Init RandomizedSearchCV...")
            rf = RandomForestClassifier(n_jobs=22)
            model = RandomizedSearchCV(estimator = rf, param_distributions = param_grid, n_iter = 20, cv = 3, verbose=10, random_state=42, n_jobs = 2)
        else:
            if params is not None:
                print("Using best params previously found")
                model = RandomForestClassifier(**params, n_jobs=n_jobs, verbose = 1)
            else:
                model = RandomForestClassifier(n_jobs=n_jobs, verbose = 1) 
    
        
    else:
        raise NotImplementedError
    if verbose:
        print(model)
        print("Starting to fit")
    model = model.fit(X_train, y_train)
    if verbose:
        print("Scoring")
    score = model.score(X_test, y_test)#100 est
    if verbose:
        print("Test score:", end="\t")
        print(score)
    if get_exec_time:
        print(f"Experiment run in {round(time.time()-t0, 2)}s")
    return model, score

# Testing TF-IDF feature extraction on the lemmatized description

In [51]:
#using 10000 tf-idf features
tf_vectorizer = TfidfVectorizer(analyzer = "word", tokenizer = None, norm = 'l1',
                                        preprocessor = None, max_features = 10000, sublinear_tf = False)
tf_idf_desc_lemm_10000_feat = tf_vectorizer.fit_transform(df.desc_lemm)


## Random Search on Randon Forest Params

In [74]:
%%time
model, score = run_experiment(df.desc_lemm, df.category, 
               tf_idf_features=tf_idf_desc_lemm_10000_feat, 
               param_grid = random_grid,
               verbose = True, random_search=True)

using previously extracted tf-idf features
tf-idf features shape:	(120809, 10000)
Classifying with Random Forest...
Init RandomizedSearchCV...
Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:   41.7s
[Parallel(n_jobs=2)]: Done   4 tasks      | elapsed:  1.3min
[Parallel(n_jobs=2)]: Done   9 tasks      | elapsed:  2.1min
[Parallel(n_jobs=2)]: Done  14 tasks      | elapsed:  4.6min
[Parallel(n_jobs=2)]: Done  21 tasks      | elapsed:  8.3min
[Parallel(n_jobs=2)]: Done  28 tasks      | elapsed: 43.3min
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed: 53.0min
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 75.4min
[Parallel(n_jobs=2)]: Done  57 tasks      | elapsed: 80.6min
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed: 83.8min finished


Test score:	0.6710537207184836
CPU times: user 14h 49min 6s, sys: 1min 2s, total: 14h 50min 9s
Wall time: 2h 4min 37s


In [83]:
model.best_score_
best_params = model.best_params_

best_params

{'n_estimators': 1000,
 'min_samples_split': 10,
 'min_samples_leaf': 1,
 'max_features': 1000,
 'max_depth': None,
 'bootstrap': False}

Performances are similar to the default parameters, keeping those for simplicity

In [82]:
#using 1000 tf-idf features
tf_vectorizer = TfidfVectorizer(analyzer = "word", tokenizer = None, norm = 'l1',
                                        preprocessor = None, max_features = 1000, sublinear_tf = False)
tf_idf_desc_lemm_1000_feat = tf_vectorizer.fit_transform(df.desc_lemm)


In [94]:
%%time
run_experiment(df.desc_lemm, df.category, 
               tf_idf_features=tf_idf_desc_lemm_1000_feat, 
               verbose = True, random_search=False, params = None)

using previously extracted tf-idf features
tf-idf features shape:	(120809, 1000)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=40, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.
[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:  3.6min finished


Scoring


[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.
[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:    2.0s finished


Test score:	0.6583478188891648
CPU times: user 2h 18min 22s, sys: 8.56 s, total: 2h 18min 31s
Wall time: 3min 42s


(RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=40, oob_score=False, random_state=None, verbose=1,
                        warm_start=False), 0.6583478188891648)

In [95]:
#using 100 tf-idf features
tf_vectorizer = TfidfVectorizer(analyzer = "word", tokenizer = None, norm = 'l1',
                                        preprocessor = None, max_features = 300, sublinear_tf = False)
tf_idf_desc_lemm_300_feat = tf_vectorizer.fit_transform(df.desc_lemm)


In [96]:
%%time
run_experiment(df.desc_lemm, df.category, 
               tf_idf_features=tf_idf_desc_lemm_300_feat, 
               verbose = True, random_search=False, params = None)

using previously extracted tf-idf features
tf-idf features shape:	(120809, 300)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=40, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.
[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:  1.1min finished
[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.


Scoring
Test score:	0.6099246751096764
CPU times: user 39min 3s, sys: 6.67 s, total: 39min 10s
Wall time: 1min 6s


[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:    0.7s finished


(RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=40, oob_score=False, random_state=None, verbose=1,
                        warm_start=False), 0.6099246751096764)

In [97]:
%%time
run_experiment(df.desc_lemm, df.category, 
               tf_idf_features=tf_idf_desc_lemm_10000_feat, 
               verbose = True, random_search=False, params = None)

using previously extracted tf-idf features
tf-idf features shape:	(120809, 10000)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=40, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.
[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:  1.4min finished
[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.


Scoring
Test score:	0.6759788097011836
CPU times: user 36min 53s, sys: 7.86 s, total: 37min
Wall time: 1min 23s


[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:    0.3s finished


(RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=40, oob_score=False, random_state=None, verbose=1,
                        warm_start=False), 0.6759788097011836)

In [None]:
#using 20000 tf-idf features
tf_vectorizer = TfidfVectorizer(analyzer = "word", tokenizer = None, norm = 'l1',
                                        preprocessor = None, max_features = 20000, sublinear_tf = False)
tf_idf_desc_lemm_20000_feat = tf_vectorizer.fit_transform(df.desc_lemm)

In [101]:
%%time
run_experiment(df.desc_lemm, df.category, 
               tf_idf_features=tf_idf_desc_lemm_20000_feat, 
               verbose = True, random_search=False, params = None)

using previously extracted tf-idf features
tf-idf features shape:	(120809, 20000)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=40, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.
[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:  1.5min finished
[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.


Scoring
Test score:	0.6739508318847778
CPU times: user 41min 50s, sys: 7.41 s, total: 41min 57s
Wall time: 1min 30s


[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:    0.3s finished


(RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=40, oob_score=False, random_state=None, verbose=1,
                        warm_start=False), 0.6739508318847778)

In [102]:
#using 5000 tf-idf features
tf_vectorizer = TfidfVectorizer(analyzer = "word", tokenizer = None, norm = 'l1',
                                        preprocessor = None, max_features = 5000, sublinear_tf = False)
tf_idf_desc_lemm_5000_feat = tf_vectorizer.fit_transform(df.desc_lemm)

In [103]:
%%time
run_experiment(df.desc_lemm, df.category, 
               tf_idf_features=tf_idf_desc_lemm_5000_feat, 
               verbose = True, random_search=False, params = None)

using previously extracted tf-idf features
tf-idf features shape:	(120809, 5000)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=40, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.
[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:  1.2min finished
[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.


Scoring
Test score:	0.671674530254118
CPU times: user 32min 55s, sys: 6.39 s, total: 33min 1s
Wall time: 1min 10s


[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:    0.3s finished


(RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=40, oob_score=False, random_state=None, verbose=1,
                        warm_start=False), 0.671674530254118)

In [104]:
#using 2000 tf-idf features
tf_vectorizer = TfidfVectorizer(analyzer = "word", tokenizer = None, norm = 'l1',
                                        preprocessor = None, max_features = 2000, sublinear_tf = False)
tf_idf_desc_lemm_2000_feat = tf_vectorizer.fit_transform(df.desc_lemm)

In [105]:
%%time
run_experiment(df.desc_lemm, df.category, 
               tf_idf_features=tf_idf_desc_lemm_2000_feat, 
               verbose = True, random_search=False, params = None)

using previously extracted tf-idf features
tf-idf features shape:	(120809, 2000)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=40, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.
[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:   60.0s finished
[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.


Scoring
Test score:	0.6683635460640676
CPU times: user 28min 7s, sys: 7.09 s, total: 28min 14s
Wall time: 1min 1s


[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:    0.3s finished


(RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=40, oob_score=False, random_state=None, verbose=1,
                        warm_start=False), 0.6683635460640676)

In [117]:
%%time

#using 2000 tf-idf features
tf_vectorizer = TfidfVectorizer(analyzer = "word", tokenizer = None, norm = 'l2',
                                        preprocessor = None, max_features = 2000, sublinear_tf = True)
tf_idf_desc_lemm_2000_feat = tf_vectorizer.fit_transform(df.desc_lemm)

run_experiment(df.desc_lemm, df.category, 
               tf_idf_features=tf_idf_desc_lemm_2000_feat, 
               verbose = True, random_search=False, params = None)

using previously extracted tf-idf features
tf-idf features shape:	(120809, 2000)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=40, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.
[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:  1.0min finished
[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.


Scoring
Test score:	0.6701018127638441
CPU times: user 28min 22s, sys: 7.08 s, total: 28min 29s
Wall time: 1min 17s


[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:    0.3s finished


# Testing Count Vectorized version

In [None]:
%%time
#testing 10000 features
count_vect = CountVectorizer(max_features = 10000, analyzer = 'word', ngram_range = (1,1))
count_vect_10000_features = count_vect.fit_transform(df.desc_lemm)

In [115]:
run_experiment(df.desc_lemm, df.category, 
               mode = "count", count_vect_features = count_vect_10000_features, 
               verbose = True, random_search=False, params = None)

using previously extracted count-vect features
count-vect features shape:	(120809, 10000)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=40, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.
[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:  2.3min finished
[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.


Scoring
Test score:	0.6763099081201888


[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:    0.5s finished


(RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=40, oob_score=False, random_state=None, verbose=1,
                        warm_start=False), 0.6763099081201888)

In [116]:
%%time

count_vect = CountVectorizer(max_features = 1000, analyzer = 'word', ngram_range = (1,1))
count_vect_1000_features = count_vect.fit_transform(df.desc_lemm)

run_experiment(df.desc_lemm, df.category, 
               mode = "count", count_vect_features = count_vect_1000_features, 
               verbose = True, random_search=False, params = None)

using previously extracted count-vect features
count-vect features shape:	(120809, 1000)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=40, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.
[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:   50.3s finished
[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.


Scoring
Test score:	0.6602516347984438
CPU times: user 25min 13s, sys: 6.89 s, total: 25min 20s
Wall time: 1min 7s


[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:    0.3s finished


In [120]:
%%time

count_vect = CountVectorizer(max_features = 300, analyzer = 'word', ngram_range = (1,1))
count_vect_300_features = count_vect.fit_transform(df.desc_lemm)

run_experiment(df.desc_lemm, df.category, 
               mode = "count", count_vect_features = count_vect_300_features, 
               verbose = True, random_search=False, params = None)

using previously extracted count-vect features
count-vect features shape:	(120809, 300)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=40, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.
[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:   44.0s finished
[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.


Scoring


[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:    0.3s finished


Test score:	0.6141875672543664
CPU times: user 43min 54s, sys: 7.73 s, total: 44min 1s
Wall time: 1min 26s
Compiler : 113 ms


In [121]:
%%time

count_vect = CountVectorizer(max_features = 5000, analyzer = 'word', ngram_range = (1,1))
count_vect_5000_features = count_vect.fit_transform(df.desc_lemm)

run_experiment(df.desc_lemm, df.category, 
               mode = "count", count_vect_features = count_vect_5000_features, 
               verbose = True, random_search=False, params = None)

using previously extracted count-vect features
count-vect features shape:	(120809, 5000)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=40, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.
[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:  1.4min finished
[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.


Scoring


[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:    0.3s finished


Test score:	0.6756477112821786
CPU times: user 37min 7s, sys: 12.7 s, total: 37min 19s
Wall time: 1min 47s


In [122]:
%%time

count_vect = CountVectorizer(max_features = 2000, analyzer = 'word', ngram_range = (1,1))
count_vect_2000_features = count_vect.fit_transform(df.desc_lemm)

run_experiment(df.desc_lemm, df.category, 
               mode = "count", count_vect_features = count_vect_2000_features, 
               verbose = True, random_search=False, params = None)

using previously extracted count-vect features
count-vect features shape:	(120809, 2000)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=40, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.
[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:  1.0min finished
[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.


Scoring
Test score:	0.6699362635543415
CPU times: user 28min 18s, sys: 4.81 s, total: 28min 22s
Wall time: 1min 18s


[Parallel(n_jobs=40)]: Done 100 out of 100 | elapsed:    0.3s finished


## Plain desc comparison

In [3]:
df_plain = pd.read_pickle("../data/backup/df_no_html_no_link_no_emoji_lang.pckl")

In [4]:
run_experiment(df_plain.desc_no_html_no_link_no_emoji, df_plain.category, 
               mode = "count", max_features=5000, 
               verbose = True, random_search=False, params = None)

extracting count-resulting features...
count-vect features shape:	(134125, 5000)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators='warn',
                       n_jobs=35, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=35)]: Using backend ThreadingBackend with 35 concurrent workers.
[Parallel(n_jobs=35)]: Done   7 out of  10 | elapsed:   18.7s remaining:    8.0s
[Parallel(n_jobs=35)]: Done  10 out of  10 | elapsed:   19.0s finished
[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   2 out of  10 | elapsed:    0.1s remaining:    0.3s


Scoring
Test score:	0.6003727865796832


[Parallel(n_jobs=10)]: Done  10 out of  10 | elapsed:    0.1s finished


(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                        max_depth=None, max_features='auto', max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=35,
                        oob_score=False, random_state=None, verbose=1,
                        warm_start=False), 0.6003727865796832)

In [5]:
run_experiment(df_plain.desc_no_html_no_link_no_emoji, df_plain.category, 
               mode = "tfidf", max_features=5000, 
               verbose = True, random_search=False, params = None)

extracting tf-idf features...
tf-idf features shape:	(134125, 5000)
Classifying with Random Forest...
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators='warn',
                       n_jobs=35, oob_score=False, random_state=None, verbose=1,
                       warm_start=False)
Starting to fit


[Parallel(n_jobs=35)]: Using backend ThreadingBackend with 35 concurrent workers.
[Parallel(n_jobs=35)]: Done   7 out of  10 | elapsed:   18.3s remaining:    7.9s
[Parallel(n_jobs=35)]: Done  10 out of  10 | elapsed:   18.6s finished


Scoring
Test score:	0.6012301957129543


[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   2 out of  10 | elapsed:    0.1s remaining:    0.2s
[Parallel(n_jobs=10)]: Done  10 out of  10 | elapsed:    0.1s finished


(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                        max_depth=None, max_features='auto', max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=35,
                        oob_score=False, random_state=None, verbose=1,
                        warm_start=False), 0.6012301957129543)