Lambda School Data Science

*Unit 4, Sprint 1, Module 3*

---

# Document Classification (Prepare)

Today's guided module project will be different. You already know how to do classification. You ready know how to extract features from documents. So? That means you're ready to combine and practice those skills in a kaggle competition. We we will open with a five minute sprint explaining the competition, and then give you 25 minutes to work. After those twenty five minutes are up, I will give a 5-minute demo an NLP technique that will help you with document classification (*and **maybe** the competition*).

Today's all about having fun and practicing your skills. The competition will begin

## Learning Objectives
* <a href="#p0">Part 0</a>: Kaggle Competition
* <a href="#p1">Part 1</a>: Text Feature Extraction & Classification Pipelines
* <a href="#p2">Part 2</a>: Latent Semantic Indexing
* <a href="#p3">Part 3</a>: Word Embeddings with Spacy

---
---

# Text Feature Extraction & Classification Pieplines (Learn)
<a id="p1"></a>

## Overview

Sklearn pipelines allow you to stitch together multiple components of a machine learning process. The idea is that you can pass you raw data and get predictions out of the pipeline. This ability to pass raw input and receive a prediction from a singular class makes pipelines well suited for production, because you can pickle a a pipeline without worry about other data preprocessing steps. 

In [1]:
# Import Statements
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Dataset
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism',
              'talk.religion.misc']

data = fetch_20newsgroups(subset='train', categories=categories)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [3]:
# Create Pipeline Components

vect = TfidfVectorizer(stop_words='english')
rfc = RandomForestClassifier()

In [4]:
# Define the Pipeline
pipe = Pipeline([
                 #Vectorizer
                 ('vect', vect), 
                 # Classifier
                 ('clf', rfc)
                ])

In [5]:
parameters = {
    'vect__max_df': ( 0.75, 1.0),
    'vect__min_df': (.02, .05),
    'vect__max_features': (500,1000),
    'clf__n_estimators':(5, 10,),
    'clf__max_depth':(15,20)
}

grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(data.data, data.target)

Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   21.1s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:  1.4min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'vect__max_df': (0.75, 1.0), 'vect__min_df': (0.02, 0.05), 'vect__max_features': (500, 1000), 'clf__n_estimators': (5, 10), 'clf__max_depth': (15, 20)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [29]:
data.target

array([0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0,
       0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,

In [21]:
type(data.data)

list

In [6]:
grid_search.best_score_

0.8833138856476079

In [7]:
grid_search.predict(['Send me lots of money now', 'you won the lottery in Nigeria'])

array([1, 1])

---

## Follow Along 

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model (try using the pipe method I just demoed)

### Load Competition Data

In [8]:
import pandas as pd
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

### Define Pipeline Components

In [9]:
# === Instantiate pipeline components === #

vect = TfidfVectorizer(stop_words='english')
clf = RandomForestClassifier()

pipe = Pipeline([('vect', vect), ('clf', clf)])

### Define Your Search Space
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [11]:
# Look at the data
train.head()

Unnamed: 0,id,description,category
0,1,A marriage of 13 and 18 year old bourbons. A m...,2
1,2,There have been some legendary Bowmores from t...,1
2,3,This bottling celebrates master distiller Park...,2
3,4,What impresses me most is how this whisky evol...,1
4,9,"A caramel-laden fruit bouquet, followed by une...",2


In [14]:
# === Arrange data into X and y === #

# y - target
y_train = train.drop(columns=["description"])
y_test = test.drop(columns=["description"])

# X - features
X_train = train[["description"]]
X_test = test[["description"]]

In [35]:
X_train["description"].head()

0    A marriage of 13 and 18 year old bourbons. A m...
1    There have been some legendary Bowmores from t...
2    This bottling celebrates master distiller Park...
3    What impresses me most is how this whisky evol...
4    A caramel-laden fruit bouquet, followed by une...
Name: description, dtype: object

In [37]:
# === Using RandomizedSearchCV === #
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

# Define the parameters of the search
parameters = {
    'vect__max_df': (0.98, 1.0),
    'vect__min_df': (.02, .03),
    'vect__max_features': (500, 800),
    'clf__n_estimators': (8, 12,),
    'clf__max_depth': (16, 24)
}

In [39]:
# === Run the search === #
rand_search = RandomizedSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)

rand_search.fit(X_train["description"], y_train["category"])

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   32.2s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   36.3s finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
          fit_params=None, iid='warn', n_iter=10, n_jobs=-1,
          param_distributions={'vect__max_df': (0.98, 1.0), 'vect__min_df': (0.02, 0.03), 'vect__max_features': (500, 800), 'clf__n_estimators': (8, 12), 'clf__max_depth': (16, 24)},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=1)

### Make a Submission File
*Note:* You are only allowed two submissions a day. Only submit if you feel you cannot achieve higher test accuracy. 

In [40]:
# Predictions on test sample
pred = rand_search.predict(test['description'])

In [41]:
submission = pd.DataFrame({'id': test['id'], 'category':pred})
submission['category'] = submission['category'].astype('int64')

In [43]:
# Make Sure the Category is an Integer
submission["category"].head()

0    2
1    2
2    1
3    1
4    1
Name: category, dtype: int64

In [44]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
submission.to_csv('./data/submission1.csv', index=False)

```bash
kaggle competitions submit ds8-which-whiskey -f submission1.csv -m "First sub using RSCV"
```
`Successfully submitted to DS8 Which Whiskey`

## Challenge

You're trying to achienve 90% Accuracy on your model.

---
---

## Latent Semantic Indexing (Learn)
<a id="p2"></a>

## Overview

In [45]:
# Import

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100, 
                   algorithm="randomized",
                   n_iter=10)

In [48]:
params = { 
    "lsi__svd__n_components": [10,100,250],
    "lsi__vect__max_df": [.9, .95, 1.0],
    "clf__n_estimators": (5, 10, 20),
}

In [49]:
# LSI
lsi = Pipeline([('vect', vect), ('svd', svd)])

# Pipe
pipe = Pipeline([('lsi', lsi), ('clf', rfc)])

In [50]:
# Fit
grid_search = GridSearchCV(pipe, params, cv=5, n_jobs=4, verbose=1)
grid_search.fit(data.data, data.target)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  2.1min
[Parallel(n_jobs=4)]: Done 135 out of 135 | elapsed:  6.4min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('lsi', Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'lsi__svd__n_components': [10, 100, 250], 'lsi__vect__max_df': [0.9, 0.95, 1.0], 'clf__n_estimators': (5, 10, 20)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [51]:
grid_search.best_score_

0.8926487747957993

## Follow Along
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
4. Make a submission to Kaggle 


### Define Pipeline Components

In [56]:
# === Instantiate all the things! === #

# SVD
svd = TruncatedSVD(n_components=100,
                   algorithm="randomized",
                   n_iter=10)

# Vectorizer
vect = TfidfVectorizer(stop_words="english")

# Classifier
clf = RandomForestClassifier()

# LSI
lsi = Pipeline([('vect', vect), ('svd', svd)])

# The other sci-pipeline
pipe = Pipeline([('lsi', lsi), ('clf', clf)])

### Define Your Search Space
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [61]:
# Define parameters and instantiate search object
parameters = {
    "lsi__svd__n_components": [10, 100, 200],
    "lsi__vect__max_df": (0.75, 1.0),
    "lsi__vect__min_df": (.02, .03),
    "clf__max_depth":(5, 10, 15, 20),
    "clf__n_estimators": [64, 80, 100, 120, 160]
}

rand_search = RandomizedSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)

In [62]:
# Train using random search
rand_search.fit(X_train["description"], y_train["category"])

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  2.1min finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=Pipeline(memory=None,
     steps=[('lsi', Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
          fit_params=None, iid='warn', n_iter=10, n_jobs=-1,
          param_distributions={'lsi__svd__n_components': [10, 100, 200], 'lsi__vect__max_df': (0.75, 1.0), 'lsi__vect__min_df': (0.02, 0.03), 'clf__max_depth': (5, 10, 15, 20), 'clf__n_estimators': [64, 80, 100, 120, 160]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=1)

### Make a Submission File
*Note:* You are only allowed two submissions a day. Only submit if you feel you cannot achieve higher test accuracy. 

In [63]:
# Predictions on test sample
pred = rand_search.predict(test["description"])

In [64]:
submission = pd.DataFrame({'id': test['id'], 'category':pred})
submission['category'] = submission['category'].astype('int64')

In [65]:
# Make Sure the Category is an Integer
submission["category"].head()

0    2
1    2
2    1
3    1
4    1
Name: category, dtype: int64

In [66]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
submission.to_csv('./data/submission2_lsi.csv', index=False)

### Kaggle Submission #2

```bash
kaggle competitions submit ds8-which-whiskey -f submission2_lsi.csv -m "2nd submission using RSCV and LSI"
```

```
╭─ dasci » tobiasfyi » ..ssification/data »  master ● ?
╰─ kaggle competitions submit ds8-which-whiskey -f submission2_lsi.csv -m "2nd submission using RSCV and LSI"
100%|████████████████████████████████████████████████████████| 1.91k/1.91k [00:01<00:00, 1.19kB/s]
Successfully submitted to DS8 Which Whiskey%
```

> Your submission scored 0.88372, which is an improvement of your previous score of 0.82558. Great job!

---

## Thank You, Try Again

### Define Pipeline Components

In [67]:
# === Instantiate all the things! === #

# SVD
svd = TruncatedSVD(n_components=100,
                   algorithm="randomized",
                   n_iter=10)

# Vectorizer
vect = TfidfVectorizer(stop_words="english")

# Classifier
clf = RandomForestClassifier()

# LSI
lsi = Pipeline([('vect', vect), ('svd', svd)])

# The other sci-pipeline
pipe = Pipeline([('lsi', lsi), ('clf', clf)])

### Define Your Search Space
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [70]:
# Define parameters and instantiate search object
parameters = {
    "lsi__svd__n_components": [10, 64, 80, 100, 160],
    "lsi__vect__max_df": (0.75, 1.0),
    "lsi__vect__min_df": (.02, .03),
    "clf__max_depth": (5, 8, 12, 16, 20, 24),
    "clf__max_features": ["auto", "sqrt", "log2", None],
    "clf__n_estimators": [64, 80, 100, 120, 160]
}

rand_search = RandomizedSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)

In [71]:
# Train using random search
rand_search.fit(X_train["description"], y_train["category"])

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  2.5min finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=Pipeline(memory=None,
     steps=[('lsi', Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
          fit_params=None, iid='warn', n_iter=10, n_jobs=-1,
          param_distributions={'lsi__svd__n_components': [10, 64, 80, 100, 160], 'lsi__vect__max_df': (0.75, 1.0), 'lsi__vect__min_df': (0.02, 0.03), 'clf__max_depth': (5, 8, 12, 16, 20, 24), 'clf__max_features': ['auto', 'sqrt', 'log2', None], 'clf__n_estimators': [64, 80, 100, 120, 160]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=1)

### Make a Submission File
*Note:* You are only allowed two submissions a day. Only submit if you feel you cannot achieve higher test accuracy. 

In [72]:
# Predictions on test sample
pred = rand_search.predict(test["description"])

In [73]:
submission = pd.DataFrame({'id': test['id'], 'category':pred})
submission['category'] = submission['category'].astype('int64')

In [74]:
# Make Sure the Category is an Integer
submission["category"].head()

0    2
1    2
2    1
3    1
4    1
Name: category, dtype: int64

In [75]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
submission.to_csv('./data/submission3_lsi.csv', index=False)

### Kaggle Submission #3

> Your submission scored 0.80232, which is not an improvement of your best score. Keep trying!

## Challenge

Continue to apply Latent Semantic Indexing (LSI) to various datasets. 

---
---

# Word Embeddings with Spacy (Learn)
<a id="p3"></a>

# Overview

In [76]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [77]:
doc = nlp("Two bananas in pyjamas")

In [78]:
bananas_vector = doc.vector
print(len(bananas_vector))

300


In [82]:
def get_word_vectors(docs):
    return [nlp(doc).vector for doc in docs]

In [83]:
X = get_word_vectors(data.data)

len(X) == len(data.data)

True

In [85]:
rfc.fit(X, data.target)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [86]:
rfc.score(X, data.target)

0.9953325554259043

---

## Gradient Boosted Classifier

> [lightgbm.LGBMClassifier](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier)


In [5]:
# Imports
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

import spacy

# Import the classifier class
from lightgbm import LGBMClassifier

In [2]:
# Load test and train (again)
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")

print(train.shape)
print(test.shape)

(2586, 3)
(288, 2)


In [3]:
class LemmaTokenizer(object):
    """Class to find lemmas."""
    
    def __init__(self):
        self.spacynlp = spacy.load("en_core_web_lg")

    def __call__(self, doc):
        nlpdoc = self.spacynlp(doc)
        return [
            token.lemma_.strip()
            for token in nlpdoc
            if (
                (len(token.lemma_) > 1)
                or (token.lemma_.isalnum())
                and (token.is_stop != True)
                and (token.is_punct != True)
                and (token.pos_ not in ("PUNCT", "SYM"))
            )
        ]

### Define / instantiate pipeline components

In [6]:
# Linear dimensionality reduction with truncated SVD
# Instantiate the SVD object
svd = TruncatedSVD(n_components=100, algorithm="randomized", n_iter=5)

In [92]:
# Use the class as tokenizer
vect = TfidfVectorizer(tokenizer=LemmaTokenizer(), ngram_range=(1, 2))

# Instantiate the classifier
rfc = RandomForestClassifier()
# lgbm = LGBMClassifier()

In [93]:
# LSI
lsi = Pipeline([('vect', vect), ('svd', svd)])

# Pipe
pipe = Pipeline([('lsi', lsi), ('clf', lgbm)])

In [94]:
# Specify the parameters of the randomized search
params = {
    "lsi__svd__n_components": [40, 50, 60],
    "lsi__vect__max_df": [0.8, 0.85, 0.9],
    "lsi__vect__min_df": [0.2, 0.15, 0.1],
    "clf__max_depth": [-1, 5, 6, 8],
    "clf__child_samples": [16, 20, 24, 32, 40],
    "clf__num_leaves": [12, 16, 24, 32, 40],
    "clf__learning_rate": [0.05, 0.1, 0.2],
}

In [97]:
# Fit using randomized search
rand_search = RandomizedSearchCV(pipe, params, cv=5, n_jobs=4, verbose=1)
rand_search.fit(train["description"], train["category"])

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


PicklingError: Could not pickle the task to send it to the workers.

Exception in thread QueueFeederThread:
Traceback (most recent call last):
  File "/Users/Tobias/.vega/nlp-9igaqrSk/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/backend/queues.py", line 150, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/Users/Tobias/.vega/nlp-9igaqrSk/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/backend/reduction.py", line 243, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/Users/Tobias/.vega/nlp-9igaqrSk/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/backend/reduction.py", line 236, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/Users/Tobias/.vega/nlp-9igaqrSk/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py", line 284, in dump
    return Pickler.dump(self, obj)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 437, in dum

In [51]:
rand_search.best_score_

0.8926487747957993

## Gradient Boosted Classifier, Round 2

> [XGBClassifier](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier)

In [None]:
# ====== Imports ====== #

import pandas as pd
import os

from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

from xgboost import XGBClassifier
import spacy

---
---

## Challenge

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
    - Try to extract word embeddings with Spacy and use those embeddings as your features for a classification model.
4. Make a submission to Kaggle 

# Review

To review this module: 
* Continue working on the Kaggle comeptition
* Find another text classification task to work on