# Building an NLP Pipeline Class

For the pair problem today, you're going to work on building a class that vectorizes an arbitrary list of documents. The goal is to build something that takes in a bunch of text, and can spit out the cleaned text as a matrix. I'll get you started with a template.

In [1]:
class NLPPipe:
    
    def __init__(self, vectorizer, cleaning_function, tokenizer, stemmer):
        self.vectorizer = vectorizer
        self.cleaning_function = cleaning_function
        self.tokenizer = tokenizer
        self.stemmer = stemmer
    
    def fit(self, text):
        pass
    
    def transform(self, text):
        pass

## Passing Functions
As a quick note, if you want to pass a function into a class you can do so like this:

In [None]:
def print_the_word_bob_three_times():
    for i in range(3):
        print('bob')
        
class this_is_an_example:
    
    def __init__(self, function_input):
        self.function_to_run = function_input
        
    def do_the_thing(self):
        self.function_to_run()

In [None]:
example = this_is_an_example(print_the_word_bob_three_times)

Note, above, that when we put the function in, we **do not invoke it with the parentheses**!

In [None]:
example.do_the_thing()

## Order of Operations

Both the `.fit` and `.transform` methods are taking in *raw* `text` (a *list* of text documents), cleaning them, and then vectorizing them. So, in your `cleaning_function`, we

1. Loop through each document in `text` ... and,
2. Pick out the individual words using your `tokenizer`
3. Capture only the "meaningful" portion of each of these words using your `stemmer`
4. Join the clean words (stemmed tokens) together, back into each document
5. ... Output all the text as another list of (clean) documents, to give to the `vectorizer`

`.fit` and `.transform` use the `cleaning_function` before fitting or transforming (respectively) the class's `vectorizer` using the given `text`.

## What We Want

So what I want is the ability to do something like:

```python
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer
from nltk.stem import PorterStemmer

nlp = nlp_pipe(CountVectorizer(), simple_cleaning_function_i_made, TreebankWordTokenizer(), PorterStemmer())
nlp.fit(train_corpus)
nlp.transform(test_corpus)
```
Which should return the test corpus in its vectorizer format.

# Solution!

In [13]:
import re
import string

In [14]:
import nltk
nltk.download()

AttributeError: module 'regex' has no attribute 'compile'