## Preprocess our text data 
to convert it to something useful (i.e. numbers) for the machine learning model. We are going to use the Bag-of-Words (BOW) approach.

Preprocessing in two ways
- modify the original dataframe TEXT column
- preprocess as part of your pipeline so you don’t edit the original data

In [None]:
def preprocess_text(df):
    # This function preprocesses the text by filling not a number and replacing new lines ('\n') and carriage returns ('\r')
    df.TEXT = df.TEXT.fillna(' ')
    df.TEXT = df.TEXT.str.replace('\n',' ')
    df.TEXT = df.TEXT.str.replace('\r',' ')
    return df
# preprocess the text to deal with known issues
df_train = preprocess_text(df_train)
df_valid = preprocess_text(df_valid)
df_test = preprocess_text(df_test)

### Word tokenize

In [None]:
import nltk
from nltk import word_tokenize
word_tokenize('This should be tokenized. 02/02/2018 sentence has stars**')

The default shows that some punctuation is separated and that numbers stay in the sentence. We will write our own tokenizer function to
* replace punctuation with spaces
* replace numbers with spaces
* lower case all words

In [None]:
import string
def tokenizer_better(text):
    # tokenize the text by replacing punctuation and numbers with spaces and lowercase all words
    
    punc_list = string.punctuation+'0123456789'
    t = str.maketrans(dict.fromkeys(punc_list, " "))
    text = text.lower().translate(t)
    tokens = word_tokenize(text)
    return tokens

### Convert free-text into tokens
we need a way to count the tokens for each discharge summary. 

We will use the built in `CountVectorizer` from scikit-learn package. This vectorizer simply counts how many times each word occurs in the note. 

There is also a `TfidfVectorizer` which takes into how often words are used across all notes, but for this project let’s use the simpler one (I got similar results with the second one too).

In [None]:
sample_text = ['Data science is about the data', 'The science is amazing', 'Predictive modeling is part of data science']

## Fit 
the `CountVectorizer` to learn the words in your data and the transform your data to create counts for each word.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(tokenizer = tokenizer_better)
vect.fit(sample_text)
# matrix is stored as a sparse matrix (since you have a lot of zeros)
X = vect.transform(sample_text)

The `matrix X` will be a sparse matrix, but if you convert it to an array `(X.toarray())`, you will see this

`array([[1, 0, 2, 1, 0, 0, 0, 0, 1, 1],
       [0, 1, 0, 1, 0, 0, 0, 0, 1, 1],
       [0, 0, 1, 1, 1, 1, 1, 1, 1, 0]], dtype=int64)`

Where there are 3 rows (since we have 3 notes) and counts of each word. You can see the column names with `vect.get_feature_names()`


Use only the `training data` because you don’t want to include any new words that show up in the validation and test sets. 

There is a hyperparameter called max_features which you can set to constrain how many words are included in the Vectorizer. This will use the top N most frequently used words.

In [None]:
# fit our vectorizer. This will take a while depending on your computer. 
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(max_features = 3000, tokenizer = tokenizer_better)
# this could take a while
vect.fit(df_train.TEXT.values)

Look at the most frequently used words and we will see that many of these words might not add any value for our model. 

These words are called stop words, and we can remove them easily (if we want) with the CountVectorizer. 

There are lists of common stop words for different NLP corpus, but we will just make up our own based on the image below.

In [None]:
my_stop_words = ['the','and','to','of','was','with','a','on',
                 'in','for','name','is','patient','s','he',
                 'at','as','or','one','she','his','her','am',                 
                 'were','you','pt','pm','by','be','had','your',
                 'this','date','from','there','an','that','p',
                 'are','have','has','h','but','o','namepattern',
                 'which','every','also']

Feel free to add your own stop words if you want.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(max_features = 3000, 
                       tokenizer = tokenizer_better, 
                       stop_words = my_stop_words)
# this could take a while
vect.fit(df_train.TEXT.values)

## Transform our notes into numerical matrices 
At this point, I will only use the training and validation data so I’m not tempted to see how it works on the test data yet.

In [None]:
X_train_tf = vect.transform(df_train.TEXT.values)
X_valid_tf = vect.transform(df_valid.TEXT.values)

We also need our output labels as separate variables


In [None]:
y_train = df_train.OUTPUT_LABEL
y_valid = df_valid.OUTPUT_LABEL