### Representing Text

As described in the introduction, in this notebook we'll be looking at how to represent text in a way that could be useful for doing analysis. 


In all programming languages, text is represented in the form of a string. However, when we do any statistical analysis we want numerical vectors. More than this, we need to give the text a numerical representation that maintains all of the information contained within the text - e.g. word relationships and contextual meaning. 


Below you will learn one method to convert text into numbers - bag of words models. This is a first step to creating numerical representations of the information encoded in the text. 


Let's find out how.

### Bag of words models (BoW)

BoW is the most common way of representing text. In this approach we take a given string and represent it as a frequency count of all the words it contains. For example the string: *"Sometimes I like to eat an orange, but sometimes I prefer eating an apple"* Would be represented as:

* Sometimes: 2
* I : 2
* Like: 1
* To: 1
* Eat: 1
* An: 2
* Orange: 1
* But: 1
* Prefer: 1
* Eating: 1
* Apple: 1
 
   
Whilst you're working through implementing this approach, think about what might be a drawback of this approach.

The method to building a BoW model has two main parts:
1. Keep account of all the words/tokens that appeared in a string
2. Ignore the order in which those words appeared

In practice, there are more than two steps required to produce a bag of words model from a raw text input. Lets walk through the process together.

#### 1. Preprocessing

There can be several steps here, but the aim of this process is to reduce duplicate and nonsense words from our final corpus. Some of the steps we might need to do include:
1. Converting text to lower case
2. Removing all non-word characters
3. Removing all punctuations
4. Removing multiple spaces

All these steps are necessary to ensure that only a small set of useful words remain in our corpus. 

##### Exercise: Have a look at the sentece given below and complete the preprocess function to apply all the above steps

In [0]:
text = " Do Preprocessing on this piece of text! You should   be able to obtain a clean version. Text should always be cleaned"

## Write your code here
def preprocess(text):
    return ''

preprocessed_text = preprocess(text)

##### Example Answer:

#### 2. Tokenization 

This is the process of breaking text into tokens/words. In this simple example we're going to simply break the text around the whitespace (spaces). 

##### Exercise: First try to create your own tokenization function

In [0]:
def tokenize(text):
    # You must return a list of tokens, you can use the string.split() function
    return []

tokens = tokenize(preprocessed_text)

##### Example Answer:

#### 3. Removing Stopwords

We need to remove the functional words like 'is', 'and', 'the', 'are' etc as they give no information about the text and increase clutter in the analysis


In [0]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
 
stopWords = set(stopwords.words('english'))
print(stopWords)

def removeStopwords(tokens, stopWords):
  # Write your logic to remove stopwords
  return []

filtered_tokens = removeStopwords(tokens, stopWords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
{"needn't", 'such', 'will', 'theirs', "shan't", 'doesn', 'and', "don't", "hasn't", "you've", 'weren', 'had', 'can', 'haven', 'is', "doesn't", 'of', "haven't", 'herself', 't', 'not', "wouldn't", 'here', 'no', 'until', 'she', 'm', "didn't", 'under', 'down', 'ma', "it's", 'hadn', "she's", 'so', 'below', 'more', 'why', 'yourself', "aren't", 'he', 'have', 'for', 'has', 'in', 've', 'during', 'few', 'was', 'having', 'me', 'who', 'hers', 'on', 'further', 'into', 'as', 'been', "should've", "weren't", 'same', "mustn't", 'about', 'from', 'to', 'how', "mightn't", 'it', 'yours', 'don', "couldn't", 'that', 'this', 'once', 'than', 'nor', 'a', 'her', "isn't", 'just', 'against', 'while', 'are', 'y', 'aren', 'mustn', 'your', 'at', 'themselves', 'won', 'what', 'should', 'again', 'shouldn', 'does', "hadn't", 'you', 'wasn', 'were', 'shan', 'where', 'these', 'the', 'his', 'above', 'hasn', 'doing', 

NameError: ignored

#### 4. Stemming and Lemmatization

In many European languages (e.g English, German, Spanish), words can take on several forms. This process of converting from root form to other ones is called inflection. For example, in English this is used to signal different cases, different actor(s) and differnt tenses. E.g. `play` can become `plays`, `playing`, `played` etc. 

In most of the cases we'd want to treat all of these form in a similar manner, hence it might be useful to convert them to their root form.

Lemmatization is also used for the same purpose but it is more accurate than stemming. Example: If the word ‘go’ is in its past tense ‘went’ in the given sentence, Stemming cant arrive at 'go' from 'went'. But Lemmatization can. Stemming is a rule based technique where Lemmatization is a dictionary based technique. Therefore, Lemmatization is more accurate on the other hand it is slow in processing as it has to look up the dictionary. The base form of a word in this case is called Lemma.

Since, it is pretty non-trivial to implement stemming by yourself, we'll be using a well known library called `nltk`. Nltk is a multipurpose NLP package that we'll use at several points during this minicourse.


A detailed dive into all the techniques we're using has been omitted to keep it short. If you want to go deeper into how stemming works, you can check out here: https://www.geeksforgeeks.org/python-stemming-words-with-nltk/

##### Exercise: Complete the function below using nltk

In [0]:
import nltk

def stem(tokens):
    # Use stemming from nltk here
    return tokens

stemmed_tokens = stem(tokens)

##### Example Answer:

Usually you'll always be using libraries for all the above tasks, but it's alyways a good idea to dirty your hands in the beginning

### Count Vectors 

Everything we've been doing so far has been preparing the text because we want to count the words in the text properly. Now that you know how things work let's look at the count matrix.

For this task you can use both `gensim` as well as `sklearn`.

`gensim` is faster but `sklearn` is easier and has better integration for downstream tasks so we're going to work with that. Here's the example of `CountVectorizer` from sklearn's page:

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
 ]
 
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

Here `X` will contain the counts of each of the words present in the text. Lets take a look at the words in our corpus, and the bag of words representations of each snippet.

In [0]:
print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [0]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


You can change a few parameters to make sure all the irrelevant stop words like (and, or, is, the etc.) are removed. You can also control the max number of words that should be considered. Moreoever tokenizers, stemmers, lemmatizers and other parts of the pipeline can also be provided. Have a look at the documentation to play around with the parameters: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

### tf-idf vectors

tf-idf is an improvement over the count vectors because it takes into account the frequency of a word in a given corpus. 

Suppose you are working with several documents related to medicine. There'll be a lot of domain specific words that'll occur in most of the documents and are not predictive of anything. For example 'hospital', 'doctor' and 'drugs' might all normally convey significant meaning, but may have little value in this context.

We need some way to penalize counts of such words and that's where tf-idf comes in.

**The basic idea is:** take the word counts and divide it by the number of different documents that the word occurs in (sometimes with a log of that)

Again, as we don't have the time to go into detail on this topic, you can find a more detailed discussion here: https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76


sklearn provides a vectorizer similar to CountVectorizer to directly obtain tfidf-vectors

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
 ]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

In [0]:
print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [0]:
print(X.toarray())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


### Using BoW models

Once you convert your text into bow representation (counts or tf-idf) and you have a label corresponding to each of them, you can train any classifier or regression model on top of that. This simple technique of counting words helps solve a lot of task for which you have any labelled data.

However, for most problems in the world you do not have any labelled data and you do need to solve the problem upto sufficient extent. We'll look in the later notebooks as to how you can do that.