# Preprocessing


## Installation of spaCy

There are two tools widely used in the NLP world. These are two libraries called respectively spacy and nltk. The choice of spacy is based on its ease of implementation.
> If you want to know the differences between spacy and nltk, consult this article. 
> https://medium.com/@akankshamalhotra24/introduction-to-libraries-of-nlp-in-python-nltk-vs-spacy-42d7b2f128f2


***Installation via pip***

````
pip3 install -U spacy
````

***Installation via conda***
````
conda install -c conda-forge spacy
````

**Then you must install the language module !**

````
python3 -m spacy download en_core_web_sm
````

## Installation of scikit-learn

scikitlearn is an important package for machine learning. In this example we will use it to train a preprogrammed model.
> Installation guide:
> https://scikit-learn.org/stable/install.html


In [8]:
total = []
for i in [['a','b','c'],'ee',['d','e','f']]:
    print(i)
    if type(i) == str:
        total += [i]
    total += i

print(total)


['a', 'b', 'c']
ee
ok
['d', 'e', 'f']
['a', 'b', 'c', 'ee', 'e', 'e', 'd', 'e', 'f']


## Dataset

We will start with a simple example to familiarize you with the basic concepts of NLP


In [1]:
data = {"I hate school": -1, "I hate apples":-1, "I love you":1, "I love trees":1}

## Preparing the data

The model we'll create is a word-based language model, which means each input unit is a single word (alternatively, some language models learn subword units like characters).

###  Tokenization

The first pre-processing step is to tokenize each of the stories into (lowercased) individual words. In this first example we ask you to write your own tokenization function. Later we will teach you how to do this with Spacy.

We will create a `text_to_tokens` function because we will reuse this piece of code for the tests.

Few rules:

* Make sure to detach the punction from the words. 
    * `['ok', ',', 'I', 'will', 'do', 'it', '.']` -> OK
    * `['ok,', 'I', 'will', 'do', 'it.']` -> NOT ok.

In [2]:
def text_to_tokens(text):  
            
    # ADD YOUR CODE HERE...
    
    return tokens



# !! ======== DO NOT CHANGE  ======== !!
# Just some code to check that your function is working.
# If you get AssertionError, your funciton is not doing what is expected.
test_string = "ok, I will do it. it's not an issue! But are you sure?"
test_tokens = ['ok', ',', 'I', 'will', 'do', 'it',  '.', 'it', "'", 's', 'not', 'an', 'issue', '!', 'But', 'are', 'you', 'sure', '?']
assert text_to_tokens(test_string) == test_tokens

### Vocabulary indexing

Now that you have done the tokenization, you need to convert those tokens to numbers because as we know, computers can only process numbers.

To simplest approach could be to give a number to each **new** word.

```python
text = ['my', 'simple', 'sentence', '.',  'my', 'simple',  'text', '.']

vocabulary = {
  "my": 0,
  "simple": 1,
  "sentence": 2,
  ".": 3,
  "text": 4,
}
```

Let's try it then! Create a function `make_vocabulary()` that take a list of words as input and return a dict where the key is the word and the value is an index.

In [3]:
def make_vocabulary(tokens):
    """
    This function will create a Dict our of a List of String to provide a unique number to each unique string.
    
    :param tokens: List of string containing a word or a punctuation mark.
    :return: A Dict containing all the unique text's  string in key and an unique Int as value.
    """
    # ADD YOUR CODE HERE....

    return dic

text = ['my', 'simple', 'sentence', '.',  'my', 'simple',  'text', '.']
print(make_vocabulary(text))

{'my': 0, 'simple': 1, 'sentence': 2, '.': 3, 'text': 4}


### Embeddings

Now it's time for the last preprocessing step! We will wrap all our work.
You will use your created functions to create an embedding system.

The goal is to convert your tokens to numbers using your vocabulary list.

For example:

```python
text = "my simple sentence.  my simple  text."

vocabulary = {
  "my": 0,
  "simple": 1,
  "sentence": 2,
  ".": 3,
  "text": 4,
}

embedded = [0, 1, 2, 3, 0, 1, 4, 3]
```

To create `embbeded` we used the `vocabulary` on `text`. So the first word of the sentence is `my`, if I do  `vocabulary['my']` I  will  get 0 because I attributed this value in the `vocabulary`. Then I can simply loop over each token, get their value  in the vocabulary list and add it to the embbeded list.


Embedd the sentence: `"my simple sentence.  my simple  text."`

In [4]:
def embed_text(text):
    """Embed a string and return a list of int."""
    
    # ADD YOUR CODE HERE
    
    return embedded

# !! ======== DO NOT CHANGE  ======== !!
# Just some code to check that your function is working.
# If you get AssertionError, your funciton is not doing what is expected.
text = "my simple sentence.  my simple  text."
assert embed_text(text) == [0, 1, 2, 3, 0, 1, 4, 3]

## Perform the steps on the above given dataset
### Step 1: create vocabulary

In the next step we want to train a model to predict whether a sentence is positive or negative. In our training dataset, we have a few labeled examples which we can use to train the model. -1 means negative, 1 means positive.

First we generate the vocabularium for this dataset. Can you see whay the first steps are necessary?

In [5]:
# first we determine the tokens for all the sentences in the dataset
all_tokens = [text_to_tokens(text) for text in data.keys()]

# we want to make a list of all the DIFFERENT tokens, so they can not be shown twice.
concat_list = list(set([j for i in all_tokens for j in i]))

#Now we can construct a vocabulary from this list of words.
vocabulary = make_vocabulary(concat_list)
print(vocabulary)

{'you': 0, 'I': 1, 'apples': 2, 'trees': 3, 'love': 4, 'school': 5, 'hate': 6}


If everything went well you're result will be something like this:

```python
{'I': 0, 'York': 1, 'hate': 2, 'school': 3, 'you': 4, 'New': 5, 'apples': 6, 'love': 7, 'trees': 8}
```

### Step 2: embed sentences in dataset
We want to construct a matrix (= a list of lists), where every list contains the embedding of 1 sentence.

Please take your time to understand the steps in the following cell

In [6]:
#We initialize an empty matrix
matrix = []
#We loop over the lists of tokens of the different sentences
for tokens in all_tokens:
    #We embed the tokens for each sentence
    embedded = [vocabulary[token] for token in tokens]
    #we add the embedding to the matrix
    matrix.append(embedded)
    
print(matrix)


[[1, 6, 5], [1, 6, 2], [1, 4, 0], [1, 4, 3]]


If everything went well you're result will be something like this:

```python
[[0, 2, 3], [0, 2, 6], [0, 7, 4], [0, 7, 8], [0, 7, 5, 1], [0, 2, 3], [0, 2, 6], [0, 7, 4], [0, 7, 8], [0, 7, 5, 1]]
```

Please take some time to validate that the numbers indeed correspond to the words in the sentences

### Step 3: train a model
Here is where some magic will happen. Don't despair if you don't fully understand what is going on.

In [7]:
from sklearn.linear_model import LogisticRegression

#We make a target value that contains all of the labels -1 or 1
y = [text for text in data.values()]
print(y)

[-1, -1, 1, 1]


In [8]:
matrix

[[1, 6, 5], [1, 6, 2], [1, 4, 0], [1, 4, 3]]

In [9]:
y

[-1, -1, 1, 1]

In [15]:
#We train a model giving it our embedded sentence (mat) and the known labels (y). From this it will try to learn 
#which words are important for determining positive or negative sentiment.
clf = LogisticRegression(random_state=0).fit(matrix, y)


In [18]:
#We now want to use this trained model predict the sentiment of a new sentence, the model has never seen before
test = "I love apples"
embedded = [vocabulary[token] for token in text_to_tokens(test)]

print("The model predicts that the class is: ", clf.predict([embedded])[0])

[1, 4, 2]
The model predicts that the class is:  1


The model correctly sees the sentence as positive. Now an other example. It will give an error, can you identify why it gives an error?

In [21]:
test = "I hate fish"
embedded = [vocabulary[token] for token in text_to_tokens(test)]

print(clf.predict([embedded]))