# NLP

![deep](../asset/deep.png)

# Task

![deep](../asset/nlp.png)

# Inputs

In Vision domain we use image.

We can use timeseries (array of numbers)

**But in NLP wo do we convert text into numeric representation ?***

# Bag of words(BOW)

![bag](../asset/bag_word.png)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

corpus = ["j'aime les frites", "LISP c'est trop bien !", "j'aime les jeux dragon's Lair"]

X = vectorizer.fit_transform(corpus)
print("Matrix", X.toarray())
print("Vocabulary", vectorizer.vocabulary_)

Get matrix representation

In [None]:
import pandas as pd

df = pd.DataFrame(data=X.toarray(), columns=vectorizer.get_feature_names_out())
print(df)

Display top n words

In [None]:
import plotly.graph_objects as go


def get_top_n_words(corpus, n=None):
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    return words_freq[:n]


common_words = get_top_n_words(corpus, 30)
df = pd.DataFrame(common_words, columns=['unigram', 'count'])

fig = go.Figure([go.Bar(x=df['unigram'], y=df['count'])])
fig.update_layout(title=go.layout.Title(text="Top 30 unigrams"))
fig.show()

## Text similarity

In [None]:
from sklearn.metrics import jaccard_score
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["j'aime les frites",
          "LISP c'est trop bien !",
          "j'aime les jeux dragon's Lair",
          "j'adore les frites",
          "envoyer un mail",
          "mangé saussise"]

# Vectorise the corpus
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
arr = X.toarray()


def simlarity_search(arr, input_text):
    # Vectorise the input text
    input_text = vectorizer.transform([input_text]).toarray()[0]

    # Compute foreach sentence in the corpus the jaccard score
    scores = []
    for idx in range(arr.shape[0]):
        score = jaccard_score(input_text, arr[idx])
        scores.append([score, corpus[idx]])

    # Sort by score
    scores = sorted(scores, key=lambda x: x[0])[::-1]
    for score, sentence in scores:
        print(f"{score}: {sentence}")


simlarity_search(arr, "j'aime manger des frites")
print("----------------------")
simlarity_search(arr, "envoyé des email")

# Data Cleaning

The problem with text is that there are many different ways to write a word, capital `Cat, cat`. Or conjugation `help`, `helping`, `helped`, `helpful`, punctuation and stop words (the, that, etc).
The aim is to reduce the amount of words and their diversity.

To do this data cleaning we will use [spacy](https://spacy.io).

In [None]:
!pip install -U -q spacy
!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

In [None]:
corpus = ["j'aime, les frites",
          "comment installer un site internet ?",
          "LISP c'est trop bien !",
          "j'aime les jeux dragon's Lair",
          "j'adore les frites",
          "Envoyer un mail",
          "mangé: saussise"]

In [None]:
import spacy

nlp = spacy.load("fr_core_news_sm")

## Tokenizer

The target is to split the sentence into tokens

`I love to play dragon lait` --> `I`,  `love`,  `to`,  ...

In [None]:
docs = [nlp(sentence) for sentence in corpus]
for token in docs[0]:
    print(token)

## StopWords

Stop words are common words that do not contribute much of the information in a text document. Words like `the`, `is`, `a` have less value and add noise to the text data.

In [None]:
for sentence in docs:
    clean_sentence = []
    for token in sentence:
        if not token.is_stop:
            clean_sentence.append(str(token))
    print(' '.join(clean_sentence))

## Punctuation

Removing punctuation can be useful. But for other embeding techniques like deeplearning, it is not the best solution.

In [None]:
for sentence in docs:
    clean_sentence = []
    for token in sentence:
        if not token.is_stop and not token.is_punct:
            clean_sentence.append(str(token))
    print(' '.join(clean_sentence))

## Lemmatization

The goal is to converting a word to its root form  `help`, `helping`, `helped`, `helpful`.

In [None]:
for sentence in docs:
    clean_sentence = []
    for token in sentence:

        if token.is_stop or token.is_punct:
            continue

        if token.lemma_ != "-PRON-":
            lem_word = token.lemma_.lower()
        else:
            lem_word = token.lower_

        clean_sentence.append(str(lem_word))

    print(' '.join(clean_sentence))

# Bag of words with data cleaning



In [None]:
clean_sentences = []


def clean_sentence(sentence, nlp):
    clean_sentence = []
    for token in nlp(sentence):

        if token.is_stop or token.is_punct:
            continue

        if token.lemma_ != "-PRON-":
            lem_word = token.lemma_.lower()
        else:
            lem_word = token.lower_

        clean_sentence.append(str(lem_word))

    return ' '.join(clean_sentence)


corpus = ["j'aime les frites",
          "LISP c'est trop bien !",
          "j'aime les jeux dragon's Lair",
          "j'adore les frites",
          "envoyer un mail",
          "mangé saussise"]
docs = [clean_sentence(sentence, nlp) for sentence in corpus]

In [None]:
# Vectorise the corpus
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
arr = X.toarray()

simlarity_search(arr, clean_sentence("j'aime manger des frites", nlp))
print("----------------------")
simlarity_search(arr, clean_sentence("envoyé des email", nlp))

# TF-IDF

TF-IDF **Vectorizer** and Count **Vectorizer** are both methods used in natural language processing to vectorize text. However, there is a fundamental difference between the two methods.

CountVectorizer simply counts the number of times a word appears in a document (using a bag-of-words approach), while TF-IDF Vectorizer takes into account not only how many times a word appears in a document but also how important that word is to the whole corpus.

This is done by penalizing words that often appear across all documents, reducing the count of these as these words are likely to be less important.

There is no one technique better than the other, it all depends on the application, ultimately. Testing both is important.

## How it works

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency:

![bag](../asset/tf_idf_formul.png)

## Terminology

* **t** — term (word)
* **d** — document (set of words)
* **N** — count of corpus
* **corpus** — the total document set

## Term Frequency (TF)

`tf(t,d) = count of t in d / number of words in d`

## Document Frequency

`df(t) = occurrence of t in documents`

## Inverse Document Frequency(IDF):
While computing TF, all terms are considered equally important. However it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. 

`idf(t) = log(N/(df + 1))`

## Final formula

`tf-idf(t, d) = tf(t, d) * log(N/df)`

## Example

Sentence A : The car is driven on the road.

Sentence B : The truck is driven on the highway.

![bag](../asset/tf_idf_example.png)

## Vector similarity

![bag](../asset/vector_sim.png)

Where, a and b are vectors in a multidimensional space.

Since the cos(Ø) value is in the range [−1,1] :

- −1 value will indicate strongly opposite vectors i.e. no similarity
    - "north" and "south" are opposite
- 0 indicates independent (or orthogonal) vectors
    - "dog" and "moon" are generally independent and have no contextual relation.
- 1 indicates a high similarity between the vectors 
    - "happy" and  "joyful" Both words represent positive emotions and are thus similar in the context of sentiment.

![bag](../asset/formule_cosine.png)

Where ||A|| is Euclidean norm

![bag](../asset/euclide.png)

## Usage

In [None]:
print(docs)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(docs)
vectorizer.get_feature_names_out()

In [None]:
def tfidf_simlarity_search(vectorizer, dataset_matrix, dataset, input_text):
    # Vectorise the input text
    query_vec = vectorizer.transform([input_text])
    # Apply cosiune similarity between the dataset and the query vector
    results = cosine_similarity(dataset_matrix, query_vec).reshape((-1,))
    print(f"Query: {input_text}")
    for i in results.argsort()[-10:][::-1]:
        print(f"{i + 1} - {dataset[i]}")


query = clean_sentence("j'aime manger des frites", nlp)
tfidf_simlarity_search(vectorizer, vectors, docs, query)
print("----------------------")
query = clean_sentence("envoyé des email", nlp)
tfidf_simlarity_search(vectorizer, vectors, docs, query)

# Classifier

## k-NN: A Simple Classifier

The k-Nearest Neighbor classifier is by far the most simple machine learning and image classi-
fication algorithm. In fact, it’s so simple that it doesn’t actually “learn” anything. Instead, this
algorithm directly relies on the distance between feature vectors (which in our case, are the raw
RGB pixel intensities of the images).

Here the good user [guide](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification)

![bag](../asset/knn.png)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.neighbors import KNeighborsClassifier

# Train dataset
x = ["j'aime les pomme vert", "les orange sont pas top", "une grosse poire", "la belle poire orange"]
# Multilabel ground truth
y = [['apple', 'green'], ['orange'], ['pear', 'green'], ['pear', 'orange']]

# Test dataset
x_test = ["pomme vert bio", "je suis orange", "la belle orange poire"]
y_test = [['apple', 'green'], ['orange'], ['pear', 'orange']]

# Encode labels
encoder = MultiLabelBinarizer()
y_encode = encoder.fit_transform(y)
y_test_encode = encoder.transform(y_test)

# Creat simple pipline that do tfidf 
# and train Multilabel classification model with LinearSVC 
SVC_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', KNeighborsClassifier(n_neighbors=3, weights="distance")),
])

# train the model using X_dtm & y
SVC_pipeline.fit(x, y_encode)
# compute the testing accuracy
prediction = SVC_pipeline.predict(x_test)
print('Test accuracy is {}'.format(accuracy_score(y_test_encode, prediction)))
print(classification_report(y_test_encode, prediction))

In [None]:
print(y_test_encode)
print(prediction)