# Lesson 4: Practical Deep Learning for Coders - Natural Language (NLP)

This notebook presents my personal notes and work coming from my experience with the 4th lesson of the course "Practical Deep Learning" by Jeremy Howard ([here](https://www.youtube.com/watch?v=toUgBQv1BT8&list=PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU&index=4) for the video on YT) and the related fastbook, the Jupyter-Book from fast.ai. The lecture is partly based on the chapter 10 of the book and the Kaggle notebook available at [Kaggle](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners).

The lesson explains how to analyse natural language documents, using Natural Language Processing (NLP). 
The lecture approaches NLP within the Hugging Face ecosystem, especially the Transformers library, rather than fastai library, and we are going to explain why. 

Lesson structure:
- What is NLP?
- Why hugging-face library?
- Understanding Fine-tuning;
- ULMFiT: the first fine-tuned NLP mode;
- Kaggle Competition: classify similarity of phrases used to describe US patents;
- Homework

In [None]:
# Install the required packages
! pip install -r requirements.txt

In [None]:
# Comment to get warnings
import warnings
warnings.filterwarnings('ignore')

## What is NLP?

One area where deep learning has dramatically improved in the last couple of years is natural language processing (NLP). Computers can now generate text, translate automatically from one language to another, analyze comments, label words in sentences, and much more.

Perhaps the most widely practically useful application of NLP is classification -- that is, classifying a document automatically into some category. This can be used, for instance, for:

- Sentiment analysis (e.g are people saying positive or negative things about your product);
- Author identification (what author most likely wrote some document);
- Legal discovery (which documents are in scope for a trial);
- Organizing documents by topic;
- Triaging emails.

## Why hugging-face library?

The lecture zooms in on NLP within the Hugging Face library. Now, you might be thinking, "Why not stick with good ol' fastai?" Well, here's the scoop – using more than one library is like having a toolbox with different gadgets. It gives you a broader perspective and helps wrap your head around the concepts from different angles.

But why the love for Hugging Face? Simple. It's the big shot in the world of NLP. Let's talk architecture. Unlike fastai, Hugging Face's Transformer doesn't come neatly layered; it's a bit deeper, giving you more control. It's not as plug-and-play as fastai; instead, it's a more hands-on, lower-level library.

## Understanding Fine-tuning

What does it means to fine tune a pre-trained model?
We have already done it in the first lecture, but let's understand it.


Alright, imagine you've got this pre-trained model, right? It's like a brain that's already seen a bunch of stuff. Now, fine-tuning is like tweaking the settings on that model to make it super smart for a specific task.

Let's image us tuning the slide-bar from the previous example.
Maybe someone from his experience can share some hints to start our work. The pre-trained model is an external hint – it's decent, but it might not be perfect for your jam. So, you fine-tune it. You slide that bar a bit to make it just right.

Now, let's get a bit more techy. Say you trained a model to recognize cats and dogs, but now you want it to tell the difference between penguins and polar bears. You don't wanna start from scratch because that's like reinventing the wheel. Instead, you fine-tune it. You take what it already knows about cats and dogs and give it a crash course on penguins and polar bears.

It's like saying, "Hey brain, you know a lot about animals, but let's focus on these icy fellas for a bit." You adjust the model's parameters, those fancy settings that make it tick, to make it a polar bear and penguin expert.

As we have seen in the first lecture, the last layers in NN are specific to the task, while the
first layers are more generic (edges, corner). In transfer-learning and fine-tuning, we can delete the last layers and substitute it with a random layer and train it on a new specific task!

## ULMFiT: the first fine-tuned NLP model

ULMFiT, or Universal Language Model Fine-tuning, as conceptualized by Jeremy and articulated in his [paper](https://arxiv.org/pdf/1801.06146.pdf%C3%AF%C2%BC%E2%80%B0%C3%A3%E2%82%AC%E2%80%9A), stands as a pioneering technique, marking an early instance of fine-tuning in NLP and serving as an influential foundation for subsequent research endeavors.

The task in the paper is to analyze sentiment in IMDb reviews, distinguishing between positive and negative sentiments.

The problem is divided in 3 steps, as in the figure XXXXXXX:

- Step 1: Build a language model (LM) trained on the corpus of Wikipedia. This LM functions as a sophisticated linguistic entity, and is trained to predict subsequent words in Wikipedia articles. Beyond mere lexical comprehension, it endeavors to encapsulate the intricacies of language structure, mathematical formulations, political discourse, logical reasoning, and the nuanced distinction between veracity and fallacy. 


- Step 2: Enter the fine-tuning phase, involving the training of the language model on IMDb reviews for a limited number of epochs. The model is further refined by predicting subsequent words in IMDb reviews. 

- Step 3: Subsequently, this refined model is employed to initialize the training of a classifier, constituting the core of the transfer-learning process.

Jeremy initially employed Recurrent Neural Networks (RNN), but with the advent of transformers, these models exhibit proficiency in discerning contextual semantics. They are trained to predict words omitted from segments of text.



## Kaggle Competition: classify similarity of phrases used to describe US patents

The competition ["U.S. Patent Phrase to Phrase Matching"](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching) asks to build an approach to match phrases in U.S. Patents, here some details:

> In this competition, you will train your models on a novel semantic similarity dataset to extract relevant information by matching key phrases in patent documents. Determining the semantic similarity between phrases is critically important during the patent search and examination process to determine if an invention has been described before. For example, if one invention claims "television set" and a prior publication describes "TV set", a model would ideally recognize these are the same and assist a patent attorney or examiner in retrieving relevant documents. This extends beyond paraphrase identification; if one invention claims a "strong material" and another uses "steel", that may also be a match. What counts as a "strong material" varies per domain (it may be steel in one domain and ripstop fabric in another, but you wouldn't want your parachute made of steel). We have included the Cooperative Patent Classification as the technical domain context as an additional feature to help you disambiguate these situations.

Let's download the data and understand the problem! You need an API key to use the Kaggle API; to get one, click on your profile picture on the Kaggle website, and choose My Account, then click Create New API Token. This will save a file called kaggle.json to your PC. Store it in your HOME directory in the folder kaggle, and accept the competition condition.

In [12]:
from pathlib import Path
creds = ''
cred_path = Path(r'~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

In [13]:
path = Path('us-patent-phrase-to-phrase-matching')
if not path.exists():
    import zipfile,kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

Downloading us-patent-phrase-to-phrase-matching.zip to c:\Users\conti\Progetti\agile-lab\books\fastbook\04_natural_language


100%|██████████| 682k/682k [00:00<00:00, 1.19MB/s]







Documents in NLP datasets are generally in one of two main forms:

- Larger documents: One text file per document, often organised into one folder per category
- Smaller documents: One document (or document pair, optionally with metadata) per row in a CSV file.

Let's look at our data and see what we've got in pandas!

In [15]:
!ls {path}

"ls" non � riconosciuto come comando interno o esterno,
 un programma eseguibile o un file batch.


In [17]:
import pandas as pd
df = pd.read_csv(path/'train.csv')
df.head()

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.5
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.5
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.0


In [18]:
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


We can see that in the 36473 rows, there are 733 unique anchors, 106 contexts, and nearly 30000 targets. Some anchors are very common, with "component composite coating" for instance appearing 152 times.

Basically, in this competition, we are tasked with comparing two words or short phrases, and scoring them based on whether they're similar or not, based on which patent class they were used in. With a score of 1 it is considered that the two inputs have identical meaning, and 0 means they have totally different meaning. For instance, abatement and eliminating process have a score of 0.5, meaning they're somewhat similar, but not identical.

How to solve this problem? It is not a classification problem and each document is very short (3-4 words), it should be mapped to a simple problem as classification!

We could represent the input to the model as something like *"TEXT1: abatement; TEXT2: eliminating process"*. We'll need to add the context to this too. In Pandas, we just use + to concatenate, like so:

In [20]:
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object