# Html tags

![html](https://cdn.lynda.com/course/170427/170427-637363828865101045-16x9.jpg)

In order to have an accurate result with your NLP model, you need to give all possible information you can to the model. (Only the ones that are useful and well-formatted, of course)

For example, if you have an image before each important word in a text, or some block of text separated by a lot of spaces.

Let's take a concrete use-case:

![text](https://i.imgur.com/2METpwn.png)

In this image, you could provide the text like this:

```
Becode 1st December 2020 Cantersteen 10 Bruxelles 1000 Bruxelles Dear learners,
```

But it will be hard for your model to extract meaningful informations out of it. Even for you, if I give you this text it will not be easy.

A first solution could be to format and sort it.

```
Becode
Cantersteen 10
1000 Bruxelles

1st December 2020
Bruxelles

Dear learners,
```

A bit better, but it's still not perfect because the model doesn't understand your line breaks, it only understands text and spaces (which are a part of text, too).

So we can add a tag. As a convention, people often use the same tag as the following HTML tag: `<br>` which stands for **B**reak **L**line.

So we can do something like:
```html
Becode
Cantersteen 10
1000 Bruxelles

1st December 2020
Bruxelles
<br>
Dear learners,
```

## Create your own tags

Sometimes, you want to add visual information that is not in the text. It could be emojis, recurrent images at specific places in front of the text, etc...

In those cases, you can create your own tags, but be careful to only do that if:
1. You are sure that this information will help the model
2. There is enough repetition of this tag to allow the model to understand the meaning of it.

For example in our letter, we could specify to the model that the address and the date are on a different side of the page. We could. decide to add a tag `<LEFT_SECTION>` (it's just a choice made to call it like that). If I only give this document or add other document that doesn't contain this tag, the model will not understand the meaning of it! But if I give 100 documents like that with the same tag each time, the model could start to understand the link.

```
Becode
Cantersteen 10
1000 Bruxelles

<LEFT_SECTION>
1st December 2020
Bruxelles
</LEFT_SECTION>


Dear learners,
```

## Tags are sometime dangerous

**Pro tips:** Sometimes, you will find those tags in the extracted text, you should always ask yourself:

*Does it make sense or is it confusing?*

For example, in this text:

```
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of

 type and scrambled it to make a type specimen book. 
```

Here the line break doesn't add any information, it's more for the style and the readability.
So if you extract those line breaks (and you will with some document formats), you get
```
Lorem Ipsum is simply dummy text of the printing and typesetting industry.<br>

Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of<br>

 type and scrambled it to make a type specimen book. 
```

You should remove them! You could, for example, use regular expressions for that.

You will also encounter formatting tags like `<b>` or `<i>` (bold and italic). Once again, depending on your task, you may want to remove them. If you do document classification it can totally bias the model, if you do Named Entity Recognition, it could definitely help the model.

Try to always ask yourself the question

*Would it help me to do the task or not?*

If the answer is no, just remove them.