# 📝 🧮 💡 Text to numbers to insights

Coming up in this talk:

- A high level intro on neural NLP
- Some practical clicking around this notebook to see what it means in practice

Not coming up in this talk:

- Generative LLMs
- ChatGPT
- AI Hype

We are going to move quickly, do get back to this notebook on your own time or hmu on email or LinkedIn if you want to go deeper ✨

#### About me

- ~8 years in Academia, PhD in NLP 2019
- Worked in a couple ML startups
- Team lead MI@Fremtind

---


### 🧐 Language is hard

> _*yo no soy marinero, soy capitan*_

There are many languages

> _*that jaguar is sick*_

Language is contextual

> _*a boujee elon stan*_

Language is constantly evolving

#### How do we represent words?

- Tabular data is collected facts
- Pictures are RGB matrices
- Words (and meaningful sequences of words) have been problematic

---

### 🎓 Distributional semantics

> _You shall know a word by the company it keeps_ (Firth, 1957)

```
My son loves to eat [bananas].
[Cookies] are sweet.
```

- The meaning of a word is a function of the meaning of the words it co-occurs with 🤯

But how do we express that function? What is its output?
→ We need some sort of universal function approximator

---

### 🔮 Neural representations

**Word2Vec**, The first successfull neural language representation, extremely high level:

- Get a lot of data
- Split all the words
- Train a feed-forward network to predict a vector of a word given the vectors of the surrounding words
- Update the word vectors by backpropagating the error

→ You end up with a dense representation for each word. This representation has semantic properties!

```
distance(vectors["banana"], vectors["cake"]) < distance(vectors["banana"], vectors["lego"])`

vectors.most_similar(vectors["apple"] - vectors["fruit"] + vectors["potato"])
=> "vegetable"
```

### Nowadays

- Longer text representations
- Contextual embeddings
- Transformer architectures

In [None]:
import altair as alt
import pandas as pd
import spacy

from bs4 import BeautifulSoup
from urllib.request import urlopen
from sentence_transformers import SentenceTransformer
from umap import UMAP


PREPROCESS = spacy.load("en_core_web_sm") 
TRANSFORMER = SentenceTransformer('distiluse-base-multilingual-cased-v2')


def get_text(url):
    page = urlopen(url)
    html = page.read().decode("utf-8")
    soup = BeautifulSoup(html, "html.parser")
    return " ".join(soup.get_text().split())


def get_data_df(data, name=pd.NA):
    data = PREPROCESS(data)
    df = pd.DataFrame()
    df["text"] = [sentence.text for sentence in data.sents if len(sentence.text) > 80]
    df["embedding"] = list(TRANSFORMER.encode(list(df["text"])))
    df["name"] = [name for _ in range(len(df))]
    return df


def reduce_dimensions(df):
    reduced = UMAP().fit_transform(list(df["embedding"]))
    df["x"], df["y"] = reduced[:, 0], reduced[:, 1]
    return df

### 🔵 🔴 🟢 Representing wikipedia articles with a Transformer 

In [None]:
blading = get_data_df(get_text("https://en.wikipedia.org/wiki/Inline_skating"), "blading")
muppets = get_data_df(get_text("https://en.wikipedia.org/wiki/The_Muppets"), "muppets")
skateboarding = get_data_df(get_text("https://en.wikipedia.org/wiki/Skateboarding"), "skateboarding")
smurfs = get_data_df(get_text("https://en.wikipedia.org/wiki/The_Smurfs"), "smurfs")

df = pd.concat([blading, muppets, skateboarding, smurfs])
df = reduce_dimensions(df)

In [None]:
alt.Chart(df.drop_duplicates(subset=["text"])).mark_circle(size=100).encode(
    x="x",
    y="y",
    tooltip=["text", "name"],
    color="name"
).properties(width=800, height=500).interactive()

### 🔮 Semantic search on the cheap

In [None]:
from scipy.spatial.distance import cosine

query = "what are other common names for inline skating?"
embedded_query = TRANSFORMER.encode([query])[0]
df["distance"] = [cosine(embedded_query, embedding) for embedding in df["embedding"]]

list(df.sort_values(by="distance")["text"].head(5))

### 🔮 Classification with no training

In [None]:
df["label"] = [
    "sports" if x in ("blading", "skateboarding") else "tv-shows" 
    for x in df["name"]
]

# Try it out with other articles!
wiki = get_text("https://en.wikipedia.org/wiki/Seinfeld")
embedded_wiki = TRANSFORMER.encode([wiki])[0]
df["distance"] = [cosine(embedded_wiki, embedding) for embedding in df["embedding"]]

df.sort_values(by="distance")["label"].head(1)