In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

pd.set_option('display.max_rows', 120)
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None
%matplotlib inline
sns.set_style("ticks")

# 8. Text Mining

In this Jupyter notebook, we will go through a typical text mining pipeline. Our goal is to train a machine learning model that classifies texts scraped from company website into two classes: 
- ```0```: Text from a non-software company website.
- ```1```: Text from a software company website.

To do so, we will do the following:
1. Load a labelled and an unlabelled dataset.
2. Preprocess the texts.
3. Vectorize the texts.
4. Split the labelled dataset into a training set and a test set.
5. Train a logit regression classifier.
6. Use the trained logit regression classifier to predict the missing labels in the unlabelled dataset.
7. Calculate text similarities.

## 8.1 Load data

We will work with two different datasets: a labelled dataset and an unlabelled one. For the observations in the labelled dataset, we know whether the texts are from a software company website ("class_1" in the figure) or not ("class_2"). For the observations in the unlabelled dataset, we do not have this information and we want to **predict the missing classes**.

<img src="../misc/labelled_data.png" width="400">

### Load labelled data

First we load "labelled_data.txt", a text file with "labelled" website data. "Labelled" means that each observation is already categorized into a group.

Our data is in a table format with 4 columns:
- **"ID"**: unique identifiers for each observation
- **"url"**: the website address from where text was downloaded
- **"text"**: the downloaded website text
- **"software"**: the label which tells us whether a website is from a software company ("1") or not ("0")

In [None]:
labelled_data = pd.read_csv("../data/labelled_data.txt", sep="\t", encoding="utf-8", error_bad_lines=False)
labelled_data.head(5)

We have 2000 rows (observations) and 4 columns.

In [None]:
labelled_data.shape

Let's have a look at the website of a random software company in our dataset.

In [None]:
from IPython.display import IFrame

random_software_url = labelled_data[labelled_data["software"] == 1].sample(1)["url"].values[0]

display(IFrame(random_software_url, width=800, height=400))

Let's first have a look at the "software" column and check out how many ```1``` (software companies) and ```0``` (other company types) we have in our data. 

For that we select the "software" column and then use pandas ```value_counts()``` method to count the ones and zeros in that column. As we can see, 1716 ```0``` and 284 ```1``` are in the dataset, which means that we have many more non-software firms than software firms in our labelled dataset.

In [None]:
labelled_data["software"].value_counts()

Let's have a closer look at our text data.

First, we may want to see how large our texts are. We can do so by selecting the "text" column and use ```apply(len)``` on it. This will return us the lenght (number of characters) of each text in our dataset. By adding pandas ```describe()``` method, we will get some descriptive statistics telling us that the mean number number of characters (2558.8) per website text, for example.

In [None]:
labelled_data["text"].apply(len).describe()

In [None]:
labelled_data["text"].apply(len).plot(kind="hist", bins=100)

### Load unlabelled data

We can now use the same procedure as above to load a dataset with "unlabelled" data. In our case this means that the dataset contains website data without the "software" label, so we don't know whether the companies are software firms or not...but we will know soon using machine learning magic 🧙!

The ```shape``` method tells us that we have 937 observations and three columns (compared to the labelled dataset, we are missing the "software" column).

In [None]:
unlabelled_data = pd.read_csv("../data/unlabelled_data.txt", sep="\t", encoding="utf-8", error_bad_lines=False)
print(unlabelled_data.shape)
unlabelled_data.head(5)

In [None]:
unlabelled_data["text"].apply(len).describe()

Not so much difference in terms of the mean between the two datasets.

In [None]:
labelled_data["text"].apply(len).mean() - unlabelled_data["text"].apply(len).mean()

## 8.2 Text preprocessing

### Excluding short texts

As we saw above, there are quite some websites with texts that are shorter than 500 characters and some even had only a single character. Let's exclude them because short or no text usually mean little or no information at all.

After that we have 1,649 observations left in our labelled dataset

In [None]:
print("Dataframe shape before: ", labelled_data.shape)
labelled_data = labelled_data[labelled_data["text"].apply(len) > 499]
print("Dataframe shape after: ", labelled_data.shape)

The same procedure should also be applied to the unlabelled dataset.

In [None]:
print("Dataframe shape before: ", unlabelled_data.shape)
unlabelled_data = unlabelled_data[unlabelled_data["text"].apply(len) > 499]
print("Dataframe shape after: ", unlabelled_data.shape)

### Standardising text

Right now, the texts in our dataset are exactly as they were downloaded from the company websites.

Let's have a look at an example by displaying observation with ```ID == 1928```. We also alter a pandas option to display us more of the texts.

In [None]:
pd.set_option('display.max_colwidth', 1000)

labelled_data[labelled_data["ID"] == 1928]

As we can see, there can be quiet a lot of special characters (e.g. "%" or "€") and numbers in the text which we may want to exclude from our further analysis. We also may want to standardise all characters to lowercase, such that "Software" and "software" are recognized as the same words by the computer.

**Note** however that this is just one of many ways to preprocess your texts. Most modern Natural Language Processing (NLP) pipelines do not preprocess at all.

We will import a Python's *regular expression* package and apply the ```sub("FILTER", "REPLACE_STRING")``` function to the text column of our labelled dataset. We submit the ```sub()``` function with a so-called regular expression telling Python to delete all characters in the text that are not part of this list of characters:

```"abcdefghijklmnopqrstuvwxyzäöüß&. "```

The method ```lower()``` will cast all characters to lowercase, while ```strip()``` will delete "trailing" whitespaces (e.g. ```"end of sentence    "``` will become ```"end of sentence"```).

We will replace the original text in the "text" column with the result of this operation.

In [None]:
import re

labelled_data["text"] = labelled_data["text"].apply(lambda x: re.sub("[^abcdefghijklmnopqrstuvwxyzäöüß& ']", "", str(x).lower()).strip())

Let's see how this step changed the text of our example above:

In [None]:
labelled_data[labelled_data["ID"] == 1928]

Looks good! Let's apply the same operation on the text column of the unlabelled dataset.

In [None]:
unlabelled_data["text"] = unlabelled_data["text"].apply(lambda x: re.sub("[^abcdefghijklmnopqrstuvwxyzäöüß& ']", "", str(x).lower()).strip())

In [None]:
unlabelled_data.head(4)

## 8.3 Text vectorization

The machine learning algorithms we will use require us to give numerical data to them. Raw text data as an input will not work! This means that we have to transfer our texts to some kind of numerical representation without loosing too much information. Transferring a text from a sequence of characters to a vector of numbers is called *text vectorization*.

<img src="../misc/text_vectorization.png" width="600">

There are many different ways to vectorize texts, from fancy techniques like [word embeddings](!https://en.wikipedia.org/wiki/Word_embedding) and topic models like [latent dirichlet allocation (LDA)](!https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) to simple [word count models](!https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

### TFIDF vectorization

Today we will keep it rather simple and use an approach called **TFIDF** ([term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) which basically counts how many times a word appears in a document and then reweights this count by the word's frequency over all known documents. The latter results in a decreased weight for common terms like "the", "and", "house" etc.
- **term frequency (TF)**: The number of times a term $t$ (word) appears in a document $d$ adjusted by the length of the document (number of all words $t'$ in document $d$).

\begin{equation*}
TF(t, d) =   \frac{f_t,_d}{\sum{f_{t^\prime},_d}}
\end{equation*}

- **inverse document frequency (IDF)**: Counts the number of documents $n_t$ an individual term $t$ appears over all documents $N$.
\begin{equation*}
IDF(t) =   log{\frac{N}{1 + n_t}}
\end{equation*}

- **term frequency-inverse document frequency (TFIDF)**: This step weights down common words like "the" and gives more weight to rare words like "software".

\begin{equation*}
TFIDF(t, d) = TF(t, d) * IDF(t)
\end{equation*}

We will use scikit-learn's [TFIDF Vectorizer](!https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to generate TFIDF vectors from our texts. Scikit-learn is the most popular machine learning package for Python and includes all kinds of ML algorithms from clustering to classification.

In [None]:
display(IFrame("https://scikit-learn.org/stable/", width=1200, height=500))

Scikit-learn (sklearn) should comes pre-installed with Anaconda. Otherwise you would install it using pip:

```pip install -U scikit-learn```

### Generating TFIDF vectors from text

Let's import sklearn's ```TfidfVectorizer``` function.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

First, we have to initialize a TFIDF vectorizer. We pass the ```analyzer="word"``` parameter to tell the function to analyze our texts at the level of words (rather than single characters, for example).

In [None]:
vectorizer = TfidfVectorizer(analyzer='word')

We now have to teach our vocabulary to the vectorizer. For that we can use the vectorizer's ```fit()``` method on our texts. In this step, the vectorizer also calculates the inverse document frequencies.

In [None]:
fitted_vectorizer = vectorizer.fit(labelled_data["text"])

We can now use the fitted vectorizer's ```transform()``` method to transform any text document to a TFIDF vector.

Let's transform the sentence ```"dies ist ein Test"``` and output the resulting vector using Python's ```print()``` function.

In [None]:
example_tfidf = fitted_vectorizer.transform(["dies ist ein test"])
print(example_tfidf)

The output you see is a so-called *sparse matrix*. In a sparse matrix, only non-zero elements are memorized and mapped using indexes. This actually saves A LOT of memory. In the example above, there are only four non-zero elements in the matrix and their coordinates/indexes are given in the left parantheses. The elements on the right hand side give you the corresponding TFIDF value for the word mapped by the coordinates.

Let's have a look at the vocabulary that the vectorizer learned from our data. We call the ```vocabulary_``` method on the fitted vectorizer to retrieve the full vocabulary and the use a ```for``` loop to print the first items in the vocabulary:

In [None]:
dic  = {"key1": "value1", "key2": "value2"}
dic

In [None]:
dic.items()

In [None]:
vocabulary = fitted_vectorizer.vocabulary_

#little loop to print the first items in the vocabulary
for count, item in enumerate(iter(vocabulary.items())):
    print(item)
    if count >= 10:
        break

We can also get the vocabulary index of a word directly (this only works for words that were learned during the ```fit()``` and thus are in the vocabulary):

In [None]:
vocabulary["test"]

In [None]:
vocabulary["istari"]

But how many words are in our vocabulary, actually? We can see that by calculating its length using Python's ```len()``` function:

In [None]:
len(vocabulary)

Those are quiet a lot of words. It may be a good idea to shrink down our vocabulary a bit, especially because this will reduce both memory consumption and the required computational power when we start doing ML magic!

A common approach to do so is to apply so-called *popularity-based filtering*. Hereby, we exclude very common and/or extremly uncommon words from our vocabulary. This can be achieved by passing the corresponding parameters to the vectorizer during fitting. 

Let's overwrite our vectorizer and create a new one which includes only those words that appear in a maximum (```max_df```) of 80% and a minimum (```min_df```) of 1% of the websites. 

In [None]:
vectorizer = TfidfVectorizer(analyzer='word', min_df=0.01, max_df=0.8)
fitted_vectorizer = vectorizer.fit(labelled_data["text"])
vocabulary = fitted_vectorizer.vocabulary_

In [None]:
TfidfVectorizer()

The vocabulary should be way smaller now:

In [None]:
len(vocabulary)

This should have excluded super frequent words like "der".

In [None]:
vocabulary["der"]

Medium common words like "test", should be still included.

In [None]:
vocabulary["test"]

## 8.4 Train and test sets

We will now split our labelled dataset into a **training set** and a **test set**. The training set will be used to train our machine learning model to predict the correct labels (classes). After the training, we will use the trained model to predict the labels of all the observations in the test set, which was not used for training. Based on the prediction performance in the test set, we can evaluate the prediction performance of our trained model.

This two-step approach is used to make sure that the ML model does not simply memorize all the observations in the training data, but instead derives universal rules on how to distinguish the different classes. This universal ability is called **generalization** and is very desireable in machine learning. In contrast, the over-memorization of the training set and a model's resulting bad performance using other data is called **overfitting**.

<img src="../misc/training_split.png" width="500">

The **training** of a machine learning model describes the process of teaching the model how to achive a certain learning task. In the case of a classification tasks, we give the model a list of properties $X$ (**features** or **attributes**) that are used to calculate a predicted outcome (**label** or **class**) $\hat{Y}$. We then compare the predicted outcome $\hat{Y}$ to the true outcome $Y$ that we know because we have a labelled dataset. In our example, we will use the trained ML model to predict whether a text comes from a software company website ($\hat{Y}$ = 1 or 0) and the compare our predictions against the true values of $Y$. 

The difference between the predicted and the true outcome is then used to calculate an **error**. We then start to **optimize** (**train**) the model by adjusting its internal numbers $W$ (**weights**) to minimize the error.

<img src="../misc/training.png" width="500">

This type of training is called **supervised learning** because we supervise the training outcome and assess each output predicted label of the model.

In our case, the features (attributes) of our observations are the texts as TFIDF vectors and our labels are the "1" and "0" classes in the "software" column. In another ML task, the attributes could be, for example, the properties/features of a house (location, size, number of rooms etc.) and the outcome we want to predict could be the house's selling price.

So let's first shuffle our data (always do that to make sure there is no systematic order in your data) and then create a list with our features $X$ and a list with the corresponding labels $Y$. For the features we select the text column from our labelled dataset and transform them to TFIDF vectors using our trained vactorizer. 

In [None]:
labelled_data = labelled_data.sample(frac=1.0, random_state=12) # a fixed random_state ensures that the shuffle will result in the same order every time
features = fitted_vectorizer.transform(labelled_data["text"])
labels = labelled_data["software"]

Remember that we had 1,649 observations in our labelled dataset. 

In [None]:
labelled_data.shape

Let's take the first 1,250 observations for the training set. The remaining obseravtions will be assigned to the test set and put aside for the moment.

In [None]:
features_trainset = features[:1250]
labels_trainset = labels[:1250]

features_testset = features[1250:]
labels_testset = labels[1250:]

*Note: In a real life ML task, you should fit your tfidf vectorizer on the training dataset only and not the test set.*

## 8.5 Training a logit regression classifier

Let's train our first model. We will start with something basic: A [logistic regression classifier](https://en.wikipedia.org/wiki/Logistic_regression). This classifier is a pretty popular model for binary outcome variables. Again, we will use scikit-learn for this task.

In [None]:
from sklearn.linear_model import LogisticRegression

First, we have to initialize the logisitic regression model. We pass it the parameter ```class_weight="balanced"``` because we have a pretty unbalanced dataset (one class in way more frequent than the other). The ```"balanced"``` parameter will make sure that the model will pay more attention to the infrequent class (is our case the "software"  ```1``` class).

In [None]:
logit_classifier = LogisticRegression(class_weight="balanced")

We can now train the classifier using its ```fit()``` method and passing the features and corresponding labels of our training set.

In [None]:
trained_logit_classifier = logit_classifier.fit(features_trainset, labels_trainset)

You just trained your first machine learning model! But how good is it at distinguishing the web texts of software firms from other firm types? 

Let's test that with an example sentence that we transfer to a TFIDF vector using our trained vectorizer:

In [None]:
bauernhof_tfidf = fitted_vectorizer.transform(["das ist ein bauernhof und wir bauen getreide an"])

In [None]:
print(bauernhof_tfidf)

We can now pass this TFIDF vector to our trained logit classifier and tell it to predict its label using the ```predict()``` method:

In [None]:
trained_logit_classifier.predict(bauernhof_tfidf)

The predicted label is ```0``` (aka "not a software company"). Awesome 🥳🥳🥳!

We can also check out the probability for both classes by using the ```predict_proba()``` method:

In [None]:
trained_logit_classifier.predict_proba(bauernhof_tfidf)

The first number above gives you the probability that the label is ```0``` while the second number is the probability that the label is ```1```. So the classifier is not too confident (about 60-65%) that the text comes from the website of a non-software company (**WARNING: Your results may differ!**).

Let's try one more example:

In [None]:
programmer_tfidf = fitted_vectorizer.transform(["wir programmieren software"])
trained_logit_classifier.predict_proba(programmer_tfidf)

In [None]:
trained_logit_classifier.predict(programmer_tfidf)

Wow! In this example, the model is absolutely sure (97%) that the text comes from a sofware company.

We should now test our trained classifier using the test set which we put aside:

In [None]:
predicted_labels = trained_logit_classifier.predict(features_testset)

We can now ```print()``` the predicted labels...

In [None]:
print(predicted_labels)

...and the true labels...

In [None]:
print(labels_testset.values)

...and compare them one by one.

In [None]:
print(predicted_labels == labels_testset.values)

Wow...at first glace that looks pretty convincing. It seems like most of the predicted labels match their true counterparts, but not all of them.

Let's quantify the classifier's prediction performance by generating a scikit-learn **classification report**. The report contains several measures that allow us to evaluate the performance of our trained model in the test set:

- **precision**: the fraction of observations that were predicted to have label $\hat{y} = 1$ and that actually have the true label $y = 1$ and vice versa.
- **recall**: the fraction of observations that have the true label $y = 1$ and that were predicted to have $\hat{y} = 1$ and vice versa.
- **f1-score**: a composite measure that combines both precision and recall.
- **support**: simply the number of observations with this true label in the test set.

<img src="../misc/classification_report.png" width="200">

In [None]:
from sklearn.metrics import classification_report

print(classification_report(labels_testset, predicted_labels))

Not too bad! As you can see, the classification report gives you precision, recall, f1-score, and support for both labels (```1``` and ```0```). We could summarize the report as follows:
- 95% of the observations that were labeled ```0``` by the classifier actually have the true label ```0```. For label ```1``` this value is only 71%
- 92% of the observations that have the true label ```0``` were also predicted to have have the label ```0``` by the classifier. For label ```1``` this value is only 75%.

So if our goal was to identify most of the software firms in the unlabelled dataset (**true positives**) while limiting the number of non-software firms that are wrongly classified as software firm (**false positives**), we could say:

*7 out of 10 firms that were predicted to be software firms are actually software firms and we are able to recover 8 of 10 software firms in the dataset.*

As a final step, we can create a new column with our predicted labels in our labelled dataset...

In [None]:
labelled_data.loc[labelled_data.index[1250:], 'prediction'] = predicted_labels

...and have a look at a random observations that has been assigned the wrong class. 

In [None]:
wrong_prediction = labelled_data[(labelled_data["software"] != labelled_data["prediction"]) & (labelled_data["prediction"].notnull())].sample(1)
wrong_prediction

In [None]:
display(IFrame(wrong_prediction["url"].values[0], width=1200, height=350))

In a final step, we create a new column "software" in our unlabelled dataset and predict the labels using our trained logit classifier.

In [None]:
unlabelled_data["predicted_software"] = unlabelled_data["text"].apply(lambda text: trained_logit_classifier.predict(fitted_vectorizer.transform([text]))[0])
unlabelled_data.head(3)

Our ML model found about 125 software firms in our unlabelled dataset. 

In [None]:
unlabelled_data["predicted_software"].value_counts()

# 8.6 Calculating text similarities

Representing texts as vectors has another advantage: It allows us to easily calculate distances between pairs of texts. By calculating such pairwise distances, we are able to identify pairs of texts that are very similar (close) or dissimilar (distant) to each other. Calculating the distances between observations in our dataset can be understood as comparing the business activities of the companies as they are described on their websites. 

Let's try this out and see whether we can find Geographic Information System (GIS) software companies in our unlabelled dataset!

First, we want to retrain our TFIDF vectorizer on our unlabelled dataset without the popularity-based filtering. This makes sure that all the words (about 42,000) in the unlabelled dataset are included in the resulting vocabulary.

In [None]:
vectorizer = TfidfVectorizer(analyzer='word')
trained_vectorizer = vectorizer.fit(unlabelled_data["text"])
vocabulary = vectorizer.vocabulary_
len(vocabulary)

Next, we create a single TFIDF vector from a description of the company type we are looking for.

In [None]:
search_tfidf = trained_vectorizer.transform(["gis geographie geographische informationssysteme geodaten"])

Let's create another TFIDF with AI keywords too.

In [None]:
search_tfidf_2 = trained_vectorizer.transform(["künstliche intelligenz artificial intelligence ai ki machine learning maschinelles lernen"])

In [None]:
print(search_tfidf)

We will use the popular [**cosine similarity**](!https://www.machinelearningplus.com/nlp/cosine-similarity/) to calculate the similarity between our TFIDF vectors. Cosine similarity measures the similarity between vectors based on their orientation in the high-dimensional vector space they live in. The smaller the angle between two documents, the more similar they are.

Again, we will use a function from scikit-learn [```cosine_similarity()```](!https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) to calculate the cosine similarity between our search TFIDF vectors and the TFIDF vectors representing the company website texts in our unlabelled dataset. We do so by applying the function to the text column and creating a new column "similarity" with the output.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

unlabelled_data["similarity_gis"] = unlabelled_data["text"].apply(lambda x: cosine_similarity(search_tfidf, trained_vectorizer.transform([x])).tolist()[0][0])

unlabelled_data["similarity_ai"] = unlabelled_data["text"].apply(lambda x: cosine_similarity(search_tfidf_2, trained_vectorizer.transform([x])).tolist()[0][0])

Let's have a look at those software firms that are most similar to our search vector. We can do that by restricting our dataframe to software firms and sorting it (```sort_values()```) by our new "similarity" columns. 

In [None]:
top_hit = unlabelled_data[unlabelled_data["predicted_software"] == 1].sort_values(by=["similarity_gis"], ascending=False).head(1)
top_hit

In [None]:
display(IFrame(top_hit["url"].values[0], width=1200, height=350))

In [None]:
top_hit_ai = unlabelled_data[unlabelled_data["predicted_software"] == 1].sort_values(by=["similarity_ai"], ascending=False).head(5)
top_hit_ai

In [None]:
display(IFrame(top_hit_ai["url"].values[0], width=1200, height=350))