<h1><center>EZ Text Mining with Python</center></h1>

<center>Jan Kinne</center>

If you work in Colab, run the line below to make the used example data available from Github repo.

In [None]:
!npx degit datawizard1337/EZ_Text_Mining -f

# Content

In this Jupyter notebook, we will go through a typical text mining pipeline. Our goal is to train a machine learning model that classifies texts scraped from company website into two classes: 
- **0**: Text from a non-software company website.
- **1**: Text from a software company website.

Chapters:
1. Load a labelled and an unlabelled dataset.
2. Preprocess the texts.
3. Vectorize the texts.
4. Split the labelled dataset into a training set and a test set.
5. Train a logit regression classifier.
6. Calculate text similarities.
7. Train a simple artifical neural network classifiert.

First, we load [pandas](!https://pandas.pydata.org/), a Python package that allows us to load, manipulate, and analyse tabular data in Python. It is a very popular package and the de-facto standard for data scientists. We will shorten pandas name to "pd" because we will have to type it quiet often during our analysis. We will also set an option to enhance the rendering of outputs in this Jupyter notebook.

In [None]:
import pandas as pd

%matplotlib inline 

# 1. Load data

We will now use the pandas **read_csv()** function to read in the textfiles containing our data. The function will return an object (a so called "pandas dataframe") that contains our data in a table format.

<img src="pics/labelled_data.png" width="400">

### Load labelled data

First we load "labelled_data.txt", a text file with "labelled" website data. "Labelled" means that each observation is already categorized as being part of a specific group or class.

In [None]:
labelled_data = pd.read_csv("labelled_data.txt", sep="\t", encoding="utf-8", error_bad_lines=False)

We can now use panda's **head()** method on our dataframe to show the first lines of our loaded labelled dataset. As we can see, our data is in a table format with 4 columns:
- "ID": unique identifiers for each observation
- "url" the website address from where text was downloaded
- "text": the downloaded website text
- "software": the label which tells us whether a website is from a software company ("1") or not ("0")

In [None]:
labelled_data.head(5)

We can use pandas **shape()** method on our dataframe to see how many rows and columns the dataframe has.

The output tells us that we have 2,000 rows and 4 columns.

In [None]:
labelled_data.shape

Let's first have a look at the "software" column and check out how many "1"s (software companies) and "0"s (other company types) we have in our data. 

For that we select the "software" column by placing it in squared brackets behind our dataframe and then use pandas **value_counts()** method to count the ones and zeros in that column. As we can see, 1,716 "0"s and 284 "1"s are in the dataset, which means that we have many more non-software firms than software firms in our labelled dataset.

In [None]:
labelled_data["software"].value_counts()

Let's have a closer look at our text data.

First, we may want to see how large our texts are. We can do so by selecting the "text" column and use **apply(len)** on it. This will return us the lenght (number of characters) of each text in our dataset. By adding pandas **describe()** method, we will get some descriptive statistics telling us that the mean number number of characters (2,558.8) per website text, for example.

In [None]:
labelled_data["text"].apply(len).describe()

By adding an additional statement in the format **DATA[DATA["COLUMN"] == CONDITION]** we restrict the analysis to software firms only.

In [None]:
labelled_data[labelled_data["software"] == 1]["text"].apply(len).describe()

We can also easily plot the distribution of website text lengths by using pandas **plot()** method and passing the keywords for a histogram with 100 bins (**kind="hist, bins=100**).

In [None]:
labelled_data["text"].apply(len).plot(kind="hist", bins=100)

### Load unlabelled data

We can now use the same procedure as above to load a dataset with "unlabelled" data. In our case this means that the dataset contains website data without the "software" label, so we don't know whether the companies are software firms or not...but we will know soon by applying some machine learning magic!

In [None]:
unlabelled_data = pd.read_csv("unlabelled_data.txt", sep="\t", encoding="utf-8", error_bad_lines=False)
unlabelled_data.head(5)

Using the **shape** method, we see that we have 937 observations in this dataset.

In [None]:
unlabelled_data.shape

# 2. Text preprocessing

### Excluding short texts

As we saw above, there is quite a number of websites with texts that are shorter than 500 characters and some even had only a single character in their "text" column. We can exclude such websteis from our further analysis by overwriting the dataframe with a selection from the same dataframe restricted to websites with more than 499 characters in the text column.

**shape** tells us that we now have 1,649 observations left in our labelled dataset

In [None]:
labelled_data = labelled_data[labelled_data["text"].apply(len) > 499]
labelled_data.shape

The same procedure should also be applied to the unlabelled dataset.

In [None]:
unlabelled_data = unlabelled_data[unlabelled_data["text"].apply(len) > 499]
unlabelled_data.shape

### Standardising text

Right now, the texts in our dataset are exactly as they were downloaded from the company websites.

Let's have a look at an example by displaying the observation with ID 697. We also alter a pandas option to display us more of the texts.

In [None]:
pd.set_option('display.max_colwidth', 1000)

labelled_data[labelled_data["ID"] == 697]

As you can see, there can be quiet a lot of special characters and numbers in the text which we may want to exclude from our further analysis. We also may want to standardise all characters to lowercase, such that "Software" and "software" are recognized as the same words by the computer.

We will import a python's "regular expression" operations and apply its **sub("FILTER", "REPLACE_STRING")** function to the text column of our labelled dataset. We submit the sub() function with a regular expression telling it to delete all characters in the text that are not part of this list of characters: **"abcdefghijklmnopqrstuvwxyzäöüß&. "**. 

The method **lower()** will cast all characters to lowercase, while **strip()** will delete "trailing" whitespaces (e.g. "last word    " --> "last word").

The results of this *operation* will be used to overwrite the orininal "text" column by using:

labelled_data["text"] = *operation*

In [None]:
import re

labelled_data["text"] = labelled_data["text"].apply(lambda text: re.sub("[^abcdefghijklmnopqrstuvwxyzäöüß& ']", "", str(text).lower()).strip())

Let's see how this step changed the text of our example above:

In [None]:
labelled_data[labelled_data["ID"] == 697]

Looks good! Let's apply the same operation on the text column of the unlabelled dataset.

In [None]:
unlabelled_data["text"] = unlabelled_data["text"].apply(lambda text: re.sub("[^abcdefghijklmnopqrstuvwxyzäöüß&. ']", "", str(text).lower()).strip())

# 3. Text vectorization

The machine learning algorithms we will use require us to input our data in a numerical form. Raw text data as input will not work! This means that we have to transfer our texts to some kind of numerical representation without losing too much information. Transferring a text from a sequence of characters to a vector of numbers is called "text vectorization".

<img src="pics/text_vectorization.png" width="600">

There are many different ways to vectorize texts, from fancy techniques like [word embeddings](!https://en.wikipedia.org/wiki/Word_embedding) and topic models like [latent dirichlet allocation (LDA)](!https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) to simple [word count models](!https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

### TF-IDF vectorization

Today we will keep it rather simple and use an approach called **TF-IDF** ([term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)).
- **term frequency (TF)**: The number of times a term $t$ (word) appears in a document $d$ adjusted by the length of the document (number of all words $t'$ in document $d$).

\begin{equation*}
TF(t, d) =   \frac{f_t,_d}{\sum{f_{t^\prime},_d}}
\end{equation*}

- **inverse document frequency (IDF)**: Counts the number of documents $n_t$ an individual term $t$ appears over all documents $N$.
\begin{equation*}
IDF(t) =   log{\frac{N}{1 + n_t}}
\end{equation*}

- **term frequency-inverse document frequency (TFIDF)**: This step weights down common words like "the" and gives more weight to rarer words like "software".

\begin{equation*}
TFIDF(t, d) = TF(t, d) * IDF(t)
\end{equation*}

We will use scikit-learn' [TF-IDF Vectorizer](!https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to generate TF-IDF vectors from our texts. Scikit-learn is the most popular machine learning package for Python and includes all kinds of ML algorithms for stuff like classification and clustering, but also handy preprocessing tools.

<img src="pics/sklearn.png" width="500">

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

### Generating TF-IDF vectors from text

*Note: In this tutorial, we fit our TF-IDF vectorizer on our entire labelled dataset. In real life, you should instead fit your vectorizer on the "training set" part of you labelled dataset only.*

First, we have to initialize a TF-IDF vectorizer. 

In [None]:
vectorizer = TfidfVectorizer(analyzer='word')

We now have to teach our vocabulary to the vectorizer. For that we can use the vectorizer's **fit()** method on our texts. In this step, the vectorizer also calculates the inverse document frequencies.

In [None]:
trained_vectorizer = vectorizer.fit(labelled_data["text"])

We can now use the trained vectorizer's **transform()** method to transform any text document to a TF-IDF vector.

Let's transform the sentence "dies ist ein Test" and output the resulting vector using Python's **print()** function.

In [None]:
example_tfidf = trained_vectorizer.transform(["dies ist ein test"])
print(example_tfidf)

The output you see is a so-called sparse matrix. In a sparse matrix, only non-zero elements are memorized and mapped using indexes. This actually saves A LOT of memory. In the example above, there are only four non-zero elements in the matrix and their coordinates/indexes are given in the left parantheses. The elements on the right hand side give you the corresponding TF-IDF value for the word mapped by the coordinates.

Let's have a look at the vocabulary the vectorizer learned from our data. We call the **vocabulary_** method on the trained vectorizer to retrieve the full vocabulary and the use a little loop to print the first items in the vocabulary:

In [None]:
vocabulary = trained_vectorizer.vocabulary_

#little loop to print the first items in the vocabulary
for count, item in enumerate(iter(vocabulary.items())):
    print(item)
    if count >= 10:
        break

We can also retrieve the vocabulary index of a word directly:

In [None]:
vocabulary["test"]

But how many words are in our vocabulary, actually? We can see that by calculating its length using Python's **len()** function:

In [None]:
len(vocabulary)

Those are quite a lot of words. It may be a good idea to shrink down our vocabulary a bit, especially because this will reduce both memory consumption and required computational power when we start doing ML magic!

A common approach to do so is to apply so-called popularity-based filtering. Hereby, we exclude very common and/or extremly uncommon words from our vocabulary. This can be achieved by passing the corresponding parameters to the vectorizer when we initialize it. 

Let's overwrite our vectorizer and create a new one which includes words that appear in at least 1% of the documents.

In [None]:
vectorizer = TfidfVectorizer(analyzer='word', min_df=0.01)
trained_vectorizer = vectorizer.fit(labelled_data["text"])

The vocabulary should be way smaller now:

In [None]:
vocabulary = trained_vectorizer.vocabulary_
len(vocabulary)

# 4. Train and test sets

We will now split our labelled dataset into a **training set** and a **test set**. The training set will be used to train our machine learning model to predict the correct labels (classes) of the observations in the training set based on their corresponding texts. After training, we will use the trained model to predict the labels of all the observations in the test set (which was not used for training). Based on the prediction performance in the test set, we can evaluate our trained model.

This two-step approach is used to make sure that the ML model does not simply memorize all the observations in the training data, but instead derive universal rules or patterns to distinguish the different classes. This ability is called **generalization** and it is very desireable in machine learning. In contrast, the overspecialization on the training set and a model's resulting bad performance using other data is called **overfitting**.

<img src="pics/training_split.png" width="500">

The **training** of a machine learning model describes the process of teaching the model how to achive a certain learning task. In the case of classification tasks, we give the model a list of properties $X$ (**features** or **attributes**) that are used to calculate a predicted outcome (label/class) $\hat{Y}$. We then compare the predicted outcome $\hat{Y}$ to the true outcome $Y$ that we know because we have a labelled dataset. The difference between the predicted and the true outcome is then used to calculate an **error**. We then start to **optimize** (train) the model by adjusting its internal numbers $W$ (**weights**) to minimize the error.

<img src="pics/training.png" width="500">

In our case, the features (attributes) of our observations are the texts represented as numerical TFIDF vectors and our labels are the "1" and "0" classes in the "software" column. 

So let's first shuffle our data and then create a list with our features $X$ and a list with the corresponding labels $Y$. For the features we select the text column from our labelled dataset and transform them to TFIDF vectors using our trained vactorizer. 

In [None]:
labelled_data = labelled_data.sample(frac=1.0, random_state=12)
features = trained_vectorizer.transform(labelled_data["text"])
labels = labelled_data["software"]

Remeber that we hat 1,649 observations in our labelled dataset. 

In [None]:
labelled_data.shape

Let's take the first 1,250 observations for the training set. The remaining obseravtions will be assigned to the test set and put aside for the moment.

In [None]:
features_trainset = features[:1250]
labels_trainset = labels[:1250]

features_testset = features[1250:]
labels_testset = labels[1250:]

The features in our training set are stored in a sparse matrix format with dimension 1250 (the number of our training samples) by 2628 (the size of our vocabulary).

In [None]:
features_trainset

The labels in our training set are integers stored in pandas Series (a single-column dataframe).

In [None]:
labels_trainset

# 5. Training a logit regression classifier

Let's train our first model. We will start with something basic: A [logistic regression classifier](https://en.wikipedia.org/wiki/Logistic_regression). This classifier is a pretty popular model for classifying binary outcome variables. Again, we will use scikit-learn for this task.

In [None]:
from sklearn.linear_model import LogisticRegression

First, we have to initialize the logisitic regression. We pass it the parameter **(class_weight="balanced")** because we have a rather unbalanced dataset (one class is way more frequent than the other). The "balanced" parameter will make sure that the model pays more attention to the infrequent class (in our case the software == 1 class).

In [None]:
logit_classifier = LogisticRegression(class_weight="balanced")

We can now train the classifier using its **fit()** method and passing the features and corresponding labels of our training set.

In [None]:
trained_logit_classifier = logit_classifier.fit(features_trainset, labels_trainset)

You just trained your first machine learning model! We can have a look at the "learned" weights in the model by using the **coef_** attribute.

In [None]:
trained_logit_classifier.coef_

In total we have 2,686 weights in our model, each one associated to a word in our vocabulary.

In [None]:
trained_logit_classifier.coef_.shape

But how good is our trained model at distinguishing the web texts of software firms from other firm types? 

Let's test that with an example sentence that we transfer to a TFIDF vector using our trained vectorizer:

In [None]:
bauernhof_tfidf = trained_vectorizer.transform(["das ist ein bauernhof"])

We can now pass this TFIDF vector to our trained logit classifier and tell it to predict its label using the **predict()** method:

In [None]:
trained_logit_classifier.predict(bauernhof_tfidf)

The predicted label is "0" (aka "not a software company"). Awesome!

We can also check out the probability for both classes by using the **predict_proba()** method:

In [None]:
trained_logit_classifier.predict_proba(bauernhof_tfidf)

The first number above gives you the probability that the label is "0" while the second number is the probability that the label is "1". So the classifier is not too confident (about 64%) that the text comes from the website of a non-software company (**WARNING: Your results may differ!**).

Let's try one more example:

In [None]:
programmer_tfidf = trained_vectorizer.transform(["wir programmieren software"])
trained_logit_classifier.predict_proba(programmer_tfidf)

Wow! In this example, the model is almost absolutely sure (96%) that the text comes from a sofware company.

We should now test our classifier using the test set which we put aside:

In [None]:
predicted_labels = trained_logit_classifier.predict(features_testset)

We can now **print()** the predicted labels...

In [None]:
print(predicted_labels)

...and the true labels...

In [None]:
print(labels_testset.values)

...and compare them one by one.

In [None]:
print(predicted_labels == labels_testset.values)

Wow...at first glace that looks pretty convincing. It seems like most of the predicted labels match their true counterparts, but not all of them.

Let's quantify the classifier's prediction performance by generating a scikit-learn **classification report**. The report contains several measures that allow us to evaluate the performance of our trained model in the test set:

- **precision**: the fraction of observations that were predicted to have label $\hat{y} = 1$ and that actually have the true label $y = 1$.
- **recall**: the fraction of observations that have the true label $y = 1$ and that were predicted to have $\hat{y} = 1$.
- **f1-score**: a composite measure that combines both precision and recall.
- **support**: simply the number of observations with this true label in the test set.

<img src="pics/classification_report.png" width="400">

In [None]:
from sklearn.metrics import classification_report

print(classification_report(labels_testset, predicted_labels))

Not too bad! As you can see, the classification report gives you precision, recall, f1-score, and support for both labels ("1" and "0"). We could summarize the report as follows:
- Precision: 95% of the observations that were labeled "0" by the classifier actually have the true label "0". For label "1" this value is only 69%.
- Recall: 92% of the observations that have the true label "0" were also predicted to have have the label "0" by the classifier. For label "1" this value is only 78%.

So if our goal was to identify most of the software firms in the unlabelled dataset (**true positives**) while limiting the number of non-software firms that are wrongly classified as software firm (**false positives**), we could say:
"7 of 10 firms that were predicted to be software firms are actually software firms and we are able to recover 8 of 10 software firms in the dataset".

As a final step, we can create a new column with our predicted labels in our labelled dataset...

In [None]:
labelled_data.loc[labelled_data.index[1250]:, 'prediction'] = predicted_labels

...and have a look at some observations that were incorrectly predicted. In the two examples below, our model is actually correct and the original labels are wrong. Seems like the person who created the base data misclassified the companies...

In [None]:
labelled_data[(labelled_data["software"] != labelled_data["prediction"]) & (labelled_data["prediction"].notnull())].sample(2, random_state=1)

In a final step, we create a new column "software" in our unlabelled dataset and predict the labels using our trained logit classifier. Actually, this is what we wanted to do in the first place. Instead of manually going through the texts and classifying each company manually, we trained a ML model to do the work for us. The huge advantage of our model: Humans are slow, but with our model we are now able to classify hundreds of companies in a second.

In [None]:
unlabelled_data["software"] = unlabelled_data["text"].apply(lambda text: trained_logit_classifier.predict(trained_vectorizer.transform([text]))[0])
unlabelled_data.head(3)

# 6. Calculating text similarities

Representing texts as numerical vectors has another advantage: It allows us to easily calculate similarities between pairs of texts. By calculating such pairwise distances, we are able to identify pairs of texts that are very similar (close) or dissimilar (distant) to each other. Calculating the similarities between observations in our dataset can be understood as comparing the business models of the companies as they are described on their websites. Let's try this out and see whether we can find GIS (Geographic Information System) software companies in our unlabelled dataset!

First, we want to fit our TFIDF vectorizer on our unlabelled dataset without the popularity-based filtering. This makes sure that all the words (41,955) in the unlabelled dataset are included in the resulting vocabulary.

In [None]:
vectorizer = TfidfVectorizer(analyzer='word')
trained_vectorizer = vectorizer.fit(unlabelled_data["text"])
vocabulary = vectorizer.vocabulary_
len(vocabulary)

Next, we create a single TFIDF vector from a keyword list of the company type we are looking for.

In [None]:
search_tfidf = trained_vectorizer.transform(["gis geographie geographische informationssysteme geodaten"])

We will use the popular [**cosine similary**](!https://www.machinelearningplus.com/nlp/cosine-similarity/) to calculate similarities between our TFIDF vectors. Cosine similarity measures the similarity between vectors based on their orientation in the high-dimensional vector space. The smaller the angle between two document vectors, the more similar they are.

Again, we will use a function from scikit-learn [**cosine_similarity()**](!https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) to calculate the cosine similarities between our search TFIDF vector and the TFIDF vectors representing the company website texts in our unlabelled dataset. We do so by applying the function to the text column and create a new column "similarity" with the output.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

unlabelled_data["similarity"] = unlabelled_data["text"].apply(lambda text: cosine_similarity(search_tfidf, trained_vectorizer.transform([text])).tolist()[0][0])

Let's have a look at those software firms that are most similar to our search vector. We can do that by restricting our dataframe to software firms and sorting it (**sort_values()**) by our new "similarity" columns. 

In [None]:
unlabelled_data[unlabelled_data["software"] == 1].sort_values(by=["similarity"], ascending=False).head(2)

# 7. Training a neural network

<img src="pics/ANN.png" width="600">

For our next classifier, we will use one of those fancy artifical neural networks (ANN) everyone is talking about. ANNs consist of several **layers** of so-called **neurons** (or **nodes**) which are linked with all the neurons in the previous and in the next layer. These connections are called **edges** and have weights attached to them which control the strength of the connection between the two neurons the connect. These edge weights are adjusted during the training process.

The first layer (**input layer**) of the ANN takes the feature vector X as input, whereby each individual neuron processes a single element from X (i.e. a word in our case). The features are piped through the network where the internal weights transform them, before they reach the last layer of the network (**output layer**). The final output of the last layer should be in some way relatable to the problem we are trying to solve with the ANN. In our case, we want to know wheter X is likely to represent class "1" or "0" (a binary classification problem). Because of that, our output layer will consists of only a single neuron that takes all the outputs from the previous layer as input and squeezes them into a number between 0.0 and 1.0 using a so-called **sigmoid activation function**:

<img src="pics/sigmoid.png" width="400">

We use this output to calculate an error between our predicted class $\hat{y}$ and the true calls $y$. This error can the be used to adjust all the weights in the ANN in order to minimize the overall error for all our training samples.

For a proper dive into Deep Learning, check out Andrew Ng's great video series: https://www.youtube.com/playlist?list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0

For our ANN we will use [Keras](!https://keras.io/), a Python deep learning library. From Keras, we need to import some modules first.

In [None]:
from keras import layers, models, optimizers

We can now construct a simple ANN by defining its layers and combine them in a Keras model.

In [None]:
vocabulary_size = 2628

# The input layer with a number of neurons equal to the size of our vocabulary
input_layer = layers.Input(shape=(vocabulary_size,))

# The only hidden layer in our simple AAN, consisting of 25 neurons that process the input_layer's outputs 
hidden_layer = layers.Dense(25)(input_layer)

# The output layer which consists of only a single neuron with a sigmoid activation function
# The input for this layer are the outputs of the hidden_layer
output_layer = layers.Dense(1, activation="sigmoid")(hidden_layer)

# Initialize the model using Keras model() function and passing it the input and output layers
simple_ANN = models.Model(inputs=input_layer, outputs=output_layer)

In a final step, we compile the model and prepare it for training by defining the:
- **optimizer**: The algorithm we want to use to adjust the weights in our ANN and find their optimum setting during the training.
- **learning rate (lr)**: A parameter that controls how vigorously we change the weights in our ANN during each training step. If we are too vigorous, we will overshoot and if we are too reluctant it will take ages to find the optimum weights.
- **loss**: The loss function (also called cost, error, or objective function) we want to use to determine how far off our predictions are from the true values.

In [None]:
simple_ANN.compile(optimizer=optimizers.Adam(lr=0.005), loss='binary_crossentropy')

We can now also get a summary of our model using the **summary()** function. As you can see, there are 65,751 trainable parameters in our model. So, let's train them!

In [None]:
simple_ANN.summary()

For training, we **fit()** the model on our training set. We also need to define:
- **epochs**: The number of full iterations through all the observations in the training data. If we adjust the weights only very reluctant (i.e. small learning rate), we are likely to need more epochs.
- **validation_split**: The share of observations from the training data that will be taken aside as the **validation set**. The validation set is like an additional test set which will not be used for training the neural network but for its validation while training.

After each epoch, the loss in both the training data and the validation data is reported. This allows us to monitor the training process and whether the ANN is actually learning something (i.e. decreasing the loss). Ideally, the performance in both the training and the validation data should improve. If the performance in the validation data stagnates while it is still improving in the training data, we are likely to overfit our model to the training data. 

In [None]:
import scipy
simple_ANN.fit(scipy.sparse.csr_matrix.toarray(features_trainset), labels_trainset, epochs=10, validation_split=0.1)

After the training, we can use our simple ANN to predict the labels for our testset observation using the **predict()** method.

In [None]:
predicted_labels = simple_ANN.predict(scipy.sparse.csr_matrix.toarray(features_testset))

Because predict() will return us the probability (e.g. 0.61) that a observation has the label "1", we need to round the predictions using numpy's **round()** function.

In [None]:
import numpy as np

predicted_labels = np.round(predicted_labels)

And we can finally print out a classification report.

In [None]:
print(classification_report(labels_testset, predicted_labels))