# Social Media Insight Using Naive Bayes


Text-based datasets contain a lot of information, whether they are books, historical documents, social media, e-mail, or any of the other ways we communicate via writing. Extracting features from text-based datasets and using them for classification is a difficult problem. There are, however, some common patterns for text mining. We look at disambiguating terms in social media using the Naive Bayes algorithm, which is a powerful and surprisingly simple algorithm. Naive Bayes takes a few shortcuts to properly compute the probabilities for classification, hence the term naive in the name. It can also be extended to other types of datasets quite easily and doesn't rely on numerical features. The model is a baseline for text mining studies, as the process can work reasonably well for a variety of datasets.

We will cover the following topics in this chapter:
- Downloading data from social network APIs
- Transformers for text
- Naive Bayes classifier
- Using JSON for saving and loading datasets
- The NLTK library for extracting features from text
- The F-measure for evaluation

## Disambiguation

Text is often called an **unstructured** format. There is a lot of information there, but it is just there; no headings, no required format, loose syntax and other problems prohibit the easy extraction of information from text. The data is also highly connected, with lots of mentions and cross-references—just not in a format that allows us to easily extract it!

We can compare the information stored in a book with that stored in a large database to see the difference. In the book, there are characters, themes, places, and lots of information. However, the book needs to be read and, more importantly, interpreted to gain this information. The database sits on your server with column names and data types. All the information is there and the level of interpretation needed is quite low. Information about the data, such as its type or meaning is called **metadata**, and text lacks it. A book also contains some metadata in the form of a table of contents and index but the degree is significantly lower than that of a database. 

One of the problems is the term **disambiguation**. When a person uses the word bank, is this a financial message or an environmental message (such as river bank)? This type of disambiguation is quite easy in many circumstances for humans (although there are still troubles), but much harder for computers to do.

Here, we will look at disambiguating the use of the term Python on Twitter's stream. A message on Twitter is called a tweet and is limited to 140 characters. This means there is little room for context. There isn't much metadata available although hashtags are often used to denote the topic of the tweet.

When people talk about Python, they could be talking about the following things:
- The programming language Python
- Monty Python, the classic comedy group
- The snake Python
- A make of shoe called Python

There can be many other things called Python. The aim of our experiment is to take a tweet mentioning Python and determine whether it is talking about the programming language, based only on the content of the tweet.

## Downloading data from a social network

We are going to download a corpus of data from Twitter and use it to sort out spam from useful content. Twitter provides a robust API for collecting information from its servers and this API is free for small-scale usage. It is, however, subject to some conditions that you'll need to be aware of if you start using Twitter's data in a commercial setting.

First, you'll need to sign up for a Twitter account (which is free). Go to http://twitter.com and register an account if you do not already have one.

Next, you'll need to ensure that you only make a certain number of requests per minute. This limit is currently 180 requests per hour. It can be tricky ensuring that you don't breach this limit, so it is highly recommended that you use a library to talk to Twitter's API.

You will need a key to access Twitter's data. Go to http://twitter.com and sign in to your account.

When you are logged in, go to https://apps.twitter.com/ and click on Create New App.

Create a name and description for your app, along with a website address. If you don't have a website to use, insert a placeholder. Leave the Callback URL field blank for this app—we won't need it. Agree to the terms of use (if you do) and click on Create your Twitter application.

Keep the resulting website open—you'll need the access keys that are on this page. Next, we need a library to talk to Twitter. There are many options; the one I like is simply called twitter, and is the official Twitter Python library.

Note: You can install twitter using pip3 install twitter if you are using pip to install your packages. If you are using another system,check the documentation at https://github.com/sixohsix/ twitter.

In [1]:
# import twitter
# import os
# import json

# consumer_key = "<Your Consumer Key Here>"
# consumer_secret = "<Your Consumer Secret Here>"
# access_token = "<Your Access Token Here>"
# access_token_secret = "<Your Access Token Secret Here>"
# authorization = twitter.OAuth(access_token, access_token_secret,
# consumer_key, consumer_secret)

# output_filename = os.path.join(os.path.expanduser("~"),
#  "Data", "twitter", "python_tweets.json")


# t = twitter.Twitter(auth=authorization)

# with open(output_filename, 'a') as output_file:
#     search_results = t.search.tweets(q="python", count=100)['statuses']
#     for tweet in search_results:
#         if 'text' in tweet:
#             output_file.write(json.dumps(tweet))
#             output_file.write("\n\n")

In the preceding loop, we also perform a check to see whether there is text in the tweet or not. Not all of the objects returned by twitter will be actual tweets (some will be actions to delete tweets and others). The key difference is the inclusion of text as a key, which we test for.

Running this for a few minutes will result in 100 tweets being added to the
output file.

## Loading and classifying the dataset

After we have collected a set of tweets (our dataset), we need labels to perform classification. The dataset we have stored is nearly in a JSON format. JSON is a format for data that doesn't impose much structure and is directly readable in JavaScript (hence the name, JavaScript Object Notation). JSON defines basic objects such as numbers, strings, lists and dictionaries, making it a good format for storing datasets if they contain data that isn't numerical. If your dataset is fully numerical, you would save space and time using a matrix-based format like in NumPy.

To parse it, we can use the json library but we will have to first split the file by newlines to get the actual tweet objects themselves.

## Loading data without the twitterAPI
We do not need to used the twitterAPI or anything of the like. There is a saved .txt file that has 100 tweets that we will use. If you wish to use the twitterAPI feel free to do so.

In [2]:
DATA = 'data/social-media-data/'
TWITTER = DATA + 'posts.txt'

In [3]:
tweets_list = []
with open(TWITTER, "r") as file:
    content = file.read().split('\n\n')
    for i, line in enumerate(content):
        try: 
            user, tweet = line.split('\n')
            tweets_list.append([user, tweet])
        except:
            print(f'no user: id = {i} \n tweet: {line} \n')
            tweets_list.append(['no user', tweet])

print(f"Loaded {len(tweets_list)} tweets")

no user: id = 29 
 tweet: Available Now At -… http://t.co/tvgQugZZT5 

no user: id = 31 
 tweet: No_not_that_one
RT @SamuelHLowe: - Excuse me, do you have snake belts?
- Not sure, but I can check in the back.
- Please do, my python's pants keep falling… 

no user: id = 33 
 tweet: .....))))))))))) 

no user: id = 51 
 tweet: halulan
@python_octopus 
乙です！！ 

Loaded 102 tweets


We are now interested in classifying whether an item is relevant to us or not (in this case, relevant means refers to the programming language Python). We will use the IPython Notebook's ability to embed HTML and talk between JavaScript and Python to create a viewer of tweets to allow us to easily and quickly classify the tweets as spam or not.

The code will present a new tweet to the user (you) and ask for a label: is it relevant
or not? It will then store the input and present the next tweet to be labeled.

In [15]:
# labels = []

# print(f""" Instructions: 
#            Enter a 1 if the tweet is relevant, enter 2 otherwise.""")

# n = len(tweets_list)+1
# idx = 0
# while len(labels) <= n:
#     print(f"""Tweet: {tweets_list[idx][1]}""")
#     a = input()
#     try:
#         if a == 'exit':
#             break
#         val = int(a)
#         if val in [1,2]:
#             labels.append(int(a))
#             idx += 1
#         else:
#             print('invalid input: must be 1 or 2')
#     except:
#         print('invalid input: must be 1 or 2')
    

In [19]:
import numpy as np

# # Open a file in write mode
# with open(DATA+"labels.txt", "w") as f:
#     # Write the array string to the file
#     my_array = np.array(labels)
#     array_string = np.array2string(my_array)
#     f.write(array_string)

## Creating a replicable dataset from Twitter

In data mining, there are lots of variables. These aren't just in the data mining algorithms—they also appear in the data collection, environment, and many other factors. Being able to replicate your results is important as it enables you to verify or improve upon your results.

**Note:** Getting 80 percent accuracy on one dataset with algorithm X, and 90 percent accuracy on another dataset with algorithm Y doesn't mean that Y is better. We need to be able to test on the same dataset in the same conditions to be able to properly compare.

Your labeling of tweets might be different from what is here. While there are obvious examples where a given tweet relates to the python programming language, there will always be gray areas where the labeling isn't obvious. One tough gray area was tweets in non-English languages that couldn't be read.

Due to these factors, it is difficult to replicate experiments on databases that are extracted from social media, and Twitter is no exception. Twitter explicitly disallows sharing datasets directly.

One solution to this is to share tweet IDs only, which you can share freely. In this section, we will first create a simulation of tweet ID dataset that we can theoretically be freely share, named `label.txt` in our data folder.

In [42]:
labels = []

# a label with -1 is n/a
with open(DATA+'labels.txt', 'r') as f:
    content = f.read()[1:-1]
    label_vals = content.split('\n')
    for val in label_vals:
        for label in val.split():
            labels.append(int(label))

In [43]:
n_samples = min(len(tweets_list), len(labels))
print(n_samples)

102


Now that we have the tweet IDs and labels, we can recreate the original dataset if we wanted to but we will not do that here since the code up until now creates the dataset for you.

## Text transformers

Now that we have our dataset, how are we going to perform data mining on it?

Text-based datasets include books, essays, websites, manuscripts, programming code, and other forms of written expression. All of the algorithms we have seen so far deal with numerical or categorical features, so how do we convert our text into a format that the algorithm can deal with?

There are a number of measurements that could be taken. For instance, average word and average sentence length are used to predict the readability of a document. However, there are lots of feature types such as word occurrence which we will now investigate.

### Bag-of-words

One of the simplest but highly effective models is to simply count each word in the dataset. We create a matrix, where each row represents a document in our dataset and each column represents a word. The value of the cell is the frequency of that word in the document.

Here's an excerpt from The Lord of the Rings, J.R.R. Tolkien:

```
Three Rings for the Elven-kings under the sky,
Seven for the Dwarf-lords in halls of stone,
Nine for Mortal Men, doomed to die,
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all and in the darkness bind them.
In the Land of Mordor where the Shadows lie.
 - J.R.R. Tolkien's epigraph to The Lord of The Rings
```

The word the appears nine times in this quote, while the words in, for, to, and one each appear four times. The word ring appears three times, as does the word of. 

We can create a dataset from this, choosing a subset of words and counting the frequency:

|Word| the| one| ring| to|
|--|--|--|--|--|
|Frequency| 9| 4| 3| 4|

We can use the counter class to do a simple count for a given string. When counting words, it is normal to convert all letters to lowercase, which we do when creating thestring. The code is as follows:

In [44]:
s = """Three Rings for the Elven-kings under the sky,
Seven for the Dwarf-lords in halls of stone,
Nine for Mortal Men, doomed to die,
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all and in the darkness bind them.
In the Land of Mordor where the Shadows lie. """.lower()
words = s.split()
from collections import Counter
c = Counter(words)
c.most_common(5)

[('the', 9), ('for', 4), ('in', 4), ('to', 4), ('one', 4)]

Printing `c.most_common(5)` gives the list of the top five most frequently occurring words. Ties are not handled well as only five are given and a very large number of words all share a tie for fifth place.

The bag-of-words model has three major types. 
- The first is to use the raw frequencies, as shown in the preceding example. This does have a drawback when documents vary in size from fewer words to many words, as the overall values will be very different. 
- The second model is to use the **normalized frequency**, where each document's sum equals 1. This is a much better solution as the length of the document doesn't matter as much. 
- The third type is to simply use binary features—a value is 1 if the word occurs *at all* and 0 if it doesn't. We will use binary representation in this case.

Another popular (arguably more popular) method for performing normalization is called **term frequency - inverse document frequency**, or **tf-idf**. In this weighting scheme, term counts are first normalized to frequencies and then divided by the number of documents in which it appears in the corpus.

There are a number of libraries for working with text data in Python. We will use a major one, called **Natural Language ToolKit (NLTK)**. The `scikit-learn` library also has the `CountVectorizer` class that performs a similar action, and it is recommended you take a look at it. However the NLTK version has more options for word tokenization. If you are doing natural language processing in python, NLTK is a great library to use.

## N-grams
A step up from single bag-of-words features is that of **n-grams**. An n-gram is a
subsequence of n consecutive tokens.

They are counted the same way, with the n-grams forming a word that is put in the bag. The value of a cell in this dataset is the frequency that a particular n-gram appears in the given document.

**Note:** The value of n is a parameter. For English, setting it to between 2 to 5 is a good start, although some applications call for higher values

As an example, for n=3, we extract the first few n-grams in the following quote:

*Always look on the bright side of life.*

The first n-gram (of size 3) is Always look on, the second is look on the, the third is on the bright. As you can see, the n-grams overlap and cover three words.

Word n-grams have advantages over using single words. This simple concept introduces some context to word use by considering its local environment, without a large overhead of understanding the language computationally. A disadvantage of using n-grams is that the matrix becomes even sparser—word n-grams are unlikely to appear twice (especially in tweets and other short documents!).

Specially for social media and other short documents, word n-grams are unlikely to appear in too many different tweets, unless it is a retweet. However, in larger documents, word n-grams are quite effective for many applications.

Another form of n-gram for text documents is that of a **character n-gram**. Rather than using sets of words, we simply use sets of characters (although character n-grams have lots of options for how they are computed!). **This type of dataset can pick up words that are misspelled, as well as providing other benefits**.

## Other features (further reading)

There are other features that can be extracted too. These include syntactic features, such as the usage of particular words in sentences. Part-of-speech tags are also popular for data mining applications that need to understand meaning in text. Such feature types won't be covered in this book. If you are interested in learning more:
- *Python 3 Text Processing with NLTK 3 Cookbook, Jacob Perkins, Packt publication.*

## Naive Bayes

Naive Bayes is a probabilistic model that is unsurprisingly built upon a naive interpretation of Bayesian statistics. Despite the naive aspect, the method performs very well in a large number of contexts. It can be used for classification of many different feature types and formats, but we will focus on one: binary features in the bag-of-words model.

## Bayes' theorem

For most of us, when we were taught statistics, we started from a frequentist approach. In this approach, we assume the data comes from some distribution and we aim to determine what the parameters are for that distribution. However, those parameters are (perhaps incorrectly) assumed to be fixed. We use our model to describe the data, even testing to ensure the data fits our model.

Bayesian statistics instead model how people (non-statisticians) actually reason. We have some data and we use that data to update our model about how likely something is to occur. In Bayesian statistics, we use the data to describe the model rather than using a model and confirming it with data (as per the frequentist approach).

Bayes' theorem computes the value of $P(A|B)$, that is, knowing that $B$ has occurred,  what is the probability of $A$. In most cases, $B$ is an observed event such as it rained yesterday, and $A$ is a prediction it will rain today. For data mining, $B$ is usually we observed this sample and $A$ is it belongs to this class. We will see how to use Bayes' theorem for data mining in the next section. The equation for Bayes' theorem is given as follows:

$$
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$

## A simple example

As an example, we want to determine the probability that an e-mail containing the word drugs is spam (as we believe that such a tweet may be a pharmaceutical spam). $A$, in this context, is the probability that this tweet is spam. We can compute $P(A)$, called the prior belief directly from a training dataset by computing the percentage of tweets in our dataset that are spam. If our dataset contains 30 spam messages for every 100 e-mails, $P(A) =  0.3$.

$B$, in this context, is this tweet contains the word 'drugs'. Likewise, we can compute $P(B)$ by computing the percentage of tweets in our dataset containing the word drugs. If 10 e-mails in every 100 of our training dataset contain the word drugs, $P(B) = 0.1$. Note that we don't care if the e-mail is spam or not when computing this value. 

$P(B|A)$ is the probability that an e-mail contains the word drugs if it is spam. It is also easy to compute from our training dataset. We look through our training set for spam e-mails and compute the percentage of them that contain the word drugs. Of our 30 spam e-mails, if 6 contain the word drugs, then $P(B|A) = 0.2$.

From here, we use Bayes' theorem to compute P(A|B), which is the probability that
a tweet containing the word drugs is spam. Using the previous equation, we see the
result is 0.6. This indicates that if an e-mail has the word drugs in it, there is a 60
percent chance that it is spam.

Note the empirical nature of the preceding example—we use evidence directly from our training dataset, not from some preconceived distribution. In contrast, a frequentist view of this would rely on us creating a distribution of the probability of words in tweets to compute similar equations.

## Naive Bayes algorithm

Looking back at our Bayes' theorem equation, we can use it to compute the probability that a given sample belongs to a given class. This allows the equation to be used as a classification algorithm.

With $C$ as a given class and $D$ as a sample in our dataset, we create the elements necessary for Bayes' theorem, and subsequently Naive Bayes. Naive Bayes is a classification algorithm that utilizes Bayes' theorem to compute the probability that a new data sample belongs to a particular class.

$P(C)$ is the probability of a class, which is computed from the training dataset itself (as we did with the spam example). We simply compute the percentage of samples in our training dataset that belong to the given class.

$P(D)$ is the probability of a given data sample. It can be difficult to compute this, as the sample is a complex interaction between different features, but luckily it is a constant across all classes. Therefore, we don't need to compute it at all. We will see later how to get around this issue.

$P(D|C)$ is the probability of the data point belonging to the class. This could also be difficult to compute due to the different features. However, this is where we **introduce the naive part of the Naive Bayes algorithm**. We naively assume that each feature is independent of each other. Rather than computing the full probability of $P(D|C)$, we compute the probability of each feature $D_1, D_2, D_3, …$ and so on. Then,
we multiply them together:

$$
P(D|C) = P(D_1|C)P(D_2|C)\cdots P(D_n|C)
$$

Each of these values is relatively easy to compute with binary features; we simply compute the percentage of times it is equal in our sample dataset.

In contrast, if we were to perform a non-naive Bayes version of this part, we would need to compute the correlations between different features for each class. Such computation is infeasible at best, and nearly impossible without vast amounts of data
or adequate language analysis models.

From here, the algorithm is straightforward. We compute $P(C|D)$ for each possible class, ignoring the $P(D)$ term. Then we choose the class with the highest probability. As the P(D) term is consistent across each of the classes, ignoring it has no impact on the final prediction.

## How it works

As an example, suppose we have the following (binary) feature values from a sample in our dataset: $D = [1, 0, 0, 1]$.

Our training dataset contains two classes with 75 percent of samples belonging to the class 0, and 25 percent belonging to the class 1. The likelihood of the feature values for each class are as follows:

For class 0: $[P(D_1=1|C=0),P(D_2=1|C=0),P(D_3=1|C=0),P(D_4=1|C=0)] = [0.3, 0.4, 0.4, 0.7]$

For class 1: $[P(D_1=1|C=1),P(D_2=1|C=1),P(D_3=1|C=1),P(D_4|=1C=1)] = [0.7, 0.3, 0.4, 0.9]$

These values are to be interpreted as: for feature 1, $D_1$, it is equal to 1 in 30 percent of cases for class 0, i.e. $P(D_1=1|C=0) = .3$.

We can now compute the probability that this sample should belong to the class 0:

$P(C=0) = 0.75$ which is the probability that the class is 0.

Note that $P(D)$ isn't needed for the Naive Bayes algorithm. Let's take a look at the calculation:

$P(D|C=0) = P(D_1|C=0)P(D_2|C=0)P(D_3|C=0)P(D_4|C=0) = 0.3\cdot 0.6 \cdot 0.6 \cdot 0.7= 0.0756$

**Note:** The listed probabilities are for values of 1 for each feature. Therefore, the probability of a 0 is its inverse: $P(0) = 1 – P(1)$.

Now, we can compute the probability of the data point belonging to this class. An important point to note is that we haven't computed P(D), so this isn't a real probability. However, it is good enough to compare against the same value for the probability of the class 1. Let's take a look at the calculation:

$$
P(C=0|D) = \frac{P(C=0) P(D|C=0)}{P(D)}
= \frac{0.75 \cdot 0.0756}{P(D)}
= \frac{0.0567}{P(D)}
$$

Now, we compute the same values for the class 1: $P(C=1) = 0.25$

$P(D)$ isn't needed for naive Bayes. Let's take a look at the calculation:

$P(D|C=1) = P(D_1|C=1)P(D_2|C=1)P(D_3|C=1)P(D_4|C=1)
= 0.7 \cdot 0.7 \cdot 0.6 \cdot 0.9
= 0.2646$

$$
P(C=1|D) = \frac{P(C=1) P(D|C=1)}{P(D)}
= \frac{0.25 \cdot 0.2646}{P(D)}
= \frac{0.06615}{P(D)}
$$

The data point should be classified as belonging to the class 1. You may have guessed this while going through the equations anyway; however, you may have been a bit surprised that the final decision was so close. After all, the probabilities in computing P(D|C) were much, much higher for the class 1. This is because we introduced a prior belief that most samples generally belong to the class 0.

If the classes had been equal sizes, the resulting probabilities would be much different. Try it yourself by changing both $P(C=0)$ and $P(C=1)$ to 0.5 for equal class sizes and computing the result again.

## Application

We will now create a pipeline that takes a tweet and determines whether it is relevant or not, based only on the content of that tweet. To perform the word extraction, we will be using the NLTK, a library that contains a large number of tools for performing analysis on natural language.

We are going to create a pipeline to extract the word features and classify the tweets using Naive Bayes. Our pipeline has the following steps:

1. Transform the original text documents into a dictionary of counts using NLTK's word_tokenize function.
2. Transform those dictionaries into a vector matrix using the DictVectorizer transformer in scikit-learn. This is necessary to enable the Naive Bayes classifier to read the feature values extracted in the first step.
3. Train the Naive Bayes classifier.

## Extracting word counts

We are going to use NLTK to extract our word counts. We still want to use it in a pipeline, but NLTK doesn't conform to our transformer interface. We will therefore need to create a basic transformer to do this to obtain both fit and transform methods, enabling us to use this in a pipeline.

First, set up the transformer class. We don't need to fit anything in this class, as this transformer simply extracts the words in the document. Therefore, our fit is an empty function, except that it returns self which is necessary for transformer objects. Our transform is a little more complicated. We want to extract each word from each document and record True if it was discovered. We are only using the binary features here—True if in the document, False otherwise. If we wanted to use the frequency we would set up counting dictionaries, as we have done in several of the past chapters.

Let's take a look at the code:

In [67]:
from sklearn.base import TransformerMixin
import nltk
from nltk import word_tokenize

nltk.download('punkt')


class NLTKBOW(TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [{word: True for word in word_tokenize(document)}
                 for document in X]

[nltk_data] Downloading package punkt to /Users/robed/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Converting dictionaries to a matrix

This step converts the dictionaries built as per the previous step into a matrix that can be used with a classifier. This step is made quite simple through the `DictVectorizer` transformer.

The `DictVectorizer` class simply takes a list of dictionaries and converts them into a matrix. The features in this matrix are the keys in each of the dictionaries, and the values correspond to the occurrence of those features in each sample. Dictionaries are easy to create in code, but many data algorithm implementations prefer matrices. This makes `DictVectorizer` a very useful class.

In [50]:
from sklearn.feature_extraction import DictVectorizer

## Training the Naive Bayes classifier

Finally, we need to set up a classifier and we are using Naive Bayes for this section. As our dataset contains only binary features, we use the `BernoulliNB` classifier that is designed for binary features. As a classifier, it is very easy to use. As with `DictVectorizer`, we simply import it and add it to our pipeline:

In [51]:
from sklearn.naive_bayes import BernoulliNB

## Putting it all together

Now, create a pipeline putting together the components from before. Our pipeline
has three parts:
- The `NLTKBOW` transformer we created
- A `DictVectorizer` transformer
- A `BernoulliNB` classifier
The code is as follows:

In [61]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('bag-of-words', NLTKBOW()),
                     ('vectorizer', DictVectorizer()),
                     ('naive-bayes', BernoulliNB())])

We can nearly run our pipeline now, which we will do with `cross_val_score` as we have done many times before. Before that though, we will introduce a better evaluation metric than the accuracy metric we used before. As we will see, the use of accuracy is not adequate for datasets when the number of samples in each class is different.

## Evaluation using the F1-score
When choosing an evaluation metric, it is always important to consider cases where that evaluation metric is not useful. Accuracy is a good evaluation metric in many cases, as it is easy to understand and simple to compute. However, it can be easily faked. In other words, in many cases you can create algorithms that have a high accuracy by poor utility.

While our dataset of tweets (typically, your results may vary) contains about 50 percent programming-related and 50 percent nonprogramming (see below), many datasets aren't as **balanced** as this.

As an example, an e-mail spam filter may expect to see more than 80 percent of incoming e-mails be spam. A spam filter that simply labels everything as spam is quite useless; however, it will obtain an accuracy of 80 percent!

To get around this problem, we can use other evaluation metrics. One of the most commonly employed is called an f1-score (also called f-score, f-measure, or one of many other variations on this term).

The f1-score is defined on a per-class basis and is based on two concepts: the precision and recall. The **precision** is the percentage of all the samples that were predicted as belonging to a specific class that were actually from that class. The **recall** is the percentage of samples in the dataset that are in a class and actually labeled as belonging to that class.

In the case of our application, we could compute the value for both classes (relevant and not relevant). However, we are really interested in the spam. Therefore, our precision computation becomes the question: of all the tweets that were predicted as being relevant, what percentage were actually relevant? Likewise, the recall becomes the question: of all the relevant tweets in the dataset, how many were predicted as being relevant?

After you compute both the precision and recall, the f1-score is the harmonic mean of the precision and recall:

$$
F_1 = 2\frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}
$$

To use the f1-score in scikit-learn methods, simply set the scoring parameter to f1. By default, this will return the f1-score of the class with label 1. Running the code on our dataset, we simply use the following line of code:

In [62]:
labels = np.array(labels)

In [63]:
print(f"{np.mean(labels == 1):.2%}% have class 1")

50.00%% have class 1


In [64]:
tweets = np.array([tweet[1] for tweet in tweets_list])

In [68]:
from sklearn.model_selection import cross_val_score


scores = cross_val_score(pipeline, tweets, labels, scoring='f1')
print(f"Score: {np.mean(scores):.3f}")

Score: 0.711


The result is 0.711, which means we can accurately determine if a tweet using Python relates to the programing language 70 percent of the time. This is using a dataset with only 102 tweets in it.

## Getting useful features from models

One question you may ask is what are the best features for determining if a tweet is relevant or not? We can extract this information from of our Naive Bayes model and find out which features are the best individually, according to Naive Bayes. First we fit a new model. While the cross_val_score gives us a score across different folds of cross-validated testing data, it doesn't easily give us the trained models themselves. To do this, we simply fit our pipeline with the tweets, creating a new model.

A pipeline gives you access to the individual steps through the named_steps attribute and the name of the step (we defined these names ourselves when we created the pipeline object itself). For instance, we can get the Naive Bayes model:

In [71]:
model = pipeline.fit(tweets, labels)
nb = model.named_steps['naive-bayes']
feature_probabilities = nb.feature_log_prob_
top_features = np.argsort(-feature_probabilities[1])[:50]

From this model, we can extract the probabilities for each word. These are stored as log probabilities, which is simply $log(P(A|f))$, where $f$ is a given feature.

The preceding code will just give us the indices and not the actual feature values. This isn't very useful, so we will map the feature's indices to the actual values. The key is the `DictVectorizer` step of the pipeline, which created the matrices for us. Luckily this also records the mapping, allowing us to find the feature names that correlate to different columns.

From here, we can print out the names of the top features by looking them up in the feature_names_ attribute of DictVectorizer. Enter the following lines into a new cell and run it to print out a list of the top features:

In [73]:
dv = model.named_steps['vectorizer']
for i, feature_index in enumerate(top_features):
    if i < 20:
        print(i, dv.feature_names_[feature_index],
        np.exp(feature_probabilities[1][feature_index]))

0 : 0.49056603773584906
1 @ 0.471698113207547
2 http 0.43396226415094336
3 Python 0.32075471698113206
4 - 0.169811320754717


The first few features include :, http, # and @. These are likely to be noise (although the use of a colon is not very common outside programming), based on the data we collected. Collecting more data is critical to smoothing out these issues. Looking through the list though, we get a number of more obvious programming features:

There are some others too that refer to Python in a work context, and therefore might be referring to the programming language (although freelance snake handlers may also use similar terms, they are less common on Twitter):

Looking through these features gives us quite a few benefits. We could train people to recognize these tweets, look for commonalities (which give insight into a topic), or even get rid of features that make no sense.