# Social Media Insight Using Naive Bayes


Text-based datasets contain a lot of information, whether they are books, historical documents, social media, e-mail, or any of the other ways we communicate via writing. Extracting features from text-based datasets and using them for classification is a difficult problem. There are, however, some common patterns for text mining. We look at disambiguating terms in social media using the Naive Bayes algorithm, which is a powerful and surprisingly simple algorithm. Naive Bayes takes a few shortcuts to properly compute the probabilities for classification, hence the term naive in the name. It can also be extended to other types of datasets quite easily and doesn't rely on numerical features. The model is a baseline for text mining studies, as the process can work reasonably well for a variety of datasets.

We will cover the following topics in this chapter:
- Downloading data from social network APIs
- Transformers for text
- Naive Bayes classifier
- Using JSON for saving and loading datasets
- The NLTK library for extracting features from text
- The F-measure for evaluation

## Disambiguation

Text is often called an **unstructured** format. There is a lot of information there, but it is just there; no headings, no required format, loose syntax and other problems prohibit the easy extraction of information from text. The data is also highly connected, with lots of mentions and cross-references—just not in a format that allows us to easily extract it!

We can compare the information stored in a book with that stored in a large database to see the difference. In the book, there are characters, themes, places, and lots of information. However, the book needs to be read and, more importantly, interpreted to gain this information. The database sits on your server with column names and data types. All the information is there and the level of interpretation needed is quite low. Information about the data, such as its type or meaning is called **metadata**, and text lacks it. A book also contains some metadata in the form of a table of contents and index but the degree is significantly lower than that of a database. 

One of the problems is the term **disambiguation**. When a person uses the word bank, is this a financial message or an environmental message (such as river bank)? This type of disambiguation is quite easy in many circumstances for humans (although there are still troubles), but much harder for computers to do.

Here, we will look at disambiguating the use of the term Python on Twitter's stream. A message on Twitter is called a tweet and is limited to 140 characters. This means there is little room for context. There isn't much metadata available although hashtags are often used to denote the topic of the tweet.

When people talk about Python, they could be talking about the following things:
- The programming language Python
- Monty Python, the classic comedy group
- The snake Python
- A make of shoe called Python

There can be many other things called Python. The aim of our experiment is to take a tweet mentioning Python and determine whether it is talking about the programming language, based only on the content of the tweet.

## Downloading data from a social network

We are going to download a corpus of data from Twitter and use it to sort out spam from useful content. Twitter provides a robust API for collecting information from its servers and this API is free for small-scale usage. It is, however, subject to some conditions that you'll need to be aware of if you start using Twitter's data in a commercial setting.

First, you'll need to sign up for a Twitter account (which is free). Go to http://twitter.com and register an account if you do not already have one.

Next, you'll need to ensure that you only make a certain number of requests per minute. This limit is currently 180 requests per hour. It can be tricky ensuring that you don't breach this limit, so it is highly recommended that you use a library to talk to Twitter's API.

You will need a key to access Twitter's data. Go to http://twitter.com and sign in to your account.

When you are logged in, go to https://apps.twitter.com/ and click on Create New App.

Create a name and description for your app, along with a website address. If you don't have a website to use, insert a placeholder. Leave the Callback URL field blank for this app—we won't need it. Agree to the terms of use (if you do) and click on Create your Twitter application.

Keep the resulting website open—you'll need the access keys that are on this page. Next, we need a library to talk to Twitter. There are many options; the one I like is simply called twitter, and is the official Twitter Python library.

Note: You can install twitter using pip3 install twitter if you are using pip to install your packages. If you are using another system,check the documentation at https://github.com/sixohsix/ twitter.

In [1]:
# import twitter
# import os
# import json

# consumer_key = "<Your Consumer Key Here>"
# consumer_secret = "<Your Consumer Secret Here>"
# access_token = "<Your Access Token Here>"
# access_token_secret = "<Your Access Token Secret Here>"
# authorization = twitter.OAuth(access_token, access_token_secret,
# consumer_key, consumer_secret)

# output_filename = os.path.join(os.path.expanduser("~"),
#  "Data", "twitter", "python_tweets.json")


# t = twitter.Twitter(auth=authorization)

# with open(output_filename, 'a') as output_file:
#     search_results = t.search.tweets(q="python", count=100)['statuses']
#     for tweet in search_results:
#         if 'text' in tweet:
#             output_file.write(json.dumps(tweet))
#             output_file.write("\n\n")

In the preceding loop, we also perform a check to see whether there is text in the tweet or not. Not all of the objects returned by twitter will be actual tweets (some will be actions to delete tweets and others). The key difference is the inclusion of text as a key, which we test for.

Running this for a few minutes will result in 100 tweets being added to the
output file.

## Loading and classifying the dataset

After we have collected a set of tweets (our dataset), we need labels to perform classification. The dataset we have stored is nearly in a JSON format. JSON is a format for data that doesn't impose much structure and is directly readable in JavaScript (hence the name, JavaScript Object Notation). JSON defines basic objects such as numbers, strings, lists and dictionaries, making it a good format for storing datasets if they contain data that isn't numerical. If your dataset is fully numerical, you would save space and time using a matrix-based format like in NumPy.

To parse it, we can use the json library but we will have to first split the file by newlines to get the actual tweet objects themselves.

## Loading data without the twitterAPI
We do not need to used the twitterAPI or anything of the like. There is a saved .txt file that has 100 tweets that we will use. If you wish to use the twitterAPI feel free to do so.

In [2]:
DATA = 'data/social-media-data/'
TWITTER = DATA + 'posts.txt'

In [3]:
tweets_set = {}
tweets_list = []
with open(TWITTER, "r") as file:
    content = file.read().split('\n\n')
    for i, line in enumerate(content):
        try: 
            user, tweet = line.split('\n')
            tweets_set[i] = {'user': user,
                            'text': tweet}
            tweets_list.append([user, tweet])
        except:
            continue

print(f"Loaded {len(tweets_list)} tweets")

Loaded 99 tweets


We are now interested in classifying whether an item is relevant to us or not (in this case, relevant means refers to the programming language Python). We will use the IPython Notebook's ability to embed HTML and talk between JavaScript and Python to create a viewer of tweets to allow us to easily and quickly classify the tweets as spam or not.

The code will present a new tweet to the user (you) and ask for a label: is it relevant
or not? It will then store the input and present the next tweet to be labeled.

In [4]:
tweet_sample = tweets_set
labels = []

In [5]:
# instructions = f"""
# Instructions: Click in textbox. Enter a 1 if the tweet is relevant, enter 2 otherwise.
# """
# print(instructions)
# while len(labels) <= len(tweets_list):
#     try:
#         tweet_text = tweet_sample[len(labels)]['text']
#     except:
#         print(f'key value {len(labels)} DNE')
#         labels.append(-1)

#     tweet = f"""
# Tweet: {tweet_sample[len(labels)]['text']}
#     """
#     print(tweet)
#     a = input()
#     try:
#         if a == 'end':
#             break
#         val = int(a)
#         if val in [1,2]:
#             labels.append(int(a))
#         else:
#             print('invalid input: must be 0 or 1')
#     except:
#         print('invalid input: must be 0 or 1')
    

In [6]:
# a label with -1 is n/a
with open(DATA+'labels.txt', 'r') as f:
    content = f.read()
    label_vals = content.split('\n')
    for val in label_vals:
        labels.append(val)

In [7]:
n_samples = min(len(tweet_sample), len(labels))

## Creating a replicable dataset from Twitter

In data mining, there are lots of variables. These aren't just in the data mining algorithms—they also appear in the data collection, environment, and many other factors. Being able to replicate your results is important as it enables you to verify or improve upon your results.

**Note:** Getting 80 percent accuracy on one dataset with algorithm X, and 90 percent accuracy on another dataset with algorithm Y doesn't mean that Y is better. We need to be able to test on the same dataset in the same conditions to be able to properly compare.

Your labeling of tweets might be different from what is here. While there are obvious examples where a given tweet relates to the python programming language, there will always be gray areas where the labeling isn't obvious. One tough gray area was tweets in non-English languages that couldn't be read.

Due to these factors, it is difficult to replicate experiments on databases that are extracted from social media, and Twitter is no exception. Twitter explicitly disallows sharing datasets directly.

One solution to this is to share tweet IDs only, which you can share freely. In this section, we will first create a simulation of tweet ID dataset that we can theoretically be freely share.

In [8]:
dataset = []

for id, label in enumerate(labels):
    try:
        # print(f"tweet: {tweet_sample[id]['text']}")
        # print(f"label: {label} \n")
        dataset.append([id, label])
    except:
        # print(f'invalid key: {id} \n')
        continue

In [9]:
with open(DATA+'replicable.txt', 'w') as f:
    for item in dataset:
        f.write(str(item) + '\n')

Now that we have the tweet IDs and labels, we can recreate the original dataset if we wanted to but we will not do that here since the code up until now creates the dataset for you.

## Text transformers

Now that we have our dataset, how are we going to perform data mining on it?
Text-based datasets include books, essays, websites, manuscripts, programming
code, and other forms of written expression. All of the algorithms we have seen so
far deal with numerical or categorical features, so how do we convert our text into a
format that the algorithm can deal with?
There are a number of measurements that could be taken. For instance, average
word and average sentence length are used to predict the readability of a document.
However, there are lots of feature types such as word occurrence which we will
now investigate.

### Bag-of-words

One of the simplest but highly effective models is to simply count each word in the
dataset. We create a matrix, where each row represents a document in our dataset
and each column represents a word. The value of the cell is the frequency of that
word in the document.

Here's an excerpt from The Lord of the Rings, J.R.R. Tolkien:
Three Rings for the Elven-kings under the sky,
Seven for the Dwarf-lords in halls of stone,
Nine for Mortal Men, doomed to die,
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all and in the darkness bind them.
In the Land of Mordor where the Shadows lie.
 - J.R.R. Tolkien's epigraph to The Lord of The Rings
The word the appears nine times in this quote, while the words in, for, to, and one
each appear four times. The word ring appears three times, as does the word of.
We can create a dataset from this, choosing a subset of words and counting
the frequency:

Word the one ring to
Frequency 9 4 3 4

We can use the counter class to do a simple count for a given string. When counting
words, it is normal to convert all letters to lowercase, which we do when creating the
string. The code is as follows:

In [None]:
s = """Three Rings for the Elven-kings under the sky,
Seven for the Dwarf-lords in halls of stone,
Nine for Mortal Men, doomed to die,
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all and in the darkness bind them.
In the Land of Mordor where the Shadows lie. """.lower()
words = s.split()
from collections import Counter
c = Counter(words)

Printing c.most_common(5) gives the list of the top five most frequently occurring
words. Ties are not handled well as only five are given and a very large number of
words all share a tie for fifth place.
The bag-of-words model has three major types. The first is to use the raw
frequencies, as shown in the preceding example. This does have a drawback when
documents vary in size from fewer words to many words, as the overall values
will be very different. The second model is to use the normalized frequency,
where each document's sum equals 1. This is a much better solution as the length
of the document doesn't matter as much. The third type is to simply use binary
features—a value is 1 if the word occurs at all and 0 if it doesn't. We will use binary
representation in this chapter.
Another popular (arguably more popular) method for performing normalization is
called term frequency - inverse document frequency, or tf-idf. In this weighting
scheme, term counts are first normalized to frequencies and then divided by the
number of documents in which it appears in the corpus. We will use tf-idf in Chapter
10, Clustering News Articles.
There are a number of libraries for working with text data in Python. We will
use a major one, called Natural Language ToolKit (NLTK). The scikit-learn
library also has the CountVectorizer class that performs a similar action, and
it is recommended you take a look at it (we will use it in Chapter 9, Authorship
Attribution). However the NLTK version has more options for word tokenization. If
you are doing natural language processing in python, NLTK is a great library to use.

In [None]:
sample_tweets = [t.lower() for t in tweets[:n_samples]]
labels = labels[:n_samples]

In [None]:
import numpy as np
y_true = np.array(labels)

In [None]:
print("{:.1f}% have class 1".format(np.mean(y_true == 1) * 100))

In [None]:
from sklearn.base import TransformerMixin
from nltk import word_tokenize

class NLTKBOW(TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [{word: True for word in word_tokenize(document)}
                 for document in X]

In [None]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

In [None]:

pipeline = Pipeline([('bag-of-words', NLTKBOW()),
                     ('vectorizer', DictVectorizer()),
                     ('naive-bayes', BernoulliNB())
                     ])
scores = cross_val_score(pipeline, sample_tweets, y_true, cv=10, scoring='f1')
print("Score: {:.3f}".format(np.mean(scores)))