# Social Media Insight Using Naive Bayes


Text-based datasets contain a lot of information, whether they are books, historical documents, social media, e-mail, or any of the other ways we communicate via writing. Extracting features from text-based datasets and using them for classification is a difficult problem. There are, however, some common patterns for text mining. We look at disambiguating terms in social media using the Naive Bayes algorithm, which is a powerful and surprisingly simple algorithm. Naive Bayes takes a few shortcuts to properly compute the probabilities for classification, hence the term naive in the name. It can also be extended to other types of datasets quite easily and doesn't rely on numerical features. The model is a baseline for text mining studies, as the process can work reasonably well for a variety of datasets.

We will cover the following topics in this chapter:
- Downloading data from social network APIs
- Transformers for text
- Naive Bayes classifier
- Using JSON for saving and loading datasets
- The NLTK library for extracting features from text
- The F-measure for evaluation

## Disambiguation

Text is often called an **unstructured** format. There is a lot of information there, but it is just there; no headings, no required format, loose syntax and other problems prohibit the easy extraction of information from text. The data is also highly connected, with lots of mentions and cross-references—just not in a format that allows us to easily extract it!

We can compare the information stored in a book with that stored in a large database to see the difference. In the book, there are characters, themes, places, and lots of information. However, the book needs to be read and, more importantly, interpreted to gain this information. The database sits on your server with column names and data types. All the information is there and the level of interpretation needed is quite low. Information about the data, such as its type or meaning is called **metadata**, and text lacks it. A book also contains some metadata in the form of a table of contents and index but the degree is significantly lower than that of a database. 

One of the problems is the term **disambiguation**. When a person uses the word bank, is this a financial message or an environmental message (such as river bank)? This type of disambiguation is quite easy in many circumstances for humans (although there are still troubles), but much harder for computers to do.

Here, we will look at disambiguating the use of the term Python on Twitter's stream. A message on Twitter is called a tweet and is limited to 140 characters. This means there is little room for context. There isn't much metadata available although hashtags are often used to denote the topic of the tweet.

When people talk about Python, they could be talking about the following things:
- The programming language Python
- Monty Python, the classic comedy group
- The snake Python
- A make of shoe called Python

There can be many other things called Python. The aim of our experiment is to take a tweet mentioning Python and determine whether it is talking about the programming language, based only on the content of the tweet.

## Downloading data from a social network

We are going to download a corpus of data from Twitter and use it to sort out spam from useful content. Twitter provides a robust API for collecting information from its servers and this API is free for small-scale usage. It is, however, subject to some conditions that you'll need to be aware of if you start using Twitter's data in a commercial setting.

First, you'll need to sign up for a Twitter account (which is free). Go to http://twitter.com and register an account if you do not already have one.

Next, you'll need to ensure that you only make a certain number of requests per minute. This limit is currently 180 requests per hour. It can be tricky ensuring that you don't breach this limit, so it is highly recommended that you use a library to talk to Twitter's API.

You will need a key to access Twitter's data. Go to http://twitter.com and sign in to your account.

When you are logged in, go to https://apps.twitter.com/ and click on Create New App.

Create a name and description for your app, along with a website address.
If you don't have a website to use, insert a placeholder. Leave the Callback URL field
blank for this app—we won't need it. Agree to the terms of use (if you do)
and click on Create your Twitter application.

Keep the resulting website open—you'll need the access keys that are on this page.
Next, we need a library to talk to Twitter. There are many options; the one I like is
simply called twitter, and is the official Twitter Python library.

Note: You can install twitter using pip3 install twitter if you are
using pip to install your packages. If you are using another system,
check the documentation at https://github.com/sixohsix/
twitter.

Create a new IPython Notebook to download the data. We will create several
notebooks in this chapter for various different purposes, so it might be a good idea to
also create a folder to keep track of them. This first notebook, ch6_get_twitter, is
specifically for downloading new Twitter data.

First, we import the twitter library and set our authorization tokens. The consumer
key, consumer secret will be available on the Keys and Access Tokens tab on your
Twitter app's page. To get the access tokens, you'll need to click on the Create my
access token button, which is on the same page. Enter the keys into the appropriate
places in the following code:

In [None]:
# Labelling the class values for the twitter dataset.
import os
input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json")
classes_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json")

In [None]:
import json
tweets = []
with open(input_filename) as inf:
    for line in inf:
        if len(line.strip()) == 0:
            continue
        tweets.append(json.loads(line)['text'])
print("Loaded {} tweets".format(len(tweets)))

In [None]:
with open(classes_filename) as inf:
    labels = json.load(inf)

In [None]:
n_samples = min(len(tweets), len(labels))

In [None]:
sample_tweets = [t.lower() for t in tweets[:n_samples]]
labels = labels[:n_samples]

In [None]:
import numpy as np
y_true = np.array(labels)

In [None]:
print("{:.1f}% have class 1".format(np.mean(y_true == 1) * 100))

In [None]:
from sklearn.base import TransformerMixin
from nltk import word_tokenize

class NLTKBOW(TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [{word: True for word in word_tokenize(document)}
                 for document in X]

In [None]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

In [None]:

pipeline = Pipeline([('bag-of-words', NLTKBOW()),
                     ('vectorizer', DictVectorizer()),
                     ('naive-bayes', BernoulliNB())
                     ])
scores = cross_val_score(pipeline, sample_tweets, y_true, cv=10, scoring='f1')
print("Score: {:.3f}".format(np.mean(scores)))