# Off-Platform Project: Classifying Tweets

In this off-platform project, you will use a Naive Bayes Classifier to find patterns in real tweets. Three files contain tweets that are gathered from NYC, London and Paris: `new_york.json`, `london.json`, and `paris.json`.

The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

## Step 1. Investigate the Data

To begin, let's take a look at the data. 


In [1]:
import pandas as pd

# import NYC tweet data
new_york_tweets = pd.read_json("new_york.json", lines=True)

# print the number of tweets
print(len(new_york_tweets))

# print the features of a tweet
print(new_york_tweets.columns)

# print the text of the 12th tweet in the NYC dataset
print(new_york_tweets.loc[12]["text"])

4723
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')
Be best #ThursdayThoughts


Let's do the same for London and Paris data.

In [2]:
# import London data
london_tweets = pd.read_json("london.json", lines=True)

# print the number of tweets
print(len(london_tweets))

# print the features of a tweet
print(london_tweets.columns)

# print the text of the 20th tweet in the London dataset
print(london_tweets.loc[20]["text"])

5341
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'extended_tweet', 'quote_count',
       'reply_count', 'retweet_count', 'favorite_count', 'entities',
       'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities'],
      dtype='object')
@JoshUJWorld @chris1989jean @richardupshall Local old people’s home please


In [3]:
# import Paris data
paris_tweets = pd.read_json("paris.json", lines=True)

# print the number of tweets
print(len(paris_tweets))

# print the features of a tweet
print(paris_tweets.columns)

# print the text of the 134th tweet in the Paris dataset
print(paris_tweets.loc[134]['text'])

2510
Index(['created_at', 'id', 'id_str', 'text', 'source', 'truncated',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'display_text_range',
       'extended_entities', 'possibly_sensitive', 'quoted_status_id',
       'quoted_status_id_str', 'quoted_status', 'quoted_status_permalink',
       'extended_tweet'],
      dtype='object')
Une chute d’arbre sur la ligne D, une chute d’arbre sur la ligne A
Soit la nature m’en veut, soit je sème des arbre… https://t.co/aJFYqwCcmv


## Step 2. Classifying using language: Naive Bayes Classifier

We're going to create a Naive Bayes Classifier! Let's begin by looking at the way language is used differently in these three locations. Let's grab the text of all of the tweets and make it one big list.

Then combine all three into a variable named `all_tweets`.

Let's also make the labels associated with those tweets. `0` represents a New York tweet, `1`  represents a London tweet, and `2` represents a Paris tweet. 

In [4]:
# grab a text of the tweets in datasets
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets["text"].tolist()
paris_text = paris_tweets["text"].tolist()

# combine three lists into one
all_tweets = new_york_text + london_text + paris_text

# combine labels as well
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)

## Step 3. Making a Training and Test Set

We can now break our data into a training set and a test set. We'll use scikit-learn's `train_test_split` function to do this split. This function takes two required parameters: It takes the data, followed by the labels. Set the optional parameter `test_size` to be `0.2`. Finally, set the optional parameter `random_state` to `1`. This will make it so your data is split in the same way as the data in our solution code. 

`train_test_split` function returns 4 items in this order:
1. The training data
2. The testing data
3. The training labels
4. The testing labels

In [5]:
from sklearn.model_selection import train_test_split

train_data, test_data, train_labels, test_labels = train_test_split(all_tweets, labels, test_size=0.2, random_state=1)

print(len(train_data))
print(len(test_data))

10059
2515


## Step 4. Making the Count Vectors

To use a Naive Bayes Classifier, we need to transform our lists of words into count vectors. Recall that this changes the sentence `"I love New York, New York"` into a list that contains:

* Two `1`s because the words `"I"` and `"love"` each appear once.
* Two `2`s because the words `"New"` and `"York"` each appear twice.
* Many `0`s because every other word in the training set didn't appear at all.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# create a CountVectorizer
counter = CountVectorizer()

# teach the counter our vocabulary
counter.fit(train_data)

# transform train_data and test_data into Count Vectors
train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)

# check what a tweet looks like as a Count Vector
print(train_data[3])
print(train_counts[3])

saying bye is hard. Especially when youre saying bye to comfort.
  (0, 5022)	2
  (0, 6371)	1
  (0, 9552)	1
  (0, 12314)	1
  (0, 13903)	1
  (0, 23994)	2
  (0, 27146)	1
  (0, 29397)	1
  (0, 30274)	1


## Step 5. Train and Test the Naive Bayes Classifier

We now have the inputs to our classifier. Let's use the CountVectors to train and test the Naive Bayes Classifier!

In [7]:
from sklearn.naive_bayes import MultinomialNB

# make a MultinomialNB
classifier = MultinomialNB()

# train the classifier, fit() calculates all of the probabilities used in Bayes Theorem
# the model is now ready to quickly predict the location of a new tweet
classifier.fit(train_counts, train_labels)

# test the model
predictions = classifier.predict(test_counts)

## Step 6. Evaluating Your Model

Now that the classifier has made its predictions, let's see how well it did. 

In [8]:
from sklearn.metrics import accuracy_score

model_score = accuracy_score(test_labels, predictions)

# prints the percentage of tweets in the test set that the classifier correctly classified
print(model_score)


0.6779324055666004


The other way you can evaluate your model is by looking at the **confusion matrix**. A confusion matrix is a table that describes how your classifier made its predictions. For example, if there were two labels, A and B, a confusion matrix might look like this:

```
9 1
3 5
```

In this example, the first row shows how the classifier classified the true A's. It guessed that 9 of them were A's and 1 of them was a B. The second row shows how the classifier did on the true B's. It guessed that 3 of them were A's and 5 of them were B's.

For our project using tweets, there were three classes &mdash; `0` for New York, `1` for London, and `2` for Paris. 

In [9]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(test_labels, predictions))

[[541 404  28]
 [203 824  34]
 [ 38 103 340]]


## Step 7. Test Your Own Tweet

Nice work! The confusion matrix should line up with your intuition. The classifier predicts tweets that were actually from New York as either New York tweets or London tweets, but almost never Paris tweets. Similarly, the classifier rarely misclassifies the tweets that were actually from Paris. Tweets coming from two English speaking countries are harder to distinguish than tweets in different languages.

Now it's your chance to write a tweet and see how the classifier works!

In [10]:
tweet = "Time is such a precious asset, there are so many interesting things to do in the world, no time to be lazy."

tweet_counts = counter.transform([tweet])
tweet_prediction = classifier.predict(tweet_counts)
print(tweet_prediction)

[1]
