# Project: Classifying Tweets

In this project, I will use a Naive Bayes Classifier to find patterns in real tweets. There are given three files: `new_york.json`, `london.json`, and `paris.json`. These three files contain tweets that were gathered from those locations.

The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

# Load the Data

In [128]:
import pandas as pd

london_tweets = pd.read_json('london.json', lines=True)
paris_tweets = pd.read_json('paris.json', lines=True)
new_york_tweets = pd.read_json('new_york.json', lines=True)

# Investigate the Data

To begin, let's take a look at the data and answer few questions.

    How many London tweets are there? How many Paris or New York ones are there?
    What are the columns, or features, of a tweet?
    What are the text of the 12th tweet in the New York dataset?
    


In [115]:

print(len(new_york_tweets))
print(london_tweets.columns)

print(london_tweets.loc[12])
print(new_york_tweets.loc[12]["text"])
print("\n")

print(f'Number of Tweets in London: {len(london_tweets)}')

4723
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'extended_tweet', 'quote_count',
       'reply_count', 'retweet_count', 'favorite_count', 'entities',
       'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities'],
      dtype='object')
created_at                                           2018-07-26 13:39:42+00:00
id                                                         1022476558630572033
id_str                                                     1022476558630572032
text                         I saw this on the BBC and thought you should s...
display_text_

# Classifying using language: Naive Bayes Classifier

We're going to create a Naive Bayes Classifier. Let's begin by looking at the way language is used differently in these three locations. Let's grab the text of all of the tweets and make it one big list. Then combine all three into a variable named `all_tweets`.

Let's also make the labels associated with those tweets.
`0` represents a New York tweet, `1`  represents a London tweet, and `2` represents a Paris tweet. Finish the definition of `labels`.

In [116]:
tweets_list = [new_york_tweets, london_tweets, paris_tweets]

all_tweets = []
labels = []
num = 0
for tweets in tweets_list:
    all_tweets += tweets['text'].tolist()
    labels += [num] * len(tweets)
    num += 1


# Making a Training and Test Set

We can now break our data into a training set and a test set. We'll use scikit-learn's `train_test_split` function to do this split. This function takes two required parameters: It takes the data, followed by the labels. Set the optional parameter `test_size` to be `0.2`. Finally, set the optional parameter `random_state` to `1`. This will make it so your data is split in the same way as the data in our solution code. 

This function returns 4 items in this order:
1. The training data
2. The testing data
3. The training labels
4. The testing labels


In [117]:
from sklearn.model_selection import train_test_split

train_data, test_data, train_labels, test_labels = train_test_split(all_tweets, labels, test_size=0.2, random_state=1)

print(len(train_data))
print(len(test_labels))

10059
2515


# Making the Count Vectors

To use a Naive Bayes Classifier, we need to transform our lists of words into count vectors. Recall that this changes the sentence `"I love New York, New York"` into a list that contains:

* Two `1`s because the words `"I"` and `"love"` each appear once.
* Two `2`s because the words `"New"` and `"York"` each appear twice.
* Many `0`s because every other word in the training set didn't appear at all.

To start, create a `CountVectorizer` named `counter`.

Next, call the `.fit()` method using `train_data` as a parameter. This teaches the counter our vocabulary.

Finally, let's transform `train_data` and `test_data` into Count Vectors. Call `counter`'s `.transform()` method using `train_data` as a parameter and store the result in `train_counts`. Do the same for `test_data` and store the result in `test_counts`.

Print `train_data[7]` and `train_counts[7]` to see what a tweet looks like as a Count Vector.

In [130]:
from itertools import count
from sklearn.feature_extraction.text import CountVectorizer

counter = CountVectorizer()
counter.fit(train_data)

train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)

print(train_data[77])
print(train_counts[77])


Despite this I’m still loving the heat btw 🌞
  (0, 4879)	1
  (0, 7847)	1
  (0, 12487)	1
  (0, 16565)	1
  (0, 25639)	1
  (0, 26698)	1
  (0, 26897)	1


# Train and Test the Naive Bayes Classifier

We now have the inputs to our classifier. Let's use the CountVectors to train and test the Naive Bayes Classifier.

First, make a `MultinomialNB` named `classifier`.

Next, call `classifier`'s `.fit()` method. This method takes two parameters: the training data and the training labels. `train_counts` contains the training data and `train_labels` containts the labels for that data.

Calling `.fit()` calculates all of the probabilities used in Bayes Theorem. The model is now ready to quickly predict the location of a new tweet. 

Finally, let's test our model. `classifier`'s `.predict()` method using `test_counts` as a parameter.

In [119]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(train_counts, train_labels)

predictions = classifier.predict(test_counts)

# Evaluating Model

Now that the classifier has made its predictions, let's see how well it did. Let's look at two different ways to do this. First, call scikit-learn's `accuracy_score` function. This prints the percentage of tweets in the test set that the classifier correctly classified.



In [120]:
from sklearn.metrics import accuracy_score

print(accuracy_score(test_labels, predictions))

0.6779324055666004


The other way to evaluate model is by looking at the **confusion matrix**. A confusion matrix is a table that describes how your classifier made its predictions. For example, if there were two labels, A and B, a confusion matrix might look like this:

```
9 1
3 5
```

In this example, the first row shows how the classifier classified the true A's. It guessed that 9 of them were A's and 1 of them was a B. The second row shows how the classifier did on the true B's. It guessed that 3 of them were A's and 5 of them were B's.

In this project using tweets, there were three classes &mdash; `0` for New York, `1` for London, and `2` for Paris.

In [121]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(test_labels, predictions))

[[541 404  28]
 [203 824  34]
 [ 38 103 340]]


# Test Your Own Tweet

The confusion matrix line up with intuition. The classifier predicts tweets that were actually from New York as either New York tweets or London tweets, but almost never Paris tweets. Similarly, the classifier rarely misclassifies the tweets that were actually from Paris. Tweets coming from two English speaking countries are harder to distinguish than tweets in different languages.

Now it's your chance to write a tweet and see how the classifier works! I wrote `write_tweet` function which take text in string format as argument and print it's prediction.
Remember a `0` represents New York, a `1` represents London, and a `2` represents Paris.

In [127]:
def write_tweet(text):
    tweet = text
    tweet_counts = counter.transform([tweet])
    print(classifier.predict(tweet_counts))


write_tweet('Je suis Bartek')
write_tweet('Well it\'s not you, who drink tea')
write_tweet('New york rule')




[2]
[1]
[0]
