# Off-Platform Project: Classifying Tweets

In this off-platform project, you will use a Naive Bayes Classifier to find patterns in real tweets. We've given you three files: `new_york.json`, `london.json`, and `paris.json`. These three files contain tweets that we gathered from those locations.

The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

# Investigate the Data

import the data and printed the following information:
* The number of tweets.
* The columns, or features, of a tweet.
* The text of the 12th tweet in the New York dataset.


In [1]:
import pandas as pd

new_york_tweets = pd.read_json("new_york.json", lines=True)
print(len(new_york_tweets))
print(new_york_tweets.columns)
print(new_york_tweets.loc[12]["text"])

4723
Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities', 'extended_entities', 'extended_tweet', 'favorite_count',
       'favorited', 'filter_level', 'geo', 'id', 'id_str',
       'in_reply_to_screen_name', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'place',
       'possibly_sensitive', 'quote_count', 'quoted_status',
       'quoted_status_id', 'quoted_status_id_str', 'quoted_status_permalink',
       'reply_count', 'retweet_count', 'retweeted', 'source', 'text',
       'timestamp_ms', 'truncated', 'user', 'withheld_in_countries'],
      dtype='object')
Be best #ThursdayThoughts


In the code block below, load the London and Paris tweets into DataFrames named `london_tweets` and `paris_tweets`.



In [2]:
# import londa tweets
london_tweets= pd.read_json("london.json", lines=True)
print(len(london_tweets))
print(london_tweets.columns)
print(london_tweets.loc[12]["text"])

5341
Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities', 'extended_entities', 'extended_tweet', 'favorite_count',
       'favorited', 'filter_level', 'geo', 'id', 'id_str',
       'in_reply_to_screen_name', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'place',
       'possibly_sensitive', 'quote_count', 'quoted_status',
       'quoted_status_id', 'quoted_status_id_str', 'quoted_status_permalink',
       'reply_count', 'retweet_count', 'retweeted', 'source', 'text',
       'timestamp_ms', 'truncated', 'user'],
      dtype='object')
I saw this on the BBC and thought you should see it:

The precious metal sparking a new gold rush - https://t.co/ScW4MOSobZ


In [3]:
# import paris tweets
paris_tweets= pd.read_json("paris.json", lines=True)
print(len(paris_tweets))
print(paris_tweets.columns)
print(paris_tweets.loc[12]["text"])

2510
Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities', 'extended_entities', 'extended_tweet', 'favorite_count',
       'favorited', 'filter_level', 'geo', 'id', 'id_str',
       'in_reply_to_screen_name', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'place',
       'possibly_sensitive', 'quote_count', 'quoted_status',
       'quoted_status_id', 'quoted_status_id_str', 'quoted_status_permalink',
       'reply_count', 'retweet_count', 'retweeted', 'source', 'text',
       'timestamp_ms', 'truncated', 'user'],
      dtype='object')
Hauts-de-Seine : l’incendie d’Issy-les-Moulineaux prive aussi 16 000 foyers de courant https://t.co/Hlb02Fpliy


# Classifying using language: Naive Bayes Classifier

We're going to create a Naive Bayes Classifier! Let's begin by looking at the way language is used differently in these three locations. Let's grab the text of all of the tweets and make it one big list. In the code block below, we've created a list of all the New York tweets. Do the same for `london_tweets` and `paris_tweets`.

Then combine all three into a variable named `all_tweets` by using the `+` operator. For example, `all_tweets = new_york_text + london_text + ...`

Let's also make the labels associated with those tweets. `0` represents a New York tweet, `1`  represents a London tweet, and `2` represents a Paris tweet. Finish the definition of `labels`.

In [5]:
new_york_text = new_york_tweets["text"].tolist()
london_text=london_tweets['text'].tolist()
paris_text=paris_tweets['text'].tolist()
all_tweets = new_york_text + london_text+paris_text
labels = [0] * len(new_york_text) + [1]*len(london_text)+[2]*len(paris_text)

In [9]:
print(all_tweets[:5])
print(labels[:5])

['@DelgadoforNY19 Calendar marked.', 'petition to ban more than one spritz of cologne', 'People really be making up beef with you in they head lol', '30 years old.. wow what a journey... I moved to NYC at 22 young and dumb, without even $100 in my bank account and… https://t.co/awjzsvoGS7', 'At first glance it looked like asparagus with chicken and gravy smothered over it or potatoes. She gotta be extra w… https://t.co/InBNnsKuWu']
[0, 0, 0, 0, 0]


# Making a Training and Test Set

We can now break our data into a training set and a test set. We'll use scikit-learn's `train_test_split` function to do this split. This function takes two required parameters: It takes the data, followed by the labels. Set the optional parameter `test_size` to be `0.2`. Finally, set the optional parameter `random_state` to `1`. This will make it so your data is split in the same way as the data in our solution code. 

In [11]:
from sklearn.model_selection import train_test_split
train_data,test_data, train_labels,test_labels=train_test_split(all_tweets,labels,test_size=0.2,random_state=1)

In [12]:
print(len(train_data))
print(len(test_data))

10059
2515


# Making the Count Vectors

To use a Naive Bayes Classifier, we need to transform our lists of words into count vectors. 

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
counter=CountVectorizer()
counter.fit(train_data)
train_counts=counter.transform(train_data)
test_counts=counter.transform(test_data)
print(train_data[3])
print(train_counts[3])   # saying appear twice, bye appear twice


saying bye is hard. Especially when youre saying bye to comfort.
  (0, 5022)	2
  (0, 6371)	1
  (0, 9552)	1
  (0, 12314)	1
  (0, 13903)	1
  (0, 23994)	2
  (0, 27146)	1
  (0, 29397)	1
  (0, 30274)	1


# Train and Test the Naive Bayes Classifier

We now have the inputs to our classifier. Let's use the CountVectors to train and test the Naive Bayes Classifier!

In [18]:
from sklearn.naive_bayes import MultinomialNB
classifier=MultinomialNB()
classifier.fit(train_counts,train_labels)
predictions=classifier.predict(test_counts)
print(predictions)

[0 2 1 ... 1 0 1]


# Evaluating Your Model

Now that the classifier has made its predictions, let's see how well it did. 


In [19]:
from sklearn.metrics import accuracy_score
print(accuracy_score(test_labels,predictions))


0.6779324055666004


In [20]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(test_labels,predictions))

[[541 404  28]
 [203 824  34]
 [ 38 103 340]]


# Test Your Own Tweet

The classifier predicts tweets that were actually from New York as either New York tweets or London tweets, but almost never Paris tweets. Similarly, the classifier rarely misclassifies the tweets that were actually from Paris. Tweets coming from two English speaking countries are harder to distinguish than tweets in different languages.

Now let's write a tweet and see how the classifier works! 

In [22]:
tweet="I love coding so much, it's awesome!"
tweet_counts=counter.transform([tweet])
print(classifier.predict(tweet_counts))   # our model predict it from London

[1]


In [24]:
tweet="I live in Manhattan!"
tweet_counts=counter.transform([tweet])
print(classifier.predict(tweet_counts))   # our model predict it from New York

[0]


In [26]:
tweet="I love Paris!"
tweet_counts=counter.transform([tweet])
print(classifier.predict(tweet_counts))    # our model predict it from Paris!  Amazing!

[2]
