# Classifying Tweets

THIS WILL USE THE FOLLOWING DATA `new_york.json`, `london.json`, and `paris.json`. These three files contain tweets that we gathered from those locations.

The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

![title](download.png)

# Investigate the Data

In [2]:
import pandas as pd

new_york_tweets = pd.read_json("new_york.json", lines=True)
print(len(new_york_tweets))
print()
print(new_york_tweets.columns)
print()
print(new_york_tweets.loc[12]["text"])
print()
new_york_tweets.head(3)

4723

Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')

Be best #ThursdayThoughts



Unnamed: 0,created_at,id,id_str,text,display_text_range,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,...,lang,timestamp_ms,extended_tweet,possibly_sensitive,quoted_status_id,quoted_status_id_str,quoted_status,quoted_status_permalink,extended_entities,withheld_in_countries
0,2018-07-26 13:32:33+00:00,1022474755625164800,1022474755625164800,@DelgadoforNY19 Calendar marked.,"[16, 32]","<a href=""http://twitter.com/download/android"" ...",False,1.022208e+18,1.022208e+18,8.290618e+17,...,en,2018-07-26 13:32:33.060,,,,,,,,
1,2018-07-26 13:32:34+00:00,1022474762491183104,1022474762491183104,petition to ban more than one spritz of cologne,,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,...,en,2018-07-26 13:32:34.697,,,,,,,,
2,2018-07-26 13:32:35+00:00,1022474765750226945,1022474765750226944,People really be making up beef with you in th...,,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,...,en,2018-07-26 13:32:35.474,,,,,,,,


* The number of tweets.
* The columns, or features, of a tweet.
* The text of the 12th tweet in the New York dataset.

In [16]:
london_tweets = pd.read_json("london.json", lines=True)
paris_tweets = pd.read_json("paris.json", lines=True)
print(len(london_tweets))
print(len(paris_tweets))

5341
2510


# Classifying using language: Naive Bayes Classifier

In [17]:
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets["text"].tolist()
paris_text = paris_tweets["text"].tolist()

all_tweets = new_york_text + london_text + paris_text
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)

# Making a Training and Test Set

In [18]:
from sklearn.model_selection import train_test_split


train_data, test_data, train_labels, test_labels = train_test_split(all_tweets, labels, test_size = 0.2, random_state = 1)
print(len(train_data), len(test_data))

10059 2515


# Making the Count Vectors

To use a Naive Bayes Classifier, we need to transform our lists of words into count vectors. 

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

counter = CountVectorizer()


counter.fit(train_data)


train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)


print(train_data[3])
print()
print(train_counts[3])

saying bye is hard. Especially when youre saying bye to comfort.

  (0, 5022)	2
  (0, 6371)	1
  (0, 9552)	1
  (0, 12314)	1
  (0, 13903)	1
  (0, 23994)	2
  (0, 27146)	1
  (0, 29397)	1
  (0, 30274)	1


# Train and Test the Naive Bayes Classifier

In [26]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(train_counts, train_labels)
predictions = classifier.predict(test_counts)

# Evaluating the Model

In [28]:
from sklearn.metrics import accuracy_score

print(accuracy_score(test_labels, predictions))

0.6779324055666004


In [29]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(test_labels, predictions))

[[541 404  28]
 [203 824  34]
 [ 38 103 340]]


# Test Your Own Tweet

## Predict where my tweet is

### TWEET 1
Times Square is preparing for Phase 2 of reopening in NYC. Tables are appropriately distanced with 2 chairs per table and are being regularly cleaned by the Times Square Alliance Sanitation team (pictured: Lisa Ringgold, Supervisor, and Herbert Murray, Assistant Supervisor

### TWEET 2

Pastel pink skies over iconic London scenes Destellos What have you been revisiting since the easing of lockdown? Let us know belowDorso de la mano con el dedo índice señalando hacia abajo #BecauseImALondoner

### TWEET 3

Boire régulièrement de l'eau, éviter les efforts physiques, maintenir son appartement au frais...  Pendant la #canicule, adoptez les bons réflexes !

New york                   |  london                   |    Paris                |
:-------------------------:|:-------------------------:|:-----------------------:|
![](tweet1.png)            |  ![](tweet2.png)          |  ![](tweet3.png)        |

In [42]:
tweet = "Times Square is preparing for Phase 2 of reopening in NYC. Tables are appropriately distanced with 2 chairs per table and are being regularly cleaned by the Times Square Alliance Sanitation team (pictured: Lisa Ringgold, Supervisor, and Herbert Murray, Assistant Supervisor"

tweet_counts = counter.transform([tweet])
result1 = classifier.predict(tweet_counts)
print(result1)

tweet2 = "Pastel pink skies over iconic London scenes Destellos What have you been revisiting since the easing of lockdown? Let us know belowDorso de la mano con el dedo índice señalando hacia abajo #BecauseImALondoner"

tweet_counts = counter.transform([tweet2])
result2 = classifier.predict(tweet_counts)
print(result2)


tweet3 = "Boire régulièrement de l'eau, éviter les efforts physiques, maintenir son appartement au frais...  Pendant la #canicule, adoptez les bons réflexes !"

tweet_counts = counter.transform([tweet3])
result3 = classifier.predict(tweet_counts)
print(result3)

[0]
[1]
[2]


[0] -> new york

[1] -> London

[2] -> paris