# Classifying Tweets

This project involves the use a Naive Bayes Classifier to find patterns in real tweets. There are three files: `new_york.json`, `london.json`, and `paris.json`. These three files contain tweets that are gathered from those locations.

The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

## Investigate the Data

To begin, let's take a look at the data. I've imported the datasets and printed the following information:
* The number of tweets.
* The columns, or features, of a tweet.
* The text of the 12th tweet in the New York dataset.


In [1]:
import pandas as pd

new_york_tweets = pd.read_json("new_york.json", lines=True)
print(len(new_york_tweets))
print(new_york_tweets.columns)
print(new_york_tweets.loc[12]["text"])

4723
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')
Be best #ThursdayThoughts


In [2]:
london_tweets = pd.read_json("london.json", lines=True)
print(len(london_tweets))
print(london_tweets.columns)
print(london_tweets.loc[12]["text"])

5341
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'extended_tweet', 'quote_count',
       'reply_count', 'retweet_count', 'favorite_count', 'entities',
       'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities'],
      dtype='object')
I saw this on the BBC and thought you should see it:

The precious metal sparking a new gold rush - https://t.co/ScW4MOSobZ


In [3]:
paris_tweets = pd.read_json("paris.json", lines=True)
print(len(paris_tweets))
print(paris_tweets.columns)
print(paris_tweets.loc[12]["text"])

2510
Index(['created_at', 'id', 'id_str', 'text', 'source', 'truncated',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'display_text_range',
       'extended_entities', 'possibly_sensitive', 'quoted_status_id',
       'quoted_status_id_str', 'quoted_status', 'quoted_status_permalink',
       'extended_tweet'],
      dtype='object')
Hauts-de-Seine : l’incendie d’Issy-les-Moulineaux prive aussi 16 000 foyers de courant https://t.co/Hlb02Fpliy


## Tokenization

In [4]:
from nltk.tokenize import word_tokenize

def list_tokenizer(list_of_text):
    tokenized_list = []
    updated_tokenized_list = []
    for text in list_of_text:
        tokenized_list.append(word_tokenize(text))
    for text in tokenized_list:
        new_text = []
        for word in text:
            if word.isalpha():
                new_text.append(word.lower())
        updated_tokenized_list.append(new_text)
    return updated_tokenized_list

new_york_tokenized_text = list_tokenizer(new_york_tweets.text.tolist())
london_tokenized_text = list_tokenizer(london_tweets.text.tolist())
paris_tokenized_text = list_tokenizer(paris_tweets.text.tolist())

print(new_york_tokenized_text[4])

['at', 'first', 'glance', 'it', 'looked', 'like', 'asparagus', 'with', 'chicken', 'and', 'gravy', 'smothered', 'over', 'it', 'or', 'potatoes', 'she', 'got', 'ta', 'be', 'extra', 'https']


## Removal of Stopwords

In [5]:
from nltk.corpus import stopwords

def remove_stop_words(tokenized_list, language):
    stop_words = set(stopwords.words(language))
    filtered = []
    for text in tokenized_list:
        new_list = [word for word in text if not word in stop_words]
        filtered.append(new_list)
    return filtered

new_york_important_text = remove_stop_words(new_york_tokenized_text, 'english')
london_important_text = remove_stop_words(london_tokenized_text, 'english')
paris_important_text = remove_stop_words(paris_tokenized_text, 'french')

new_york_important_text[4]

['first',
 'glance',
 'looked',
 'like',
 'asparagus',
 'chicken',
 'gravy',
 'smothered',
 'potatoes',
 'got',
 'ta',
 'extra',
 'https']

## Stemming Using the SnowballStemmer

In [6]:
from nltk.stem import SnowballStemmer

def stem(tokenized_list, language):
    porter = SnowballStemmer(language)
    new_list = []
    for text in tokenized_list:
        new_text = []
        for word in text:
            new_text.append(porter.stem(word))
        new_list.append(new_text)
    return new_list

new_york_stemmed = stem(new_york_important_text, 'english')
london_stemmed = stem(london_important_text, 'english')
paris_stemmed = stem(paris_important_text, 'french')
new_york_stemmed[4]

['first',
 'glanc',
 'look',
 'like',
 'asparagus',
 'chicken',
 'gravi',
 'smother',
 'potato',
 'got',
 'ta',
 'extra',
 'https']

## Classifying using language: Naive Bayes Classifier

I'm going to create a Naive Bayes Classifier! I will begin by looking at the way language is used differently in these three locations. I will grab the text of all of the tweets and make it one big list.I will also make the labels associated with those tweets. `0` represents a New York tweet, `1`  represents a London tweet, and `2` represents a Paris tweet.

In [7]:
all_tweets = new_york_stemmed + london_stemmed + paris_stemmed
labels = [0] * len(new_york_important_text) + [1] * len(london_important_text) + [2] * len(paris_important_text)

In [8]:
all_tweets_updated = []
for tweet in all_tweets:
    new_tweet = " ".join(tweet)
    all_tweets_updated.append(new_tweet)

## Making a Training and Test Set

In [9]:
from sklearn.model_selection import train_test_split
train_data, test_data, train_labels, test_labels = train_test_split(all_tweets_updated, labels, test_size=0.2, random_state=1)
print(len(train_data), len(test_data))

10059 2515


## Making the Count Vectors

To use a Naive Bayes Classifier, I need to transform the lists of words into count vectors.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
counter = CountVectorizer(ngram_range = (1, 3))
counter.fit(train_data)
train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)
print(train_data[3])
print(train_counts[3])

say bye hard especi your say bye comfort
  (0, 14107)	2
  (0, 14110)	1
  (0, 14111)	1
  (0, 14112)	1
  (0, 19246)	1
  (0, 30860)	1
  (0, 30876)	1
  (0, 30877)	1
  (0, 43646)	1
  (0, 43655)	1
  (0, 43656)	1
  (0, 88461)	2
  (0, 88489)	2
  (0, 88490)	1
  (0, 88491)	1
  (0, 115969)	1
  (0, 116002)	1
  (0, 116003)	1


## Training and Testing the Naive Bayes Classifier

In [11]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(train_counts, train_labels)
predictions = classifier.predict(test_counts)

## Evaluating Your Model

I will evauluate the classifier by printing the accuracy score and confussion matrix.

In [12]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_labels, predictions)
print(f'The accuracy is: {round((accuracy * 100), 2)}%')

The accuracy is: 72.25%


In [13]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(test_labels, predictions))

[[566 403   4]
 [225 828   8]
 [ 11  47 423]]


In [14]:
def predict(tweet):
    token = word_tokenize(tweet)
    stop_words = set(stopwords.words('english'))
    new_token_1 = [word.lower() for word in token if not word in stop_words]
    new_token_2 = [SnowballStemmer('english').stem(word) for word in new_token_1]
    new_token_3 = " ".join(new_token_2)
    tweet_counts = counter.transform([new_token_3])
    result = (classifier.predict(tweet_counts))
    if result[0] == 0:
        return "This tweet is from New York"
    elif result[0] == 1:
        return "This tweet is from London"
    else:
        return "This tweet is from Paris"
    
print(predict("Pouvez-vous parlez plus lentement s’il vous plaît?"))

This tweet is from Paris
