# Emojis Speak More than Words

GOAL: 
    1. give an "issue word" as an input (ex. ocasio, climate change) and find the most related emoji
    to kinda grasp people's opinions
    2. give any word or a saying and get a emoji that is most related ex. sparkle --> ✨


In [7]:
import pickle
import numpy as np
import pandas as pd
from collections import Counter

In [46]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
import scipy.sparse as sp

In [None]:
# read topic words data
from read_tweets import read_tweets
reading = read_tweets()
tweets = reading.emoji_tweets(['ocasio cortez', 'climate change', 'greta'], 
                              num_batches = 20, num_tweets = 100)

In [3]:
from clean_tweets import clean_tweets
cleaning = clean_tweets()
tweets_df = cleaning.tweets_df(tweets)

[nltk_data] Downloading package stopwords to /Users/sara/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
cleaning.top_emojis(tweets_df)

     0    1    2    3    4    5    6    7    8    9   ...   15   16   17   18  \
0     😂    😭    ❤    🔥    🤣    😍    🥺    ♀    ✨    🙏  ...    😊    🙌    🤷    🚨   
1  1029  603  602  338  285  283  218  216  199  198  ...  147  144  143  141   

    19   20   21   22   23   24  
0    👀    🎉    😩    🥰    👏    🤦  
1  139  138  133  131  125  123  

[2 rows x 25 columns]


too many laughing face...

In [8]:
tw = [word for word in tweets_df['emoji']]
tw_counts = Counter(tw)

In [9]:
common_emoji = tweets_df[tweets_df['emoji'].isin(pd.DataFrame(tw_counts.most_common(30))[0])]
common_emoji_removed = tweets_df[~tweets_df['emoji'].isin(pd.DataFrame(tw_counts.most_common(30))[0])]

In [10]:
X = tweets_df['tweets'].values
y = tweets_df['emoji']

In [16]:
stopwords = set(list(ENGLISH_STOP_WORDS) + ['rt', 'follow', 'dm', 'https', 'ur', 'll' ,'amp', 'subscribe', 'don', 've', 'retweet', 'im', 'http','lt'])
tfidf = TfidfVectorizer(max_features=10000, max_df = .8, min_df = .001, stop_words = stopwords, ngram_range = (1,2))
tfidf.fit(X)
X_tfidf = tfidf.transform(X)

Since the emojis that is taking up the majority class in the dataset is too dominant in our predictions, I will try to balance
them out by taking them out seperately and apply random undersampler.

In [18]:
# majority emoji index
majority_idx = np.where(y.isin(pd.DataFrame(tw_counts.most_common(30))[0]))
y_majority = y[y.isin(pd.DataFrame(tw_counts.most_common(30))[0])]

In [21]:
# minority
minority_idx = np.where(~y.isin(pd.DataFrame(tw_counts.most_common(30))[0]))
y_minority = y[~y.isin(pd.DataFrame(tw_counts.most_common(30))[0])]

In [22]:
X_minority = X_tfidf[minority_idx]

In [23]:
X_minority.shape

(8718, 1181)

In [24]:
len(y_majority)

6758

In [25]:
X_majority = X_tfidf[majority_idx]

In [26]:
print(X_majority.shape, len(y_majority))

(6758, 1181) 6758


In [27]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0, replacement=False)
X_subsample, y_subsample = rus.fit_sample(X_majority, y_majority)
print(X_subsample.shape)
print(len(y_subsample))

Using TensorFlow backend.


(3270, 1181)
3270


when the majority classes are balanced the size dropped to 3270 from 6758.

In [35]:
X_tfidf = sp.vstack((X_subsample, X_minority))
y = np.concatenate((y_subsample, y_minority), axis=None)

In [36]:
X_tfidf

<11988x1181 sparse matrix of type '<class 'numpy.float64'>'
	with 53289 stored elements in Compressed Sparse Row format>

Let's model tweets with naive bayes and svm algorithms

In [31]:
np.random.seed(123)

In [32]:
print("Working with {} tweets".format(len(tweets_df['tweets'].unique())))

Working with 10134 tweets


In [37]:
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=0)

In [42]:
%%time
# Naive Bayes multinom
nb = naive_bayes.MultinomialNB()
nb.fit(X_train, y_train)
# predict the labels on validation dataset
predictions_nb = nb.predict(X_test)
# Use accuracy_score function to get the accuracy
nb_score = accuracy_score(predictions_nb, y_test)*100
print("Naive Bayes Accuracy Score -> ", nb_score)

Naive Bayes Accuracy Score ->  3.9616346955796495
CPU times: user 504 ms, sys: 119 ms, total: 623 ms
Wall time: 802 ms


In [43]:
%%time
sgd = SGDClassifier(loss="log", alpha=.0001, max_iter=50, penalty="elasticnet")
sgd.fit(X_train, y_train)
# predict the labels on validation dataset
predictions_sgd = nb.predict(X_test)
# Use accuracy_score function to get the accuracy
sgd_score = accuracy_score(predictions_sgd, y_test)*100
print("Stochastic Gradient Descent Accuracy Score -> ", sgd_score)

Stochastic Gradient Descent Accuracy Score ->  3.9616346955796495
CPU times: user 12.7 s, sys: 77.5 ms, total: 12.7 s
Wall time: 13 s


In [44]:
%%time
# Naive Bayes gaussian
gnb = naive_bayes.GaussianNB()
gnb.fit(X_train.todense(), y_train)
predictions_gnb = gnb.predict(X_test.todense())
gnb_score = accuracy_score(predictions_gnb, y_test)*100
print("Gaussian Naive Bayes Accuracy Score -> ", gnb_score)

Gaussian Naive Bayes Accuracy Score ->  1.834862385321101
CPU times: user 18 s, sys: 4.5 s, total: 22.6 s
Wall time: 24.4 s


In [47]:
%%time
# SVM
svm = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto', probability=True)
svm.fit(X_train, y_train)
# predict the labels on validation dataset
predictions_svm = svm.predict(X_test)
# Use accuracy_score function to get the accuracy
score = accuracy_score(predictions_svm, y_test)*100
print("SVM Accuracy Score -> ",score)

SVM Accuracy Score ->  3.878231859883236
CPU times: user 3min 42s, sys: 995 ms, total: 3min 43s
Wall time: 3min 44s


In [58]:
# prediction
def prediction(model, text, top_n = 5):
    test_tfidf = tfidf.transform([text])
    if model == gnb:
        probs = model.predict_proba(test_tfidf.todense())
        predict_rank = pd.DataFrame({type(model).__name__+' predictions': gnb.classes_, 'probs': probs[0]})
        predict_rank = predict_rank[predict_rank['probs']>0]
        predict_rank = predict_rank.sort_values(by = 'probs', ascending = False)

    else:
        probs = model.predict_proba(test_tfidf.todense())
        predict_rank = pd.DataFrame({type(model).__name__+' predictions': model.classes_, 'probs': probs[0]})
        predict_rank = predict_rank.sort_values(by = 'probs', ascending = False)
    
    return(predict_rank[:top_n].reset_index(drop=True)) 


def print_prediction(text, models = [gnb, sgd, nb, svm], top_n = 5):
    df = pd.DataFrame()
    for i in models:
        df = pd.concat([df, prediction(i, text, top_n)], axis=1)
    print('top {} predictions for {} is:'.format(top_n, text))
    return(df)

In [59]:
print_prediction(text = 'trump', top_n = 5)

top 5 predictions for trump is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,✳,1.0,🙄,0.011932,🇺🇸,0.027303,🇺🇸,0.019855
1,🔹,3.6296039999999996e-19,🇺🇸,0.010432,🙄,0.025196,🙄,0.018551
2,⬆,1.5116709999999998e-36,👇,0.006659,👍,0.021081,👍,0.016725
3,🚂,7.013839e-37,👍,0.006611,💯,0.019423,💯,0.015774
4,✝,1.125989e-47,💯,0.005627,👇,0.018425,❤,0.013888


In [105]:
print_prediction(text = 'family and friends', top_n = 5)

top 5 predictions for family and friends is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,❣,1.0,🙏,0.004573,💯,0.014022,👀,0.016168
1,💞,1.320782e-251,😭,0.003691,👀,0.013604,🌈,0.012484
2,✌,2.458752e-297,👀,0.003635,🙏,0.012978,😭,0.011766
3,,,♀,0.003447,♀,0.012398,♀,0.011762
4,,,💯,0.003026,▶,0.012207,🙏,0.011747


In [61]:
print_prediction(text = 'nothing makes me happier than a box of chocolate', top_n = 5)

top 5 predictions for nothing makes me happier than a box of chocolate is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,🍇,1.0,🍫,0.011397,💕,0.012347,🙌,0.011355
1,🍩,1.330273e-41,🍓,0.005292,🙌,0.011763,💕,0.011031
2,🍏,9.078833e-53,🍒,0.002982,🎉,0.0112,💙,0.00972
3,🍫,2.5660070000000003e-62,😅,0.002827,😅,0.010808,😅,0.009511
4,🍰,5.353955e-100,🔥,0.002744,💙,0.010464,👏,0.009226


In [62]:
print_prediction(text = 'vegan', top_n = 5)

top 5 predictions for vegan is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,🐓,1.0,🌱,0.08556,🌱,0.052167,🌱,0.051256
1,🍧,5.154602e-14,💚,0.010678,💚,0.028289,💚,0.014383
2,📦,5.629186e-25,😋,0.010346,🙌,0.022048,😍,0.013056
3,🍎,1.690493e-30,🙌,0.007129,😍,0.022029,🙌,0.012728
4,🔝,2.786584e-34,😍,0.006986,🤦,0.017476,🤦,0.0116


In [63]:
print_prediction(text = 'summer', top_n = 5)

top 5 predictions for summer is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,🛶,0.9999262,☀,0.064304,☀,0.043573,☀,0.032439
1,👿,7.380386e-05,🌞,0.017167,😎,0.026232,😎,0.015742
2,9️⃣,1.161721e-25,😎,0.012658,😩,0.020328,💖,0.011224
3,☘,9.909009000000001e-27,😩,0.005896,😍,0.016703,🥳,0.010965
4,↗,2.103527e-29,🌊,0.004832,💖,0.014513,🌊,0.010678


In [102]:
print_prediction(text = 'summer bbq party', top_n = 5)

top 5 predictions for summer bbq party is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,🥂,0.9999502,☀,0.013314,😎,0.019299,☀,0.020681
1,🍸,4.975591e-05,😎,0.006509,☀,0.017132,😎,0.014065
2,🏖,9.127978e-60,🌞,0.005897,🙌,0.016888,🌞,0.011599
3,🌴,3.451603e-173,🙌,0.00523,😩,0.01612,😩,0.010735
4,💃,8.269649999999999e-240,😩,0.004256,♂,0.013838,👇,0.010594


In [65]:
print_prediction(text = 'stranger things', top_n = 5)

top 5 predictions for stranger things is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,🌋,1.0,😭,0.004527,😂,0.017186,😊,0.014409
1,,,😂,0.004319,😊,0.014692,😂,0.013899
2,,,☺,0.003998,😘,0.013075,😘,0.013284
3,,,😊,0.003438,😭,0.013032,😭,0.011362
4,,,😳,0.00307,☺,0.012446,🥺,0.010856


In [66]:
print_prediction(text = 'climate change', top_n = 5)

top 5 predictions for climate change is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,🆘,0.9999999,🌍,0.011687,🙄,0.039085,🌍,0.020188
1,👣,6.425125e-08,🙄,0.010044,🌍,0.0224,🙄,0.019182
2,🚒,8.491305e-12,🌎,0.009147,👍,0.022125,♀,0.012484
3,⚖,2.953214e-15,✅,0.006062,♀,0.022088,👇,0.012225
4,♻,2.351706e-25,👍,0.005856,👇,0.019619,🤣,0.011445


In [101]:
print_prediction(text = 'dog', top_n = 5)

top 5 predictions for dog is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,🌽,1.0,🐶,0.010022,👏,0.014246,😭,0.012833
1,🌶,4.88541e-60,👏,0.003865,😭,0.012664,👏,0.0119
2,🐕,3.480263e-70,😭,0.003393,💚,0.012479,😂,0.011702
3,🎾,1.181557e-92,🥺,0.002934,🥺,0.012424,💚,0.011054
4,🐍,8.174489e-102,💚,0.002863,😂,0.012173,🌈,0.010888


In [68]:
print_prediction(text = 'yoga', top_n = 5)

top 5 predictions for yoga is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,🕉,1.0,🧘,0.012289,♀,0.0261,♀,0.014034
1,👦,1.554235e-124,♀,0.011016,😅,0.013098,🤦,0.011845
2,🤨,2.737787e-194,😅,0.00376,🤦,0.013094,😍,0.011221
3,🧘,1.724725e-202,🤦,0.003254,😊,0.01176,😅,0.01114
4,👧,7.103169000000001e-205,💗,0.002898,😍,0.011661,💗,0.010201


In [121]:
print_prediction(text = 'my iphone cracked', top_n = 5)

top 5 predictions for my iphone cracked is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,🔔,1.0,📱,0.004838,❤,0.012674,❤,0.012484
1,📱,1.971204e-52,🔥,0.003454,🔥,0.012653,🔥,0.012304
2,🎁,4.0916379999999995e-224,🌍,0.003271,🤷,0.011373,🌍,0.011367
3,,,🤷,0.003156,♂,0.011158,🤷,0.011302
4,,,♂,0.0031,👍,0.009242,♂,0.011138


In [82]:
print_prediction(text = 'greta thunberg', top_n = 5)

top 5 predictions for greta thunberg is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,🇸🇪,1.0,😅,0.009108,😅,0.018671,😊,0.012172
1,😦,4.472056e-78,🌍,0.005396,🥺,0.013022,💖,0.011521
2,📸,1.419535e-207,🥳,0.003486,💖,0.012526,🥳,0.011514
3,,,🥺,0.003364,😊,0.012429,😅,0.011422
4,,,✊,0.003012,🥳,0.009615,🥺,0.010849


In [71]:
print_prediction(text = 'abortion', top_n = 5)

top 5 predictions for abortion is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,👎,1.0,👩,0.007129,👍,0.014945,👍,0.01513
1,😣,1.218548e-103,🤝,0.004031,🤦,0.012061,🤦,0.013282
2,😑,7.438464e-108,👍,0.003849,♂,0.010951,🚨,0.011677
3,🤝,2.903257e-206,🤦,0.003013,🚨,0.01036,♂,0.01134
4,👩,1.539567e-220,😳,0.002974,😳,0.00983,🔥,0.010032


In [76]:
print_prediction(text = 'my iphone is broken', top_n = 5)

top 5 predictions for my iphone is broken is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,🔔,1.0,📱,0.004838,❤,0.012674,❤,0.012484
1,📱,1.971204e-52,🔥,0.003454,🔥,0.012653,🔥,0.012304
2,🎁,4.0916379999999995e-224,🌍,0.003271,🤷,0.011373,🌍,0.011367
3,,,🤷,0.003156,♂,0.011158,🤷,0.011302
4,,,♂,0.0031,👍,0.009242,♂,0.011138


In [119]:
print_prediction(text = 'cold', top_n = 5)

top 5 predictions for cold is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,🥶,0.9999798,🤦,0.003867,🤦,0.013504,🤦,0.014498
1,☔,2.022491e-05,♂,0.003644,♂,0.012262,🤷,0.011752
2,🍿,1.741959e-22,🤷,0.003367,🤷,0.012226,♂,0.011698
3,🤕,1.4750239999999998e-64,♀,0.002817,👍,0.009239,♀,0.011464
4,💰,6.130437e-308,👍,0.002816,♀,0.009164,🔥,0.011102


In [133]:
print_prediction(text = 'ocean', top_n = 5)

top 5 predictions for ocean is:


Unnamed: 0,GaussianNB predictions,probs,SGDClassifier predictions,probs.1,MultinomialNB predictions,probs.2,SVC predictions,probs.3
0,🦈,1.0,🌊,0.033187,🌊,0.017034,🥺,0.010601
1,🐬,2.034963e-09,🎶,0.004635,🥺,0.012535,💚,0.01043
2,🐟,2.3436170000000003e-28,😔,0.002922,💚,0.012513,🙄,0.010332
3,🐳,5.7059560000000003e-33,🐬,0.002652,🙄,0.011864,🙏,0.009673
4,🐙,2.08541e-36,👌,0.00261,❤,0.011651,🎶,0.009545


The prediction is looking a lot better when I randomly under sampled the top 30 emojis in the dataset. 
The Stochastic Gradient descent classifier which is pretty fast, works pretty well. I can trust emojis with probabilty 
higher than 0.01. Otherwise, Gaussian Naive Bayes model differs from other models and gives somewhat good emojis that the 
other models did not pick up. However, gives very bad assumptions when given an input that is not common in the dataset.

I think worst out of the four models is the SVC which is very slow in terms of running time. Multinomial Naive Bayes is the fastest
and works very similar to SVC model. I think it is a good baseline model.

In [None]:
cross validate train