# Emoji Prediction
A v quick intro to the scikit-learn machine learning library 

### Project Description
[Click me!] https://competitions.codalab.org/competitions/17344

### Emoji Prediction is Natural Language Processing (NLP)
__Wiki def__ - Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. 
<br> [Other applications!](https://machinelearningmastery.com/applications-of-deep-learning-for-natural-language-processing/)</br>
![title](imgs/reading_robot.png)

### Emoji Prediction is Text Classification
__Wiki def__ - In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.
![title](imgs/classification.png)

### Emoji Prediction is Sentiment Analysis
__Wiki def__ - Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. 
![title](imgs/sentiment-analysis.jpg)

### Emoji Prediction!
Tweets = Documents
<br> Emojis = Sentiment/Categories </br>

![title](imgs/twitter.png)

### NLP Stages
Data mining
<br> Pre-processing </br>
<br> __Training__ (Supervised Learning)</br>


### Supervised Learning!
 
![title](imgs/supervised_learning.png)

### OUR DATA

In [1]:
print(open("datasets/1k_tweets.txt", encoding="utf8").read())


In [2]:
print(open("datasets/1k_labels.txt", encoding="utf8").read())

In [3]:
print(open("datasets/us_mapping.txt", encoding="utf8").read())

### HOW do we do this???

In [4]:
# standard library
import sys
import time

# scikit imports
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# scikit classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.metrics import accuracy_score



### Organize data

In [5]:
# read data
tweets = open("datasets/1k_tweets.txt", encoding="utf8").read().split('\n')
emojis = open("datasets/1k_labels.txt", encoding="utf8").read().split('\n')
#tweets = open("datasets/test_tweets.txt", encoding="utf8").read().split('\n')
#emojis = open("datasets/test_labels.txt", encoding="utf8").read().split('\n')

print(tweets)
print(emojis)

# Learn the vocabulary dictionary and return term-document matrix
# Count Vectorizer - Convert a collection of text documents to a matrix of token counts
count_vect = CountVectorizer()
term_doc_matrix = count_vect.fit_transform(tweets)

# Normalize term document matrix
tfidf_transformer = TfidfTransformer()
normalized_tdm = tfidf_transformer.fit_transform(term_doc_matrix)



['LoL @ West Covina, California ', 'Things got a little festive at the office #christmas2016 @ RedRock… ', 'Step out and explore. # ️ @ Ellis Island Cafe ', '@user @ Cathedral Preparatory School ', "My baby bear @ Bubby's ", "RuPaul's Drag Race bingo fun. Drag Queens be SEXY! #rupaulsdragrace @user abwyman #la… ", 'Black History like a Mufffffaaaaaka #blacchyna done thru her yugioh trap card like hell … ', 'Just light makeup ️ #blueeyes #lupusgirl #photography #modelingagency #modeling #smiling… ', "@ BJ's Restaurant and Brewhouse ", 'So lovely catching up with my soul sister @user @ University of Victoria ', 'Perfect for this weather ️ #dessert #snowice #snowwhite #lasvegas #summer @ Snow White Cafe ', 'Had fun (at @user in New York, NY) ', 'Well Damn @ Oklahoma City, Oklahoma ', "'scuse me while I kiss the sky. ___ : nikkileekv @ Malibu, California ", 'Fun in the sun ️ @ Brownstone Park, Portland, CT ', 'Celebrating #LAstyle @ Calle Tacos ', 'I think today is about to be a great day.

### Real quick...
### [Term Document Matrix](https://en.wikipedia.org/wiki/Document-term_matrix)
![title](imgs/tdm.png)
### [Bag-Of-Words Model](https://en.wikipedia.org/wiki/Bag-of-words_model)
EX.
<br> "John likes dog yells"  ==> ["John", "like", "dog", "yell"] </br>
<br> "John yells like a dog" == >["John", "like", "dog", "yell" ] </br>
### [Count Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
Convert a collection of text documents to a matrix of token counts. (Term-Document-Matrix)

### [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
![title](imgs/TF-IDF_print.png)
![title](imgs/tf-idf-matrix.png)

In [6]:
print ("----- FEATURES -----")
print(count_vect.get_feature_names())

----- FEATURES -----
['00p', '10', '101', '1015', '1103a', '12', '1251', '125th', '13', '14', '14612', '14th', '15', '16', '17', '18', '18236', '1st', '2012', '2014', '2015', '2016', '2017', '21', '21225', '21daystoplantbased', '21st', '22', '24', '245x3', '24inspired', '24th', '25', '2810', '285', '2k16', '2lbs', '2nd', '2x', '30', '30https', '31', '3100', '33', '34th', '360', '388', '39', '3rd', '420', '45p', '4s', '500', '5163uno', '529', '5fofo', '5rights', '60', '60th', '615', '675', '6th', '7000', '713', '75', '7th', '80', '807', '8lbs', '8pm', '95', '98', '___', '__lissuurr', '_ch3w', '_maddyfritz', '_th3r3almik3y_', 'a1', 'aaah', 'able', 'about', 'abraham', 'absolutely', 'abwyman', 'ac', 'academy', 'accomplished', 'achieve', 'achorusline', 'ackeeandsaltfish', 'across', 'actually', 'add', 'addicted', 'admiring', 'admission', 'adoring', 'adult', 'adventures', 'ae', 'aerial', 'af', 'afford', 'afsp', 'after', 'afternoon', 'afterparty', 'afterwork', 'again', 'agentprovacateur', 'agg

In [7]:
print ("----- TDM -----")
print(term_doc_matrix.toarray())

----- TDM -----
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [8]:
print ("----- NORMALIZED TDM -----")
print(normalized_tdm.toarray())

----- NORMALIZED TDM -----
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### Make your machines learn things

In [9]:
# fit the model according to the training data
classifier = MLPClassifier().fit(normalized_tdm, emojis)

# test it out
predictions = classifier.predict(normalized_tdm)

print(predictions)

['2' '17' '0' '18' '1' '9' '2' '0' '8' '13' '0' '1' '2' '18' '12' '3' '5'
 '14' '16' '1' '0' '13' '8' '16' '10' '1' '0' '4' '13' '9' '6' '6' '6' '4'
 '17' '2' '15' '1' '2' '14' '2' '0' '3' '9' '4' '0' '0' '0' '16' '4' '2'
 '5' '2' '14' '0' '4' '7' '7' '3' '0' '3' '0' '5' '0' '5' '4' '0' '16' '0'
 '0' '0' '14' '5' '1' '8' '5' '0' '7' '9' '0' '4' '12' '6' '3' '8' '9' '6'
 '3' '3' '11' '13' '19' '16' '18' '4' '5' '2' '10' '7' '11' '5' '15' '17'
 '9' '18' '1' '0' '1' '0' '19' '10' '2' '0' '1' '0' '0' '1' '0' '11' '7'
 '3' '0' '0' '2' '2' '11' '18' '2' '15' '0' '13' '4' '0' '8' '3' '0' '5'
 '1' '1' '17' '1' '1' '1' '8' '7' '0' '10' '0' '4' '9' '6' '2' '0' '16'
 '1' '1' '3' '9' '11' '12' '10' '3' '2' '8' '0' '14' '18' '12' '9' '3' '5'
 '19' '11' '8' '0' '6' '0' '16' '5' '13' '1' '2' '8' '6' '12' '17' '17'
 '2' '1' '14' '0' '4' '0' '2' '8' '0' '3' '2' '2' '0' '2' '8' '2' '2' '1'
 '0' '9' '0' '13' '2' '2' '14' '4' '1' '16' '0' '3' '3' '4' '14' '10' '6'
 '18' '17' '16' '2' '5' '17' '3' '1' '0' 

In [10]:
print(accuracy_score(predictions, emojis))

0.998
