## Language Detection from Text
(Machine-Learning Text Classification Model)
> *Workflow:*
> * Load necessary **packages and libraries**.
> * **Data Agumention and Organization:** Use data from multiple sources and combine them with proper organisation to increase size of data. Split data into train and test sets.
> * **EDA & Data Visualization:** Analyze, Understand and Explore the Data.
> * **Preprocessing:**
    * **(NLP -> Text Cleaning + Text Vectorization)**
        * **Text cleaning:** Remove emojis, Convert Emoticons to Words, Apply Text Contractions, Correct Abbreviations, Remove Mentions and URLs, Remove Special Characters, Punctuations, HTML tags, Escape Characters and Extra Spaces, Apply Spell Check, Lemmatize/Stem texts.
        * **Text Vectorization:** Convert text data to numerical vectors. (I wil use **Tfidf** vectorization technique, but other methods like Bag of Words or Word embedding can also be used.)
    * **Label Encoding:** Machine Learning models can not understand text labelled categories hence we must convert text labels to numerical values before feeding our data to a ML model.
> * **Training:** Feed processed data to various **Machine Learning/Deep Learning models** and **Test** each model's accuracy using Test data (created during train-test-split).
> * **Create Pipeline:** Choose the **best performing model** and create a **model pipeline**.
> * **Ground Testing:** Use the trained model to predict language from user inputs.

## Load Dependencies
(I prefer loading all the necessary files and libraries at the beginning, makes notebook cleaner!)

In [17]:
# # install extra packages (not included in kaggle by default)
# !pip install nltk # provides text cleaning and normalizing function
# !pip install contractions # should't -> should not
# !pip install pyspellchecker  # python spell checker

In [18]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import Pipeline

In [19]:
import os
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings("ignore")

In [20]:
import re

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

In [21]:
# pip install contractions

In [22]:
# pip install pyspellchecker

In [23]:
import contractions


In [24]:
from spellchecker import SpellChecker

In [25]:
# Basic Libraries
import os
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings("ignore")

# libraries for preprocessing
import re
import contractions
from spellchecker import SpellChecker
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
from nltk.stem import WordNetLemmatizer

# initiating lemmatizer(kaggle specific):
# Define a install path for nltk
if os.environ.get('KAGGLE_KERNEL_RUN_TYPE', ''):
    nltk_path='/kaggle/working'
else:
    nltk_path="{}".format(os.getcwd())

isnltk_installed = os.path.isdir(f'{nltk_path}/nltk_data/corpora/wordnet')

# Install relevent libraries to nltk path
if isnltk_installed:
    nltk.data.path.append(f'{nltk_path}/nltk_data')
else:
    # Make directort name 'nlrk_data' in current work directory '/kaggle/working/'
    !mkdir nltk_data
    # Download neccessaty package as .zip file ('corpora' directory are automatically created)
    nltk.download('wordnet', f"{nltk_path}/nltk_data")
    nltk.download('omw-1.4', f"{nltk_path}/nltk_data/")
    # Unzip .zip file in folder '/kaggle/working/nltk_data/corpora'
    !unzip /kaggle/working/nltk_data/corpora/wordnet.zip -d /kaggle/working/nltk_data/corpora
    !unzip /kaggle/working/nltk_data/corpora/omw-1.4.zip -d /kaggle/working/nltk_data/corpora
    # Add custom location nltk file data path
    nltk.data.path.append(f'{nltk_path}/nltk_data')

# libraries for ML
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import Pipeline

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\RA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\RA\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
A subdirectory or file nltk_data already exists.
[nltk_data] Downloading package wordnet to
[nltk_data]     c:\Users\RA\Desktop\emotion_detection/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     c:\Users\RA\Desktop\emotion_detection/nltk_data/...
[nltk_data]   Package omw-1.4 is already up-to-date!
'unzip' is not recognized as an internal or external command,
operable program or batch file.
'unzip' is not recognized as an internal or external command,
operable program or batch file.


## Data Augmentation and Organization
(Use data from multiple sources and combine them with proper organisation to increase size of data)

In [26]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [27]:
# source 1:
s1_train = pd.read_csv('train.txt', names=['text','emotion'], delimiter=';')
s1_val = pd.read_csv('val.txt', names=['text','emotion'], delimiter=';')
s1_test = pd.read_csv('test.txt', names=['text','emotion'], delimiter=';')
data1 = pd.concat((s1_train,s1_val,s1_test), axis=0)
display(data1)

Unnamed: 0,text,emotion
0,im feeling rather rotten so im not very ambiti...,sadness
1,im updating my blog because i feel shitty,sadness
2,i never make her separate from me because i do...,sadness
3,i left with my bouquet of red and yellow tulip...,joy
4,i was feeling a little vain when i did this one,sadness
...,...,...
15995,i just had a very brief time in the beanbag an...,sadness
15996,i am now turning and i feel pathetic that i am...,sadness
15997,i feel strong and good overall,joy
15998,i feel like this was such a rude comment and i...,anger


In [28]:
# classes and value counts:
print(data1.emotion.value_counts())
print("Number of classes: ",len(data1.emotion.value_counts()))

emotion
joy         6761
sadness     5797
anger       2709
fear        2373
love        1641
surprise     719
Name: count, dtype: int64
Number of classes:  6


In [29]:
# source 2
with open('text.txt') as f:
    i=0
    # display first 5 lines of the raw data
    for line in f:
        if(i==5): break;
        print(line)
        i+=1

[ 1.  0.  0.  0.  0.  0.  0.] During the period of falling in love, each time that we met and especially when we had not met for a long time.

[ 0.  1.  0.  0.  0.  0.  0.] When I was involved in a traffic accident.

[ 0.  0.  1.  0.  0.  0.  0.] When I was driving home after  several days of hard work, there was a motorist ahead of me who was driving at 50 km/hour and refused, despite his low speeed to let me overtake.

[ 0.  0.  0.  1.  0.  0.  0.] When I lost the person who meant the most to me.

[ 0.  0.  0.  0.  1.  0.  0.] The time I knocked a deer down - the sight of the animal's injuries and helplessness.  The realization that the animal was so badly hurt that it had to be put down, and when the animal screamed at the moment of death.



In [30]:
# organize data, create dataset:
text = []
labs = []
emotion = []
label_map = ["joy", 'fear', "anger", "sadness", "disgust", "shame", "guilt"]   # information provided with the dataset
with open('text.txt','r') as f:
    for line in f:
        line = line.strip()
        label = line[1:line.find(']')].strip().split()
        sent = line[line.find(']')+1:].strip()
        text.append(sent)
        labs.append(label)
        
        
for i in range(len(labs)):
    for j in range(len(labs[0])):
        if labs[i][j]=='1.':
            emotion.append(label_map[j])
            
            
data2 = pd.concat((pd.Series(text),pd.Series(emotion)), axis=1, keys=['text','emotion'])
data2

Unnamed: 0,text,emotion
0,"During the period of falling in love, each tim...",joy
1,When I was involved in a traffic accident.,fear
2,When I was driving home after several days of...,anger
3,When I lost the person who meant the most to me.,sadness
4,The time I knocked a deer down - the sight of ...,disgust
...,...,...
7475,Two years back someone invited me to be the tu...,anger
7476,I had taken the responsibility to do something...,sadness
7477,I was at home and I heard a loud sound of spit...,disgust
7478,I did not do the homework that the teacher had...,shame


In [31]:
# classes and value counts:
print(data2.emotion.value_counts())
print("Number of classes: ",len(data2.emotion.value_counts()))

emotion
joy        1084
anger      1080
sadness    1079
fear       1078
disgust    1057
guilt      1057
shame      1045
Name: count, dtype: int64
Number of classes:  7


In [32]:
# merge data from both the sources into 1 dataset:
data = pd.concat((data1,data2),axis=0)
data

Unnamed: 0,text,emotion
0,im feeling rather rotten so im not very ambiti...,sadness
1,im updating my blog because i feel shitty,sadness
2,i never make her separate from me because i do...,sadness
3,i left with my bouquet of red and yellow tulip...,joy
4,i was feeling a little vain when i did this one,sadness
...,...,...
7475,Two years back someone invited me to be the tu...,anger
7476,I had taken the responsibility to do something...,sadness
7477,I was at home and I heard a loud sound of spit...,disgust
7478,I did not do the homework that the teacher had...,shame


In [33]:
# classes and value counts:
print(data.emotion.value_counts())
print("Number of classes: ",len(data.emotion.value_counts()))
print('Data Size:',data.shape[0])

emotion
joy         7845
sadness     6876
anger       3789
fear        3451
love        1641
disgust     1057
guilt       1057
shame       1045
surprise     719
Name: count, dtype: int64
Number of classes:  9
Data Size: 27480


### Clean the dataest

In [34]:
data.duplicated().sum()

58

In [35]:
# remove duplicate rows:
idx = data.loc[data.duplicated()==True].index
data.drop(idx, axis=0, inplace=True)
data.reset_index(inplace=True, drop=True)

In [36]:
print(data.emotion.value_counts())
print("Number of classes:",len(data.emotion.value_counts()))
print("Data Size:",data.shape[0])

emotion
joy         7804
sadness     6833
anger       3780
fear        3439
love        1635
disgust     1056
guilt       1053
shame       1038
surprise     715
Name: count, dtype: int64
Number of classes: 9
Data Size: 27353


In [37]:
# check data with same text but different emotions
data.loc[data.text.duplicated()==True].head()

Unnamed: 0,text,emotion
2597,i have had several new members tell me how com...,joy
2871,i feel not having a generous spirit or a forgi...,joy
3220,i don t always feel like i have amazing style ...,surprise
3776,i feel like i am in paradise kissing those swe...,love
3981,i feel so tortured by it,anger


In [38]:
# check few of these texts:
display(data.loc[data.text==data.text.iloc[5045]])
display(data.loc[data.text==data.text.iloc[6102]])
display(data.loc[data.text==data.text.iloc[6529]])
display(data.loc[data.text==data.text.iloc[7566]])

Unnamed: 0,text,emotion
5045,i need to get in touch with what i want and ho...,love


Unnamed: 0,text,emotion
6102,i get the feeling people think im indecisive a...,fear


Unnamed: 0,text,emotion
6529,i feel like its important to vote on all of th...,joy


Unnamed: 0,text,emotion
7566,i drove back to the beach staring at the thing...,sadness


In [39]:
# removing these data:
idx = data[data.text.duplicated()==True].index
data.drop(idx, axis=0, inplace=True)
data.reset_index(inplace=True, drop=True)

In [40]:
print(data.emotion.value_counts())
print("Number of classes:",len(data.emotion.value_counts()))
print("Data Size:",data.shape[0])

emotion
joy         7786
sadness     6829
anger       3774
fear        3432
love        1623
disgust     1054
guilt       1051
shame       1036
surprise     709
Name: count, dtype: int64
Number of classes: 9
Data Size: 27294


In [41]:
val = np.array(data.emotion.value_counts().values)
lab = data.emotion.value_counts().index.tolist()
prob = val/val.sum()
prob

array([0.28526416, 0.25020151, 0.13827215, 0.12574192, 0.05946362,
       0.03861655, 0.03850663, 0.03795706, 0.02597641])

### Train Test Split

In [42]:
# train test split
x_train,x_test,y_train,y_test = train_test_split(data.text.values, data.emotion.values, test_size = 0.2, random_state=42)

x_train.shape,y_train.shape,x_test.shape,y_test.shape

((21835,), (21835,), (5459,), (5459,))

In [43]:
x_train[11]

"When I promised to visit my boyfriend and I didn't fulfil the promise."

## Preprocessing

### Text Preprocessing and Vectorization (NLP)
1. **Text Cleaning**
    * **Lowercasing sentences**
    * **Removing:**
        * Hashtags, Mentions, URLs, Emojis
        * Punctuation, Numbers, Special Characters, Escape Characters etc.
        * Stopwords
    * **Text Normalization:** Using Lemmatization technique
    
    
2. **Vectorization:** Using Tfidf Vectorization method

### Text Cleaning

In [44]:
# function for NLP
def nlp(text):
    def remove_emoji(text):
        emoji_pattern = re.compile(
          '['
          u'\U0001F600-\U0001F64F'  # emoticons
          u'\U0001F300-\U0001F5FF'  # symbols & pictographs
          u'\U0001F680-\U0001F6FF'  # transport & map symbols
          u'\U0001F1E0-\U0001F1FF'  # flags
          u'\U00002702-\U000027B0'
          u'\U000024C2-\U0001F251'
          ']+',
          flags=re.UNICODE)
        return emoji_pattern.sub(r'', text)
    
    def emoticons_to_text(text):
        EMOTICONS = {
            u"xD":"Funny face", u"XD":"Funny face",
            u":3":"Happy face", u":o":"Happy face",
            u"=D":"Laughing",
            u"D:":"Sadness", u"D;":"Great dismay", u"D=":"Great dismay",
            u":O":"Surprise", u":‑O":"Surprise", u":‑o":"Surprise", u":o":"Surprise", u"o_O":"Surprise",
            u":-0":"Shock", u":X":"Kiss", u";D":"Wink or smirk",
            u":p":"cheeky, playful", u":b":"cheeky, playful", u"d:":"cheeky, playful",
            u"=p":"cheeky, playful", u"=P":"cheeky, playful",
            u":L":"annoyed", u":S":"annoyed", u":@":"annoyed",
            u":$":"blushing", u":x":"Sealed lips",
            u"^.^":"Laugh", u"^_^":"Laugh",
            u"T_T":"Sad", u";_;":"Sad", u";n;":"Sad", u";;":"Sad", u"QQ":"Sad"
        }
        for emot in EMOTICONS:
            text = re.sub(emot, EMOTICONS[emot], text)
        return text
    
    def abbr_to_text(text):
        sample_abbr = {"$" : " dollar ", "€" : " euro ", "4ao" : "for adults only", "a.m" : "before midday", "a3" : "anytime anywhere anyplace",
                        "aamof" : "as a matter of fact", "acct" : "account", "adih" : "another day in hell", "afaic" : "as far as i am concerned",
                        "afaict" : "as far as i can tell", "afaik" : "as far as i know", "afair" : "as far as i remember", "afk" : "away from keyboard",
                        "app" : "application", "approx" : "approximately", "apps" : "applications", "asap" : "as soon as possible", "asl" : "age, sex, location",
                        "atk" : "at the keyboard", "ave." : "avenue", "aymm" : "are you my mother", "ayor" : "at your own risk",
                        "b&b" : "bed and breakfast", "b+b" : "bed and breakfast", "b.c" : "before christ", "b2b" : "business to business",
                        "b2c" : "business to customer", "b4" : "before", "b4n" : "bye for now", "b@u" : "back at you", "bae" : "before anyone else",
                        "bak" : "back at keyboard", "bbbg" : "bye bye be good", "bbc" : "british broadcasting corporation", "bbias" : "be back in a second",
                        "bbl" : "be back later", "bbs" : "be back soon", "be4" : "before", "bfn" : "bye for now", "blvd" : "boulevard", "bout" : "about",
                        "brb" : "be right back", "bros" : "brothers", "brt" : "be right there", "bsaaw" : "big smile and a wink", "btw" : "by the way",
                        "bwl" : "bursting with laughter", "c/o" : "care of", "cet" : "central european time", "cf" : "compare", "cia" : "central intelligence agency",
                        "csl" : "can not stop laughing", "cu" : "see you", "cul8r" : "see you later", "cv" : "curriculum vitae", "cwot" : "complete waste of time",
                        "cya" : "see you", "cyt" : "see you tomorrow", "dae" : "does anyone else", "dbmib" : "do not bother me i am busy", "diy" : "do it yourself",
                        "dm" : "direct message", "dwh" : "during work hours", "e123" : "easy as one two three", "eet" : "eastern european time", "eg" : "example",
                        "embm" : "early morning business meeting", "encl" : "enclosed", "encl." : "enclosed", "etc" : "and so on", "faq" : "frequently asked questions",
                        "fawc" : "for anyone who cares", "fb" : "facebook", "fc" : "fingers crossed", "fig" : "figure","fimh" : "forever in my heart", 
                        "ft." : "feet", "ft" : "featuring", "ftl" : "for the loss", "ftw" : "for the win", "fwiw" : "for what it is worth", "fyi" : "for your information",
                        "g9" : "genius", "gahoy" : "get a hold of yourself", "gal" : "get a life", "gcse" : "general certificate of secondary education",
                        "gfn" : "gone for now", "gg" : "good game", "gl" : "good luck", "glhf" : "good luck have fun", "gmt" : "greenwich mean time",
                        "gmta" : "great minds think alike", "gn" : "good night", "g.o.a.t" : "greatest of all time", "goat" : "greatest of all time",
                        "goi" : "get over it", "gps" : "global positioning system", "gr8" : "great", "gratz" : "congratulations", "gyal" : "girl",
                        "h&c" : "hot and cold", "hp" : "horsepower", "hr" : "hour", "hrh" : "his royal highness", "ht" : "height", "ibrb" : "i will be right back",
                        "ic" : "i see", "icq" : "i seek you", "icymi" : "in case you missed it","idc" : "i do not care", "idgadf" : "i do not give a damn fuck",
                        "idgaf" : "i do not give a fuck", "idk" : "i do not know", "ie" : "that is", "i.e" : "that is", "iykyk":"if you know you know",
                        "ifyp" : "i feel your pain", "IG" : "instagram", "ig":"instagram", "iirc" : "if i remember correctly", "ilu" : "i love you",
                        "ily" : "i love you", "imho" : "in my humble opinion", "imo" : "in my opinion", "imu" : "i miss you", "iow" : "in other words",
                        "irl" : "in real life", "j4f" : "just for fun", "jic" : "just in case", "jk" : "just kidding", "jsyk" : "just so you know",
                        "l8r" : "later", "lb" : "pound", "lbs" : "pounds", "ldr" : "long distance relationship", "lmao" : "laugh my ass off", "luv":"love",
                        "lmfao" : "laugh my fucking ass off", "lol" : "laughing out loud", "ltd" : "limited","ltns" : "long time no see", "m8" : "mate",
                        "mf" : "motherfucker", "mfs" : "motherfuckers", "mfw" : "my face when","mofo" : "motherfucker","mph" : "miles per hour","mr" : "mister",
                        "mrw" : "my reaction when", "ms" : "miss", "mte" : "my thoughts exactly", "nagi" : "not a good idea", "nbc" : "national broadcasting company",
                        "nbd" : "not big deal", "nfs" : "not for sale", "ngl" : "not going to lie", "nhs" : "national health service", "nrn" : "no reply necessary",
                        "nsfl" : "not safe for life", "nsfw" : "not safe for work", "nth" : "nice to have", "nvr" : "never", "nyc" : "new york city",
                        "oc" : "original content", "og" : "original", "ohp" : "overhead projector", "oic" : "oh i see", "omdb" : "over my dead body",
                        "omg" : "oh my god", "omw" : "on my way", "p.a" : "per annum", "p.m" : "after midday", "pm" : "prime minister", "poc" : "people of color",
                        "pov" : "point of view", "pp" : "pages", "ppl" : "people", "prw" : "parents are watching", "ps" : "postscript", "pt" : "point",
                        "ptb" : "please text back", "pto" : "please turn over","qpsa" : "what happens", "ratchet" : "rude", "rbtl" : "read between the lines",
                        "rlrt" : "real life retweet",  "rofl" : "rolling on the floor laughing", "roflol" : "rolling on the floor laughing out loud",
                        "rotflmao" : "rolling on the floor laughing my ass off", "rt" : "retweet", "ruok" : "are you ok", "sfw" : "safe for work", "sk8" : "skate",
                        "smh" : "shake my head", "sq" : "square", "srsly" : "seriously", 
                        "ssdd" : "same stuff different day", "tbh" : "to be honest", "tbs" : "tablespooful", "tbsp" : "tablespooful", "tfw" : "that feeling when",
                        "thks" : "thank you", "tho" : "though", "thx" : "thank you", "tia" : "thanks in advance", "til" : "today i learned", "tl;dr" : "too long i did not read", "tldr" : "too long i did not read",
                        "tmb" : "tweet me back", "tntl" : "trying not to laugh", "ttyl" : "talk to you later", "u" : "you", "u2" : "you too", "u4e" : "yours for ever",
                        "utc" : "coordinated universal time", "w/" : "with", "w/o" : "without", "w8" : "wait", "wassup" : "what is up", "wb" : "welcome back",
                        "wtf" : "what the fuck", "wtg" : "way to go", "wtpa" : "where the party at", "wuf" : "where are you from", "wuzup" : "what is up",
                        "wywh" : "wish you were here", "yd" : "yard", "ygtr" : "you got that right", "ynk" : "you never know", "zzz" : "sleeping bored and tired"
                    }
        sample_abbr_pattern = re.compile(r'(?<!\w)(' + '|'.join(re.escape(key) for key in sample_abbr.keys()) + r')(?!\w)')
        text = sample_abbr_pattern.sub(lambda x: sample_abbr[x.group()], text)
        return text
    
    def correct_spellings(text):
        spell = SpellChecker()
        corrected_text = []
        misspelled_words = spell.unknown(text.split())
        for word in text.split():
            if word in misspelled_words:
                corrected_text.append(spell.correction(word))
            else:
                corrected_text.append(word)
        return " ".join(str(word) for word in corrected_text)
    
    
    def lemmatize(text):
        lemmatizer = WordNetLemmatizer()
        words = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word not in stopwords.words('english')])
        return words
    
    text = remove_emoji(text)
    text = emoticons_to_text(text)
    text = str(text).lower()
    text = abbr_to_text(text)
    text = re.sub(r'<.*?>','', text)  # HTML tags
    text = re.sub(r'https?://\S+|www\.\S+','',text)  # URLs
    text = re.sub(r'@\S+','',text)  # Mentions
    text = re.sub(r'&\S+','',text)  # html characters
    text = re.sub(r'[^\x00-\x7f]','',text)  # non-ASCII
    text = contractions.fix(text)  # update contractions
    text = re.sub(r'[]!"$%&\'()*+,./:;=#@?[\\^_`{|}~-]+', "", text)  # punctuations, special chars
    text = re.sub(r'\s+', ' ', text)
    text = correct_spellings(text)
    text = lemmatize(text)
    return text

In [45]:
# cleaning texts
x_train_clean = np.array([nlp(text) for text in tqdm(x_train, desc='Progress')])
x_test_clean = np.array([nlp(sent) for sent in tqdm(x_test, desc='Progress')])

Progress:   0%|          | 0/21835 [00:00<?, ?it/s]

Progress:   0%|          | 0/5459 [00:00<?, ?it/s]

In [46]:
# text after cleaning
x_train_clean[11]

'promised visit boyfriend fulfil promise'

### Vectorization

In [47]:
tfidf = TfidfVectorizer()
tfidf.fit(x_train)
x_train_ready = tfidf.transform(x_train_clean)
x_test_ready = tfidf.transform(x_test_clean)

x_train_ready, x_test_ready

(<21835x17974 sparse matrix of type '<class 'numpy.float64'>'
 	with 191470 stored elements in Compressed Sparse Row format>,
 <5459x17974 sparse matrix of type '<class 'numpy.float64'>'
 	with 46960 stored elements in Compressed Sparse Row format>)

### Label Encoding

In [48]:
enc = LabelEncoder()
enc.fit(y_train)
y_train_ready = enc.transform(y_train)
y_test_ready = enc.transform(y_test)

In [49]:
labels = enc.classes_
labels

array(['anger', 'disgust', 'fear', 'guilt', 'joy', 'love', 'sadness',
       'shame', 'surprise'], dtype=object)

## Machine Learning
(Apply processed data to various ML models like: ```Logistic Regression```, ```Naive Bayes```, ```SVM```, ```KNN```, ```Decision Trees```, ```Random Forest Ensemble```, and we also try to make a ```Stacking Ensemble```, by top 2 best preforming models)

In [50]:
result = {}

In [61]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


In [51]:
log = LogisticRegression(max_iter=10000, penalty='l1', solver='liblinear')
log.fit(x_train_ready,y_train_ready)
log_score = log.score(x_test_ready,y_test_ready)
result['Logistic Regression']=log_score
print(log_score)

0.7906209928558344


In [62]:
from sklearn.naive_bayes import MultinomialNB


In [52]:
nb = MultinomialNB(fit_prior=False)
nb.fit(x_train_ready,y_train_ready)
nb_score = nb.score(x_test_ready,y_test_ready)
result['Naive Bayes']=nb_score
print(nb_score)

0.6975636563473163


In [63]:
from sklearn.svm import LinearSVC

In [53]:
svm = LinearSVC(C=0.5)
svm.fit(x_train_ready,y_train_ready)
svm_score = svm.score(x_test_ready,y_test_ready)
result['SVM']=svm_score
print(svm_score)

0.7853086645905843


In [64]:
from sklearn.tree import DecisionTreeClassifier

In [54]:
dt = DecisionTreeClassifier(criterion='gini', max_depth=1000, 
                            splitter='best', random_state=1)
dt.fit(x_train_ready,y_train_ready)
dt_score = dt.score(x_test_ready,y_test_ready)
result['Decision Tree']=dt_score
print(dt_score)

0.7497710203333944


In [65]:
from sklearn.neighbors import KNeighborsClassifier

In [55]:
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(x_train_ready,y_train_ready)
knn_score = knn.score(x_test_ready,y_test_ready)
result['KNN']=knn_score
print(knn_score)

0.6702692800879282


In [66]:
from sklearn.ensemble import RandomForestClassifier


In [56]:
rf = RandomForestClassifier(n_estimators=200)
rf.fit(x_train_ready,y_train_ready)
rf_score = rf.score(x_test_ready,y_test_ready)
result['Random Forest']=rf_score
print(rf_score)

0.7810954387250412


In [67]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


In [57]:
# stack best performing models and try to create a better model
stack1 = StackingClassifier(estimators=[('nb',LogisticRegression(max_iter=10000, penalty='l1', solver='liblinear')),
                                        ('svm',LinearSVC(C=0.5))], 
                            final_estimator=LogisticRegression(max_iter=10000, penalty='l1', solver='liblinear'))
stack1.fit(x_train_ready,y_train_ready)
stack1_score = stack1.score(x_test_ready,y_test_ready)
result['Stack1: Log+SVM']=stack1_score
print(stack1_score)

0.7959333211210845


## Model Comparison
(Compare preformance of trained models)

In [68]:
import pandas as pd

In [58]:
# display results
result_df = []
for model,score in result.items():
    result_df.append([model,score])
    
result_df = pd.DataFrame(result_df, columns=['Model','Test Score'])
result_df = result_df.style
result_df = result_df.highlight_max(subset=['Test Score'], color = 'lightgreen')
result_df = result_df.highlight_min(subset=['Test Score'], color = 'pink')
display(result_df)

Unnamed: 0,Model,Test Score
0,Logistic Regression,0.790621
1,Naive Bayes,0.697564
2,SVM,0.785309
3,Decision Tree,0.749771
4,KNN,0.670269
5,Random Forest,0.781095
6,Stack1: Log+SVM,0.795933


Note that ```Stack1 Ensemble```, that is created by stacking 2 best preforming individual models: ```Logistic Regression``` and ```SVM``` performed slightly better than both of them 

In [69]:
from sklearn.pipeline import Pipeline

In [59]:
# compile and model and vectorizer using pipeline
model = Pipeline([('vectorizer',tfidf),('model',stack1)])
model.score(x_test_clean,y_test_ready)

0.7959333211210845

In [70]:
import pickle

In [60]:
# save model
pickle.dump(model,open('model.pkl','wb'))
# save preprocessing function
pickle.dump(nlp,open('nlp.pkl','wb'))
# save label encoder
pickle.dump(enc,open('encoder.pkl','wb'))

## Ground Testing

In [71]:
# function to make predictions
def predict(text):
    pred = model.predict([nlp(text)])
    return enc.inverse_transform(pred)[0]

In [72]:
predict("i will succeed one day!"), predict('will i survive?'), predict('such a lying bastard you are!')

('joy', 'joy', 'shame')