In [0]:
import pandas as pd
import numpy as np
import re

# Wstępna analiza danych EmoContext

Posiadamy następujące pliki w folderze `data`:

In [0]:
! ls ./data -1

test_with_labels.txt
test_without_labels.txt
training.txt
validation.txt


## Zestaw treningowy

In [0]:
training_df = pd.read_csv("./data/training.txt", sep='\t')

In [0]:
training_df.head()

Unnamed: 0,id,turn1,turn2,turn3,label
0,0,Don't worry I'm girl,hmm how do I know if you are,What's ur name?,others
1,1,When did I?,saw many times i think -_-,No. I never saw you,angry
2,2,By,by Google Chrome,Where you live,others
3,3,U r ridiculous,I might be ridiculous but I am telling the truth.,U little disgusting whore,angry
4,4,Just for time pass,wt do u do 4 a living then,Maybe,others


In [0]:
training_df.tail()

Unnamed: 0,id,turn1,turn2,turn3,label
30155,30155,I don't work,I could take your shift,I am a student,others
30156,30156,I'm not getting you 😭😭😭,Why are you crying??,Because you are not making any sense,sad
30157,30157,Haha,"no, seriously. What is up with that o-o",Had your breakfast?,others
30158,30158,Do you sing?,yea a lil,Nice,others
30159,30159,Me to,People be driving me crazy,Come on sleep with me,others


## Zestaw walidacyjny

In [0]:
validation_df = pd.read_csv("./data/validation.txt", sep = "\t")

In [0]:
validation_df.head()

Unnamed: 0,id,turn1,turn2,turn3,label
0,0,Then dont ask me,YOURE A GUY NOT AS IF YOU WOULD UNDERSTAND,IM NOT A GUY FUCK OFF,angry
1,1,Mixed things such as??,the things you do.,Have you seen minions??,others
2,2,Today I'm very happy,and I'm happy for you ❤,I will be marry,happy
3,3,Woah bring me some,left it there oops,Brb,others
4,4,it is thooooo,I said soon master.,he is pressuring me,others


In [0]:
validation_df.tail()

Unnamed: 0,id,turn1,turn2,turn3,label
2750,2750,U are my book,book for what? ugliness? THANK YOU,U like ur self,others
2751,2751,I'll be crying,You just want to make ppl cry:P,ppl,others
2752,2752,Thanks for sending,hahaha you're welcome! 😤😤,Why are u not sending,others
2753,2753,Write it,Mr. F,U understand me?,others
2754,2754,Yes,okay I'll give you a ticket,Ohk,others


## Zestaw testowy

Zestaw treningowy, w odróżnieniu od dwóch poprzednich, nie posiada kolumny `label`:

In [0]:
test_df = pd.read_csv("./data/test_without_labels.txt", sep='\t')

In [0]:
test_df.head()

Unnamed: 0,id,turn1,turn2,turn3
0,0,Hmm,What does your bio mean?,I don’t have any bio
1,1,What you like,very little things,Ok
2,2,Yes,How so?,I want to fuck babu
3,3,what did you guess,what what,fuck
4,4,We ?,of course we will!,What gender movies you like??


In [0]:
test_df.tail()

Unnamed: 0,id,turn1,turn2,turn3
5504,5504,Not youuu,I also didn't not not.,How to calll
5505,5505,Welcome,"Why, thank you.","I don't know, you tell"
5506,5506,Yes,IF ONLY I COULD AFFORD THIS,How are you
5507,5507,For my information,It's our responsibility to clarify everything.,What is mountain dew?
5508,5508,ok........... where you work ?,I'm off this whole week,ok..............


## Odpowiedzi do zestawu treningowego (golden standard)

In [0]:
test_set_answers = pd.read_csv("./data/test_with_labels.txt", sep="\t", usecols=["label"])

In [0]:
test_set_answers.head()

Unnamed: 0,label
0,others
1,others
2,others
3,others
4,others


In [0]:
test_set_answers.tail()

Unnamed: 0,label
5504,others
5505,others
5506,others
5507,others
5508,others


## Niepotrzebne kolumny

Jako, że kolumna `id` duplikuje informację z pandasowego indeksu, usuwam ją:

In [0]:
training_df.drop('id', axis=1, inplace=True)
test_df.drop('id', axis=1, inplace=True)
validation_df.drop('id', axis=1, inplace=True)

## Brakujące dane

Nie ma żadnych braków danych:

In [0]:
training_df.isnull().any().any()

False

In [0]:
test_df.isnull().any().any()

False

In [0]:
validation_df.isnull().any().any()

False

## Kodowanie emotikonek

W niektórych wiadomościach występują emotikonki. Myślę, że mogą stanowić istotną cechę przy wyborze kategorii. Jak są kodowane? 

In [0]:
emoticon_row = training_df.iloc[21]

In [0]:
emoticon_row.loc['turn2']

'Yes I love to dance 😻'

In [0]:
emoticon_row.loc['turn3']

'😂😂😂 so you have legs too'

Okazuje się, że emojis są częścią Unicode i są kodowane tak samo jako inne znaki. Do ich poprawnego wyświetlenia potrzebna jest czcionka, która potrafi renderować specjalne (jak na standardy alfabetu łacińskiego znaki), np. chińskie znaki czy emotikony (ang. glyphs). Więcej tutaj: https://stackoverflow.com/questions/19091320/special-characters-emoticons-in-text-file

In [0]:
emoticon_row.loc['turn3'].encode("utf-8")

b'\xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x98\x82 so you have legs too'

In [0]:
emoticon_row.loc['turn2'].encode("utf-8")

b'Yes I love to dance \xf0\x9f\x98\xbb'

In [0]:
emoticon_row_2 = training_df.iloc[38]
emoticon_row_2

turn1                   That was mean
turn2    haha the truth usually is. 👍
turn3            Are you bored of me?
label                          others
Name: 38, dtype: object

In [0]:
emoticon_row_2.loc['turn2'].encode("utf-8")

b'haha the truth usually is. \xf0\x9f\x91\x8d'

Jak widać poniżej `b''` koduje w postaci ASCII:

In [0]:
emoticon_row_2.loc['turn2'].encode("utf-8")[0]

104

Emotikonka ma długość 1 (w kodowaniu Unicode):

In [0]:
len(emoticon_row_2.loc['turn2'][-1])

1

## Rozkład zmiennych

In [0]:
def get_set_statistics(set_df):
    """
    The function returns a dataframe that produces aggregated summary for each of four columns in the input dataframe (`set_df`): `label`, `turn1`, `turn2`, `turn3`.
    Aggregated summary contains information on frequency (`count`) and relative frequency (`freq`) for the `label` column. 
    Each of `turn*` columns is summarised with 
        the average length of the message (`average_len`), 
        standard deviation of the message's length (`std_avg_len`)
        the length of the shortest message in the column (`min_len`),
        the length of the longest message in the the column (`max_len`).
    """
    
    no_rows = set_df.shape[0]
    
    aggregation = {
    'label': {
        'count': 'count',
        'freq': lambda x: x.shape[0]/no_rows
    },
    'turn1': {
        'average_len': lambda x: np.mean(x.str.len()),
        'std_avg_len': lambda x: np.std(x.str.len()),
        'min_len': lambda x: np.min(x.str.len()),
        'max_len': lambda x: np.max(x.str.len())
    },
    'turn2': {
        'average_len': lambda x: np.mean(x.str.len()),
        'std_avg_len': lambda x: np.std(x.str.len()),
        'min_len': lambda x: np.min(x.str.len()),
        'max_len': lambda x: np.max(x.str.len())
    },
    'turn3': {
        'average_len': lambda x: np.mean(x.str.len()),
        'std_avg_len': lambda x: np.std(x.str.len()),
        'min_len': lambda x: np.min(x.str.len()),
        'max_len': lambda x: np.max(x.str.len())
    }
    }
    
    print("The set contains {} observations".format(set_df.shape[0]))
    aggregated_df = set_df.groupby(by='label').agg(aggregation).round(2).sort_values(by=[('label', 'count')], ascending=False)
    return aggregated_df

### Treningowe

In [0]:
get_set_statistics(training_df)

The set contains 30160 observations


  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)


Unnamed: 0_level_0,label,label,turn1,turn1,turn1,turn1,turn2,turn2,turn2,turn2,turn3,turn3,turn3,turn3
Unnamed: 0_level_1,count,freq,average_len,std_avg_len,min_len,max_len,average_len,std_avg_len,min_len,max_len,average_len,std_avg_len,min_len,max_len
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
others,14948,0.5,17.46,13.61,1,308,25.49,16.34,1,149,16.93,15.01,1,630
angry,5506,0.18,20.92,16.44,1,175,26.5,16.62,1,114,22.19,17.21,1,197
sad,5463,0.18,18.55,14.54,1,238,24.35,15.43,1,115,18.86,14.82,1,142
happy,4243,0.14,19.96,13.43,1,130,27.93,16.11,1,111,14.09,12.2,1,160


### Walidacyjne

In [0]:
get_set_statistics(validation_df)

  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)


The set contains 2755 observations


Unnamed: 0_level_0,label,label,turn1,turn1,turn1,turn1,turn2,turn2,turn2,turn2,turn3,turn3,turn3,turn3
Unnamed: 0_level_1,count,freq,average_len,std_avg_len,min_len,max_len,average_len,std_avg_len,min_len,max_len,average_len,std_avg_len,min_len,max_len
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
others,2338,0.85,17.54,13.11,1,122,25.69,16.72,2,101,16.92,13.54,1,206
angry,150,0.05,19.01,13.25,1,88,27.91,17.09,3,81,19.1,15.6,3,111
happy,142,0.05,18.85,12.76,1,66,28.3,17.19,3,95,13.97,16.55,1,119
sad,125,0.05,19.94,27.21,2,286,22.9,14.42,3,82,19.78,13.51,1,66


### Testowe

Jako, że ramka danych z obserwacjami testowymi nie zawiera kategorii, tworzę tymczasową ramkę danych:

In [0]:
test_df['label'] = test_set_answers
test_df.head()

Unnamed: 0,turn1,turn2,turn3,label
0,Hmm,What does your bio mean?,I don’t have any bio,others
1,What you like,very little things,Ok,others
2,Yes,How so?,I want to fuck babu,others
3,what did you guess,what what,fuck,others
4,We ?,of course we will!,What gender movies you like??,others


In [0]:
get_set_statistics(test_df)

The set contains 5509 observations


  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)


Unnamed: 0_level_0,label,label,turn1,turn1,turn1,turn1,turn2,turn2,turn2,turn2,turn3,turn3,turn3,turn3
Unnamed: 0_level_1,count,freq,average_len,std_avg_len,min_len,max_len,average_len,std_avg_len,min_len,max_len,average_len,std_avg_len,min_len,max_len
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
others,4677,0.85,17.69,14.33,1,293,25.91,16.74,1,105,16.85,16.22,1,460
angry,298,0.05,17.91,12.73,1,116,24.6,15.41,4,85,19.77,14.38,1,101
happy,284,0.05,20.54,15.98,1,172,25.91,14.49,4,77,13.37,14.0,1,85
sad,250,0.05,19.95,31.03,1,458,25.5,16.75,3,82,17.9,23.07,1,308


In [0]:
test_df.drop('label', axis=1, inplace=True)

## Najkrótsze wiadomości

Najkrótsze wiadomości w zestawie mają tylko jeden znak. Czy są to tylko emotikonki? Co z wiadomościami trochę dłuższymi (np. do czterech znaków włącznie)?

In [0]:
def get_shortest_messages(set_df, one_char = False, limit = 4, head=True):
    """
    TODO function desc
    """
    if one_char:
        # the maximum length of the message is one
        limit = 1
    mask = (set_df['turn1'].str.len() <= limit) | (set_df['turn2'].str.len() <= limit) | (set_df['turn3'].str.len() <= limit)
    if head:
        return set_df.loc[mask].head(15)
    return set_df.loc[mask]

### Treningowe

In [0]:
get_shortest_messages(training_df, one_char=True)

Unnamed: 0,turn1,turn2,turn3,label
19,Ur creator is very bad,you are only the creator of your brain.,😑,sad
24,Bcoz u dont know wat is to miss someone,but sometimes one can't express the same,😢,sad
67,😂 😂 right,Appatasiri. High five then! 🖑😂😂😂😂,😂,happy
152,I have a good sense of humor,I think that's funny.,😁,happy
153,But who paid,still waiting,😂,happy
182,Haha! I act so dumb sometimes and I knew it,Haha. How was it though? :3,😁,others
191,Tell me about it,your header,😂,happy
206,?,u asked me if u cn ask me something,I mean pussy cat,others
218,You cannot see my hair,I'm in your closet,😂,happy
265,K think me as husband,Spoken like a person who has never been married.,S,others


In [0]:
get_shortest_messages(training_df)

Unnamed: 0,turn1,turn2,turn3,label
2,By,by Google Chrome,Where you live,others
7,Ok,ok im back!!,"So, how are u",others
9,Bay,in the bay,😘 love you,others
10,I hate my boyfriend,you got a boyfriend?,Yes,angry
13,Bad,Bad bad! That's the bad kind of bad.,I have no gf,sad
14,Ok get it......,I made it an option,Ok,others
15,Money money and lots of money😍😍,I need to get it tailored but I'm in love with...,😁😁,happy
16,My gf left ne,Get over it. Go out with someone else.,Me*,sad
18,You are lying and i know that,"I KNOW YOU'RE LYING, AB BYS",😭😭,sad
19,Ur creator is very bad,you are only the creator of your brain.,😑,sad


### Walidacyjne

In [0]:
get_shortest_messages(validation_df, one_char=True)

Unnamed: 0,turn1,turn2,turn3,label
8,Shall we meet,you say- you're leaving soon...anywhere you wa...,?,others
22,Send me any video or songs,Video or Text,S,others
131,"I mean, what else?",Aiyyo! at all XD,😂,happy
168,Did you married me,yes I did,?,others
263,?,ITS TOO PINK FOR ME.,Too pink,others
305,?,you know well what you did !,You you don't like it,others
348,better do skydiving,"If at first you don't succeed, don't take up s...",k,others
354,?,Go on... This thing?,R u there,others
394,U and me,I don't WANT to handle you...,😢,sad
409,Wr r frm 😜,lol. I saw 1st half last night. 😜,😅,happy


In [0]:
get_shortest_messages(validation_df)

Unnamed: 0,turn1,turn2,turn3,label
3,Woah bring me some,left it there oops,Brb,others
8,Shall we meet,you say- you're leaving soon...anywhere you wa...,?,others
10,Your pic pz,thank you X‑D,wc,others
15,Ok,Thank you. xD,What about cortanan,others
21,But...,then,I'm feeling nervous,sad
22,Send me any video or songs,Video or Text,S,others
23,Why,why what,How r u,others
33,You,how about ur family.. Still single?,Can you love mi,others
35,Hell,im already there xoxo 😂,Good night sweet dreams baby,others
39,Ok,... tries to ignore pain :'(,Where to go?,others


### Testowe

In [0]:
get_shortest_messages(test_df, one_char=True)

Unnamed: 0,turn1,turn2,turn3
9,Nice to meet u,"Hi, nice to meet you too! 😸😂",😁
63,Whenever I want,same pinch!!! =/,😁
76,Let me love you,Amazing song though ! :'‑),😂
92,Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes,no yes no yes,😂
118,I'm also good,you're welcome,😀
124,Wat do u mean?,i mean ear :D LOL,😂
195,?,sounds like some young kids tag names frm the ...,sounds like you wanna have sex
382,Coz I want to,WHY ARE WE YELLING?,🙄
428,Love 😘👭👫❤️❤️you,looking so beautiful ma ! 😀,💛
434,💩,Already did.,😆


In [0]:
get_shortest_messages(test_df)

Unnamed: 0,turn1,turn2,turn3
0,Hmm,What does your bio mean?,I don’t have any bio
1,What you like,very little things,Ok
2,Yes,How so?,I want to fuck babu
3,what did you guess,what what,fuck
4,We ?,of course we will!,What gender movies you like??
9,Nice to meet u,"Hi, nice to meet you too! 😸😂",😁
10,Yupp,why?,Don't know I'm tired
13,First you hurt me,okay,So I talked rude
17,by,In Suits.,have good day
24,sure will u cal me ni8,Ok,then


Powyższe ramki danych pokazują, że wiadomości o długości niekoniecznie są emotikonkami. Mogą to być pojedyncze litery lub znaki interpunkcyjne.

## Emotikonki i kategorie

Czy istnieje zależność pomiędzy ikonką występującą, w którejś wiadomości a kategorią (`label`)?

Filtruję tylko te konwersacje, gdzie występuje co najmniej jedna emotikonka. Każda emotikonka jest zapisana w następujący sposób: `\x..\x..\x..\x..`, gdzie `.` to cyfra lub litera.

In [0]:
emoji_pattern = re.compile(u"(["                    
u"\U0001F600-\U0001F64F"  # emoticons
u"\U0001F300-\U0001F5FF"  # symbols & pictographs
u"\U0001F680-\U0001F6FF"  # transport & map symbols
u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                "])", flags= re.UNICODE) 

In [0]:
def is_row_with_emojis(cell):
    """
    TODO function def
    """
    if len(re.findall(emoji_pattern, cell)) > 0:
        return True
    return False

In [0]:
def purge_non_emojis(cell):
    return ''.join(re.findall(emoji_pattern, cell))

In [0]:
def get_rows_with_emoticons(set_df):
    """
    """
    mask = (set_df['turn1'].apply(is_row_with_emojis)) | (set_df['turn2'].apply(is_row_with_emojis)) | (set_df['turn3'].apply(is_row_with_emojis))
    
    # filtered out rows don't contain any emojis
    emojis_df = set_df.loc[mask]
    
    # keep the `label` column
    label_col = emojis_df['label']
    
    emojis_df = emojis_df.applymap(purge_non_emojis)
    
    emojis_df['label'] = label_col
    
    return emojis_df

### Treningowe

In [0]:
training_df_emojis_only = get_rows_with_emoticons(training_df)

training_df_emojis_only.head()

Unnamed: 0,turn1,turn2,turn3,label
9,,,😘,others
15,😍😍,😍,😁😁,happy
18,,,😭😭,sad
19,,,😑,sad
21,,😻,😂😂😂,happy


### Walidacyjne

In [0]:
validation_df_emojis_only = get_rows_with_emoticons(validation_df)

validation_df_emojis_only.head()

Unnamed: 0,turn1,turn2,turn3,label
13,,😿,,sad
16,,😹,,others
35,,😂,,others
47,😂,,,others
56,,,😂😂,happy


### Testowe

In [0]:
# tymczasowo dodaj `label` do `test_df`
test_df['label'] = test_set_answers

In [0]:
test_df_emojis_only = get_rows_with_emoticons(test_df)

test_df_emojis_only.head()

Unnamed: 0,turn1,turn2,turn3,label
6,,,😁,happy
9,,😸😂,😁,happy
14,🙊,😏,🙊,others
19,,👍,,others
21,,😁😁😁😁,,others


In [0]:
test_df.drop('label', axis=1, inplace=True)

Sprawdzam ile procent poszczególnych zestawów to obserwacje zawierające co najmniej jedną emotikonkę. Okazuje się, że maksymalnie obserwacji ma co najmniej jedną emotikonkę.

In [0]:
def get_percentage_emojis_rows(df_full, df_emojis_only):
    """
    TODO: doc
    """
    return df_emojis_only.shape[0]/df_full.shape[0]

In [0]:
### Treningowe
get_percentage_emojis_rows(training_df, training_df_emojis_only)

0.16982758620689656

In [0]:
### Walidacyjne
get_percentage_emojis_rows(validation_df, validation_df_emojis_only)

0.10417422867513612

In [0]:
### Testowe
get_percentage_emojis_rows(test_df, test_df_emojis_only)

0.11944091486658195

## Unikalne emotikonki w zbiorze

Znalezienie unikalnych emotikonek w zbiorze ma za zadanie pomóc w ręcznym sklasyfikowaniu poszczególnych emotikonek w czterech kategoriach. Kategoria `others` ma zostać użyta tylko jeżeli nie jesteśmy pewni do której z trzech pozostałych kategorii wrzucić emotikonkę. Działam jedynie na danych treningowych.

In [0]:
training_df_emojis_only = get_rows_with_emoticons(training_df)

In [0]:
testowa_df = get_rows_with_emoticons(training_df).head(5)

In [0]:
def get_unique_emojis(set_df):
    """
    Given a dataframe get a set of unique emojis
    from the input dataframe.
    """
    
    unique_emojis = set()
    
    for _, *cells in set_df.drop(['label'], axis=1).itertuples():
        for cell in cells:
            emojis = list(cell)
            unique_emojis.update(emojis)
    return unique_emojis

In [0]:
unique_emojis_training_set = get_unique_emojis(training_df_emojis_only)

In [0]:
len(unique_emojis_training_set)

221

In [0]:
unique_emojis_training_set

{'🌍',
 '🌞',
 '🌟',
 '🌱',
 '🌷',
 '🌸',
 '🌹',
 '🍌',
 '🍒',
 '🍓',
 '🍗',
 '🍜',
 '🍞',
 '🍭',
 '🍰',
 '🍶',
 '🍷',
 '🍺',
 '🍻',
 '🍼',
 '🍾',
 '🎁',
 '🎂',
 '🎃',
 '🎈',
 '🎉',
 '🎧',
 '🎵',
 '🎶',
 '🏀',
 '🏃',
 '🏋',
 '🏕',
 '🏖',
 '🏡',
 '🏣',
 '🏻',
 '🏼',
 '🏽',
 '🏾',
 '🏿',
 '🐇',
 '🐍',
 '🐒',
 '🐓',
 '🐔',
 '🐘',
 '🐙',
 '🐛',
 '🐝',
 '🐞',
 '🐠',
 '🐨',
 '🐬',
 '🐭',
 '🐰',
 '🐱',
 '🐶',
 '🐷',
 '🐹',
 '🐺',
 '🐻',
 '🐼',
 '👀',
 '👄',
 '👅',
 '👆',
 '👇',
 '👈',
 '👉',
 '👊',
 '👋',
 '👌',
 '👍',
 '👎',
 '👏',
 '👐',
 '👗',
 '👙',
 '👦',
 '👧',
 '👨',
 '👩',
 '👪',
 '👫',
 '👬',
 '👭',
 '👮',
 '👯',
 '👵',
 '👶',
 '👷',
 '👺',
 '👻',
 '👼',
 '👽',
 '👿',
 '💁',
 '💃',
 '💋',
 '💍',
 '💎',
 '💐',
 '💑',
 '💓',
 '💔',
 '💕',
 '💖',
 '💗',
 '💘',
 '💙',
 '💚',
 '💛',
 '💜',
 '💝',
 '💞',
 '💡',
 '💤',
 '💩',
 '💪',
 '💭',
 '💯',
 '💰',
 '💵',
 '📆',
 '📞',
 '📲',
 '🔊',
 '🔙',
 '🔜',
 '🔥',
 '🔪',
 '🔱',
 '🕺',
 '🖑',
 '🖕',
 '😀',
 '😁',
 '😂',
 '😃',
 '😄',
 '😅',
 '😆',
 '😇',
 '😈',
 '😉',
 '😊',
 '😋',
 '😌',
 '😍',
 '😎',
 '😏',
 '😐',
 '😑',
 '😒',
 '😓',
 '😔',
 '😕',
 '😖',
 '😗',
 '😘',
 '😙',
 '😚',
 '😛',
 '😜',
 '😝',
 '😞'

## Błędy pisowni

# TODO

5. Czy występują słowa z literówka/błędy itp.?

6. Czy algorytmy, których będziemy używać potrzebują emotikonek w formie b'' czy mogą zostać zakodowane w formie Unicode?

7. Napisać funkcję, która grupując po `label` zliczy dla każdej obserwacji unikalne emotikonki dla każdej `turn*`

8. Użycie embeddingów z Twittera (może są jakieś z domeny naszego problemu?)

9. Wytrenowanie własnych embeddingów na naszych danych treningowych

10. Tłumaczenie emotikon na słowa, np. :) -> smiling face