#### General Steps to Follow

1. Problem Statement
2. Importing Packages
3. Data Collection
4. Checking the Data

### --------------------------------------------------------------------------------------------------------------------------------------------------------

## 1) Problem statement
- I will implement a model which inputs a sentence (such as "Let's go see the baseball game tonight!") and finds the most appropriate emoji to be used with this sentence(⚾️)

### --------------------------------------------------------------------------------------------------------------------------------------------------------

## 2) Importing Packages

#### Add the repository directory path to the Python path

In [1]:
import os
import sys

REPO_DIR_PATH = os.path.normpath(os.path.join(
    os.path.join(os.path.dirname(os.getcwd()))))

sys.path.append(REPO_DIR_PATH)

In [2]:
import pandas as pd
from src.utils import read_glove_vectors

### --------------------------------------------------------------------------------------------------------------------------------------------------------

## 3) Data Collection

- I used the dataset from the "Sequence Models" course in the "Deep Learning" Specialization on Coursera.
- The dataset (X, Y) where:
    - X contains sentences (strings).
    - Y contains an integer label between 0 and 4 corresponding to an emoji for each sentence.
- The dataset is located at `data/emojify_data.csv`
- I will also use a pre-trained set of word embeddings, specifically 50-dimensional GloVe vectors, to represent each word. The embeddings are saved at `data/glove.6B.50d.txt`

<center>
    <img src="../images/data_set.png" alt="Description of the image" width=800 height = 800>
</center>


### --------------------------------------------------------------------------------------------------------------------------------------------------------

## 4) Checking the Data

#### Checking the Emojifier Data

In [3]:
data = pd.read_csv("../data/emojify_data.csv")

In [4]:
data.head()

Unnamed: 0,French macaroon is so tasty,4,Unnamed: 2,Unnamed: 3
0,work is horrible,3,,
1,I am upset,3,,[3]
2,throw the ball,1,,[2]
3,Good joke,2,,
4,what is your favorite baseball game,1,,


In [5]:
len(data)

182

In [6]:
max_length_sentence = 0
for i in range(len(data)):
    max_length_sentence = max(len(data.iloc[:,0][i].split()),max_length_sentence)
print(max_length_sentence)

10


- The data consists of 182 sentences along with their corresponding label. 
- The sentences in the dataset contain a maximum of 10 words.

#### Checking the Pretrained Word Embeddings

In [7]:
glove_file = "../data/glove.6B.50d.txt"
words, word_to_vec_map = read_glove_vectors(glove_file)

In [1]:
#words

In [2]:
#word_to_vec_map

In [10]:
word_to_vec_map["the"].shape

(50,)

In [11]:
word_to_vec_map["unknown"]

array([ 0.89855 ,  0.30093 ,  0.38384 , -0.07748 ,  1.2406  ,  0.6338  ,
       -0.49759 ,  0.59377 , -0.16398 , -0.079284,  0.6614  , -0.17841 ,
        0.064431,  0.15498 ,  0.63783 , -0.12535 , -0.045814,  0.084162,
       -0.84272 ,  0.25469 , -0.53641 ,  0.058337,  0.53229 ,  0.60801 ,
        0.41529 , -1.2192  , -1.1077  , -0.29251 ,  0.50284 ,  0.65703 ,
        2.2331  , -1.2356  ,  0.18461 , -1.1709  ,  0.56209 ,  0.3741  ,
        0.24536 , -0.21032 , -0.35088 ,  0.20336 ,  0.098822, -0.15596 ,
        0.088795,  0.17909 ,  0.21729 , -0.50994 , -0.48693 , -0.07791 ,
        0.55245 , -0.62789 ])

In [21]:
len(words)

400000

- There are 400000 words along with their embedding vector
- There is a vector to handle unknown words.