<a href="https://colab.research.google.com/github/arssite/Datalysis/blob/main/embeding_SEntiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embedding
It is a technique in deep learning used to represent categorical variables as vectors of real numbers. This allows the model to learn relationships between different categories and use them for prediction or classification tasks.

Benefits of using embedding:

Reduced dimensionality: Embeddings are typically lower-dimensional than the original categorical variables, which can improve computational efficiency and reduce overfitting.
Improved interpretability: Embeddings can be visualized to understand the relationships between different categories.
Increased flexibility: Embeddings can be used with a variety of deep learning models, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs).
How embedding works:

One-hot encoding: The first step is to convert the categorical variables into one-hot encoded vectors. This means that each category is represented by a vector of zeros, except for the position corresponding to the category, which is set to 1.
Embedding layer: The one-hot encoded vectors are then passed through an embedding layer. This layer is a neural network that learns a mapping from the one-hot encoded vectors to a lower-dimensional space.
Output: The output of the embedding layer is a vector of real numbers for each category. These vectors can then be used as input to other deep learning models.
Applications of embedding:

Natural language processing: Embeddings are commonly used in natural language processing tasks such as sentiment analysis, text classification, and machine translation.
Computer vision: Embeddings can also be used in computer vision tasks such as image classification and object detection.
Recommendation systems: Embeddings can be used to learn user preferences and recommend items that the user is likely to be interested in.

In [1]:
import numpy as np
docs = [
    "The quick brown fox jumps over the lazy dog.",
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
    "Python is a high-level programming language.",
    "Stack Overflow is a question and answer website for professional and enthusiast programmers.",
    "Artificial intelligence is reshaping industries across the globe.",
    "The universe is vast and full of mysteries.",
    "Climate change is a pressing issue that requires global cooperation.",
    "The sun rises in the east and sets in the west.",
    "Music has the power to evoke strong emotions.",
    "Education is the key to unlocking opportunities.",
    "Health is wealth.",
    "The journey of a thousand miles begins with a single step.",
    "Reading is to the mind what exercise is to the body.",
    "Life is like a box of chocolates; you never know what you're gonna get.",
    "Laughter is the best medicine.",
    "The only way to do great work is to love what you do.",
    "Success is not final, failure is not fatal: It is the courage to continue that counts.",
    "Believe you can and you're halfway there.",
    "Yesterday is history, tomorrow is a mystery, but today is a gift. That is why it is called the present.",
    "Happiness is not something ready made. It comes from your own actions.",
    "Be yourself; everyone else is already taken.",
    "In three words I can sum up everything I've learned about life: it goes on.",
    "To be yourself in a world that is constantly trying to make you something else is the greatest accomplishment.",
    "Life is either a daring adventure or nothing at all.",
    "Success is not the key to happiness. Happiness is the key to success. If you love what you are doing, you will be successful.",
    "Don't cry because it's over, smile because it happened."
]

In [3]:
from keras.preprocessing.text import Tokenizer
token=Tokenizer()

In [4]:
token.fit_on_texts(docs)

In [7]:
len(token.word_index)

175

In [5]:
token.word_index

{'is': 1,
 'the': 2,
 'to': 3,
 'a': 4,
 'you': 5,
 'and': 6,
 'it': 7,
 'that': 8,
 'in': 9,
 'what': 10,
 'not': 11,
 'of': 12,
 'key': 13,
 'life': 14,
 'success': 15,
 'happiness': 16,
 'be': 17,
 'over': 18,
 "you're": 19,
 'do': 20,
 'love': 21,
 'can': 22,
 'something': 23,
 'yourself': 24,
 'else': 25,
 'because': 26,
 'quick': 27,
 'brown': 28,
 'fox': 29,
 'jumps': 30,
 'lazy': 31,
 'dog': 32,
 'lorem': 33,
 'ipsum': 34,
 'dolor': 35,
 'sit': 36,
 'amet': 37,
 'consectetur': 38,
 'adipiscing': 39,
 'elit': 40,
 'python': 41,
 'high': 42,
 'level': 43,
 'programming': 44,
 'language': 45,
 'stack': 46,
 'overflow': 47,
 'question': 48,
 'answer': 49,
 'website': 50,
 'for': 51,
 'professional': 52,
 'enthusiast': 53,
 'programmers': 54,
 'artificial': 55,
 'intelligence': 56,
 'reshaping': 57,
 'industries': 58,
 'across': 59,
 'globe': 60,
 'universe': 61,
 'vast': 62,
 'full': 63,
 'mysteries': 64,
 'climate': 65,
 'change': 66,
 'pressing': 67,
 'issue': 68,
 'requires': 69

In [6]:
token.word_counts

OrderedDict([('the', 19),
             ('quick', 1),
             ('brown', 1),
             ('fox', 1),
             ('jumps', 1),
             ('over', 2),
             ('lazy', 1),
             ('dog', 1),
             ('lorem', 1),
             ('ipsum', 1),
             ('dolor', 1),
             ('sit', 1),
             ('amet', 1),
             ('consectetur', 1),
             ('adipiscing', 1),
             ('elit', 1),
             ('python', 1),
             ('is', 27),
             ('a', 10),
             ('high', 1),
             ('level', 1),
             ('programming', 1),
             ('language', 1),
             ('stack', 1),
             ('overflow', 1),
             ('question', 1),
             ('and', 5),
             ('answer', 1),
             ('website', 1),
             ('for', 1),
             ('professional', 1),
             ('enthusiast', 1),
             ('programmers', 1),
             ('artificial', 1),
             ('intelligence', 1),
             ('r

In [8]:
token.document_count

26

In [9]:
seq=token.texts_to_sequences(docs)
seq

[[2, 27, 28, 29, 30, 18, 2, 31, 32],
 [33, 34, 35, 36, 37, 38, 39, 40],
 [41, 1, 4, 42, 43, 44, 45],
 [46, 47, 1, 4, 48, 6, 49, 50, 51, 52, 6, 53, 54],
 [55, 56, 1, 57, 58, 59, 2, 60],
 [2, 61, 1, 62, 6, 63, 12, 64],
 [65, 66, 1, 4, 67, 68, 8, 69, 70, 71],
 [2, 72, 73, 9, 2, 74, 6, 75, 9, 2, 76],
 [77, 78, 2, 79, 3, 80, 81, 82],
 [83, 1, 2, 13, 3, 84, 85],
 [86, 1, 87],
 [2, 88, 12, 4, 89, 90, 91, 92, 4, 93, 94],
 [95, 1, 3, 2, 96, 10, 97, 1, 3, 2, 98],
 [14, 1, 99, 4, 100, 12, 101, 5, 102, 103, 10, 19, 104, 105],
 [106, 1, 2, 107, 108],
 [2, 109, 110, 3, 20, 111, 112, 1, 3, 21, 10, 5, 20],
 [15, 1, 11, 113, 114, 1, 11, 115, 7, 1, 2, 116, 3, 117, 8, 118],
 [119, 5, 22, 6, 19, 120, 121],
 [122,
  1,
  123,
  124,
  1,
  4,
  125,
  126,
  127,
  1,
  4,
  128,
  8,
  1,
  129,
  7,
  1,
  130,
  2,
  131],
 [16, 1, 11, 23, 132, 133, 7, 134, 135, 136, 137, 138],
 [17, 24, 139, 25, 1, 140, 141],
 [9, 142, 143, 144, 22, 145, 146, 147, 148, 149, 150, 14, 7, 151, 152],
 [3, 17, 24, 9, 4, 153

In [10]:
from keras.utils import pad_sequences
sequences = pad_sequences(seq,padding='post')
sequences

array([[  2,  27,  28,  29,  30,  18,   2,  31,  32,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [ 33,  34,  35,  36,  37,  38,  39,  40,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [ 41,   1,   4,  42,  43,  44,  45,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [ 46,  47,   1,   4,  48,   6,  49,  50,  51,  52,   6,  53,  54,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [ 55,  56,   1,  57,  58,  59,   2,  60,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  2,  61,   1,  62,   6,  63,  12,  64,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [ 65,  66,   1,   4,  67,  68,   8,  69,  70,  71,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  2,  72,  73,   9,   2,  74,   6

In [11]:
from keras.datasets import imdb
from keras import Sequential
from keras.layers import Dense,SimpleRNN,Embedding,Flatten

In [13]:
model=Sequential()
model.add(Embedding(17,output_dim=2,input_length=5))
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 5, 2)              34        
                                                                 
Total params: 34 (136.00 Byte)
Trainable params: 34 (136.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [15]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
