### **1. Term Frequency**

In [1]:
  from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = [
          'The greatest glory in living lies not in never falling, but in rising every time we fall.',
          'The way to get started is to quit talking and begin doing.',
          'If life were predictable it would cease to be life, and be without flavor.',
          'Life is what happens when you are busy making other plans'
] 

X = vectorizer.fit_transform(corpus)

Fungsi fit_transform digunakan untuk mengekstrak kosa kata atau vocabulary dan menghitung kemunculan setiap kata dalam setiap kalimat yang diberikan. Fungsi fit_transform mengembalikan document-term matrix yang dalam hal ini kita simpan dalam variabel X

In [2]:
print(vectorizer.get_feature_names())

['and', 'are', 'be', 'begin', 'busy', 'but', 'cease', 'doing', 'every', 'fall', 'falling', 'flavor', 'get', 'glory', 'greatest', 'happens', 'if', 'in', 'is', 'it', 'lies', 'life', 'living', 'making', 'never', 'not', 'other', 'plans', 'predictable', 'quit', 'rising', 'started', 'talking', 'the', 'time', 'to', 'way', 'we', 'were', 'what', 'when', 'without', 'would', 'you']


Pada tahap ini melihat dan mengkonversi kumpulan text pada vocabulary menjadi matriks jumlah token (Scikit-learn’s CountVectorizer is used to convert a collection of text documents to a vector of term/token counts)

In [3]:
#retrieve the matrix in the numpy form
X.toarray()

#transforming a new document according to learn vocabulary
vectorizer.transform(['A glory.']).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

Fungsi toarray() digunakan untuk menampilkan matrix. Sementara itu, fungsi transform() digunakan untuk merubah dokumen yang diberikan menjadi bentuk document-term matrix. Pada contoh kode tersebut, “A glory.” diubah menjadi vektor [0,0,0,0,0,0,0,0,0,0,0,0,0,1,…,0] karena kata “glory” memiliki frekuensi 1 dalam dokumen dan berada di index ke-14 darimatriks𝑋,sementara “a” tidak ditemui dalam kosakata sehingga tidak ada representasi nilai kemunculannya dalam vektor tersebut.

### **2. Term Frequency – Inverse Document Frequency Konversi Kalimat ke N-gram**

In [4]:
from sklearn.feature_extraction.text import TfidfTransformer

# create tf-idf object
transformer = TfidfTransformer(smooth_idf=False)

# Learn the vocabulary and store tf-idf sparse matrix in tfidf
tfidf = transformer.fit_transform(X)

#retrieveing matrix in numpy form as we did it before
tfidf.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.21080243, 0.        , 0.        , 0.21080243, 0.21080243,
        0.21080243, 0.        , 0.        , 0.21080243, 0.21080243,
        0.        , 0.        , 0.63240729, 0.        , 0.        ,
        0.21080243, 0.        , 0.21080243, 0.        , 0.21080243,
        0.21080243, 0.        , 0.        , 0.        , 0.        ,
        0.21080243, 0.        , 0.        , 0.14957063, 0.21080243,
        0.        , 0.        , 0.21080243, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.21871556, 0.        , 0.        , 0.30825419, 0.        ,
        0.        , 0.        , 0.30825419, 0.        , 0.        ,
        0.        , 0.        , 0.30825419, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.21871556, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.3

Transform a count matrix to a normalized tf or tf-idf representation

### **3. Word2Vec**

In [5]:
  import nltk
  nltk.download('punkt')

  # importing all necessary modules
  from nltk.tokenize import sent_tokenize, word_tokenize
  import warnings

  warnings.filterwarnings(action = 'ignore')

  import gensim
  from gensim.models import Word2Vec

  # Read 'alice.text' file
  sample = open("./alice_in_wonderland.txt", "r")
  s = sample.read()

  # Replaces escape character with space
  f = s.replace("\n", " ")

  data = []
  # iterate through each sentence in the file
  for i in sent_tokenize(f):
    temp = []

    # tokenize the_sentece into words
    for j in word_tokenize(i):
      temp.append(j.lower())

    data.append(temp)

# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count = 1, size = 100, window = 5)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Pada pemodelan CBOW tersebut terdapat 3 hyperpameter yaitu,

a. min_count : jumlah kata minimum yang perlu dipertimbangkan saat melatih model

b. size : jumlah dari dimensi embedding

c. window: jarak maksimum antara kata target dan kata disekitar kata target

In [6]:
# Print results
print("Consine similarity between 'alice' " + "and 'wonderland' - CBOW: ",
      model1.similarity('alice', 'wonderland'))

print("Consine similarity between 'alice' " + "and 'machines' - CBOW: ",
      model1.similarity('alice', 'machines'))

Consine similarity between 'alice' and 'wonderland' - CBOW:  0.9975647
Consine similarity between 'alice' and 'machines' - CBOW:  0.9896047


Word2vec dapat digunakan untuk menghitung kesamaan antara dua kata dalam kosakata dengan me-manggil fungsi model.similarity ().

In [7]:
# Create skip Gram model
model2= gensim.models.Word2Vec(data, min_count = 1, size = 100, window = 5, sg = 1 )

# Print results
print("Consine similarity between 'alice' " + "and 'wonderland' - Skip Gram: ",
      model2.similarity('alice', 'wonderland'))

print("Consine similarity between 'alice' " + "and 'machines' - Skip Gram: ",
      model2.similarity('alice', 'machines'))

Consine similarity between 'alice' and 'wonderland' - Skip Gram:  0.94407004
Consine similarity between 'alice' and 'machines' - Skip Gram:  0.9343238


Untuk membuat model skip-gram, kita hanya perlu menambahkan satu hyperparameter yaitu ‘sg’ pada fungsi Word2Vec.