Original file is located at
    https://colab.research.google.com/drive/1JJegGeIK0LrEjUYR_MDxnIScD4YVWO7V

1.   Label Encoding
2.   One Hot Encoding
3.   Count Vectorization ( TF-IDF Vectorizer, Bag Of Words (BOW))
4.   Word Embedding ( Word2Vec,GloVe, FastText)
5.   N-gram Features

# **1 - Label Encoding**

# Label Encoding kategorik değişkenleri (metinleri) sayısal gösterimlere dönüştürmek için kullanılan bir tekniktir.
# Her benzersiz kategoriye benzersiz bir tam sayı değeri atanır.
# Hızlı ve kolay bir şekilde entegre olabiliyor ancak kategoriler arasındaki ilişkiyi anlayamıyor, örneğin hemşire ve doktorların birbirlerine diğerlerine göre daha yakın olduğunu fark edemiyor.
# dezavantajı: anlamına bakmaması

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder

In [None]:
categories = ['teacher', 'nurse', 'police','doctor']
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(categories)
df = pd.DataFrame({'Meslek': categories, 'Etiket':encoded_labels})

In [None]:
df.head()

Unnamed: 0,Meslek,Etiket
0,teacher,3
1,nurse,1
2,police,2
3,doctor,0


# **2 - One Hot Encoding**

# One Hot Encoding kategorik değişkenleri ikili vektörlere dönüştürmek için kullanılan bir tekniktir.
# Her kategori, yalnızca bir öğenin "sıcak" (1) ve diğerlerinin "soğuk" (0) olduğu ikili bir vektörle temsil edilir.
# Kategori sayısı azsa, metinleri sayısal değerlere dönüştürmek için uygundur. Kategori sayısı fazlaysa, önemli sayıda sütun eklemek gereksiz veri genişlemesine neden olabilir, bu da hesaplama maliyetinin artmasına neden olur
# 0 ve 1 ile etiketler. 2 sınıf var

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
categories = ['teacher', 'nurse', 'police','doctor']
data = pd.DataFrame({'Meslek': categories})
encoder = OneHotEncoder(sparse_output= False,dtype =int)
encoded_data = encoder.fit_transform(data)
encoded_df   = pd.DataFrame(encoded_data, columns = categories)

In [None]:
encoded_df.head()

Unnamed: 0,teacher,nurse,police,doctor
0,0,0,0,1
1,0,1,0,0
2,0,0,1,0
3,1,0,0,0


# **3 - Count Vectorization**

# Belgelerdeki kelimelerin sıklığına dayalı olarak sayısal vektörlere dönüştürmek için kullanılan bir tekniktir.

**a )  TF-IDF Vectorizer:**
It combines the concepts of "TF" (Term Frequency) and "IDF" (Inverse Document Frequency).


In [None]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
documents = ['Bu ilk belgedir',
            'İkinci belge budur',
            'Ve ücüncü belgemiz',
            'İlk belge hangisidir']

data = pd.DataFrame({'Text': documents})
vectorizer = TfidfVectorizer()
tfidf_vector = vectorizer.fit_transform(data['Text'])

tfidf_vector = pd.DataFrame(tfidf_vector.toarray(), columns= vectorizer.get_feature_names_out())

In [None]:
tfidf_vector.head()

Unnamed: 0,belge,belgedir,belgemiz,bu,budur,hangisidir,ilk,kinci,lk,ve,ücüncü
0,0.0,0.57735,0.0,0.57735,0.0,0.0,0.57735,0.0,0.0,0.0,0.0
1,0.486934,0.0,0.0,0.0,0.617614,0.0,0.0,0.617614,0.0,0.0,0.0
2,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.57735
3,0.486934,0.0,0.0,0.0,0.0,0.617614,0.0,0.0,0.617614,0.0,0.0


# TF-IDF

In [None]:
import nltk

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

from nltk.tokenize import word_tokenize

from sklearn.decomposition import NMF, PCA

from sklearn.manifold import TSNE

import spacy

In [None]:
text = "Doğal dil işleme bilgisayar bilimi altalanıdır" \
        "yapay zeka ve hasaplamalı bilmidir" \
        "Bilgisayar ve insan dili kesişimidir"

# token : metnin içindeki en anlamlı parça
tokens = word_tokenize(text)

print(len(text),tokens[:50])

# TF-IDF
tfidf_vec = TfidfVectorizer()
X_tfidf = tfidf_vec.fit_transform([text])
print('TF-IDF:')
print(tfidf_vec.get_feature_names_out()[:50])
print(X_tfidf.toarray()[0][:50])

116 ['Doğal', 'dil', 'işleme', 'bilgisayar', 'bilimi', 'altalanıdıryapay', 'zeka', 've', 'hasaplamalı', 'bilmidirBilgisayar', 've', 'insan', 'dili', 'kesişimidir']
TF-IDF:
['altalanıdıryapay' 'bilgisayar' 'bilimi' 'bilmidirbilgisayar' 'dil'
 'dili' 'doğal' 'hasaplamalı' 'insan' 'işleme' 'kesişimidir' 've' 'zeka']
[0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.5  0.25]


#  **Bag Of Words (BOW):**

# Derlemden benzersiz kelimelerden oluşan bir kelime dağarcığı oluşturur ve her belgeyi kelime frekanslarının bir vektörü olarak temsil eder.
# örneğimize bakarak her kelimenin sözlükte ne kadar sıklıkla geçtiğini gösterebiliriz

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
documents = ['Bu ilk belgedir',
            'İkinci belge budur',
            'Ve ücüncü belgemiz',
            'İlk belge hangisidir']

data = pd.DataFrame({'Text':documents})
vectorizer = CountVectorizer()
bow_vectors = vectorizer.fit_transform(data['Text'])

bow_df = pd.DataFrame(bow_vectors.toarray(), columns = vectorizer.get_feature_names_out())

In [4]:
bow_df.head()

Unnamed: 0,belge,belgedir,belgemiz,bu,budur,hangisidir,ilk,kinci,lk,ve,ücüncü
0,0,1,0,1,0,0,1,0,0,0,0
1,1,0,0,0,1,0,0,1,0,0,0
2,0,0,1,0,0,0,0,0,0,1,1
3,1,0,0,0,0,1,0,0,1,0,0


# **4 ) Word Embedding**

Kelime Gömme, NLP'de kelimeleri yüksek boyutlu bir uzayda yoğun vektörler olarak temsil eden bir tekniktir. Anlamsal anlamı ve kelime ilişkilerini yakalayarak doğal dilin daha iyi anlaşılmasına ve işlenmesine olanak tanır. Kelime yerleştirmeler, sinir ağı modelleri kullanılarak büyük metin verilerinden öğrenilir ve seyrek temsillere kıyasla NLP modeli performansını artıran yoğun temsiller sağlar.


**a ) Word2Vec:**

Büyük metin derlemlerinden kelimelerin sürekli vektör temsillerini (gömmelerini) öğrenen sinir ağı tabanlı bir modeldir. Bu yerleştirmeler, kelimeler arasındaki semantik ve sözdizimsel ilişkileri yakalayarak daha anlamlı ve bağlama duyarlı kelime temsillerine olanak tanır.

**- CBOW (Continuous Bag of Words)**: Hedef kelimeyi çevredeki bağlam kelimelerine göre tahmin eder. Bağlam kelimeleri göz önüne alındığında, CBOW merkezdeki hedef kelimeyi tahmin etmeye çalışır.

**- Skip-gram**: Hedef kelime verildiğinde çevredeki bağlam kelimelerini tahmin eder. Merkezde bir hedef kelime verildiğinde, Skip-gram genellikle onun etrafında görünen bağlam kelimelerini tahmin etmeyi amaçlar.


## Cbow

In [None]:
import pandas as pd
from gensim.models import Word2Vec

In [None]:
sentences = [["Ben", "severim", "elmaları"],
            ["Ben", "yerim", "meyve"],
            ['elmalar', "lezzetlidir"],
            ['meyveler','sağlar','vitamin']]

cbow = Word2Vec(sentences,min_count=1 , vector_size=300,sg=0)
vectors = cbow.wv

vector_df = pd.DataFrame(vectors.vectors, index = vectors.index_to_key)

In [None]:
vector_df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
Ben,-0.000179,7.9e-05,0.001701,0.003003,-0.003101,-0.002372,0.002153,0.002991,-0.001672,-0.001254,...,-0.001503,0.001901,0.00306,-0.001367,0.002655,0.001792,0.00196,0.000171,0.002738,-0.00234
vitamin,-0.002748,0.0031,-6.6e-05,-0.000656,0.001535,-0.001365,0.000914,0.002313,0.002022,-0.002504,...,0.003024,0.002979,-0.002736,-0.001004,0.003296,0.001701,-0.000529,-0.002897,0.000987,-0.002225
sağlar,0.002711,-0.001486,-0.000356,0.000335,-6.4e-05,0.000383,0.002038,-7e-06,-0.001082,-0.000504,...,-0.001695,0.000377,0.000961,-0.000512,0.003311,0.002783,0.000805,0.002373,0.001964,-0.00186
meyveler,-0.001719,-0.002223,-0.002592,0.00277,-0.000661,-0.002285,-0.001385,0.001715,-0.000956,-0.00125,...,0.001591,-0.001087,-0.003089,0.001262,0.002387,-0.001878,-0.002622,-0.000991,-0.001644,-0.000772
lezzetlidir,-0.000648,-0.001756,0.003149,-0.0031,0.001501,0.001801,-0.00047,0.003002,0.003295,-0.001825,...,0.002365,0.000634,0.001733,0.002127,0.000637,-0.002043,-2e-06,0.002756,-0.002033,0.003146


## Skip gram
hedef kelimeyi belirleyip etrafındaki kelimeleri tahmin etmeye çalışır

In [None]:
import pandas as pd
from gensim.models import Word2Vec

In [None]:
sentences = [["Ben", "severim", "elmaları"],
            ["Ben", "yerim", "meyve"],
            ['elmalar', "lezzetlidir"],
            ['meyveler','sağlar','vitamin']]

skip_gram = Word2Vec(sentences,min_count=1 , vector_size=300,sg=1)
vectors = skip_gram.wv

vector_df = pd.DataFrame(vectors.vectors, index = vectors.index_to_key)

In [None]:
vector_df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
Ben,-0.000179,7.9e-05,0.001701,0.003003,-0.003101,-0.002372,0.002153,0.002991,-0.001672,-0.001254,...,-0.001503,0.001901,0.00306,-0.001367,0.002655,0.001792,0.00196,0.000171,0.002738,-0.00234
vitamin,-0.002748,0.0031,-6.6e-05,-0.000656,0.001535,-0.001365,0.000914,0.002313,0.002022,-0.002504,...,0.003024,0.002979,-0.002736,-0.001004,0.003296,0.001701,-0.000529,-0.002897,0.000987,-0.002225
sağlar,0.002711,-0.001486,-0.000356,0.000335,-6.4e-05,0.000383,0.002038,-7e-06,-0.001082,-0.000504,...,-0.001695,0.000377,0.000961,-0.000512,0.003311,0.002783,0.000805,0.002373,0.001964,-0.00186
meyveler,-0.001719,-0.002223,-0.002592,0.00277,-0.000661,-0.002285,-0.001385,0.001715,-0.000956,-0.00125,...,0.001591,-0.001087,-0.003089,0.001262,0.002387,-0.001878,-0.002622,-0.000991,-0.001644,-0.000772
lezzetlidir,-0.000648,-0.001756,0.003149,-0.0031,0.001501,0.001801,-0.00047,0.003002,0.003295,-0.001825,...,0.002365,0.000634,0.001733,0.002127,0.000637,-0.002043,-2e-06,0.002756,-0.002033,0.003146


Genel olarak, CBOW modeli belirli bir kelimenin etrafındaki bağlamı tahmin ederken, Skip-gram modeli ise bir kelimenin verilen bağlam içindeki ilişkilerini öğrenmeye çalışır. Bu modeller, NLP uygulamalarında genellikle kelime gömme (word embedding) matrislerini oluşturmak ve kullanmak için kullanılır.

**b  )  GloVe:** GloVe, Kelime Temsili için Küresel Vektörler anlamına gelir. Bir derlemdeki küresel sözcük birlikte oluşum modellerini yakalayarak sözcük yerleştirmeleri oluşturmayı amaçlayan denetimsiz bir öğrenme algoritmasıdır.

In [None]:
#ÖDEV (glove)

# **5 ) N-gram features**

# N-gram özellikleri, bir metin belgesindeki n kelimenin bitişik dizileridir. Yalnızca tek tek kelimeleri değil aynı zamanda oluşturdukları kelime gruplarını da dikkate alarak kelimeler arasındaki bağlamsal bilgileri ve ilişkileri yakalarlar.
# Her bir token, bölme kriterine göre tek bir öğeyi temsil eder.
# Kelime sırasını tamamen göz ardı etmek her zaman iyi değildir.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
documents = ["Bu ilk belgedir.",
             "İkinci belge budur.",
             "Ve üçüncü belgemiz.",
             "İlk belge hangisidir?"]

data = pd.DataFrame({'Text': documents})
ngram = CountVectorizer(ngram_range=(2,3))
ngram_vectors = ngram.fit_transform(data['Text'])
ngram_df = pd.DataFrame(ngram_vectors.toarray(), columns=ngram.get_feature_names_out())

In [None]:
ngram_df.head()

Unnamed: 0,belge budur,belge hangisidir,bu ilk,bu ilk belgedir,ilk belgedir,kinci belge,kinci belge budur,lk belge,lk belge hangisidir,ve üçüncü,ve üçüncü belgemiz,üçüncü belgemiz
0,0,0,1,1,1,0,0,0,0,0,0,0
1,1,0,0,0,0,1,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,1,1
3,0,1,0,0,0,0,0,1,1,0,0,0
