# fastText ile yazım denetimi

Bu çalışmada fastText ile basit bir yazım denetimi uygulaması deniyorum.

FastText, her bir kelimenin, kelimenin kendisine ek olarak n-gramlık bir karakter çantası olarak temsil edilebileceği şekilde alt kelime temsillerini destekler.

FastText kelime vektörlerine dayalı bir yazım denetleyici uygulaması oluşturmaya çalışacağı. Yanlış yazılmış bir kelime verildiğinde eğitimli gömme uzayında bu kelimenin vektör temsiline en yakın kelime vektör temsilini bulmak olacaktır.

Eğer kelime hazinemizde aranılan kelime yer alıyorsa kelimeyi olduğu haliyle bırakacağız, aksi durumda  alt kelime temsillerine en yakın olan ile değiştireceğiz.

In [1]:
import pandas as pd
import numpy as np
import bs4
from bs4 import BeautifulSoup
import lxml
import requests
import re
import nltk

import gensim
from gensim.models import KeyedVectors
from gensim.models import FastText

In [2]:
pd.set_option("display.max_columns",None);
pd.set_option("display.max_rows",None);
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [3]:
def cleaning_text(text):
    text_nobracket = re.sub(".*?\((.*?)\)", '', str(text.lower()))
    text_nopunct = re.sub(r'[^\w\s]','', str(text_nobracket))
    return text_nopunct

In [4]:
import nltk
WPT = nltk.WordPunctTokenizer()

def tokenize(text):
    tokens = WPT.tokenize(text)
    return tokens

In [5]:
from nltk.tokenize import sent_tokenize
def tokenize_sent(text):
    tokens = sent_tokenize(text)
    return tokens

# HOPI

In [6]:
r = requests.get('https://hopi.com.tr/markalar' , verify=False)
soup = BeautifulSoup(r.content, 'lxml')



In [7]:
print(set([text.parent.name for text in soup.find_all(text=True)]))

{'style', 'header', 'nav', 'ul', 'h2', 'h5', 'noscript', 'a', 'div', 'section', 'h1', 'label', 'li', 'p', '[document]', 'span', 'title', 'body', 'h6', 'form', 'strong', 'html', 'svg', 'button', 'main', 'head', 'figure', 'footer', 'script'}


In [8]:
hopi = []
for brand in soup.find_all("span", {"class" : "title"}):
    hopi.append(brand.text)

In [9]:
hopi[:5]

['01 BURDA AVM', '49.COM.TR', 'A101', 'ABDULLAH KİĞILI', 'AGORA ANTALYA AVM']

# MORHIPO

In [10]:
r = requests.get('https://www.morhipo.com/markalar/0/marka' , verify=False)
soup = BeautifulSoup(r.content, 'lxml')



In [11]:
print(set([text.parent.name for text in soup.find_all(text=True)]))

{'style', 'header', 'ul', 'ol', 'noscript', 'a', 'div', 'b', 'h1', 'li', 'p', '[document]', 'span', 'title', 'body', 'strong', 'html', 'button', 'main', 'head', 'footer', 'script'}


In [12]:
morhipo = []
for li in soup.find_all(class_= "chaar-item col-xxs-12 col-xs-6 col-sm-3"):
    morhipo.append(li.a.get('href'))

In [13]:
morhipo[:5]

['/a-m-eyewear', '/a-spor', '/abottega', '/agspalding-bros', '/akent']

# FIRM_LIST

In [14]:
firm_list = list(set(morhipo) - set(hopi))

In [15]:
firmname = pd.DataFrame(firm_list, columns = ["FIRMNAME"])
firmname.FIRMNAME = [re.sub("/", '', str(x).lower()) for x in firmname.FIRMNAME]
firmname.FIRMNAME = [re.sub("-", ' ', str(x)) for x in firmname.FIRMNAME]
firmname["FIRMNAME2"] = [re.sub(" ", '', str(x)) for x in firmname.FIRMNAME]

In [16]:
firmname.loc[:,'FIRMNAME'] = firmname.loc[:,'FIRMNAME'].apply(lambda x: cleaning_text(x))
firmname.loc[:,'Sent_Token1'] = firmname.loc[:,'FIRMNAME'].apply(lambda x: tokenize_sent(x) )

firmname.loc[:,'FIRMNAME2'] = firmname.loc[:,'FIRMNAME2'].apply(lambda x: cleaning_text(x))
firmname.loc[:,'Sent_Token2'] = firmname.loc[:,'FIRMNAME2'].apply(lambda x: tokenize_sent(x) )

In [17]:
firmname.head()

Unnamed: 0,FIRMNAME,FIRMNAME2,Sent_Token1,Sent_Token2
0,exuma,exuma,[exuma],[exuma]
1,joygears,joygears,[joygears],[joygears]
2,redoxon,redoxon,[redoxon],[redoxon]
3,baby londy,babylondy,[baby londy],[babylondy]
4,densmood,densmood,[densmood],[densmood]


In [18]:
brand_list =  firmname["Sent_Token1"].tolist() + firmname["Sent_Token2"].tolist()

# fastText Spell Checking

In [19]:
import gensim
from gensim.models.fasttext import FastText

In [20]:
cbow_fasttext = FastText(brand_list, vector_size=400, window=7, min_count =1,  min_n=3, max_n=6)
%time cbow_fasttext.train(brand_list,total_examples=len(brand_list), epochs=10)

Wall time: 1.11 s


(178700, 178700)

#### Spell Checking

In [22]:
wrong_words = [("sennhayser" ),
("senheiser"),
("sennheiser"),
("diyiturk" ),
("lc waykiki"),
("sennhayser" ),
("dokers"),
("digitürk"),
( "lc waykiki"),
("dikiturk"),
("vestl"),
("rusel hobs"),
("russell hobbs")]              

In [46]:
def spellcheck(tests, model, vocab):
    for wrong in wrong_words:
        w = wrong
        if w in vocab:
            print('{} exists in the vocabulary. No correction required'.format(w))
        else:
            w_old = w
            w = cbow_fasttext.wv.most_similar_to_given(w, list(vocab))
            print("Suggested word for {} : {}".format(w_old, w))
            
if __name__ == "__main__":
    model = cbow_fasttext
    vocab = cbow_fasttext.wv.key_to_index.keys()
    spellcheck(wrong_words, model, vocab)

Suggested word for sennhayser : sennheiser
Suggested word for senheiser : sennheiser
sennheiser exists in the vocabulary. No correction required
Suggested word for diyiturk : digiturk
Suggested word for lc waykiki : lc waikiki
Suggested word for sennhayser : sennheiser
Suggested word for dokers : dockers
Suggested word for digitürk : digiturk
Suggested word for lc waykiki : lc waikiki
Suggested word for dikiturk : digiturk
Suggested word for vestl : vestel
Suggested word for rusel hobs : russell hobbs
russell hobbs exists in the vocabulary. No correction required


In [26]:
cbow_fasttext.wv.most_similar_to_given("sennhayser", list(vocab))

'sennheiser'

In [27]:
cbow_fasttext.wv.most_similar_to_given("lc waykiki", list(vocab))

'lc waikiki'

In [28]:
cbow_fasttext.wv.most_similar_to_given("lcwaykiki", list(vocab))

'lcwaikiki'

In [29]:
cbow_fasttext.wv.most_similar_to_given("diyiturk", list(vocab))

'digiturk'

In [30]:
cbow_fasttext.wv.most_similar_to_given("dikiturk", list(vocab))

'digiturk'

In [31]:
cbow_fasttext.wv.most_similar_to_given("dokers", list(vocab))

'dockers'

In [32]:
cbow_fasttext.wv.most_similar_to_given("vestl", list(vocab))

'vestel'

In [33]:
cbow_fasttext.wv.most_similar_to_given("rusel hobs", list(vocab))

'russell hobbs'