# TF-IDF - “Term Frequency — Inverse Document Frequency”


Bu teknik, belgelerdeki bir sözcüğü ölçmek için kullanılan bir tekniktir, genellikle her sözcük için belgede ve corpus'da sözcüğün önemini belirten bir ağırlık değeri hesaplarız. Bu yöntem, Information Retrieval ve Text Mining'de yaygın olarak kullanılan bir tekniktir.

TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)

_Terminology_
* t — term (word)
* d — document (set of words)
* N — count of corpus
* corpus — the total document set

## Term Frequency

TF, bir document'de bir kelimenin ne kadar sıklıkta bulunduğunu ölçer.

TF bu şekilde formülize edilir.

tf(t,d) = count of t in d / number of words in d

## Document Frequency

DF, tüm corpus kümesi içerisinde document'in önemini ölçer, bu TF'ye çok benzer. Tek fark, TF'nin, d belgesindeki bir t terimi için frekans sayacı olmasıdır, burada DF, N belge kümesindeki yani corpus'undaki t teriminin geçme sayısıdır. Diğer bir deyişle, DF, t teriminin geçtiği document sayısıdır. 
t terimi document'te en az bir kez yer alıyorsa, t terimi document'te yer alır deriz, terimin d document'i içinde kaç kez bulunduğunu bilmemize gerek yoktur.

df(t) = occurrence of t in documents

Bu değerleri de bir aralıkta tutmak için, df'i toplam belge sayısına bölerek normalize etmemiz gerekir. Asıl amacımız bir terimin bilgilendirici bir terim mi yoksa değil mi olduğunu bilmektir ve DF bunun tam tersidir.

## Inverse Document Frequency

IDF, t teriminin bilgilendiriciliğini ölçen Document Frequency'nin tersidir.
IDF'yi hesapladığımızda, stop word'ler gibi en çok ortaya çıkan sözcükler için bu değer çok düşük olacaktır (çünkü "is" gibi stop word'ler hemen hemen tüm belgelerde mevcuttur ve N / df bu sözcüğe çok düşük bir değer verecektir. ). Bu istediğimiz şeydir, anlamına göre bir ağırlık verilmiş olacak. 

idf(t) = N/df

Yukarıdaki formül ile ilgili birkaç sorun ile karşılacağız ki onlar şunlardır; 
örneğin, çok büyük corpus'larda mesela 10.000 document'e sahip corpus'da yukarıdaki denklem çok yüksek değerler verecektir. Bu büyük değerleri normalleştirmek için idf(t) = N/df in logaritmasını alacağız.
Bir sorgulama yaptığımızda ve kelime corpus içerisinde yoksa df 0 sıfır olacaktır. Bir sayıyı sıfıra bölmek istemediğimiz için ise paydaya 1 ekleyip değeri biraz daha sabit ve anlamlı bir hale dönüştürüyoruz.

**idf(t) = log(N/(df + 1))**

Son olarak, TF ve IDF değerlerini çarpım işleminden geçirerek TF-IDF skorunu elde etmiş olacağız. 

### tf-idf(t, d) = tf(t, d) * log(N/(df + 1)) 

## Import libraries

In [12]:
!pip install num2words
!pip install TurkishStemmer
import nltk
nltk.download('stopwords')
nltk.download('punkt')

#from TurkishStemmer import TurkishStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from collections import Counter
from num2words import num2words

import nltk
import os
import string
import numpy as np
import copy
import pandas as pd
import pickle
import re
import math
import sklearn as sk

Collecting num2words
[?25l  Downloading https://files.pythonhosted.org/packages/eb/a2/ea800689730732e27711c41beed4b2a129b34974435bdc450377ec407738/num2words-0.5.10-py3-none-any.whl (101kB)
[K     |███▎                            | 10kB 19.8MB/s eta 0:00:01[K     |██████▌                         | 20kB 15.0MB/s eta 0:00:01[K     |█████████▊                      | 30kB 13.8MB/s eta 0:00:01[K     |█████████████                   | 40kB 13.4MB/s eta 0:00:01[K     |████████████████▏               | 51kB 11.2MB/s eta 0:00:01[K     |███████████████████▍            | 61kB 11.7MB/s eta 0:00:01[K     |██████████████████████▋         | 71kB 11.9MB/s eta 0:00:01[K     |█████████████████████████▉      | 81kB 12.0MB/s eta 0:00:01[K     |█████████████████████████████   | 92kB 12.2MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 7.3MB/s 
Installing collected packages: num2words
Successfully installed num2words-0.5.10
Collecting TurkishStemmer
  Downloading https://f

## Documents

In [1]:
docA = """Elon Musk is a South African-born American industrial engineer, entrepreneur, who co-founded Paypal and founded aerospace transportation services company SpaceX. He is also one of the early investors in Tesla, an electric car company, and now the Chief Executive Officer of the firm as well. With a $70.5 bn fortune, Musk sits comfortably at the 7th spot in the world's top 10 billionaires list.
Early years
 
Born in 1971, Elon Musk displayed an early talent in computers and video games and at the age of 12 created a video game code that earned him fame, as well as a fortune. He acquired a Canadian passport to avoid supporting apartheid in Africa at the time and also due to greater economic opportunities in the United States.
 
Paypal and SpaceX
 
Musk has created several successful firms since he left Stanford University in 1997. That year, he created the first company - Zip2. The firm,  which provided maps and business directories to online newspapers, was bought by Compaq in 1999. Then he created X.com which eventually merged with PayPal, which went public in 2001, and in 2002 eBay bought the firm for $1.5 bn. Musk in 2002, formed SpaceX and since then the aerospace firm has covered several milestones over the years. SpaceX became the first company to successfully relaunch and land at the first stage of an orbital rocket in late 2017. In 2020, SpaceX made history on May 30, after it flew NASA astronauts Doug Hurley and Bob Behnken, to space aboard its Crew Dragon spacecraft using a Falcon 9 rocket.
 
Tesla
 
In 2004, Musk invested heavily in Tesla, an electric car company as he believed electric vehicles to be the future of mobility. Two years after introducing the first car, Tesla, in 2008, introduced the Model S sedan, which was praised by automotive critics for its performance and design. The company eventually made him a billionaire at the age of 40 in 2012. Musk, as the CEO of Tesla, landed himself in several controversies which resulted in stocks tanking by a huge margin. In 2018, Musk ran into trouble after his tweet falsely claimed that he had secured funding and was considering taking Tesla private at $420 per share. It resulted in both him and Tesla paying a $20 million fine and Musk agreeing to step down as chairman of Tesla's board. However, despite controversies, Tesla stocks have continued to grow adding to his wealth
"""



docB = """Born in 1971 in South Africa of a model and dietitian, Maye Musk, and an electromechanical engineer, Errol Musk, whom Elon has described as "a terrible human being," Elon Reeve Musk is the eldest of his parents' three children, and a citizen of three countries: South Africa, Canada, and the US.

Musk spent his childhood with his nose in books and computers. A small, introverted boy, he was ostracized by his schoolmates and regularly beaten up by class bullies, until he became big enough to defend himself after a growth spurt in his teens.
Musk moved to Silicon Valley in summer 1995. He registered in a PhD program in applied physics at Stanford University – but withdrew after only two days. His brother Kimball Musk, who is 15 months younger than Elon, had just graduated from Queen's University with a business degree and come to join him in California.The early Internet was heating up, and the brothers decided to launch a startup they called Zip2, an online business directory equipped with maps.

In due course, the brothers found angel investors for Zip2 and built it into a successful company. In 1999, the brothers sold Zip2 to computer maker Compaq for $307 million (280 million).

Elon then founded an online financial services company, X.com, on his own. Its main rival was a company called Confinity, founded by Peter Thiel and two others just months after X.com, with offices in the same building. The two companies merged in March 2000 and took on the name of their main product, PayPal, a person-to-person online money transfer service.

Ebay, the online auction service, bought PayPal in October 2002 for $1.5 billion worth of Ebay shares. At the age of 31, Elon Musk, who had been the largest shareholder in PayPal with 11.7% of its equity shares, found himself holding $165 million worth of Ebay stock."""


In [3]:
# Corpus oluşturdum.
corpus = []
corpus.append(docA)
corpus.append(docB)
N = len(corpus)

## Preprocessing

Bu noktada bazı yapılması zorunlu diyebileceğimiz adımları izleyeceğim. Bu işlemler için gerekli fonksiyonları tanımlayacağım.
Adımlar şu şekilde; 
* corpus içerisindeki tüm kelimeleri lowercase yapmak, 
* noktalama işaretlerini silmek, 
* stop word'leri silmek, 
* Stemming.

In [4]:
def convert_lower_case(data):
    return np.char.lower(data)

In [5]:
def remove_stop_words(data):
    stop_words = stopwords.words('english')
    words = word_tokenize(str(data))
    new_text = ""
    for w in words:
        if w not in stop_words and len(w) > 1:
            new_text = new_text + " " + w
    return new_text

In [6]:
def remove_punctuation(data):
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
    for i in range(len(symbols)):
        data = np.char.replace(data, symbols[i], ' ')
        data = np.char.replace(data, "  ", " ")
    data = np.char.replace(data, ',', '')
    return data

In [7]:
def remove_apostrophe(data):
    return np.char.replace(data, "'", "")

In [8]:
def stemming(data):
    stemmer = PorterStemmer()
    
    tokens = word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        new_text = new_text + " " + stemmer.stem(w)
    return new_text

In [9]:
def convert_numbers(data):
    tokens = word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        try:
            w = num2words(int(w), lang='eng')
        except:
            a = 0
        new_text = new_text + " " + w
    new_text = np.char.replace(new_text, "-", " ")
    return new_text

In [10]:
def preprocess(data):
    data = convert_lower_case(data)
    data = remove_punctuation(data) 
    data = remove_apostrophe(data)
    data = remove_stop_words(data)
    data = convert_numbers(data)
    data = stemming(data)
    data = remove_punctuation(data)
    data = convert_numbers(data)
    data = stemming(data) 
    data = remove_punctuation(data) 
    data = remove_stop_words(data) 
    return data

In [14]:
processed_text = []

for i in corpus[:N]:
    text = i.strip()
    processed_text.append(word_tokenize(str(preprocess(text))))

### Creating Bag of Words

In [15]:
bowA = processed_text[0]
bowB = processed_text[1]

In [16]:
wordSet = set(bowA).union(set(bowB))

In [21]:
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

In [25]:
# check first 30 words
import itertools
dict(itertools.islice(wordDictA.items(), 30))

{'age': 0,
 'becam': 0,
 'canada': 0,
 'comfort': 0,
 'confin': 0,
 'earn': 0,
 'equiti': 0,
 'execut': 0,
 'fame': 0,
 'fund': 0,
 'growth': 0,
 'heavili': 0,
 'howev': 0,
 'internet': 0,
 'late': 0,
 'onlin': 0,
 'pay': 0,
 'peter': 0,
 'prai': 0,
 'public': 0,
 'secur': 0,
 'sedan': 0,
 'sit': 0,
 'take': 0,
 'teen': 0,
 'ten': 0,
 'time': 0,
 'twelv': 0,
 'worth': 0,
 'zip2': 0}

In [26]:
for word in bowA:
    wordDictA[word] += 1

    
for word in bowB:
    wordDictB[word] += 1

In [28]:
# re-check the change of first 30 words
dict(itertools.islice(wordDictA.items(), 30))

{'age': 2,
 'becam': 1,
 'canada': 0,
 'comfort': 1,
 'confin': 0,
 'earn': 1,
 'equiti': 0,
 'execut': 1,
 'fame': 1,
 'fund': 1,
 'growth': 0,
 'heavili': 1,
 'howev': 1,
 'internet': 0,
 'late': 1,
 'onlin': 1,
 'pay': 1,
 'peter': 0,
 'prai': 1,
 'public': 1,
 'secur': 1,
 'sedan': 1,
 'sit': 1,
 'take': 1,
 'teen': 0,
 'ten': 1,
 'time': 1,
 'twelv': 2,
 'worth': 0,
 'zip2': 1}

In [30]:
import pandas as pd
CountVectorizer = pd.DataFrame([wordDictA, wordDictB])
CountVectorizer

Unnamed: 0,zip2,becam,fame,age,sedan,secur,growth,howev,peter,public,teen,comfort,fund,twelv,confin,equiti,execut,take,late,prai,earn,heavili,time,pay,canada,sit,onlin,internet,worth,ten,parent,left,histori,beaten,hundr,co,servic,list,continu,electr,...,defend,passport,degr,financ,call,firm,troubl,eighti,summer,launch,newspap,video,margin,doug,huge,eventu,largest,nineti,relaunch,octob,equip,eight,stage,queen,stanford,terribl,comput,well,transport,orbit,childhood,silicon,sinc,seven,econom,share,earli,spent,tweet,futur
0,1,1,1,2,1,1,0,1,0,1,0,1,1,2,0,0,1,1,1,1,1,1,1,1,0,1,1,0,0,1,0,1,1,0,4,1,1,1,1,3,...,0,1,0,0,0,5,1,0,0,0,1,2,1,1,1,2,0,2,1,0,0,1,1,0,1,0,1,2,1,1,0,0,2,1,1,1,3,0,1,1
1,3,1,0,1,0,0,1,0,1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,4,1,2,0,1,0,0,1,6,0,3,0,0,0,...,1,0,1,1,2,0,0,1,1,1,0,0,0,0,0,0,1,2,0,1,1,0,0,1,1,1,2,0,0,0,1,1,0,1,0,2,1,1,0,0


# Term - Frequency

In [36]:
def Term_Frequency(wordDict, bow):
    """This function creates Term Frequency.

    The Equation:
    tf(t,d) = count of t in d / number of words in d."""
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count / float(bowCount)
    return tfDict

In [35]:
tfBowA = Term_Frequency(wordDictA, bowA)
tfBowB = Term_Frequency(wordDictB, bowB)

# Inverse Document Frequency

In [37]:
def Inverse_doc_freq(docList):
    """This function creates Inverse-Document Freqeuncy
    
    The Equation:
    idf(t) = log(N/(df + 1))
    """
    import math
    idfDict = {}
    N = len(docList)
    
    # w kelimesini içeren dökümanları hesaplar
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    
    for doc in docList:
        for word, val in doc.items():
            if val > 0 :
                idfDict[word] += 1
    
    # N sayısını paydada ki document-frequency'e bölüp logaritmasını alır.
    for word, val in idfDict.items():
        idfDict[word] = math.log(N / float(val+1))
        return idfDict

In [38]:
idf = Inverse_doc_freq([wordDictA, wordDictB])

# Term Frequency - Inverse Document Frequency (TF-IDF)

In [44]:
def TFIDF(tfBow, idf):
    tfidf = {}
    for word, val in tfBow.items():
        if idf[word] < 0 :
            idf[word] = idf[word] * (-1)
        tfidf[word] = val * idf[word]
    return tfidf

In [45]:
tfidfA = TFIDF(tfBowA, idf)
tfidfB = TFIDF(tfBowB, idf)

# Result

In [46]:
pd.DataFrame([tfidfA, tfidfB])

Unnamed: 0,zip2,becam,fame,age,sedan,secur,growth,howev,peter,public,teen,comfort,fund,twelv,confin,equiti,execut,take,late,prai,earn,heavili,time,pay,canada,sit,onlin,internet,worth,ten,parent,left,histori,beaten,hundr,co,servic,list,continu,electr,...,defend,passport,degr,financ,call,firm,troubl,eighti,summer,launch,newspap,video,margin,doug,huge,eventu,largest,nineti,relaunch,octob,equip,eight,stage,queen,stanford,terribl,comput,well,transport,orbit,childhood,silicon,sinc,seven,econom,share,earli,spent,tweet,futur
0,0.001423,0.007018,0.003509,0.014035,0.003509,0.003509,0.0,0.003509,0.0,0.003509,0.0,0.003509,0.003509,0.007018,0.0,0.0,0.003509,0.003509,0.003509,0.003509,0.003509,0.003509,0.003509,0.003509,0.0,0.003509,0.007018,0.0,0.0,0.003509,0.0,0.003509,0.003509,0.0,0.02807,0.003509,0.007018,0.003509,0.003509,0.010526,...,0.0,0.003509,0.0,0.0,0.0,0.017544,0.003509,0.0,0.0,0.0,0.003509,0.007018,0.003509,0.003509,0.003509,0.007018,0.0,0.014035,0.003509,0.0,0.0,0.003509,0.003509,0.0,0.007018,0.0,0.007018,0.007018,0.003509,0.003509,0.0,0.0,0.007018,0.007018,0.003509,0.007018,0.021053,0.0,0.003509,0.003509
1,0.00582,0.009569,0.0,0.009569,0.0,0.0,0.004785,0.0,0.004785,0.0,0.004785,0.0,0.0,0.0,0.004785,0.004785,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004785,0.0,0.038278,0.004785,0.009569,0.0,0.004785,0.0,0.0,0.004785,0.057416,0.0,0.028708,0.0,0.0,0.0,...,0.004785,0.0,0.004785,0.004785,0.009569,0.0,0.0,0.004785,0.004785,0.004785,0.0,0.0,0.0,0.0,0.0,0.0,0.004785,0.019139,0.0,0.004785,0.004785,0.0,0.0,0.004785,0.009569,0.004785,0.019139,0.0,0.0,0.0,0.004785,0.004785,0.0,0.009569,0.0,0.019139,0.009569,0.004785,0.0,0.0
