# TFIDF VECTORIZER

## Kullanma amacımız metinleri makinenin anlayacağı şekilde sayısal değerlere çevirmektir. Kelimenin bir döküman içerisindeki sıklığı ile dökümanlar arasındaki sıklıkları arasındaki ilişkiyi matematiksel ifadeye çeviren bir yöntemdir.

# TF-IDF = TF ( t , d ) x IDF ( t )

#### TF (Term Frequency) = Terimin döküman içerisindeki sıklık tekrarı <br>IDF (Inverse Document Frequency) = Bir kelime diğer dökümanlarda ne kadar çok tekrar ediyorsa IDF değeri o kadar düşük olur.<br>Stopwords kelimeler önemsizdir ve çok tekrar eder o yüzden TFIDF değeri düşük olması için IDF düşük alınır ancak diğer önemli ana kelimeler az tekrar ettiği için IDF değerleri yüksek olur.

In [25]:
import numpy as np
import pandas as pd
import string
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [10]:
df = pd.read_csv("spam_ham_dataset.csv")
df

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\nth...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\n( see a...",0
2,3624,ham,"Subject: neon retreat\nho ho ho , we ' re arou...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\nthis deal is to ...,0
...,...,...,...,...
5166,1518,ham,Subject: put the 10 on the ft\nthe transport v...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\nhpl ...,0
5168,2933,ham,Subject: calpine daily gas nomination\n>\n>\nj...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0


In [11]:
df = df[ ["text","label_num"] ]
df

Unnamed: 0,text,label_num
0,Subject: enron methanol ; meter # : 988291\nth...,0
1,"Subject: hpl nom for january 9 , 2001\n( see a...",0
2,"Subject: neon retreat\nho ho ho , we ' re arou...",0
3,"Subject: photoshop , windows , office . cheap ...",1
4,Subject: re : indian springs\nthis deal is to ...,0
...,...,...
5166,Subject: put the 10 on the ft\nthe transport v...,0
5167,Subject: 3 / 4 / 2000 and following noms\nhpl ...,0
5168,Subject: calpine daily gas nomination\n>\n>\nj...,0
5169,Subject: industrial worksheets for august 2000...,0


In [12]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HBA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
ps = nltk.stem.porter.PorterStemmer()

def clean_text(text):
    text = text.lower() 

    text = text.translate( str.maketrans( '','',string.punctuation ) )  
    words = text.split() 
    words = [ word for word in words if not word.isdigit() ] 
    words = [ ps.stem(word) for word in words if word not in nltk.corpus.stopwords.words('english') ]
    return " ".join(words) 


In [14]:
df["clean_text"] = df["text"].apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["clean_text"] = df["text"].apply(clean_text)


In [15]:
df

Unnamed: 0,text,label_num,clean_text
0,Subject: enron methanol ; meter # : 988291\nth...,0,subject enron methanol meter follow note gave ...
1,"Subject: hpl nom for january 9 , 2001\n( see a...",0,subject hpl nom januari see attach file hplnol...
2,"Subject: neon retreat\nho ho ho , we ' re arou...",0,subject neon retreat ho ho ho around wonder ti...
3,"Subject: photoshop , windows , office . cheap ...",1,subject photoshop window offic cheap main tren...
4,Subject: re : indian springs\nthis deal is to ...,0,subject indian spring deal book teco pvr reven...
...,...,...,...
5166,Subject: put the 10 on the ft\nthe transport v...,0,subject put ft transport volum decreas contrac...
5167,Subject: 3 / 4 / 2000 and following noms\nhpl ...,0,subject follow nom hpl take extra mmcf weekend...
5168,Subject: calpine daily gas nomination\n>\n>\nj...,0,subject calpin daili ga nomin juli mention ear...
5169,Subject: industrial worksheets for august 2000...,0,subject industri worksheet august activ attach...


In [17]:
X = df["clean_text"]
y = df["label_num"]

In [19]:
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform( df["clean_text"] )
X_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 307420 stored elements and shape (5171, 37932)>

In [20]:
X_train, X_test, y_train, y_test = train_test_split( X_tfidf, y, test_size=0.3, random_state=42 )

### Modeldeki Alpha parametresi : Naive Bayes algoritmasında hiç gözlemlenmeyen kelimelere sıfır olasılık verilmesini engellemek için kullanılır.

In [39]:
model = MultinomialNB(alpha=0.1)

In [40]:
model.fit( X_train, y_train )

In [41]:
y_pred = model.predict( X_test )
y_pred

array([0, 1, 0, ..., 0, 1, 0], shape=(1552,))

In [42]:
print( classification_report( y_test, y_pred ) )

              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1121
           1       0.97      0.96      0.96       431

    accuracy                           0.98      1552
   macro avg       0.98      0.97      0.97      1552
weighted avg       0.98      0.98      0.98      1552

