# Data Cleaning and Vectorization For NLP

## Install and Import

In [52]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 50)

In [53]:
#!pip install nltk

## Tokenization

In [54]:
import nltk        # Stopword' lari kullanacagimiz kütüphane.

In [55]:
sample_text= "Oh man, this is pretty cool. We will do more such things. @mynet"

In [56]:
from nltk.tokenize import sent_tokenize, word_tokenize

__sent_tokenize -->__ Sentence' lari bozmadan tokenize islemini yapar ve lower ile bütün harfleri kücük harfe cevirir :

In [57]:
sentence_token = sent_tokenize(sample_text.lower())
sentence_token

['oh man, this is pretty cool.', 'we will do more such things.', '@mynet']

__word_tokenize -->__  Kelime kelime ayrim yaparak tokenize islemini yapar ve lower ile bütün harfleri kücük harfe cevirir. Noktalama isaretleri de birer token olarak kabul edildi :

__!__ word_tokenize, sent_tokenize' a gore daha cok tercih edilir. __!__

In [58]:
word_token = word_tokenize(sample_text.lower())
word_token

['oh',
 'man',
 ',',
 'this',
 'is',
 'pretty',
 'cool',
 '.',
 'we',
 'will',
 'do',
 'more',
 'such',
 'things',
 '.',
 '@',
 'mynet']

## Removing Punctuation and Numbers

Tokenization isleminden sonraki ikinci asama olarak noktalama isaterlerinden ve sayilardan kurtulmamiz gerekiyor. 

ML ile hazirlanan modellerde classification veya sentimental analysis yapilabiliyor. Bu analizlerde de sayilar ve noktalama isaretlerini temizlemek gerekir.

__isalpha -->__ Tokenin object (str) ifade olup olmadigina bakar; object ise gecirir fakat noktalama isareti veya sayisal deger ise gecirmez. (Sayilarin da kalmasi isteniyor ise isalpha yerine __.isalnum()__ yazilabilir. 

In [59]:
tokens_without_punc = [w for w in word_token if w.isalpha()] # .isalnum() for number and object
tokens_without_punc

['oh',
 'man',
 'this',
 'is',
 'pretty',
 'cool',
 'we',
 'will',
 'do',
 'more',
 'such',
 'things',
 'mynet']

## Removing Stopwords

Cleaning islemini iki farkli sekilde yapacagiz : Classification islemi icin, sentimental analysis icin (olumlu veya olumlu sonuc bizim icin onemli ise).

In [60]:
#nltk.download('stopwords')

In [61]:
from nltk.corpus import stopwords

stop_words isimli degiskenin icine hangi dilin stopword' lerini kullanacaksak onu tanimladik :

In [62]:
stop_words = stopwords.words("english")
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

Corpus' u (data) word tokenler haline getirmistik. Bu datadan stopword' leri cikaracagiz :

In [63]:
tokens_without_punc

['oh',
 'man',
 'this',
 'is',
 'pretty',
 'cool',
 'we',
 'will',
 'do',
 'more',
 'such',
 'things',
 'mynet']

Corpus' tan elde ettigimiz tokenlerden stopword' lari cikarmak icin asagidaki islemi yaptik. Bunun icin bir list comprehension islemi yaptik. __tokens_without_punc__ icindeki her bir tokeni al, stopword icinde varsa listeye dahil et, eger stopword icinde yoksa listeye at. Bu sekilde stopword' lerde de kurtulmus olduk :

# Data Normalization-Lemmatization

NLP' de data normalization islemi Lemmatization veya Stemming ile yapilir. Lemmatization sözlükteki anlami korudugu icin daha cok tercih edilen bir yöntemdir.

In [65]:
from nltk.stem import WordNetLemmatizer

In [66]:
#nltk.download('wordnet')

Lemmatization children' daki 'en' ekinin bir cogul eki oldugunu anladigi icin 'child' köküne indirmis. Eger yapim eki olsaydi, sözlükteki anlami farkli olacagi icin eki atmazdi :

In [67]:
WordNetLemmatizer().lemmatize("children")

'child'

__WordNetLemmatizer().lemmatize("children", pos='n')__ diyerek isim veya fiil oldugunu da belirtebiliriz. (Default degeri noun' dir)

Stopword' lerden temizlenmis corpus' un icindeki her bir tokeni list comprehension yöntemi ile köklerine indirmis olduk :

In [68]:
lem = [WordNetLemmatizer().lemmatize(t) for t in token_without_sw]

In [69]:
lem

['oh', 'man', 'pretty', 'cool', 'thing', 'mynet']

## Data Normalization-Stemming

In [70]:
from nltk.stem import PorterStemmer

Driving' in sözlükte baska anlami olmasina ragmen, Stemming bunu göz ardi etti ve kelimeyi köküne indirdi :

In [71]:
PorterStemmer().stem("driving")

'drive'

In [72]:
stem = [PorterStemmer().stem(t) for t in token_without_sw]

In [73]:
stem

['oh', 'man', 'pretti', 'cool', 'thing', 'mynet']

## Joining

Liste icindeki bütun tokenleri join ile birlestirdik. 

In [74]:
" ".join(lem)

'oh man pretty cool thing mynet'

## Cleaning Function - for classification (NOT for sentiment analysis)

Yukarida yaptigimiz cleaning islemlerinin hepsini, bir fonksiyon tanimlayarak yapabiliriz.

Eger bir classification islemi yapacaksak, herhangi bir duygu analizi yapmayacaksan asagidaki fonksiyonu kullanabiliriz :

In [75]:
def cleaning(data):
    
    #1. Tokenize
    text_tokens = word_tokenize(data.lower()) 
    
    #2. Remove Puncs
    tokens_without_punc = [w for w in text_tokens if w.isalpha()]
    
    #3. Removing Stopwords
    tokens_without_sw = [t for t in tokens_without_punc if t not in stop_words]
    
    #4. lemma
    text_cleaned = [WordNetLemmatizer().lemmatize(t) for t in tokens_without_sw]
    
    #joining
    return " ".join(text_cleaned)

In [76]:
pd.Series(sample_text).apply(cleaning)

<IPython.core.display.Javascript object>

0    oh man pretty cool thing mynet
dtype: object

## Cleaning Function - for sentiment analysis

Eger sentimental bir analiz yapacaksak olumlu-olumsuz yardimci fiilerin text' in icinde kalmasi önemlidir. 

In [80]:
sample_text= "Oh man, this is pretty cool. We will do more such things. don't aren't are not. no problem"

Asagida "Text' in icinde bir (') var ise bunun yerine hicbir sey atama" diyerek bunu bir degiskene atadik. 

Bu degiskeni word_tokenize icine vererek tokenlerine ayirdik. Ayraci kaldirdigimiz icin arent kelimesi stopword icindeki aren't kelimesi ile eslesmeyecek ve stopword isleminden sonra da bu kelimeler corpus icinde olmaya devam edecek :

In [82]:
s = sample_text.replace("'",'')
word = word_tokenize(s)
word 

['Oh',
 'man',
 ',',
 'this',
 'is',
 'pretty',
 'cool',
 '.',
 'We',
 'will',
 'do',
 'more',
 'such',
 'things',
 '.',
 'dont',
 'arent',
 'are',
 'not',
 '.',
 'no',
 'problem']

Bazen aren't yerine are not da kullanilabilir. Bu ayri yazimlarin da stopword asamasinda temizlenmesini engellememiz gerekir. Bunun icin asagida bir fonksiyon tanimladik.

Ilk olarak, yardimci fiillerdeki ayraclari kaldirdik.

Ikinci olarak, bunlari word_tokenlerine ayirdik ve kücük harflere dönüstürdük.

Ücüncü olarak, numaralardan ve noktalama isaretlerinden temizledik.

Dördüncü olarak, stopword asamasinda bir for döngüsü kurarak 'not' ve 'no' sözcüklerini stopword' ler arasindan kaldirdik ki bu kelimeler stopword isleminden sonra da datamizda kalmaya devam etsin. Daha sonra list comprehension ile stopword' lerden temizleme islemini yaptik.

Besinci olarak lemmatization islemi ile tokenlerin köklerine indik.

In [79]:
#4. Removing Stopwords
for i in ["not", "no"]:
    stop_words.remove(i)

def cleaning_fsa(data):
    
    
    #1. removing upper brackets to keep negative auxiliary verbs in text
    text = data.replace("'",'')
         
    #2. Tokenize
    text_tokens = word_tokenize(text.lower()) 
    
    #3. Remove numbers
    tokens_without_punc = [w for w in text_tokens if w.isalpha()]
    
        
    tokens_without_sw = [t for t in tokens_without_punc if t not in stop_words]
    
    #5. lemma
    text_cleaned = [WordNetLemmatizer().lemmatize(t) for t in tokens_without_sw]
    
    #joining
    return " ".join(text_cleaned)

In [50]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

sample_text' i seri haline getirmeden apply islemi uygulayamiyoruz. Text' i seri haline getirdikten sonra yukarida olusturdugumuz fonksiyonu apply ile ekledik. Böylece tek satirda cleaning islemi tamamlanmis oldu :

In [83]:
pd.Series(sample_text).apply(cleaning_fsa)

<IPython.core.display.Javascript object>

0    oh man pretty cool thing dont arent not no pro...
dtype: object

## CountVectorization and TF-IDF Vectorization

Bir havayoluyla ilgili atilan tweet yorumlarindan olusan bir corpus var. Bu corpus üzerinden CountVectorization ve TF-IDF Vectorization islemlerinin mantiginin nasil isledigini görecegiz.

In [110]:
df = pd.read_csv("airline_tweets.csv")

In [111]:
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


NLP datalarinin hepsi, __text__ ve __label__ olarak 2 feature' a düsürülür.  Bu yüzden corpus' tan sadece bu iki sütunu aldik :

In [112]:
df = df[['airline_sentiment','text']]
df

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...
...,...,...
14635,positive,@AmericanAir thank you we got on a different f...
14636,negative,@AmericanAir leaving over 20 minutes Late Flig...
14637,neutral,@AmericanAir Please bring American Airlines to...
14638,negative,"@AmericanAir you have my money, you change my ..."


df' teki ilk 8 satiri ve tüm feature' lari aldik (Daha anlasilir olmasi icin) :

In [114]:
df = df.iloc[:8, :]
df

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...
5,negative,@VirginAmerica seriously would pay $30 a fligh...
6,positive,"@VirginAmerica yes, nearly every time I fly VX..."
7,neutral,@VirginAmerica Really missed a prime opportuni...


df' in bir kopyesini df2' ye atadik. (Bu sekilde yapmazsak hata veriyor) :

In [115]:
df2 = df.copy()

df2' deki text'e apply ile yukarida olusturdugumuz cleaning_fsa fonksiyonunu uyguladik ve boylece df icin celaning islemini uygulamis olduk (Duygu analizi icin olusturdugumuz fonksiyon). 

In [117]:
df2["text"] = df2["text"].apply(cleaning_fsa)

__!!__ Model kurarken cumle icindeki grammer yapisindan dolayi sira onemli fakat cumleler arasi sira onemli degil. __!!_

In [118]:
df2

Unnamed: 0,airline_sentiment,text
0,neutral,virginamerica dhepburn said
1,positive,virginamerica plus added commercial experience...
2,neutral,virginamerica today must mean need take anothe...
3,negative,virginamerica really aggressive blast obnoxiou...
4,negative,virginamerica really big bad thing
5,negative,virginamerica seriously would pay flight seat ...
6,positive,virginamerica yes nearly every time fly vx ear...
7,neutral,virginamerica really missed prime opportunity ...


## CountVectorization

CountVectorizer ile text' leri sayisal hale dönüstürme islemi yapacagiz.

In [119]:
X = df2["text"]                   # Yorumlar
y = df2["airline_sentiment"]      # Target label

In [120]:
from sklearn.model_selection import train_test_split

Datamizda toplam 8 cumle var, bunlari train ve test olarak yari yariya ayirdik :

In [121]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, stratify = y, random_state = 42)

In [122]:
from sklearn.feature_extraction.text import CountVectorizer

ML ve DL' deki scale isleminde yaptigimiz islemleri yapiyoruz; X_train'e fit_transform, X_test'e sadece transform islemi. (Data leakage' i engellemek icin)

X_train'e fit uygulandiginda, X_train icindeki unique bütün tokenler tespit edilir; transform ile ise döküman icindeki her token sayilir. 

X_test'e transform islemi uygulandiginda, dökümandaki sayma islemlerini X_train' e göre yapar. Örnegin X_test' te 'car' kelimesi var fakat X_train' de bu kelime yoksa bu kelimeyi es gecer. Cünkü egittigimiz döküman icinde car kelimesi gecmiyor.

Yani transform islemi, X_train' deki unique tokenlere göre yapilir.

Bu yüzden X_train' i olabildigince buyuk tutmak gerekir ki tum tokenleri içersin.

In [123]:
vectorizer = CountVectorizer()
X_train_count = vectorizer.fit_transform(X_train)
X_test_count = vectorizer.transform(X_test)

vectorizer' da egitilen unique token isimleri, feature isimleri olarak atandi :

In [124]:
vectorizer.get_feature_names()   # vectorizer.get_feature_names_out() --> Yeni versiyonlarda boyle.

['another',
 'away',
 'bad',
 'big',
 'dhepburn',
 'ear',
 'every',
 'fly',
 'go',
 'mean',
 'must',
 'nearly',
 'need',
 'really',
 'said',
 'take',
 'thing',
 'time',
 'today',
 'trip',
 'virginamerica',
 'vx',
 'worm',
 'yes']

X_train' i array' e cevirdik ve her döküman icindeki tokenlerin teker teker sayildigini görmüs olduk :

In [125]:
X_train_count.toarray()

array([[0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
        1, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
        0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0,
        0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
        0, 0]], dtype=int64)

Array halindeki X_train datasini DafaFrame' e dönüstürdük. Columns isimleri olarak da get_feature_names' leri verdik. Her dökümanda her tokenin kac kere gectigini görüyoruz :

In [136]:
df_count = pd.DataFrame(X_train_count.toarray(), columns = vectorizer.get_feature_names())
df_count

Unnamed: 0,another,away,bad,big,dhepburn,ear,every,fly,go,mean,must,nearly,need,really,said,take,thing,time,today,trip,virginamerica,vx,worm,yes
0,0,1,0,0,0,1,1,1,1,0,0,1,0,0,0,0,0,1,0,0,1,1,1,1
1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0
2,1,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,1,1,1,0,0,0
3,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0


Yukaridaki DataFrame ile kiyaslamak icin asagida gercek X_train datasini yazdirdik, kelimelerin gercekte kacar kere gectigini kiyaslamis olduk :

In [138]:
X_train

6    virginamerica yes nearly every time fly vx ear...
0                          virginamerica dhepburn said
2    virginamerica today must mean need take anothe...
4                   virginamerica really big bad thing
Name: text, dtype: object

In [133]:
X_train[6]   # 0. indexteki cumle.

'virginamerica yes nearly every time fly vx ear worm go away'

vectorizer.vocabulary_ --> X_train' de gecen token sayilari.

In [134]:
vectorizer.vocabulary_

{'virginamerica': 20,
 'yes': 23,
 'nearly': 11,
 'every': 6,
 'time': 17,
 'fly': 7,
 'vx': 21,
 'ear': 5,
 'worm': 22,
 'go': 8,
 'away': 1,
 'dhepburn': 4,
 'said': 14,
 'today': 18,
 'must': 10,
 'mean': 9,
 'need': 12,
 'take': 15,
 'another': 0,
 'trip': 19,
 'really': 13,
 'big': 3,
 'bad': 2,
 'thing': 16}

## TF-IDF

sklearn TD-IDF
https://towardsdatascience.com/how-sklearns-tf-idf-is-different-from-the-standard-tf-idf-275fa582e73d

In [103]:
from sklearn.feature_extraction.text import TfidfVectorizer

TfidfVectorizer()' i bir degiskene atadik. Yine fit ve transform islemlerini yaptik. 

X_train icin yaptigi transform islemini X_test icin de yapar fakat X_test' te gecen bir token X_train' de gecmiyorsa o tokeni görmezden gelir. Boyle bir durumda IDF hesabi yapilirken deger yerine konuldugunda log(0/0)=sonsuz olacaktir. Bunun onune gecmek icin IDF, degerlere 1 ekler. log((0+1)/(0+1)). Bu sekilde degerlerin NaN cikmasinin önüne gecer.

In [104]:
tf_idf_vectorizer = TfidfVectorizer()
X_train_tf_idf = tf_idf_vectorizer.fit_transform(X_train)
X_test_tf_idf = tf_idf_vectorizer.transform(X_test)

In [105]:
tf_idf_vectorizer.get_feature_names()      # tf_idf_vectorizer.get_feature_names_out()

['another',
 'away',
 'bad',
 'big',
 'dhepburn',
 'ear',
 'every',
 'fly',
 'go',
 'mean',
 'must',
 'nearly',
 'need',
 'really',
 'said',
 'take',
 'thing',
 'time',
 'today',
 'trip',
 'virginamerica',
 'vx',
 'worm',
 'yes']

Train datamizi yine array' e cevirip DataFrame'e donusturduk :

In [106]:
X_train_tf_idf.toarray()

array([[0.        , 0.31200802, 0.        , 0.        , 0.        ,
        0.31200802, 0.31200802, 0.31200802, 0.31200802, 0.        ,
        0.        , 0.31200802, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.31200802, 0.        , 0.        ,
        0.16281873, 0.31200802, 0.31200802, 0.31200802],
       [0.        , 0.        , 0.        , 0.        , 0.66338461,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.66338461,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.34618161, 0.        , 0.        , 0.        ],
       [0.37082034, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.37082034,
        0.37082034, 0.        , 0.37082034, 0.        , 0.        ,
        0.37082034, 0.        , 0.        , 0.37082034, 0.37082034,
        0.19350944, 0.        , 0.        , 0.        ],
       [0.   

DataFrame'e cevirip unique token isimlerini verdik ve yeni feature' lar olustu :

In [140]:
df_tfidf = pd.DataFrame(X_train_tf_idf.toarray(), columns = tf_idf_vectorizer.get_feature_names())
df_tfidf

Unnamed: 0,another,away,bad,big,dhepburn,ear,every,fly,go,mean,must,nearly,need,really,said,take,thing,time,today,trip,virginamerica,vx,worm,yes
0,0.0,0.312008,0.0,0.0,0.0,0.312008,0.312008,0.312008,0.312008,0.0,0.0,0.312008,0.0,0.0,0.0,0.0,0.0,0.312008,0.0,0.0,0.162819,0.312008,0.312008,0.312008
1,0.0,0.0,0.0,0.0,0.663385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.663385,0.0,0.0,0.0,0.0,0.0,0.346182,0.0,0.0,0.0
2,0.37082,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37082,0.37082,0.0,0.37082,0.0,0.0,0.37082,0.0,0.0,0.37082,0.37082,0.193509,0.0,0.0,0.0
3,0.0,0.0,0.483803,0.483803,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.483803,0.0,0.0,0.483803,0.0,0.0,0.0,0.252468,0.0,0.0,0.0


X_train' in 6. degerini yine kiyaslama icin aldik (Sifirinci index' teki deger). Yukarda ilk indexe bakacak olursak __virginamerica__ tokeni neredeyse her satirda gectigi icin stopword gibi kabul edilmis ve agirligi azaltilmis. Bir token her satirda gecerse önemsizlesir. Bu tokenlere classification yapilamaz. (CountVectorizer, virginamerica kelimesini onemsizlestirmemisti fakat TF-IDF onemsizlestirdi)

In [141]:
X_train[6]

'virginamerica yes nearly every time fly vx ear worm go away'

Datanin 2. indexinde var olan kelimelerden en dusuk agirliga sahip olan tokeni, virginamerica :

In [142]:
df_tfidf.loc[2].sort_values(ascending=False)

another          0.370820
mean             0.370820
trip             0.370820
today            0.370820
take             0.370820
must             0.370820
need             0.370820
virginamerica    0.193509
fly              0.000000
thing            0.000000
worm             0.000000
vx               0.000000
bad              0.000000
big              0.000000
time             0.000000
dhepburn         0.000000
go               0.000000
said             0.000000
really           0.000000
away             0.000000
nearly           0.000000
ear              0.000000
every            0.000000
yes              0.000000
Name: 2, dtype: float64

X_testi DataFrame donusturduk (X_train' deki feature' lara gore) :

In [None]:
pd.DataFrame(X_test_tf_idf.toarray(), columns = tf_idf_vectorizer.get_feature_names())

In [None]:
X_test

X_test' in ilk cumlesi 3. cumle. Bu cumlede gecen aggressive kelimesinin X_test feature' lari arasinda olmadigini goruyoruz. Demek ki bu kelime X_train' de yokmus ve gözardi edilmis. Bu durumda tahmin asamasinda modelin tahmin yapmasi zorlasir. Bu yuzden X_train' i olabildigince fazla datayla egitmek onemlidir. 

In [None]:
X_test[3]