![image.png](attachment:image.png)

# Data Cleaning and Vectorization For NLP

## Install and Import

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 50)

![image.png](attachment:image.png)

In [2]:
!pip install nltk




[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: C:\Users\Yaramis\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


![image.png](attachment:image.png)

## Tokenization

In [3]:
import nltk

In [4]:
sample_text= "Awesome!!!, This is fantastic!!!. NLP and Computer Visions is amazing. We are very pleased... 3456"

In [5]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [6]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Yaramis\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

![image.png](attachment:image.png)

In [7]:
sentence_token = sent_tokenize(sample_text.lower())    # ilk önce büyük harfleri küçültülür, cümle cümle tekonize edilir
sentence_token

['awesome!!',
 '!, this is fantastic!!!.',
 'nlp and computer visions is amazing.',
 'we are very pleased... 3456']

![image.png](attachment:image.png)

In [8]:
word_token = word_tokenize(sample_text.lower()) # kelime kelime tokenize yapılır
word_token

['awesome',
 '!',
 '!',
 '!',
 ',',
 'this',
 'is',
 'fantastic',
 '!',
 '!',
 '!',
 '.',
 'nlp',
 'and',
 'computer',
 'visions',
 'is',
 'amazing',
 '.',
 'we',
 'are',
 'very',
 'pleased',
 '...',
 '3456']

w.isalnum(), tokenin yalnızca alfanümerik karakterleri (harf veya rakam) içerip içermediğini kontrol eden bir metodur ve eğer içeriyorsa True döndürür.
Eğer w yalnızca harf veya rakamlardan oluşuyorsa, bu token tokens_without_punc listesine dahil edilir.
Bu nedenle bu kodu çalıştırdıktan sonra, tokens_without_punc listesi, word_token listesinden yalnızca harf veya rakamlardan oluşan tokenleri içerecektir. Noktalama işareti veya diğer özel karakterler içeren tokenler bu listede yer almayacaktır.

## Removing Punctuation and Numbers

w.isalpha(), tokenin sadece alfabetik karakterler (harfler) içerip içermediğini kontrol eden bir metodur ve eğer içeriyorsa True döndürür.
Eğer w sadece harflerden oluşuyorsa (yani herhangi bir noktalama işareti veya rakam içermiyorsa), bu token tokens_without_punc listesine dahil edilir.

![image.png](attachment:image.png)

In [9]:
tokens_without_punc = [w for w in word_token if w.isalpha()]   # list comprehension
tokens_without_punc  # noktalama özel karekter ve saylar silinir. ,eğer sayılar kalması istenirse isalnum() kullanılır

['awesome',
 'this',
 'is',
 'fantastic',
 'nlp',
 'and',
 'computer',
 'visions',
 'is',
 'amazing',
 'we',
 'are',
 'very',
 'pleased']

Bu işlemle, noktalama işaretleri ve sayılar gibi karakterlerin kaldırıldığı temizlenmiş bir kelime listesi elde edilir. Bu, metin analizi veya dil işleme çalışmalarında, sadece sözcüklerle ilgilenildiğinde veya sayılar ve özel karakterlerin dikkate alınmadığı durumlarda faydalı olabilir.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [10]:
tokens_without_punc = [w for w in word_token if w.isalnum()]        # isalnum() for number and object
tokens_without_punc          # noktalama özel karekter  silinir. ,eğer sayılar kalması istenirse isalnum() kullanılır

['awesome',
 'this',
 'is',
 'fantastic',
 'nlp',
 'and',
 'computer',
 'visions',
 'is',
 'amazing',
 'we',
 'are',
 'very',
 'pleased',
 '3456']

![image.png](attachment:image.png)

## Removing Stopwords

In [11]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Yaramis\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [12]:
from nltk.corpus import stopwords

![image.png](attachment:image.png)

In [13]:
stop_words = stopwords.words("english") #stopwords lerde temizlenir
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [14]:
stopwords.words ('turkish')

['acaba',
 'ama',
 'aslında',
 'az',
 'bazı',
 'belki',
 'biri',
 'birkaç',
 'birşey',
 'biz',
 'bu',
 'çok',
 'çünkü',
 'da',
 'daha',
 'de',
 'defa',
 'diye',
 'eğer',
 'en',
 'gibi',
 'hem',
 'hep',
 'hepsi',
 'her',
 'hiç',
 'için',
 'ile',
 'ise',
 'kez',
 'ki',
 'kim',
 'mı',
 'mu',
 'mü',
 'nasıl',
 'ne',
 'neden',
 'nerde',
 'nerede',
 'nereye',
 'niçin',
 'niye',
 'o',
 'sanki',
 'şey',
 'siz',
 'şu',
 'tüm',
 've',
 'veya',
 'ya',
 'yani']

In [15]:
stopwords.words ('french')

['au',
 'aux',
 'avec',
 'ce',
 'ces',
 'dans',
 'de',
 'des',
 'du',
 'elle',
 'en',
 'et',
 'eux',
 'il',
 'ils',
 'je',
 'la',
 'le',
 'les',
 'leur',
 'lui',
 'ma',
 'mais',
 'me',
 'même',
 'mes',
 'moi',
 'mon',
 'ne',
 'nos',
 'notre',
 'nous',
 'on',
 'ou',
 'par',
 'pas',
 'pour',
 'qu',
 'que',
 'qui',
 'sa',
 'se',
 'ses',
 'son',
 'sur',
 'ta',
 'te',
 'tes',
 'toi',
 'ton',
 'tu',
 'un',
 'une',
 'vos',
 'votre',
 'vous',
 'c',
 'd',
 'j',
 'l',
 'à',
 'm',
 'n',
 's',
 't',
 'y',
 'été',
 'étée',
 'étées',
 'étés',
 'étant',
 'étante',
 'étants',
 'étantes',
 'suis',
 'es',
 'est',
 'sommes',
 'êtes',
 'sont',
 'serai',
 'seras',
 'sera',
 'serons',
 'serez',
 'seront',
 'serais',
 'serait',
 'serions',
 'seriez',
 'seraient',
 'étais',
 'était',
 'étions',
 'étiez',
 'étaient',
 'fus',
 'fut',
 'fûmes',
 'fûtes',
 'furent',
 'sois',
 'soit',
 'soyons',
 'soyez',
 'soient',
 'fusse',
 'fusses',
 'fût',
 'fussions',
 'fussiez',
 'fussent',
 'ayant',
 'ayante',
 'ayantes',


In [16]:
tokens_without_punc # cümlelerin temizlenmiş son hali 

['awesome',
 'this',
 'is',
 'fantastic',
 'nlp',
 'and',
 'computer',
 'visions',
 'is',
 'amazing',
 'we',
 'are',
 'very',
 'pleased',
 '3456']

Temizlenmiş son halini alıp bunların Stop Words lerin içerisinde olup olmadığını kontrol edip,
Eğer bunlar içerisinde değil ise belirlemiş olduğum listeye atamamız lazım..

![image.png](attachment:image.png)

In [17]:
token_without_sw = [t for t in tokens_without_punc if t not in stop_words] # if you make a sentiment analysis , you can't remove 
                                                                           # negative auxiliary verb
token_without_sw
#cümle analizi yapılacak ise olumsuzluk fiil yardımcı fiiller silinmez

['awesome',
 'fantastic',
 'nlp',
 'computer',
 'visions',
 'amazing',
 'pleased',
 '3456']

![image.png](attachment:image.png)

## Data Normalization-Lemmatization

In [18]:
import nltk

In [19]:
from nltk.stem import WordNetLemmatizer

In [20]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Yaramis\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Bu kod, Python'da NLTK (Natural Language Toolkit) kütüphanesini kullanarak WordNet veri kümesini indirmek için kullanılır. WordNet, doğal dil işleme projelerinde sıklıkla kullanılan bir dil kaynağıdır. Bu kaynak, kelimelerin anlamlarını, eş anlamlıları, karşıt anlamlıları ve ilişkilerini içeren bir sözlük ve ağaç yapısı sağlar.

Kod, nltk.download('wordnet') şeklindeki komutla WordNet veri kümesini indirir. Bu işlemi gerçekleştirdikten sonra, WordNet verilerini kullanarak kelime köklerini bulma, kelime anlamlarını inceleme, benzer kelimeleri bulma gibi işlemleri yapabilirsiniz.

Örneğin, yukarıdaki kod WordNet veri kümesini indirir ve ardından WordNetLemmatizer kullanarak "children" kelimesinin kökünü ("child") bulur. Bu tür işlemler, doğal dil işleme ve metin analizi gibi alanlarda yaygın olarak kullanılır.

In [21]:
WordNetLemmatizer().lemmatize("children")

'child'

In [22]:
WordNetLemmatizer().lemmatize("runs", pos='v')  #'v' olduğu zaman verb olduğunu hatırlatıyoruz.    # n=noun

'run'

WordNetLemmatizer().lemmatize() işlevini kullanırken pos (kelimenin parçası) argümanı, kelimenin hangi kelime sınıfına (isim, fiil, sıfat vb.) ait olduğunu belirtir. Bu, kelimenin lemmalaştırılması için gereken dilbilgisi bilgilerini sağlar. İşte pos argümanına gelebilecek bazı değerler:

'n': İsim (isim için)
'v': Fiil (fiil için)
'a': Sıfat (sıfat için)
'r': Zarf (zarf için)
Örneğin, "runs" kelimesini fiil olarak lemmalaştırmak için pos='v' değerini kullanabilirsiniz:


WordNetLemmatizer().lemmatize("runs", pos='v') ifadesi, "runs" kelimesinin "v" (fiil) kelime türüne göre lemmatizasyonunu gerçekleştirir.

Lemmatizasyon, bir kelimenin kökünü (lemmasını) bulma işlemidir. Kelimenin farklı çekimlerini veya yapısını dikkate alarak, kelimenin temel anlamını veya kökünü elde etmeyi amaçlar. Bu durumda, "runs" kelimesi fiil türünde olduğu için pos='v' parametresiyle belirtilir.

Çıktı olarak "run" elde edilir. "runs" kelimesi, fiil türünde olduğu için lemmatizasyon sonucunda "run" kelimesine dönüşür.

In [23]:
token_without_sw

['awesome',
 'fantastic',
 'nlp',
 'computer',
 'visions',
 'amazing',
 'pleased',
 '3456']

In [24]:
lem =[WordNetLemmatizer().lemmatize(t) for t in token_without_sw]

kod, bir liste içindeki kelimeleri lemmalaştırma işlemi yapar. Ayrıca, token_without_sw adlı bir liste üzerinde döngü yapar ve her kelimenin lemmalaştırılmış hali olan lem adlı yeni bir liste oluşturur. Bu, stop words (durak kelimeler) gibi gereksiz kelimeleri çıkardıktan sonra kelimelerin köklerini veya temel formlarını elde etmek için sıkça kullanılır.

In [25]:
lem

['awesome',
 'fantastic',
 'nlp',
 'computer',
 'vision',
 'amazing',
 'pleased',
 '3456']

![image.png](attachment:image.png)

## Data Normalization-Stemming

In [26]:
from nltk.stem import PorterStemmer

In [27]:
PorterStemmer().stem("development")

'develop'

In [28]:
stem = [PorterStemmer().stem(t) for t in token_without_sw]

In [29]:
stem

['awesom', 'fantast', 'nlp', 'comput', 'vision', 'amaz', 'pleas', '3456']

In [34]:
# çokta anlamlı olmadığını görüyoruz

## Joining

In [30]:
" ".join(lem)

'awesome fantastic nlp computer vision amazing pleased 3456'

Bu kod, lem adlı bir liste içinde bulunan kelimeleri birleştirerek bir cümle oluşturur. Özellikle bir listedeki kelimeleri aralarında boşluk bırakarak birleştirmek için kullanılır. 

![image.png](attachment:image.png)

## Cleaning Function - for classification (NOT for sentiment analysis)

In [31]:
def cleaning(data):
    
    #1. Tokenize
    text_tokens = word_tokenize(data.lower()) 
    
    #2. Remove Puncs
    tokens_without_punc = [w for w in text_tokens if w.isalpha()]
    
    #3. Removing Stopwords
    tokens_without_sw = [t for t in tokens_without_punc if t not in stop_words]
    
    #4. lemma
    text_cleaned = [WordNetLemmatizer().lemmatize(t) for t in tokens_without_sw]
    
    #joining
    return " ".join(text_cleaned)

In [32]:
sample_text

'Awesome!!!, This is fantastic!!!. NLP and Computer Visions is amazing. We are very pleased... 3456'

In [33]:
pd.Series(sample_text).apply(cleaning)

0    awesome fantastic nlp computer vision amazing ...
dtype: object

sample_text bir liste olduğu için, .apply() yöntemi doğrudan bir liste üzerinde kullanılamaz. .apply() yöntemi, Pandas Serileri üzerinde çalışır. Liste üzerinde işlem yapmak için liste içindeki her öğeyi döngü veya list comprehension kullanarak işlemeniz gerekecektir.

## Cleaning Function - for sentiment analysis

![image.png](attachment:image.png)

In [34]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# İngilizce stop words listesi
stop_words = set(stopwords.words('english'))

# Bazı kelimeleri stop words'den çıkaralım. Çalışmamız da bu kelimelerin kalmasını istiyrouz
words_to_remove = ['not', 'no', 'is']

# Çıkarılacak kelimeleri stop words listesinden çıkaralım
updated_stop_words = [word for word in stop_words if word not in words_to_remove]

# Örnek bir cümle
sentence = "This is not a good idea, but no problem."

# Cümleyi tokenlara ayıralım
tokens = word_tokenize(sentence.lower())

# Stop words'leri çıkaralım
filtered_tokens = [token for token in tokens if token not in updated_stop_words]

# Sonucu gösterelim
print(filtered_tokens)


['is', 'not', 'good', 'idea', ',', 'no', 'problem', '.']


In [35]:
sample_text1= "Awesome!!!, This is fantastic!!!. wonderful..now there. wasn't We are very pleased...3456. don't eat, isn't. no problem for me"

sample_text1.replace("'",''): Bu satırda, sample_text1 adlı metin içindeki tek tırnak işaretlerini (') boş bir karakter dizisiyle ('') değiştiriyoruz. Böylece tek tırnak işaretleri kaldırılıyor.

In [36]:
s = sample_text1.replace("'",'')
word = word_tokenize(s)
word 

['Awesome',
 '!',
 '!',
 '!',
 ',',
 'This',
 'is',
 'fantastic',
 '!',
 '!',
 '!',
 '.',
 'wonderful',
 '..',
 'now',
 'there',
 '.',
 'wasnt',
 'We',
 'are',
 'very',
 'pleased',
 '...',
 '3456.',
 'dont',
 'eat',
 ',',
 'isnt',
 '.',
 'no',
 'problem',
 'for',
 'me']

In [37]:
stop_words = set(stopwords.words('english'))
words_to_remove = ['dont']
stop_words = [word for word in stop_words if word not in words_to_remove]

def cleaning_fsa(data):
    # 1. Removing upper brackets to keep negative auxiliary verbs in text
    text = data.replace("'", '')
    # 2. Tokenize
    text_tokens = word_tokenize(text.lower())
    # 3. Remove numbers
    tokens_without_punc = [w for w in text_tokens if w.isalpha()]
    # 4. Remove stopwords
    tokens_without_sw = [t for t in tokens_without_punc if t not in stop_words]
    # 5. Lemmatization
    text_cleaned = [WordNetLemmatizer().lemmatize(t) for t in tokens_without_sw]
    # Joining
    return " ".join(text_cleaned)

In [38]:
pd.Series(s).apply(cleaning_fsa)

0    awesome fantastic wonderful wasnt pleased dont...
dtype: object

## CountVectorization and TF-IDF Vectorization

In [39]:
df = pd.read_csv("airline_tweets.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'airline_tweets.csv'

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   tweet_id                      14640 non-null  int64  
 1   airline_sentiment             14640 non-null  object 
 2   airline_sentiment_confidence  14640 non-null  float64
 3   negativereason                9178 non-null   object 
 4   negativereason_confidence     10522 non-null  float64
 5   airline                       14640 non-null  object 
 6   airline_sentiment_gold        40 non-null     object 
 7   name                          14640 non-null  object 
 8   negativereason_gold           32 non-null     object 
 9   retweet_count                 14640 non-null  int64  
 10  text                          14640 non-null  object 
 11  tweet_coord                   1019 non-null   object 
 12  tweet_created                 14640 non-null  object 
 13  t

In [46]:
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [47]:
df = df[['airline_sentiment','text']]
df

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...
...,...,...
14635,positive,@AmericanAir thank you we got on a different f...
14636,negative,@AmericanAir leaving over 20 minutes Late Flig...
14637,neutral,@AmericanAir Please bring American Airlines to...
14638,negative,"@AmericanAir you have my money, you change my ..."


In [48]:
df.head(8)

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...
5,negative,@VirginAmerica seriously would pay $30 a fligh...
6,positive,"@VirginAmerica yes, nearly every time I fly VX..."
7,neutral,@VirginAmerica Really missed a prime opportuni...


df2 = df.copy() kodu, bir Pandas DataFrame'i (df) kopyalamak için kullanılır. Bu kodun amacı orijinal DataFrame (df) üzerinde herhangi bir değişiklik yapmadan aynı verileri ve yapısını içeren yeni bir DataFrame (df2) oluşturmaktır.

Bu kopyalama işlemi, verilerin yedeklenmesi veya farklı analizler veya işlemler için aynı veri setini kullanmanın gerektiği durumlarda faydalıdır. Orijinal DataFrame üzerinde herhangi bir değişiklik yapmadan, kopyalanan DataFrame üzerinde farklı analizler yapabilir veya değişiklikler deneyebilirsiniz. Bu, veri bütünlüğünü ve orijinal veriyi korumanıza yardımcı olur.

Özetle, df2 = df.copy() kodu, df DataFrame'inin tam bir kopyasını oluşturarak verilerin ve yapının korunmasını sağlar.

In [49]:
df2 = df.copy()

![image.png](attachment:image.png)

In [50]:
df2["text"] = df2["text"].apply(cleaning)    #cleaning fonksiyonunu tanımladığımız zaman text imizi temizledi..

Verilen kod satırı, bir Pandas DataFrame içinde bulunan "text" sütunundaki metin verilerine "cleaning" adlı bir işlem uygular. İşlem sonucunda "text" sütunundaki metin verileri temizlenir veya dönüştürülür ve bu temizlenmiş veriler "df2" DataFrame'inin "text" sütununa kaydedilir.

Örneğin, eğer "text" sütunundaki metinlerde temizleme işlemi yapılması gerekiyorsa ve "cleaning" işlevi bu temizlemeyi gerçekleştiriyorsa, yukarıdaki kod satırı bu temizleme işlemini tüm "text" sütunundaki metinlere uygular.

Sonuç olarak, "df2" DataFrame'i, "text" sütunundaki metin verilerinin temizlenmiş haliyle güncellenmiş olur. Bu, veri analizi veya işleme için metin verilerini hazırlarken sıkça kullanılan bir işlemdir.

In [51]:
df2

Unnamed: 0,airline_sentiment,text
0,neutral,virginamerica dhepburn said
1,positive,virginamerica plus added commercial experience...
2,neutral,virginamerica today must mean need take anothe...
3,negative,virginamerica really aggressive blast obnoxiou...
4,negative,virginamerica really big bad thing
...,...,...
14635,positive,americanair thank got different flight chicago
14636,negative,americanair leaving minute late flight warning...
14637,neutral,americanair please bring american airline
14638,negative,americanair money change flight answer phone s...


![image.png](attachment:image.png)

## CountVectorization

In [56]:
X = df2["text"]
y = df2["airline_sentiment"]

In [57]:
from sklearn.model_selection import train_test_split

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y, random_state = 42)

 **stratify = y** parametresi kullanılmış. Bu sayede y değişkenindeki sınıf oranları, hem eğitim setinde hem de test setinde aynı şekilde korunmuş olacak. Bu da modelin daha dengeli bir şekilde eğitilmesine ve test edilmesine olanak sağlar.

In [59]:
from sklearn.feature_extraction.text import CountVectorizer

![image.png](attachment:image.png)

In [60]:
vectorizer = CountVectorizer()                      # ML DE SCALE DE YAPTIĞIMIZI GİBİ YAPIYORUZ.
X_train_count = vectorizer.fit_transform(X_train)   # fit deyince bütün unique leri tespit ediyor, transform da ise sayıyor
X_test_count = vectorizer.transform(X_test)         # X testte olan kelime X train de yok ise değerlendirilmeye alınmaz.
                                                    # O yüzden büyük Corpus ta eğitimesi gerekir ki iyi sonuçlar alalım...

Burada verileri bir sayım vektörüne (bag of words) dönüştürmek için CountVectorizer kullanıyorsunuz. CountVectorizer, metin verilerini kelime seviyesinde sayım vektörlerine dönüştürmek için yaygın olarak kullanılan bir yöntemdir.

Veri dönüşüm işlemi iki aşamada gerçekleşir:

fit_transform(X_train): Bu adımda, eğitim verileri (X_train) üzerinde CountVectorizer'ı eğitiyorsunuz. Bu işlem, tüm benzersiz kelimeleri tanır ve her kelimenin bir sütunda temsil edildiği bir sayım matrisini oluşturur. Her hücrede bir kelimenin görülme sıklığı bulunur.

transform(X_test): Bu adımda, test verileri (X_test) üzerinde aynı dönüşümü uyguluyorsunuz. Ancak, burada sadece eğitim verilerinde gördüğünüz kelimelerin sayısını sayıyor ve bu kelime sayılarını test verileri için bir sayım matrisine dönüştürüyorsunuz. Yani, eğitim verileriyle öğrenilen kelime dağarcığına dayalı olarak test verilerini temsil eden bir sayım matrisi oluşturulur.

Sonuç olarak, X_train_count ve X_test_count matrisleri, metin verilerini bag of words temsilinde sayım vektörlerine dönüştüren işlemi ifade eder. Bu temsil, metin verilerini sayısal olarak işlemek ve makine öğrenimi modellerine beslemek için kullanılır.

In [66]:
vectorizer.get_feature_names_out()            # bütün unique tokenları verir. 

array(['aa', 'aaaand', 'aaadvantage', ..., 'zrh', 'zukes', 'zurich'],
      dtype=object)

In [67]:
X_train_count.toarray()   

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

X_train_count adındaki bir değişkenin toarray() metodu çağrıldığında, bu metot X_train_count değişkenindeki seyrek matrisi yoğun bir matrise dönüştürür ve sonucu bir NumPy dizisi olarak döndürür. Bu dönüşüm, seyrek matrisin sadece değerleri, satır ve sütun indeksleriyle temsil edildiği yoğun bir matrisin oluşturulmasını sağlar.

In [70]:
df_count = pd.DataFrame(X_train_count.toarray(), columns = vectorizer.get_feature_names_out())   
# X Train de geçen tüm değerleri burada alfabetik sırada görüyoruz.
df_count

Unnamed: 0,aa,aaaand,aaadvantage,aaalwayslate,aadavantage,aadelay,aadv,aadvantage,aafail,aal,aaron,aarp,ab,abandon,abandoned,abandonment,abassinet,abbreve,abc,abcletjetbluestreamfeed,abcnetwork,abcnews,abducted,abi,abigailedge,...,ystday,ystrdy,yuck,yucki,yuma,yummy,yup,yvonne,yvonneokaka,yvr,yxe,yxu,yyz,zabsonre,zakkohane,zero,zfv,zipper,zkatcher,zombie,zone,zoom,zrh,zukes,zurich
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11707,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11708,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
11709,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11710,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [71]:
df_count.duplicated().any()

True

In [72]:
duplicate_rows = df_count[df_count.duplicated()]
duplicate_rows

Unnamed: 0,aa,aaaand,aaadvantage,aaalwayslate,aadavantage,aadelay,aadv,aadvantage,aafail,aal,aaron,aarp,ab,abandon,abandoned,abandonment,abassinet,abbreve,abc,abcletjetbluestreamfeed,abcnetwork,abcnews,abducted,abi,abigailedge,...,ystday,ystrdy,yuck,yucki,yuma,yummy,yup,yvonne,yvonneokaka,yvr,yxe,yxu,yyz,zabsonre,zakkohane,zero,zfv,zipper,zkatcher,zombie,zone,zoom,zrh,zukes,zurich
236,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
344,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
386,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
566,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
594,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11643,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11654,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11674,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11679,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [73]:
X_train

1262                                     united would cost
10772    usairways used get email snack time check got ...
4204     united flight cancelled flightlations one due ...
5491     southwestair frustrated idea great crew thanks...
12096      americanair narrowly made standby lot snag trip
                               ...                        
305      virginamerica flight booking problem section w...
10579    usairways thanks hour flight pit phx zero ente...
4514     southwestair assigned seating cousin amp proba...
7131                    jetblue enough stop flying jetblue
4927     southwestair beautiful fleet perfect evening f...
Name: text, Length: 11712, dtype: object

In [74]:
X_train[65]

'virginamerica flight dal dca tried check could status please'

In [75]:
vectorizer.vocabulary_     # tüm feature lerin alfabetik sıraya göre sıralamasını görüyoruz. 

{'united': 8574,
 'would': 9131,
 'cost': 1819,
 'usairways': 8684,
 'used': 8702,
 'get': 3394,
 'email': 2596,
 'snack': 7499,
 'time': 8230,
 'check': 1391,
 'got': 3493,
 'neither': 5416,
 'tomorrow': 8283,
 'trip': 8394,
 'sent': 7217,
 'flight': 3079,
 'cancelled': 1214,
 'flightlations': 3093,
 'one': 5703,
 'due': 2489,
 'weather': 8922,
 'mechanical': 5051,
 'paid': 5864,
 'hotel': 3879,
 'bag': 684,
 'held': 3723,
 'transfer': 8338,
 'southwestair': 7589,
 'frustrated': 3287,
 'idea': 3962,
 'great': 3529,
 'crew': 1893,
 'thanks': 8108,
 'happycustomer': 3655,
 'americanair': 304,
 'narrowly': 5366,
 'made': 4898,
 'standby': 7693,
 'lot': 4825,
 'snag': 7501,
 'pay': 5953,
 'accommodation': 57,
 'flightling': 3096,
 'reason': 6548,
 'know': 4516,
 'saying': 7092,
 'lying': 4891,
 'cause': 1292,
 'see': 7181,
 'aa': 0,
 'tsa': 8431,
 'traveler': 8358,
 'id': 3960,
 'added': 104,
 'boarding': 942,
 'pas': 5910,
 'included': 4054,
 'precheck': 6236,
 'flying': 3134,
 'tomorro'

Verilen kod, CountVectorizer tarafından oluşturulan kelime dağarcığını, kelimenin indeks numarasına göre sıralayarak ve tersten (büyükten küçüğe) görüntülemek için kullanılıyor. Bu, kelime dağarcığındaki kelimelerin ve onların sıralanmış indeks numaralarının listesini elde etmek için kullanışlı bir yaklaşımdır.

In [76]:
sorted_vocab = sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1], reverse=True)

for word, index in sorted_vocab:
    print(word, index)


zurich 9257
zukes 9256
zrh 9255
zoom 9254
zone 9253
zombie 9252
zkatcher 9251
zipper 9250
zfv 9249
zero 9248
zakkohane 9247
zabsonre 9246
yyz 9245
yxu 9244
yxe 9243
yvr 9242
yvonneokaka 9241
yvonne 9240
yup 9239
yummy 9238
yuma 9237
yucki 9236
yuck 9235
ystrdy 9234
ystday 9233
yr 9232
yponthebeat 9231
yow 9230
youve 9229
youth 9228
yout 9227
yousuck 9226
yourstoryhere 9225
yourphonesystemsucks 9224
yourock 9223
youretheworst 9222
youredoingitwrong 9221
youre 9220
yourairlinesucks 9219
youragentshavenoclue 9218
young 9217
youknowyouwantto 9216
youdidit 9215
youcouldntmakethis 9214
youcandobetter 9213
youareonyourown 9212
york 9211
yogurt 9210
yoga 9209
yo 9208
yikes 9207
yield 9206
yet 9205
yesterday 9204
yest 9203
yeseniahernandez 9202
yes 9201
yer 9200
yep 9199
yellow 9198
yelling 9197
yelled 9196
yell 9195
yeg 9194
yeehaw 9193
year 9192
yeah 9191
yea 9190
yday 9189
yayayay 9188
yay 9187
yasssss 9186
yasss 9185
yard 9184
yall 9183
yaffasolin 9182
ya 9181
xzmscw 9180
xxx 9179
xx 9178
x

In [77]:
sorted_vocab = sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1], reverse=False)

for word, index in sorted_vocab:
    print(word, index)


aa 0
aaaand 1
aaadvantage 2
aaalwayslate 3
aadavantage 4
aadelay 5
aadv 6
aadvantage 7
aafail 8
aal 9
aaron 10
aarp 11
ab 12
abandon 13
abandoned 14
abandonment 15
abassinet 16
abbreve 17
abc 18
abcletjetbluestreamfeed 19
abcnetwork 20
abcnews 21
abducted 22
abi 23
abigailedge 24
ability 25
able 26
aboard 27
abounds 28
abq 29
absolute 30
absolutely 31
absorb 32
absoulutely 33
absurd 34
absurdly 35
abt 36
abundance 37
abuse 38
abused 39
abysmal 40
ac 41
accelerate 42
accept 43
acceptable 44
accepted 45
accepting 46
acces 47
access 48
accessible 49
accessing 50
accident 51
accidentally 52
accommodate 53
accommodated 54
accommodates 55
accommodating 56
accommodation 57
accompaniment 58
accompany 59
accomplish 60
accomplished 61
according 62
accordingly 63
account 64
accountability 65
accrue 66
accruing 67
acct 68
accts 69
accurate 70
accurately 71
accuratetraveltimes 72
accused 73
achieve 74
achieves 75
achieving 76
ack 77
acknowledge 78
acknowledgement 79
acknowledgment 80
acnewsguy 81
a

![image.png](attachment:image.png)

## TF-IDF

In [78]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [79]:
tf_idf_vectorizer = TfidfVectorizer()
X_train_tf_idf = tf_idf_vectorizer.fit_transform(X_train)   # fit dediğimiz zaman önce her satırda geçip geçmediğini sonra her document da geçenlerin sayısını tespit ediyor. 
X_test_tf_idf = tf_idf_vectorizer.transform(X_test)         # TF IDF formülünü uyguluyor

In [82]:
tf_idf_vectorizer.get_feature_names_out()

array(['aa', 'aaaand', 'aaadvantage', ..., 'zrh', 'zukes', 'zurich'],
      dtype=object)

In [83]:
X_train_tf_idf.toarray()         # float değerler ile karşılaştık

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [86]:
df_tfidf = pd.DataFrame(X_train_tf_idf.toarray(), columns = tf_idf_vectorizer.get_feature_names_out())
df_tfidf

Unnamed: 0,aa,aaaand,aaadvantage,aaalwayslate,aadavantage,aadelay,aadv,aadvantage,aafail,aal,aaron,aarp,ab,abandon,abandoned,abandonment,abassinet,abbreve,abc,abcletjetbluestreamfeed,abcnetwork,abcnews,abducted,abi,abigailedge,...,ystday,ystrdy,yuck,yucki,yuma,yummy,yup,yvonne,yvonneokaka,yvr,yxe,yxu,yyz,zabsonre,zakkohane,zero,zfv,zipper,zkatcher,zombie,zone,zoom,zrh,zukes,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11707,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11708,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.315675,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11710,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [87]:
X_train[70]

'virginamerica need change reservation virgin credit card need modify phone waive change fee online'

In [88]:
df_tfidf.loc[170].sort_values(ascending=False)

course           0.396082
failed           0.385889
place            0.357716
job              0.329648
company          0.323544
                   ...   
flightglobal     0.000000
flighting        0.000000
flightlanding    0.000000
flightlation     0.000000
zurich           0.000000
Name: 170, Length: 9258, dtype: float64

In [89]:
df_tfidf.sample(50)

Unnamed: 0,aa,aaaand,aaadvantage,aaalwayslate,aadavantage,aadelay,aadv,aadvantage,aafail,aal,aaron,aarp,ab,abandon,abandoned,abandonment,abassinet,abbreve,abc,abcletjetbluestreamfeed,abcnetwork,abcnews,abducted,abi,abigailedge,...,ystday,ystrdy,yuck,yucki,yuma,yummy,yup,yvonne,yvonneokaka,yvr,yxe,yxu,yyz,zabsonre,zakkohane,zero,zfv,zipper,zkatcher,zombie,zone,zoom,zrh,zukes,zurich
8347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4307,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9798,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4664,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3676,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9480,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [90]:
df_tfidf.head()

Unnamed: 0,aa,aaaand,aaadvantage,aaalwayslate,aadavantage,aadelay,aadv,aadvantage,aafail,aal,aaron,aarp,ab,abandon,abandoned,abandonment,abassinet,abbreve,abc,abcletjetbluestreamfeed,abcnetwork,abcnews,abducted,abi,abigailedge,...,ystday,ystrdy,yuck,yucki,yuma,yummy,yup,yvonne,yvonneokaka,yvr,yxe,yxu,yyz,zabsonre,zakkohane,zero,zfv,zipper,zkatcher,zombie,zone,zoom,zrh,zukes,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
