# # Film Ele≈ütirileri ve Bag-of-Words Modellemesi

üéØ Bu zorluƒüun amacƒ±, metinlerin ***Bag-of-words*** modellemesiyle oynamaktƒ±r.

‚úçÔ∏è A≈üaƒüƒ±daki veri setinde, _‚Äúolumlu‚Äù_ veya _‚Äúolumsuz‚Äù_ olarak sƒ±nƒ±flandƒ±rƒ±lmƒ±≈ü 2000 adet yorum bulunmaktadƒ±r.

In [1]:
import pandas as pd

data = pd.read_csv("https://d32aokrjazspmn.cloudfront.net/materials/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [2]:
data.shape

(2000, 2)

## 1. √ñn i≈üleme (Preprocessing)

‚ùì **Soru (Metin Temizleme)** ‚ùì

- Bir c√ºmleyi temizleyecek bir `preprocessing` fonksiyonu yazƒ±n ve bunu t√ºm yorumlara uygulayƒ±n. Fonksiyon ≈üunlarƒ± yapmalƒ±dƒ±r:
    - bo≈üluklarƒ± kaldƒ±rma
    - harfleri k√º√ß√ºk harfe √ßevirme
    - sayƒ±larƒ± kaldƒ±rma
    - noktalama i≈üaretlerini kaldƒ±rma
    - tokenization (kelimelere ayƒ±rma)
    - lemmatization (kelime k√∂k√ºne indirgeme)
- Temizlenmi≈ü yorumlarƒ± `clean_reviews` adlƒ± bir s√ºtunda saklayabilirsiniz.
- Bu a≈üamada stopword‚Äôleri kaldƒ±rmayƒ±n; nedenini `3. N-gram modelleme` b√∂l√ºm√ºnde a√ßƒ±klayacaƒüƒ±z.

In [7]:
import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
    
# Removing whitespaces
    sentence = sentence.strip()
    # Lowercasing
    sentence = sentence.lower()
    # Removing numbers
    sentence = ''.join(char for char in sentence if not char.isdigit())
    # Removing punctuation
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '')
    # Tokenizing
    tokenized = word_tokenize(sentence)
    # Lemmatizing
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(word) for word in tokenized]
    cleaned_sentence = " ".join(lemmatized)
    return cleaned_sentence


In [8]:
data['clean_reviews'] = data.reviews.apply(preprocessing)
data.head()

Unnamed: 0,target,reviews,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go to a church party drin...
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastard quick movie review damn that...
2,neg,it is movies like these that make a jaded movi...,it is movie like these that make a jaded movie...
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first feature...
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing ps...


‚ùì **Soru (LabelEncoding)**‚ùì

Hedefinizi LabelEncode ile kodlayƒ±n ve `‚Äútarget_encoded‚Äù` adlƒ± bir s√ºtuna kaydedin.

In [9]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
data["target_encoded"] =  le.fit_transform(data.target)

In [10]:
# Hƒ±zlƒ± kontrol
data.head()

Unnamed: 0,target,reviews,clean_reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go to a church party drin...,0
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastard quick movie review damn that...,0
2,neg,it is movies like these that make a jaded movi...,it is movie like these that make a jaded movie...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first feature...,0
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing ps...,0


## 2. Bag-of-Words Modellemesi

‚ùì **Soru (Tek kelimelik s√∂zc√ºklerle NaiveBayes)** ‚ùì

`cross_validate` kullanarak, metinlerin Bag-of-Words temsilinde eƒüitilmi≈ü bir Multinomial Naive Bayes modelini puanlayƒ±n.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate

vectorizer = CountVectorizer()
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.clean_reviews)

cv_nb = cross_validate(
    naivebayes,
    X_bow,
    data.target_encoded,
    scoring = "accuracy"
)

round(cv_nb['test_score'].mean(),2)

0.82

## 3. N-gram Modellemesi

üëÄ Stop kelimeleri kaldƒ±rmamanƒ±zƒ± istediƒüimizi hatƒ±rlayƒ±n. Neden? 

üëâ Naive Bayes modelini bigramlarla eƒüiteceƒüiz. Bu nedenle, ‚ÄúI do not like coriander‚Äù (Ki≈üni≈üi sevmiyorum) gibi bir c√ºmlede, √∂rneƒüin bu c√ºmlede olumsuzluƒüu tespit etmek i√ßin ‚Äúdo not‚Äù bigramƒ±nƒ± taramak √∂nemlidir.

‚ùì **Soru (bigramlarla NaiveBayes)** ‚ùì

`cross_validate` kullanarak, metinlerin 2-gram Bag-of-Words temsilinde eƒüitilmi≈ü bir Multinomial Naive Bayes modelini puanlayƒ±n.

In [12]:
vectorizer = CountVectorizer(ngram_range = (2,2))
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.clean_reviews)

cv_nb = cross_validate(
    naivebayes,
    X_bow,
    data.target_encoded,
    scoring = "accuracy"
)

round(cv_nb['test_score'].mean(),2)

0.84

üèÅ Tebrikler! Artƒ±k vekt√∂rle≈ütirilmi≈ü metinler √ºzerinde Naive Bayes modelini nasƒ±l eƒüiteceƒüinizi biliyorsunuz.

üíæ Not defterinizi `git add/commit/push` yapmayƒ± unutmayƒ±n...

