# ***AIN 313 - Assignment 2***

### ***Instructor:*** Erkut Erdem
### ***Assistant:*** Sibel Kapan
### ***Topic:*** Naive Bayes
### ***Subject:*** Book Genre Classification with Naive Bayes
### ***Student Info:*** Can Ali Ateş
### ***Student ID:*** 2200765002

# ***PART I - Theory Questions***

## ***Maximum Likelihood Estimation (MLE)***

![Question1](https://drive.google.com/uc?id=1VQ27utuzMXbF0yU3s6jbOZmuQFUpeCq3)

### ***Answer 1:***
$Let's$ $say$ $P(X=Head) = k$ <br> <br>
$Likelihood(k) = k.(1-k).k.(1-k).k = (1-k)^2.k^3$<br> <br>
***Maximum Likelihood and Maximum Log Likelihood is equal because Log operation is monotonic transformation.*** <br> <br>
$log(Likelihood(k)) = log((1-k)^2.k^3) = 2.log(1-k) + 3.log(k)$ <br> <br>
***Derivative has to be 0 for maximization.*** <br><br>
$\frac{d}{d_k}log(Likelihood(k)) = 2.\frac{1}{1-k}.(-1) + 3.\frac{1}{k} =0\;\;\;so \;\;\; k_{estimated} =\frac{3}{5}$ <br><br>
$MLE(k) = \frac{3}{5}$

<br> <br>

![Question2](https://drive.google.com/uc?id=1do2mG9tCEgCqbpbM8094AIM4FbFRRrcW)

### ***Answer 2:***
***Maximum Likelihood for mean in normal distribution is equal to mean of the sample directly.*** <br> <br>
$MLE(µ) = \frac{1}{N}\sum_{1}^{N}X_i = \frac{1}{3}.(2 + 3 + 5) = \frac{10}{3}$

<br> <br>

![Question3](https://drive.google.com/uc?id=178e3tRjTWbP724p3FnogAF44GKNsHSS4)

### ***Answer 3:***
$Likelihood(θ) = P(x=3).P(x=0).P(x=2).P(x=1).P(x=3).P(x=2).P(x=1).P(x=0).P(x=2).P(x=1)$<br> <br>
***Maximum Likelihood and Maximum Log Likelihood is equal because Log operation is monotonic transformation.*** <br> <br>
$log(Likelihood(θ)) = \sum_{i=1}^{n}logP(X_i|θ) = 2(log\frac{2}{3}+logθ)+3(log\frac{1}{3} + logθ)+3(log\frac{2}{3}+log(1-θ))+2(log\frac{1}{3}+log(1-θ))$ <br><br>
$l(θ) = log(Likelihood(θ)) = 5logθ + 5log(1-θ) + C$ <br> <br>
$\frac{d}{dθ}log(Likelihood(θ)) = \frac{5}{θ} - \frac{5}{1-θ} = 0 \;\;\; so \;\;\; θ_{estimated} = 0.5$ <br> <br>
$MLE(θ) = 0.5$

<br>

![Question4](https://drive.google.com/uc?id=10adBKY3XgtmLizPMpCDl696f_yHx3MWX)

### ***Answer 4:***
$f(x|θ) = 1/θ \;\;\; if \;\;\; 0≤x≤θ,\;\; otherwise\;\; 0$ <br> <br>
$Likelihood(θ) = \frac{1}{θ^n} \;\;\; if \;\;\; 0≤x_i≤θ \;\;\; for \;\;\; (i = 1,...,n), \;\;\; otherwise \;\; 0$ <br> <br>
***$\frac{1}{θ^n}$ is a decreasing function of $θ$, so the $MLE = θ_{estimated} = max(X_1, ... , X_n)$***
<br>
<br>

## ***Naive Bayes***

![Question5](https://drive.google.com/uc?id=1Aoja8E7COlHxae3oaBIokDhOGyW3-Zau)

### ***Answer 1:***
***We Need to Apply Laplace Smoothing because $P(A=16-25|C=y) = 0.0$*** <br> <br>
$P(C=y) = \frac{5}{10}$ <br><br>
$P(C=n) = \frac{5}{10}$ <br><br> 
$P(S=m|C=y) = \frac{3+1}{5+4} = \frac{4}{9}$ <br><br> 
$P(S=m|C=n) = \frac{1+1}{5+3} = \frac{2}{9}$ <br><br>
$P(E=u|C=y) = \frac{2+1}{5+4} = \frac{3}{9}$ <br><br>
$P(E=u|C=n) = \frac{2+1}{5+4} = \frac{3}{9}$ <br><br>
$P(A=16-25|C=y) = \frac{0+1}{5+4} = \frac{1}{9}$ <br><br>
$P(A=16-25|C=n) = \frac{3+1}{5+4} = \frac{4}{9}$ <br><br>
$P(I=working|C=y) = \frac{1+1}{5+4} = \frac{2}{9}$ <br><br>
$P(I=working|C=n) = \frac{2+1}{5+4} = \frac{3}{9}$ <br> <br>
$P(C=y).P(S=m|C=y).P(E=u|C=y).P(A=16-25|C=y).P(I=w|C=y) = 0.0018$ <br><br>
$P(C=n).P(S=m|C=n).P(E=u|C=n).P(A=16-25|C=n).P(I=w|C=n) = 0.0055$ <br><br>

***0.0055 $\ge$ 0.0018, so Credit Label = no.*** <br>
***As a result, The male 23-year-old university graduate working class customer can't get credit.***



# ***PART II -  Book Genre Classification with Naïve Bayes***

***Import Libraries***

In [None]:
# Import necessary libraries.
import math
import nltk
import numpy as np
import pandas as pd
from collections import defaultdict
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

***Read the Data***

In [None]:
# Update G-Down to avoid possible errors.
!pip install --upgrade --no-cache-dir gdown

# Download the dataset from Google Drive to Google Colab.
!gdown --id 1ZbB40qw7BU24q3G6gEMGD5kGMlJg33es

# Read CSV file.
df = pd.read_csv("book_dataset_a2.csv", sep = "\t")

# Display first 5 row of DataFrame.
df.head()

Unnamed: 0,title,author,description,coverImg,genre
0,The Hunger Games,Suzanne Collins,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,https://i.gr-assets.com/images/S/compressed.ph...,Young Adult
1,Harry Potter and the Order of the Phoenix,"J.K. Rowling, Mary GrandPré (Illustrator)",There is a door at the end of a silent corrido...,https://i.gr-assets.com/images/S/compressed.ph...,Fantasy
2,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,https://i.gr-assets.com/images/S/compressed.ph...,Classics
3,Pride and Prejudice,"Jane Austen, Anna Quindlen (Introduction)",Alternate cover edition of ISBN 9780679783268S...,https://i.gr-assets.com/images/S/compressed.ph...,Classics
4,Twilight,Stephenie Meyer,About three things I was absolutely positive.\...,https://i.gr-assets.com/images/S/compressed.ph...,Young Adult


***Preprocess the Data***

In [None]:
# Drop title and author column.
df = df.drop(['title', 'author'], axis = 1)

# Convert all characters to lowercase.
df['description'] = df['description'].str.lower()

# Remove punctuations.
df['description'] = df['description'].str.replace('[^\w\s]+', ' ', regex = True)

# Remove numbers.
df['description'] = df['description'].str.replace('\d+', ' ', regex = True)

# Remove non-English words.
df['description'] = df['description'].str.replace('[^a-zA-Z\s]+', '', regex = True)

# Convert ENGLISH_STOP_WORDS set to tuple.
stop_words = tuple(frozenset(ENGLISH_STOP_WORDS))

# Remove the ENGLISH_STOP_WORDS from the descriptions, then save cleaned descriptions to new column. 
df['cleaned_description'] = df['description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

# Remove single chars.
df['description'] = df['description'].str.replace(' \w ', ' ', regex = True)
df['cleaned_description'] = df['cleaned_description'].str.replace(' \w ', ' ', regex = True)

# Remove multiple spaces, tabs and newlines.
df['description'] = df['description'].str.replace('\s+', ' ', regex = True)
df['cleaned_description'] = df['cleaned_description'].str.replace('\s+', ' ', regex = True)

# Display the first 5 record of DataFrame.
df.head()

Unnamed: 0,description,coverImg,genre,cleaned_description
0,winning means fame and fortune losing means ce...,https://i.gr-assets.com/images/S/compressed.ph...,Young Adult,winning means fame fortune losing means certai...
1,there is door at the end of silent corridor an...,https://i.gr-assets.com/images/S/compressed.ph...,Fantasy,door end silent corridor haunting harry pottte...
2,the unforgettable novel of childhood in sleepy...,https://i.gr-assets.com/images/S/compressed.ph...,Classics,unforgettable novel childhood sleepy southern ...
3,alternate cover edition of isbn since its imme...,https://i.gr-assets.com/images/S/compressed.ph...,Classics,alternate cover edition isbn immediate success...
4,about three things was absolutely positive fir...,https://i.gr-assets.com/images/S/compressed.ph...,Young Adult,things absolutely positive edward vampire seco...


### ***2.1 Understanding the Data***

In [None]:
# Group the genres, then choose History genre.
history_books = df.groupby(['genre']).get_group('History')

# Calculate the word frequencies in History genre.
history_books_df = history_books.cleaned_description.str.split(expand=True).stack().value_counts().reset_index()

# Arrange column names. 
history_books_df.columns = ['Word', 'Frequency'] 

# Display most frequent 3 words in History genre.
print("Most Frequent 3 Words in History Genre")
history_books_df.iloc[:3, :]

Most Frequent 3 Words in History Genre


Unnamed: 0,Word,Frequency
0,history,776
1,war,603
2,world,512


***Split the dataset into train and test sets.***

In [None]:
# Feature and label splitting.
X = df['description'].values   # We will use this for stopword experiments.
cleaned_X = df['cleaned_description'].values    # We will use this for no-stopword experiments.
y = df['genre'].values    

# Encode the labels.
encoder = LabelEncoder()
encoded_y = encoder.fit_transform(y)

# Train(%80) and Test(%20) split.
X_train, X_test, y_train, y_test = train_test_split(X, encoded_y, test_size=0.2, random_state = 42, stratify = encoded_y)
cleaned_X_train, cleaned_X_test, cleaned_y_train, cleaned_y_test = train_test_split(cleaned_X, encoded_y, test_size=0.2, random_state = 42, stratify = encoded_y)

### ***2.2 Implementing Naive Bayes***

In [None]:
class NaiveBayesClassifier:
    
    # Initialize the Classifier.
    def __init__(self, train_X, train_Y, test_X, test_Y):
        self.tokenizer = nltk.tokenize.WhitespaceTokenizer()   # Tokenize possible white spaces in text.
        self.train_sentences = train_X
        self.train_labels = train_Y
        self.test_sentences = test_X
        self.test_labels = test_Y
        self.priors = {}
        self.label_items = {}
        self.word_counts = {}
        self.vocab = None
    
    # BoW Method.
    def bow(self, ngram, stopword):
        vectorizer = CountVectorizer(ngram_range = ngram, min_df = 3, stop_words = stopword)
        X = vectorizer.fit_transform(self.train_sentences)
        self.vocab = vectorizer.get_feature_names_out()
        X = X.toarray()
        # Calculate the frequencies of words in descriptions.
        for l in range(10):
            # Create a dict in dict for each genre type.
            self.word_counts[l] = defaultdict(lambda: 0)
        # Fill the dicts by word frequencies.
        for i in range(X.shape[0]):
            l = self.train_labels[i]
            for j in range(len(self.vocab)):
                self.word_counts[l][self.vocab[j]] += X[i][j]
    
    # Laplace smoothing.
    def laplace_smoothing(self, word, text_label):
        # Increase the divident with 1.
        a = self.word_counts[text_label][word] + 1
        # Increase the divisor with feature count.
        b = self.label_items[text_label] + len(self.vocab)
        # Calculate log probability.
        return math.log(a/b)
    
    # Fit the training set.
    def fit(self, x, y, labels):
        # Group data based on labels.
        grouped_data = {}
        for l in labels:
            grouped_data[l] = x[np.where(y == l)]
        # Arrange priors.
        for l, data in grouped_data.items():
            self.label_items[l] = len(data)
            self.priors[l] = math.log(self.label_items[l] / len(x))
    
    # Predict the genre.
    def predict(self, labels, x):
        result = []
        for text in x:
            # Arrange label scores according to priors.
            label_scores = {l: self.priors[l] for l in labels}
            words = set(self.tokenizer.tokenize(text))
            # Calculate each label score based on priors. 
            for word in words:
                if word not in self.vocab: 
                    continue
                for l in labels:
                    label_scores[l] += self.laplace_smoothing(word, l)
            # Choose the label which has more score based on priors.
            result.append(max(label_scores, key=label_scores.get))
        return result
    
    # Run Model, then calculate the accuracy.
    def calculate_accuracy(self):
        labels = [0,1,2,3,4,5,6,7,8,9]
        self.fit(self.train_sentences, self.train_labels, labels)
        pred = self.predict(labels, self.test_sentences)
        return accuracy_score(self.test_labels, pred)

In [None]:
# Configure Classifiers for Each Experiment. 
naive_bayes_stopword_uni = NaiveBayesClassifier(X_train, y_train, X_test, y_test)
naive_bayes_no_stopword_uni = NaiveBayesClassifier(cleaned_X_train, cleaned_y_train, cleaned_X_test, cleaned_y_test)
naive_bayes_stopword_bi = NaiveBayesClassifier(X_train, y_train, X_test, y_test)
naive_bayes_no_stopword_bi = NaiveBayesClassifier(cleaned_X_train, cleaned_y_train, cleaned_X_test, cleaned_y_test)

In [None]:
# Unigram with Stopwords Experiment.
naive_bayes_stopword_uni.bow((1,1), None)
stopword_uni = naive_bayes_stopword_uni.calculate_accuracy()

In [None]:
# Unigram without Stopwords Experiment
naive_bayes_no_stopword_uni.bow((1,1), 'english')
no_stopword_uni = naive_bayes_no_stopword_uni.calculate_accuracy()

In [None]:
# Bigram with Stopwords Experiment.
naive_bayes_stopword_bi.bow((2,2), None)
stopword_bi = naive_bayes_stopword_bi.calculate_accuracy()

In [None]:
# Bigram without Stopwords Experiment.
naive_bayes_no_stopword_bi.bow((2,2), 'english')
no_stopword_bi = naive_bayes_no_stopword_bi.calculate_accuracy()

### ***2.3 Error Analysis***

#### There can be mainly 4 problem in Naive Bayes Classification when misestimation occurred:
##### 1. Feature Independency
***We assume that all features are being independent of each other but when the problem turns into a real-life scenario, this assumption rarely holds true.***
##### 2. Zero-Frequency Problem
***Frequency-based probability estimate will be zero, when an individual class label is missing.***
##### 3. Data Scarity
***Insufficient data may lead to the algorithm to numerical instabilities resulting in vague prediction by the classifier model.***
##### 4. Poorly Preprocessed Data
***Real world data is raw, so when a developing Naive Bayes model we have to preprocess the data.***

#### Analysis of the misestimation based on these problems:
* ***Zero-Frequency problem prevented by applying laplace-smoothing.***
* ***Feature Independency may cause of these misestimations. Some books may be belong to 2 or more category at the same time, so these books descriptions can contain words from multiple categories.***
* ***Distribution of genre types are not equal, so data scarity may occurs the misestimation problem.***
* ***Maybe I can't clean and preprocess defects in the description data well enough, so model doesn't fit well. Because of that reason, some of the book are misclassified.***

### ***2.4 Modul Analysis***

#### ***2.4.a  Analyzing effect of the words on prediction***

In [None]:
def show_presence_absence(genre_type):
    
    # Group the genres by specified genre.
    genre_type_books = df.groupby(['genre']).get_group(genre_type)
    
    # Take all descriptions of that genre.
    descriptions = genre_type_books.cleaned_description.values
    
    # Create a TF-IDF Vectorizer for calculate weights based on conditional probability.
    vec = TfidfVectorizer()
    
    # Calculate weight of each word in the specified genre.
    tf_idf =  vec.fit_transform(descriptions)
    
    # Calculate top 10 Presence and top 10 Absence Words.
    pa_df = pd.DataFrame(tf_idf.toarray(), columns=vec.get_feature_names_out())
    sum_df = pa_df.sum().sort_values(ascending = False)
    print(f"Top 10 Presence Word of {genre_type}")
    display(sum_df[:10])
    print(f"\nTop 10 Absence Word of {genre_type}")
    display(sum_df[-10:])
    print()

In [None]:
# Show top 10 presence and absence words for each genre.
for genre in df.genre.unique():
    show_presence_absence(genre)

Top 10 Presence Word of Young Adult


life       58.282313
new        54.833681
love       48.075244
school     46.174400
world      44.724624
just       39.510522
girl       38.606984
year       37.814978
friends    36.046324
time       35.885485
dtype: float64


Top 10 Absence Word of Young Adult


frostbiterose       0.047297
academylissa        0.047297
kissrose            0.047297
stonehouse          0.044589
theworstbookever    0.044589
zandri              0.044589
boiled              0.044589
aaronpatterson      0.044589
mstersmith          0.044589
blogspot            0.044589
dtype: float64


Top 10 Presence Word of Fantasy


world     95.110394
new       76.753067
magic     69.048592
life      68.367681
time      62.165881
power     49.386563
king      48.944918
book      48.828420
series    47.763852
dark      47.474750
dtype: float64


Top 10 Absence Word of Fantasy


ripercussioni    0.042019
nei              0.042019
solcano          0.042019
mai              0.042019
porta            0.042019
solitudine       0.042019
eppure           0.042019
divora           0.042019
sollecita        0.042019
avere            0.042019
dtype: float64


Top 10 Presence Word of Classics


novel      16.551707
story      16.163125
life       15.137923
new        14.187331
love       13.164633
world      12.817884
young      11.592062
edition    11.344683
stories    11.160772
tale       10.762718
dtype: float64


Top 10 Absence Word of Classics


gale         0.027459
irons        0.027459
succdait     0.027459
donner       0.027459
rappel       0.027459
accepter     0.027459
populaire    0.027459
fort         0.027459
jusqu        0.027459
retourn      0.027459
dtype: float64


Top 10 Presence Word of Science Fiction


world      25.221669
earth      24.477492
planet     21.754202
new        20.598337
human      19.730130
life       18.512623
time       18.274109
war        16.717374
science    16.252629
space      15.841428
dtype: float64


Top 10 Absence Word of Science Fiction


galley           0.025254
unintentional    0.025254
astray           0.025254
evitable         0.025254
verse            0.025254
satisfaction     0.025254
robbie           0.025254
tercentenary     0.025254
runaround        0.025254
liar             0.025254
dtype: float64


Top 10 Presence Word of Fiction


life      97.400085
novel     86.619769
world     79.237466
love      78.531064
new       77.600116
story     77.086735
family    68.990492
man       60.956649
young     57.657292
time      51.478449
dtype: float64


Top 10 Absence Word of Fiction


uzis             0.030141
fiefdoms         0.030141
vulgarities      0.030141
loudmouths       0.030141
cockneyisms      0.030141
rollers          0.030141
flirtatiously    0.030141
ruminate         0.030141
insuring         0.030141
plagiarize       0.023797
dtype: float64


Top 10 Presence Word of Horror


new       13.516661
world     12.644026
horror    11.733001
house     11.451890
life      11.007154
story     10.418000
town      10.382720
man       10.359287
old        9.900066
dead       9.856169
dtype: float64


Top 10 Absence Word of Horror


macdonald       0.045304
ledge           0.045304
ladder          0.045304
battleground    0.045304
fll             0.045304
quitters        0.045304
hellish         0.045304
rung            0.045304
fw              0.045304
gallery         0.045304
dtype: float64


Top 10 Presence Word of Romance


love      72.277248
life      68.586490
new       48.624354
man       47.789836
just      42.137954
heart     38.180313
time      37.025938
woman     36.963350
family    34.950529
like      34.718493
dtype: float64


Top 10 Absence Word of Romance


endangering    0.02682
pledging       0.02682
unanimous      0.02682
thieu          0.02682
whistles       0.02682
obliging       0.02682
revelry        0.02682
gored          0.02682
waif           0.02682
paine          0.02682
dtype: float64


Top 10 Presence Word of Mystery


new          30.093188
murder       27.928576
killer       26.973581
life         24.037760
case         22.937338
mystery      22.863099
old          22.738117
man          22.265749
detective    21.763649
death        20.549863
dtype: float64


Top 10 Absence Word of Mystery


indebted         0.045595
rewards          0.045595
hefty            0.045595
rescuing         0.045595
conflagration    0.045595
tangent          0.045595
wooed            0.045595
valued           0.045595
psycho           0.045595
utilize          0.045595
dtype: float64


Top 10 Presence Word of History


history     23.059677
war         21.473389
world       17.615884
new         14.282164
book        13.328499
story       13.159550
american    12.334931
life        11.448896
years       10.210757
great        9.691610
dtype: float64


Top 10 Absence Word of History


cow              0.032862
giornale         0.032862
tojo             0.032862
togo             0.032862
choosing         0.032862
isolationists    0.032862
evoke            0.032862
dictatorial      0.032862
springfield      0.032862
consolidated     0.032862
dtype: float64


Top 10 Presence Word of Thriller


life        8.723201
new         8.059924
world       7.685948
man         7.427947
reacher     7.382393
killer      6.371205
thriller    6.075177
time        5.813557
woman       5.774562
secret      5.713321
dtype: float64


Top 10 Absence Word of Thriller


glowing        0.031804
luminous       0.031804
breather       0.031804
storylines     0.031804
religions      0.031804
instance       0.031804
charlemagne    0.031804
manic          0.031804
turners        0.031804
essentially    0.031804
dtype: float64




In [None]:
class NaiveBayesClassifierUpdated:
    
    # Initialize the Classifier.
    def __init__(self, train_X, train_Y, test_X, test_Y):
        self.tokenizer = nltk.tokenize.WhitespaceTokenizer()   # Tokenize possible white spaces in text.
        self.train_sentences = train_X
        self.train_labels = train_Y
        self.test_sentences = test_X
        self.test_labels = test_Y
        self.priors = {}
        self.label_items = {}
        self.word_counts = {}
        self.vocab = None
    
    # TF-IDF Method.
    def tf_idf(self, stopword):
        vectorizer = TfidfVectorizer(ngram_range = (1,1), stop_words = stopword)
        X = vectorizer.fit_transform(self.train_sentences)
        self.vocab = vectorizer.get_feature_names_out()
        X = X.toarray()
        # Calculate the frequencies of words in descriptions.
        for l in range(10):
            self.word_counts[l] = defaultdict(lambda: 0)
        for i in range(X.shape[0]):
            l = self.train_labels[i]
            for j in range(len(self.vocab)):
                self.word_counts[l][self.vocab[j]] += X[i][j]
    
    # Laplace smoothing
    def laplace_smoothing(self, word, text_label):
        # Increase the divident with 1.
        a = self.word_counts[text_label][word] + 1
        # Increase the divisor with feature count.
        b = self.label_items[text_label] + len(self.vocab)
        # Calculate log probability.
        return math.log(a/b)
    
    # Fit the training set.
    def fit(self, x, y, labels):
        # Group data based on labels.
        grouped_data = {}
        for l in labels:
            grouped_data[l] = x[np.where(y == l)]
        # Arrange priors.
        for l, data in grouped_data.items():
            self.label_items[l] = len(data)
            self.priors[l] = math.log(self.label_items[l] / len(x))
    
    # Predict the genre.
    def predict(self, labels, x):
        result = []
        for text in x:
            # Arrange label scores according to priors.
            label_scores = {l: self.priors[l] for l in labels}
            words = set(self.tokenizer.tokenize(text))
            # Calculate each label score based on priors. 
            for word in words:
                if word not in self.vocab: 
                    continue
                for l in labels:
                    label_scores[l] += self.laplace_smoothing(word, l)
            # Choose the label which has more score based on priors.
            result.append(max(label_scores, key=label_scores.get))
        return result
    
    # Run Model, then calculate the accuracy.
    def calculate_accuracy(self):
        labels = [0,1,2,3,4,5,6,7,8,9]
        self.fit(self.train_sentences, self.train_labels, labels)
        pred = self.predict(labels, self.test_sentences)
        return accuracy_score(self.test_labels, pred)

In [None]:
tfidf_stopword_uni = NaiveBayesClassifierUpdated(X_train, y_train, X_test, y_test)
tfidf_stopword_uni.tf_idf(None)
stopword_tfidf = tfidf_stopword_uni.calculate_accuracy()

In [None]:
tfidf_no_stopword_uni = NaiveBayesClassifierUpdated(cleaned_X_train, cleaned_y_train, cleaned_X_test, cleaned_y_test)
tfidf_no_stopword_uni.tf_idf('english')
no_stopword_tfidf = tfidf_no_stopword_uni.calculate_accuracy()

#### ***2.4.b Stop Words***

In [None]:
# Group the genres, then choose Fantasy genre.
fantasy_books = df.groupby(['genre']).get_group('Fantasy')

# Calculate the word frequencies in Fantasy genre.
fantasy_books_df = fantasy_books.cleaned_description.str.split(expand=True).stack().value_counts().reset_index()

# Arrange column names. 
fantasy_books_df.columns = ['Word', 'Frequency'] 

# Display most frequent 10 words in History genre. 
print(f"10 non-stop words that most strongly predict that the Fantasy books")
fantasy_books_df.iloc[:10, :]

10 non-stop words that most strongly predict that the Fantasy books


Unnamed: 0,Word,Frequency
0,world,2779
1,new,1991
2,life,1744
3,magic,1544
4,time,1373
5,power,1010
6,old,972
7,book,971
8,dark,955
9,love,942


In [None]:
# Group the genres, then choose Mystery genre.
mystery_books = df.groupby(['genre']).get_group('Mystery')

# Calculate the word frequencies in Mystery genre.
mystery_books_df = mystery_books.cleaned_description.str.split(expand=True).stack().value_counts().reset_index()

# Arrange column names. 
mystery_books_df.columns = ['Word', 'Frequency'] 

# Display most frequent 10 words in Mystery genre. 
print("10 non-stop words that most strongly predict that the Mystery books")
mystery_books_df.iloc[:10, :]

10 non-stop words that most strongly predict that the Mystery books


Unnamed: 0,Word,Frequency
0,new,797
1,murder,632
2,life,581
3,killer,570
4,case,499
5,man,492
6,mystery,482
7,old,476
8,detective,441
9,family,419


#### ***2.4.c Analyzing effect of the stop words***


###### ***Removing stop words***
***The main point of the removing stopwords is "don’t add any new information". Based on this principle, when we remove the stopwords the data which we will investigates decreases so it makes the algorithm faster and decreases the time complexity. Also, most of the stopwords are general and frequent in the different train records; so, in the BoW method that situation can't make a difference on different classified records.***

###### ***Holding stop words***
***Stopwords can change the positivity or negativity of a word, so problems like sentiment analysis can easily affect from the stopwords. When non-stopwords gain a mean with the stopwords, we have to hold stopwords. But holding stopwords want more memory than removing stop words, so algorithm be more memory inefficient.***

###### ***Interpretation over Problem***
***Stopwords will not affect our problem because words which we need mostly doesn't depend on negativity or positivity. Probably, algorithm will take more memory and time complexity for the stopwords included experiments but give closer accuracy with not stopwords included experiments. As a result, dropping stopwords from the data will be more efficient.*** 

### ***2.5 Calculation of Accuracy***

In [None]:
print(f"Accuracy of Unigram with Stopwords Experiment: %{100 * stopword_uni:.2f}")

Accuracy of Unigram with Stopwords Experiment: %47.13


In [None]:
print(f"Accuracy of Unigram without Stopwords Experiment: %{100 * no_stopword_uni:.2f}")

Accuracy of Unigram without Stopwords Experiment: %48.51


In [None]:
print(f"Accuracy of Bigram with Stopwords Experiment: %{100 * stopword_bi:.2f}")

Accuracy of Bigram with Stopwords Experiment: %23.03


In [None]:
print(f"Accuracy of Bigram without Stopwords Experiment: %{100 * no_stopword_bi:.2f}")

Accuracy of Bigram without Stopwords Experiment: %23.03


In [None]:
print(f"Accuract of TF-IDF with Stopwords Experiment: %{100 * stopword_tfidf:.2f}")

Accuract of TF-IDF with Stopwords Experiment: %44.05


In [None]:
print(f"Accuract of TF-IDF without Stopwords Experiment: %{100 * no_stopword_tfidf:.2f}")

Accuract of TF-IDF without Stopwords Experiment: %48.48


# ***REPORT***

### 1-) Overview of the Problem

#### In this problem, we are trying to develop a Naive Bayes Algorithm from scratch with BoW and TF-IDF methods for text classification.  The subject of the text classification is description and book genre pairs, we are predicting the book's genre with using it's description.

### 2-) Preprocessing the Data

#### In this part; first I drop the author and title columns, remove punctuations and numbers, convert all descriptions to lowercase, remove the non-english words like arabic words. After that process; I remove stopwords from the descriptions and then add this cleaned descriptions to dataframe as a new column because we will observe our Naive Bayes model on both of the descriptions which includes stopwords or not. After this new column creation; I check tabs, newlines and single chars for both of the cleaned and not-cleaned description columns. 

### 3-) Understanding the Data Part

#### In this part; I group the dataframe according to genre types, then choose History genre. I count the word frequencies for the History genre and show the first 3 records which frequencies are highest. According to results; history genre can guess by using "history", "war" and "world" words. That shows each genre has unique and frequent words which can describe the type of the genre.

* #### Split the Dataset
    * #### In this part I split the features and labels. I use two feature space because I will test my algorithm for both of the descriptions which contains stopwords or not. I split my dataset to %80 Train and %20 test, and doing this process with randomized.

### 4-) Implementing Naive Bayes Part

#### In this part; I implement my own Naive Bayes Class based on the BoW method. The class contains initializer, bow, laplace_smoothing, fit, predict and calculate accuracy methods. 

##### Interpreting of these methods:
* #### init(): initializes the dictionaries, variables, sets which I will use in the class.
* #### bow(): makes the bow method, I use 3 parameter for this method. First parameter is ngram_range which defines the count of word in sample, second parameter is min_df which removes the words frequency dictionary which word has lower occurrences than threshold value, the last parameter is stop_words which removes stopwords.
##### Explanation of why I use min_df = 3 ?
##### The bi_gram experiment gave a RAM Issues, to prevent from this issue I had 2 ways which are take a sample or decrease the dictionary size. If I took a sample, some of the genres will have too few occurrences in my train set it will cause problems on my prediction step so I choose the decrease the dictionary size with using min_df = 3 parametes which drops the words whose occurrences less than 3.
* #### laplace_smoothing(): apply laplace smoothing to prevent zero-frequency problem, then calculate log probability.
* #### fit(): fits the model based on train sets.
* #### predict(): predicts the genres of the book by given descriptions.
* #### calculate_accuracy(): run the model, then calculate the accuracy of the model.

#### I create 4 classifier for use in experiments after implemented the Naive Bayes. I tested my Naive Bayes with BoW method on:
* #### Include Stopword - Unigram
* #### Not Include Stopword - Unigram
* #### Include Stopword - Bigram
* #### Not Include Stopword - Bigram

### 5-) Error Analysis Part

#### I analyzed misestimation reasons based on naive bayes issues, and interpret them under the related section.

### 6-) Modul Analysis Part

* #### Analyzing effect of the words on prediction
    * #### I show the top 10 most absent and presence words of each genre type.
    * #### Reimplement the Naive Bayes based on TF-IDF methodology. I use TF-IDFVectorizer which does same operations with TF-IDFTransformer but works faster than it. After new implementation, I create 2 classifier to test my Naive Bayes with TF-IDF method on:
        * #### Include Stopword
        * #### Not-Include Stopword
    * #### In the absence or presence of the word we consider the conditional probability of the word in a genre type. The condition includes the word occurrences in the all words, which has more frequency be presence, low frequency be absence. Precence words are more important because they are mostly unique for the specified genre type and seperate the genre from the others. Absence words can think as stopwords so we can drop them because they are not unique too much and has no weight while classifying the book genre.

* #### Stop Words and Analyzing Effect of the Stopwords
    * #### I show top 10 non-stop words for the Fantasy and Mysters genres, then analyze the effects of the stopwords in the related part.

### 7-) Result Analysis

![Question1](https://drive.google.com/uc?id=17489OWskvEO2pqcl2wo4LHw1bFNoQgc_)

1. #### Comment on BoW Method - Unigram Experiments

#### As expected, the removing stopwords increases the acccuracy. Because we take descriptions word by word and the result just affected from the individual words.

2. #### Comment on BoW Method - Bigram Experiments

#### The result is not affected from the removing stopwords and the accuracy decreases too much relative to unigram. There are several reasons for these accuracies. Possible reasons are:
* #### Data Preprocessing Step: Maybe I can't preprocess the data correctly for bi_gram experiment. In bi_gram experiment the words take as 2 long pairs such as "life long", because of the not convenient preprocess steps my classifier not works well on bi_gram range.
* #### Splitting the Data Randomly: I use random variable to split data, because of these reason some of the genre types are not occur in the train data so bi_gram can't take much word for that genre so that genre books can misclassified. That situation can easily decrease the accuracy.
* #### Last reason can be taking 2 words as pair maybe occur meaningless combination like "flower hell" and if we have not enough frequency for these meaningless pairs we can't fit the model well so accuracy decreases.

***Our results show there is no difference for included stopword - not included stopword situation, I think the word pairs can't occur any difference from individual pairs and that shows either our preprocessing steps are not appropriate for the bi_gram experiment nor randomly split defects the distribution of genre types in train set.***

3. #### Comment on TF-IDF Method Experiments

#### The results are changed as expected, removing stopwords are increased the accuracy. As showed in the related part some words have big some words have small weights. TF-IDF works based on this principle. It gives the close accuracy the bow method, that shows tf-idf is not optimize our model better than the bow method. This optimization can depend on preprocess and data splitting, so we can't generalize the tf-idf is not optimize well the naive bayes. That approach just acceptable for my implementation and methodology steps.  

### 8-) Conclusion
#### In this project I implement Naive Bayes with BoW and TF-IDF methods to predict book genres depend on descriptions of books. I observed 6 experiment which are 
* (BoW, Stopword, Unigram)
* (BoW, No Stopword, Unigram)
* (BoW, Stopword, Bigram)
* (BoW, No Stopword, Bigram)
* (TF-IDF, Stopword)
* (TF-IDF, No Stopword).

#### Based on these experiements, as a result; depending on my preprocessing, random splitting and Naive Bayes implementation model works better on unigram range without stopwords. That shows Naive Bayes can be convenient classifier for text classification. 