## General Overview of Problem and Structure of Our Program

### Overview

Our task is to find which category a news article belongs to amongs five categories (sport,business,tech,politics,entertainment) by using naive-bayes classifier. We will implement a naive-bayes classifier and verify it's performance in the given dataset.

### Structure of Our Program

We decided to implement our functions and include our import in the first cell so that if we need to make any changes we only need to change the functions in the first cell. Our functions have parameters, for example in "naive-bayes" function we have 'bigram' and 'hasStopWords' arguments. These arguments gave us the flexibility to use our functions in different situations: using unigram or bigram by changing 'bigram' argument or include the stopwords by changing the 'hasStopWords' argument. We hope that it increased our code readability.

In [1]:
import numpy as np
import pandas as pd
import random
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import math

def shuffle(nump_arr):
    for x in range(nump_arr.shape[0]):
        a = random.randint(0, nump_arr.shape[0]-1)
        b = random.randint(0, nump_arr.shape[0]-1)
        tmp = np.copy(nump_arr[a])
        nump_arr[a] = np.copy(nump_arr[b])
        nump_arr[b] = np.copy(tmp)
    return nump_arr

def calculate_accuracy(y_test,y_pred):
    a = 0
    for ol in range(len(y_test)):
        if(y_test[ol] == y_pred[ol]):
            a += 1
            
    print("Accuracy is: %" + str(100*a/len(y_test)))

def split_category(x,y):
    
    sport = []
    business = []
    politics = []
    tech = []
    entertainment = []
    
    for i in range(len(y)):
        
        if(y[i] == "sport"):
            sport.append(x[i])
        elif(y[i] == "business"):
            business.append(x[i])
        elif(y[i] == "politics"):
            politics.append(x[i])
        elif(y[i] == "tech"):
            tech.append(x[i])
        else:
            entertainment.append(x[i])
            
    return (sport,business,politics,tech,entertainment)

def is_feasible(feature_names):
    sport_words = ["play","ball","match"]
    tech_words = ["web","computer","machine"]
    politics_words = ["government","election","campaign"]
    business_words = ["inflation","profit","business"]
    entertainment_words = ["dance","comedy","show"]
    
    print("How choosen 3 words for sports is a good measure:")
    for i in sport_words:
        ind = feature_names.index(i)
        
        print(i.upper() + " frequency in sport class: " + str(sport_frequency[ind]) + "------------>>>>>>>>>>>>")
        print(i + " frequency in business class: " + str(business_frequency[ind]))
        print(i + " frequency in tech class: " + str(tech_frequency[ind]))
        print(i + " frequency in politics class: " + str(politics_frequency[ind]))
        print(i + " frequency in entertainment class: " + str(entertainment_frequency[ind]))

    print("**************************************************")
    
    print()
    
    print("How choosen 3 words for tech is a good measure:")
    for i in tech_words:
        ind = feature_names.index(i)
        
        print(i.upper() + " frequency in tech class: " + str(tech_frequency[ind]) + "------------>>>>>>>>>>>>")
        print(i + " frequency in business class: " + str(business_frequency[ind]))
        print(i + " frequency in sport class: " + str(sport_frequency[ind]))
        print(i + " frequency in politics class: " + str(politics_frequency[ind]))
        print(i + " frequency in entertainment class: " + str(entertainment_frequency[ind]))
        
        
    print("**************************************************")
    
    print()
    
    print("How choosen 3 words for politics is a good measure:")
    for i in politics_words:
        ind = feature_names.index(i)
        
        print(i.upper() + " frequency in politics class: " + str(politics_frequency[ind]) + "------------>>>>>>>>>>>>")
        print(i + " frequency in business class: " + str(business_frequency[ind]))
        print(i + " frequency in sport class: " + str(sport_frequency[ind]))
        print(i + " frequency in tech class: " + str(tech_frequency[ind]))
        print(i + " frequency in entertainment class: " + str(entertainment_frequency[ind]))
        
        
    print("**************************************************")
    
    print()
    
    print("How choosen 3 words for business is a good measure:")
    for i in business_words:
        ind = feature_names.index(i)
        
        print(i.upper() + " frequency in business class: " + str(business_frequency[ind]) + "------------>>>>>>>>>>>>")
        print(i + " frequency in politics class: " + str(politics_frequency[ind]))
        print(i + " frequency in sport class: " + str(sport_frequency[ind]))
        print(i + " frequency in tech class: " + str(tech_frequency[ind]))
        print(i + " frequency in entertainment class: " + str(entertainment_frequency[ind]))
        
        
    print("**************************************************")
    
    print()
    
    print("How choosen 3 words for entertainment is a good measure:")
    for i in entertainment_words:
        ind = feature_names.index(i)
        
        print(i.upper() + " frequency in entertainment class: " + str(entertainment_frequency[ind]) + "------------>>>>>>>>>>>>")
        print(i + " frequency in politics class: " + str(politics_frequency[ind]))
        print(i + " frequency in sport class: " + str(sport_frequency[ind]))
        print(i + " frequency in tech class: " + str(tech_frequency[ind]))
        print(i + " frequency in business class: " + str(business_frequency[ind]))
        
        
def naive_bayes(x_train,y_train,x_test,y_test,bigram = False,hasStopWords=False):
    
    stopWords = []
    
    if(hasStopWords):
        stopWords = ENGLISH_STOP_WORDS
    
    p_sport = math.log(y_train.tolist().count("sport") / len(y_train))
    p_business = math.log(y_train.tolist().count("business") / len(y_train))
    p_tech = math.log(y_train.tolist().count("tech") / len(y_train))
    p_politics = math.log(y_train.tolist().count("politics") / len(y_train))
    p_entertainment = math.log(y_train.tolist().count("entertainment") / len(y_train))

    sport_word_len=0
    business_word_len=0
    tech_word_len=0
    politics_word_len=0
    entertainment_word_len=0

    for i in range(len(y_train)):
        for j in x_train[i].values():
            if(j != 0):
                if(y_train[i] == "sport"):
                    sport_word_len += 1
                elif(y_train[i] == "business"):
                    business_word_len += 1
                
                elif(y_train[i] == "tech"):
                    tech_word_len += 1
                
                elif(y_train[i] == "politics"):
                    politics_word_len += 1
                
                elif(y_train[i] == "entertainment"):
                    entertainment_word_len += 1
    
    y_pred = []
    
    for i in x_test:
        
        
        
        vectorizer = CountVectorizer(stop_words=stopWords)
        X = vectorizer.fit_transform([i])
        feature_names = vectorizer.get_feature_names_out()
        
        if(bigram):
            vectorizer = CountVectorizer(analyzer='word' ,ngram_range=(2, 2),stop_words=stopWords)
            X = vectorizer.fit_transform([i])
            feature_names = vectorizer.get_feature_names_out()
        
        
        sport_prob = p_sport
        business_prob = p_business
        tech_prob = p_tech
        politics_prob = p_politics
        entertainment_prob = p_entertainment
        
        for word in feature_names:
            
            sport_p = 0
            business_p = 0
            tech_p = 0
            politics_p = 0
            entertainment_p = 0
            
            for j in range(len(y_train)):
                if(y_train[j] == "sport"):
                    sport_p += x_train[j].get(word,0)
                    
                elif(y_train[j] == "business"):
                    business_p += x_train[j].get(word,0)
                    
                    
                elif(y_train[j] == "tech"):
                    tech_p += x_train[j].get(word,0)
                    
                    
                elif(y_train[j] == "politics"):
                    politics_p += x_train[j].get(word,0)
                    
                  
                else:
                    entertainment_p += x_train[j].get(word,0)
            
            
            #Laplace smoothing
            if(sport_p == 0):
                sport_p = 1 / (y_train.tolist().count("sport") + 24746)
            else:
                
                sport_p /= sport_word_len
                
                
            if(business_p == 0):
                business_p = 1 / (y_train.tolist().count("business") + 24746)
            else:
                
                business_p /= business_word_len
                
                
            if(tech_p == 0):
                tech_p = 1 / (y_train.tolist().count("tech") + 24746)
            else:
                tech_p /= tech_word_len
                
            if(politics_p == 0):
                politics_p = 1 / (y_train.tolist().count("politics") + 24746)
            else:
                politics_p /= politics_word_len
                
            if(entertainment_p == 0):
                entertainment_p = 1 / (y_train.tolist().count("entertainment") + 24746)
            else:
                entertainment_p /= entertainment_word_len
            
            
            
            sport_prob += math.log(sport_p)
            business_prob += math.log(business_p)
            tech_prob += math.log(tech_p)
            politics_prob += math.log(politics_p)
            entertainment_prob += math.log(entertainment_p)
            
        
            
        
        pred = ""
        max_probability = max([sport_prob,business_prob,tech_prob,politics_prob,entertainment_prob])
        
        indx = [sport_prob,business_prob,tech_prob,politics_prob,entertainment_prob].index(max_probability)
        
        if(indx == 0):
            pred = "sport"
        elif(indx == 1):
            pred = "business"
        elif(indx == 2):
            pred = "tech"
        elif(indx == 3):
            pred = "politics"
        else:
            pred = "entertainment"
            
        y_pred.append(pred)
        
    return y_pred

def find_top_ten_words(class_name,ascendingg = False,bigram = False,hasStopWords=False):
    
    stopWords = []
    
    if(hasStopWords):
        stopWords = ENGLISH_STOP_WORDS
    
    dataset = np.copy(df)

    x = dataset[:,1]
    y = dataset[:,2]

    categories = split_category(x,y)

    sport_category = categories[0]
    business_category = categories[1]
    politics_category = categories[2]
    tech_category = categories[3]
    entertainment_category = categories[4]
    
    category = []
    
    if(class_name == "sport"):
        category = sport_category
    elif(class_name == "business"):
        category = business_category
    elif(class_name == "politics"):
        category = politics_category
    elif(class_name == "tech"):
        category = tech_category
    else:
        category = entertainment_category
    
    
    
    countVectorizer = CountVectorizer(stop_words=stopWords)
    
    if(bigram):
        countVectorizer = CountVectorizer(analyzer='word',ngram_range=(2,2),stop_words=stopWords)
    
    wordCount = countVectorizer.fit_transform(category)
    arr = wordCount.toarray()
    feature_names = countVectorizer.get_feature_names_out()
    
    dct = {}
    
    for i in countVectorizer.get_feature_names_out():
        dct[i] = 0
        
    for j in range(len(arr)):
        for l in range(len(arr[j])):
            dct[feature_names[l]] += arr[j][l]
        
    
    
        
    df2 = pd.DataFrame(list(dct.items()))

    df2 = df2.sort_values(1, ascending=ascendingg)
    return df2.head(10)

def narrow_the_dict(ascendingg = False,bigram=False,word_size=1000):
    
    dataset = np.copy(df)

    x = dataset[:,1]
    y = dataset[:,2]

    categories = split_category(x,y)

    sport_category = categories[0]
    business_category = categories[1]
    politics_category = categories[2]
    tech_category = categories[3]
    entertainment_category = categories[4]
    
    category = x.tolist()
    
    
    
    tfIdfTransformer = TfidfTransformer(smooth_idf=True,use_idf=True)
    countVectorizer = CountVectorizer(stop_words=ENGLISH_STOP_WORDS)
    
    if(bigram):
        countVectorizer = CountVectorizer(analyzer='word',ngram_range=(2,2),stop_words=ENGLISH_STOP_WORDS)
    
    wordCount = countVectorizer.fit_transform(category)
    newTfIdf = tfIdfTransformer.fit_transform(wordCount)
    
    newTfIdf = newTfIdf.toarray()
    dct = {}

    for i in countVectorizer.get_feature_names_out():
        dct[i] = 0
    
    for x in range(len(category)):
        dfa = pd.DataFrame(newTfIdf[x], index=countVectorizer.get_feature_names_out(),columns=["TF-IDF"])
        dfa = dfa.sort_values('TF-IDF',ascending=ascendingg)
    
        if(dct[dfa.head(1).index[0]] < dfa.head(1).values[0][0]):
            dct[dfa.head(1).index[0]] = dfa.head(1).values[0][0]
        
    df2 = pd.DataFrame(list(dct.items()))

    df2 = df2.sort_values(1, ascending=False)
    
    return list(df2.head(word_size).to_dict()[0].values())

In [2]:
df = pd.read_csv('English Dataset.csv')
df = df.to_numpy()
categorizable = np.copy(df)

In [3]:
x = df[:,1]
y = df[:,2]

x_cat = categorizable[:,1]
y_cat = categorizable[:,2]

In [4]:
vectorizer_cat = CountVectorizer()
X_cat = vectorizer_cat.fit_transform(x_cat)
feature_names = vectorizer_cat.get_feature_names_out()

In [5]:
x_cat = X_cat.toarray()

# Part 1

### In order to find 3 words in each category we splitted the data by class names.

In [6]:
categories = split_category(x_cat,y_cat)

In [7]:
sport = np.array(categories[0])
business = np.array(categories[1])
politics = np.array(categories[2])
tech = np.array(categories[3])
entertainment = np.array(categories[4])



In [8]:
sport_transpose = sport.transpose()
business_transpose = business.transpose()
politics_transpose = politics.transpose()
tech_transpose = tech.transpose()
entertainment_transpose = entertainment.transpose()


In [9]:
#Total frequencies of each word for sport class
sport_frequency = []
for i in range(len(sport_transpose)):
    total = 0
    for j in sport_transpose[i]:
        total += j
        
    sport_frequency.append(total)
    
#Total frequencies of each word for business class
business_frequency = []
for i in range(len(business_transpose)):
    total = 0
    for j in business_transpose[i]:
        total += j
        
    business_frequency.append(total)
    
#Total frequencies of each word for politics class
politics_frequency = []
for i in range(len(politics_transpose)):
    total = 0
    for j in politics_transpose[i]:
        total += j
        
    politics_frequency.append(total)
    
#Total frequencies of each word for tech class
tech_frequency = []
for i in range(len(tech_transpose)):
    total = 0
    for j in tech_transpose[i]:
        total += j
        
    tech_frequency.append(total)
    
#Total frequencies of each word for entertainment class
entertainment_frequency = []
for i in range(len(entertainment_transpose)):
    total = 0
    for j in entertainment_transpose[i]:
        total += j
        
    entertainment_frequency.append(total)
    


In [10]:
feature_names = feature_names.tolist()

In [11]:
is_feasible(feature_names)

How choosen 3 words for sports is a good measure:
PLAY frequency in sport class: 193------------>>>>>>>>>>>>
play frequency in business class: 8
play frequency in tech class: 81
play frequency in politics class: 21
play frequency in entertainment class: 77
BALL frequency in sport class: 102------------>>>>>>>>>>>>
ball frequency in business class: 0
ball frequency in tech class: 5
ball frequency in politics class: 1
ball frequency in entertainment class: 3
MATCH frequency in sport class: 180------------>>>>>>>>>>>>
match frequency in business class: 3
match frequency in tech class: 4
match frequency in politics class: 6
match frequency in entertainment class: 3
**************************************************

How choosen 3 words for tech is a good measure:
WEB frequency in tech class: 150------------>>>>>>>>>>>>
web frequency in business class: 2
web frequency in sport class: 0
web frequency in politics class: 0
web frequency in entertainment class: 2
COMPUTER frequency in tech clas

### So we can say that our word choices is feasible for this dataset and we can do naive bayes classifing

#### For sport class: 'play','ball' and 'match'
#### For tech class: 'web','computer' and 'machine'
#### For politics class: 'government','election' and 'campaign'
#### For business class: 'inflation','profit' and 'business'
#### For entertainment class: 'dance','comedy' and 'show'

# Part 2 Naive Bayes

Our Train-Test size is 80-20. We decided to use this proportion. In naive-bayes it is important to have significant amount of trained data. But in order for our model to give accurate results, our test size shouldn't have been so small. That's why we choosed this proportion

In this part we used our naive-bayes function to get our prediction data and we used it with test data and calculated our accuracy

## Unigram Version

In [12]:
df = shuffle(df)
x = df[:,1]
y = df[:,2]

x2 = np.copy(x)

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(x)
feature_names = vectorizer.get_feature_names_out()

x_data = X.toarray()

x_train = x_data[:1192]
y_train = y[:1192]
x_test = x[1192:]
y_test = y[1192:]


In [13]:
test_keys = feature_names


In [14]:
#Creating our word dictionary
x_train = x_train.tolist()
for i in range(len(x_train)):
    test_values = x_train[i]
    
    res = {test_keys[j]: test_values[j] for j in range(len(test_keys))}
    
    x_train[i] = res
    



In [15]:
y_pred = naive_bayes(x_train,y_train,x_test,y_test)

In [16]:
calculate_accuracy(y_test,y_pred)

Accuracy is: %94.96644295302013


## Bigram Version

In [17]:
df = shuffle(df)
x = df[:,1]
y = df[:,2]

vectorizer2 = CountVectorizer(analyzer='word' ,ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(x)
feature_names = vectorizer2.get_feature_names_out()

x_data = X2.toarray()

x_train = x_data[:1192]
y_train = y[:1192]
x_test = x[1192:]
y_test = y[1192:]

In [18]:
test_keys = feature_names

In [19]:
#Creating our word dictionary
x_train = x_train.tolist()
for i in range(len(x_train)):
    test_values = x_train[i]
    
    res = {test_keys[j]: test_values[j] for j in range(len(test_keys))}
    
    x_train[i] = res        

In [20]:
y_pred = naive_bayes(x_train,y_train,x_test,y_test,bigram=True)
calculate_accuracy(y_test,y_pred)

Accuracy is: %62.75167785234899


#### As can be seen the accuracy values for unigram and bigram models differ by significant range (Unigram is better than bigram). This may happen because of stopping words and bigram dictionary size

# Part 3

In this part we used our find_top_ten_words function to get top 10 words that most strongly predicts the dataset.

## Unigram

### 10 words for sport class whose presence most strongly predicts:

In [21]:
find_top_ten_words("sport")

Unnamed: 0,0,1
7978,the,6620
8076,to,3189
593,and,2532
4082,in,2510
5545,of,1826
3261,for,1127
3772,he,1105
5589,on,1014
1349,but,992
4264,is,985


### 10 words for business class whose presence most strongly predicts:

In [22]:
find_top_ten_words("business")

Unnamed: 0,0,1
8812,the,7133
8904,to,3306
6143,of,2864
4669,in,2821
1047,and,2161
7681,said,1100
4908,is,1072
8808,that,1052
3926,for,1045
4929,it,1011


### 10 words for tech class whose presence most strongly predicts:

In [23]:
find_top_ten_words("tech")

Unnamed: 0,0,1
8834,the,7498
8945,to,4149
6054,of,3425
679,and,3017
4534,in,2316
8833,that,1676
4831,is,1597
4845,it,1465
3656,for,1299
6092,on,1111


### 10 words for politics class whose presence most strongly predicts:

In [24]:
find_top_ten_words("politics")

Unnamed: 0,0,1
8226,the,7957
8318,to,3913
5666,of,2840
687,and,2559
4269,in,2159
7183,said,1445
3949,he,1410
3486,for,1237
8223,that,1195
5702,on,1186


### 10 words for entertainment class whose presence most strongly predicts:

In [25]:
find_top_ten_words("entertainment")

Unnamed: 0,0,1
8822,the,5822
676,and,2130
8932,to,2108
6150,of,2048
4608,in,1996
3706,for,1097
6191,on,881
9503,was,818
4807,it,738
4794,is,684


### 10 words for sport class whose absence most strongly predicts:

In [26]:
find_top_ten_words("sport",True)

Unnamed: 0,0,1
0,00,1
4681,lena,1
4680,leisurely,1
4673,legions,1
4671,legality,1
4665,ledley,1
4663,lecturer,1
4662,lebeouf,1
4651,leander,1
4643,leaderless,1


### 10 words for business class whose absence most strongly predicts:

In [27]:
find_top_ten_words("business",True)

Unnamed: 0,0,1
4878,invests,1
6770,preference,1
6762,predecessor,1
6761,precursor,1
6760,precious,1
6759,preceding,1
3149,doughnuts,1
6758,prebon,1
6756,prayers,1
6755,pratt,1


### 10 words for tech class whose absence most strongly predicts:

In [28]:
find_top_ten_words("tech",True)

Unnamed: 0,0,1
0,00,1
4999,kilometres,1
4998,kilobit,1
4997,killzone,1
4996,killing,1
4991,kicks,1
4990,kick,1
4989,khan,1
5008,kitts,1
4987,keyword,1


### 10 words for politics class whose absence most strongly predicts:

In [29]:
find_top_ten_words("politics",True)

Unnamed: 0,0,1
9122,zone,1
6133,plus,1
6130,ploy,1
6129,plough,1
2786,dragged,1
6126,plight,1
2788,dragon,1
2790,drastically,1
6119,please,1
2792,drawing,1


### 10 words for entertainment class whose absence most strongly predicts:

In [30]:
find_top_ten_words("entertainment",True)

Unnamed: 0,0,1
9806,zutons,1
3120,embark,1
8079,skeletons,1
3122,embarrassed,1
8078,skeleton,1
3124,embedded,1
3125,embrace,1
3126,embraced,1
3127,embroiled,1
3130,emergency,1


## Bigram

### 10 words for sport class whose presence most strongly predicts:

In [31]:
find_top_ten_words("sport",bigram=True)

Unnamed: 0,0,1
26180,in the,867
35807,of the,531
5736,at the,316
19211,for the,303
36493,on the,249
53003,to the,236
52292,to be,191
49593,the first,173
27833,it was,155
23451,he said,148


### 10 words for business class whose presence most strongly predicts:

In [32]:
find_top_ten_words("business",bigram=True)

Unnamed: 0,0,1
38274,of the,658
27897,in the,632
56070,to the,260
22130,for the,257
53871,the us,244
39000,on the,227
52080,that the,214
46529,said the,182
5873,and the,179
52529,the company,155


### 10 words for tech class whose presence most strongly predicts:

In [33]:
find_top_ten_words("tech",bigram=True)

Unnamed: 0,0,1
37190,of the,773
26185,in the,610
56069,to be,297
37933,on the,283
56924,to the,281
24123,he said,250
61968,will be,245
20460,for the,217
28372,it is,211
50355,such as,189


### 10 words for politics class whose presence most strongly predicts:

In [34]:
find_top_ten_words("politics",bigram=True)

Unnamed: 0,0,1
35721,of the,679
25767,in the,576
53590,to the,378
23243,he said,323
43900,said the,310
52793,to be,299
36469,on the,278
50206,the government,259
19967,for the,252
4286,and the,228


### 10 words for entertainment class whose presence most strongly predicts:

In [35]:
find_top_ten_words("entertainment",bigram=True)

Unnamed: 0,0,1
31027,of the,598
22857,in the,546
5768,at the,256
17726,for the,221
45758,to the,205
31597,on the,196
45169,to be,195
4395,and the,168
49684,will be,135
42842,the film,129


### 10 words for sport class whose absence most strongly predicts:

In [36]:
find_top_ten_words("sport",True,True)

Unnamed: 0,0,1
0,00 and,1
36840,open mirza,1
36841,open now,1
36843,open one,1
36844,open ousting,1
36845,open qualifying,1
36846,open quarter,1
36847,open reached,1
36848,open saw,1
36849,open semi,1


### 10 words for business class whose absence most strongly predicts:

In [37]:
find_top_ten_words("business",True,True)

Unnamed: 0,0,1
0,000 12,1
38788,on business,1
38789,on buying,1
38790,on campaigning,1
38792,on case,1
38793,on certain,1
38794,on charges,1
38795,on china,1
38796,on christmas,1
38797,on citigroup,1


### 10 words for tech class whose absence most strongly predicts:

In [38]:
find_top_ten_words("tech",True,True)

Unnamed: 0,0,1
0,00 per,1
38261,online world,1
38262,only 0051,1
38263,only 10,1
38264,only 11,1
38266,only 15,1
38268,only 25,1
38269,only 273,1
38270,only 30,1
38271,only 36,1


### 10 words for politics class whose absence most strongly predicts:

In [39]:
find_top_ten_words("politics",True,True)

Unnamed: 0,0,1
0,00 for,1
36670,only documents,1
36671,only doing,1
36672,only english,1
36673,only few,1
36674,only five,1
36677,only fresh,1
36678,only from,1
36679,only going,1
36680,only got,1


### 10 words for entertainment class whose absence most strongly predicts:

In [40]:
find_top_ten_words("entertainment",True,True)

Unnamed: 0,0,1
0,000 267,1
31786,only handful,1
31787,only have,1
31788,only if,1
31790,only it,1
31791,only just,1
31792,only matter,1
31793,only mcdonald,1
31794,only on,1
31796,only other,1


# By Excluding Stop Words

## Unigram

### 10 words for sport class whose presence most strongly predicts:

In [41]:
find_top_ten_words("sport",hasStopWords=True)

Unnamed: 0,0,1
6707,said,636
3314,game,356
8563,year,331
2722,england,329
7842,time,296
8454,win,295
8513,world,269
5801,players,209
2053,cup,206
7713,team,205


### 10 words for business class whose presence most strongly predicts:

In [42]:
find_top_ten_words("business",hasStopWords=True)

Unnamed: 0,0,1
7505,said,1100
9454,year,456
5754,mr,393
5438,market,284
5862,new,273
3767,firm,261
4172,growth,257
2244,company,253
3215,economy,233
4114,government,215


### 10 words for tech class whose presence most strongly predicts:

In [43]:
find_top_ten_words("tech",hasStopWords=True)

Unnamed: 0,0,1
7445,said,1064
6223,people,647
5759,new,349
5640,mr,349
5546,mobile,343
8541,technology,303
9094,users,268
7924,software,265
9087,use,260
5738,net,256


### 10 words for politics class whose presence most strongly predicts:

In [44]:
find_top_ten_words("politics",hasStopWords=True)

Unnamed: 0,0,1
7006,said,1445
5295,mr,1073
4632,labour,494
3661,government,464
2859,election,424
1161,blair,395
5752,party,376
5821,people,372
5177,minister,286
5405,new,280


### 10 words for entertainment class whose presence most strongly predicts:

In [45]:
find_top_ten_words("entertainment",hasStopWords=True)

Unnamed: 0,0,1
7474,said,594
3496,film,583
1161,best,430
9493,year,315
5757,music,255
5856,new,234
945,awards,184
8930,uk,171
451,actor,169
5955,number,165


## Bigram

### 10 words for sport class whose presence most strongly predicts:

In [46]:
find_top_ten_words("sport",bigram=True,hasStopWords=True)

Unnamed: 0,0,1
46977,year old,125
28268,new zealand,68
46642,world cup,65
18038,grand slam,58
7045,champions league,57
42461,told bbc,53
3455,australian open,52
25636,manchester united,46
10406,davis cup,42
36890,second half,42


### 10 words for business class whose presence most strongly predicts:

In [47]:
find_top_ten_words("business",bigram=True,hasStopWords=True)

Unnamed: 0,0,1
9578,chief executive,81
16068,economic growth,53
44948,stock market,45
4662,analysts said,44
31226,mr glazer,40
31967,new york,38
32792,oil prices,35
31217,mr ebbers,35
11683,consumer spending,32
44945,stock exchange,32


### 10 words for tech class whose presence most strongly predicts:

In [48]:
find_top_ten_words("tech",bigram=True,hasStopWords=True)

Unnamed: 0,0,1
37738,said mr,135
27224,mobile phone,73
4301,bbc news,60
27225,mobile phones,58
44863,told bbc,57
19649,high definition,56
29071,news website,50
25115,mac mini,45
48791,wi fi,43
2855,anti virus,42


### 10 words for politics class whose presence most strongly predicts:

In [49]:
find_top_ten_words("politics",bigram=True,hasStopWords=True)

Unnamed: 0,0,1
27179,mr blair,222
32417,prime minister,191
27185,mr brown,165
42992,tony blair,124
17503,general election,119
27214,mr howard,92
42890,told bbc,87
26165,michael howard,83
22716,kilroy silk,79
23960,lib dems,77


### 10 words for entertainment class whose presence most strongly predicts:

In [50]:
find_top_ten_words("entertainment",bigram=True,hasStopWords=True)

Unnamed: 0,0,1
5728,box office,68
23671,new york,48
38974,year old,46
4998,best film,43
20916,los angeles,41
23320,named best,41
38359,won best,41
36950,vera drake,39
22222,million dollar,38
11260,dollar baby,37


### The Importance of Stopwords

We can see the difference between using stopwords or not using them by looking at the above tables. Without eliminating stopwords our top 10 words will be generally stopwords like "the", "as", "is"... and these words are not very good to predict if a given document belongs to a given category.These words say nothing about a class. So it makes sense to remove them when we generate our model because we make a classification. On the other hand if our job was to make a language translation then it would be useful to keep them.

## Calculating importance of words by using tf-idf scoring

In order to narrow down our dictionary size for both unigram and bigram model, we decided to use tf-idf scoring on the whole data. Tf-idf gives a weight for each word. The more weight a word has, it has more importance in the dataset. So we sorted the words and select n-most-relevant of them. For Unigram model choosing n-most-relevant words was unnecassary because the model was already fast and narrowing down the size of dictionary was decreasing accuracy. So we just eliminated stop words for Unigram model and accuracy increased from %95 to %98.

However for Bigram model, narrowing down the dictionary size had a good effect: It increased accuracy and made our model faster. We choosed to select 60000-most-relevant words for the dataset. 60000 may be a lot compared to the actual size of dictionary which is 244227 but the less word means the less accurate our model. There was a trade-off and this is our choice. Yes our model is not that much fast compared to the old bigram model (not narrowed down) but it's accuracy is a lot better. Our Bigram model accuracy went from %63 to %90-%95 when we narrowed down the dictionary size and eliminated the stop words.

In [51]:

df = shuffle(df)
x = df[:,1]
y = df[:,2]

x2 = np.copy(x)

vectorizer = CountVectorizer(stop_words=ENGLISH_STOP_WORDS)
X = vectorizer.fit_transform(x)
feature_names = vectorizer.get_feature_names_out()

x_data = X.toarray()

x_train = x_data[:1192]
y_train = y[:1192]
x_test = x[1192:]
y_test = y[1192:]

test_keys = feature_names

x_train = x_train.tolist()
for i in range(len(x_train)):
    test_values = x_train[i]
    
    res = {test_keys[j]: test_values[j] for j in range(len(test_keys))}
    
    x_train[i] = res




y_pred = naive_bayes(x_train,y_train,x_test,y_test,hasStopWords=True)

calculate_accuracy(y_test,y_pred)  

Accuracy is: %97.31543624161074


In [52]:
narrowed_dict2 = narrow_the_dict(bigram=True,word_size=60000)

In [53]:
df = shuffle(df)
x = df[:,1]
y = df[:,2]

x2 = np.copy(x)

vectorizer = CountVectorizer(vocabulary=narrowed_dict2,analyzer='word',ngram_range=(2,2))
X = vectorizer.fit_transform(x)
feature_names = vectorizer.get_feature_names_out()

x_data = X.toarray()

x_train = x_data[:1192]
y_train = y[:1192]
x_test = x[1192:]
y_test = y[1192:]

test_keys = feature_names

x_train = x_train.tolist()
for i in range(len(x_train)):
    test_values = x_train[i]
    
    res = {test_keys[j]: test_values[j] for j in range(len(test_keys))}
    
    x_train[i] = res




y_pred = naive_bayes(x_train,y_train,x_test,y_test,bigram=True,hasStopWords=True)

calculate_accuracy(y_test,y_pred)  

Accuracy is: %93.95973154362416
