## AIN313 ASSIGMENT 2

Name: Ebrar Pınar Kuz

Student ID: 2220765017

In [216]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer
import warnings
warnings.filterwarnings('ignore')


In [217]:
df=pd.read_csv("English Dataset.csv")
print(df.head())
print(df.shape)

   ArticleId                                               Text  Category
0       1833  worldcom ex-boss launches defence lawyers defe...  business
1        154  german business confidence slides german busin...  business
2       1101  bbc poll indicates economic gloom citizens in ...  business
3       1976  lifestyle  governs mobile choice  faster  bett...      tech
4        917  enron bosses in $168m payout eighteen former e...  business
(1490, 3)


## PART 1- Understanding the data 

In [218]:
print(df['Category'].value_counts())
df['Text'] = df['Text'].str.lower()


Category
sport            346
business         336
politics         274
entertainment    273
tech             261
Name: count, dtype: int64


In [219]:
vectorizer = CountVectorizer(
    stop_words= None,
    token_pattern=r'(?u)\b[a-zA-Z]+\b'  )
X = vectorizer.fit_transform(df['Text'])
vocab = vectorizer.get_feature_names_out()

word_freq_df = pd.DataFrame(0, index=vocab, columns=df['Category'].unique())

for cat in df['Category'].unique():
    subset_index = (df['Category'] == cat).to_numpy()  
    counts = X[subset_index]
    word_freq = np.sum(counts.toarray(), axis=0)
    word_freq_df.loc[:, cat] = word_freq
top_words = word_freq_df.sum(axis=1).sort_values(ascending=False).head(30)
result_df = word_freq_df.loc[top_words.index]


print(result_df)

      business  tech  politics  sport  entertainment
the       7133  7498      7957   6620           5822
to        3306  4149      3913   3189           2108
of        2864  3425      2840   1826           2048
and       2161  3017      2559   2532           2130
a         2296  2773      2532   2651           1898
in        2821  2316      2159   2510           1996
s         1367   806      1074   1440           1247
for       1045  1299      1237   1127           1097
is        1072  1597      1167    985            684
that      1052  1676      1195    863            491
it        1011  1465       998    974            738
on         906  1111      1186   1014            881
said      1100  1064      1445    636            594
was        596   653      1013    943            818
he         287   484      1410   1105            584
be         573  1057      1080    614            502
with       612   787       645    803            662
has        835   686       611    650         

### REPORT 

The counts of the categories are very close to each other. The most common category is sport with 346 news articles, and the least common category is tech with 261 news articles.

When we look at the counts of common words for each category, we see that:

For the business category: said,year, mr, market, new, and firm

For the tech category: said,people, mr, new, mobile, technology

For the politics category: said,mr, labour, government, election, blair

For the sport category: said,game, year, england, time, win

For the entertainment category: said,film, best, year, music

These are the most common words. I decided on these words just by looking at their counts in each category, but this alone does not give accurate results for making comments. Most of these words appear in more than one category. Therefore, we should choose words that have a high count in one category and a very low count in the others.

In [220]:
#Here I check the ratios of the words so, I saw if they make a good representation for classes or not.
ratio_df = word_freq_df.div(word_freq_df.sum(axis=1) + 1e-9, axis=0) 
max_cat = ratio_df.idxmax(axis=1)
max_val = ratio_df.max(axis=1)


all_representative_words = pd.DataFrame({
    'main_category': max_cat,
    'strength': max_val,
    'total_count': word_freq_df.sum(axis=1) 
})

min_appearances = 5 
filtered_rep_words = all_representative_words[
    all_representative_words['total_count'] > min_appearances
]

top_3_per_category_df = filtered_rep_words.groupby('main_category') \
                                          .apply(lambda x: x.sort_values('strength', ascending=False).head(3))

top_3_per_category_df = top_3_per_category_df.reset_index(level=0, drop=True)

final_word_list = top_3_per_category_df.index
report_stats_df = word_freq_df.loc[final_word_list]


display(report_stats_df)

Unnamed: 0,business,tech,politics,sport,entertainment
yukos,122,0,0,0,0
gm,55,0,0,0,0
worldcom,54,0,0,0,0
actress,0,0,0,0,116
festival,0,0,0,0,100
aviator,0,0,0,0,76
tory,0,0,170,0,0
tories,0,0,158,0,0
lib,0,0,133,0,0
roddick,0,0,0,97,0


### REPORT PART 1: Understanding the Data

To determine if predicting a category from its text is feasible, we analyzed the words that are most representative of each category. Instead of using raw word counts, which can be misleading, we analyzed the word frequency *ratios*. This method checks how strongly a word is associated with one specific category compared to all others.

Based on this ratio analysis, we can say that specific keywords can successfully represent their main category. The following are three specific keyword examples for each class that we found to be useful:

* Business class: `yukos`, `gm`, `worldcom`
* Tech class: `spam`, `mobiles`, `gadgets`
* Politics class: `tory`, `tories`, `lib`
* Sport class: `roddick`, `referee`, `athens`
* Entertainment class: `actress`, `festival`, `aviator`

### Part 2-  Implementing Naive Bayes

In [221]:
#implementing naive bayes algorithm
class MyNaiveBayes:
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.class_word_counts = {}
        self.class_total_words = {}
        self.class_priors = {}

        
        self.vocab_size = X.shape[1]

        for c in self.classes:
            X_c = X[np.array(y) == c]

            self.class_word_counts[c] = np.sum(X_c, axis=0) + 1  # Laplace smoothing
            self.class_total_words[c] = np.sum(self.class_word_counts[c])
            self.class_priors[c] = X_c.shape[0] / X.shape[0]

    def predict(self, X):
        preds = []
        for x in X:
            class_probs = {}
            for c in self.classes:
                log_prior = np.log(self.class_priors[c])
                log_likelihood = np.sum(x.multiply(np.log(self.class_word_counts[c] / self.class_total_words[c])))
                class_probs[c] = log_prior + log_likelihood
            preds.append(max(class_probs, key=class_probs.get))
        return preds


In [222]:
def evaluate_model(vectorizer, X_text, y, label):
    
    if hasattr(vectorizer, "vocabulary_"):
        X = vectorizer.transform(X_text)
    else:
        X = vectorizer.fit_transform(X_text)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    model = MyNaiveBayes()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)

    acc = accuracy_score(y_test, preds)
    print(f"{label}: {acc*100:.2f}%")
    return acc


In [223]:


acc_uni = evaluate_model(
    CountVectorizer(stop_words=None, ngram_range=(1,1)),
    df["Text"], df["Category"], "BoW (Unigram) - NB"
)

BoW (Unigram) - NB: 97.65%


In [224]:

acc_bi = evaluate_model(
    CountVectorizer(stop_words=None, ngram_range=(2,2)),
    df["Text"], df["Category"], "BoW (Bigram) - NB"
)

BoW (Bigram) - NB: 96.98%


### PART 3-

In [225]:
vectorizer_part3 = CountVectorizer(stop_words=None, ngram_range=(1,1))


X_part3 = vectorizer_part3.fit_transform(df["Text"])
y_part3 = df["Category"]


vocab_part3 = vectorizer_part3.get_feature_names_out()


model_part3 = MyNaiveBayes()
model_part3.fit(X_part3, y_part3)


classes_part3 = model_part3.classes

In [226]:
# --- Part 3a: 
print("Top 10 Presence and Absence Words for each class")

likelihoods = {}
for c in classes_part3:
    
    probs = np.asarray(model_part3.class_word_counts[c] / model_part3.class_total_words[c]).flatten()
    likelihoods[c] = probs


likelihood_df = pd.DataFrame(likelihoods, index=vocab_part3)

presence_words = {}
absence_words = {}
for c in classes_part3:
    print(f"\nCategory: {c}")
    
    presence_words[c] = likelihood_df[c].sort_values(ascending=False).head(10)
    print("Top 10 presence words:")
    print(presence_words[c])

    
    absence_words[c] = likelihood_df[c].sort_values(ascending=True).head(10)
    print("\nTop 10 absence words:")
    print(absence_words[c])

Top 10 Presence and Absence Words for each class

Category: business
Top 10 presence words:
the     0.053576
to      0.024836
of      0.021516
in      0.021193
and     0.016237
said    0.008268
is      0.008058
that    0.007908
for     0.007855
it      0.007600
Name: business, dtype: float64

Top 10 absence words:
04secs       0.000008
0400         0.000008
033          0.000008
028          0.000008
0130         0.000008
0100         0.000008
007          0.000008
0051         0.000008
001st        0.000008
mudslides    0.000008
Name: business, dtype: float64

Category: entertainment
Top 10 presence words:
the    0.051877
and    0.018985
to     0.018789
of     0.018254
in     0.017791
for    0.009782
on     0.007858
was    0.007296
it     0.006584
is     0.006103
Name: entertainment, dtype: float64

Top 10 absence words:
0530gmt     0.000009
01          0.000009
028         0.000009
03          0.000009
zib         0.000009
zidane      0.000009
ziers       0.000009
zinc        0.00000

### Part 3a: Analyzing the Effect of Words

We looked at the "Top 10 presence words" and "Top 10 absence words" for each category, based on the probabilities calculated by our model.

**1. Presence Words**

As the output lists show, the **Top 10 presence words** for every single category are common stopwords. Words like `the`, `to`, `of`, and `and` are at the top of all lists.

This happens because our model was trained *with* these words, and they appear very often in all texts. So, the model's math correctly finds that they have the highest probability. However, these words are not informative and do not help us tell the categories apart from each other.

**2. Absence Words**

On the other hand, the **Top 10 absence words** lists show the words with the lowest possible probability (around `0.000008`). These lists are mostly filled with very rare words, random numbers, or possible typos that the model has almost never seen (like `04secs`, `zombie`, or `zooropa`).

Because the model has no information on these words, it gives them the lowest possible score. We can conclude that—just like the stopword-filled "presence" lists—these "absence" lists are also not very useful for understanding what makes each category unique. This tells us we need to filter out stopwords to find the real keywords, which we will do in Part 3b.

In [227]:
# TF-IDF 

tfidf_transformer = TfidfTransformer()
tfidf_matrix = tfidf_transformer.fit_transform(X_part3)

# find the tf-idf value for each word 
tfidf_scores = np.asarray(tfidf_matrix.mean(axis=0)).ravel()

tfidf_df = pd.DataFrame({'word': vocab_part3, 'tfidf': tfidf_scores}).sort_values('tfidf', ascending=False)

# select teh most important words
N_FEATURES = 2000
important_words = set(tfidf_df.head(N_FEATURES)['word'])
print(f"Selected the most importanat {len(important_words)} words with TF-IDF ")


Selected the most importanat 2000 words with TF-IDF 


In [228]:
#TF-IDF with stopwords
vectorizer_subset = CountVectorizer(
    stop_words=None,
    ngram_range=(1, 1),
    vocabulary=important_words  
)
acc_tfd = evaluate_model(
    vectorizer_subset,
    df["Text"],
    df["Category"],
    f"BoW (Unigram - Top {N_FEATURES} TF-IDF Words)"
)

BoW (Unigram - Top 2000 TF-IDF Words): 97.99%


To find the optimal number of features for our model, I experimented with different vocabulary sizes (N_FEATURES) based on their TF-IDF scores.

I observed that the model's accuracy increased as the number of features grew, peaking at `N_FEATURES = 2000`. However, when more than 2000 words were used, the accuracy began to decrease. This suggests that 2000 features is an optimal value, as it provides the best balance between capturing informative signals and including noise or irrelevant details.So I selected `N_FEATURES = 2000` for this experiment.

In [229]:
#TF-IDF without stopwords
vectorizer_subset = CountVectorizer(
    stop_words='english',
    ngram_range=(1, 1),
    vocabulary=important_words  
)
acc_tfd_stop = evaluate_model(
    vectorizer_subset,
    df["Text"],
    df["Category"],
    f"BoW (Unigram - Top {N_FEATURES} TF-IDF Words)"
)

BoW (Unigram - Top 2000 TF-IDF Words): 97.65%


In [230]:
# --- Part 3b:
print("Strongest words for each category(except stopwords)")


non_stop_presence = {}
for c in classes_part3:
   
    all_class_words = likelihood_df[c]
    
    filtered_words = all_class_words.drop(ENGLISH_STOP_WORDS, errors='ignore')
    
    top_10_non_stopwords = filtered_words.sort_values(ascending=False).head(10)
    
    non_stop_presence[c] = top_10_non_stopwords 
    
    print(f"\nCategory: {c}")
    print("Top 10 important non-stopwords:")
    print(top_10_non_stopwords)

Strongest words for each category(except stopwords)

Category: business
Top 10 important non-stopwords:
said          0.008268
year          0.003432
mr            0.002959
market        0.002140
new           0.002058
firm          0.001968
growth        0.001938
company       0.001908
economy       0.001757
government    0.001622
Name: business, dtype: float64

Category: entertainment
Top 10 important non-stopwords:
said      0.005301
film      0.005203
best      0.003840
year      0.002815
music     0.002281
new       0.002094
awards    0.001648
uk        0.001532
actor     0.001515
number    0.001479
Name: entertainment, dtype: float64

Category: politics
Top 10 important non-stopwords:
said          0.010071
mr            0.007480
labour        0.003447
government    0.003239
election      0.002960
blair         0.002758
party         0.002626
people        0.002598
minister      0.001999
new           0.001957
Name: politics, dtype: float64

Category: sport
Top 10 important non-s

### Part 3b: Analysis of Non-Stopword Lists

These lists, which were generated after filtering the stopwords are more informative than the "presence" lists from Part 3a, which were dominated by some stopwords words like `the` and `to`.

By removing the stopwords, we can now see the **true, informative keywords** that the model is using to identify each category.

The word `said` dominates the top of almost every category (it is number 1 in all of them). While `said` is not on the sklearn stopword list, it clearly acts as a stopword for this news dataset. 

When we pass the common words like 'said' , 'mr' we saw that:
    * Politics is clearly identified by `labour`, `government`, `election`, and `blair`.
    * Entertainment is identified by `film`, `music`, `awards`, and `actor`.
    * Sport is identified by `game`, `england`, `win`, `players`, `cup`, and `team`.
    * Tech is identified by `mobile`, `technology`, `users`, `software`, and `net`.

 



In [231]:

acc_uni_sw = evaluate_model(
    CountVectorizer(stop_words='english', ngram_range=(1,1)),
    df["Text"], df["Category"], "BoW (Unigram) - Stopwords - NB"
)

BoW (Unigram) - Stopwords - NB: 97.65%


In [232]:
# === 4️⃣ BoW (Bigram + Stopwords Removed) ===
acc_bi_sw = evaluate_model(
    CountVectorizer(stop_words='english', ngram_range=(2,2)),
    df["Text"], df["Category"], "BoW (Bigram) - Stopwords - NB"
)


BoW (Bigram) - Stopwords - NB: 94.63%


In [233]:
print("\n=== Accuracy Summary ===")
print(f"BoW Unigram: {acc_uni*100:.2f}%")
print(f"BoW Bigram: {acc_bi*100:.2f}%")
print(f"BoW Unigram Stopwords: {acc_uni_sw*100:.2f}%")
print(f"BoW Bigram Stopwords: {acc_bi_sw*100:.2f}%")
print(f"TF-IDF: {acc_tfd*100:.2f}%")
print(f"TF-IDF Stopwords: {acc_tfd_stop*100:.2f}%")



=== Accuracy Summary ===
BoW Unigram: 97.65%
BoW Bigram: 96.98%
BoW Unigram Stopwords: 97.65%
BoW Bigram Stopwords: 94.63%
TF-IDF: 97.99%
TF-IDF Stopwords: 97.65%



### REPORT-About stopwords and Accuracy Results

"Why we remove stop words?"

* As seen in Part 3a, keeping stopwords hides the true signals. The `presence` lists were 100% noise. Removing them acts as a filter, allowing model  to focus on the words that actually carry meaning.
* Filtering stopwords allows us to interpret the model. We can now confirm that our model correctly learned that `film` predicts 'Entertainment' and `game` predicts 'Sport'.
* Removing stopwords seems so sensible at first but when we look at the results we saw that removing them is getting lower accuracy

 "Why we should keep stop words?"

* Removing  stopwords can break some meaningful word pairs. For example, in a Bigram model, the phrase "King *of* Spain" is meaningful, and removing "of" would break it. So,in the case of BoW bigram the accuracy 96.98 but when we remove stopwords it is 94.63.

However, for a Unigram (single word) Naive Bayes classifier, stopwords primarily add noise. We can expect the model's accuracy to be higher *after* removing them but in this case the accuracy did not change. Probable reason is the count of stopwords. It is nor enough to change accuracy.

The most interesting result came from the **TF-IDF models**.
* The **best accuracy** of all experiments was **97.99%**, which came from the TF-IDF model that kept its stopwords.
* When we removed the stopwords first (Model 6), the accuracy actually dropped to *97.65%*.

The reason is: TF-IDF is a smarter method. It automatically handles stopwords.  The "IDF" part of the algorithm sees that words like 'the' or 'to' are in all documents, so it gives them a very low "importance" score.

The model (Model 5) already learned to ignore these stopwords on its own. Manually removing them (Model 6) did not help and, just like in the Bigram model, it resulted in a slightly lower accuracy.