# **Business Problem**

The dataset contains 2,225 news articles with two columns:

data - The news article text.

labels - The category of the news article.

Categories (Labels)

The dataset appears to be from BBC News and categorizes articles into different topics such as:

- Entertainment
- Politics
- Sport
- Technology
- Business

# **Loading the Data**

In [1]:
import pandas as pd 
import numpy as np
import nltk
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('bbc_data.csv')
df

Unnamed: 0,data,labels
0,Musicians to tackle US red tape Musicians gro...,entertainment
1,"U2s desire to be number one U2, who have won ...",entertainment
2,Rocker Doherty in on-stage fight Rock singer ...,entertainment
3,Snicket tops US box office chart The film ada...,entertainment
4,"Oceans Twelve raids box office Oceans Twelve,...",entertainment
...,...,...
2220,Warning over Windows Word files Writing a Mic...,tech
2221,Fast lifts rise into record books Two high-sp...,tech
2222,Nintendo adds media playing to DS Nintendo is...,tech
2223,Fast moving phone viruses appear Security fir...,tech


# **Data Exploration**

In [3]:
df.shape

(2225, 2)

In [4]:
df.dtypes

data      object
labels    object
dtype: object

In [5]:
df.isnull().sum()

data      0
labels    0
dtype: int64

In [6]:
df.duplicated().sum()

np.int64(99)

# **Data (Text) Preprocessing**

**Text Cleaning**

In [7]:
df = df.drop_duplicates()

In [8]:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

corpus = []
for news in df['data'].tolist():
  l = re.sub('[^a-zA-Z0-9]', ' ', news)
  l = l.lower()
  l = l.split()
  l1 = [stemmer.stem(i) for i in l if i not in stopwords.words('english')]
  l = ' '.join(l1)
  corpus.append(l)

**Train Test Split**

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(corpus, df['labels'], test_size = 0.2, random_state = 42)

**Vectorization**

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train = pd.DataFrame(vectorizer.fit_transform(X_train).toarray(), columns = vectorizer.get_feature_names_out())
X_test = pd.DataFrame(vectorizer.transform(X_test).toarray(), columns = vectorizer.get_feature_names_out())

# **Modelling**

**Naive Bayes**

In [11]:
#Modelling
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train)

#Prediction
ypred_train = model.predict(X_train)
ypred_test = model.predict(X_test)

#Evaluation
from sklearn.metrics import accuracy_score

accuracy_train = accuracy_score(y_train, ypred_train)
accuracy_test = accuracy_score(y_test, ypred_test)

print('Accuracy(Train): ', accuracy_train)
print('Accuracy(Test): ', accuracy_test)

Accuracy(Train):  0.9929411764705882
Accuracy(Test):  0.9671361502347418


**SVM**

In [12]:
from sklearn.svm import SVC

model = SVC(random_state = True)
model.fit(X_train, y_train)

#Prediction
ypred_train = model.predict(X_train)
ypred_test = model.predict(X_test)

#Evaluation
from sklearn.metrics import accuracy_score

accuracy_train = accuracy_score(y_train, ypred_train)
accuracy_test = accuracy_score(y_test, ypred_test)

print('Accuracy(Train): ', accuracy_train)
print('Accuracy(Test): ', accuracy_test)

Accuracy(Train):  0.9982352941176471
Accuracy(Test):  0.9553990610328639


**Random Forest**

In [13]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state = True)
model.fit(X_train, y_train)

#Prediction
ypred_train = model.predict(X_train)
ypred_test = model.predict(X_test)

#Evaluation
from sklearn.metrics import accuracy_score

accuracy_train = accuracy_score(y_train, ypred_train)
accuracy_test = accuracy_score(y_test, ypred_test)

print('Accuracy(Train): ', accuracy_train)
print('Accuracy(Test): ', accuracy_test)

Accuracy(Train):  1.0
Accuracy(Test):  0.9624413145539906


**Decision Tree**

In [14]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state = True)
model.fit(X_train, y_train)

#Prediction
ypred_train = model.predict(X_train)
ypred_test = model.predict(X_test)

#Evaluation
from sklearn.metrics import accuracy_score

accuracy_train = accuracy_score(y_train, ypred_train)
accuracy_test = accuracy_score(y_test, ypred_test)

print('Accuracy(Train): ', accuracy_train)
print('Accuracy(Test): ', accuracy_test)

Accuracy(Train):  1.0
Accuracy(Test):  0.8356807511737089
