# Introduction
The data set contains news and the category to which it belongs

FEATURES:

STORY: A part of the main content of the article to be published as a piece of news. SECTION: The genre/category the STORY falls in.

There are four distinct sections where each story may fall in to. The Sections are labelled as follows :

Politics: 0 Technology: 1 Entertainment: 2 Business: 3

In [None]:
#Mounting
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [None]:
#Import packages and libraries
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score


In [None]:
# Download the Following Modules once
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
#Loading data set
news_data=pd.read_excel('/content/gdrive/MyDrive/predictnews/Data_Train.xlsx')
news_data

Unnamed: 0,STORY,SECTION
0,But the most painful was the huge reversal in ...,3
1,How formidable is the opposition alliance amon...,0
2,Most Asian currencies were trading lower today...,3
3,"If you want to answer any question, click on ‘...",1
4,"In global markets, gold prices edged up today ...",3
...,...,...
7623,"Karnataka has been a Congress bastion, but it ...",0
7624,"The film, which also features Janhvi Kapoor, w...",2
7625,The database has been created after bringing t...,1
7626,"The state, which has had an uneasy relationshi...",0


In [None]:
print(news_data.shape)
news_data.head()

(7628, 2)


Unnamed: 0,STORY,SECTION
0,But the most painful was the huge reversal in ...,3
1,How formidable is the opposition alliance amon...,0
2,Most Asian currencies were trading lower today...,3
3,"If you want to answer any question, click on ‘...",1
4,"In global markets, gold prices edged up today ...",3


In [None]:
#Printing the group by description of each category
news_data.groupby('SECTION').describe()

Unnamed: 0_level_0,STORY,STORY,STORY,STORY
Unnamed: 0_level_1,count,unique,top,freq
SECTION,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,1686,1673,This story has been published from a wire agen...,4
1,2772,2731,This story has been published from a wire agen...,13
2,1924,1914,We will leave no stone unturned to make the au...,3
3,1246,1233,This story has been published from a wire agen...,11


In [None]:
# Removing Duplicates to avoid Overfitting
news_data.drop_duplicates(inplace=True)
#A punctuations string for reference (added other valid characters from the dataset)
punctuate = string.punctuation
punctuate

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
#Method to remove punctuation marks from the data
def punc_clear(news):
    news_no_punc = "".join([p for p in news if p not in punctuate])
    return news_no_punc

#To remove stop words
def clear_stopword(news):
    words = news.split()
    news = " ".join([i for i in words if i not in stopwords.words('english')])
    return news

lemmer = nltk.stem.WordNetLemmatizer()
def lemme(words):
    return " ".join([lemmer.lemmatize(word,'v') for word in words.split()])

def final_text(raw):
    cleaned_text = clear_stopword(punc_clear(raw))
    return lemme(cleaned_text)

In [None]:
#Applying the cleaning method to the entire data
news_data['CLEAN_STORY'] = news_data['STORY'].apply(final_text)
news_data

Unnamed: 0,STORY,SECTION,CLEAN_STORY
0,But the most painful was the huge reversal in ...,3,But painful huge reversal fee income unheard a...
1,How formidable is the opposition alliance amon...,0,How formidable opposition alliance among Congr...
2,Most Asian currencies were trading lower today...,3,Most Asian currencies trade lower today South ...
3,"If you want to answer any question, click on ‘...",1,If want answer question click ‘Answer’ After c...
4,"In global markets, gold prices edged up today ...",3,In global market gold price edge today disappo...
...,...,...,...
7623,"Karnataka has been a Congress bastion, but it ...",0,Karnataka Congress bastion also give BJP first...
7624,"The film, which also features Janhvi Kapoor, w...",2,The film also feature Janhvi Kapoor revolve ar...
7625,The database has been created after bringing t...,1,The database create bring together criminal re...
7626,"The state, which has had an uneasy relationshi...",0,The state uneasy relationship mainland since d...


In [None]:
# Creating a bag of words Dictionary of words from the Data
bow = CountVectorizer().fit(news_data['CLEAN_STORY'])
print(len(bow.vocabulary_))
bow_data = bow.transform(news_data['CLEAN_STORY'])
print(bow_data.shape)
tfidf = TfidfTransformer().fit(bow_data)
tfidf_data = tfidf.transform(bow_data)

34514
(7551, 34514)


In [None]:
X=tfidf_data
y=news_data['SECTION']

In [None]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [None]:
#Fit using multinomial Naive Bayes Algorithm
clf = MultinomialNB().fit(X_train, y_train)
y_pred=clf.predict(X_test)

In [None]:
#Confusion Matrix
cm=confusion_matrix(y_test,y_pred)
cm

array([[297,  21,   0,   3],
       [  3, 553,   0,   3],
       [ 11,  32, 311,   0],
       [  1,  31,   0, 245]])

In [None]:
#Model Score
score=accuracy_score(y_test,y_pred)
score

0.9305095962938451

In [None]:
#Fit using Logistic Regression Algorithm
from sklearn.linear_model import LogisticRegression
lo=LogisticRegression()
lo.fit(X_train,y_train)
lo_pred=lo.predict(X_test)
lo_score=accuracy_score(y_test,lo_pred)
lo_score

0.9602911978821972

In [None]:
#Fit using Random Forest Classifier Algorithm
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=100)
rf.fit(X_train,y_train)
rf_pred=rf.predict(X_test)
rf_score=accuracy_score(y_test,rf_pred)
rf_score

0.9410986101919259

## Conclusion

Logistic Regression gave a maximum score of around 96%