# 1. Import Statements

### 1.1 Installing Required Libraries

In [None]:
!pip install contractions

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-2.0.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.8/110.8 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24


In [None]:
!pip install --user -U nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### 1.2 Importing Required Libraries

In [None]:
import os
import re 
import random
import string      # for string operations    
import pandas as pd
import numpy as np     
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import plotly.express as px
# SetUp NLTK
import nltk                                # Python library for NLP
#nltk.download('punkt')

#from nltk.corpus import twitter_samples    # sample Twitter dataset from NLTK
#from nltk.corpus import stopwords          # module for stop words that come with NLTK
#from nltk.stem import PorterStemmer        # module for stemming
#from nltk.stem import WordNetLemmatizer    # module for Lemmatization

#from nltk.tokenize import TweetTokenizer
#nltk.download('wordnet')
#nltk.download('omw-1.4')
#nltk.download("stopwords")

from pprint import pprint

# Spacy specific libraries 
import spacy
import contractions

# Import label encoder
from sklearn import preprocessing
# Bag of words
from sklearn.feature_extraction.text import CountVectorizer
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
# Naive bayes classifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

# Random Forest classifier
from sklearn.ensemble import RandomForestClassifier

# KNN classifier
from sklearn.neighbors import KNeighborsClassifier

# Decision Tree classifier
from sklearn.tree import DecisionTreeClassifier

# Evaluation matrices
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report

# 2. Download the Data

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
!gdown 1I3-pQFzbSufhpMrUKAROBLGULXcWiB9u

Downloading...
From: https://drive.google.com/uc?id=1I3-pQFzbSufhpMrUKAROBLGULXcWiB9u
To: /content/flipitnews-data.csv
100% 5.06M/5.06M [00:00<00:00, 17.2MB/s]


# 3. Problem Statement
- **Categorize News Articles** into several categories such as **politics, technology, sports, business and entertainment** based on content
- **Create & compare** at different models


# 4. Read & Explore Data

## 4.1 Read News Data

In [None]:
dataframe = pd.read_csv("./flipitnews-data.csv")
dataframe.head()

Unnamed: 0,Category,Article
0,Technology,tv future in the hands of viewers with home th...
1,Business,worldcom boss left books alone former worldc...
2,Sports,tigers wary of farrell gamble leicester say ...
3,Sports,yeading face newcastle in fa cup premiership s...
4,Entertainment,ocean s twelve raids box office ocean s twelve...


 ### 4.1.1 Data Shape

In [None]:
dataframe.shape

(2225, 2)

### 4.1.2 Data Types

In [None]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  2225 non-null   object
 1   Article   2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


## 4.2 Explore News Data

 ### 4.2.1 Exploring News Categories

In [None]:
fig = px.pie(dataframe, names='Category',hole=0.3, title='News Category Pie Chart')
fig.show()

In [None]:
dataframe.Category.value_counts()     

Sports           511
Business         510
Politics         417
Technology       401
Entertainment    386
Name: Category, dtype: int64

# 5. Processing the Textual Data

## 5.1 Pre-processing Articles
- Converting to Lower case 
- Expanding Contractions
- Lemmatization
- Remove stopwords and punctuation (i.e. non-letters)

In [None]:
def process_sentence(sentence, nlp_object):
    # Convert to lowercase
    sentence = sentence.lower()
    
    # Exapnding contractions
    sentence = contractions.fix(sentence)
    
    # Lemmatization and removing stopwords
    doc = nlp_object(sentence)
    sentence = " ".join([token.lemma_ for token in doc if not token.is_stop])
    
    # Remove punctuation
    for p in string.punctuation:
        sentence = sentence.replace(p, " ")
    sentence = re.sub(r"\s+", " ", sentence) # Replace all whitespace characters with space
    
    return sentence

In [None]:
from tqdm.notebook import tqdm
# tqdm to see real time progress
tqdm.pandas()

nlp = spacy.load('en_core_web_sm') # English pipeline optimized for CPU

In [None]:
dataframe["processed_Article"] = dataframe["Article"].progress_apply(lambda x : process_sentence(x, nlp))

  0%|          | 0/2225 [00:00<?, ?it/s]

#### 5.1.1 Display Articles - Pre and Post processing

In [None]:
dataframe[["Article","processed_Article"]].head()

Unnamed: 0,Article,processed_Article
0,tv future in the hands of viewers with home th...,tv future hand viewer home theatre system plas...
1,worldcom boss left books alone former worldc...,worldcom boss leave book worldcom boss bernie ...
2,tigers wary of farrell gamble leicester say ...,tiger wary farrell gamble leicester rush make ...
3,yeading face newcastle in fa cup premiership s...,yeade face newcastle fa cup premiership newcas...
4,ocean s twelve raids box office ocean s twelve...,ocean s raid box office ocean s crime caper se...


## 5.2 Encoding and Transforming the data
- Encoding the target variable
- Bag of Words
- TF-IDF
- Train-Test Split


### 5.2.1 Encoding the target variable

In [None]:
# label_encoder object knows 
# how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in column 'species'.
dataframe['Target_category']= label_encoder.fit_transform(dataframe['Category'])

In [None]:
dataframe['Target_category'].value_counts()

3    511
0    510
2    417
4    401
1    386
Name: Target_category, dtype: int64

In [None]:
dataframe['Category'].value_counts()

Sports           511
Business         510
Politics         417
Technology       401
Entertainment    386
Name: Category, dtype: int64

### 5.2.2 Bag of Words
- It converts the corpus of multiple sentences into a matrix of words & fills it with the frequency of each word in a sentence.
- Limitation of BOW approach :
  - This method **ignores the location information of the word**. It is **not possible to grasp the meaning of a word** from this representation.
  - The **intuition** that **high-frequency words are more important** or give more information about the sentence **fails when it comes to stop-words like “is, the, an, I” & when the corpus is context-specific**.
For example, in a corpus about covid-19, the word coronavirus may not add a lot of value.


In [None]:
# Using CountVectorizer for removing stop-words directly from the corpus.
cv = CountVectorizer(stop_words="english")
bow_rep = cv.fit_transform(dataframe['processed_Article']).todense()
df = pd.DataFrame(bow_rep)
df.columns = cv.get_feature_names_out()
df.index = dataframe['processed_Article']
df

Unnamed: 0_level_0,00,000,0001,000bn,000th,001,001and,001st,004,0051,...,zoom,zooropa,zornotza,zorro,zubair,zuluaga,zurich,zuton,zvonareva,zvyagintsev
processed_Article,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
tv future hand viewer home theatre system plasma high definition tv digital video recorder move living room way people watch tv radically different year time accord expert panel gather annual consumer electronic las vegas discuss new technology impact favourite pastime lead trend programme content deliver viewer home network cable satellite telecom company broadband service provider room portable device talk technology ce digital personal video recorder dvr pvr set box like s tivo uk s sky system allow people record store play pause forward wind tv programme want essentially technology allow personalised tv build high definition tv set big business japan slow europe lack high definition programming people forward wind advert forget abide network channel schedule put la carte entertainment network cable satellite company worried mean term advertising revenue brand identity view loyalty channel lead technology moment concern raise europe particularly grow uptake service like sky happen today month year time uk adam hume bbc broadcast s futurologist tell bbc news website like bbc issue lose advertising revenue pressing issue moment commercial uk broadcaster brand loyalty important talk content brand network brand say tim hanlon brand communication firm starcom mediavest reality broadband connection anybody producer content add challenge hard promote programme choice mean say stacey jolna senior vice president tv guide tv group way people find content want watch simplify tv viewer mean network term channel leaf google s book search engine future instead scheduler help people find want watch kind channel model work young ipod generation take control gadget play suit panel recognise old generation comfortable familiar schedule channel brand know get want choice hand mr hanlon suggest end kid diaper push button possible available say mr hanlon ultimately consumer tell market want 50 000 new gadget technology showcase ce enhance tv watch experience high definition tv set new model lcd liquid crystal display tv launch dvr capability build instead external box example launch humax s 26 inch lcd tv 80 hour tivo dvr dvd recorder s big satellite tv company directtv launch brand dvr 100 hour recording capability instant replay search function set pause rewind tv 90 hour microsoft chief bill gate announce pre keynote speech partnership tivo call tivotogo mean people play record programme windows pc mobile device reflect increase trend free multimedia people watch want want,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
worldcom boss leave book worldcom boss bernie ebbers accuse oversee 11bn £ 5 8bn fraud accounting decision witness tell juror david myers comment question defence lawyer argue mr ebber responsible worldcom s problem phone company collapse 2002 prosecutor claim loss hide protect firm s share mr myers plead guilty fraud assist prosecutor monday defence lawyer reid weingarten try distance client allegation cross examination ask mr myers know mr ebber accounting decision aware mr myers reply know mr ebber accounting entry worldcom book mr weingarten press reply witness mr myers admit order false accounting entry request worldcom chief financial officer scott sullivan defence lawyer try paint mr sullivan admit fraud testify later trial mastermind worldcom s accounting house card mr ebber team look portray affable boss admission pe graduate economist ability mr ebber transform worldcom relative unknown 160bn telecom giant investor darling late 1990 worldcom s problem mount competition increase telecom boom petere firm finally collapse shareholder lose 180bn 20 000 worker lose job mr ebber trial expect month find guilty ceo face substantial jail sentence firmly declare innocence,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tiger wary farrell gamble leicester rush make bid andy farrell great britain rugby league captain decide switch code anybody involve process way away go stage tiger boss john wells tell bbc radio leicester moment lot unknown andy farrell medical situation go big big gamble farrell persistent knee problem operation knee week ago expect month leicester saracen believe head list rugby union club interest sign farrell decide 15 man game union well believe better play back initially m sure step league union involve centre say well think england prefer progress position row use rugby league skill forward jury cross divide club balance strike cost gamble option bring ready replacement,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
yeade face newcastle fa cup premiership newcastle united face trip ryman premier league leader yeade fa cup round game arguably highlight draw potential money spinner non league yeading beat slough second round conference exeter city knock doncaster saturday travel old trafford meet holder manchester united january arsenal draw home stoke chelsea play host scunthorpe non league draw hinckley united hold brentford goalless draw sunday meet league leader luton win replay martin allen s team griffin park number premiership team face difficult away game championship side weekend 8 9 january place everton visit plymouth liverpool travel burnley crystal palace sunderland fulham face carle cup semi finalists watford bolton meet ipswich aston villa draw sheffield united premiership struggler norwich blackburn west brom away west ham cardiff preston north end respectively southampton visit northampton having beat league carling cup early season middlesbrough draw away swindon notts county spur entertain brighton white hart lane arsenal v stoke swindon notts co v middlesbrough man utd v exeter plymouth v everton leicester v blackpool derby v wigan sunderland v crystal palace wolve v millwall yeade v newcastle hull v colchester tottenham v brighton read v stockport swansea birmingham v leed hartlepool v boston milton keynes don v peterborough oldham v man city chelsea v scunthorpe cardiff v blackburn charlton v rochdale west ham v norwich sheff utd v aston villa preston v west brom rotherham v yeovil burnley v liverpool bournemouth v chester coventry v crewe watford v fulham ipswich v bolton portsmouth v gillingham northampton v southampton qpr v nottm forest luton v hinckley brentford match play weekend 8 9 january,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ocean s raid box office ocean s crime caper sequel star george clooney brad pitt julia roberts go straight number box office chart take 40 8 m £ 21 m weekend ticket sale accord studio estimate sequel follow master criminal try pull major heist europe knock week s number national treasure place wesley snipe blade trinity second take 16 1 m £ 8 4 m round animate fable polar express star tom hank festive comedy christmas krank ocean s box office triumph mark fourth big opening december release film lord ring trilogy sequel narrowly beat 2001 predecessor ocean s take 38 1 m £ 19 8 m opening weekend 184 m £ 95 8 m total remake 1960s film star frank sinatra rat pack ocean s direct oscar win director steven soderbergh soderbergh return direct hit sequel reunite clooney pitt roberts matt damon andy garcia elliott gould catherine zeta jones join star cast s fun good holiday movie say dan fellman president distribution warner bros critic complimentary 110 m £ 57 2 m project los angeles time label dispirit vanity project milder review new york times dub sequel unabashedly trivial,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
car pull retail figure retail sale fall 0 3 january big monthly decline august drive heavy fall car sale 3 3 fall car sale expect come december s 4 rise car sale fuel generous pre christmas special offer exclude car sector retail sale 0 6 january twice analyst expect retail spending expect rise 2005 quickly 2004 steve gallagher chief economist sg corporate investment banking say january s figure decent number see number see second half 2004 pretty healthy add sale appliance electronic store 0 6 january sale hardware store drop 0 3 furniture store sale dip 0 1 sale clothing clothing accessory store jump 1 8 sale general merchandise store category include department store rise 0 9 strong gain consumer spend gift voucher give christmas sale restaurant bar coffee house rise 0 3 grocery store sale 0 5 december overall retail sale rise 1 1 exclude car sector sale rise 0 3 parul jain deputy chief economist nomura securities international say consumer spending continue rise 2005 slow rate growth 2004 consumer continue retain strength quarter say van rourke bond strategist popular security agree late retail sale figure slightly strong expect,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
kilroy unveils immigration policy ex chatshow host robert kilroy silk attack uk policy immigration say britain s open door approach hit low wage indigenous worker verita leader say people benefit immigrant place like poland employer landlord member metropolitan elite mep say party admit foreigner require specific skill offer argue asylum cost £ 2bn year 14 000 successful applicant mr kilroy silk say work £ 143 000 successful asylum seeker say verita want grant amnesty britain claim asylum child deport britain fair share asylum seeker united nations convention human right argue mr kilroy silk say want spend extra £ 500 m year help provide refugee abroad,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
rem announce new glasgow concert band rem announce plan perform 10 000 scottish fan rescheduled gig band play dub europe s big tent glasgow green tuesday 14 june force pull concert secc glasgow month bassist mike mills contract flu fan buy ticket original 22 february attend rescheduled concert june gig act warm rem s open air concert balloch castle country park bank loch lomond day later promoter regular music book glasgow green secc available suitable date mark mackie director regular music say fantastic news show rem s commitment scottish fan come glasgow truly unique gig rem gig kick start promise memorable summer scottish music lover grammy award winner u2 play hampden 21 june oasis perform national stadium glasgow 29 june coldplay announce concert bellahouston park glasgow 1 july t park hold balado near kinross 9 10 july ticketweb secc box office write customer buy ticket february gig ask want attend new buy ticket person urge return point purchase concert give refund cut date swap ticket 1 april remain sale public,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
political squabble snowball s commonplace argue blair brown like squabbling school kid supporter need grow stop bicker analysis fact get wrong s child fight adult solid reason trivial argument mature protagonist hard stop got go key feature endless feud agree d well end want word participant genuinely want row stop think worth prolong argument tiny bit ensure view hear successive attempt end argument word ensure argument go case mr blair mr brown successive book publish ensure issue die isn t participant stupid s actually individual behave entirely rationally give incentive face s piece economic theory explain obscure post neo classical endogenous growth theory chancellor quote ubiquitous piece game theory respectable policy wonk familiar s refer prisoner s dilemma base parable tell economics degree course sheriff prisoner story go prisoner jointly charge heinous crime lock separate cell sheriff desperately need confession provide evidence convict crime confession prisoner minimal sentence trump charge clearly prisoner good strategy mouth shut short sentence clever sheriff idea induce talk tell prisoner separately confess confess ll let crime tell don t confess confess ll life prisoner confront choice good bet confess partner doesn t confess ll completely partner confess d well confess ensure don t life result course prisoner confess sheriff let prisoner individual logic behave way well agree shut don t worry don t entirely follow look google 283 000 entry prisoner dilemma ramification truly capture economist couple decade parable describe situation obvious sensible choice take collectively rational choice individually behave selfishly cold war arm race example classic case russia america well arm lot arm long want arm arm race ensue result individually logical decision buy arm result arm level high economic tell prisoner dilemma repeat experience time s hard escape perverse logic s good exhort people stop buy arm stop argue incentive encourage carry incentive change case labour party believe rift blair brown camp bad report suggest solomon s wisdom need deploy solve problem parent know ingenious solution argument solution affect incentive participant example famous rule divide choose way allocate piece cake slice greedy child case apparently endless argument want come end ensure person word lose win row cost prolong row briefing book matter exceed benefit have word get point rest party enforce ll protagonist retreat pretty quickly,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
dataframe['Article'].iloc[0]

'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high

In [None]:
dataframe['processed_Article'].iloc[0]

'tv future hand viewer home theatre system plasma high definition tv digital video recorder move living room way people watch tv radically different year time accord expert panel gather annual consumer electronic las vegas discuss new technology impact favourite pastime lead trend programme content deliver viewer home network cable satellite telecom company broadband service provider room portable device talk technology ce digital personal video recorder dvr pvr set box like s tivo uk s sky system allow people record store play pause forward wind tv programme want essentially technology allow personalised tv build high definition tv set big business japan slow europe lack high definition programming people forward wind advert forget abide network channel schedule put la carte entertainment network cable satellite company worried mean term advertising revenue brand identity view loyalty channel lead technology moment concern raise europe particularly grow uptake service like sky happen 

### 5.2.3 TF-IDF for vectorizing the data
- Term Frequency is the measure of how common a word (or token) is in the document.

  - More common words (or tokens) of a document would have a higher term frequency.This is calculated for every word in a document.

  - There are various ways to determine the term frequency. One of the most common formulation of TF is -
𝑇𝐹(𝑡,𝑑)=𝑓𝑡,𝑑∑𝑡′∈𝑑𝑓𝑡′,𝑑

In [None]:
tf_idf_vectorizer = TfidfVectorizer()
tf_idf_rep = tf_idf_vectorizer.fit_transform(dataframe['processed_Article']).todense()
df = pd.DataFrame(tf_idf_rep)
df.columns = tf_idf_vectorizer.get_feature_names_out()
df.index = dataframe['processed_Article']
display(df)

Unnamed: 0_level_0,00,000,0001,000bn,000th,001,001and,001st,004,0051,...,zoom,zooropa,zornotza,zorro,zubair,zuluaga,zurich,zuton,zvonareva,zvyagintsev
processed_Article,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
tv future hand viewer home theatre system plasma high definition tv digital video recorder move living room way people watch tv radically different year time accord expert panel gather annual consumer electronic las vegas discuss new technology impact favourite pastime lead trend programme content deliver viewer home network cable satellite telecom company broadband service provider room portable device talk technology ce digital personal video recorder dvr pvr set box like s tivo uk s sky system allow people record store play pause forward wind tv programme want essentially technology allow personalised tv build high definition tv set big business japan slow europe lack high definition programming people forward wind advert forget abide network channel schedule put la carte entertainment network cable satellite company worried mean term advertising revenue brand identity view loyalty channel lead technology moment concern raise europe particularly grow uptake service like sky happen today month year time uk adam hume bbc broadcast s futurologist tell bbc news website like bbc issue lose advertising revenue pressing issue moment commercial uk broadcaster brand loyalty important talk content brand network brand say tim hanlon brand communication firm starcom mediavest reality broadband connection anybody producer content add challenge hard promote programme choice mean say stacey jolna senior vice president tv guide tv group way people find content want watch simplify tv viewer mean network term channel leaf google s book search engine future instead scheduler help people find want watch kind channel model work young ipod generation take control gadget play suit panel recognise old generation comfortable familiar schedule channel brand know get want choice hand mr hanlon suggest end kid diaper push button possible available say mr hanlon ultimately consumer tell market want 50 000 new gadget technology showcase ce enhance tv watch experience high definition tv set new model lcd liquid crystal display tv launch dvr capability build instead external box example launch humax s 26 inch lcd tv 80 hour tivo dvr dvd recorder s big satellite tv company directtv launch brand dvr 100 hour recording capability instant replay search function set pause rewind tv 90 hour microsoft chief bill gate announce pre keynote speech partnership tivo call tivotogo mean people play record programme windows pc mobile device reflect increase trend free multimedia people watch want want,0.0,0.019421,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
worldcom boss leave book worldcom boss bernie ebbers accuse oversee 11bn £ 5 8bn fraud accounting decision witness tell juror david myers comment question defence lawyer argue mr ebber responsible worldcom s problem phone company collapse 2002 prosecutor claim loss hide protect firm s share mr myers plead guilty fraud assist prosecutor monday defence lawyer reid weingarten try distance client allegation cross examination ask mr myers know mr ebber accounting decision aware mr myers reply know mr ebber accounting entry worldcom book mr weingarten press reply witness mr myers admit order false accounting entry request worldcom chief financial officer scott sullivan defence lawyer try paint mr sullivan admit fraud testify later trial mastermind worldcom s accounting house card mr ebber team look portray affable boss admission pe graduate economist ability mr ebber transform worldcom relative unknown 160bn telecom giant investor darling late 1990 worldcom s problem mount competition increase telecom boom petere firm finally collapse shareholder lose 180bn 20 000 worker lose job mr ebber trial expect month find guilty ceo face substantial jail sentence firmly declare innocence,0.0,0.024302,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
tiger wary farrell gamble leicester rush make bid andy farrell great britain rugby league captain decide switch code anybody involve process way away go stage tiger boss john wells tell bbc radio leicester moment lot unknown andy farrell medical situation go big big gamble farrell persistent knee problem operation knee week ago expect month leicester saracen believe head list rugby union club interest sign farrell decide 15 man game union well believe better play back initially m sure step league union involve centre say well think england prefer progress position row use rugby league skill forward jury cross divide club balance strike cost gamble option bring ready replacement,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
yeade face newcastle fa cup premiership newcastle united face trip ryman premier league leader yeade fa cup round game arguably highlight draw potential money spinner non league yeading beat slough second round conference exeter city knock doncaster saturday travel old trafford meet holder manchester united january arsenal draw home stoke chelsea play host scunthorpe non league draw hinckley united hold brentford goalless draw sunday meet league leader luton win replay martin allen s team griffin park number premiership team face difficult away game championship side weekend 8 9 january place everton visit plymouth liverpool travel burnley crystal palace sunderland fulham face carle cup semi finalists watford bolton meet ipswich aston villa draw sheffield united premiership struggler norwich blackburn west brom away west ham cardiff preston north end respectively southampton visit northampton having beat league carling cup early season middlesbrough draw away swindon notts county spur entertain brighton white hart lane arsenal v stoke swindon notts co v middlesbrough man utd v exeter plymouth v everton leicester v blackpool derby v wigan sunderland v crystal palace wolve v millwall yeade v newcastle hull v colchester tottenham v brighton read v stockport swansea birmingham v leed hartlepool v boston milton keynes don v peterborough oldham v man city chelsea v scunthorpe cardiff v blackburn charlton v rochdale west ham v norwich sheff utd v aston villa preston v west brom rotherham v yeovil burnley v liverpool bournemouth v chester coventry v crewe watford v fulham ipswich v bolton portsmouth v gillingham northampton v southampton qpr v nottm forest luton v hinckley brentford match play weekend 8 9 january,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ocean s raid box office ocean s crime caper sequel star george clooney brad pitt julia roberts go straight number box office chart take 40 8 m £ 21 m weekend ticket sale accord studio estimate sequel follow master criminal try pull major heist europe knock week s number national treasure place wesley snipe blade trinity second take 16 1 m £ 8 4 m round animate fable polar express star tom hank festive comedy christmas krank ocean s box office triumph mark fourth big opening december release film lord ring trilogy sequel narrowly beat 2001 predecessor ocean s take 38 1 m £ 19 8 m opening weekend 184 m £ 95 8 m total remake 1960s film star frank sinatra rat pack ocean s direct oscar win director steven soderbergh soderbergh return direct hit sequel reunite clooney pitt roberts matt damon andy garcia elliott gould catherine zeta jones join star cast s fun good holiday movie say dan fellman president distribution warner bros critic complimentary 110 m £ 57 2 m project los angeles time label dispirit vanity project milder review new york times dub sequel unabashedly trivial,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
car pull retail figure retail sale fall 0 3 january big monthly decline august drive heavy fall car sale 3 3 fall car sale expect come december s 4 rise car sale fuel generous pre christmas special offer exclude car sector retail sale 0 6 january twice analyst expect retail spending expect rise 2005 quickly 2004 steve gallagher chief economist sg corporate investment banking say january s figure decent number see number see second half 2004 pretty healthy add sale appliance electronic store 0 6 january sale hardware store drop 0 3 furniture store sale dip 0 1 sale clothing clothing accessory store jump 1 8 sale general merchandise store category include department store rise 0 9 strong gain consumer spend gift voucher give christmas sale restaurant bar coffee house rise 0 3 grocery store sale 0 5 december overall retail sale rise 1 1 exclude car sector sale rise 0 3 parul jain deputy chief economist nomura securities international say consumer spending continue rise 2005 slow rate growth 2004 consumer continue retain strength quarter say van rourke bond strategist popular security agree late retail sale figure slightly strong expect,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
kilroy unveils immigration policy ex chatshow host robert kilroy silk attack uk policy immigration say britain s open door approach hit low wage indigenous worker verita leader say people benefit immigrant place like poland employer landlord member metropolitan elite mep say party admit foreigner require specific skill offer argue asylum cost £ 2bn year 14 000 successful applicant mr kilroy silk say work £ 143 000 successful asylum seeker say verita want grant amnesty britain claim asylum child deport britain fair share asylum seeker united nations convention human right argue mr kilroy silk say want spend extra £ 500 m year help provide refugee abroad,0.0,0.084371,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
rem announce new glasgow concert band rem announce plan perform 10 000 scottish fan rescheduled gig band play dub europe s big tent glasgow green tuesday 14 june force pull concert secc glasgow month bassist mike mills contract flu fan buy ticket original 22 february attend rescheduled concert june gig act warm rem s open air concert balloch castle country park bank loch lomond day later promoter regular music book glasgow green secc available suitable date mark mackie director regular music say fantastic news show rem s commitment scottish fan come glasgow truly unique gig rem gig kick start promise memorable summer scottish music lover grammy award winner u2 play hampden 21 june oasis perform national stadium glasgow 29 june coldplay announce concert bellahouston park glasgow 1 july t park hold balado near kinross 9 10 july ticketweb secc box office write customer buy ticket february gig ask want attend new buy ticket person urge return point purchase concert give refund cut date swap ticket 1 april remain sale public,0.0,0.026081,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
political squabble snowball s commonplace argue blair brown like squabbling school kid supporter need grow stop bicker analysis fact get wrong s child fight adult solid reason trivial argument mature protagonist hard stop got go key feature endless feud agree d well end want word participant genuinely want row stop think worth prolong argument tiny bit ensure view hear successive attempt end argument word ensure argument go case mr blair mr brown successive book publish ensure issue die isn t participant stupid s actually individual behave entirely rationally give incentive face s piece economic theory explain obscure post neo classical endogenous growth theory chancellor quote ubiquitous piece game theory respectable policy wonk familiar s refer prisoner s dilemma base parable tell economics degree course sheriff prisoner story go prisoner jointly charge heinous crime lock separate cell sheriff desperately need confession provide evidence convict crime confession prisoner minimal sentence trump charge clearly prisoner good strategy mouth shut short sentence clever sheriff idea induce talk tell prisoner separately confess confess ll let crime tell don t confess confess ll life prisoner confront choice good bet confess partner doesn t confess ll completely partner confess d well confess ensure don t life result course prisoner confess sheriff let prisoner individual logic behave way well agree shut don t worry don t entirely follow look google 283 000 entry prisoner dilemma ramification truly capture economist couple decade parable describe situation obvious sensible choice take collectively rational choice individually behave selfishly cold war arm race example classic case russia america well arm lot arm long want arm arm race ensue result individually logical decision buy arm result arm level high economic tell prisoner dilemma repeat experience time s hard escape perverse logic s good exhort people stop buy arm stop argue incentive encourage carry incentive change case labour party believe rift blair brown camp bad report suggest solomon s wisdom need deploy solve problem parent know ingenious solution argument solution affect incentive participant example famous rule divide choose way allocate piece cake slice greedy child case apparently endless argument want come end ensure person word lose win row cost prolong row briefing book matter exceed benefit have word get point rest party enforce ll protagonist retreat pretty quickly,0.0,0.016557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
dataframe['Article'].iloc[20]



In [None]:
dataframe['processed_Article'].iloc[20]

'security warn fbi virus federal bureau investigation warn computer virus spread e mail purport fbi e mail come fbi gov address tell recipient access illegal website message warn internet use monitor fbi s internet fraud complaint center attachment e mail contain virus fbi say message ask recipient click attachment answer question internet use questionnaire attachment contain virus infect recipient s computer accord agency clear virus infect computer user warn open attachment unsolicited e mail people know recipient similar solicitation know fbi engage practice send unsolicited e mail public manner fbi say statement bureau investigate phoney e mail agency early month shut fbi gov account communicate public security breach spokeswoman say incident appear unrelated '

### 5.2.4 Train-test split


In [None]:
df = dataframe[["processed_Article","Target_category"]]
X = df.processed_Article
y = df.Target_category
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 42)

In [None]:
X_train.shape

(1668,)

In [None]:
X_test.shape

(557,)

# 6. Model Training & Evaluation


## 6.1. Simple Approach

### 6.1.1 Naive Bayes Model

In [None]:
nb = Pipeline([('vect', CountVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB()),
              ])
nb.fit(X_train, y_train)

### 6.1.2 Evaluate Model

In [None]:
y_pred = nb.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred))

accuracy 0.9676840215439856
              precision    recall  f1-score   support

           0       0.97      0.96      0.96       136
           1       1.00      0.92      0.96        96
           2       0.92      0.99      0.96        98
           3       0.98      1.00      0.99       124
           4       0.96      0.97      0.97       103

    accuracy                           0.97       557
   macro avg       0.97      0.97      0.97       557
weighted avg       0.97      0.97      0.97       557



## 6.2 Functionalized Code

In [None]:
def text_classification_model(classifier_model):
  model = Pipeline([('vect', CountVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', classifier_model),
              ])
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)
  print(classification_report(y_test, y_pred))

### 6.2.1 Decision Tree

In [None]:
dt_model = DecisionTreeClassifier(random_state=42)
text_classification_model(dt_model)

              precision    recall  f1-score   support

           0       0.82      0.76      0.79       136
           1       0.64      0.72      0.68        96
           2       0.72      0.81      0.76        98
           3       0.93      0.90      0.91       124
           4       0.81      0.75      0.78       103

    accuracy                           0.79       557
   macro avg       0.79      0.79      0.79       557
weighted avg       0.80      0.79      0.79       557



### 6.2.2 Nearest Neighbors

In [None]:
knn_model = KNeighborsClassifier(n_neighbors=7)
text_classification_model(knn_model)

              precision    recall  f1-score   support

           0       0.95      0.89      0.92       136
           1       0.97      0.90      0.93        96
           2       0.86      0.92      0.89        98
           3       0.96      1.00      0.98       124
           4       0.94      0.97      0.96       103

    accuracy                           0.94       557
   macro avg       0.93      0.93      0.93       557
weighted avg       0.94      0.94      0.94       557



### 6.2.3 Random Forest

In [None]:
rf_model = RandomForestClassifier(random_state=42)
text_classification_model(rf_model)

              precision    recall  f1-score   support

           0       0.90      0.96      0.93       136
           1       0.96      0.94      0.95        96
           2       0.94      0.93      0.93        98
           3       0.97      0.99      0.98       124
           4       0.99      0.89      0.94       103

    accuracy                           0.95       557
   macro avg       0.95      0.94      0.95       557
weighted avg       0.95      0.95      0.95       557



# 7. Observe and comment on the performances of all the models used

# 8. Question & Answers
Questionnaire:
- How many news articles are present in the dataset that we have?
  - 2225
- Most of the news articles are from _____ category.
  - Sports category (511 articles)
- Only ___ no. of articles belong to the ‘Technology’ category.
  - 401
- What are Stop Words and why should they be removed from the text data?
  - Stop words are a set of commonly used words in any language, not just English.
  - By removing stop words, we **remove the low-level information** from our text in order to **give more focus to the important information**
- Explain the difference between Stemming and Lemmatization.
  - **Stemming** is a process that **stems or removes last few characters from a word**, often **leading to incorrect meanings and spelling**. 
  - **Lemmatization considers the context and converts the word to its meaningful base**
  - **Lemmatization has higher accuracy than stemming**. 
  - **Lemmatization is preferred for context analysis**, whereas stemming is recommended when the context is not important
  - Stemming is faster than Lemmatization
- Which of the techniques Bag of Words or TF-IDF is considered to be more efficient than the other?
  - TF-IDF is more effecient as TF-IDF makes rare words more prominent and effectively ignores common words
    - TF-IDF
      - It computes the feature importances of a word in a document of a corpus
      - It can detect and nullify the effect of stopwords on feature vector of a sentence
    - Bag Of Words
      - It computes the frequencies (or) presence of a word in a document of a corpus
      - It assigns equal priority to every word. It cannot detect stopwords 
- What’s the shape of train & test data sets after performing a 75:25 split.
  - Train data shape - (1668,)
  - Test data shape - (557,) 
- Which of the following is found to be the best performing model..
  - a. Random Forest
  -	b. Nearest Neighbors		
  - c. Naive Bayes
  - Ans : **Naive Bayes is performing better**
- According to this particular use case, both precision and recall are equally important. (T/F)
  - True. Precision measures the accuracy of positive predictions, while recall measures the completeness of positive predictions