# Text classification: News Category Prediction

### [Hackathon Link](https://www.machinehack.com/course/predict-the-news-category-hackathon/)

### Problem Statement with information about dataset
Use Natural Language Processing to predict which genre or category a piece of news will fall in to from the story.

Size of training set: 7,628 records
Size of test set: 2,748 records

FEATURES:

STORY:  A part of the main content of the article to be published as a piece of news.
SECTION: The genre/category the STORY falls in.

There are four distinct sections where each story may fall in to. The Sections are labelled as follows :

Politics: 0  
Technology: 1  
Entertainment: 2  
Business: 3  

### Libraries import

In [None]:
#Remove warning in console
import warnings
warnings.filterwarnings('ignore')
# default libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#NLP specific libraries
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize

#Accuracy metrics 
from sklearn.metrics import classification_report

#NLP libraries
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
#Importing TfidfTransformer from sklearn
from sklearn.feature_extraction.text import TfidfTransformer

#Download the following modules once
# nltk.download('stopwords')
# nltk.download('wordnet')

#Model libraries 
from sklearn.naive_bayes import MultinomialNB

### Import Data
#### Download: [MachineHack Site](https://www.machinehack.com/wp-content/uploads/2019/07/Participants_Data_News_category-20190729T063600Z-001.zip)

In [None]:
input_train_df = pd.read_excel('./sample_data/Data_Train.xlsx')

### Explore DataSet

In [None]:
input_train_df.head()

Unnamed: 0,STORY,SECTION
0,But the most painful was the huge reversal in ...,3
1,How formidable is the opposition alliance amon...,0
2,Most Asian currencies were trading lower today...,3
3,"If you want to answer any question, click on ‘...",1
4,"In global markets, gold prices edged up today ...",3


In [None]:
input_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7628 entries, 0 to 7627
Data columns (total 2 columns):
STORY      7628 non-null object
SECTION    7628 non-null int64
dtypes: int64(1), object(1)
memory usage: 119.3+ KB


In [None]:
#Printing the shape of the dataset
print(input_train_df.shape)

(7628, 2)


In [None]:
#Printing the group by description of each category
input_train_df.groupby("SECTION").describe()

Unnamed: 0_level_0,STORY,STORY,STORY,STORY
Unnamed: 0_level_1,count,unique,top,freq
SECTION,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,1686,1673,This story has been published from a wire agen...,4
1,2772,2731,This story has been published from a wire agen...,13
2,1924,1914,We will leave no stone unturned to make the au...,3
3,1246,1233,This story has been published from a wire agen...,11


In [None]:
#Check for null values
input_train_df.isnull().values.any()
input_train_df.isnull().sum()

STORY      0
SECTION    0
dtype: int64

In [None]:
#Removing duplicates to avoid overfitting
input_train_df.drop_duplicates(inplace = True)

### EDA
Code resources:      
[Kaggle notebook on eda](https://www.kaggle.com/amokrane/eda-and-text-classification-with-scikit-learn)

In [None]:
# Distribution plot of target column
# ax, fig = plt.subplots(figsize=(10, 7))
# sections_class = input_train_df["SECTION"].value_counts()
# sections_class.plot(kind= 'bar')
# plt.title('Bar chart of news categories')
# plt.show()

In [None]:
# print("Percentage distribution of target column categories:-")
# (input_train_df.groupby('SECTION').size()/input_train_df['SECTION'].count())*100

In [None]:
#Divide target column into individual categories
# politics = input_train_df[input_train_df["SECTION"] == 0]
# technology = input_train_df[input_train_df["SECTION"] == 1]
# entertainment = input_train_df[input_train_df["SECTION"] == 2]
# business = input_train_df[input_train_df["SECTION"] == 3]

In [None]:
#get length of news stories for each category
# politics["news_length"] = politics.STORY.apply(lambda x: len(x))
# technology["news_length"] = technology.STORY.apply(lambda x: len(x))
# entertainment["news_length"] = entertainment.STORY.apply(lambda x: len(x))
# business["news_length"] = business.STORY.apply(lambda x: len(x))

**Distribution plot of target column categories**    
More info on dist plot: [TowardsDataScience](https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0)

In [None]:
# fig = plt.figure(figsize=(11.7,8.27))
# sns.distplot(politics.news_length, hist=True, label="politics")
# sns.distplot(technology.news_length, hist=True, label="technology")
# sns.distplot(entertainment.news_length, hist=True, label="entertainment")
# sns.distplot(business.news_length, hist=False, label="business")
# plt.ylabel('Density of news story words')
# plt.xlabel('New Story length')
# fig.legend()
# plt.show()

### Data Preprocessing

**1)Text cleaning**    
Cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization.

**1.1Remove Punctutation**   
Punctuation signs and special characters won't have any predicting power, so we'll just get rid of them.

In [None]:
all_punctuations = string.punctuation + '‘’,:”][],' 

#Method to remove punctuation marks from the data
def punc_remover(raw_text):
    no_punct = "".join([i for i in raw_text if i not in all_punctuations])
    return no_punct

#Method to remove stopwords from the data
def stopword_remover(no_punc_text):
    words = no_punc_text.split()
    no_stp_words = " ".join([i for i in words if i not in stopwords.words('english')])
    return no_stp_words

#Method to lemmatize the words in the data
lemmer = nltk.stem.WordNetLemmatizer()
def lem(words):
    return " ".join([lemmer.lemmatize(word,'v') for word in words.split()])

#Method to perform a complete cleaning
def text_cleaner(raw):
    cleaned_text = stopword_remover(punc_remover(raw))
    return lem(cleaned_text)

In [None]:
#Applying the cleaner method to the entire data
input_train_df['CLEAN_STORY'] = input_train_df['STORY'].apply(text_cleaner)

In [None]:
#Checking the new dataset
print(input_train_df.head())

                                               STORY  SECTION  \
0  But the most painful was the huge reversal in ...        3   
1  How formidable is the opposition alliance amon...        0   
2  Most Asian currencies were trading lower today...        3   
3  If you want to answer any question, click on ‘...        1   
4  In global markets, gold prices edged up today ...        3   

                                         CLEAN_STORY  
0  But painful huge reversal fee income unheard a...  
1  How formidable opposition alliance among Congr...  
2  Most Asian currencies trade lower today South ...  
3  If want answer question click Answer After cli...  
4  In global market gold price edge today disappo...  


In [None]:
#Exporting file to pickle to save time in processing data
input_train_df.to_pickle('./sample_data/Pickles/input_train_df.pickle')

In [None]:
input_train_df = pd.read_pickle('./sample_data/Pickles/input_train_df.pickle')

**2)Count Vectors and TF-IDF Vectors**    
Create count vectors and TF-IDF vectors for feeding data into model

**2.1 Create bag-of-words model using CountVectoriser**

In [None]:
#Creating a bag-of-words dictionary of words from the data
bow_dictionary = CountVectorizer().fit(input_train_df['CLEAN_STORY'])

In [None]:
#Total number of words in the bow_dictionary
len(bow_dictionary.vocabulary_)

35189

In [None]:
#Using the bow_dictionary to create count vectors for the cleaned data.
bow = bow_dictionary.transform(input_train_df['CLEAN_STORY'])

In [None]:
#Printing the shape of the bag of words model
print(bow.shape)

(7551, 35189)


**2.2 Create TF-IDF Vectors**

In [None]:
#Fitting the bag of words data to the TF-IDF transformer
tfidf_transformer = TfidfTransformer().fit(bow)

# #Transforming the bag of words model to TF-IDF vectors
storytfidf = tfidf_transformer.transform(bow)

### Training the classification models

In [None]:
#Creating a Multinomial Naive Bayes Classifier and 
#Fitting the training data to the classifier
classifier = MultinomialNB().fit(storytfidf, input_train_df['SECTION'])

### Predicting For The Test Set

In [None]:
final_test_df = pd.read_excel('./sample_data/Data_Test.xlsx')
final_test_df['CLEAN_STORY'] = final_test_df['STORY'].apply(text_cleaner)

In [None]:
#Exporting file to pickle to save time in processing data
final_test_df.to_pickle('./sample_data/Pickles/final_test_df.pickle')

In [None]:
final_test_df = pd.read_pickle('./sample_data/Pickles/final_test_df.pickle')

In [None]:
#Printing the cleaned data
print(final_test_df.values)

[['2019 will see gadgets like gaming smartphones and wearable medical devices lifting the user experience to a whole new level\n\n\nmint-india-wire consumer technologyconsumer technology trends in New Yeartech gadgetsFoldable phonesgaming smartphoneswearable medical devicestechnology\n\n\nNew Delhi: Gadgets have become an integral part of our lives with most of us relying on some form of factor to communicate, commute, work, be informed or entertained. Year 2019 will see some gadgets lifting the user experience to a whole new level. Here’s what we can expect to see:\n\n\nSmartphones with foldable screens: Foldable phones are finally moving from the concept stage to commercial launches. They are made up of organic light-emitting diode (OLED) panels with higher plastic substrates, allowing them to be bent without damage.\n\n\nUS-based display maker Royole Corp’s foldable phone, FlexPai, has already arrived in select markets, while Samsung’s unnamed foldable phone is expected sometime nex

### Create a pipeline to preprocess and initialize the classifier

In [None]:
#Importing the Pipeline module from sklearn
from sklearn.pipeline import Pipeline

#Initializing the pipeline with necessary transformations and the required classifier
pipe = Pipeline([
('bow', CountVectorizer()),
('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())])

#Fitting the training data to the pipeline
pipe.fit(input_train_df['CLEAN_STORY'], input_train_df['SECTION'])

Pipeline(memory=None,
     steps=[('bow', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [None]:
#Predicting the SECTION 
test_preds_mnb = pipe.predict(final_test_df['CLEAN_STORY'])

# #Writing the predictions to an excel sheet
output = pd.DataFrame(test_preds_mnb, columns = ['SECTION'])

In [None]:
final_test_df.shape

(2748, 2)

In [None]:
output.to_excel("./Predictions/prediction3.xlsx")