# Classifying News

I've downloaded news from InShorts website using a Python Script.
News are sorted into 5 categories Automobile, Enterainment, Hatke, Science and Technology.
Each category contains atleast 1000 News for training the model and 25 each for testing the model

###### Author: Yash Sharma
###### Date: 10th July 2017
###### Library Used: sklearn 

##### sklearn official [documentation](http://scikit-learn.org/stable/documentation.html) used as a reference

In [71]:
# Importing methods from sklearn library
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier, RidgeClassifierCV, RidgeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score

In [2]:
# Loading training data from InShorts_News/Train dir
category = ['Automobile', 'Entertainment', 'Hatke', 'Science', 'Technology']
data_train = load_files('InShorts_News/Train', random_state=42, shuffle=True, categories = category)

In [3]:
# Target Names
print ("Target Name: {}".format(data_train.target_names))

Target Name: ['Automobile', 'Entertainment', 'Hatke', 'Science', 'Technology']


# Extracting features from text files

## Bag of Words: 
The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears.
We'll use CountVectorizer function for sklearn librrary to create a bag of words.

Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool.  
> vectorizer = CountVectorizer()

You can also pass various arguments like whether to remove stopwords or tokenize text or maximum number of features we wan.
[Here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is the complete list of argument we can pass

Then we call the fit_transform method and pass the data we want to vectorize
> fit_transform(data.data)

fit_transform() does two functions: 
1. It fits the model and learns the vocabulary.
2. It transforms our training data into feature vectors.

In [59]:
# Initializing CountVectorizer
vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 1))

# Fitting our data
X_train_count = vectorizer.fit_transform(data_train.data)

In [5]:
# Shape of the data_features
print ("\nShape after vectorizing ", X_train_count.shape)


Shape after vectorizing  (7770, 27186)


## From Occurrences to Frequencies

Longer documents have higher average count values than shorter documents, even though they might talk about the same topics and this causes issue. So, to avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document. These new features are called the Term Frequencies.

Luckly sklearn learn has a method for this, TfidfTransformer
Tf–idf stands for “Term Frequency times Inverse Document Frequency”.

What it does is, it downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus. This is also a refinimentt to the make our model fit better.

> tfidf = TfidfTransformer()

> tfidf_features = tfidf.fit_transform(data_features)

In [60]:
# Initialize TfidfTransformer
tfidf = TfidfTransformer(use_idf=True)

# fit the transformer
X_train_tfidf = tfidf.fit_transform(X_train_count)

In [8]:
# Shape of tfidf_features
print ("\nShape after applying Tf-idf", X_train_tfidf.shape)


Shape after applying Tf-idf (7770, 27186)


## Note: 

The two steps we performed can be merged into a single step using the TfidfVectorizer() method. It is a mixture of CountVectorizer() and TfidfTransformer() method. It first vectorize the data and then form tf-idf

# Test Data
Load the test data into the memory and perform the same process (Bag of Word and Frequence).

In [61]:
# Loading test data passing the same categories as of training data
data_test = load_files('InShorts_News/Test', random_state=42, shuffle=True, categories = category)

# storing text and target in two variables
X_test = data_test.data
y_test = data_test.target

# Performing vectorization and tf-idf on test data
X_test_count = vectorizer.transform(X_test)
X_test_tfidf = tfidf.transform(X_test_count)

# Train our model

## 1. MultinomialNB

In [10]:
# clf stands for classifier
clf = MultinomialNB()

# fit (train) our model
clf.fit(X_train_tfidf, data_train.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's first test our model on some sample data

In [12]:
# First text is for Entertainment and second for Science
sample_text = [
    "A wax statue of actor Ranveer Singh, who turned 32 today, was unveiled at the Grévin wax museum in Paris. Ranveer is the third Bollywood celebrity after Shah Rukh Khan and Aishwarya Rai to get a statue at the museum. \"It's a truly special birthday present and one that will always bring back beautiful memories of Paris,\" said Ranveer.",
    
    "Some physicists suggest that if quantum world is real and time-symmetric, it must allow for 'retrocausality', implying influences can travel backwards in time.\
     Retrocausality, however, doesn't imply that signals can be communicated from future to the past, scientists added. \
     If found correct, it would prove Einstein correct that quantum theory is incomplete and quantum entanglement is not possible."
]

# Vectorize sample data
X_new_counts = vectorizer.transform(sample_text)
X_new_tfidf = tfidf.transform(X_new_counts)

# predict the output
predicted = clf.predict(X_new_tfidf)

# Print the prediction made by the model
for pred in predicted:
    print (data_train.target_names[pred])

Entertainment
Science


It worked perfectly. Our Model guessed both sample text perfectly correct.

Let's test our model on test data

In [13]:
# Making predictions on test data
y_mnb_predict = clf.predict(X_test_tfidf)

### Confusion Matrix and Accuracy Score for MultinomialNB Model 

In [14]:
# confusion matrix
print(classification_report(y_test, y_mnb_predict, target_names=data_test.target_names))

# accuracy score
print ("Accuracy Score: {}".format(accuracy_score(y_test, y_mnb_predict)))

               precision    recall  f1-score   support

   Automobile       0.80      0.64      0.71        25
Entertainment       0.86      1.00      0.93        25
        Hatke       1.00      0.40      0.57        25
      Science       0.68      1.00      0.81        25
   Technology       0.69      0.80      0.74        25

  avg / total       0.81      0.77      0.75       125

Accuracy Score: 0.768


## 2. Support Vector Machine (SVM)

In [42]:
# SVM classifier
svm_clf = SGDClassifier(loss='hinge', penalty='l2', alpha=0.001, n_iter=5, random_state=42)

# fit out training data
svm_clf.fit(X_train_tfidf, data_train.target)

SGDClassifier(alpha=0.001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False)

In [43]:
# make prediction using svm model
y_svm_predict = svm_clf.predict(X_test_tfidf)

### Confusion Matrix and Accuracy Score for SVM Model

In [44]:
# confusion matrix
print(classification_report(y_test, y_svm_predict, target_names=data_test.target_names))

# accuracy score
print ("\nAccuracy Score: {}".format(accuracy_score(y_test, y_svm_predict)))

               precision    recall  f1-score   support

   Automobile       0.69      0.80      0.74        25
Entertainment       0.88      0.92      0.90        25
        Hatke       1.00      0.44      0.61        25
      Science       0.69      1.00      0.82        25
   Technology       0.83      0.76      0.79        25

  avg / total       0.82      0.78      0.77       125


Accuracy Score: 0.784


## 3. RidgeClassifierCV

In [62]:
# Ridge classifier
ridge_clf = RidgeClassifierCV(alphas=(0.01, 0.1, 1.0))

# fit out training data
ridge_clf.fit(X_train_tfidf, data_train.target)

RidgeClassifierCV(alphas=(0.01, 0.1, 1.0), class_weight=None, cv=None,
         fit_intercept=True, normalize=False, scoring=None)

In [64]:
# make prediction using svm model
y_ridge_predict = ridge_clf.predict(X_test_tfidf)

### Confusion Matrix and Accuracy Score for SVM Model

In [67]:
# confusion matrix
print(classification_report(y_test, y_ridge_predict, target_names=data_test.target_names))

# accuracy score
print ("\nAccuracy Score: {}".format(accuracy_score(y_test, y_ridge_predict)))

# Best Alpha out of other.
# RidgeClassifierCV does the cross-validation it self. We just need to pass a list of alphas and it takes the best one
# from the list.
print ("\nAlpha:", ridge_clf.alpha_)

               precision    recall  f1-score   support

   Automobile       0.88      0.56      0.68        25
Entertainment       0.89      1.00      0.94        25
        Hatke       1.00      0.64      0.78        25
      Science       0.78      1.00      0.88        25
   Technology       0.64      0.84      0.72        25

  avg / total       0.84      0.81      0.80       125


Accuracy Score: 0.808

Alpha: 1.0


# Building Pipeline
All the steps we performed so far we can reduce them by building an pipeline.

Pipeline are used to chain multiple estimators (we can add classifiers also) into one, this is useful as many times there are fixed number of steps for a process. To make our work easy we create a pipeline which handles the process for us step by step.
    
Pipeline serves two purposes here:

##### Convenience: 
    You only have to call fit and predict once on your data to fit a whole sequence of estimators.

##### Joint parameter selection: 
    You can grid search over parameters of all estimators in the pipeline at once.

#### Note: All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.).

In [72]:
# Pipeline with CountVectorize, TfidfTransformer and SVM

# Now we don't need to call CountVecotrizer and then TfidfTransformer
# Pipeline will do this for us.
# Pipeline will perform the processing step by step

pipe = Pipeline ([
    ('countvectorizer', CountVectorizer(stop_words='english', ngram_range=(1, 1))),
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('ridge', RidgeClassifier(alpha=1))
    #('BNB', BernoulliNB(alpha=1))
    #('MNB', MultinomialNB(alpha=0.1))
])

In [73]:
# Fit the dataset and make prediction
pipe.fit(data_train.data, data_train.target)

# Testing our model
pipe_predict = pipe.predict(data_test.data)

In [74]:
# Accuracy Score
print ("Accuracy Score:", accuracy_score(data_test.target, pipe_predict))

Accuracy Score: 0.808


# Tuning Parameters using GridSearchCV
Classifiers and Estimators tend to have many parameters we can get the better result when these are set perfectly. But how can we get perfect parameter for every classifier and estimator? We can tweak these parameters and check result but it's gonna take time which is very precious to us. So, to make our work easy sklearn has a method GridSearchCV(), it takes in the classifier (or estimator) and tunes it by applying different parameters.
But remember it's a very expensive job.


Parameters:
> svm_clf = SGDClassifier(loss='hinge', penalty='l2', alpha=0.01, n_iter=5, random_state=42)

> svm_clf.get_params

Hyper-Parameters
They are the "higher-level" structural information about the model, and they are typically set before training the model

> pipeline.get_params()

In [6]:
# Let's see what parameters are available for us to tune
pipe.get_params()

{'MNB': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 'MNB__alpha': 1.0,
 'MNB__class_prior': None,
 'MNB__fit_prior': True,
 'countvectorizer': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 1), preprocessor=None, stop_words='english',
         strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
         tokenizer=None, vocabulary=None),
 'countvectorizer__analyzer': 'word',
 'countvectorizer__binary': False,
 'countvectorizer__decode_error': 'strict',
 'countvectorizer__dtype': numpy.int64,
 'countvectorizer__encoding': 'utf-8',
 'countvectorizer__input': 'content',
 'countvectorizer__lowercase': True,
 'countvectorizer__max_df': 1.0,
 'countvectorizer__max_features': None,
 'countvectorizer__min_df': 1,
 'countvectorizer__ngram_range': (1, 1),
 'countvectorizer__preprocessor': None

In [10]:
# List of Parameters we wan to tune
parameters = {
    'countvectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'tfidf__use_idf': (True, False),
    #'MNB__alpha': (1.0, 0.1, 0.01, 0.001)
}

In [11]:
# Create a Grid with the pipeline and parameter we want to tune
gs_clf = GridSearchCV(pipe, parameters, n_jobs=-1, cv=10)

# Fit the dataset
gs_clf.fit(data_train.data, data_train.target)

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english...inear_tf=False, use_idf=True)), ('MNB', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'MNB__alpha': (1.0, 0.1, 0.01, 0.001), 'tfidf__use_idf': (True, False), 'countvectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [12]:
# Print the best mean score and the best parameter we'll get after tuning
print ("Best Mean Score:", gs_clf.best_score_)
print ("Best Parameter Settings: ", gs_clf.best_params_)

Best Mean Score: 0.880308880309
Best Parameter Settings:  {'MNB__alpha': 0.1, 'tfidf__use_idf': False, 'countvectorizer__ngram_range': (1, 1)}


In [14]:
# Let's use these parameters and make prediction
gs_clf.refit

# Testing model
gs_predict = gs_clf.predict(X_test)

# Accuracy Score
print ("Accuray Score:", accuracy_score(data_test.target, gs_predict))

Accuray Score: 0.768


In [68]:
gs_clf.cv_results_

{'mean_fit_time': array([ 1.38943822,  1.36962538,  5.2649421 ,  5.78236973,  8.77949331,
         8.44525688,  1.4203624 ,  1.37572818,  4.91590009,  4.83553863,
         8.73370049,  7.88531094,  1.31288307,  1.29937325,  4.3249243 ,
         4.20538888,  7.94481905,  7.42882466,  1.30177519,  1.28676481,
         4.33618236,  4.19953537,  7.74887702,  7.70577683]),
 'mean_score_time': array([ 0.13089452,  0.13259399,  0.31143126,  0.35615344,  0.39838338,
         0.38432345,  0.14059715,  0.13324549,  0.2992125 ,  0.26744015,
         0.38849933,  0.34439511,  0.12854149,  0.12238741,  0.25498149,
         0.25072863,  0.36586027,  0.33889112,  0.12393808,  0.12203667,
         0.26098549,  0.24217203,  0.36270773,  0.33814077]),
 'mean_test_score': array([ 0.83861004,  0.83732304,  0.80733591,  0.80913771,  0.7965251 ,
         0.8036036 ,  0.87258687,  0.88030888,  0.85791506,  0.85933076,
         0.85070785,  0.84980695,  0.85842986,  0.87027027,  0.86718147,
         0.8712998

# Saving the Model

In [75]:
from sklearn.externals import joblib
joblib.dump(pipe, 'Pipeline_RidgeClassifier_InShortsNews.sav')

['Pipeline_RidgeClassifier_InShortsNews.sav']

Now we can use these models in future for predicting data without needing to train it. Model is Pre-trained

To load the model
> model = joblib.load(model_name)