<a class="anchor" id="0"></a>
# [NLP : Reports & News Classification](https://www.kaggle.com/vbmokin/nlp-reports-news-classification)
## Automatic Environmental Reports & News Classification (Ukranian)

# Acknowledgements

This notebook uses such good notebooks: 
* BERT model from the paper with notebook [A Visual Guide to Using BERT for the First Time](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)
* EDA from the notebook [NLP - EDA, Bag of Words, TF IDF, GloVe, BERT](https://www.kaggle.com/vbmokin/nlp-eda-bag-of-words-tf-idf-glove-bert)
* Classification model from the notebook [Heart Disease - Automatic AdvEDA & FE & 20 models](https://www.kaggle.com/vbmokin/heart-disease-automatic-adveda-fe-20-models)
* similar my notebook [NLP for EN : BERT Classification for Water Report](https://www.kaggle.com/vbmokin/nlp-for-en-bert-classification-for-water-report)

My dataset [NLP : Reports & News Classification](https://www.kaggle.com/vbmokin/nlp-reports-news-classification)

Source of models:
https://huggingface.co/transformers/pretrained_models.html

<a class="anchor" id="0.1"></a>
## Table of Contents

1. [Import libraries](#1)
1. [Download data](#2)
1. [BERT: Data Prepairing and Modeling](#3)
1. [Text Classification and Prediction](#4)

## 1. Import libraries <a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

In [None]:
#!pip install transformers

In [None]:
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

import torch
import transformers as ppb

import warnings
warnings.filterwarnings('ignore')

## 2. Download data <a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

In [None]:
df = pd.read_csv('../input/nlp-reports-news-classification/water_problem_nlp_ua_for_Kaggle_100.csv', delimiter=';', 
                 header=0, encoding='cp1251')
df = df.fillna(0)

convert_dict = {'text': str, 
                'env_problems': int,
                'pollution': int, 
                'treatment': int,
                'climate': int,
                'biomonitoring': int} 
  
df = df.astype(convert_dict)
df

In [None]:
df.info()

## 3. BERT: Data Prepairing and Modeling <a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)

In [None]:
# For pre-trained DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-multilingual-cased')

# Other models: https://huggingface.co/transformers/pretrained_models.html

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [None]:
# Tokenization the sentences - break them up into word and subwords in the format BERT is comfortable with
tokenized = df['text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
np.array(padded).shape

In [None]:
# Creation variable to ignore (mask) the data padding
attention_mask = np.where(padded != 0, 1, 0)
print(attention_mask.shape)
attention_mask

In [None]:
# Modeling
input_ids = torch.tensor(padded).to(torch.int64)
attention_mask = torch.tensor(attention_mask).to(torch.int64)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [None]:
# Last hidden states
features = last_hidden_states[0][:,0,:].numpy()

## 4. Text Classification and Prediction <a class="anchor" id="4"></a>

[Back to Table of Contents](#0.1)

In [None]:
def target_prediction(df, features, target, test_size=0.2):
    # Text classification model and prediction for given feature "target" (with labels) in df
    
    # Target
    labels = df[target]
    
    # EDA
    print()
    # Extracting the number of examples of each class
    Relevant_len = df[df[target] == 1].shape[0]
    Not_len = df[df[target] == 0].shape[0]
    # Draw bar plot
    plt.rcParams['figure.figsize'] = (7, 5)
    plt.bar(10, Relevant_len, 3, label="Relevant", color='green')
    plt.bar(15, Not_len, 3, label="Not", color='red')
    plt.legend(loc='upper center')
    plt.ylabel('Number of examples')
    plt.title('Propertion of examples for ' + target)
    plt.show()
    
    # Train, test split 
    train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=test_size)
    
    # Train a simple model
    print(f'Classification for {col}:')
    parameters = {'C': np.linspace(0.0001, 100, 20)}
    model = GridSearchCV(LogisticRegression(), parameters)
    model.fit(train_features, train_labels)

    print('best parameters: ', model.best_params_)
    print('best scores: ', model.best_score_)
    
    # Test prediction
    test_pred = model.predict(test_features)
    print('Score of the test prediction -', accuracy_score(test_labels, test_pred),'\n\n')

In [None]:
# List of the target features in df
cols = df.columns.tolist()[1:]
print('Target columns:', cols)

In [None]:
%%time
# Solving NLP Classification tasks
print('Solving NLP Classification tasks')
for col in cols:
    target_prediction(df, features, col, test_size=0.4)

I hope you find this kernel useful and enjoyable.

Your comments and feedback are most welcome.

[Go to Top](#0)