# Execrice 2 : NLP/Text classification

Our goal is to build a model able to predict the category of a given text (business, sport, politics, entertainment, tech)

## I. Data preprocessing :

### 1) Load the dataset


In [0]:
from google.colab import files
!pip install -q kaggle

To get your API :
go to [Kaggle.com](https://www.kaggle.com/) --> my account (top right click on the profile pic) --> scroll down to "create API token" 

In [35]:
uploaded = files.upload()

Saving kaggle.json to kaggle (1).json


In [36]:

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json

kaggle.json


In [37]:
!kaggle datasets download -d datacolab/news-classification

news-classification.zip: Skipping, found more recently modified local copy (use --force to force download)


In [0]:
import pandas as pd

df = pd.read_csv('news-classification.zip', compression='zip', encoding = "ISO-8859-1", sep=',',error_bad_lines=False)

In [39]:
df.head()

Unnamed: 0,news,type
0,China had role in Yukos split-up\n \n China le...,business
1,Oil rebounds from weather effect\n \n Oil pric...,business
2,Indonesia 'declines debt freeze'\n \n Indonesi...,business
3,$1m payoff for former Shell boss\n \n Shell is...,business
4,US bank in $515m SEC settlement\n \n Five Bank...,business


In [40]:
df['news'][0]

'China had role in Yukos split-up\n \n China lent Russia $6bn (Â£3.2bn) to help the Russian government renationalise the key Yuganskneftegas unit of oil group Yukos, it has been revealed.\n \n The Kremlin said on Tuesday that the $6bn which Russian state bank VEB lent state-owned Rosneft to help buy Yugansk in turn came from Chinese banks. The revelation came as the Russian government said Rosneft had signed a long-term oil supply deal with China. The deal sees Rosneft receive $6bn in credits from China\'s CNPC.\n \n According to Russian newspaper Vedomosti, these credits would be used to pay off the loans Rosneft received to finance the purchase of Yugansk. Reports said CNPC had been offered 20% of Yugansk in return for providing finance but the company opted for a long-term oil supply deal instead. Analysts said one factor that might have influenced the Chinese decision was the possibility of litigation from Yukos, Yugansk\'s former owner, if CNPC had become a shareholder. Rosneft an

### 2) Remove stop words


In [41]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [0]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 


stop_words = set(stopwords.words('english'))
def removing_stop_words(text) :
  text_tokenized = word_tokenize(text)
  text_tokenized = [w for w in text_tokenized if w not in stop_words]
  return ' '.join(text_tokenized)

df['news_tokenized_0stopwords'] = df['news'].map(removing_stop_words)

In [43]:
df['news_tokenized_0stopwords'][0]

"China role Yukos split-up China lent Russia $ 6bn ( Â£3.2bn ) help Russian government renationalise key Yuganskneftegas unit oil group Yukos , revealed . The Kremlin said Tuesday $ 6bn Russian state bank VEB lent state-owned Rosneft help buy Yugansk turn came Chinese banks . The revelation came Russian government said Rosneft signed long-term oil supply deal China . The deal sees Rosneft receive $ 6bn credits China 's CNPC . According Russian newspaper Vedomosti , credits would used pay loans Rosneft received finance purchase Yugansk . Reports said CNPC offered 20 % Yugansk return providing finance company opted long-term oil supply deal instead . Analysts said one factor might influenced Chinese decision possibility litigation Yukos , Yugansk 's former owner , CNPC become shareholder . Rosneft VEB declined comment . `` The two companies [ Rosneft CNPC ] agreed pre-payment long-term deliveries , '' said Russian oil official Sergei Oganesyan . `` There nothing unusual pre-payment five 

### 3) Remove  indesired ponctuation 


In [0]:
import re #for regrex expressions
def removing_undesired_punctuation(text) : # removes &^;:?!.,'[](){} and lowers the text
  text=text.lower()
  text = re.sub('[()[\]{}\''',.``?:;!&^]','',text)
  return text
df['news_tokenized_0stopwords']=df['news_tokenized_0stopwords'].map(lambda x: removing_undesired_punctuation(x))  

In [45]:
df['news_tokenized_0stopwords'][0]

'china role yukos split-up china lent russia $ 6bn  â£32bn  help russian government renationalise key yuganskneftegas unit oil group yukos  revealed  the kremlin said tuesday $ 6bn russian state bank veb lent state-owned rosneft help buy yugansk turn came chinese banks  the revelation came russian government said rosneft signed long-term oil supply deal china  the deal sees rosneft receive $ 6bn credits china s cnpc  according russian newspaper vedomosti  credits would used pay loans rosneft received finance purchase yugansk  reports said cnpc offered 20 % yugansk return providing finance company opted long-term oil supply deal instead  analysts said one factor might influenced chinese decision possibility litigation yukos  yugansk s former owner  cnpc become shareholder  rosneft veb declined comment   the two companies  rosneft cnpc  agreed pre-payment long-term deliveries   said russian oil official sergei oganesyan   there nothing unusual pre-payment five six years   the announcemen

In [46]:
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### 4) Words lemmatization


In [0]:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet

wordnet_lemmatizer = WordNetLemmatizer()

#word lemmatizing depends on part of speech in the word therefore we need to get the POS using wordnet corpus
#then we can lemmatize according to that POS 


def get_wordnet_pos(word):

    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

def lemmitazing_according_to_word(text) :
    text_tokenized = word_tokenize(text)
    text_lemmatized = []
    for word in text_tokenized :
        text_lemmatized.append(wordnet_lemmatizer.lemmatize(word,get_wordnet_pos(word)))
    return ' '.join(text_lemmatized)

df['news_lemmatized'] = df['news_tokenized_0stopwords'].map(lambda x : lemmitazing_according_to_word(x))       

In [48]:
df['news_lemmatized'][0]

'china role yukos split-up china lent russia $ 6bn â£32bn help russian government renationalise key yuganskneftegas unit oil group yukos reveal the kremlin say tuesday $ 6bn russian state bank veb lent state-owned rosneft help buy yugansk turn come chinese bank the revelation come russian government say rosneft sign long-term oil supply deal china the deal see rosneft receive $ 6bn credit china s cnpc accord russian newspaper vedomosti credit would use pay loan rosneft receive finance purchase yugansk report say cnpc offer 20 % yugansk return provide finance company opt long-term oil supply deal instead analyst say one factor might influence chinese decision possibility litigation yukos yugansk s former owner cnpc become shareholder rosneft veb decline comment the two company rosneft cnpc agree pre-payment long-term delivery say russian oil official sergei oganesyan there nothing unusual pre-payment five six year the announcement help explain rosneft medium-sized indebted relatively un

### 5) One-hot encoding


In [49]:
df.head()

Unnamed: 0,news,type,news_tokenized_0stopwords,news_lemmatized
0,China had role in Yukos split-up\n \n China le...,business,china role yukos split-up china lent russia $ ...,china role yukos split-up china lent russia $ ...
1,Oil rebounds from weather effect\n \n Oil pric...,business,oil rebounds weather effect oil prices recover...,oil rebound weather effect oil price recover a...
2,Indonesia 'declines debt freeze'\n \n Indonesi...,business,indonesia declines debt freeze indonesia longe...,indonesia decline debt freeze indonesia longer...
3,$1m payoff for former Shell boss\n \n Shell is...,business,$ 1m payoff former shell boss shell pay $ 1m ...,$ 1m payoff former shell bos shell pay $ 1m â£...
4,US bank in $515m SEC settlement\n \n Five Bank...,business,us bank $ 515m sec settlement five bank americ...,u bank $ 515m sec settlement five bank america...


In [0]:
import pandas as pd
data = df.drop(['news','news_tokenized_0stopwords'],axis=1)
one_hot = pd.get_dummies(data["type"])
data = data.drop("type",axis=1)
data = data.join(one_hot)

In [51]:
data.head()

Unnamed: 0,news_lemmatized,business,entertainment,politics,sport,tech
0,china role yukos split-up china lent russia $ ...,1,0,0,0,0
1,oil rebound weather effect oil price recover a...,1,0,0,0,0
2,indonesia decline debt freeze indonesia longer...,1,0,0,0,0
3,$ 1m payoff former shell bos shell pay $ 1m â£...,1,0,0,0,0
4,u bank $ 515m sec settlement five bank america...,1,0,0,0,0


### 6) Split the dataset for training and testing


In [0]:
from sklearn.model_selection import train_test_split
X= data[['news_lemmatized']]
y= data[['business','entertainment','politics','sport','tech']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=30)

### 7) Words vectorization

In [53]:
from sklearn.feature_extraction.text import CountVectorizer

def count_vect(data_clean):
  cv = CountVectorizer(stop_words='english')
  data_cv = cv.fit_transform(data_clean.news_lemmatized)
  data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
  data_dtm.index = data_clean.index
  return data_dtm

X = count_vect(X)
X.head()  

Unnamed: 0,000,0001,00051,001,002,003,004secs,007,01,0100,0130,019secs,02,0200,022,0227,024,025,027,028,03,030,0300,04,0400,041,0469,048,05,050505,053,0530,0530gmt,053bn,056,06,0619,0630,067,07,...,zen,zenden,zenith,zephaniah,zephyr,zeppelin,zero,zeta,zhang,zhaoxing,zheng,zib,zidane,ziers,zillion,zimbabwe,zinc,zinedine,zip,ziyi,zodiac,zoe,zoellick,zola,zomba,zombic,zombie,zombies,zone,zonealarm,zoom,zooropa,zornotza,zorro,zubair,zuluaga,zurich,zutons,zvonareva,zvyagintsev
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=30)

## II. Model training 


### - Create a simple neural network architecture (MLP...) and use it to train the model

In [55]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier()
mlp = mlp.fit(X_train,y_train)
mlp_pred = mlp.predict(X_test)
mlp.score(X_test,y_test)

0.9251497005988024

### - Print the number of missclassified predictions (optional)

In [60]:
import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix

cnfe = confusion_matrix(
    y_test.values.argmax(axis=1), mlp_pred.argmax(axis=1))
cnfe

array([[144,   0,   1,   0,   0],
       [  8, 100,   0,   0,   0],
       [ 14,   0, 112,   0,   0],
       [  1,   0,   0, 166,   0],
       [ 10,   2,   1,   1, 108]])

The number of misclassfied texts is 38 

In [0]:
from sklearn.model_selection import GridSearchCV

parameters = {
    'solver': ['lbfgs'],
     'max_iter': [1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000 ],
      'alpha': 10.0 ** -np.arange(1, 10), 
      'hidden_layer_sizes':np.arange(10, 15), 
      'random_state':[0,1,2,3,4,5,6,7,8,9]
      }
clf = GridSearchCV(MLPClassifier(), parameters, n_jobs=-1)

clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
print(clf.best_params_)


