<a href="https://colab.research.google.com/github/txusser/Master_IA_Sanidad/blob/main/Modulo_3/NLP1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We install some dependencies.

In [1]:
!mkdir tutorial
!git clone https://github.com/scikit-learn/scikit-learn.git
!pip install scikit-learn
!cp -r scikit-learn/doc/tutorial/text_analytics/ tutorial
!python tutorial/text_analytics/data/twenty_newsgroups/fetch_data.py

mkdir: cannot create directory ‘tutorial’: File exists
fatal: destination path 'scikit-learn' already exists and is not an empty directory.


We download the dataset that we will use in this example. 
It is a dataset of emails (approximately 12,000 documents in total), almost evenly divided into 20 groups based on the topics they cover. It was originally collected by Ken Lang, likely for his paper "Newsweeder: Learning to filter netnews."

In [2]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',shuffle=True)
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [3]:
len(twenty_train.data)

11314

In [4]:
for t in twenty_train.target[:10]:
  print(twenty_train.target_names[t])

rec.autos
comp.sys.mac.hardware
comp.sys.mac.hardware
comp.graphics
sci.space
talk.politics.guns
sci.med
comp.sys.ibm.pc.hardware
comp.os.ms-windows.misc
comp.sys.mac.hardware


In [5]:
print("\n".join(twenty_train.data[4].split("\n")))

From: jcm@head-cfa.harvard.edu (Jonathan McDowell)
Subject: Re: Shuttle Launch Question
Organization: Smithsonian Astrophysical Observatory, Cambridge, MA,  USA
Distribution: sci
Lines: 23

From article <C5owCB.n3p@world.std.com>, by tombaker@world.std.com (Tom A Baker):
>>In article <C5JLwx.4H9.1@cs.cmu.edu>, ETRAT@ttacs1.ttu.edu (Pack Rat) writes...
>>>errors. ...".  I am wondering what an "expected error" might
>>>be.  Sorry if this is a really dumb question, but
> 
> Parity errors in memory or previously known conditions that were waivered.
>    "Yes that is an error, but we already knew about it"
> I'd be curious as to what the real meaning of the quote is.
> 
> tom


My understanding is that the 'expected errors' are basically
that don't have the right values in yet because they aren't
set till after launch, and suchlike. Rather than fix the code
and possibly introduce new bugs, they just tell the crew

 - Jonathan





Text preprocessing, tokenization, and keyword filtering are included in ‎‎CountVectorizer‎‎, which creates a feature dictionary and transforms the emails into feature vectors.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

Let's check the size of our vocabulary.

In [7]:
count_vect.vocabulary_.get(u'algorithm')

27366

Counting occurrences is a good start, but there's an issue: longer documents will have higher average count values than shorter ones, even if they discuss the same topics.

To avoid these potential discrepancies, we simply divide the number of occurrences of each word in a document by the total number of words in that document. These new features are called Term Frequencies (tf).

Another refinement, besides transforming the values into tf, is to reduce the weights of words that appear in many documents across the corpus, as they are less informative than words occurring in a smaller portion of the corpus.

This scaling reduction is called ‎‎tf-idf‎‎, which stands for "Term Frequency times Inverse Document Frequency."

Both ‎‎tf‎‎ and ‎‎tf-idf‎‎ can be calculated as follows using ‎‎TfidfTransformer‎‎:

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

In the example code above, we first use the "*fit*" method to train our estimator on the data and, second, the "*transform*" method to convert our count matrix into a tf-idf representation. These two steps can be combined to achieve the same final result while avoiding redundant processing. This is done by using the "*fit_transform*" method. 

Exercise: Replace with the "*fit_transform*" method.

In [9]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

Now that we have our features, we can train a classifier to try to predict the category of a post. Let's start with a ‎‎naive Bayes classifier, which provides a good baseline for this task. There are several variants of this classifier; the most suitable for word count is the multinomial variant.

In [10]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

To try to predict the outcome for a new document, we need to extract the features using almost the same feature extraction pipeline as before. The difference is that we call "*transform*" instead of "*fit_transform*", since the pipeline has already been fitted to the training set.

In [11]:
docs_new = ['Jesus Loves You. Follow his guidance and you will be able to reach heaven', 'Earth is 7 light years away from the sun','You need to ensure that the engine temperature does not reach above 90 degrees']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
  print('%r => %s' % (doc, twenty_train.target_names[category]))

'Jesus Loves You. Follow his guidance and you will be able to reach heaven' => soc.religion.christian
'Earth is 7 light years away from the sun' => sci.space
'You need to ensure that the engine temperature does not reach above 90 degrees' => rec.autos


To make the vectorizer => transformer => classifier pipeline easier to work with, there is a ‎‎Pipeline‎‎ class that behaves like a composite classifier.

In [12]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB())])
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB())])

We can use the same dataset to evaluate the predictive accuracy of the model. For this, the dataset reserves a test portion:

In [13]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test',shuffle=True)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.7738980350504514

We achieved an accuracy of around 75%. Let's see if we can do better with a Support Vector Machine (SVM),‎‎ which is widely considered one of the best text classification algorithms (although it is slightly slower than Naive Bayes):‎

In [14]:
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42,max_iter=5, tol=None))])
text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.8248805098247477

We have improved performance by about 5% using an SVM. ‎scikit-learn‎ provides additional utilities for a more detailed performance analysis of the results:‎

In [15]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,target_names=twenty_test.target_names))


                          precision    recall  f1-score   support

             alt.atheism       0.73      0.71      0.72       319
           comp.graphics       0.78      0.72      0.75       389
 comp.os.ms-windows.misc       0.73      0.78      0.75       394
comp.sys.ibm.pc.hardware       0.74      0.67      0.70       392
   comp.sys.mac.hardware       0.81      0.83      0.82       385
          comp.windows.x       0.84      0.76      0.80       395
            misc.forsale       0.84      0.90      0.87       390
               rec.autos       0.91      0.90      0.90       396
         rec.motorcycles       0.93      0.96      0.95       398
      rec.sport.baseball       0.88      0.90      0.89       397
        rec.sport.hockey       0.88      0.99      0.93       399
               sci.crypt       0.84      0.96      0.90       396
         sci.electronics       0.83      0.62      0.71       393
                 sci.med       0.87      0.86      0.87       396
         

In [16]:
for i in twenty_train.target_names:
  for j in twenty_train.target_names:
    if i!=j and metrics.confusion_matrix(twenty_test.target, predicted)[twenty_train.target_names.index(i),twenty_train.target_names.index(j)]>20:
      print ('%s  -  %s = %s' % (i,j,metrics.confusion_matrix(twenty_test.target, predicted)[twenty_train.target_names.index(i),twenty_train.target_names.index(j)])) 

alt.atheism  -  soc.religion.christian = 44
comp.graphics  -  comp.os.ms-windows.misc = 21
comp.graphics  -  comp.windows.x = 24
comp.os.ms-windows.misc  -  comp.sys.ibm.pc.hardware = 21
comp.sys.ibm.pc.hardware  -  comp.os.ms-windows.misc = 27
comp.sys.ibm.pc.hardware  -  comp.sys.mac.hardware = 26
comp.sys.ibm.pc.hardware  -  sci.electronics = 22
comp.sys.mac.hardware  -  comp.sys.ibm.pc.hardware = 22
comp.windows.x  -  comp.graphics = 32
comp.windows.x  -  comp.os.ms-windows.misc = 42
rec.sport.baseball  -  rec.sport.hockey = 32
sci.electronics  -  comp.sys.ibm.pc.hardware = 22
sci.electronics  -  sci.crypt = 30
talk.politics.misc  -  talk.politics.guns = 102
talk.religion.misc  -  alt.atheism = 45
talk.religion.misc  -  soc.religion.christian = 56
talk.religion.misc  -  talk.politics.guns = 22


We have already encountered some parameters like *use_idf* in the *TfidfTransformer*. Classifiers tend to have many parameters; for example, *MultinomialNB* includes a smoothing parameter *alpha*, and *SGDClassifier* has a penalty parameter *alpha* along with configurable loss and penalty terms in the objective function.

We can try optimizing these parameters to improve our classification. Instead of manually tuning the parameters of the various pipeline components, we can perform an exhaustive search for the best parameters across a grid of possible values. For instance, we test all classifiers on words or bigrams, with or without idf, and with a penalty parameter of 0.01 or 0.001 for the linear SVM:

Obviously, such an exhaustive search can be computationally expensive. If we have multiple CPU cores, we can instruct the process to test these eight parameter combinations in parallel. By setting the *n_jobs* parameter to -1, grid search will detect how many cores are installed and use all of them.

In [20]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf__alpha': (1e-2, 1e-3)}
gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data[:1000], twenty_train.target[:1000])

The ‎best_score_ and best_params_ objects store the best average score and the parameter configuration corresponding to that score:

In [21]:
gs_clf.best_score_

0.766