<a href="https://colab.research.google.com/github/Viny2030/UNED/blob/main/inputText_Classification_using_SpaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction

#### About Dataset:
We will be using rich dataset of amazon fine food reviews.

####  What are we trying to achieve??
We are going to tackle an interesting natural language processing problem i.e sentiment or text classification.
We will explore texual data using amazing spaCy library and build a text classification model.

### Here is breakdown of concepts I will try to explain.
We will extract linguistic features like
1. tokenization,
1. part-of-speech tagging,
1. dependency parsing,
1. lemmatization ,
1. named entities recognition,
1. Sentence Boundary Detection
for building language models later.

Visualizing Data
1. explacy - explaining how parsing is done
1. displaCy - visualizing named entities

Word vectors and similarity
1. sense2vec - using contextual information for building word embeddings

Text classification model
1. SpaCy TextCategorizer


Introducción
Acerca del conjunto de datos:
Usaremos un conjunto de datos enriquecido de reseñas de alimentos de alta calidad de Amazon.

¿Qué estamos tratando de lograr?
Vamos a abordar un problema interesante de procesamiento del lenguaje natural, es decir, la clasificación de texto o sentimiento. Exploraremos datos textuales utilizando la increíble biblioteca spaCy y crearemos un modelo de clasificación de texto.

A continuación, se detallan los conceptos que intentaré explicar.
Extraeremos características lingüísticas como

tokenización,
etiquetado de partes del discurso,
análisis de dependencias,
lematización,
reconocimiento de entidades con nombre,
detección de límites de oraciones para crear modelos de lenguaje más adelante.
Visualización de datos

explacy: explicación de cómo se realiza el análisis
displaCy: visualización de entidades con nombre
Vectores de palabras y similitud

sense2vec: uso de información contextual para crear incrustaciones de palabras
Modelo de clasificación de texto

SpaCy TextCategorizer
Enviar comentarios


### Loading data

In [1]:

!pip install -U pip setuptools wheel
!pip install -U spacy



In [2]:
!python -m spacy download en_core_web_trf

Collecting en-core-web-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spacy-curated-transformers<1.0.0,>=0.2.2 (from en-core-web-trf==3.8.0)
  Downloading spacy_curated_transformers-0.3.0-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting curated-transformers<0.2.0,>=0.1.0 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_transformers-0.1.1-py2.py3-none-any.whl.metadata (965 bytes)
Collecting curated-tokenizers<0.1.0,>=0.0.9 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_tokenizers-0.0.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Downloading spacy_curated_transformers-0.3.0-py2.py3-none-any.whl (236 kB)
Downloading cur

In [3]:
import pandas as pd
import numpy as np
import spacy
from spacy import displacy
from spacy.util import minibatch, compounding



In [4]:
!pip install --upgrade spacy



In [5]:
!pip install matplotlib

Collecting numpy<2,>=1.21 (from matplotlib)
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m54.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
blis 1.0.1 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.
thinc 8.3.2 requires numpy<2.1.0,>=2.0.0; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.26.4


Let's read in food reviews data

In [9]:
import spacy
nlp= spacy.load("en_core_web_trf")
doc = nlp("Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")
for token in doc:
    print(token.text,"//",token.pos_,"//", token.dep_ )  ## TEXTO DEL TOKEN, POSICION DEL TOKEN GRAMATICAL,

Jane // PROPN // nsubj
bought // VERB // ROOT
me // PRON // dative
these // DET // det
books // NOUN // dobj
. // PUNCT // punct
Jane // PROPN // nsubj
bought // VERB // ROOT
a // DET // det
book // NOUN // dobj
for // ADP // dative
me // PRON // pobj
. // PUNCT // punct
She // PRON // nsubj
dropped // VERB // ROOT
a // DET // det
line // NOUN // dobj
to // ADP // prep
him // PRON // pobj
. // PUNCT // punct
Thank // VERB // ROOT
you // PRON // dobj
. // PUNCT // punct
She // PRON // nsubj
sleeps // VERB // ROOT
. // PUNCT // punct
I // PRON // nsubj
sleep // VERB // ROOT
a // DET // det
lot // NOUN // npadvmod
. // PUNCT // punct
I // PRON // nsubjpass
was // AUX // auxpass
born // VERB // ROOT
in // ADP // prep
Madrid.the // PROPN // punct
cat // NOUN // nsubjpass
was // AUX // auxpass
chased // VERB // ROOT
by // ADP // agent
the // DET // det
dog // NOUN // pobj
. // PUNCT // punct
I // PRON // nsubjpass
was // AUX // auxpass
born // VERB // ROOT
in // ADP // prep
Madrid // PROPN /

In [10]:
doc

Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.

### Linguistic features

#### Tokenization
First step in any nlp pipeline is tokenizing text i.e breaking down paragraphs into sentenses and then sentenses into words, punctuations and so on.

we will load english language model to tokenize our english text.

Every language is different and have different rules. Spacy offers 8 different language models.

# Tokenización
El primer paso en cualquier canal de procesamiento de lenguaje natural es tokenizar el texto, es decir, dividir los párrafos en oraciones y luego las oraciones en palabras, signos de puntuación, etc.

Cargaremos el modelo de idioma inglés para tokenizar nuestro texto en inglés.

Cada idioma es diferente y tiene diferentes reglas. Spacy ofrece 8 modelos de idioma diferentes.

In [11]:
for token in doc:
    print(token.text, "//", token.pos_, "//", token.dep_)

Jane // PROPN // nsubj
bought // VERB // ROOT
me // PRON // dative
these // DET // det
books // NOUN // dobj
. // PUNCT // punct
Jane // PROPN // nsubj
bought // VERB // ROOT
a // DET // det
book // NOUN // dobj
for // ADP // dative
me // PRON // pobj
. // PUNCT // punct
She // PRON // nsubj
dropped // VERB // ROOT
a // DET // det
line // NOUN // dobj
to // ADP // prep
him // PRON // pobj
. // PUNCT // punct
Thank // VERB // ROOT
you // PRON // dobj
. // PUNCT // punct
She // PRON // nsubj
sleeps // VERB // ROOT
. // PUNCT // punct
I // PRON // nsubj
sleep // VERB // ROOT
a // DET // det
lot // NOUN // npadvmod
. // PUNCT // punct
I // PRON // nsubjpass
was // AUX // auxpass
born // VERB // ROOT
in // ADP // prep
Madrid.the // PROPN // punct
cat // NOUN // nsubjpass
was // AUX // auxpass
chased // VERB // ROOT
by // ADP // agent
the // DET // det
dog // NOUN // pobj
. // PUNCT // punct
I // PRON // nsubjpass
was // AUX // auxpass
born // VERB // ROOT
in // ADP // prep
Madrid // PROPN /

# Características lingüísticas
Tokenización
El primer paso en cualquier canal de procesamiento de lenguaje natural es tokenizar el texto, es decir, dividir los párrafos en oraciones y luego las oraciones en palabras, signos de puntuación, etc.

Cargaremos el modelo de idioma inglés para tokenizar nuestro texto en inglés.

Cada idioma es diferente y tiene diferentes reglas. Spacy ofrece 8 modelos de idioma diferentes.

In [12]:
spacy_tok = spacy.load('en_core_web_sm')
car_lin = spacy_tok(doc)
car_lin



Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.

There is not much difference between parsed review and original one. But we will see ahead what has actually happened.
We can see how parsing has been done visually through **explacy**.

No hay mucha diferencia entre la revisión analizada y la original, pero veremos qué ha sucedido realmente. Podemos ver cómo se ha realizado el análisis visualmente a través de explacy.

In [13]:
!wget https://raw.githubusercontent.com/tylerneylon/explacy/master/explacy.py

--2024-12-03 14:07:37--  https://raw.githubusercontent.com/tylerneylon/explacy/master/explacy.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6896 (6.7K) [text/plain]
Saving to: ‘explacy.py’


2024-12-03 14:07:38 (44.5 MB/s) - ‘explacy.py’ saved [6896/6896]



In [16]:
import explacy
explacy.print_parse_info(spacy_tok,'Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.')


Dep tree           Token      Dep type  Lemma      Part of Sp
────────────────── ────────── ───────── ────────── ──────────
               ┌─► Jane       nsubj     Jane       PROPN     
           ┌┬──┼── bought     ROOT      buy        VERB      
           ││  └─► me         dative    I          PRON      
           ││  ┌─► these      det       these      DET       
           │└─►└── books      dobj      book       NOUN      
           └─────► .          punct     .          PUNCT     
               ┌─► Jane       nsubj     Jane       PROPN     
          ┌┬┬──┴── bought     ROOT      buy        VERB      
          │││  ┌─► a          det       a          DET       
          ││└─►└── book       dobj      book       NOUN      
          │└──►┌── for        dative    for        ADP       
          │    └─► me         pobj      I          PRON      
          └──────► .          punct     .          PUNCT     
               ┌─► She        nsubj     she        PRON      
        

In [20]:
explacy.print_parse_info(spacy_tok,'Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.')


Dep tree           Token      Dep type  Lemma      Part of Sp
────────────────── ────────── ───────── ────────── ──────────
               ┌─► Jane       nsubj     Jane       PROPN     
           ┌┬──┼── bought     ROOT      buy        VERB      
           ││  └─► me         dative    I          PRON      
           ││  ┌─► these      det       these      DET       
           │└─►└── books      dobj      book       NOUN      
           └─────► .          punct     .          PUNCT     
               ┌─► Jane       nsubj     Jane       PROPN     
          ┌┬┬──┴── bought     ROOT      buy        VERB      
          │││  ┌─► a          det       a          DET       
          ││└─►└── book       dobj      book       NOUN      
          │└──►┌── for        dative    for        ADP       
          │    └─► me         pobj      I          PRON      
          └──────► .          punct     .          PUNCT     
               ┌─► She        nsubj     she        PRON      
        

#### Part-of-speech tagging
After tokenization we can parse and tag variety of parts of speech to paragraph text. SpaCy uses statistical models in background to predict which tag will go for each word(s) based on the context.

##### Lemmatization
It is the process of extracting uninflected/base form of the word.
Lemma can be like
For eg.

Adjectives: best, better → good
Adverbs: worse, worst → badly
Nouns: ducks, children → duck, child
Verbs: standing,stood → stand


# Etiquetado de partes del discurso
Después de la tokenización, podemos analizar y etiquetar una variedad de partes del discurso en el texto de un párrafo. SpaCy utiliza modelos estadísticos en segundo plano para predecir qué etiqueta se aplicará a cada palabra en función del contexto.

Lematización
Es el proceso de extraer la forma básica/no flexiva de la palabra. El lema puede ser como Por ejemplo:

Adjetivos: best, better → good Adverbios: worst, worst → badly Sustantivos: ducks, children → duck, child Verbos: standing,stood → stand

In [22]:
!pip install numpy --upgrade

Collecting numpy
  Downloading numpy-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Downloading numpy-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m86.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cupy-cuda12x 12.2.0 requires numpy<1.27,>=1.20, but you have numpy 2.1.3 which is incompatible.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.1.3 which is incompatible.
langchain 0.3.7 requires numpy<2,>=1; python_version < "3.12", but you have numpy 2.1.3 which is incompatible

In [23]:
tokenized_text = pd.DataFrame()

for i, token in enumerate(car_lin):
    tokenized_text.loc[i, 'text'] = token.text
    tokenized_text.loc[i, 'lemma'] = token.lemma_,
    tokenized_text.loc[i, 'pos'] = token.pos_
    tokenized_text.loc[i, 'tag'] = token.tag_
    tokenized_text.loc[i, 'dep'] = token.dep_
    tokenized_text.loc[i, 'shape'] = token.shape_
    tokenized_text.loc[i, 'is_alpha'] = token.is_alpha
    tokenized_text.loc[i, 'is_stop'] = token.is_stop
    tokenized_text.loc[i, 'is_punctuation'] = token.is_punct

tokenized_text[:20]

Unnamed: 0,text,lemma,pos,tag,dep,shape,is_alpha,is_stop,is_punctuation
0,Jane,"(Jane,)",PROPN,NNP,nsubj,Xxxx,True,False,False
1,bought,"(buy,)",VERB,VBD,ROOT,xxxx,True,False,False
2,me,"(I,)",PRON,PRP,dative,xx,True,True,False
3,these,"(these,)",DET,DT,det,xxxx,True,True,False
4,books,"(book,)",NOUN,NNS,dobj,xxxx,True,False,False
5,.,"(.,)",PUNCT,.,punct,.,False,False,True
6,Jane,"(Jane,)",PROPN,NNP,nsubj,Xxxx,True,False,False
7,bought,"(buy,)",VERB,VBD,ROOT,xxxx,True,False,False
8,a,"(a,)",DET,DT,det,x,True,True,False
9,book,"(book,)",NOUN,NN,dobj,xxxx,True,False,False


#### Named Entity Recognition (NER)
Named entity is real world object like Person, Organization etc

Spacy figures out below entities automatically:

|Type	|Description|
|------|------|
|PERSON|	People, including fictional.
|NORP|	Nationalities or religious or political groups.|
|FAC|	Buildings, airports, highways, bridges, etc.|
|ORG|	Companies, agencies, institutions, etc.|
|GPE|	Countries, cities, states.|
|LOC|	Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|	Objects, vehicles, foods, etc. (Not services.)|
|EVENT|	Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|	Titles of books, songs, etc.|
|LAW|	Named documents made into laws.|
|LANGUAGE|	Any named language.|
|DATE|	Absolute or relative dates or periods.|
|TIME|	Times smaller than a day.|
|PERCENT|	Percentage, including "%".|
|MONEY|	Monetary values, including unit.|
|QUANTITY|	Measurements, as of weight or distance.|
|ORDINAL|	"first", "second", etc.|
|CARDINAL|	Numerals that do not fall under another type|

Reconocimiento de entidades con nombre (NER)
Las entidades con nombre son objetos del mundo real, como personas, organizaciones, etc.

Spacy identifica automáticamente las siguientes entidades:

In [24]:
spacy.displacy.render(car_lin, style='ent', jupyter=True)

In [25]:
spacy.explain('GPE') # to explain POS tag

'Countries, cities, states'

#### Dependency parsing
Syntactic Parsing or Dependency Parsing is process of identifyig sentenses and assigning a syntactic structure to it.
As in Subject combined with object makes a sentence.
Spacy provides parse tree which can be used to generate this structure.

##### Sentense Boundry Detection
Figuring out where sentense starts and ends is very imporatnt part of nlp.

# Análisis de dependencia
El análisis sintáctico o análisis de dependencia es un proceso de identificación de oraciones y asignación de una estructura sintáctica. Por ejemplo, el sujeto combinado con el objeto forma una oración. Spacy proporciona un árbol de análisis que se puede utilizar para generar esta estructura.

Detección de límites de oraciones
Determinar dónde comienza y termina una oración es una parte muy importante del procesamiento del lenguaje natural.

In [26]:
sentence_spans = list(car_lin.sents)
sentence_spans

[Jane bought me these books.,
 Jane bought a book for me.,
 She dropped a line to him.,
 Thank you.,
 She sleeps.,
 I sleep a lot.,
 I was born in Madrid.the,
 cat was chased by the dog.,
 I was born in Madrid during 1995.Out of all this,
 , something good will come.,
 Susan left after the rehearsal.,
 She did it well.,
 She sleeps during the morning, but she sleeps.]

In [27]:
displacy.render(car_lin, style='dep', jupyter=True,options={'distance': 140})

Kindly scroll down if you can't see the output above.
You can even customize dependency parser's output as below.

Desplácese hacia abajo si no puede ver el resultado anterior. Incluso puede personalizar el resultado del analizador de dependencias como se muestra a continuación.

In [28]:
options = {'compact': True, 'bg': 'violet','distance': 140,
           'color': 'white', 'font': 'Trebuchet MS'}
displacy.render(car_lin, jupyter=True, style='dep', options=options)

In [29]:
spacy.explain("ADJ") ,spacy.explain("det") ,spacy.explain("ADP") ,spacy.explain("prep")  # to understand tags

('adjective', 'determiner', 'adposition', 'prepositional modifier')

#### Processing Noun chunks

Procesando fragmentos de sustantivos

In [30]:
noun_chunks_df = pd.DataFrame()

for i, chunk in enumerate(car_lin.noun_chunks):
    noun_chunks_df.loc[i, 'text'] = chunk.text
    noun_chunks_df.loc[i, 'root'] = chunk.root,
    noun_chunks_df.loc[i, 'root.text'] = chunk.root.text,
    noun_chunks_df.loc[i, 'root.dep_'] = chunk.root.dep_
    noun_chunks_df.loc[i, 'root.head.text'] = chunk.root.head.text

noun_chunks_df[:20]

Unnamed: 0,text,root,root.text,root.dep_,root.head.text
0,Jane,"(Jane,)","(Jane,)",nsubj,bought
1,me,"(me,)","(me,)",dative,bought
2,these books,"(books,)","(books,)",dobj,bought
3,Jane,"(Jane,)","(Jane,)",nsubj,bought
4,a book,"(book,)","(book,)",dobj,bought
5,me,"(me,)","(me,)",pobj,for
6,She,"(She,)","(She,)",nsubj,dropped
7,a line,"(line,)","(line,)",dobj,dropped
8,him,"(him,)","(him,)",pobj,to
9,you,"(you,)","(you,)",dobj,Thank


### Visualizing using Scattertext

Visualización mediante Scattertext

https://spacy.io/universe/project/scattertext


In [50]:
!pip install -q scattertext

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mGetting requirements to build wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Getting requirements to build wheel ... [?25l[?25herror
[1;31merror[0m: [1msubprocess-exited-with-error[0m

[31m×[0m [32mGetting requirements to build wheel[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m See above for output.

[1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.


In [51]:
import scattertext as st

ModuleNotFoundError: No module named 'scattertext'

In [49]:
import spacy

from scattertext import SampleCorpora, produce_scattertext_explorer
from scattertext import produce_scattertext_html
from scattertext.CorpusFromPandas import CorpusFromPandas

nlp = spacy.load('en_core_web_sm')
convention_df = SampleCorpora.doc.get_data()
corpus = CorpusFromPandas(car_lin,
                          category_col='party',
                          text_col='text',
                          nlp=nlp).build()

ModuleNotFoundError: No module named 'scattertext'

In [None]:
nlp = spacy.load('en',disable_pipes=["tagger","ner"])
train_df['parsed'] = train_df.Text[49500:50500].apply(nlp)
corpus = st.CorpusFromParsedDocuments(train_df[49500:50500],
                             category_col='Score',
                             parsed_col='parsed').build()

In [None]:
html = st.produce_scattertext_explorer(corpus,
          category=1,
          category_name='Positive',
          not_category_name='Negative',
          width_in_pixels=700,
          minimum_term_frequency=15,
          term_significance = st.LogOddsRatioUninformativeDirichletPrior(),
          )

In [None]:
# uncomment this cell to load the interactive scattertext visualisation
filename = "positive-vs-negative.html"
open(filename, 'wb').write(html.encode('utf-8'))
IFrame(src=filename, width = 900, height=900)


### Word vectors and similarity

Ok let's do some modelling and focus on scoring our food!!

### Sence2vec

The idea is get something better than word2vec model.

The idea behind sense2vec is super simple. If the problem is that duck as in waterfowl and duck as in crouch are different concepts, the straight-forward solution is to just have two entries, duckN and duckV.  Trask et al (2015) published a nice set of experiments showing that the idea worked well.

It assight parts of speech tags like verb, noun , adjective to words, which will in turn be used to make sence of context.
1. Please book [VERB] my ticket.
2. Read the book [NOUN].

Read more [here](https://explosion.ai/blog/sense2vec-with-spacy) and [here](https://github.com/explosion/sense2vec)

Reddit talks about food a lot so we can get nice similarity vectors for food items.

Vectores de palabras y similitud
¡Bien, hagamos algunos modelos y concentrémonos en puntuar nuestra comida!

Sence2vec
La idea es obtener algo mejor que el modelo word2vec.

La idea detrás de sense2vec es muy simple. Si el problema es que pato como en aves acuáticas y pato como en agacharse son conceptos diferentes, la solución directa es tener solo dos entradas, patoN y patoV. Trask et al (2015) publicaron un buen conjunto de experimentos que muestran que la idea funcionó bien.

Asigna etiquetas de partes del discurso como verbo, sustantivo, adjetivo a palabras, que a su vez se usarán para dar sentido al contexto.

Por favor, reserve [VERB] mi boleto.
Lea el libro [NOUN].
Lea más aquí y aquí

Reddit habla mucho sobre la comida, por lo que podemos obtener buenos vectores de similitud para los alimentos.


In [57]:
!pip install sense2vec
import sense2vec
from sense2vec import Sense2VecComponent

Collecting sense2vec
  Downloading sense2vec-2.0.2-py2.py3-none-any.whl.metadata (54 kB)
Collecting numpy>=1.15.0 (from sense2vec)
  Using cached numpy-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Downloading sense2vec-2.0.2-py2.py3-none-any.whl (40 kB)
Using cached numpy-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
Installing collected packages: numpy, sense2vec
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cupy-cuda12x 12.2.0 requires numpy<1.27,>=1.20, but you have numpy 2.0.2 which is incompatible.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.0.2 which is incompatible.
langchain 0.3.7 requires numpy<2,>=1; python_versi

In [59]:
import sense2vec
from sense2vec import Sense2VecComponent

s2v = Sense2VecComponent(spacy_tok.vocab).from_disk('../input/reddit-vectors-for-sense2vec-spacy/reddit_vectors-1.1.0/reddit_vectors-1.1.0/') # Initialize Sense2VecComponent with vocab
spacy_tok.add_pipe("sense2vec", config={"data_path": "../input/reddit-vectors-for-sense2vec-spacy/reddit_vectors-1.1.0/reddit_vectors-1.1.0/"})  # Add the component using its name "sense2vec" and providing the path via config


ValueError: Can't read file: ../input/reddit-vectors-for-sense2vec-spacy/reddit_vectors-1.1.0/reddit_vectors-1.1.0/strings.json

In [58]:
import sense2vec
from sense2vec import Sense2VecComponent

s2v = Sense2VecComponent('../input/reddit-vectors-for-sense2vec-spacy/reddit_vectors-1.1.0/reddit_vectors-1.1.0/')
spacy_tok.add_pipe(s2v)
doc = spacy_tok(u"dessert.")
freq = doc[0]._.s2v_freq
vector = doc[0]._.s2v_vec
most_similar = doc[0]._.s2v_most_similar(5)
most_similar,freq

ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <sense2vec.component.Sense2VecComponent object at 0x7d00e8cb4ee0> (name: 'None').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

In [54]:
doc = spacy_tok(u"burger")
most_similar = doc[0]._.s2v_most_similar(4)
most_similar

AttributeError: [E046] Can't retrieve unregistered extension attribute 's2v_most_similar'. Did you forget to call the `set_extension` method?

In [55]:
doc = spacy_tok(u"peanut butter")
most_similar = doc[0]._.s2v_most_similar(4)
most_similar

AttributeError: [E046] Can't retrieve unregistered extension attribute 's2v_most_similar'. Did you forget to call the `set_extension` method?

Similarity between entities can be kind of fun.


The following attributes are available via the ._ property – for example token._.in_s2v:

Name	|Attribute Type|	Type|	Description|
--------|---------------|-------------|---------------|
in_s2v	|property|	bool|	Whether a key exists in the vector map.
s2v_freq|	property|	int|	The frequency of the given key.
s2v_vec|	property|	ndarray[float32]|	The vector of the given key.
s2v_most_similar|	method|	list|	Get the n most similar terms. Returns a list of ((word, sense), score) tuples.



## SpaCy Text Categorizer

We will train a multi-label convolutional neural network text classifier on our food reviews, using spaCy's new TextCategorizer  component.

SpaCy provides classification model with multiple, non-mutually exclusive labels. You can change the model architecture rather easily, but by default, the TextCategorizer class uses a convolutional neural network to assign position-sensitive vectors to each word in the document. The TextCategorizer uses its own CNN model, to avoid sharing weights with the other pipeline components. The document tensor is then summarized by concatenating max and mean pooling, and a multilayer perceptron is used to predict an output vector of length nr_class, before a logistic activation is applied elementwise. The value of each output neuron is the probability that some class is present.

Categorizador de texto de SpaCy
Entrenaremos un clasificador de texto de red neuronal convolucional de múltiples etiquetas en nuestras reseñas de alimentos, utilizando el nuevo componente TextCategorizer de SpaCy.

SpaCy proporciona un modelo de clasificación con múltiples etiquetas que no se excluyen mutuamente. Puede cambiar la arquitectura del modelo con bastante facilidad, pero de manera predeterminada, la clase TextCategorizer utiliza una red neuronal convolucional para asignar vectores sensibles a la posición a cada palabra del documento. TextCategorizer utiliza su propio modelo CNN para evitar compartir pesos con los demás componentes de la canalización. Luego, el tensor del documento se resume concatenando la agrupación máxima y media, y se utiliza un perceptrón multicapa para predecir un vector de salida de longitud nr_class, antes de aplicar una activación logística elemento por elemento. El valor de cada neurona de salida es la probabilidad de que alguna clase sea p


#### Prepare data
Let's prepare the data as SpaCy would like it.
It accepts list of tuples of text and labels.

In [None]:
train_df['tuples'] = train_df.apply(
    lambda row: (row['Text'],row['Score']), axis=1)
train = train_df['tuples'].tolist()
train[:1]

In [None]:
train[-2:]

In [None]:
#functions from spacy documentation
def load_data(limit=0, split=0.8):
    train_data = train
    np.random.shuffle(train_data)
    train_data = train_data[-limit:]
    texts, labels = zip(*train_data)
    cats = [{'POSITIVE': bool(y)} for y in labels]
    split = int(len(train_data) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])

def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 1e-8  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 1e-8  # True negatives
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f_score = 2 * (precision * recall) / (precision + recall)
    return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}

#("Number of texts to train from","t" , int)
n_texts=30000
#You can increase texts count if you have more computational power.

#("Number of training iterations", "n", int))
n_iter=10

In [None]:
nlp = spacy.load('en_core_web_sm')  # create english Language class

In [None]:
# add the text classifier to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'textcat' not in nlp.pipe_names:
    textcat = nlp.create_pipe('textcat')
    nlp.add_pipe(textcat, last=True)
# otherwise, get it, so we can add labels to it
else:
    textcat = nlp.get_pipe('textcat')

# add label to text classifier
textcat.add_label('POSITIVE')

# load the dataset
print("Loading food reviews data...")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)
print("Using {} examples ({} training, {} evaluation)"
      .format(n_texts, len(train_texts), len(dev_texts)))
train_data = list(zip(train_texts,
                      [{'cats': cats} for cats in train_cats]))

### Training model

In [None]:
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.begin_training()
    print("Training the model...")
    print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
    for i in range(n_iter):
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2,
                       losses=losses)
        with textcat.model.use_params(optimizer.averages):
            # evaluate on the dev data split off in load_data()
            scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
        print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}'  # print a simple table
              .format(losses['textcat'], scores['textcat_p'],
                      scores['textcat_r'], scores['textcat_f']))


In [None]:
# test the trained model
test_text1 = 'This tea is fun to watch as the flower expands in the water. Very smooth taste and can be used again and again in the same day. If you love tea, you gotta try these "flowering teas"'
test_text2="I bought this product at a local store, not from this seller.  I usually use Wellness canned food, but thought my cat was bored and wanted something new.  So I picked this up, knowing that Evo is a really good brand (like Wellness).<br /><br />It is one of the most disgusting smelling cat foods I've ever had the displeasure of using.  I was gagging while trying to put it into the bowl.  My cat took one taste and walked away, and chose to eat nothing until I replaced it 12 hours later with some dry food.  I would try another flavor of their food - since I know it's high quality - but I wouldn't buy the duck flavor again."
doc = nlp(test_text1)
test_text1, doc.cats

Positive review is indeed close to 1

In [None]:
doc2 = nlp(test_text2)
test_text2, doc2.cats

Negative review is close to 0

In [None]:
output_dir=%pwd
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

In [None]:
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text2)
print(test_text2, doc2.cats)

Model looks preety good. We can definitely improve it further by feeding more data and data augmentations.
Thanks for reading. Hope you learnt something new :)  #TODO Data Augmentation.