

---



# I. Document Feature Matrix (dfm)

Un **dfm** una representación de texto que muestra la frecuencia de ocurrencia de las palabras en un corpus de documentos.

En un dfm, las filas representan los documentos individuales del corpus y las columnas representan los términos presentes en los documentos. Cada celda de la matriz indica la frecuencia con la que un término ocurre en un documento en particular.

In [1]:
#install.packages("quanteda")
library(quanteda)


Package version: 3.3.1
Unicode version: 13.0
ICU version: 69.1

Parallel computing: 20 of 20 threads used.

See https://quanteda.io for tutorials and examples.



---
## Un ejemplo

In [2]:
my_text <- c(
"Mario es un gran empresario. Creó su propia empresa a partir de un modesto emprendimiento.", "Las empresas, los emprendedores
 y sus emprendimientos generan riqueza."
)

my_corpus <- corpus(my_text)
my_corpus

In [3]:
toks <- tokens(my_corpus)
print(toks)

Tokens consisting of 2 documents.
text1 :
 [1] "Mario"      "es"         "un"         "gran"       "empresario"
 [6] "."          "Creó"       "su"         "propia"     "empresa"   
[11] "a"          "partir"    
[ ... and 5 more ]

text2 :
 [1] "Las"             "empresas"        ","               "los"            
 [5] "emprendedores"   "y"               "sus"             "emprendimientos"
 [9] "generan"         "riqueza"         "."              



In [4]:
my_first_dfm<- dfm(toks)
print(my_first_dfm)

Document-feature matrix of: 2 documents, 25 features (48.00% sparse) and 0 docvars.
       features
docs    mario es un gran empresario . creó su propia empresa
  text1     1  1  2    1          1 2    1  1      1       1
  text2     0  0  0    0          0 1    0  0      0       0
[ reached max_nfeat ... 15 more features ]


In [6]:
my_firt_matrix <-  convert(my_first_dfm, to = "data.frame")
print(my_firt_matrix)
names(my_firt_matrix)

  doc_id mario es un gran empresario . creó su propia empresa a partir de
1  text1     1  1  2    1          1 2    1  1      1       1 1      1  1
2  text2     0  0  0    0          0 1    0  0      0       0 0      0  0
  modesto emprendimiento las empresas , los emprendedores y sus emprendimientos
1       1              1   0        0 0   0             0 0   0               0
2       0              0   1        1 1   1             1 1   1               1
  generan riqueza
1       0       0
2       1       1




---


## Tokenizar el Corpus

Para crear un dfm a partir de un corpus, primero hay que tokenizarlo. Para ello, hay que decidir qué pasos de preprocesamiento se van a dar antes de convertir los datos en un dfm.

In [7]:
toks_nopunct_stop_low <- tokens(data_corpus_inaugural, remove_punct = TRUE) %>%
                         tokens_remove(pattern = stopwords("en")) %>%
                         tokens_tolower()
toks_nopunct_stop_low

Tokens consisting of 59 documents and 4 docvars.
1789-Washington :
 [1] "fellow-citizens" "senate"          "house"           "representatives"
 [5] "among"           "vicissitudes"    "incident"        "life"           
 [9] "event"           "filled"          "greater"         "anxieties"      
[ ... and 640 more ]

1793-Washington :
 [1] "fellow"     "citizens"   "called"     "upon"       "voice"     
 [6] "country"    "execute"    "functions"  "chief"      "magistrate"
[11] "occasion"   "proper"    
[ ... and 50 more ]

1797-Adams :
 [1] "first"       "perceived"   "early"       "times"       "middle"     
 [6] "course"      "america"     "remained"    "unlimited"   "submission" 
[11] "foreign"     "legislature"
[ ... and 1,058 more ]

1801-Jefferson :
 [1] "friends"   "fellow"    "citizens"  "called"    "upon"      "undertake"
 [7] "duties"    "first"     "executive" "office"    "country"   "avail"    
[ ... and 801 more ]

1805-Jefferson :
 [1] "proceeding"    "fellow"        "ci



---



## Crear una dfm



In [8]:
dfm_inaug_nostem <- dfm(toks_nopunct_stop_low)
print(dfm_inaug_nostem)


Document-feature matrix of: 59 documents, 9,285 features (92.70% sparse) and 4 docvars.
                 features
docs              fellow-citizens senate house representatives among
  1789-Washington               1      1     2               2     1
  1793-Washington               0      0     0               0     0
  1797-Adams                    3      1     0               2     4
  1801-Jefferson                2      0     0               0     1
  1805-Jefferson                0      0     0               0     7
  1809-Madison                  1      0     0               0     0
                 features
docs              vicissitudes incident life event filled
  1789-Washington            1        1    1     2      1
  1793-Washington            0        0    0     0      0
  1797-Adams                 0        0    2     0      0
  1801-Jefferson             0        0    1     0      0
  1805-Jefferson             0        0    2     0      0
  1809-Madison               

El dfm consta de 9.285 *features* que comprenden palabras, números y símbolos en este caso concreto. Si no se hubieran eliminado los signos de puntuación, también se habrían incluido.

Queda claro que un "feature" o "característica" se refiere a un término o palabra que se utiliza para representar una columna en la matriz. Las características en un dfm son entonces términos o palabras que se extraen de los documentos del corpus y se utilizan para construir la representación matricial. Estas características pueden incluir palabras individuales, combinaciones de palabras (n-gramas), características léxicas, características sintácticas o cualquier otro elemento que se considere relevante para el análisis de texto.

Por ejemplo, en un corpus de documentos relacionados con el análisis de sentimientos en redes sociales, las características en el dfm podrían ser palabras o frases que reflejen emociones como "feliz", "triste", "enfurecido", "alegre", etc.

¿Qué pasaría con el número de features si hubiésemos hecho stemming en el procesamiento del texto?

  

In [9]:
toks_nopunct_stop_low_stem <- tokens_wordstem(toks_nopunct_stop_low)
dfm_inaug_stem <- dfm(toks_nopunct_stop_low_stem)
print(dfm_inaug_stem)

Document-feature matrix of: 59 documents, 5,458 features (89.34% sparse) and 4 docvars.
                 features
docs              fellow-citizen senat hous repres among vicissitud incid life
  1789-Washington              1     1    2      2     1          1     1    1
  1793-Washington              0     0    0      0     0          0     0    0
  1797-Adams                   3     1    3      3     4          0     0    2
  1801-Jefferson               2     0    0      1     1          0     0    1
  1805-Jefferson               0     0    0      0     7          0     0    2
  1809-Madison                 1     0    0      1     0          1     0    1
                 features
docs              event fill
  1789-Washington     2    1
  1793-Washington     0    0
  1797-Adams          0    0
  1801-Jefferson      0    0
  1805-Jefferson      1    0
  1809-Madison        0    1
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 5,448 more features ]





---


## Sparsity

Document-feature matrix of: 59 documents, 9,285 features (92.70% sparse) and 4 docvars.

**Sparsity**, en el contexto más general, se refiere a la propiedad de tener una gran cantidad de elementos o valores nulos o ceros en relación con el total de elementos posibles.

La sparsity indica la falta de densidad o la falta de presencia significativa de elementos en una estructura de datos o matriz. Por ejemplo, si se tiene una matriz de números donde la mayoría de los elementos son cero y solo unos pocos tienen valores diferentes de cero, se puede decir que la matriz es "sparse" o dispersa.



---


# II. Seleccionado **features** o Características

Existen varias funciones que se pueden utilizar para seleccionar y explorar *features* tras convertir documentos textuales a un dfm.

## **topfeatures()**

La función **topfeatures()** permite consultar las características más frecuentes.


In [10]:
print(topfeatures(dfm_inaug_nostem))

    people government         us        can       must       upon      great 
       584        564        505        487        376        371        344 
       may     states      world 
       343        334        319 


In [11]:
print(topfeatures(dfm_inaug_stem))

nation govern  peopl     us    can  state  great   must  power   upon 
   691    657    632    505    487    452    378    376    375    371 


¿Que diferencias se notan entre las *topfeatures* de un documento stemmed y el documento not-stemmed?


Por ejemplo, antes del stemming, "nation" ni siquiera aparecía entre las 10 primeras características, lo que ilustra cómo el stemming permite captar el uso de palabras similares de forma diferente.

In [12]:
print(topfeatures(dfm_inaug_nostem,20))

    people government         us        can       must       upon      great 
       584        564        505        487        376        371        344 
       may     states      world      shall    country     nation      every 
       343        334        319        316        308        305        300 
       one      peace        new      power        now     public 
       267        258        250        241        229        225 


¿Por qué pasa esto? ¿Se le ocurre como investigarlo?

In [14]:
# Key words in context
kw_nation <- kwic(toks_nopunct_stop_low, pattern = "nation*")
print(head(kw_nation, 10))

Keyword-in-context with 10 matches.                                                                             
 [1789-Washington, 159]           almighty rules universe presides councils |
 [1789-Washington, 225]           every step advanced character independent |
 [1789-Washington, 361] assemblage communities interests another foundation |
 [1789-Washington, 423]                    smiles heaven can never expected |
       [1797-Adams, 56]    signally protected country first representatives |
      [1797-Adams, 169]             faith loss consideration credit foreign |
      [1797-Adams, 179]  partial conventions insurrection threatening great |
      [1797-Adams, 262]        adapted genius character situation relations |
      [1797-Adams, 360]               upon peace order prosperity happiness |
      [1797-Adams, 381]               ancient idea congregations men cities |
                                                               
 nations  | whose providential aids can sup

In [15]:
?topfeatures

topfeatures              package:quanteda              R Documentation

_I_d_e_n_t_i_f_y _t_h_e _m_o_s_t _f_r_e_q_u_e_n_t _f_e_a_t_u_r_e_s _i_n _a _d_f_m

_D_e_s_c_r_i_p_t_i_o_n:

     List the most (or least) frequently occurring features in a dfm,
     either as a whole or separated by document.

_U_s_a_g_e:

     topfeatures(
       x,
       n = 10,
       decreasing = TRUE,
       scheme = c("count", "docfreq"),
       groups = NULL
     )
     
_A_r_g_u_m_e_n_t_s:

       x: the object whose features will be returned

       n: how many top features should be returned

decreasing: If 'TRUE', return the 'n' most frequent features; otherwise
          return the 'n' least frequent features

  scheme: one of 'count' for total feature frequency (within 'group' if
          applicable), or 'docfreq' for the document frequencies of
          features

  groups: grouping variable for sampling, equal in length to the number
  







### Agrupando por alguna variable

In [14]:
head(docvars(data_corpus_inaugural))

Unnamed: 0_level_0,Year,President,FirstName,Party
Unnamed: 0_level_1,<int>,<chr>,<chr>,<fct>
1,1789,Washington,George,none
2,1793,Washington,George,none
3,1797,Adams,John,Federalist
4,1801,Jefferson,Thomas,Democratic-Republican
5,1805,Jefferson,Thomas,Democratic-Republican
6,1809,Madison,James,Democratic-Republican


In [16]:
print(head(topfeatures(dfm_inaug_nostem,5, groups = President)))

$Adams
government     people      union       upon    country 
        33         27         22         21         18 

$Biden
     us america     can     one  nation 
     27      18      16      15      12 

$Buchanan
      states        shall constitution          may       people 
          22           18           17           15           13 

$Bush
freedom  nation      us america     can 
     36      27      27      27      24 

$Carter
   can nation    new   must     us 
    13     10      9      8      8 

$Cleveland
    people government     public      every      shall 
        35         29         19         14         14 



In [17]:
print(head(topfeatures(dfm_inaug_nostem,5, groups = Party)))

$Democratic
        us     people        can government       must 
       222        199        173        143        138 

$`Democratic-Republican`
government      great     states        war        may 
        68         61         56         51         49 

$Federalist
    people government        may    nations    country 
        20         16         13         11          9 

$none
       can      every government        may    present 
         9          9          9          7          6 

$Republican
    people government        can         us       must 
       264        240        228        218        201 

$Whig
  government       states       people        power constitution 
          88           61           57           57           55 



**scheme=count** & **scheme=docfreq**

In [18]:
print(topfeatures(dfm_inaug_nostem,5, scheme = "count"))


    people government         us        can       must 
       584        564        505        487        376 


In [19]:
print(topfeatures(dfm_inaug_nostem,5, scheme = "docfreq"))

 people     can   great      us country 
     57      56      56      56      54 


Esto significa que "*poeple*" aparece 584 veces en 57 de los 59 documentos totales del corpus.



---


## **dfm_select()**, **dfm_remove()**, **dfm_trim()**, **dfm_select()**,

A través de estos comandos se pueden seleccionar características específicas de un objeto dfm.

In [19]:
?dfm_select()

dfm_select              package:quanteda               R Documentation

_S_e_l_e_c_t _f_e_a_t_u_r_e_s _f_r_o_m _a _d_f_m _o_r _f_c_m

_D_e_s_c_r_i_p_t_i_o_n:

     This function selects or removes features from a dfm or fcm, based
     on feature name matches with 'pattern'.  The most common usages
     are to eliminate features from a dfm already constructed, such as
     stopwords, or to select only terms of interest from a dictionary.

_U_s_a_g_e:

     dfm_select(
       x,
       pattern = NULL,
       selection = c("keep", "remove"),
       valuetype = c("glob", "regex", "fixed"),
       case_insensitive = TRUE,
       min_nchar = NULL,
       max_nchar = NULL,
       padding = FALSE,
       verbose = quanteda_options("verbose")
     )
     
     dfm_remove(x, ...)
     
     dfm_keep(x, ...)
     
     fcm_select(
       x,
       pattern = NULL,
       selection = c("keep", "remove"),
       valuetype = c("glob", "regex", "fixed"),
   

### **dfm_remove()**

Supongamos que no eliminamos los stop-words cuando tokenizamos el documento.

In [20]:
toks <- tokens(data_corpus_inaugural)


In [21]:
dfm_inaugural_nostop <- dfm(tokens(data_corpus_inaugural)) %>%
             dfm_remove(pattern = stopwords("en"))

print(dfm_inaugural_nostop)

Document-feature matrix of: 59 documents, 9,303 features (92.65% sparse) and 4 docvars.
                 features
docs              fellow-citizens senate house representatives : among
  1789-Washington               1      1     2               2 1     1
  1793-Washington               0      0     0               0 1     0
  1797-Adams                    3      1     0               2 0     4
  1801-Jefferson                2      0     0               0 1     1
  1805-Jefferson                0      0     0               0 0     7
  1809-Madison                  1      0     0               0 0     0
                 features
docs              vicissitudes incident life event
  1789-Washington            1        1    1     2
  1793-Washington            0        0    0     0
  1797-Adams                 0        0    2     0
  1801-Jefferson             0        0    1     0
  1805-Jefferson             0        0    2     0
  1809-Madison               0        0    1     0
[ reac

### **dfm_trim()**

Los argumentos de esta función *min_termfreq* y *max_termfreq* permiten determinar el umbral de frecuencia para el análisis.

In [22]:
dfm_trim(dfm_inaugural_nostop , min_termfreq = 20)


Document-feature matrix of: 59 documents, 661 features (59.48% sparse) and 4 docvars.
                 features
docs              fellow-citizens : among life greater order   , day present  .
  1789-Washington               1 1     1    1       1     2  70   2       5 23
  1793-Washington               0 1     0    0       0     0   5   0       1  4
  1797-Adams                    3 0     4    2       0     4 201   1       2 33
  1801-Jefferson                2 1     1    1       1     1 128   1       0 37
  1805-Jefferson                0 0     7    2       0     3 142   1       3 41
  1809-Madison                  1 0     0    1       0     0  47   0       1 21
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 651 more features ]

### **dfm_select()**

¿Cuántas palabras "largas" hay en los documentos del corpus?


In [23]:
dfmat_inaug_long <- dfm_select(dfm_inaugural_nostop, min_nchar = 10)
print(dfmat_inaug_long)

Document-feature matrix of: 59 documents, 2,442 features (95.05% sparse) and 4 docvars.
                 features
docs              fellow-citizens representatives vicissitudes notification
  1789-Washington               1               2            1            1
  1793-Washington               0               0            0            0
  1797-Adams                    3               2            0            0
  1801-Jefferson                2               0            0            0
  1805-Jefferson                0               0            0            0
  1809-Madison                  1               0            0            0
                 features
docs              transmitted veneration predilection flattering inclination
  1789-Washington           1          1            1          1           1
  1793-Washington           0          0            0          0           0
  1797-Adams                0          2            0          0           1
  1801-Jefferson    

Quanteda es tremendamente flexible y permite combinar estos comandos.

In [23]:
dfm_inaug__stem_docfreq <- dfm_trim(dfm_inaug_stem, min_docfreq = 0.90, docfreq_type = "prop")

print(dfm_inaug__stem_docfreq)

Document-feature matrix of: 59 documents, 11 features (6.01% sparse) and 4 docvars.
                 features
docs              countri can time nation may peopl govern great world right
  1789-Washington       5   9    1      4   6     4      9     3     1     2
  1793-Washington       1   0    0      0   1     1      1     0     0     0
  1797-Adams           10   9    3     24  13    20     20     5     3     2
  1801-Jefferson        4   3    0      4   8     2     14     2     3     7
  1805-Jefferson        5   6    6      6  10     0      3     1     2     4
  1809-Madison          5   5    1     10   1     1      0     0     2     5
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 1 more feature ]


Sólo 11 características aparecen en más del 90% de los documentos....pero sólo vemos 10!

In [24]:
print(topfeatures(dfm_inaug_stem,11, scheme = "docfreq"))


 nation   peopl countri     can   great      us   right    time     may  govern 
     58      57      56      56      56      56      55      54      54      54 
  world 
     54 
