<a href="https://colab.research.google.com/github/alu0101061672/TAAD/blob/main/TAAD_SR_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Técnicas Avanzadas de Análisis de Datos: Sistemas Recomendadores***
Máster Universitario en Ciberseguridad e Inteligencia de datos - Universidad de La Laguna

El objetivo de este proyecto es implementar un sistema de recomendación basado en contenido, que nos permita recomendar los mejores documentos para un cliente, mediante el algoritmo de clasificación KNN.

Se va a crear un software que reciba un archivo de texto plano con extensión .txt, que contenga el conjunto de posibles documentos a recomendar al usuario final. Cada documento viene representado en una línea del archivo. Además, se va a recibir un fichero con los documentos que le gustan al usuario.

El software debe proporcionar como salida lo siguiente:

> Para cada documento, tabla con las siguientes columnas:
- Índice del término.
- Término.
- TF.
- IDF.
- TF-IDF.

> Similaridad coseno entre cada par de documentos.


***Sonia Díaz Santos***

## Análisis Exploratorio de Datos

En esta sección se va a llevar a cabo el análisis exploratorio de datos, incluyendo la instalación de las librerías necesarias, el preprocesado de los datos para leer los ficheros, la limpieza de dichos datos y la unión de los posibles documentos a recomendar con los que le han gustado al usuario.

### Instalación de librerías

Se procede a instalar las librerías necesarias para llevar a cabo este proyecto.

In [43]:
pip install lenskit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [44]:
import pandas as pd
import numpy as np
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

%matplotlib inline

### Preprocesado de datos

A continuación, se van a leer los ficheros de los posibles documentos a recomendar y los documentos que le han gustado al usuario. Posteriormente, se van a limpiar los datos y a unir ambos documentos.

#### Lectura de ficheros

En este apartado se van a leer los documentos necesarios para la realización de este proyecto. El fichero de los documentos a recomendar al usuario y el fichero de los documentos que le han gustado al usuario.

In [45]:
# Se vincula la cuenta de drive
from google.colab import drive 
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [46]:
# Se leen los datos
posibles_documentos_a_recomendar = pd.read_csv("/content/gdrive/My Drive/Master/TAAD 2022 (2)/documents.txt", sep='\n', header=None)
documentos_favoritos = pd.read_csv("/content/gdrive/My Drive/Master/TAAD 2022 (2)/documents_liked.txt", sep='\n', header=None)

#### Limpieza de datos

Ahora se van a limpiar los datos leídos anteriormente.

In [47]:
# Se asigna el nombre de Documentos a la columna
posibles_documentos_a_recomendar.columns = ["Documentos"]
documentos_favoritos.columns = ["Documentos"]

In [48]:
posibles_documentos_a_recomendar

Unnamed: 0,Documentos
0,"7. Here's a bright, informal red that opens wi..."
1,"1. Aromas include tropical fruit, broom, brims..."
2,"3. Tart and snappy, the flavors of lime flesh ..."
3,"4. Pineapple rind, lemon pith and orange bloss..."
4,"2. This is ripe and fruity, a wine that is smo..."
5,6. Blackberry and raspberry aromas show a typi...
6,"5. Much like the regular bottling from 2012, t..."


In [49]:
# Se crea un bucle para recorrer los documentos y realizar dos funciones. La primera es crear un array de índices idx en el que se almacenan los números de los documentos, para posteriormente
# generar una columna con los números de los documentos. La segunda es eliminar de la columna de documentos el número y el punto al inicio de cada uno. También se quita el punto final de cada documento.

idx = []

for i in range(posibles_documentos_a_recomendar.shape[0]):
  idx.append(int(posibles_documentos_a_recomendar["Documentos"][i][0]))
  posibles_documentos_a_recomendar["Documentos"][i] = re.sub(r'^\d\.', '', posibles_documentos_a_recomendar["Documentos"][i])
  posibles_documentos_a_recomendar["Documentos"][i] = posibles_documentos_a_recomendar["Documentos"][i].replace(posibles_documentos_a_recomendar["Documentos"][i][-1],'')

idx

[7, 1, 3, 4, 2, 6, 5]

In [50]:
# Se crea la columna que indica el número de cada documento
posibles_documentos_a_recomendar.insert(0,'Número de documento', idx)

In [51]:
posibles_documentos_a_recomendar

Unnamed: 0,Número de documento,Documentos
0,7,"Here's a bright, informal red that opens with..."
1,1,"Aromas include tropical fruit, broom, brimsto..."
2,3,"Tart and snappy, the flavors of lime flesh an..."
3,4,"Pineapple rind, lemon pith and orange blossom..."
4,2,"This is ripe and fruity, a wine that is smoot..."
5,6,Blackberry and raspberry aromas show a typica...
6,5,"Much like the regular bottling from 2012, thi..."


In [52]:
# Se ordena el dataframe por los valores de los números de documentos de valor más bajo al más alto
posibles_documentos_a_recomendar_ordered = posibles_documentos_a_recomendar.sort_values(by='Número de documento', ascending=True)

In [53]:
posibles_documentos_a_recomendar_ordered

Unnamed: 0,Número de documento,Documentos
1,1,"Aromas include tropical fruit, broom, brimsto..."
4,2,"This is ripe and fruity, a wine that is smoot..."
2,3,"Tart and snappy, the flavors of lime flesh an..."
3,4,"Pineapple rind, lemon pith and orange blossom..."
6,5,"Much like the regular bottling from 2012, thi..."
5,6,Blackberry and raspberry aromas show a typica...
0,7,"Here's a bright, informal red that opens with..."


In [54]:
# Se coloca como índice el número del documento
posibles_documentos_a_recomendar_ordered.set_index('Número de documento',inplace=True)

In [55]:
posibles_documentos_a_recomendar_ordered

Unnamed: 0_level_0,Documentos
Número de documento,Unnamed: 1_level_1
1,"Aromas include tropical fruit, broom, brimsto..."
2,"This is ripe and fruity, a wine that is smoot..."
3,"Tart and snappy, the flavors of lime flesh an..."
4,"Pineapple rind, lemon pith and orange blossom..."
5,"Much like the regular bottling from 2012, thi..."
6,Blackberry and raspberry aromas show a typica...
7,"Here's a bright, informal red that opens with..."


In [56]:
documentos_favoritos

Unnamed: 0,Documentos
0,"Aromas have tropical fruit, broom, brimstone a..."
1,The wine was all stainless-steel fermented. Th...
2,"6I like blackberry and raspberry aromas, green..."


#### Unión de ambos documentos

El siguiente paso es unir los posibles documentos a recomendar y los documentos que le han gustado al usuario.

In [57]:
# Se unen los dataframes de todos los documentos y de los documentos que le han gustado al usuario, estando estos en las 3 últimas filas
documento_completo = pd.concat([posibles_documentos_a_recomendar_ordered, documentos_favoritos], ignore_index=True)

In [58]:
documento_completo

Unnamed: 0,Documentos
0,"Aromas include tropical fruit, broom, brimsto..."
1,"This is ripe and fruity, a wine that is smoot..."
2,"Tart and snappy, the flavors of lime flesh an..."
3,"Pineapple rind, lemon pith and orange blossom..."
4,"Much like the regular bottling from 2012, thi..."
5,Blackberry and raspberry aromas show a typica...
6,"Here's a bright, informal red that opens with..."
7,"Aromas have tropical fruit, broom, brimstone a..."
8,The wine was all stainless-steel fermented. Th...
9,"6I like blackberry and raspberry aromas, green..."


In [59]:
# Se modifican los índices para que comiencen en 1 y no en 0
documento_completo['Número de documento'] = ['1','2','3','4','5','6','7','8','9','10']

In [60]:
# Se le asigna como índice comenzando desde 1
documento_completo.set_index('Número de documento', inplace=True)

In [61]:
documento_completo

Unnamed: 0_level_0,Documentos
Número de documento,Unnamed: 1_level_1
1,"Aromas include tropical fruit, broom, brimsto..."
2,"This is ripe and fruity, a wine that is smoot..."
3,"Tart and snappy, the flavors of lime flesh an..."
4,"Pineapple rind, lemon pith and orange blossom..."
5,"Much like the regular bottling from 2012, thi..."
6,Blackberry and raspberry aromas show a typica...
7,"Here's a bright, informal red that opens with..."
8,"Aromas have tropical fruit, broom, brimstone a..."
9,The wine was all stainless-steel fermented. Th...
10,"6I like blackberry and raspberry aromas, green..."


## Creación de la tabla

Ahora se va a crear la tabla de documentos que cuenta con el índice del término, el término y el TF-IDF de cada uno de los documentos, tanto de los posibles a recomendar como los que le han gustado al usuario.

In [62]:
# Se crea un nuevo dataframe que cuenta con las mismas filas que números de documentos y con las columnas correspondientes (Índice del término, Término, TF, IDF, TF-IDF)
tabla = pd.DataFrame(documento_completo, index = documento_completo.index, columns = ['Índice del término', 'Término', 'TF-IDF'], dtype = object)

In [63]:
tabla

Unnamed: 0_level_0,Índice del término,Término,TF-IDF
Número de documento,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,,,
2,,,
3,,,
4,,,
5,,,
6,,,
7,,,
8,,,
9,,,
10,,,


### Índice del término y término

En esta sección se van a calcular los datos de las columnas índice del término y término.

In [64]:
# Se rellena la columna de índice de término y término
for i in range(documento_completo["Documentos"].shape[0]):
  # Se cuenta el número de términos para cada documento
  num_of_terms = len(documento_completo["Documentos"][i].split()) 
  tabla["Término"][i] = documento_completo["Documentos"][i].split()
  tabla["Índice del término"][i] = np.array(range(num_of_terms))

In [65]:
tabla

Unnamed: 0_level_0,Índice del término,Término,TF-IDF
Número de documento,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Aromas, include, tropical, fruit,, broom,, br...",
2,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[This, is, ripe, and, fruity,, a, wine, that, ...",
3,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Tart, and, snappy,, the, flavors, of, lime, f...",
4,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Pineapple, rind,, lemon, pith, and, orange, b...",
5,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Much, like, the, regular, bottling, from, 201...",
6,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Blackberry, and, raspberry, aromas, show, a, ...",
7,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Here's, a, bright,, informal, red, that, open...",
8,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Aromas, have, tropical, fruit,, broom,, brims...",
9,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[The, wine, was, all, stainless-steel, ferment...",
10,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6I, like, blackberry, and, raspberry, aromas,...",


In [66]:
# Se muestra el primer documento en la columna término como un array de todos los términos del documento
tabla["Término"][0]

['Aromas',
 'include',
 'tropical',
 'fruit,',
 'broom,',
 'brimstone',
 'and',
 'dried',
 'herb',
 'The',
 'palate',
 "isn't",
 'overly',
 'expressive,',
 'offering',
 'unripened',
 'apple,',
 'citrus',
 'and',
 'dried',
 'sage',
 'alongside',
 'brisk',
 'acidity']

In [67]:
# Se verifica que se ha realizado correctamente
[documento_completo["Documentos"][0]]

[" Aromas include tropical fruit, broom, brimstone and dried herb The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity"]

### TF-IDF

En esta sección se van a calcular los datos de la columna TF-IDF.

In [68]:
# Se crea un array con todos los documentos, se crea el vector TF-IDF para cada uno y se asigna el resultado a la tabla creada anteriormente
total_tf_idf = []
for i in range(documento_completo.shape[0]):
  tf_idf = TfidfVectorizer(stop_words='english')
  tf_idf_matrix = tf_idf.fit_transform([documento_completo["Documentos"][i]])
  tabla["TF-IDF"][i] = tf_idf_matrix.todense()
  total_tf_idf.append(tf_idf.vocabulary_)

In [69]:
total_tf_idf

[{'aromas': 3,
  'include': 12,
  'tropical': 18,
  'fruit': 10,
  'broom': 6,
  'brimstone': 4,
  'dried': 8,
  'herb': 11,
  'palate': 16,
  'isn': 13,
  'overly': 15,
  'expressive': 9,
  'offering': 14,
  'unripened': 19,
  'apple': 2,
  'citrus': 7,
  'sage': 17,
  'alongside': 1,
  'brisk': 5,
  'acidity': 0},
 {'ripe': 13,
  'fruity': 10,
  'wine': 17,
  'smooth': 14,
  'structured': 15,
  'firm': 7,
  'tannins': 16,
  'filled': 6,
  'juicy': 11,
  'red': 12,
  'berry': 2,
  'fruits': 9,
  'freshened': 8,
  'acidity': 1,
  'drinkable': 5,
  'certainly': 4,
  'better': 3,
  '2016': 0},
 {'tart': 14,
  'snappy': 11,
  'flavors': 4,
  'lime': 7,
  'flesh': 5,
  'rind': 10,
  'dominate': 2,
  'green': 6,
  'pineapple': 8,
  'pokes': 9,
  'crisp': 1,
  'acidityunderscoring': 0,
  'wine': 15,
  'stainless': 12,
  'steel': 13,
  'fermented': 3},
 {'pineapple': 15,
  'rind': 17,
  'lemon': 9,
  'pith': 16,
  'orange': 13,
  'blossom': 3,
  'start': 20,
  'aromas': 0,
  'palate': 14,
  '

In [70]:
tabla

Unnamed: 0_level_0,Índice del término,Término,TF-IDF
Número de documento,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Aromas, include, tropical, fruit,, broom,, br...",[[[[[0.20851441 0.20851441 0.20851441 0.208514...
2,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[This, is, ripe, and, fruity,, a, wine, that, ...",[[[[[0.23570226 0.23570226 0.23570226 0.235702...
3,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Tart, and, snappy,, the, flavors, of, lime, f...",[[[[[0.22941573 0.22941573 0.22941573 0.229415...
4,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Pineapple, rind,, lemon, pith, and, orange, b...",[[[[[0.21320072 0.21320072 0.21320072 0.213200...
5,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Much, like, the, regular, bottling, from, 201...",[[[[[0.21320072 0.21320072 0.21320072 0.213200...
6,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Blackberry, and, raspberry, aromas, show, a, ...",[[[[[0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0...
7,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Here's, a, bright,, informal, red, that, open...",[[[[[0.23570226 0.23570226 0.23570226 0.235702...
8,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Aromas, have, tropical, fruit,, broom,, brims...",[[[[[0.23570226 0.23570226 0.23570226 0.235702...
9,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[The, wine, was, all, stainless-steel, ferment...",[[[[[0.28867513 0.28867513 0.28867513 0.288675...
10,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[6I, like, blackberry, and, raspberry, aromas,...",[[[[[0.23570226 0.23570226 0.23570226 0.235702...


## Similaridad del coseno

In [71]:
# Se calcula la matriz TF-IDF de todos los documentos
tf_idf_liked = TfidfVectorizer(stop_words='english')
tf_idf_matrix_liked = tf_idf_liked.fit_transform(documento_completo['Documentos']);

In [72]:
# Se ven las dimensiones de la matriz
tf_idf_matrix_liked.shape

(10, 121)

In [73]:
# Se calcula la matriz de similaridad del coseno de la matriz TF-IDF consigo misma
cosine_similarity_matrix = cosine_similarity(tf_idf_matrix_liked, tf_idf_matrix_liked)

In [74]:
# Se ven las dimensiones de la matriz de similaridad del coseno generado
cosine_similarity_matrix.shape

(10, 10)

In [75]:
cosine_similarity_matrix

array([[1.        , 0.01778818, 0.        , 0.04667929, 0.        ,
        0.05860527, 0.10846279, 0.77990298, 0.        , 0.07390319],
       [0.01778818, 1.        , 0.03066838, 0.        , 0.0240976 ,
        0.01698506, 0.16138702, 0.02090135, 0.04010935, 0.02141872],
       [0.        , 0.03066838, 1.        , 0.06975101, 0.02686441,
        0.08618358, 0.        , 0.        , 0.80933387, 0.10868035],
       [0.04667929, 0.        , 0.06975101, 1.        , 0.        ,
        0.05355276, 0.04969092, 0.01878787, 0.09122319, 0.01925292],
       [0.        , 0.0240976 , 0.02686441, 0.        , 1.        ,
        0.02855701, 0.        , 0.        , 0.03513436, 0.08305926],
       [0.05860527, 0.01698506, 0.08618358, 0.05355276, 0.02855701,
        1.        , 0.07995458, 0.068862  , 0.07514291, 0.67306898],
       [0.10846279, 0.16138702, 0.        , 0.04969092, 0.        ,
        0.07995458, 1.        , 0.08555014, 0.        , 0.0447356 ],
       [0.77990298, 0.02090135, 0.       

In [76]:
# Para la siguiente función el índice tiene que empezar en 0, por lo que se genera una lista de los índices de la tabla restándole un valor.
indice_sub = [elemento - 1 for elemento in list(map(int, tabla.index))]

In [77]:
# Para esta función el índice tiene que empezar por 0 por lo que se le resta un valor a los índices de la tabla, siendo el documento 0 realmente el documento 1 y el documento 9 el 10
similarity_scores = pd.DataFrame(cosine_similarity_matrix[indice_sub], columns=["1","2","3", "4","5", "6","7", "8","9", "10"])
similarity_scores.sort_values(by=["1","2","3", "4","5", "6","7", "8","9", "10"],ascending=False)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
0,1.0,0.017788,0.0,0.046679,0.0,0.058605,0.108463,0.779903,0.0,0.073903
7,0.779903,0.020901,0.0,0.018788,0.0,0.068862,0.08555,1.0,0.0,0.086837
6,0.108463,0.161387,0.0,0.049691,0.0,0.079955,1.0,0.08555,0.0,0.044736
9,0.073903,0.021419,0.10868,0.019253,0.083059,0.673069,0.044736,0.086837,0.094758,1.0
5,0.058605,0.016985,0.086184,0.053553,0.028557,1.0,0.079955,0.068862,0.075143,0.673069
3,0.046679,0.0,0.069751,1.0,0.0,0.053553,0.049691,0.018788,0.091223,0.019253
1,0.017788,1.0,0.030668,0.0,0.024098,0.016985,0.161387,0.020901,0.040109,0.021419
8,0.0,0.040109,0.809334,0.091223,0.035134,0.075143,0.0,0.0,1.0,0.094758
2,0.0,0.030668,1.0,0.069751,0.026864,0.086184,0.0,0.0,0.809334,0.10868
4,0.0,0.024098,0.026864,0.0,1.0,0.028557,0.0,0.0,0.035134,0.083059


In [78]:
# Como en la matriz se tienen los mismos valores en el triángulo superior como inferior se va a obtener solamente el inferior
triangle = np.tril(cosine_similarity_matrix)

In [79]:
triangle

array([[1.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.01778818, 1.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.03066838, 1.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.04667929, 0.        , 0.06975101, 1.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.0240976 , 0.02686441, 0.        , 1.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.05860527, 0.01698506, 0.08618358, 0.05355276, 0.02855701,
        1.        , 0.        , 0.        , 0.        , 0.        ],
       [0.10846279, 0.16138702, 0.        , 0.04969092, 0.        ,
        0.07995458, 1.        , 0.        , 0.        , 0.        ],
       [0.77990298, 0.02090135, 0.       

## Resultados



In [80]:
# En esta función recorre el triángulo inferior de las medidas de similaridad entre documentos con los índices desde el 0 hasta el 9, que corresponde a un documento respecto al resto de los documentos. 
# Posteriormente se recorre internamente cada uno de los documentos respecto a ese documento. 
# Se ha optado por recomendar aquellos documentos que tengan un valor superior al 0,6 y menor a 1 (el documento consigo mismo).
j = 1
x = 0
for i in triangle[indice_sub]:
  for z in i:
    if(z >= 0.6 and z<1.0):
      print("Para el documento", j , "se recomienda el documento", x+1 , "con un coseno de similitud de", round(z, 2)) 
    x = x+1
  x = 0
  j = j+1


Para el documento 8 se recomienda el documento 1 con un coseno de similitud de 0.78
Para el documento 9 se recomienda el documento 3 con un coseno de similitud de 0.81
Para el documento 10 se recomienda el documento 6 con un coseno de similitud de 0.67
