# Clustering con K-means

Se tiene un conjunto de datos de tipo texto, los cuales se busca agrupar mediante kmenas.

|quotes|
|------|
|Graphics designers are most creative people|
|Artificial Intelligence or AI is the last invention - humans could ever make|
|Snooker is a billiards sport for normally two players.|
|Snooker is played on a large (12 feet by 6 feet) table that is covered with a smooth green material.|
|FOREX is the stock market for trading currencies|
|Software Engineering is hotter and hotter topic in Silicon Valley|

Instalar nltk
```
conda install nltk
````

In [2]:
#Bibliotecas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import collections #For fetching dictionary of labels & clusters
import nltk #Natural Language Toolkit
from nltk import word_tokenize #Word tokenization is the process of splitting a large sample of text into words.
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer #Normalizing Sentences
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from pprint import pprint

In [13]:
#Descargar archivos adicionales para NLP
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/tuteggito/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/tuteggito/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/tuteggito/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Cargar los datos

In [5]:
sentences = pd.read_csv('../datasets/quotes/quotes.csv', header=0)

In [6]:
sentences.head(10) #Verificar que el archivo se haya cargado correctamente

Unnamed: 0,Quotes
0,Graphics designers are most creative people
1,Artificial Intelligence or AI is the last inve...
2,Snooker is a billiards sport for normally two ...
3,Snooker is played on a large (12 feet by 6 fee...
4,FOREX is the stock market for trading currencies
5,Software Engineering is hotter and hotter topi...
6,Love is blind
7,Snooker is popular in the United Kingdom and m...
8,The flying or operating of aircraft is known a...
9,AI is likely to be either the best or worst th...


Convertir el dataframe a lista

In [7]:
sentences_list = sentences["Quotes"].tolist()

In [8]:
sentences_list

['Graphics designers are most creative people',
 'Artificial Intelligence or AI is the last invention - humans could ever make',
 'Snooker is a billiards sport for normally two players.',
 'Snooker is played on a large (12 feet by 6 feet) table that is covered with a smooth green material.',
 'FOREX is the stock market for trading currencies',
 'Software Engineering is hotter and hotter topic in Silicon Valley',
 'Love is blind',
 'Snooker is popular in the United Kingdom and many other countries',
 'The flying or operating of aircraft is known as aviation.',
 'AI is likely to be either the best or worst thing happen to humanity',
 'Design is Intelligence made visible.',
 'Falling in love is like being on drugs.',
 'There is only one happiness in Life to Love and to be loved.',
 "Boeing 777 is considered world's largest economical plane in the world of Aviation.",
 'Warren Buffet is famous for making good investments.He knows stock markets',
 'The biggest of the many uses of aviation a

Crear una función de tokenización

La tokenización es un paso fundamental en el Procesamiento del Lenguaje Natural (PLN), ya que transforma el texto no estructurado en un formato que las máquinas pueden comprender y manejar. En esencia, la tokenización divide el texto en unidades más pequeñas (palabras, subpalabras o caracteres) llamadas tokens .

In [10]:
def tokenizer(text):
  tokens = word_tokenize(text) # Se va a dividir el texto en palabras
  stemmer = PorterStemmer()
  # Eliminación de ejes morfológicos
  tokens = [stemmer.stem(t) for t in tokens if t not in stopwords.words('english')]
  return tokens

In [11]:
stemmer = PorterStemmer()
stemmer.stem("running")

'run'

In [14]:
print(tokenizer("I am running to the store.")) #Verificar que la función de tokenización funcione correctamente

['i', 'run', 'store', '.']


Definir la función para ejecutar el clústering de oraciones

Entrenar un modelo de K-means
Creación de la matriz vectorizadora tfidf

In [19]:
def cluster_sentences(sentences_list, k):

  # Se crea la instancia tf-ifd quitando las stop words
  # TfidfVectorizer se utiliza para convertir una colección de documentos sin procesar en una matriz de características TF-IDF.
  tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenizer, stop_words=stopwords.words('english'),lowercase=True, token_pattern=None)

  # Se crea la matriz vectorizadora para las oraciones
  # Transforma el texto en vectores de características que pueden usarse como entrada para el estimador.
  tfidf_matrix = tfidf_vectorizer.fit_transform(sentences_list)
  print(tfidf_matrix) #Verificar que la matriz se haya creado correctamente

  kmeans = KMeans(n_clusters=k)
  kmeans.fit(tfidf_matrix)

  clusters = collections.defaultdict(list)

  for i, label in enumerate(kmeans.labels_):
    clusters[label].append(i)

  return dict(clusters)

Probar el modelo

In [20]:
k = 7
clusters = cluster_sentences(sentences_list,k)
for cluster in range (k):
  print("\nCLUSTER ",cluster,":\n")
  for i, sentence in enumerate(clusters[cluster]):
    print("\t",(i+1),": ",sentences_list[sentence])

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 200 stored elements and shape (28, 138)>
  Coords	Values
  (0, 62)	0.49415409169888774
  (0, 41)	0.45066279882410365
  (0, 39)	0.5554516262200696
  (0, 99)	0.49415409169888774
  (1, 17)	0.2859764225487876
  (1, 71)	0.2859764225487876
  (1, 14)	0.24707893715800808
  (1, 80)	0.3524721130295035
  (1, 72)	0.3524721130295035
  (1, 4)	0.3135746276387239
  (1, 69)	0.2859764225487876
  (1, 36)	0.3524721130295035
  (1, 49)	0.3524721130295035
  (1, 88)	0.3135746276387239
  (2, 115)	0.31255512990364265
  (2, 24)	0.4164007148801905
  (2, 118)	0.4164007148801905
  (2, 95)	0.4164007148801905
  (2, 128)	0.4164007148801905
  (2, 102)	0.4164007148801905
  (2, 5)	0.18804657403751973
  (3, 115)	0.18903214915171598
  (3, 5)	0.11372984996883059
  (3, 101)	0.2518375624370003
  (3, 78)	0.2518375624370003
  :	:
  (24, 107)	0.4643250517595315
  (24, 11)	0.4643250517595315
  (24, 21)	0.4643250517595315
  (24, 117)	0.4643250517595315
  (25, 115)	0.397

