### Datasets

At this point, we have as an input of this process, a dataframe with:
1. announcememt_id
2. description
3. price
4. locali
5. superficie
6. bagni
7. piano

With all the data cleaned, the description, pre-processesed, and the other features cleaned.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_excel('Database.xlsx')

In [3]:
df

Unnamed: 0,Description,price,locali,superficie,bagni,piano
0,Le residenze di Rinascimento Quinto completano...,192000,4,46,1,2
1,Il complesso residenziale è concepito per aver...,705000,3,67,2,3
2,L’iniziativa prevede la realizzazione di tre e...,650000,3,56,2,3


#### 1) Information
The first matrix will have this format: $m_{ij} = value$ where $i \in \{announcement_1, ..., announcement_n\}$ and $j \in \{price, locali, superficie, bagni, piano \}$. n is the number of the announcements. 

#### It's possible that not all the announcements will have all the fields mentioned above, if it's the case don't take it into account.

In [4]:
def information_matrix(df):
    """ information matrix creator
    input: dataframe with all the information, pre-processed
    output format: dataframe"""
    columns_of_interest = ["price", "locali", "superficie", "bagni", "piano"]
    return df[columns_of_interest]

In [5]:
information_matrix = information_matrix(df)

In [6]:
information_matrix

Unnamed: 0,price,locali,superficie,bagni,piano
0,192000,4,46,1,2
1,705000,3,67,2,3
2,650000,3,56,2,3


In [56]:
# In case we need it as an array
information_array = information_matrix.values
information_array

array([[192000,      4,     46,      1,      2],
       [705000,      3,     67,      2,      3],
       [650000,      3,     56,      2,      3]], dtype=int64)

#### 2) Description
The second matrix will have this format: $m_{ij} = tfIdf_{ij}$ where $i \in \{announcement_1, ..., announcement_n\}$ and $j \in \{word_1, ...,word_m\}$. n is the number of the announcements and m is the cardinality of the vocabulary. 
This time, you must implement the Tf-Idf by yourself (not with libraries). 
Make sure to use the complete description inside the link of the announcement.

In [76]:
df

Unnamed: 0,Description,price,locali,superficie,bagni,piano
0,Le residenze di Rinascimento Quinto completano...,192000,4,46,1,2
1,Il complesso residenziale è concepito per aver...,705000,3,67,2,3
2,L’iniziativa prevede la realizzazione di tre e...,650000,3,56,2,3


#### Tf-Idf

##### Tf-idf = Term frequency * Inverse document frequency

##### 1. Term frequency
Number of times that term t occurs in document d.


##### 2. Inverse document frequency 
Is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

log(total number of documents/ number of documents where the term appears)

Note that if the term is not in any document, this will lead to a division-by-zero. It is therefore common to adjust the denominator to 1+ number of documents where the term appears. 

In [9]:
# Libraries
from collections import defaultdict
from math import log

In [77]:
# Sub-functions
def vocabulary(df):
    """input:dataframe with all the information, pre-processed
    output: a list with all the unique words of the descriptions"""
    voc_list = []
    for index in range(len(df)):
        for word in str(df.values[index][0]).split():
            if word not in voc_list:
                voc_list.append(word)
    return voc_list

def inv_freq(df):
    """input: dataframe with all the information, pre-processed
    output: a dictionary, with key: word; value: inverse document frequent of the word"""
    # A list with all the unique words of the dataframe
    voc_list = vocabulary(df)
    # Total numbers of documents
    Tot_num_docs = len(df)
    # Number of documents where the term appears(dictionary)
    num_doc_dict = defaultdict(int)
    i = 0
    for word in voc_list:
        for i in range(len(df)):
            if word in df.values[i][0].split():
                num_doc_dict[word] += 1
            i += 1
    # Inverse document frequency for each word
    inv_freq_dict = defaultdict(float)
    for word in voc_list:
        inv_freq_dict[word] = log((Tot_num_docs/num_doc_dict[word]),10)
    return inv_freq_dict

def frequency(L):
    """input: a list of words
    output: a dictionary, with key: word; value: frequency of that word in the list"""
    d_freq = defaultdict(int)
    for word in L:
        d_freq[word] += 1 
    for key in d_freq:
        d_freq[key] = d_freq[key]/len(L)
    return d_freq

In [102]:
# Final function
def Tfidf(df):
    """input: dataframe with all the information, pre-processed
    output: a dictionary with key1: id_document(# row) value1: another dictionary
    with key2: the words in the document; value2: the tfidf of each word"""
    d = {}
    # Create a dictionary with all the words and the inverse document frequent of each word
    inv_freq_dict = inv_freq(df)
    # Create a frequency dictionary: as key1: id_doc; value1: dictionary
    # key2: word in document; value2: frequency of each word
    for i in list(df.index.values):
        L = str(df.values[i][0]).split()
        d[i] = frequency(L)
    # Create the final dictionary, with the tifidf of each word in each document 
    for i in range(len(df)):
        for key in d[i]:
            d[i][key] = d[i][key]*inv_freq_dict[key]
    D = pd.DataFrame(Description_Matrix).T
    D.fillna(value = 0, inplace = True)
    return D

In [103]:
Description_Matrix = Tfidf(df)
Description_Matrix

Unnamed: 0,"A/A+,",Figli.,Gruppo,Il,Le,L’idea,L’iniziativa,L’intero,Mezzaroma,Ogni,...,stilistica,suoi,terrazze,tre,una,unità,"urbana,",verde,volumi,è
0,0.0,0.01988,0.01988,0.0,0.01988,0.0,0.0,0.0,0.01988,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.00994,0.0,0.0,0.0,0.00994,0.0,0.0,...,0.0,0.00994,0.0,0.0,0.00994,0.0,0.00994,0.00994,0.0,0.007337
2,0.00734,0.0,0.0,0.0,0.0,0.00734,0.00734,0.0,0.0,0.00734,...,0.00734,0.0,0.00734,0.00734,0.0,0.00734,0.0,0.0,0.00734,0.002709


#### Clustering
This step consists in clustering the house announcements using K-means++. 
In order to do that you can use this Python library. 
Choose the optimal number of clusters using the Elbow-Method.

In [98]:
information_matrix

Unnamed: 0,price,locali,superficie,bagni,piano
0,192000,4,46,1,2
1,705000,3,67,2,3
2,650000,3,56,2,3


In [47]:
from sklearn.cluster import KMeans

In [93]:
K = KMeans(n_clusters=2).fit(information_matrix.values)

In [96]:
centroids = K.cluster_centers_

In [97]:
centroids

array([[6.775e+05, 3.000e+00, 6.150e+01, 2.000e+00, 3.000e+00],
       [1.920e+05, 4.000e+00, 4.600e+01, 1.000e+00, 2.000e+00]])

In [106]:
K.score(information_matrix.values)

-1512500060.5

In [113]:
# Elbow method

km = [KMeans(n_clusters = i) for i in range(1, 4)]
score = [km[i].fit(information_matrix.values).score(information_matrix.values) for i in range(len(km))]

In [114]:
score

[-158652666889.3335, -1512500060.5, -0.0]

Missing functions:
-  Elbow-Method 
- Jaccard-Similarity 
-  Wordcloud 