### Datasets

At this point, we have as an input of this process, a dataframe with:
1. announcememt_id
2. description
3. price
4. locali
5. superficie
6. bagni
7. piano

With all the data cleaned, the description, pre-processesed, and the other features cleaned.

In [12]:
import pandas as pd
import numpy as np

In [5]:
df = pd.read_excel('Database.xlsx')

In [6]:
df

Unnamed: 0,Description,price,locali,superficie,bagni,piano
0,Le residenze di Rinascimento Quinto completano...,192000,4,46,1,2
1,Il complesso residenziale è concepito per aver...,705000,3,67,2,3
2,L’iniziativa prevede la realizzazione di tre e...,650000,3,56,2,3


#### 1) Information
The first matrix will have this format: $m_{ij} = value$ where $i \in \{announcement_1, ..., announcement_n\}$ and $j \in \{price, locali, superficie, bagni, piano \}$. n is the number of the announcements. 

#### It's possible that not all the announcements will have all the fields mentioned above, if it's the case don't take it into account.

In [9]:
def information_matrix(df):
    """ information matrix creator
    input: dataframe with all the information, pre-processed
    output format: dataframe"""
    columns_of_interest = ["price", "locali", "superficie", "bagni", "piano"]
    return df[columns_of_interest]

In [10]:
information_matrix = information_matrix(df)

In [11]:
information_matrix

Unnamed: 0,price,locali,superficie,bagni,piano
0,192000,4,46,1,2
1,705000,3,67,2,3
2,650000,3,56,2,3


In [15]:
# In case we need it as an array
information_array = information_matrix.values
information_array

array([[192000,      4,     46,      1,      2],
       [705000,      3,     67,      2,      3],
       [650000,      3,     56,      2,      3]], dtype=int64)

#### 2) Description
The second matrix will have this format: $m_{ij} = tfIdf_{ij}$ where $i \in \{announcement_1, ..., announcement_n\}$ and $j \in \{word_1, ...,word_m\}$. n is the number of the announcements and m is the cardinality of the vocabulary. 
This time, you must implement the Tf-Idf by yourself (not with libraries). 
Make sure to use the complete description inside the link of the announcement.

In [16]:
df

Unnamed: 0,Description,price,locali,superficie,bagni,piano
0,Le residenze di Rinascimento Quinto completano...,192000,4,46,1,2
1,Il complesso residenziale è concepito per aver...,705000,3,67,2,3
2,L’iniziativa prevede la realizzazione di tre e...,650000,3,56,2,3


In [None]:
# in this case, we are going to have as an output a dictionary: {Doc_id: {word1: Tf-Idf1},.., {word2: Tf-Idf1} }

#### Tf-Idf

##### Tf-idf = Term frequency * Inverse document frequency

##### 1. Term frequency
Number of times that term t occurs in document d.


##### 2. Inverse document frequency 
Is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

log(total number of documents/ number of documents where the term appears)

Note that if the term is not in any document, this will lead to a division-by-zero. It is therefore common to adjust the denominator to 1+ number of documents where the term appears. 

In [80]:
def vocabulary(df):
    voc_list = []
    for index in range(len(df)):
        for word in str(df.values[index][0]).split():
            if word not in voc_list:
                voc_list.append(word)
    return voc_list

In [81]:
voc_list = vocabulary(df)

In [82]:
voc_list

['Le',
 'residenze',
 'di',
 'Rinascimento',
 'Quinto',
 'completano',
 'il',
 'programma',
 'edilizio',
 'Rione',
 'Rinascimento,',
 'progetto',
 'città',
 'giardino',
 'per',
 'Roma',
 'del',
 'Gruppo',
 'Pietro',
 'Mezzaroma',
 'e',
 'Figli.',
 'Il',
 'complesso',
 'residenziale',
 'è',
 'concepito',
 'avere',
 'minor',
 'impatto',
 'ambientale',
 'paesaggistico',
 'possibile,',
 'così',
 'da',
 'integrarsi',
 'armoniosamente',
 'con',
 'verde',
 'Parco',
 'Talenti.',
 'L’intero',
 'quartiere',
 'difatti',
 'al',
 'centro',
 'una',
 'profonda',
 'riqualificazione',
 'urbana,',
 'che',
 'fa',
 'dell’eco-compatibilità',
 'della',
 'ricerca',
 'architettonica',
 'i',
 'suoi',
 'punti',
 'maggior',
 'forza.',
 'L’iniziativa',
 'prevede',
 'la',
 'realizzazione',
 'tre',
 'edifici',
 'distinti.',
 'Ogni',
 'unità',
 'abitativa,',
 'in',
 'classe',
 'energetica',
 'A/A+,',
 'finemente',
 'rifinita,',
 'composta',
 'materiali',
 'prima',
 'qualità',
 'forniti',
 'dalle',
 'migliori',
 'azi

In [51]:
from collections import defaultdict

In [105]:
from math import log

In [99]:
d

{0: defaultdict(int,
             {'Le': 1,
              'residenze': 1,
              'di': 2,
              'Rinascimento': 1,
              'Quinto': 1,
              'completano': 1,
              'il': 2,
              'programma': 1,
              'edilizio': 1,
              'Rione': 1,
              'Rinascimento,': 1,
              'progetto': 1,
              'città': 1,
              'giardino': 1,
              'per': 1,
              'Roma': 1,
              'del': 1,
              'Gruppo': 1,
              'Pietro': 1,
              'Mezzaroma': 1,
              'e': 1,
              'Figli.': 1}),
 1: defaultdict(int,
             {'Il': 1,
              'complesso': 1,
              'residenziale': 1,
              'è': 2,
              'concepito': 1,
              'per': 1,
              'avere': 1,
              'il': 2,
              'minor': 1,
              'impatto': 1,
              'ambientale': 1,
              'e': 2,
              'paesaggistico': 1,
     

In [65]:
# d is a dictionary that has a a first key the id of the doc (row number), 
# as a value another dictionary, that has as a key, the word, and as value the frequency of each word in the document

In [116]:
# Total numbers of documents
Tot_num_docs = len(df)

# Number of documents where the term appears
num_doc_dict = defaultdict(int)
i = 0
for word in voc_list:
    for i in range(len(df)):
        if word in df.values[i][0].split():
                num_doc_dict[word] += 1
        i += 1

inv_freq_dict = defaultdict(float)

for word in voc_list:
    inv_freq_dict[word] = log((Tot_num_docs/num_doc_dict[word]),10)

In [121]:
inv_freq_dict # a dictionary, with word + inverse document ferquency

defaultdict(float,
            {'Le': 0.47712125471966244,
             'residenze': 0.47712125471966244,
             'di': 0.0,
             'Rinascimento': 0.47712125471966244,
             'Quinto': 0.47712125471966244,
             'completano': 0.47712125471966244,
             'il': 0.17609125905568124,
             'programma': 0.47712125471966244,
             'edilizio': 0.47712125471966244,
             'Rione': 0.47712125471966244,
             'Rinascimento,': 0.47712125471966244,
             'progetto': 0.17609125905568124,
             'città': 0.47712125471966244,
             'giardino': 0.47712125471966244,
             'per': 0.0,
             'Roma': 0.47712125471966244,
             'del': 0.17609125905568124,
             'Gruppo': 0.47712125471966244,
             'Pietro': 0.47712125471966244,
             'Mezzaroma': 0.47712125471966244,
             'e': 0.0,
             'Figli.': 0.47712125471966244,
             'Il': 0.47712125471966244,
             'co

In [122]:
def frequency_dict(L):
    d_freq = defaultdict(int)
    for word in L:
        d_freq[word] += 1
    return d_freq

In [129]:
d = {}
for i in list(df.index.values):
    L = str(df.values[i][0]).split()
    d[i] = frequency_dict(L)

In [127]:
d[0]["Le"] = d[0]["Le"]*inv_freq_dict["Le"]

In [131]:
for i in range(len(df)):
    for key in d[i]:
        d[i][key] = d[i][key]*inv_freq_dict[key]

In [132]:
d

{0: defaultdict(int,
             {'Le': 0.47712125471966244,
              'residenze': 0.47712125471966244,
              'di': 0.0,
              'Rinascimento': 0.47712125471966244,
              'Quinto': 0.47712125471966244,
              'completano': 0.47712125471966244,
              'il': 0.3521825181113625,
              'programma': 0.47712125471966244,
              'edilizio': 0.47712125471966244,
              'Rione': 0.47712125471966244,
              'Rinascimento,': 0.47712125471966244,
              'progetto': 0.17609125905568124,
              'città': 0.47712125471966244,
              'giardino': 0.47712125471966244,
              'per': 0.0,
              'Roma': 0.47712125471966244,
              'del': 0.17609125905568124,
              'Gruppo': 0.47712125471966244,
              'Pietro': 0.47712125471966244,
              'Mezzaroma': 0.47712125471966244,
              'e': 0.0,
              'Figli.': 0.47712125471966244}),
 1: defaultdict(int,
          