### Content-based recommendation.

#### The dataset: the data released during the Big Data Challenge organized by Telecom Italia Mobile (TIM) in 2016.

##### It contains data on 2.107.755 companies on seven Italian metropolitan areas: Milan, Turin, Venice, Rome,Naples, Bari, and Palermo. In particular, for every company is given the exact 

- location in latitude and longitude, 
- the metropolitan area itself, 
- the size of the company in terms of employees,
- the age, 
- the ateco code which indicates the economic sector in which the corresponding company operates in.

In [1]:
# importing the packages 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import folium

from scipy.spatial.distance import pdist, squareform

In [2]:
data = pd.read_csv("data/manufacturing_companies_high_intraindex.csv")

In [3]:
data.head(2)

Unnamed: 0,ids,lon,lat,kind,location,name,ateco,size,age,intrasector
0,86211,16.872418,41.122014,U,Bari,GOLDEN LADY COMPANY SOCIETA PER AZIONI,14,grande,over20y,0.93
1,510137,16.9281,40.8194,S,Bari,VEBAD S.P.A.,23,media,over20y,0.91


### Prepare data for CBR

In [4]:
sub_cols = ['location','size','age']
subset = data[sub_cols]

In [5]:
subset.isna().any()

location    False
size        False
age         False
dtype: bool

In [6]:
location = pd.get_dummies(subset['location'])
size = pd.get_dummies(subset['size'])
age = pd.get_dummies(subset['age'])
name = data[['name']]


In [7]:
df = pd.concat([name, location, size, age], axis=1)

In [8]:
df

Unnamed: 0,name,Bari,Milano,Napoli,Palermo,Roma,Torino,Venezia,grande,media,0y-1y,10y-20y,1y-2y,2y-5y,5y-10y,over20y
0,GOLDEN LADY COMPANY SOCIETA PER AZIONI,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1
1,VEBAD S.P.A.,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1
2,NATUZZI S.P.A.,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1
3,LUIGI LAVAZZA - SOCIETA PER AZIONI ABBREVIABI...,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1
4,MARINA RINALDI S.R.L.,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10300,PESPOW S.P.A.,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0
10301,TECNODOM S.P.A.,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0
10302,T.R.S. EVOLUTION S.P.A.,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1
10303,MORETTO S.P.A.,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1


In [9]:
df = df.set_index(['name'])

In [10]:
df

Unnamed: 0_level_0,Bari,Milano,Napoli,Palermo,Roma,Torino,Venezia,grande,media,0y-1y,10y-20y,1y-2y,2y-5y,5y-10y,over20y
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
GOLDEN LADY COMPANY SOCIETA PER AZIONI,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1
VEBAD S.P.A.,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1
NATUZZI S.P.A.,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1
LUIGI LAVAZZA - SOCIETA PER AZIONI ABBREVIABILE ANCHE NELLA SIGLA: LAVAZZA S.P.A.,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1
MARINA RINALDI S.R.L.,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
PESPOW S.P.A.,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0
TECNODOM S.P.A.,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0
T.R.S. EVOLUTION S.P.A.,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1
MORETTO S.P.A.,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1


### Find the distance between each multidimensional point an all the other poinst

In [11]:
jaccard_distances = pdist(df.values, metric='jaccard')
print(jaccard_distances)

[0.5 0.  0.  ... 0.5 0.  0.5]


In [12]:
len(jaccard_distances)

53091360

In [13]:
square_jaccard_distances = squareform(jaccard_distances)
print(square_jaccard_distances)

[[0.  0.5 0.  ... 0.5 0.8 0.5]
 [0.5 0.  0.5 ... 0.8 0.5 0.8]
 [0.  0.5 0.  ... 0.5 0.8 0.5]
 ...
 [0.5 0.8 0.5 ... 0.  0.5 0. ]
 [0.8 0.5 0.8 ... 0.5 0.  0.5]
 [0.5 0.8 0.5 ... 0.  0.5 0. ]]


In [14]:
square_jaccard_distances.shape

(10305, 10305)

In [15]:
jaccard_similarity_array = 1 - square_jaccard_distances
print(jaccard_similarity_array)


[[1.  0.5 1.  ... 0.5 0.2 0.5]
 [0.5 1.  0.5 ... 0.2 0.5 0.2]
 [1.  0.5 1.  ... 0.5 0.2 0.5]
 ...
 [0.5 0.2 0.5 ... 1.  0.5 1. ]
 [0.2 0.5 0.2 ... 0.5 1.  0.5]
 [0.5 0.2 0.5 ... 1.  0.5 1. ]]


In [16]:
df = df.reset_index()

In [17]:
distance_df = pd.DataFrame(jaccard_similarity_array, index=df['name'], columns=df['name'])
distance_df.head()

name,GOLDEN LADY COMPANY SOCIETA PER AZIONI,VEBAD S.P.A.,NATUZZI S.P.A.,LUIGI LAVAZZA - SOCIETA PER AZIONI ABBREVIABILE ANCHE NELLA SIGLA: LAVAZZA S.P.A.,MARINA RINALDI S.R.L.,HARMONT & BLAINE JEANS HARMONT & BLAINE S.P.A.,FORM DESIGN SOCIETA A RESPONSABILITA LIMITATA,A. DE ROBERTIS & FIGLI S.P.A.,VINCENZO ZUCCHI - SOCIETA PER AZIONI,TOORA CASTING SPA,...,ARD F.LLI RACCANELLO S.P.A. - INDUSTRIA VERNICI E SMALTI,GIUSEPPE BELLORA S.P.A.,SAFILO - SOCIETA AZIONARIA FABBRICA ITALIANA LAVORAZIONE OCCHIALI - S.P.A. - IN BREVE SAFILO S.P.A. -,ALIPLAST S.P.A.,RESCHIGLIAN S.R.L.,PESPOW S.P.A.,TECNODOM S.P.A.,T.R.S. EVOLUTION S.P.A.,MORETTO S.P.A.,GOLDEN LADY COMPANY SOCIETA PER AZIONI
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GOLDEN LADY COMPANY SOCIETA PER AZIONI,1.0,0.5,1.0,1.0,1.0,1.0,0.2,0.5,1.0,0.2,...,0.0,0.2,0.2,0.2,0.2,0.0,0.0,0.5,0.2,0.5
VEBAD S.P.A.,0.5,1.0,0.5,0.5,0.5,0.5,0.5,1.0,0.5,0.5,...,0.2,0.5,0.0,0.5,0.5,0.2,0.2,0.2,0.5,0.2
NATUZZI S.P.A.,1.0,0.5,1.0,1.0,1.0,1.0,0.2,0.5,1.0,0.2,...,0.0,0.2,0.2,0.2,0.2,0.0,0.0,0.5,0.2,0.5
LUIGI LAVAZZA - SOCIETA PER AZIONI ABBREVIABILE ANCHE NELLA SIGLA: LAVAZZA S.P.A.,1.0,0.5,1.0,1.0,1.0,1.0,0.2,0.5,1.0,0.2,...,0.0,0.2,0.2,0.2,0.2,0.0,0.0,0.5,0.2,0.5
MARINA RINALDI S.R.L.,1.0,0.5,1.0,1.0,1.0,1.0,0.2,0.5,1.0,0.2,...,0.0,0.2,0.2,0.2,0.2,0.0,0.0,0.5,0.2,0.5


In [18]:
#distance_df= distance_df.reset_index()

In [19]:
#distance_df.iloc[4:5, :].values

In [20]:
print(distance_df[' VEBAD S.P.A.'].sort_values(ascending=False))

name
 CINEMECCANICA S.P.A.                                                                                                                                1.0
 VIDEOGRAFICA S.R.L.                                                                                                                                 1.0
 PETTENON COSMETICS S.P.A.                                                                                                                           1.0
 AIR FIRE SOCIETA PER AZIONI                                                                                                                         1.0
 PLASTIK S.P.A.                                                                                                                                      1.0
                                                                                                                                                    ... 
 PARKER HANNIFIN MANUFACTURING S.R.L.                                        