<h1>Facebook Live Sellers in Thailand Data Set </h1>
<p style="text-align: right"><q><strong>Author:</strong> Danilo R. Santos <strong>E-mail: </strong><a href="">danilo_santosrs@hotmail.com</a></q></p>
<p>Para este projeto foi utilizado o arquivo "heart.csv" que representa o conjunto de dados de treinamento</p>

<p><strong>Características do <i>dataset</i>:</strong></p>
<blockquote>
    <p>Facebook pages of 10 Thai fashion and cosmetics retail sellers. Posts of a different nature (video, photos, statuses, and links). Engagement metrics consist of comments, shares, and reactions.</p>
    <p>The variability of consumer engagement is analysed through a Principal Component Analysis, highlighting the changes induced by the use of Facebook Live. The seasonal component is analysed through a study of the averages of the different engagement metrics for different time-frames (hourly, daily and monthly). Finally, we identify statistical outlier posts, that are qualitatively analyzed further, in terms of their selling approach and activities.</p>
    <p>Fonte dos dados: <a href="https://archive.ics.uci.edu/ml/datasets/Facebook+Live+Sellers+in+Thailand" target="_blank">https://archive.ics.uci.edu/ml/datasets/Facebook+Live+Sellers+in+Thailand</a></p>
    <p>Creators: 
        <ul>
            <li>1. Nassim Dehouche</li>
        </ul>
    </p>
</blockquote>

<p>Para navegar pelas diversas partes que compõe este projeto use os links a seguir</p>
<ol>
    <li><a href="#topico1">Iniciando o projeto</a></li>
    <li><a href="#topico2">Carrega os dados e faz uma breve limpeza</a></li>
    <li><a href="#topico3">Transforma os dados</a></li>
    <li><a href="#topico4">Aplica uma escala aos dados</a></li>
    <li><a href="#topico5">Executa o algoritmo Optics</a></li>
</ol>

<h2><a name="topico1">1. Iniciando o projeto</a></h2>
<p>Na célula abaixo são feitos as declarações de importação das bibliotecas utilizadas</p>

In [41]:
import pandas as pd

import numpy as np

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, RobustScaler
from sklearn.cluster import OPTICS, cluster_optics_dbscan

import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt

import seaborn as sns

<h2><a name="topico2">2. Carrega os dados e faz uma breve limpeza</a></h2>
<p>A descrição das células é elencado na seguinte sequência</p>
<ul>
    <li>Carrega os dados para uma <i>dataframe</i> Pandas e visualiza seu cabeçalho</li>
    <li>Verifica as dimensões do <i>dataframe</i></li>
    <li>Elimina colunas desnecessárias</li>
    <li>Verifica os registros duplicados</li>
    <li>Elimina os registros duplicados</li>
    <li>Veririca se existe valores nulos</li>
    <li>Verifica algumas informações das colunas do <i>dataframe</i></li>
    <li>Separa somente as colunas que serão utilizadas no processo de <a href="https://pt.wikipedia.org/wiki/Aprendizado_de_m%C3%A1quina" target="_blank">ML</a> e armazena em <strong>X</strong></li>
</ul>

In [2]:
df_live = pd.read_csv('live.csv')
df_live.head()

Unnamed: 0,status_id,status_type,status_published,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys,Column1,Column2,Column3,Column4
0,246675545449582_1649696485147474,video,4/22/2018 6:00,529,512,262,432,92,3,1,1,0,,,,
1,246675545449582_1649426988507757,photo,4/21/2018 22:45,150,0,0,150,0,0,0,0,0,,,,
2,246675545449582_1648730588577397,video,4/21/2018 6:17,227,236,57,204,21,1,1,0,0,,,,
3,246675545449582_1648576705259452,photo,4/21/2018 2:29,111,0,0,111,0,0,0,0,0,,,,
4,246675545449582_1645700502213739,photo,4/18/2018 3:22,213,0,0,204,9,0,0,0,0,,,,


In [3]:
df_live.shape

(7050, 16)

In [4]:
df_live.drop('Column1', axis=1, inplace=True)
df_live.drop('Column2', axis=1, inplace=True)
df_live.drop('Column3', axis=1, inplace=True)
df_live.drop('Column4', axis=1, inplace=True)
df_live.head()

Unnamed: 0,status_id,status_type,status_published,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys
0,246675545449582_1649696485147474,video,4/22/2018 6:00,529,512,262,432,92,3,1,1,0
1,246675545449582_1649426988507757,photo,4/21/2018 22:45,150,0,0,150,0,0,0,0,0
2,246675545449582_1648730588577397,video,4/21/2018 6:17,227,236,57,204,21,1,1,0,0
3,246675545449582_1648576705259452,photo,4/21/2018 2:29,111,0,0,111,0,0,0,0,0
4,246675545449582_1645700502213739,photo,4/18/2018 3:22,213,0,0,204,9,0,0,0,0


In [5]:
sum(df_live.duplicated())

51

In [6]:
df_live.drop_duplicates(inplace=True)

In [7]:
df_live.isnull().sum().any()

False

In [8]:
df_live.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6999 entries, 0 to 7049
Data columns (total 12 columns):
status_id           6999 non-null object
status_type         6999 non-null object
status_published    6999 non-null object
num_reactions       6999 non-null int64
num_comments        6999 non-null int64
num_shares          6999 non-null int64
num_likes           6999 non-null int64
num_loves           6999 non-null int64
num_wows            6999 non-null int64
num_hahas           6999 non-null int64
num_sads            6999 non-null int64
num_angrys          6999 non-null int64
dtypes: int64(9), object(3)
memory usage: 710.8+ KB


In [19]:
X=df_live[['status_type', 'num_reactions', 'num_comments', 'num_shares', 'num_likes', 
           'num_loves', 'num_wows', 'num_hahas', 'num_sads', 'num_angrys']]
X.head()

Unnamed: 0,status_type,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys
0,video,529,512,262,432,92,3,1,1,0
1,photo,150,0,0,150,0,0,0,0,0
2,video,227,236,57,204,21,1,1,0,0
3,photo,111,0,0,111,0,0,0,0,0
4,photo,213,0,0,204,9,0,0,0,0


<h2><a name="topico3">3. Transforma os dados</a></h2>
<p>A descrição das células é elencado na seguinte sequência</p>
<ul>
    <li>Codifica a coluna <i>status_type</i> usando <i>LabelEncoder</i> e adiciona ao <i>dataframe</i> <strong>X</strong> uma nova coluna chamada <i>status_type_coded</i></li>
    <li>Visualiza o cabeçalho do <i>dataframe</i></li>
    <li>Visualiza as classes codificadas po <i>LabelEncoder</i></li>
    <li>Codifica a coluna <i>status_type</i> usando <i>OneHoteEncoder</i></li>
    <li>adiciona ao <i>dataframe</i> <strong>X</strong> uma nova coluna chamada <i>status_type_link</i></li>
    <li>Visualiza o cabeçalho do <i>dataframe</i></li>
    <li>adiciona ao <i>dataframe</i> <strong>X</strong> uma nova coluna chamada <i>status_type_photo</i></li>
    <li>Visualiza o cabeçalho do <i>dataframe</i></li>
    <li>adiciona ao <i>dataframe</i> <strong>X</strong> uma nova coluna chamada <i>status_type_status</i></li>
    <li>Visualiza o cabeçalho do <i>dataframe</i></li>
    <li>adiciona ao <i>dataframe</i> <strong>X</strong> uma nova coluna chamada <i>status_type_video</i></li>
    <li>Visualiza o cabeçalho do <i>dataframe</i></li>
</ul>

In [23]:
le_status_type = LabelEncoder()
le_status_type.fit(X.status_type)
status_type_encoder = le_status_type.transform(X.status_type)
X.loc[X['status_type'] == X.status_type, 'status_type_coded'] = status_type_encoder[:]

In [24]:
X.head()

Unnamed: 0,status_type,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys,status_type_coded
0,video,529,512,262,432,92,3,1,1,0,3
1,photo,150,0,0,150,0,0,0,0,0,1
2,video,227,236,57,204,21,1,1,0,0,3
3,photo,111,0,0,111,0,0,0,0,0,1
4,photo,213,0,0,204,9,0,0,0,0,1


In [25]:
le_status_type.classes_

array(['link', 'photo', 'status', 'video'], dtype=object)

In [26]:
ohe = OneHotEncoder(categories='auto')
X_ohe = ohe.fit_transform(X).toarray()
print(X_ohe)

[[0. 0. 0. ... 0. 0. 1.]
 [0. 1. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 ...
 [0. 1. 0. ... 1. 0. 0.]
 [0. 1. 0. ... 1. 0. 0.]
 [0. 1. 0. ... 1. 0. 0.]]


In [27]:
X.loc[X['status_type'] == X.status_type, 'status_type_link'] = X_ohe[:,0]

In [28]:
X.head()

Unnamed: 0,status_type,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys,status_type_coded,status_type_link
0,video,529,512,262,432,92,3,1,1,0,3,0.0
1,photo,150,0,0,150,0,0,0,0,0,1,0.0
2,video,227,236,57,204,21,1,1,0,0,3,0.0
3,photo,111,0,0,111,0,0,0,0,0,1,0.0
4,photo,213,0,0,204,9,0,0,0,0,1,0.0


In [29]:
X.loc[X['status_type'] == X.status_type, 'status_type_photo'] = X_ohe[:,1]

In [30]:
X.head()

Unnamed: 0,status_type,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys,status_type_coded,status_type_link,status_type_photo
0,video,529,512,262,432,92,3,1,1,0,3,0.0,0.0
1,photo,150,0,0,150,0,0,0,0,0,1,0.0,1.0
2,video,227,236,57,204,21,1,1,0,0,3,0.0,0.0
3,photo,111,0,0,111,0,0,0,0,0,1,0.0,1.0
4,photo,213,0,0,204,9,0,0,0,0,1,0.0,1.0


In [31]:
X.loc[X['status_type'] == X.status_type, 'status_type_status'] = X_ohe[:,2]

In [32]:
X.head()

Unnamed: 0,status_type,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys,status_type_coded,status_type_link,status_type_photo,status_type_status
0,video,529,512,262,432,92,3,1,1,0,3,0.0,0.0,0.0
1,photo,150,0,0,150,0,0,0,0,0,1,0.0,1.0,0.0
2,video,227,236,57,204,21,1,1,0,0,3,0.0,0.0,0.0
3,photo,111,0,0,111,0,0,0,0,0,1,0.0,1.0,0.0
4,photo,213,0,0,204,9,0,0,0,0,1,0.0,1.0,0.0


In [33]:
X.loc[X['status_type'] == X.status_type, 'status_type_video'] = X_ohe[:,3]

In [34]:
X.head()

Unnamed: 0,status_type,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys,status_type_coded,status_type_link,status_type_photo,status_type_status,status_type_video
0,video,529,512,262,432,92,3,1,1,0,3,0.0,0.0,0.0,1.0
1,photo,150,0,0,150,0,0,0,0,0,1,0.0,1.0,0.0,0.0
2,video,227,236,57,204,21,1,1,0,0,3,0.0,0.0,0.0,1.0
3,photo,111,0,0,111,0,0,0,0,0,1,0.0,1.0,0.0,0.0
4,photo,213,0,0,204,9,0,0,0,0,1,0.0,1.0,0.0,0.0


<h2><a name="topico4">4. Aplica uma escala aos dados</a></h2>
<p>A descrição das células é elencado na seguinte sequência</p>
<ul>
    <li>Cria um cópia do <i>dataframe</i> <strong>X</strong> chamada <strong>X_work</strong> com somente as colunas que serão usadas no processo de <a href="https://pt.wikipedia.org/wiki/Aprendizado_de_m%C3%A1quina" target="_blank">ML</a></li>
    <li>Visualiza o cabeçalho do <i>dataframe</i></li>
    <li>Coloca os dados em escala usando <i>RobustScaler</i> e armazena em <i>data_scaled</i></li>
    <li>Transforma o vetor de dados obtidos após a conversação em um <i>DataFrame</i> do Pandas em seguida visualiza o cabeçalho</li>
</ul>

In [35]:
X_work = X[['num_reactions', 'num_comments', 'num_shares', 'num_likes', 'num_loves', 'num_wows', 'num_hahas', 
              'num_sads', 'num_angrys', 'status_type_link', 'status_type_photo', 'status_type_status', 'status_type_video']]

In [36]:
X_work.head()

Unnamed: 0,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys,status_type_link,status_type_photo,status_type_status,status_type_video
0,529,512,262,432,92,3,1,1,0,0.0,0.0,0.0,1.0
1,150,0,0,150,0,0,0,0,0,0.0,1.0,0.0,0.0
2,227,236,57,204,21,1,1,0,0,0.0,0.0,0.0,1.0
3,111,0,0,111,0,0,0,0,0,0.0,1.0,0.0,0.0
4,213,0,0,204,9,0,0,0,0,0.0,1.0,0.0,0.0


In [37]:
rs = RobustScaler(copy=True, quantile_range=(30.0, 70.0), with_centering=True, with_scaling=True)
data_scaled = rs.fit_transform(X_work)

In [46]:
X_scaled = pd.DataFrame(data_scaled, columns=X_work.columns, dtype=float)
X_scaled.head()

Unnamed: 0,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys,status_type_link,status_type_photo,status_type_status,status_type_video
0,3.098684,36.285714,131.0,3.0,46.0,3.0,1.0,1.0,0.0,0.0,-1.0,0.0,1.0
1,0.605263,-0.285714,0.0,0.744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.111842,16.571429,28.5,1.176,10.5,1.0,1.0,0.0,0.0,0.0,-1.0,0.0,1.0
3,0.348684,-0.285714,0.0,0.432,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.019737,-0.285714,0.0,1.176,4.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [48]:
X_scaled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6999 entries, 0 to 6998
Data columns (total 13 columns):
num_reactions         6999 non-null float64
num_comments          6999 non-null float64
num_shares            6999 non-null float64
num_likes             6999 non-null float64
num_loves             6999 non-null float64
num_wows              6999 non-null float64
num_hahas             6999 non-null float64
num_sads              6999 non-null float64
num_angrys            6999 non-null float64
status_type_link      6999 non-null float64
status_type_photo     6999 non-null float64
status_type_status    6999 non-null float64
status_type_video     6999 non-null float64
dtypes: float64(13)
memory usage: 710.9 KB


<h2><a name="topico5">5. Executa o algoritmo Optics</a></h2>
<p>A descrição das células é elencado na seguinte sequência</p>
<ul>
    <li></li>
</ul>

In [57]:
mdl_cluster = OPTICS(min_samples = 50,
                     metric = 'l2',
                     metric_params = None,
                     cluster_method = 'dbscan',
                     predecessor_correction = True,
                     min_cluster_size = None,
                     algorithm = 'auto',
                     n_jobs = -1)

In [58]:
mdl_cluster.fit(X_scaled)

OPTICS(algorithm='auto', cluster_method='dbscan', eps=None, leaf_size=30,
       max_eps=inf, metric='l2', metric_params=None, min_cluster_size=None,
       min_samples=50, n_jobs=-1, p=2, predecessor_correction=True, xi=0.05)

In [59]:
labels_050 = cluster_optics_dbscan(reachability=clust.reachability_,
                                   core_distances=clust.core_distances_,
                                   ordering=clust.ordering_, eps=0.5)
labels_200 = cluster_optics_dbscan(reachability=clust.reachability_,
                                   core_distances=clust.core_distances_,
                                   ordering=clust.ordering_, eps=2)

space = np.arange(len(X_scaled))
reachability = mdl_cluster.reachability_[mdl_cluster.ordering_]
labels = mdl_cluster.labels_[mdl_cluster.ordering_]

plt.figure(figsize=(10, 7))
G = gridspec.GridSpec(2, 3)
ax1 = plt.subplot(G[0, :])
ax2 = plt.subplot(G[1, 0])
ax3 = plt.subplot(G[1, 1])
ax4 = plt.subplot(G[1, 2])

# Reachability plot
colors = ['g.', 'r.', 'b.', 'y.', 'c.']
for klass, color in zip(range(0, 5), colors):
    Xk = space[labels == klass]
    Rk = reachability[labels == klass]
    ax1.plot(Xk, Rk, color, alpha=0.3)
ax1.plot(space[labels == -1], reachability[labels == -1], 'k.', alpha=0.3)
ax1.plot(space, np.full_like(space, 2., dtype=float), 'k-', alpha=0.5)
ax1.plot(space, np.full_like(space, 0.5, dtype=float), 'k-.', alpha=0.5)
ax1.set_ylabel('Reachability (epsilon distance)')
ax1.set_title('Reachability Plot')

NameError: name 'clust' is not defined

In [None]:
# OPTICS
colors = ['g.', 'r.', 'b.', 'y.', 'c.']
for klass, color in zip(range(0, 5), colors):
    Xk = X_scaled[mdl_cluster.labels_ == klass]
    ax2.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)
ax2.plot(X_scaled[mdl_cluster.labels_ == -1, 0], X_scaled[mdl_cluster.labels_ == -1, 1], 'k+', alpha=0.1)
ax2.set_title('Automatic Clustering\nOPTICS')

# DBSCAN at 0.5
colors = ['g', 'greenyellow', 'olive', 'r', 'b', 'c']
for klass, color in zip(range(0, 6), colors):
    Xk = X_scaled[labels_050 == klass]
    ax3.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3, marker='.')
ax3.plot(X_scaled[labels_050 == -1, 0], X_scaled[labels_050 == -1, 1], 'k+', alpha=0.1)
ax3.set_title('Clustering at 0.5 epsilon cut\nDBSCAN')

# DBSCAN at 2.
colors = ['g.', 'm.', 'y.', 'c.']
for klass, color in zip(range(0, 4), colors):
    Xk = X_scaled[labels_200 == klass]
    ax4.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)
ax4.plot(X_scaled[labels_200 == -1, 0], X[labels_200 == -1, 1], 'k+', alpha=0.1)
ax4.set_title('Clustering at 2.0 epsilon cut\nDBSCAN')

plt.tight_layout()
plt.show()