#Matéria 04 - Análise de Regressão em Python pt 04 b

## CRISP- DM Evaluation - Parte II

### 331 Determine Next Steps
<p>Tema central: Determinar os próximos passos (Determine Next Steps)</p><ul><li>É a última etapa do processo de avaliação (Evaluation) do CRISP-DM</li><li>Utiliza as análises das etapas anteriores (Evaluate Results e Review Process) para decidir os próximos passos</li></ul><p>Próximos passos possíveis:</p><ul><li>Finalizar o projeto e ir para a fase de deployment</li><li>Iniciar uma nova iteração do CRISP-DM</li><li>Voltar para uma fase específica (ex: Data Preparation) para melhorar com novas técnicas</li><li>Iniciar um novo projeto baseado nos insights obtidos</li></ul><p>Fatores importantes para decidir:</p><ul><li>Potencial de implementação dos resultados e modelos</li><li>Estimar potencial de melhoria versus esforço necessário</li><li>Recomendar continuações e melhorias alternativas</li></ul><p>Outputs:</p><ul><li>Lista de ações possíveis</li><li>Decisão dos próximos passos com base na análise dos fatores importantes</li></ul>

### 332 Validação do Projeto
<h1>Tema central: Etapa de Evaluation no processo CRISP-DM</h1><h2>Objetivo da etapa Evaluation</h2><ul><li>Não é para avaliar o modelo ou métricas (isso é feito na etapa Modeling)</li><li>Olhar o projeto com visão de negócio para avaliar se faz sentido</li></ul><h2>Exemplo com dados do Kickstarter</h2><ul><li>Modelo para classificar se projeto será sucesso ou falha</li><li>Variáveis utilizadas:<ul><li>valor arrecadado real</li><li>valor objetivo</li><li>tempo no ar</li><li>etc</li></ul></li><li>Métrica (acurácia) inicialmente 83% -&gt; parece bom</li></ul><h2>Problema identificado</h2><ul><li>Variáveis utilizadas não estarão disponíveis na prática<ul><li>Ex: no início do projeto, valor arrecadado real ainda não existe</li></ul></li><li>Modelo apenas olha se valor arrecadado bateu com valor objetivo para classificar sucesso<ul><li>Faz sentido para os dados históricos, mas não para classificar novos projetos</li></ul></li></ul><h2>Solução</h2><ul><li>Avaliar se variáveis fazem sentido para a aplicação prática</li><li>Possivelmente retornar para etapas anteriores (entendimento dos dados, preparação dos dados)</li><li>Redefinir variável target ou variáveis preditoras</li></ul>

In [None]:
from google.colab import files
upload = files.upload()

Saving DataPrepFinal.csv to DataPrepFinal.csv


In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("DataPrepFinal.csv")

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,09/10/15,100000,11/08/15 12:12,0,failed,0,GB,0.0,0,USD 153395
1,1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,01/11/17,3000000,02/09/17 04:43,242100,failed,15,US,10000.0,242100,USD 3000000
2,2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,26/02/13,4500000,12/01/13 00:20,22000,failed,3,US,22000.0,22000,USD 4500000
3,3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,16/04/12,500000,17/03/12 03:24,100,failed,1,US,100.0,100,USD 500000
4,4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,29/08/15,1950000,04/07/15 08:35,128300,canceled,14,US,128300.0,128300,USD 1950000


In [None]:
df = df.drop(columns=['Unnamed: 0', 'name', 'category', 'goal', 'pledged', 'usd pledged'])

In [None]:
df.head()

Unnamed: 0,ID,main_category,currency,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real
0,1000002330,Publishing,GBP,09/10/15,11/08/15 12:12,failed,0,GB,0,USD 153395
1,1000003930,Film & Video,USD,01/11/17,02/09/17 04:43,failed,15,US,242100,USD 3000000
2,1000004038,Film & Video,USD,26/02/13,12/01/13 00:20,failed,3,US,22000,USD 4500000
3,1000007540,Music,USD,16/04/12,17/03/12 03:24,failed,1,US,100,USD 500000
4,1000011046,Film & Video,USD,29/08/15,04/07/15 08:35,canceled,14,US,128300,USD 1950000


In [None]:
df.dtypes

ID                   int64
main_category       object
currency            object
deadline            object
launched            object
state               object
backers              int64
country             object
usd_pledged_real     int64
usd_goal_real       object
dtype: object

In [None]:
# Lambda não tem uma função específica que vou fazer
# Strip remove algo no começo ou no final
df['usd_goal_real'] = df['usd_goal_real'].apply(lambda x: x.strip('USD '))

In [None]:
df.head()

Unnamed: 0,ID,main_category,currency,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real
0,1000002330,Publishing,GBP,09/10/15,11/08/15 12:12,failed,0,GB,0,153395
1,1000003930,Film & Video,USD,01/11/17,02/09/17 04:43,failed,15,US,242100,3000000
2,1000004038,Film & Video,USD,26/02/13,12/01/13 00:20,failed,3,US,22000,4500000
3,1000007540,Music,USD,16/04/12,17/03/12 03:24,failed,1,US,100,500000
4,1000011046,Film & Video,USD,29/08/15,04/07/15 08:35,canceled,14,US,128300,1950000


In [None]:
# Replace muda para um novo
df['launched'] = df['launched'].str.replace(' \d\d:\d\d', '', regex=True)

In [None]:
df.head()

Unnamed: 0,ID,main_category,currency,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real
0,1000002330,Publishing,GBP,09/10/15,11/08/15,failed,0,GB,0,153395
1,1000003930,Film & Video,USD,01/11/17,02/09/17,failed,15,US,242100,3000000
2,1000004038,Film & Video,USD,26/02/13,12/01/13,failed,3,US,22000,4500000
3,1000007540,Music,USD,16/04/12,17/03/12,failed,1,US,100,500000
4,1000011046,Film & Video,USD,29/08/15,04/07/15,canceled,14,US,128300,1950000


In [None]:
df.dtypes

ID                   int64
main_category       object
currency            object
deadline            object
launched            object
state               object
backers              int64
country             object
usd_pledged_real     int64
usd_goal_real       object
dtype: object

In [None]:
df['usd_goal_real'] = df['usd_goal_real'].astype('int64')

In [None]:
df.dtypes

ID                   int64
main_category       object
currency            object
deadline            object
launched            object
state               object
backers              int64
country             object
usd_pledged_real     int64
usd_goal_real        int64
dtype: object

In [None]:
df['deadline'] = pd.to_datetime(df['deadline'], format='%d/%m/%y')
df['launched'] = pd.to_datetime(df['launched'], format='%d/%m/%y')

In [None]:
df.dtypes

ID                           int64
main_category               object
currency                    object
deadline            datetime64[ns]
launched            datetime64[ns]
state                       object
backers                      int64
country                     object
usd_pledged_real             int64
usd_goal_real                int64
dtype: object

In [None]:
df['time_range'] = (df['deadline']-df['launched']).dt.days

In [None]:
df.head()

Unnamed: 0,ID,main_category,currency,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real,time_range
0,1000002330,Publishing,GBP,2015-10-09,2015-08-11,failed,0,GB,0,153395,59
1,1000003930,Film & Video,USD,2017-11-01,2017-09-02,failed,15,US,242100,3000000,60
2,1000004038,Film & Video,USD,2013-02-26,2013-01-12,failed,3,US,22000,4500000,45
3,1000007540,Music,USD,2012-04-16,2012-03-17,failed,1,US,100,500000,30
4,1000011046,Film & Video,USD,2015-08-29,2015-07-04,canceled,14,US,128300,1950000,56


In [None]:
df.dtypes

ID                           int64
main_category               object
currency                    object
deadline            datetime64[ns]
launched            datetime64[ns]
state                       object
backers                      int64
country                     object
usd_pledged_real             int64
usd_goal_real                int64
time_range                   int64
dtype: object

In [None]:
from google.colab import files
upload = files.upload()

Saving campaign revisao.csv to campaign revisao (1).csv


In [None]:
df_c = pd.read_csv('campaign revisao.csv')

In [None]:
df_c.head()

Unnamed: 0,ID,Text Description,Video,Image,Infographic,Reviews,Risks
0,1000002330,1,1,1,1,0,1
1,1000003930,1,1,1,1,1,0
2,1000004038,1,0,1,1,0,1
3,1000007540,1,1,0,1,0,0
4,1000011046,1,0,1,0,0,0


In [None]:
df_c['Text Description'].unique()

array([1])

In [None]:
# Retirando coluna
df_c = df_c.drop(columns=['Text Description'])

In [None]:
df_c.head()

Unnamed: 0,ID,Video,Image,Infographic,Reviews,Risks
0,1000002330,1,1,1,0,1
1,1000003930,1,1,1,1,0
2,1000004038,0,1,1,0,1
3,1000007540,1,0,1,0,0
4,1000011046,0,1,0,0,0


In [None]:
df.head()

Unnamed: 0,ID,main_category,currency,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real,time_range
0,1000002330,Publishing,GBP,2015-10-09,2015-08-11,failed,0,GB,0,153395,59
1,1000003930,Film & Video,USD,2017-11-01,2017-09-02,failed,15,US,242100,3000000,60
2,1000004038,Film & Video,USD,2013-02-26,2013-01-12,failed,3,US,22000,4500000,45
3,1000007540,Music,USD,2012-04-16,2012-03-17,failed,1,US,100,500000,30
4,1000011046,Film & Video,USD,2015-08-29,2015-07-04,canceled,14,US,128300,1950000,56


In [None]:
df = df.merge(df_c, how='right', on=['ID'] )

In [None]:
df.head()

Unnamed: 0,ID,main_category,currency,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real,time_range,Video,Image,Infographic,Reviews,Risks
0,1000002330,Publishing,GBP,2015-10-09,2015-08-11,failed,0,GB,0,153395,59,1,1,1,0,1
1,1000003930,Film & Video,USD,2017-11-01,2017-09-02,failed,15,US,242100,3000000,60,1,1,1,1,0
2,1000004038,Film & Video,USD,2013-02-26,2013-01-12,failed,3,US,22000,4500000,45,0,1,1,0,1
3,1000007540,Music,USD,2012-04-16,2012-03-17,failed,1,US,100,500000,30,1,0,1,0,0
4,1000011046,Film & Video,USD,2015-08-29,2015-07-04,canceled,14,US,128300,1950000,56,0,1,0,0,0


In [None]:
from google.colab import files
upload = files.upload()

Saving invested.csv to invested (1).csv


In [None]:
df_invest = pd.read_csv('invested.csv')

In [None]:
df_invest.head()

Unnamed: 0,ID,backedLocation,age
0,1000003930,BR,18
1,1000004038,US,57
2,1000007540,US,43
3,1000011046,BR,29
4,1000014025,BR,77


In [None]:
df_invest.sort_values(by= ['ID'])

Unnamed: 0,ID,backedLocation,age
27400,106144,US,63
10585,106144,GBK,70
17409,1003381,BR,68
594,1003381,US,29
2925,1017454,US,62
...,...,...,...
16812,1098729640,GBK,61
33627,1098729640,GBK,43
16814,1098735707,BR,66
33629,1098735707,US,62


In [None]:
# Transformar em Booleana
df_invest = pd.get_dummies(df_invest, columns = ['backedLocation'])

In [None]:
df_invest.head()

Unnamed: 0,ID,age,backedLocation_BR,backedLocation_GBK,backedLocation_US
0,1000003930,18,True,False,False
1,1000004038,57,False,False,True
2,1000007540,43,False,False,True
3,1000011046,29,True,False,False
4,1000014025,77,True,False,False


In [None]:
df_invest.sort_values(by=['ID'])

Unnamed: 0,ID,age,backedLocation_BR,backedLocation_GBK,backedLocation_US
27400,106144,63,False,False,True
10585,106144,70,False,True,False
17409,1003381,68,True,False,False
594,1003381,29,False,False,True
2925,1017454,62,False,False,True
...,...,...,...,...,...
16812,1098729640,61,False,True,False
33627,1098729640,43,False,True,False
16814,1098735707,66,True,False,False
33629,1098735707,62,False,False,True


In [None]:
df_invest = df_invest.groupby(by=['ID']).agg({'age':'mean', 'backedLocation_BR':'sum','backedLocation_GBK':'sum','backedLocation_US':'sum'})

In [None]:
df_invest.head()

Unnamed: 0_level_0,age,backedLocation_BR,backedLocation_GBK,backedLocation_US
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
106144,66.5,0,1,1
1003381,48.5,1,0,1
1017454,41.5,0,0,2
1024013,77.0,2,0,0
1024208,71.0,0,1,1


In [None]:
# Efetuando Join
df = df.merge(df_invest, how='right', on ='ID')

In [None]:
df.head()

Unnamed: 0,ID,main_category,currency,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real,time_range,Video,Image,Infographic,Reviews,Risks
0,1000002330,Publishing,GBP,2015-10-09,2015-08-11,failed,0,GB,0,153395,59,1,1,1,0,1
1,1000003930,Film & Video,USD,2017-11-01,2017-09-02,failed,15,US,242100,3000000,60,1,1,1,1,0
2,1000004038,Film & Video,USD,2013-02-26,2013-01-12,failed,3,US,22000,4500000,45,0,1,1,0,1
3,1000007540,Music,USD,2012-04-16,2012-03-17,failed,1,US,100,500000,30,1,0,1,0,0
4,1000011046,Film & Video,USD,2015-08-29,2015-07-04,canceled,14,US,128300,1950000,56,0,1,0,0,0


In [None]:
df['state'].unique()

array(['failed', 'canceled', 'successful', 'live', 'undefined',
       'suspended'], dtype=object)

In [None]:
df['state']= df['state'].apply(lambda x :1 if x == 'sucessful' else 0 )

In [None]:
# Transformar em hot code
df = pd.get_dummies(df, columns = ['main_category', 'currency','country' ])

## Modeling
<https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html>

<https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html>

In [None]:
y = df['state']
X = df.drop(columns = ['state', 'ID', 'deadline', 'launched'])

In [None]:
X.head()

Unnamed: 0,backers,usd_pledged_real,usd_goal_real,time_range,Video,Image,Infographic,Reviews,Risks,main_category_Art,...,country_JP,country_LU,country_MX,"country_N0""",country_NL,country_NO,country_NZ,country_SE,country_SG,country_US
0,0,0,153395,59,1,1,1,0,1,False,...,False,False,False,False,False,False,False,False,False,False
1,15,242100,3000000,60,1,1,1,1,0,False,...,False,False,False,False,False,False,False,False,False,True
2,3,22000,4500000,45,0,1,1,0,1,False,...,False,False,False,False,False,False,False,False,False,True
3,1,100,500000,30,1,0,1,0,0,False,...,False,False,False,False,False,False,False,False,False,True
4,14,128300,1950000,56,0,1,0,0,0,False,...,False,False,False,False,False,False,False,False,False,True


In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
classificacao = DecisionTreeClassifier(random_state=0)

In [None]:
modeloClassif= classificacao.fit(X_train, y_train)

In [None]:
modeloClassif.score(X_test, y_test)

1.0

# Evaluation

### 333 Revisão do Processo
<h2>Tema central:</h2><p>Avaliação de resultados (evaluation) de um projeto de ciência de dados seguindo o CRISP-DM</p><h3>Tópicos abordados:</h3><ul><li>Avaliação dos resultados na prática</li><li>Revisão do processo executado<ul><li>Documentação</li><li>Pontos de atenção e riscos</li><li>Otimização das etapas</li><li>Decisão dos próximos passos</li></ul></li><li>Exemplo de projeto antigo<ul><li>Recomendação de bairros em Porto Alegre</li><li>Coleta de dados</li><li>Preparação dos dados</li><li>Modelagem</li><li>Resultados</li><li>Revisão do processo<ul><li>Pontos positivos</li><li>Pontos de melhoria</li></ul></li></ul></li><li>Definição dos próximos passos<ul><li>Melhorias no projeto</li><li>Seguir para deployment</li><li>Criar projetos paralelos</li></ul></li></ul><p>O vídeo mostra um exemplo prático de avaliação de um projeto de ciência de dados, revisando os resultados obtidos e o processo executado. São identificados pontos positivos e oportunidades de melhoria, para então definir os próximos passos, que podem envolver melhorar o projeto atual, colocá-lo em produção ou criar novos projetos derivados das lições aprendidas.</p>


### 334 Revisão do Módulo
<h1>Avaliação de Resultados no Crisp-DM</h1><h2>Tópicos abordados</h2><h3>Natureza da fase Evaluation</h3><ul><li>Fase menos construtiva (não envolve código)</li><li>Mais introspectiva e de interpretação</li><li>Semelhante a uma &quot;retrospectiva&quot; (Scrum)</li></ul><h3>Diferença entre Modeling e Evaluation</h3><ul><li>Modeling: avaliação dos modelos</li><li>Evaluation: avaliação geral do projeto</li></ul><h3>Tasks</h3><h4>Evaluate Results</h4><ul><li>Avaliar resultados com visão de negócio</li><li>Analisar output dos modelos e insights obtidos</li><li>Listar possíveis próximos projetos</li></ul><h4>Review Process</h4><ul><li>Revisar todo o processo do projeto</li><li>Listar melhorias e falhas a serem tratadas</li></ul><h4>Determine Next Steps</h4><ul><li>Decidir próximos passos com base nas análises</li><li>Ir para produção, voltar a etapas anteriores, criar novo projeto</li></ul><h3>Outputs</h3><ul><li>Assessment of Data Mining Results</li><li>Approved Models (que passaram no Business Success Criteria)</li><li>Processo Revisado</li><li>Lista de Ações</li></ul>