<a href="https://colab.research.google.com/github/eccardoso/mvp-analise-de-dados-e-boas-praticas/blob/main/MVP_An%C3%A1lise_de_Dados.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MVP de Análise de Dados e Boas Práticas
**Edson da Costa Cardoso**

### **MVP**

O que não está detalhado ou pode ser melhorado neste notebook para ficar como esperamos para o MVP:

Blocos de texto que expliquem textualmente cada etapa e cada decisão do seu código, contando uma história completa e compreensível, do início ao fim;
Boas práticas de codificação;
Após cada gráfico, escrever 1 parágrafo resumindo os principais achados, analisando os resultados e levantando eventuais pontos de atenção.

### **1. Definição do Problema**




O dataset usado neste projeto será a base de Dados de Violência no Estado do Rio de Janeiro no período de Jan/2014 a Fev/2021 por municipio e tipos de crimes. Seu objetivo é prever se um paciente tem ou não diabetes, com base em certas medidas de diagnóstico médico. Este dataset é um subconjunto do dataset original e aqui, todos os pacientes são mulheres com pelo menos 21 anos de idade e de herança indígena Pima. O dataset apresenta em diversos atributos relacionados a dados médicos e uma variável de classe binária (0 ou 1). As variáveis ​​preditoras incluem o número de gestações que a paciente teve, seu IMC, nível de insulina, idade e assim por diante. Para mais detalhes sobre este dataset, consulte: https://www.kaggle.com/uciml/pima-indians-diabetes-database

In [1]:
# Importar bibliotecas para a programação
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms # para tratamento de missings
from matplotlib import cm
from pandas import set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

### **2. Carga de Dados**

Iremos usar o pacote Pandas (Python Data Analysis Library) para carregar de um arquivo .csv disponível no github.

Com o dataset carregado, iremos explorá-lo um pouco.

In [13]:
# Carrega arquivo csv usando Pandas usando uma URL

# Informa a URL de importação do dataset
url = "https://raw.githubusercontent.com/eccardoso/mvp-analise-de-dados-e-boas-praticas/main/BaseMunicipioMensal.csv"

# Lê o arquivo utilizando as colunas informadas
dataset = pd.read_csv(url,  delimiter=';')

In [14]:
dataset.head()

Unnamed: 0,fmun_cod,fmun,ano,mes,mes_ano,regiao,hom_doloso,lesao_corp_morte,latrocinio,cvli,...,cmp,cmba,ameaca,pessoas_desaparecidas,encontro_cadaver,encontro_ossada,pol_militares_mortos_serv,pol_civis_mortos_serv,registro_ocorrencias,fase
0,3300100,Angra dos Reis,2014,1,2014m01,Interior,11,0,0,11,...,8,0,98,13,3,0,0,0,561,3
1,3300159,Aperibé,2014,1,2014m01,Interior,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
2,3300209,Araruama,2014,1,2014m01,Interior,2,0,0,2,...,5,0,91,10,1,0,0,0,480,3
3,3300225,Areal,2014,1,2014m01,Interior,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
4,3300233,Armação dos Búzios,2014,1,2014m01,Interior,2,0,0,2,...,3,2,46,0,0,0,0,0,309,3


### **3. Análise de Dados**

**3.1. Estatísticas Descritivas**


Vamos iniciar examinando as dimensões do dataset, suas informações e alguns exemplos de linhas.

In [15]:
# Mostra as dimensões do dataset
print(dataset.shape)

(7912, 60)


In [16]:
# Mostra as informações do dataset
print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7912 entries, 0 to 7911
Data columns (total 60 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   fmun_cod                    7912 non-null   int64 
 1   fmun                        7912 non-null   object
 2   ano                         7912 non-null   int64 
 3   mes                         7912 non-null   int64 
 4   mes_ano                     7912 non-null   object
 5   regiao                      7912 non-null   object
 6   hom_doloso                  7912 non-null   int64 
 7   lesao_corp_morte            7912 non-null   int64 
 8   latrocinio                  7912 non-null   int64 
 9   cvli                        7912 non-null   int64 
 10  hom_por_interv_policial     7912 non-null   int64 
 11  letalidade_violenta         7912 non-null   int64 
 12  tentat_hom                  7912 non-null   int64 
 13  lesao_corp_dolosa           7912 non-null   int6

In [17]:
# Mostra as 10 primeiras linhas do dataset
dataset.head(10)

Unnamed: 0,fmun_cod,fmun,ano,mes,mes_ano,regiao,hom_doloso,lesao_corp_morte,latrocinio,cvli,...,cmp,cmba,ameaca,pessoas_desaparecidas,encontro_cadaver,encontro_ossada,pol_militares_mortos_serv,pol_civis_mortos_serv,registro_ocorrencias,fase
0,3300100,Angra dos Reis,2014,1,2014m01,Interior,11,0,0,11,...,8,0,98,13,3,0,0,0,561,3
1,3300159,Aperibé,2014,1,2014m01,Interior,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
2,3300209,Araruama,2014,1,2014m01,Interior,2,0,0,2,...,5,0,91,10,1,0,0,0,480,3
3,3300225,Areal,2014,1,2014m01,Interior,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
4,3300233,Armação dos Búzios,2014,1,2014m01,Interior,2,0,0,2,...,3,2,46,0,0,0,0,0,309,3
5,3300258,Arraial do Cabo,2014,1,2014m01,Interior,0,0,0,0,...,1,0,26,0,0,0,0,0,176,3
6,3300308,Barra do Piraí,2014,1,2014m01,Interior,1,0,0,1,...,19,0,56,0,0,0,0,0,248,3
7,3300407,Barra Mansa,2014,1,2014m01,Interior,5,0,0,5,...,25,0,54,2,1,0,0,0,430,3
8,3300456,Belford Roxo,2014,1,2014m01,Baixada Fluminense,29,0,0,29,...,16,0,231,22,3,0,0,0,1367,3
9,3300506,Bom Jardim,2014,1,2014m01,Interior,0,0,0,0,...,3,1,11,1,0,0,0,0,59,3


In [18]:
# Mostra as 10 últimas linhas do dataset
dataset.tail(10)

Unnamed: 0,fmun_cod,fmun,ano,mes,mes_ano,regiao,hom_doloso,lesao_corp_morte,latrocinio,cvli,...,cmp,cmba,ameaca,pessoas_desaparecidas,encontro_cadaver,encontro_ossada,pol_militares_mortos_serv,pol_civis_mortos_serv,registro_ocorrencias,fase
7902,3305604,Silva Jardim,2021,2,2021m02,Interior,0,0,0,0,...,2,0,7,0,0,0,0,0,49,2
7903,3305703,Sumidouro,2021,2,2021m02,Interior,0,0,0,0,...,1,0,6,0,0,0,0,0,39,2
7904,3305752,Tanguá,2021,2,2021m02,Interior,0,0,0,0,...,0,0,9,0,0,0,0,0,50,2
7905,3305802,Teresópolis,2021,2,2021m02,Interior,1,0,0,1,...,6,0,76,4,0,0,1,0,456,2
7906,3305901,Trajano de Moraes,2021,2,2021m02,Interior,0,0,0,0,...,1,0,4,0,0,0,0,0,19,2
7907,3306008,Três Rios,2021,2,2021m02,Interior,0,0,0,0,...,20,0,38,1,2,0,0,0,280,2
7908,3306107,Valença,2021,2,2021m02,Interior,0,0,0,0,...,0,1,43,3,0,0,0,0,151,2
7909,3306156,Varre-Sai,2021,2,2021m02,Interior,0,0,0,0,...,0,0,0,0,0,0,0,0,10,2
7910,3306206,Vassouras,2021,2,2021m02,Interior,1,0,0,1,...,4,0,16,0,0,0,0,0,82,2
7911,3306305,Volta Redonda,2021,2,2021m02,Interior,4,0,0,4,...,11,0,84,4,0,0,0,0,726,2


In [19]:
# Faz um resumo estatístico do dataset (média, desvio padrão, mínimo, máximo e os quartis)
dataset.describe()

Unnamed: 0,fmun_cod,ano,mes,hom_doloso,lesao_corp_morte,latrocinio,cvli,hom_por_interv_policial,letalidade_violenta,tentat_hom,...,cmp,cmba,ameaca,pessoas_desaparecidas,encontro_cadaver,encontro_ossada,pol_militares_mortos_serv,pol_civis_mortos_serv,registro_ocorrencias,fase
count,7912.0,7912.0,7912.0,7912.0,7912.0,7912.0,7912.0,7912.0,7912.0,7912.0,...,7912.0,7912.0,7912.0,7912.0,7912.0,7912.0,7912.0,7912.0,7912.0,7912.0
mean,3303128.0,2017.093023,6.383721,4.124747,0.038675,0.147118,4.310541,1.032609,5.34315,5.213094,...,15.094666,1.357305,56.610465,4.633342,0.377654,0.032356,0.01997,0.002149,681.379929,2.976744
std,1840.157,2.066629,3.494963,12.285731,0.256914,0.718919,12.98946,4.872507,17.263436,18.53153,...,61.527923,9.745406,206.706432,20.645999,1.486448,0.258765,0.187744,0.04896,3225.305376,0.150724
min,3300100.0,2014.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
25%,3301578.0,2015.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,49.0,3.0
50%,3303154.0,2017.0,6.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,...,3.0,0.0,16.0,1.0,0.0,0.0,0.0,0.0,114.0,3.0
75%,3304632.0,2019.0,9.0,3.0,0.0,0.0,3.0,0.0,4.0,4.0,...,11.0,1.0,46.0,3.0,0.0,0.0,0.0,0.0,401.25,3.0
max,3306305.0,2021.0,12.0,146.0,5.0,17.0,165.0,92.0,217.0,271.0,...,1221.0,248.0,3469.0,286.0,27.0,10.0,5.0,2.0,36489.0,3.0
