# Projeto Pratico: Análise de custo de seguro de saúde
Autora: Dinorah de Farias Chagas, [linkedin](https://www.linkedin.com/in/dinorahfariasc/), [github](https://github.com/dinorahfariasc).

## Dados
o dataset utilizado para esse projeto será o [Medical Insurance dataset]('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/medical_insurance_dataset.csv') que tem a seguinte estrutura:

| Parametro |Descrição | Tipo |
|---|----|---|
|age| Idade em anos| int |
|gender|Masculino ou Feminino|int (1 ou 2)|
| bmi | Índice de massa corporal | float |
|no_of_children| Número de filhos | int |
|smoker| se fuma ou não | int (0 ou 1)|
|region|Região do EUA - NO, NE, SO, SE | int (1,2,3 ou 4 respectivamente)| 
|charges| Valor anual do Seguro em USD | float|

### Para esse projeto temos como principais objetivos: 
* Realizar uma análise exploratória de dados (EDA) e identificar os atributos que mais influenciam os custos de um seguro de saúde.
* Desenvolver modelos de regressão linear com uma e multiplas variáveis para prever o custo.
* Usar Ridge Regression para refinar a peformance dos modelos da regressão linear.

### Importando as bibliotecas

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, train_test_split

In [2]:
filepath = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/medical_insurance_dataset.csv'
df = pd.read_csv(filepath, header=None)

In [3]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,19,1,27.9,0,1,3,16884.924
1,18,2,33.77,1,0,4,1725.5523
2,28,2,33.0,3,0,4,4449.462
3,33,2,22.705,0,0,1,21984.47061
4,32,2,28.88,0,0,1,3866.8552


In [5]:
headers = ['age','gender','bmi','no_of_children','smoker','region','charger']
df.columns = headers

In [13]:
df.head(10)

Unnamed: 0,age,gender,bmi,no_of_children,smoker,region,charger
0,19,1,27.9,0,1,3,16884.924
1,18,2,33.77,1,0,4,1725.5523
2,28,2,33.0,3,0,4,4449.462
3,33,2,22.705,0,0,1,21984.47061
4,32,2,28.88,0,0,1,3866.8552
5,31,1,25.74,0,?,4,3756.6216
6,46,1,33.44,1,0,4,8240.5896
7,37,1,27.74,3,0,1,7281.5056
8,37,2,29.83,2,0,2,6406.4107
9,60,1,25.84,0,0,1,28923.13692


In [8]:
df.describe()

Unnamed: 0,gender,bmi,no_of_children,region,charger
count,2772.0,2772.0,2772.0,2772.0,2772.0
mean,1.507215,30.701349,1.101732,2.559885,13261.369959
std,0.500038,6.129449,1.214806,1.130761,12151.768945
min,1.0,15.96,0.0,1.0,1121.8739
25%,1.0,26.22,0.0,2.0,4687.797
50%,2.0,30.4475,1.0,3.0,9333.01435
75%,2.0,34.77,2.0,4.0,16577.7795
max,2.0,53.13,5.0,4.0,63770.42801


In [14]:
df.replace("?",np.NaN,inplace=True) 

In [15]:
df.head(10)

Unnamed: 0,age,gender,bmi,no_of_children,smoker,region,charger
0,19,1,27.9,0,1.0,3,16884.924
1,18,2,33.77,1,0.0,4,1725.5523
2,28,2,33.0,3,0.0,4,4449.462
3,33,2,22.705,0,0.0,1,21984.47061
4,32,2,28.88,0,0.0,1,3866.8552
5,31,1,25.74,0,,4,3756.6216
6,46,1,33.44,1,0.0,4,8240.5896
7,37,1,27.74,3,0.0,1,7281.5056
8,37,2,29.83,2,0.0,2,6406.4107
9,60,1,25.84,0,0.0,1,28923.13692


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2772 entries, 0 to 2771
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             2768 non-null   object 
 1   gender          2772 non-null   int64  
 2   bmi             2772 non-null   float64
 3   no_of_children  2772 non-null   int64  
 4   smoker          2765 non-null   object 
 5   region          2772 non-null   int64  
 6   charger         2772 non-null   float64
dtypes: float64(2), int64(3), object(2)
memory usage: 151.7+ KB


## Preparação de dados
As colunas 'age' e 'smoker' não estão como valores númericos.

Como observado nos passos acima o dataset não estava com todos os dados, subistituimos os valores "?" por np.NaN, a ausência de dados acontece poucas vezes decidi lidar com os valores faltantes da seguinte forma:

- Para idade subistituir pela média geral. (media geral de acordo com a media de taxa)
- Para smoker, como é um atributo categorico vamos subistituir pelo valor mais frequênte.

Observa-se tambem que a coluna 'charges' tem os valores até a 5a casa decimal, reduziremos até a 2a.

In [18]:
df[['age','smoker']] = df[['age','smoker']].apply(pd.to_numeric) 

In [26]:
print(df[['age']].mean())
print(df[['smoker']].mode())

age    39.109827
dtype: float64
   smoker
0     0.0


In [28]:
df.describe()

Unnamed: 0,age,gender,bmi,no_of_children,smoker,region,charger
count,2768.0,2772.0,2772.0,2772.0,2765.0,2772.0,2772.0
mean,39.109827,1.507215,30.701349,1.101732,0.203978,2.559885,13261.369959
std,14.091633,0.500038,6.129449,1.214806,0.403026,1.130761,12151.768945
min,18.0,1.0,15.96,0.0,0.0,1.0,1121.8739
25%,26.0,1.0,26.22,0.0,0.0,2.0,4687.797
50%,39.0,2.0,30.4475,1.0,0.0,3.0,9333.01435
75%,51.0,2.0,34.77,2.0,0.0,4.0,16577.7795
max,64.0,2.0,53.13,5.0,1.0,4.0,63770.42801


In [30]:
df[['age']] = df[['age']].replace(np.NaN,39)
df[['smoker']] = df[['smoker']].replace(np.NaN,0)

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2772 entries, 0 to 2771
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             2772 non-null   float64
 1   gender          2772 non-null   int64  
 2   bmi             2772 non-null   float64
 3   no_of_children  2772 non-null   int64  
 4   smoker          2772 non-null   float64
 5   region          2772 non-null   int64  
 6   charger         2772 non-null   float64
dtypes: float64(4), int64(3)
memory usage: 151.7 KB


In [32]:
df[['charger']] = df[['charger']].round(2)
df.head()

Unnamed: 0,age,gender,bmi,no_of_children,smoker,region,charger
0,19.0,1,27.9,0,1.0,3,16884.92
1,18.0,2,33.77,1,0.0,4,1725.55
2,28.0,2,33.0,3,0.0,4,4449.46
3,33.0,2,22.705,0,0.0,1,21984.47
4,32.0,2,28.88,0,0.0,1,3866.86
