# introduction to Scikit learn(sklearn)

This notebook demonstrate some of the most useful functions of the beatifull library

What we're going to cover:
0. An end-to-end Scikit-Learn workflow
1. Getting the data ready
2. Choose the right estimator/algorithm for our problemns
3. Fit the model/algorithm and use it to make predicitions on our data
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together!

# An end - to -end scikit-learn workflow

In [172]:
#Standart imports 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
heart_disease = pd.read_csv("heart-disease.csv")
heart_disease


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [173]:
#Create X(features matrix)
X = heart_disease.drop("target", axis = 1) # X é tudo exteto target

#Create y(labels)
y = heart_disease["target"]


In [174]:
#2. Choose the right model and hypermparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100)

#We'll keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [175]:
#3 Fit the model to the training data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.2)


In [176]:
clf.fit(X_train,y_test);

ValueError: Found input variables with inconsistent numbers of samples: [242, 61]

In [None]:
X_train

In [None]:
y_preds = clf.predict(X_test)
y_preds

In [None]:
y_test

In [None]:
#4. Evaluate the model on the training data and test data
clf.score(X_train,y_train)


In [None]:
clf.score(X_test,y_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score
print(classification_report(y_test,y_preds))


In [None]:
confusion_matrix(y_test,y_preds)

In [None]:
#5. Improve a model
#Try differente amount of n-estimators
import numpy as np
np.random.seed(42)
for i in range(10,100,10):
    print(f'Trying model with {i} estimators..')
    clf= RandomForestClassifier(n_estimators=i).fit(X_train,y_train)
    print(f"Model accuracy on test set : {clf.score(X_test,y_test) * 100:.2f}")
    print("")

In [None]:
#Save a model and load it
import pickle
pickle.dump(clf, open("Random_forest_model_1.pkl","wb"))


In [None]:
loaded_model = pickle.load(open("Random_forest_model_1.pkl","rb"))
loaded_model.score(X_test, y_test)


## 1. Getting our data ready to be used with learning

Three main thing we have to do:
    1. Split the into features and labels(usually 'X' and 'y');
    2. Filling (also called imputing) or disreagarding missing values;
    3. Converting non-numerical values to numerical values(also called feature encoding)

In [None]:
heart_disease.head()

In [None]:
X = heart_disease.drop("target",axis = 1)
y= heart_disease["target"]
X

In [None]:
y.head()

In [None]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2)


In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
X.shape[0] * 0.8

In [None]:
len(heart_disease)

## 1.1 Make sure it's all numerical

In [None]:
car_sales = pd.read_csv('zero-to-mastery-ml-master/data/car-sales-extended.csv')
car_sales.head()

In [None]:
len(car_sales)

In [None]:
car_sales.dtypes

In [None]:
#Split into X/y
X = car_sales.drop("Price",axis =1)
y= car_sales["Price"]

In [None]:
#SPlit into training and test
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
#Build machine learning model
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)



In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Supondo que suas colunas categóricas sejam "Make", "Colour", e "Doors"
categorical_features = ["Make", "Colour", "Doors"]

# Criar o transformador para as colunas categóricas
column_transformer = ColumnTransformer(
    transformers=[
        ("encoder", OneHotEncoder(), categorical_features)
    ],
    remainder='passthrough'  # Mantém as outras colunas que não são especificadas
)

# Suponha que X é o seu DataFrame de características
# Aplicar a transformação
transformed_X = column_transformer.fit_transform(X)
transformed_X

### Como funciona o OneHotEncoder
-> Pegando a colunas colour por exemplo, podemos observar que ela é definida em red, green e blue.
Em alguns modelos como o randomForest, precisamos que nossas variaveis nao sejam categóricas, logo, usamos o oneHotEncoder.
->Quando ele fizer a sua funçao na coluna, cada cor sera um numero binario 
vermelhor 100
verde 010
azul 001
Nesse caso, vamos tirar as variaveis categoricas dele

In [None]:
pd.DataFrame(transformed_X)

Dummies é uma funçao que é utilizada para converter variaveis categoricas em variaveis dummies/indicadoras. É bastante utilizada no preprocessamento de dados pois nos precisamos de entradas numericas.

In [None]:

dummies = pd.get_dummies(car_sales[["Make","Colour", "Doors"]])
dummies

In [None]:
#Let's refit the model
np.random.seed(42)
X_train,X_test,y_train,y_test = train_test_split(transformed_X,y,test_size = 0.2)
model.fit(X_train,y_train)

In [None]:

model.score(X_test,y_test)



### 1.2 What if there were missing values?

1.Fill them with some value(also know as imputation)
2.Remove the samples with missing data altogether.

In [None]:
#importing car sales missing data
car_sales_missing = pd.read_csv('zero-to-mastery-ml-master/data/car-sales-extended-missing-data.csv')
car_sales_missing.head()

In [None]:
car_sales_missing.isna().sum()

In [None]:
# Create X e Y
X = car_sales_missing.drop("Price", axis =1 )
y = car_sales_missing["Price"]
                           

In [None]:
#Temos que tratar os valores Nan
car_sales_missing


#### Option 1 : FIll missing data with pandas

In [None]:
car_sales_missing["Doors"].value_counts()

In [None]:
#Fill the 'make' column
car_sales_missing["Make"].fillna("missing", inplace = True)##Inplace é o comando para alterar direto no dataframe original
#fill the 'Colour' column
car_sales_missing["Colour"].fillna('missing', inplace = True)
#fill the 'Odometer{km)' COLUMN

car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(),inplace = True)
##Fills the doors column
car_sales_missing["Doors"].fillna(4,inplace = True)#Prrenchendo os dados faltantes das poras com valor 4

                                 


In [None]:
car_sales_missing.isna().sum()

In [None]:
#Remove rows with missing Price value
#Dropna -> Metodo do pandar para excluir linhas ou colunas que contem valores faltantes
car_sales_missing.dropna(inplace = True)


In [None]:
car_sales_missing.isna().sum()

In [None]:
len(car_sales_missing)

In [None]:
X = car_sales_missing.drop("Price",axis = 1)
y = car_sales_missing["Price"]

remainder='passthrough' é uma maneira eficaz de garantir que todas as colunas necessárias sejam mantidas no DataFrame final após a aplicação de transformações específicas, facilitando o gerenciamento de pipelines de dados complexos onde apenas algumas colunas precisam de transformação.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder = "passthrough")
transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

### Option2 : Fill values with scikit learn

Normalization (also called min-max scaling) - This rescales all the numerical values to between 0 and 1, with the lowest value being close to 0 and the highest previous value being close to 1. Scikit-Learn provides functionality for this in the MinMaxScalar class.

Standardization - This subtracts the mean value from all of the features (so the resulting features have 0 mean). It then scales the features to unit variance (by dividing the feature by the standard deviation). Scikit-Learn provides functionality for this in the StandardScalar clas

O escalonamento de características geralmente não é necessário para a sua variável alvo.

O escalonamento de características geralmente não é necessário com modelos baseados em árvores (por exemplo, Floresta Aleatória), pois eles podem lidar com características variáveis.s.

In [None]:
car_sales_missing= pd.read_csv('zero-to-mastery-ml-master/data/car-sales-extended-missing-data.csv')
car_sales_missing.head()

In [None]:
car_sales_missing.isna().sum()

In [None]:
#Drop the rows with no labels
car_sales_missing.dropna(subset = ["Price"], inplace = True)
car_sales_missing.isna().sum()

In [None]:
#Split into X and y
X = car_sales_missing.drop("Price" , axis =1 )
y = car_sales_missing["Price"]


Sobre o codigo abaixo
*SimpleImputer* : Usado para preencher valores ausentes em colunas usando diferentes estratégias;
*ColumnTransformer* : Permite aplicar diferentes transformações a colunas específicas de um dataframe.

-> A seguir, a explicação serve para as outra linhas com comandos semelhantes
*cat_imputer* : Preenche valores ausentes em variveis categoricas com a palabra 'missing'
*cat_features* : Linha destinada para as quais o 'cat_imputer' será aplicado.

*ColumnTransform* : Compila os imputadores definidos anteriormente.
O ColumnTransformer é aplicado ao dataframe X (que não é mostrado no código mas é assumido como o conjunto de dados de entrada). Ele ajusta os imputadores ao dataframe (fit) e transforma os dados (transform), resultando em um novo conjunto de dados (filled_X) onde todos os valores ausentes foram preenchidos conforme especificado

In [None]:
# Fill missing values with scikit learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

#Fill categorical values 'missing' and numerical values with mean
cat_imputer = SimpleImputer(strategy = "constant", fill_value = "missing")
door_imputer = SimpleImputer(strategy = 'constant', fill_value =4 )
num_imputer = SimpleImputer(strategy = 'mean')

#Define columns
cat_features = ["Make","Colour"]
door_features = ["Doors"]
num_features = ["Odometer (KM)"]

#Create an imputer(something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_features),
    ("num_imputer",num_imputer,num_features)
])

#Transform the data
filled_X = imputer.fit_transform(X)
filled_X

In [None]:
car_sales_filled = pd.DataFrame(filled_X,
                                columns = ["Make","Colour","Doors","Odometer (KM)"])
car_sales_filled.head()

In [None]:
car_sales_filled.isna().sum()

In [None]:
X - = car_sales_filled

In [None]:
#Precisamos passar as varaiveis string para numeros
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Supondo que suas colunas categóricas sejam "Make", "Colour", e "Doors"
categorical_features = ["Make", "Colour", "Doors"]

# Criar o transformador para as colunas categóricas
column_transformer = ColumnTransformer(
    transformers=[
        ("encoder", OneHotEncoder(), categorical_features)
    ],
    remainder='passthrough'  # Mantém as outras colunas que não são especificadas
)

# Suponha que X é o seu DataFrame de características
# Aplicar a transformação
transformed_X = column_transformer.fit_transform(car_sales_filled)
transformed_X

In [None]:
#now we've got our data as numbers and filled (no missing values)
#let's fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y,
                                                    test_size = 0.2)
model = RandomForestRegressor(n_estimators = 100)
model.fit(X_train,y_train)
model.score(X_test,y_test)
                                                    


In [None]:
len(car_sales_filled), len(car_sales)

## 2. Hoosing the right estimator/algorithm for your problem
['0. An end-to-end Scikit-Learn workflow',
 '1. Getting the data ready',
 '2. Choose the right estimator/algorithm for our problems',
 '3. Fit the model/algorithm and use it to make prediction on our data',
 '4. Evaluating a model',
 '5. Improve a model',
 '6. Save and load a trained model',
 '7. Putting it all together!"]

 * Some things to note:
 * Sklearn refers to machine learning models, algorithms as estimators.
 * Classification problem - predicting a category (heart disease or not)
 * Sometimes you 'll see 'clf' (short for classifier) used as classification estimator
 * Regression problem - prediciting a number (selling price of a car)

If you're working on a machiene learning problem and looking skelearn and not usre what model you should use, refer to the this site
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
   

### 2.1 Picking a machine learning model for a regression model


In [None]:
#get California housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

In [None]:

housing_df = pd.DataFrame(housing["data"], columns = housing["feature_names"])
housing_df

In [None]:
housing_df["target"] = housing["target"]
housing_df.head()

In [None]:
housing_df = housing_df.drop("MedHouseVal" , axis = 1)
housing_df

In [None]:
housing_df.columns

O método Ridge, também conhecido como regressão ridge ou regularização Tikhonov, é uma técnica usada em estatística e aprendizado de máquina para analisar dados que sofrem de multicolinearidade — quando há alta correlação entre as variáveis independentes. Este método é uma extensão da regressão linear e é utilizado para prevenir o problema de sobreajuste, que pode ocorrer em modelos de previsão.

In [None]:
#import algorithm
from sklearn.linear_model import Ridge

#Setup random seed
np.random.seed(42)

#Create the data
X = housing_df.drop("target", axis = 1)
y = housing_df["target"] #median house price in $100,00s

#Split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#Instatiate and fit the model
model = Ridge()
model.fit(X_train, y_train)

#Check the score of the model(on the test set)
model.score(X_test, y_test)

#instatiante and fit the model (on t

In [None]:
#import the randomforestregressor model class from the ensamble module
from sklearn.ensemble import RandomForestRegressor

#Setup random seed
np.random.seed(42)

#create the data
X = housing_df.drop("target", axis =1)
y = housing_df["target"]

#Split into traint and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2)

#Create a random forest model
model = RandomForestRegressor()
model.fit(X_train, y_train)

#Check the score
model.score(X_test, y_test)

## 2.2 Picking a machine learning model for a classification problem

In [None]:
heart_disease = pd.read_csv('zero-to-mastery-ml-master/data/heart-disease.csv')
heart_disease.head()

In [None]:
 #import the linearSVC estimator class
from sklearn.ensemble import RandomForestClassifier

#Setup random seed
np.random.seed(42)

#Make the data
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

#Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#Instatiante LinearSVC
clf = RandomForestClassifier(n_estimators = 1000)
clf.fit(X_train,y_train)

#Evoluate the Linear
clf.score(X_test,y_test)


In [None]:
heart_disease["target"].value_counts()


## 3. Fit the model/algorithm on our data and use it to make predict  
 ### Fiting a model data

 Different names for:
 *`X` = Features, features variables, data
 *`y` = labels, targets, target variables

In [None]:
 #import the linearSVC estimator class
from sklearn.ensemble import RandomForestClassifier

#Setup random seed
np.random.seed(42)

#Make the data
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

#Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#Instatiante LinearSVC
clf = RandomForestClassifier(n_estimators = 100)
clf.fit(X_train,y_train)

#Evoluate the Linear
clf.score(X_test,y_test)


In [None]:
X.head()

In [None]:
y.head()

In [None]:
y.tail()

### 3.2 Make predictions using a machine learning model
2 ways to make predictions:
1. `predict()´
2. `predict_proba()´

predict: Quando você precisa de uma decisão direta/etiqueta/classificação.
predict_proba: Quando você precisa avaliar a incerteza do modelo ou tomar decisões que dependem do nível de confiança nas previsões (por exemplo, em situações onde custos diferentes estão associados a diferentes tipos de erros de classificação).

In [None]:
#Use a trained model to make predictions
X_test.head()

In [None]:
clf.predict(X_test)

In [None]:
np.array(y_test)

In [None]:
# Compare predictions to truth labels to evaluate the model
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test)# Estamos calculando aqui a precisao ou quao bem o modelo saiu com esse


In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_preds)

 Make prediction with `predict_proba()` 

In [None]:
# predict_proba() returns probabilities of a classification label
clf.predict_proba(X_test[:5]) #Temos como respostas estimativas de probablidade
#No caso a coluna da esquerda representa a porcentagem da saida ser 0
# E a coluna da direita da saida ser 1

In [None]:
#let's predict() on the same data...
clf.predict(X_test[:5])

In [None]:
heart_disease["target"].value_counts()

In [None]:
housing_df.head()

In [None]:
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)

#Create de data
X = housing_df.drop("target", axis = 1)#Pegando todas as colunas exceto a coluna target
y = housing_df["target"]

#Split into training and test sets
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#Create miodel instance
model = RandomForestRegressor()

#Fit the model to the data
model.fit(X_train, y_train)
#Make predictions
y_preds = model.predict(X_test)

In [None]:
y_preds[:10]

In [None]:
np.array(y_test[:10])

In [None]:
#Compare the prediction to the truth
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,y_preds)

In [None]:
housing_df["target"]

## 4.Evoluating a machine learning model

Three ways to evoluate Scikit-Learn models/estimator
1. Estimator's built-in `score` method
2. The `scoring` parameter
3. Problem-specific metric functions



### 4.1 Evoluating a model with the score method

In [None]:
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)

#Create de data
X = heart_disease.drop("target", axis = 1)#Pegando todas as colunas exceto a coluna target
y = heart_disease["target"]

#Split into training and test sets
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#Create miodel instance
model = RandomForestClassifier(n_estimators =100)

#Fit the model to the data
model.fit(X_train, y_train)

In [None]:
#The highest value for the .score() method is 1.0, the lowest 0.0
#The default score() evaluating metric is r_squared for regression algorithms
model.score(X_train,y_train)

## 4.2 Evaluating a model using the scoring parameter

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)

#Create de data
X = heart_disease.drop("target", axis = 1)#Pegando todas as colunas exceto a coluna target
y = heart_disease["target"]

#Split into training and test sets
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#Create miodel instance
model = RandomForestClassifier(n_estimators =100)

#Fit the model to the data
model.fit(X_train, y_train)

In [None]:
cross_val_score(model,X,y, cv = 5)

In [None]:
np.random.seed(42)

#Single training and teste split score
clf_single_score = model.score(X_test, y_test)

#Take de mean of 5-fold cross-validation score
clf_cross_val_score = np.mean(cross_val_score(clf,X, y, cv = 5))

#Compare the two
clf_single_score, clf_cross_val_score

In [None]:
#Scoring parameter set to none by default
cross_val_score(model,X,y, cv = 5,scoring = None)

### 4.2.1 Classification model evoluating metrics
1.Accuracy
2.Area under ROC curve
3.Confusion matrix
4.Classification report

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

clf = RandomForestClassifier()
cross_val_score = cross_val_score(model,X,y, cv =5 )

In [None]:
np.mean(cross_val_score)
print(f"Heart Disease Classifier Cross-Validated Accuracy: {np.mean(cross_val_score) * 100:.2f}%")

**Area under the receiver, operating characteristic curve (AUC/ROC)**

* Area under curve(AUC)
* ROC curve
ROC curves are a comparsion of a model's true positive rate (tpr) veruss a models false positive rate(fpr.

*True positive = model predicts 1 when truth is 1
*False positive = model predictis 1 when truth is 0
*True negative = model predicts 0 when truth is 0
*False negative = model predicts 0 when truth is 1

In [None]:
# Create X_test... etc
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
from sklearn.metrics import roc_curve
#Fit the classifier
clf.fit(X_train,y_train)

#Make predictions with probabilites
y_probs = clf.predict_proba(X_test)
y_probs[:10], len(y_probs)

In [None]:
y_probs_positive = y_probs[:,1]
y_probs_positive[:10]

In [None]:
# calculate fpr,tpr and thresholds
#A função roc_curve calcula a taxa de falsos positivos (FPR), a taxa de verdadeiros positivos (TPR) e os limiares de decisão para diferentes pontos de corte nas probabilidades preditas.
#Muito usada em classificação binária
fpr,tpr, thresholds = roc_curve(y_test,y_probs_positive)

#Check de false positive rates
fpr

In [None]:
#Crate a function for plotting ROC curves
import matplotlib.pyplot as plt

def plot_roc_curve(fpr, tpr):
    """
    Plots a ROC curve given the false positive rate (fpr)
    and true positive rate (tpr) of a model.
    """
    #Plot roc curve
    plt.plot(fpr,tpr, color = "orange", label = "ROC")
    #Plot line with no predictive power (baseline)
    plt.plot([0,1],[0,1], color = "darkblue", linestyle = "--", label = "Guessing")#: Define o rótulo da linha como "Guessing". Esse rótulo será usado na legenda do gráfico.
    #representa os valores do eixo y. Isso significa que a linha vai do ponto (0,0) até o ponto (1,1).
#Customize the plot
plt.xlabel("False positive rate ( fpr)")
plt.ylabel("True positive rate (tpr)")
plt.title("Receiber Operating Characteristic (ROC) Curve")
plt.legend()
plt.plot()

plot_roc_curve(fpr, tpr)
#Serve como uma linha de base para comparar o desempenho do modelo. Um modelo com desempenho acima desta linha tem algum poder preditivo, enquanto um modelo abaixo desta linha tem desempenho pior que o aleatório.

In [None]:
from sklearn.metrics import roc_auc_score##Area sobre a curva, no caso tudo abaixo da curva amarelka

roc_auc_score(y_test, y_probs_positive)

ROC curves and AUC metrics are evaluation metrics for binary classification models (a model which predicts one thing or another, such as heart disease or not).

The ROC curve compares the true positive rate (tpr) versus the false positive rate (fpr) at different classification thresholds.

The AUC metric tells you how well your model is at choosing between classes (for example, how well it is at deciding whether someone has heart disease or not). A perfect model will get an AUC score of 1.
                                                                             

#Plot perfect Roc curve and auc score
fpr, tpr, thresholds = roc_curve(y_test, y_test)
plot_roc_curve(fpr,tpr)

In [None]:
#Perfect AUC score
roc_auc_score(y_test, y_test)

**Confusion matrix**

 A confusion matrix is a quick way to compare the labels a model predicts and the actuaç labels it was supposed to predict.
 In essence, giving you an idea of where the model is getting confused.

In [None]:
from sklearn.metrics import confusion_matrix
y_preds = model.predict(X_test)

confusion_matrix(y_test, y_preds)

In [None]:
pd.crosstab(y_test,
            y_preds,
            rownames=["Actual Label"],
            colnames=["Predicted Labels"])

In [None]:
#Opcao 1
#Make our confusion matrix more visual with Seaborn's heatmap()
import seaborn as sns
#Set the font scale
sns.set(font_scale = 1.5)

#Create a confusion matriddx
conf_mat = confusion_matrix(y_test, y_preds)

#Plot it using Seaborn
sns.heatmap(conf_mat);


#Opcao 2
##Creating a confusion matrix using Scikit-Learn
To use the new methods of creating a confusion matrix with Scikit-Learn you will need sklearn version 1.0+

In [None]:
import sklearn
sklearn.__version__

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(estimator = model,X=X,y=y)

In [None]:
ConfusionMatrixDisplay.from_predictions(y_true = y_test,
                                        y_pred = y_preds)

**Classification Report**

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_preds))

In [None]:
# Where precision and recall become valuable
import numpy as np
disease_true = np.zeros(10000)
disease_true[0] = 1 # only one positive

disease_preds = np.zeros(10000)#Model predicts every case as 0
pd.DataFrame(classification_report(disease_true,
                                  disease_preds,
                                  output_dict = True))

To summarize classification metrics:

* Accuracy is a good measure to start with if all classes are abalanced(e.g. same amount of samples which are labelled with 0 or 1).
* Precision and recall become more important when calsses are imbalanced
* If false positive predictions are worse than false negatives, 

## Classification report

from sklearn.metrics import classification_report
print(classification_report(y_test, y_preds))

In [None]:
#where precision and recall become valuable
disease_true = np.zeros(10000)
disease_true[0] = 1 # only one positive case

disease_preds = np.zeros(10000)#Model predicts every case as 0

pd.DataFrame(classification_report(disease_true,
                                  disease_preds,
                                  output_dict = True))

## 4.2.2 Regression model evoluation

The ones we re going to cover are:
1. r² (pronounced r-squered) or coefficient of determination
2. Mean absolute errro (MAE)
3. Mean sqaured error (MSE)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing["data"], columns = housing["feature_names"])
housing_df["target"] = housing["target"]
np.random.seed(42)

X = housing_df.drop("target", axis = 1)
y= housing_df["target"]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)

model = RandomForestRegressor(n_estimators = 100)
model.fit(X_train, y_train)

In [None]:
model.score(X_test,y_test)

In [None]:
housing_df.head()

In [None]:
y_test

In [None]:
y_test.mean()

In [None]:
from sklearn.metrics import r2_score

#Fill an array with y_test mean
y_test_mean = np.full(len(y_test), y_test.mean())

In [None]:
y_test_mean[:10]

In [None]:
r2_score(y_true = y_test,
         y_pred= y_test_mean)
#Um R² de 0 significa que o modelo não faz melhor do que simplesmente prever a média para todas as observações.
#Neste caso específico, o cálculo do R² vai resultar em 0, pois o modelo de previsão (prever a média) não explica nenhuma 
#variação dos valores reais em torno da própria média. Este é um teste básico para verificar a linha de base de desempenho para 
#modelos mais complexos comparados com o simples ato de prever a média.









In [None]:
r2_score(y_true = y_test,
         y_pred = y_test)

**Mean abosolute error (MAE)**

MAE is the average of the absolute differences between predictions and actual values.
It given you an idea of how wrong your models predictions are.

In [None]:
#MAE
from sklearn.metrics import mean_absolute_error

y_preds = model.predict(X_test)
mae= mean_absolute_error(y_test, y_preds)
mae

In [None]:
df = pd.DataFrame(data={"actual values" : y_test,
                        "predicted values":y_preds})
df["differences"] = df["predicted values"] - df["actual values"]
df.head(10)

In [None]:
y_preds

In [None]:
y_test

In [None]:
df["differences"].mean()

**Mean squared error (MSE)**

MSE is the mean of the square of the errors between actual and predicted values

In [None]:
#Mean squared error
from sklearn.metrics import mean_squared_error

y_preds = model.predict(X_test)
mse = mean_squared_error(y_test, y_preds)
mse

In [None]:
df["squared_differences"] = np.square(df["differences"])
df.head()

In [None]:
# Calculate MSE by hand
squared = np.square(df["differences"])
squared.mean()

In [None]:
df_large_error = df.copy()
df_large_error.iloc[0]["squared_differences"] = 16
df_large_error

In [None]:
#Calculate MSE with large error
df_large_error["squared_differences"].mean()

In [None]:
df_large_error.iloc[1:100] = 20
df_large_error

### 4.2.3 Finally using the `scoring` parameter 

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

X = heart_disease.drop("target" , axis = 1)
y = heart_disease["target"]

clf = RandomForestClassifier(n_estimators = 100)


In [None]:
np.random.seed(42)
#Cross validation accuracy
cv_acc= cross_val_score(clf, X, y, cv = 5, scoring = "accuracy")

In [None]:
#Cross-validated accuracy
print(f'The cross- validated accuracy is : {np.mean(cv_acc) * 100:2f}%')

In [None]:
#Precision
cv_precision = cross_val_score(clf, X, y, cv = 5, scoring = "precision")
cv_precision

In [None]:
#Cross-validated precision
print(f'The cross-validated precision is: {np.mean(cv_precision)}')

In [None]:
#Recall 
cv_recall = cross_val_score(clf, X, y, cv = 5, scoring ="recall")
cv_recall

In [None]:
#Cross-validated precision
print(f"The cross-validated precision is : {np.mean(cv_precision)}")

Lets see the scoring parameter using for a regression problem...

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing["data"], columns = housing["feature_names"])
housing_df["target"] = housing["target"]

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

X = housing_df.drop("target", axis =1)
y = housing_df["target"]
model = RandomForestRegressor(n_estimators = 100)

In [None]:
np.random.seed(42)
cv_r2 = cross_val_score (model,X,y,cv = 3, scoring = None)
np.mean(cv_r2)

In [None]:
cv_r2

In [None]:
# Mean squared error
cv_mse = cross_val_score(model,X,y,cv=3, scoring  ="neg_mean_squared_error")
np.mean(cv_mse)

In [None]:
#Mean squared error
cv_mae = cross_val_score(model,X,y,cv=3, scoring  ="neg_mean_absolute_error")
np.mean(cv_mae)

In [None]:
cv_mae

## 4.3 Using a differente evaluatrion metrics as Scikit-Learn functions
The 3rd way to evoluate scikit-learn machine learnfing models/estimators is using the sklearn.metrics

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

np.random.seed(42)
#Create X, y:
X = heart_disease.drop("target", axis = 1)
y= heart_disease["target"]
#Split data
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size = 0.2)
#Create model
clf = RandomForestClassifier()
#Fit model
clf.fit(X_train,y_train)

#Make predicitons
y_preds = clf.predict(X_test)

#Evaluate model using evaluation functions

print("Classifier mettrics on the test set")
print(f"Accuracy: {accuracy_score(y_test, clf.predict(X_test)) * 100:.2f}%")
print(f"Precision: {precision_score(y_test,y_preds)}")
print(f"Recall: {recall_score(y_test,y_preds)}")
print(f"F1_score: {f1_score(y_test,y_preds)}")

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

np.random.seed(42)
#Create X, y:
X = heart_disease.drop("target", axis = 1)
y= heart_disease["target"]
#Split data
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size = 0.2)
#Create model
clf = RandomForestClassifier()
#Fit model
clf.fit(X_train,y_train)

#Make predicitons
y_preds = clf.predict(X_test)

print('Regression mertrics on the teste set')
print(f"R2 score: {r2_score(y_test, y_preds)}")
print(f"MAE: {mean_absolute_error(y_test,y_preds)}")
print(f"MSE: {mean_squared_error(y_test, y_preds)}")

## 5.0 Improve a model
First predictions = baseline predictions.
First model = baseline model.

From data perspective:
* Could we collect more data ? ( generally, the more data, the better)
* Could we improve our data

From a model perspective:
* Is there a better model we could to use
* Could we improve the current model?

    Parameters x Hyperparameters
* Parameters = model find these patterns in data
* Hyperparameters = settings on a model you can adjust to (potentially) improve its ability to find patterns

Three ways to adjust hyperparameters
1. By hand
2. Randomyly with RandomSearchCv
3. Exhaustively with GridSearchCv

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100)

### 5.1 Tuning hyperparameters by hand

Let's make 3 sets,training validation and test

In [None]:
clf.get_params()

We're coing to try and adjust:

* max_depth
* max_features
* min_simples_leaf
* min_samples_split
* n_estimator

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def evaluate_preds(y_true, y_preds):
    """
    Performs evaluation comparison on y_true labels vs. y_pred labels.
    """
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    metric_dict = {
        "accuracy": round(accuracy, 2),
        "precision": round(precision, 2),
        "recall": round(recall, 2),
        "f1": round(f1, 2)
    }
    print(f"Acc: {accuracy * 100:.2f}%")
    print(f"Precision: {precision * 100:.2f}%")
    print(f"Recall: {recall * 100:.2f}%")
    print(f"F1 Score: {f1 * 100:.2f}%")
    return metric_dict




In [None]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Set the seed for random number generator
np.random.seed(42)

# Shuffle the data
heart_disease_shuffled = heart_disease.sample(frac=1)

# Split into X and y
X = heart_disease_shuffled.drop("target", axis=1)
y = heart_disease_shuffled["target"]  # Use the shuffled DataFrame to ensure the target matches the features

# Calculate indices for train, validation, and test split
train_split = round(0.7 * len(heart_disease_shuffled))  # 70% of data
valid_split = round(train_split + 0.15 * len(heart_disease_shuffled))  # 15% of data

# Split the data into train, validation, and test sets
X_train, y_train = X[:train_split], y[:train_split]
X_valid, y_valid = X[train_split:valid_split], y[train_split:valid_split]
X_test, y_test = X[valid_split:], y[valid_split:]

clf = RandomForestClassifier()
clf.fit(X_train,y_train)
y_preds = clf.predict(X_valid)

#Evoluate the classifier on validadtion set
baseline_metrics = evaluate_preds(y_valid, y_preds)
baseline_metrics


In [None]:
clf.get_params()

In [None]:
np.random.seed

#Create a second classifier with different hyperparameters
clf_2 = RandomForestClassifier()
clf_2.fit(X_train,y_train)
y_preds2 = clf_2.predict(X_valid)
clf_2_metrics = evaluate_preds(y_valid,y_preds2)

### 5.2 Hyperparameter tuning with randomizedSearchCv

In [None]:
from sklearn.model_selection import RandomizedSearchCV
grid = {"n_estimators" : [10,100,200,500,1000,1200],
        "max_depth": [None,5,10,15,20,30],
        "max_features":["auto","sqrt"],
        "min_samples_split": [2,4,6],
        "min_samples_leaf":[1,2,4]
       }
np.random.seed(42)

#Split into X e y 
X = heart_disease_shuffled.drop("target", axis=1)
y = heart_disease_shuffled["target"]  

#Split data
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size = 0.2)
#Create model
clf = RandomForestClassifier(n_jobs= None)

#Setup randomized 
rs_clf = RandomizedSearchCV(estimator = clf,
                            param_distributions=grid,
                            n_iter=10, #number of models to try
                            cv=5,
                            verbose =2)
#Fit the randomizeSearch
rs_clf.fit(X_train,y_train);

In [None]:
rs_clf.best_params_

In [None]:
#Make predicition with the best hyperparameter
rs_y_preds = rs_clf.predict(X_test)

#Evaluate the prediction
rs_metrics = evaluate_preds(y_test, rs_y_preds)

### 5,3 hyperparmater tuning with GridSearchCV

In [None]:
grid

In [None]:
grid_2 = {'n_estimators': [100, 200, 500],
 'max_depth': [None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_split': [6],
 'min_samples_leaf': [1, 2]}

In [None]:
from sklearn.model_selection import GridSearchCV,train_test_split

np.random.seed(42)

#Split into X e y 
X = heart_disease_shuffled.drop("target", axis=1)
y = heart_disease_shuffled["target"]  

#Split data
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size = 0.2)
#Create model
clf = RandomForestClassifier(n_jobs= None)

#Setup randomized 
gs_clf = GridSearchCV(estimator = clf,
                     param_grid=grid_2,
                     cv=5,
                     verbose =2)
#Fit the randomizeSearch
gs_clf.fit(X_train,y_train);

In [None]:
gs_clf.best_params_

In [None]:
gs_y_preds = gs_clf.predict(X_test)

#Evaluate the prediction
gs_metrics = evaluate_preds(y_test,gs_y_preds)

### comparando diferentes models metrics

In [None]:
compare_metrics = pd.DataFrame({"baseline":baseline_metrics,
                                "clf_2": clf_2_metrics,
                                "random search": rs_metrics,
                                "grid search": gs_metrics})
compare_metrics.plot.bar(figsize = (10,8))


## 6. Saving and loading train machine learnign models
Two ways to save and load machine learning models
1. With Python pickle module
2. With the joblib module


In [None]:
import pickle
#save the model
pickle.dump(gs_clf, open("gs_random_forest_modek_1.pkl","wb"))

In [None]:
#load a saved model
loaded_pickle_model = pickle.load(open("gs_random_forest_modek_1.pkl","rb"))
  
pickle_y_preds = loaded_pickle_model.predict(X_test)
evaluate_preds(y_test, pickle_y_preds)

## 7.Putting all together

In [178]:
data = pd.read_csv('zero-to-mastery-ml-master/data/car-sales-extended-missing-data.csv')
data.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [179]:
data.dtypes

Make              object
Colour            object
Odometer (KM)    float64
Doors            float64
Price            float64
dtype: object

In [181]:
data.isna().sum() # Checando a quantidade de dados faltantes

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

Etapas:
1. Fill missing data
2. Convert data to numbers
3. Build a model on the data

In [186]:
import pandas as pd 
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Setup random seed
import numpy as np
np.random.seed(42)

# Import data and drop rows with missing labels
data = pd.read_csv('zero-to-mastery-ml-master/data/car-sales-extended-missing-data.csv')
data.dropna(subset=["Price"], inplace=True)  # Retirando linhas que tem valores ausentes, mas nesse caso somente na coluna preço

# Define different features and transformer pipeline
categorical_features = ["Make", "Colour"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

door_features = ["Doors"]
door_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value=4))
])

numeric_features = ["Odometer (KM)"]
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))
])

# Setup preprocessing (fill missing values, then convert to numbers)
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('door', door_transformer, door_features),
        ('num', numeric_transformer, numeric_features)
    ]
)

# Creating a preprocessing and modelling pipeline
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", RandomForestRegressor())
])

# Split data
X = data.drop("Price", axis=1)
y = data["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit and score the model
model.fit(X_train, y_train)
model.score(X_test, y_test)


0.22188417408787875