# Módulo de estatística 

## Projeto A - Análise do dataset "Wine Quality"

### Instruções

- O projeto deverá ser entregue até dia 22/11 antes do início da aula
- O projeto poderá ser feito em grupo com até 4 integrantes
- Serão 2 projetos A e B porém apenas 1 projeto deverá ser entregue e escolhido pelo grupo

- A entrega deve ser feita em jupyter notebook com os códigos explícitos e comentados. Além disso os conceitos, decisões e conclusões usadas devem estar destacadas no notebook

### Informações sobre o projeto

Dataset (conjunto de dados a ser utilizado) está disponível em:
    https://archive.ics.uci.edu/ml/datasets/Wine+Quality

Data Set Information:

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: [Web Link] or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.


Attribute Information:

Input variables (based on physicochemical tests): <br>
- 2 - volatile acidity :   Volatile acidity is the gaseous acids present in wine.
- 1 - fixed acidity :   Primary fixed acids found in wine are tartaric, succinic, citric, and malic
- 4 - residual sugar :   Amount of sugar left after fermentation.
- 3 - citric acid :    It is weak organic acid, found in citrus fruits naturally.
- 5 -  chlorides :   Amount of salt present in wine.
- 6 - free sulfur dioxide :   So2 is used for prevention of wine by oxidation and microbial spoilage.
- 7 -total sulfur dioxide 
- 9 - pH :   In wine pH is used for checking acidity
- 8 - density 
- 10 - sulphates :    Added sulfites preserve freshness and protect wine from oxidation, and bacteria.
- 11 - alcohol :   Percent of alcohol present in wine.

Output variable (based on sensory data): <br>
- 12 - quality (score between 0 and 10)

Number of Instances: red wine - 1599; white wine - 4898

### Etapa 1

**EDA - Análise exploratória de dados**

- Análise das medidas de medidas de posição, dispersão, correlação (análises univaridas e bivariadas) - histograma, boxplot, mapa de calor, etc...
- Exclusão de outliers, caso necessário (sempre explicando a opção)


In [1]:
# import libraries
import pandas as pd # standart library for data manipulation
import numpy as np # library for data manipulation and vector operations
import seaborn as sns # library for data visualization 
import matplotlib.pyplot as plt # for graphical customizations

#Loading for tools for modeling
from sklearn import set_config
set_config(display="diagram")

from sklearn.preprocessing import StandardScaler # Function to normalize de training and the test data
from sklearn.preprocessing import PowerTransformer # Function to transform the data and normalize by power transform xxxx
from sklearn.preprocessing import QuantileTransformer # Function to transform the data and normalize by quantile transform xxxx
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split # Function that automate the split dataset for training and test
from sklearn.utils import resample # Function that make a resampling from the data set avoiding oversampling of a especific variable
from sklearn.linear_model import LinearRegression # Linear Regression model
from sklearn.metrics import r2_score,mean_squared_error # measure the results of our model
from sklearn.compose import make_column_transformer 
from sklearn.pipeline import make_pipeline



from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedKFold # Set how much times we will evaluate our model
from numpy import arange

#Loading the model used 
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsRegressor

# ferramentas
from sklearn.preprocessing import PowerTransformer


from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree 

from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.metrics import plot_confusion_matrix

#importing plotly and cufflinks in offline mode
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)


In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
# Only working for google colab
# Setup the google drive to acess the files
'''
from google.colab import drive
drive.mount('/content/gdrive')

# Storing the project path
root_path = 'gdrive/My Drive/Estatistica_Projeto_A/'

# creating Dataframe object
df_red = pd.read_csv(root_path+'winequality-red.csv',sep=';')
df_white = pd.read_csv(root_path+'winequality-white.csv',sep=';')
'''

"\nfrom google.colab import drive\ndrive.mount('/content/gdrive')\n\n# Storing the project path\nroot_path = 'gdrive/My Drive/Estatistica_Projeto_A/'\n\n# creating Dataframe object\ndf_red = pd.read_csv(root_path+'winequality-red.csv',sep=';')\ndf_white = pd.read_csv(root_path+'winequality-white.csv',sep=';')\n"

In [4]:
# creating Dataframe object
df_red = pd.read_csv('winequality-red.csv',sep=';')
df_white = pd.read_csv('winequality-white.csv',sep=';')

In [5]:
# characteristics of red wines
print(df_red.head())
print(df_red.info())
print(df_red.describe())


   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 

In [6]:
# characteristics of white wines
print(df_white.head())
print(df_white.info())
print(df_white.describe())

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.0              0.27         0.36            20.7      0.045   
1            6.3              0.30         0.34             1.6      0.049   
2            8.1              0.28         0.40             6.9      0.050   
3            7.2              0.23         0.32             8.5      0.058   
4            7.2              0.23         0.32             8.5      0.058   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 45.0                 170.0   1.0010  3.00       0.45   
1                 14.0                 132.0   0.9940  3.30       0.49   
2                 30.0                  97.0   0.9951  3.26       0.44   
3                 47.0                 186.0   0.9956  3.19       0.40   
4                 47.0                 186.0   0.9956  3.19       0.40   

   alcohol  quality  
0      8.8        6  
1      9.5        6  
2     10.1        6 

The white wine dataset contains much more samples than the red wine dataset. To avoid develop a biased model it is recommended resample the white wine dataset before we continue exploring our data.


In [7]:
# Checkin if there are instances with missing data
df_red.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [8]:
df_white.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [9]:
# Resampling
m,n = df_red.shape
df_white_resample = resample(df_white, 
                              replace = False, # this will implement (sliced) random permutations.
                              n_samples = m, # same size of df_red dataset
                              random_state = 42 ) # set a random seed to keep the dataset troughout many runs

In [10]:
df_red.insert(0,'type','red') #insert the type of the wine type
df_white_resample.insert(0,'type','white') # insert the type of the wine type

# Adding the two datasets together in a single dataframe
df_all = pd.concat([df_red,df_white_resample])

In [11]:
# Adding a new column to set a class to the wine quality score
df_all['qualityclass'] = df_all.quality.apply((lambda x: 'good' if x >= 6 else 'bad'))
print(df_all.qualityclass.head())
df_all.info()

0     bad
1     bad
2     bad
3    good
4     bad
Name: qualityclass, dtype: object
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3198 entries, 0 to 4144
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   type                  3198 non-null   object 
 1   fixed acidity         3198 non-null   float64
 2   volatile acidity      3198 non-null   float64
 3   citric acid           3198 non-null   float64
 4   residual sugar        3198 non-null   float64
 5   chlorides             3198 non-null   float64
 6   free sulfur dioxide   3198 non-null   float64
 7   total sulfur dioxide  3198 non-null   float64
 8   density               3198 non-null   float64
 9   pH                    3198 non-null   float64
 10  sulphates             3198 non-null   float64
 11  alcohol               3198 non-null   float64
 12  quality               3198 non-null   int64  
 13  qualityclass          3198 non-null   o

In [12]:
# Creating a list of categorical and numerical columns to use in our analysis
numerical= df_all.drop(['quality'], axis=1).select_dtypes('number').columns

categorical = df_all.select_dtypes('object').columns

print(f'Numerical Columns:  {df_all[numerical].columns}')
print('\n')
print(f'Categorical Columns: {df_all[categorical].columns}')

Numerical Columns:  Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol'],
      dtype='object')


Categorical Columns: Index(['type', 'qualityclass'], dtype='object')


In [13]:
skew_limit = 0.75 # This is our threshold-limit to evaluate skewness. Overall below abs(1) seems acceptable for the linear models. 
skew_vals = df_all[numerical].skew()
skew_cols = skew_vals[abs(skew_vals)> skew_limit].sort_values(ascending=False)
skew_cols

chlorides              4.958764
free sulfur dioxide    1.971879
sulphates              1.835162
residual sugar         1.696211
fixed acidity          1.445604
volatile acidity       0.964305
dtype: float64

In [14]:
print( f"Skewness: {df_all['quality'].skew()}")

Skewness: 0.18060745486468138


In [15]:
df_all['volatile acidity'].iplot(kind='hist');

In [16]:
# Ploting all numeric variables except the output variable
df_all[numerical].iplot(kind='hist',subplots=True,bins=50);

In [17]:
# Ploting the output variable
df_all.quality.iplot(kind='hist')


### Etapa 2

**Regressão Linear**

- Faça um algoritmo que estime a variável “Quality” em função das características físico-químicas dos vinhos
- Colocar comentários sobre a técnica utilizada e análise sobre as variáveis utilizadas, além dos seus respectivos “achados”. Faça uma interpretação do resultado



In [18]:
# Transforming categorical variables into numeric values
df_all_dummies = pd.get_dummies(df_all, prefix_sep='_', columns=['type', 
                                                                'qualityclass'])
# Checking our dataset
df_all_dummies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3198 entries, 0 to 4144
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         3198 non-null   float64
 1   volatile acidity      3198 non-null   float64
 2   citric acid           3198 non-null   float64
 3   residual sugar        3198 non-null   float64
 4   chlorides             3198 non-null   float64
 5   free sulfur dioxide   3198 non-null   float64
 6   total sulfur dioxide  3198 non-null   float64
 7   density               3198 non-null   float64
 8   pH                    3198 non-null   float64
 9   sulphates             3198 non-null   float64
 10  alcohol               3198 non-null   float64
 11  quality               3198 non-null   int64  
 12  type_red              3198 non-null   uint8  
 13  type_white            3198 non-null   uint8  
 14  qualityclass_bad      3198 non-null   uint8  
 15  qualityclass_good    

In [28]:
# Runing several linear basic models with the standanter parameters
rmse_test =[]
r2_test =[]
model_names =[]

numerical2= df_all_dummies.drop(['quality','qualityclass_bad','qualityclass_good'], axis=1).select_dtypes('number').columns

X= df_all_dummies.drop(['quality','qualityclass_bad','qualityclass_good'], axis=1)
y= df_all_dummies['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

s = StandardScaler()
p = PowerTransformer(method='yeo-johnson', standardize=True)
#p = QuantileTransformer(output_distribution='normal')
poly = PolynomialFeatures(degree=2)

lin = LinearRegression()
rr = Ridge()
las = Lasso()
el= ElasticNet()
knn = KNeighborsRegressor()
#clf_rbf_svm = SVC(kernel='rbf',gamma=1, C=1,decision_function_shape='ovr')
clf_rbf_svm = SVC(kernel='rbf',gamma='auto')
log_reg = LogisticRegression(penalty='elasticnet',multi_class='multinomial',solver='saga', random_state=42)

models = [lin,rr,las,el,knn,clf_rbf_svm,log_reg]

ct = make_column_transformer((p,skew_cols.index),(s,list(set(numerical2.values).difference(set(skew_cols.index.values)))),remainder='passthrough')

clf = make_pipeline(ct, LinearRegression())
clf

In [29]:

for model in models:
      
    pipe = make_pipeline(ct, model)
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    rmse_test.append(round(np.sqrt(mean_squared_error(y_test, y_pred)),2))
    r2_test.append(round(r2_score(y_test, y_pred),2))
    print (f'model : {model} and  rmse score is : {round(np.sqrt(mean_squared_error(y_test, y_pred)),2)}, r2 score is {round(r2_score(y_test, y_pred),2)}')

model_names = ['Linear Regression','Ridge','Lasso','ElasticNet','KNeighbors','SVM rbf','Logistic_Regression']
result_df = pd.DataFrame({'RMSE':rmse_test,'R2_Test':r2_test}, index=model_names)
result_df

model : LinearRegression() and  rmse score is : 0.7, r2 score is 0.28
model : Ridge() and  rmse score is : 0.7, r2 score is 0.28
model : Lasso() and  rmse score is : 0.83, r2 score is -0.0
model : ElasticNet() and  rmse score is : 0.83, r2 score is -0.0
model : KNeighborsRegressor() and  rmse score is : 0.73, r2 score is 0.23
model : SVC(gamma='auto') and  rmse score is : 0.75, r2 score is 0.19


ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None)

In [21]:
# experimentation with SVM algorithm
C = 1.0  # SVM regularization parameter
gamma = 1
clf_rbf_svm = SVC(kernel='rbf',gamma=1, C=1,decision_function_shape='ovr')
clf_rbf_svm.fit(X_train_scaled, y_train)
y_test_pred = clf_rbf_svm.predict(X_test_scaled)
ic(clf_rbf_svm.score(X_test_scaled,y_test))

NameError: name 'X_train_scaled' is not defined

In [None]:
# Removing undesearable categorical variable from our dataset to start developing our model
X = df_all_dummies.drop(['quality', 'qualityclass_bad', 'qualityclass_good'], axis = 1)
y = df_all_dummies['quality']

In [None]:
# Spliting our data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
std = StandardScaler()
X_train_std = std.fit_transform(X_train) # here the training set defines the scaler to the test set to avoid information leakeage from our modelling process
X_test_std = std.transform(X_test)

In [None]:
# Instancia o modelo
linreg = LinearRegression()
# Fit dos dados (ou seja, vamos passar os dados para o modelo aprender com eles)
linreg.fit(X_train_std, y_train)
# Para os dados novos, vamos definir a predição para a base de teste
y_pred = linreg.predict(X_test_std)
# Measuring the quality the results
R2 = r2_score(y_test, y_pred)
print('R2:', R2)

In [None]:
# Instancia o modelo
ridge = Ridge()
# Fit dos dados (ou seja, vamos passar os dados para o modelo aprender com eles)
ridge.fit(X_train_std, y_train)
# Para os dados novos, vamos definir a predição para a base de teste
y_pred_ridge = ridge.predict(X_test_std)
R2_ridge = r2_score(y_test, y_pred_ridge)
print('R2 - Ridge:', np.round(R2_ridge, 4))

In [None]:
# Instancia o modelo
lasso = Lasso()
# Fit dos dados (ou seja, vamos passar os dados para o modelo aprender com eles)
lasso.fit(X_train_std, y_train)
# Para os dados novos, vamos definir a predição para a base de teste
y_pred_lasso = lasso.predict(X_test_std)
R2_lasso = r2_score(y_test, y_pred_lasso)
print('R2 - Lasso:', np.round(R2_lasso, 4))

In [None]:
# Instancia o modelo
EN = ElasticNet()
# Fit dos dados (ou seja, vamos passar os dados para o modelo aprender com eles)
EN.fit(X_train_std, y_train)
# Para os dados novos, vamos definir a predição para a base de teste
y_pred_EN = EN.predict(X_test_std)
R2_EN = r2_score(y_test, y_pred_EN)
print('R2 - ElasticNet:', np.round(R2_EN, 4))

In [None]:
from pca import pca

In [None]:
# Initialize
model_pca = pca()
# Fit transform
out = model_pca.fit_transform(X)

# Print the top features. The results show that f1 is best, followed by f2 etc
print(out['topfeat'])

In [None]:
# define model evaluation method
#cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define model
ridge_cv = RidgeCV(alphas=arange(0, 5, 0.01), cv=10)


# Fit dos dados (ou seja, vamos passar os dados para o modelo aprender com eles)
ridge_cv.fit(X_train_std, y_train)
# Para os dados novos, vamos definir a predição para a base de teste
y_pred_ridge_cv = ridge_cv.predict(X_test_std)
R2_ridge_cv = r2_score(y_test, y_pred_ridge_cv)
print('R2 - Ridge:', np.round(R2_ridge_cv, 4))

In [None]:
model_pca.plot()

In [None]:
#Rescaling the data
scaler_pca = StandardScaler() 

X_pca_rescale = scaler_pca.fit_transform(X_test) #By default, PCA() centers the data, but does not scale it.

pca = PCA()
X_pca_rescale = pca.fit_transform(X_pca_rescale)
per_var =  np.round(pca.explained_variance_ratio_*100, decimals=1)
labels = [str(x) for x in range(1,len(per_var)+1)]
plt.clf()
plt.bar(x=range(1,len(per_var)+1),height=per_var)
plt.tick_params(axis = 'x',
                which = 'both',
                bottom = False,
                top = False,
                labelbottom= True)
#plt.xticks([1, 2, 3], ['PC 1', 'PC 2', 'PC 3'])
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Components')
plt.title('Screen Plot')
plt.show()

In [None]:
i = np.identity(X_test.shape[1])  # identity matrix
coef = pca.transform(i)
pd.DataFrame(pca.explained_variance_, index=X_test.columns)

In [None]:
# number of components
n_pcs= pca.components_.shape[0]

# using LIST COMPREHENSION HERE
most_important = [np.abs(pca.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = X_test.columns

# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# using LIST COMPREHENSION HERE AGAIN
dic = {(i+1): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
df = pd.DataFrame(sorted(dic.items()),columns=['PC','Feature'])

df

In [None]:
# reduzindo as variaveis 
#X_reduce = df_all_dummies[df.Feature[0:8]]
feature_select = ['total sulfur dioxide','free sulfur dioxide','residual sugar','fixed acidity','alcohol','type_red','volatile acidity']
X_reduce = df_all_dummies[feature_select]
y = df_all_dummies['quality']

In [None]:
# Spliting our data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X_reduce, y, test_size=0.3, random_state = 42)
std = StandardScaler()
X_train_std = std.fit_transform(X_train) # here the training set defines the scaler to the test set to avoid information leakeage from our modelling process
X_test_std = std.transform(X_test)

In [None]:
# Instancia o modelo
linreg = LinearRegression()
# Fit dos dados (ou seja, vamos passar os dados para o modelo aprender com eles)
linreg.fit(X_train_std, y_train)
# Para os dados novos, vamos definir a predição para a base de teste
y_pred = linreg.predict(X_test_std)
# Measuring the quality the results
R2 = r2_score(y_test, y_pred)
print('R2:', R2)

In [None]:
# Instancia o modelo
ridge = Ridge()
# Fit dos dados (ou seja, vamos passar os dados para o modelo aprender com eles)
ridge.fit(X_train_std, y_train)
# Para os dados novos, vamos definir a predição para a base de teste
y_pred_ridge = ridge.predict(X_test_std)
R2_ridge = r2_score(y_test, y_pred_ridge)
print('R2 - Ridge:', np.round(R2_ridge, 4))

In [None]:
# define model evaluation method
#cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define model
ridge_cv = RidgeCV(alphas=arange(0, 5, 0.01), cv=10)


# Fit dos dados (ou seja, vamos passar os dados para o modelo aprender com eles)
ridge_cv.fit(X_train_std, y_train)
# Para os dados novos, vamos definir a predição para a base de teste
y_pred_ridge_cv = ridge_cv.predict(X_test_std)
R2_ridge_cv = r2_score(y_test, y_pred_ridge_cv)
print('R2 - Ridge:', np.round(R2_ridge_cv, 4))

### Etapa 3

**Regressão logística**

- Sabendo que os vinhos com notas >= 6 são considerados vinhos de boa qualidade faça um algoritmo que classifique os vinhos em “Bom” ou “Ruim” em função de suas características físico-químicas;
- Colocar comentários sobre a técnica utilizada e análise sobre as variáveis utilizadas, além dos seus respectivos “achados”. Faça uma interpretação do resultado

In [None]:
df_all_dummies2 = pd.get_dummies(df_all, prefix_sep='_', columns=['type'])
# Removing undesearable categorical variable from our dataset to start developing our model

X = df_all_dummies2.drop(['quality','qualityclass'], axis = 1)
y = df_all_dummies2['qualityclass']

In [None]:
# Spliting our data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

In [None]:
## create a decisiont tree and fit it to the training data
clf_dt = DecisionTreeClassifier(random_state=42)
clf_dt = clf_dt.fit(X_train, y_train)
clf_dt.score(X_test,y_test)



In [None]:
plt.clf()
plt.figure(figsize=(15, 7.5))
plot_tree(clf_dt, 
          filled=True, 
          rounded=True, 
          class_names=X_test.columns, 
          feature_names=X.columns); 

In [None]:
plot_confusion_matrix(clf_dt, 
                      X_test, 
                      y_test, 
                      display_labels=["bad", "good"])

In [None]:
path = clf_dt.cost_complexity_pruning_path(X_train, y_train) # determine values for alpha
ccp_alphas = path.ccp_alphas # extract different values for alpha
ccp_alphas = ccp_alphas[:-1] # exclude the maximum value for alpha

clf_dts = [] # create an array that we will put decision trees into

## now create one decision tree per value for alpha and store it in the array
for ccp_alpha in ccp_alphas:
    clf_dt = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
    clf_dt.fit(X_train, y_train)
    clf_dts.append(clf_dt)

In [None]:
train_scores = [clf_dt.score(X_train, y_train) for clf_dt in clf_dts]
test_scores = [clf_dt.score(X_test, y_test) for clf_dt in clf_dts]
plt.clf()
fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test", drawstyle="steps-post")
ax.legend()
plt.show()

In [None]:
dif_scores = [train_scores[i]-test_scores[i] for i in range(len(test_scores))]
best_alpha = ccp_alphas[dif_scores.index(min(dif_scores))]
best_alpha

In [None]:
## Build and train a new decision tree, only this time use the optimal value for alpha
clf_dt_pruned = DecisionTreeClassifier(random_state=42, 
                                       ccp_alpha=best_alpha)
clf_dt_pruned = clf_dt_pruned.fit(X_train, y_train)
clf_dt_pruned.score(X_test,y_test)

In [None]:
plt.figure(figsize=(20, 10))
plot_tree(clf_dt_pruned, 
          filled=True, 
          rounded=True, 
          class_names=["bad", "good"], 
          feature_names=X.columns); 

In [None]:
plot_confusion_matrix(clf_dt_pruned, 
                      X_test, 
                      y_test, 
                      display_labels=["bad", "good"])