# Sera venenoso o no (hongos)

Se tiene un dataset donde hay varios hongos, cada uno clasificado entre venenoso y no venenoso, se tiene que desarrollar un modelo que nos pueda ayudar a saber de que tipo es

## Analitica de datos

**Importacion de librerias**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport

**Se quitan los warnings del log**

In [None]:
import warnings
warnings.filterwarnings('ignore')

**Lectura de datos**

Se leen los dos archivos, el de pruebas y el entrenamiento, junto con los valores de pruebas.

Se unen los datasets para analizarlos a ambos.

Se borran el Id ya que no es una variable que se pueda usar.

In [None]:
FOLDER_PATH = '../data/raw/'

df_train = pd.read_csv(f'{FOLDER_PATH}train.csv')
df_test = pd.read_csv(f'{FOLDER_PATH}test.csv')
df_class_test = pd.read_csv(f'{FOLDER_PATH}sample_submission.csv')
df_class_test.replace(
    {'Edibla': 'Edible', 'Poisonousa': 'Poisonous'}, inplace=True
)
df_test = df_test.merge(df_class_test, on='id')
df = pd.concat([df_train, df_test]).sort_values('id')
df.drop('id', inplace=True, errors='ignore', axis=1)
df

**Se revisa que tipo de datos se tiene, se puede observar que todas son variables de tipo object**

Muy probablemente sean variables categoricas

In [None]:
df.dtypes

**Se describe el dataset para ver cuantos unicos se tienen**

In [None]:
df.describe(include='all').T

Se revisan las correlaciones entre variables

In [None]:
corr = df.apply(
    lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1
)

In [None]:
sns.set_theme(style="white")

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(18, 18))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

sns.heatmap(
    corr, mask=mask, cmap=cmap, vmax=.3, center=0, square=True,
    linewidths=.5, cbar_kws={"shrink": .5}, annot=True
)
plt.show()

Se modifican los valores en cadena "None" a NaN para borrarse en dado caso de que se tengan muchos

In [None]:
df = df.replace({"None": np.NaN})
df = df.replace({"?": np.NaN})
df.dropna(axis=1, inplace=True)
df.columns.tolist()

In [None]:
df.head()

*Se obtienen las correlaciones, trate de buscar las mas altas, pero al entrenar los modelos el score era muy bajo*

In [None]:
corr = df.apply(
    lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1
)
colums_corr = corr[['class']][
    (corr['class'] <= 0.50) & (corr['class'] >= -0.50)
]
colums_corr

Se obtienen los nombres de las columnas que se usaran

In [None]:
columns_to_use = colums_corr.index.tolist()
columns_to_use

Revisando todos los datos al parecer se tienen puras variables categoricas se tendran que transformar

In [None]:
df_work = df[columns_to_use + ["class"]]
df_work.head()

In [None]:
df_work.shape

In [None]:
df_work.drop_duplicates(inplace=True)
df_work.shape

In [None]:
df_original = df_work.copy()
df_work_copy = df_work.copy()

## Tratamiento de variables

Se "encodean" las variables, transformando class en 1 y 0 para tener una salida a comparar, las demas variables se extienden de 0 al numero de variables que se tienen.

Se trataron de modificar las variblas usando enteros pero los modelos eran muy precisos, lo que me supuso que no era la mejor solucion

In [None]:
for column in df_original:
    if column == 'class':
        df_work_copy[column] = df_work_copy[column].apply(
            lambda x: 1 if x == 'Poisonous' else 0
        )
        continue

    uniques = df_work_copy[column].unique().tolist()
    for unique in uniques:
        df_work_copy[f'{column}_{unique}'] = df_work_copy[column].apply(
            lambda x: 1 if x == unique else 0
        )

    df_work_copy.drop(column, axis=1, inplace=True)

df_work_copy.head()

## Entrenamiento

Se separan las variables para entrenar los modelos de prueba

In [None]:
df = df_work_copy.copy()
X = df.iloc[:, 1:]
y = df['class']
X

Se importan las librerias de los modelos a usar

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

Partiendo los datos para entrenar y para probar

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X.shape

Creando los ML (instanciando las clases)

In [None]:
lr = LogisticRegression()
multi_nb = GaussianNB()
knn = KNeighborsClassifier()
svc = SVC()
tree = DecisionTreeClassifier(
    criterion='log_loss',
    max_features="sqrt"
)

### Trabajando con la regresion logistica

In [None]:
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
result = lr.score(X_test, y_test)

### Validations

In [None]:
"Accuracy: %.2f%%" % (result*100.0)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
scores = cross_val_score(lr, X_train, y_train, cv=10)
scores

In [None]:
scores.mean()

### Trabajando con Naive Bayes

In [None]:
multi_nb.fit(X_train, y_train)
y_pred = multi_nb.predict(X_test)
result = multi_nb.score(X_test, y_test)

### Validations

In [None]:
"Accuracy: %.2f%%" % (result*100.0)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
scores = cross_val_score(multi_nb, X_train, y_train, cv=10)
scores

In [None]:
scores.mean()

### Trabajando con KNN

In [None]:
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
result = knn.score(X_test, y_test)

### Validations

In [None]:
"Accuracy: %.2f%%" % (result*100.0)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
scores = cross_val_score(knn, X_train, y_train, cv=10)
scores

In [None]:
scores.mean()

### Trabajando con Support Vector Machine

In [None]:
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
result = svc.score(X_test, y_test)

### Validations

In [None]:
"Accuracy: %.2f%%" % (result*100.0)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
scores = cross_val_score(svc, X_train, y_train, cv=10)
scores

In [None]:
scores.mean()

### Trabajando con arbol de decision

In [None]:
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
result = tree.score(X_test, y_test)

### Validations

In [None]:
"Accuracy: %.2f%%" % (result*100.0)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
scores = cross_val_score(tree, X_train, y_train, cv=10)
scores

In [None]:
scores.mean()

In [None]:
# profile = ProfileReport(df, title="Pandas Profiling Report")
# profile.to_file("../reports/second_chance.html")