# Quiz 2 Starter Code

You can use this notebook to arrive at the answers for the quiz on Model Evaluation. It also provides you some hints and starter code.

For this quiz, we use a dataset derived from a variety of sources, including the Chicago Open Data portal and the Census. For details, see the quiz.

First, we'll load the data for you:

In [1]:
import pandas as pd

crime_acs = pd.read_csv("../data/crime_acs.csv")

Your target variable will be Arrest. Your classifier should consider the following features:

- `Primary Type`
- `Ward`
- `FBI Code`
- `Percent White`
- `Percent Black`
- `Median Income`

## Part 1: Data Preparation


### Question 1
First, create the training and testing sets. Use a split of .2, and a `random_state` of 0
As a quick sanity check, how many rows are in the training set?

In [2]:
from sklearn.model_selection import train_test_split
# Split into training and testing sets using random_state=0. Then, find the number of rows in the training set.
c_train, c_test= train_test_split(crime_acs, test_size=0.2, random_state=0)
c_train.head

<bound method NDFrame.head of                        Date        Primary Type  \
35016   2019-02-03 04:57:00     CRIMINAL DAMAGE   
56096   2019-02-21 02:21:00               THEFT   
78820   2019-06-24 11:20:00             BATTERY   
139403  2019-07-19 03:05:00             BATTERY   
90866   2019-08-16 01:38:00           NARCOTICS   
...                     ...                 ...   
176963  2019-07-25 09:00:00  DECEPTIVE PRACTICE   
117952  2019-09-13 01:05:00             BATTERY   
173685  2019-05-09 10:44:00        PROSTITUTION   
43567   2019-09-05 08:19:00     CRIMINAL DAMAGE   
199340  2019-09-03 11:21:00             BATTERY   

                                            Description  \
35016                                       TO PROPERTY   
56096                                      RETAIL THEFT   
78820                                            SIMPLE   
139403                          DOMESTIC BATTERY SIMPLE   
90866                               POSS: HEROIN(WHITE)   
...

### Question 2

Next, convert the label "Arrest" into a numerical (rather than boolean) feature in both your training and testing data (i.e., a 1 for an arrest, and a 0 for no arrest).

In the training data, what percentage of recorded crimes resulted in an arrest? (express as a decimal between 0 and 1)

In [4]:
# For the Arrest column, convert True to 1 and False to 0
# Do this for both the train and test sets
# Convertir la columna 'Arrest' en valores numéricos (1/0)
c_train['Arrest'] = c_train['Arrest'].astype(int)
c_test['Arrest'] = c_test['Arrest'].astype(int)

# Now, in the training data, find the fraction of crimes that resulted in arrest
# Your answer may look like train_df['Arrest'].value_counts() / train_df.shape[0]
p_arrest_c_train = c_train['Arrest'].value_counts() / c_train.shape[0]

print(p_arrest_c_train[1])

# Hint: your answer should be a number between .2 and .3

0.21383834211996794


### Question 3

Next, we will want to pre-process the continuous numeric features (Percent White, Percent Black, Median Income). This will mean normalizing each feature, and imputing missing values.

First, note that administrative data often uses encodings to indicate missing data.

So we should make sure to perform sanity checks (e.g. ensure that your percentages fall between 0 and 1, that income follows a reasonable distribution, etc.) 

The `Median Income` uses such an administrative code for some missing values--what is that code?

In [5]:
# Take a  look at the Median Income column--some values may be NaN, but some will be an administrative code

print(c_train['Median Income'].unique())
print(c_train['Median Income'].dtype)
unique_values = set(c_train["Median Income"].unique())
for i in unique_values:
    if i == -42:
        print(i)
    elif i == -999999:
        print(i)
    elif i == -666666666.0:
        print(i)

[    nan  62212.  21792. ...  37670. 104444. 100788.]
float64
-666666666.0


### Question 4

Replace the administrative code in question 3 with NaN for the training and testing sets. Then, normalize `Percent White`, `Percent Black`, and `Median Income` in the way that we have learned:
1. find the mean and standard deviation in the training set.

2. Subtract the training mean from each value, then divide by the training standard deviation.

Finally, replace the missing values with the mean in its column.

After going through these steps--normalizing and imputing missing values--what is the mean value in the test set for "Median Income"?

In [None]:
import numpy as np
# 1. First, replace the administrative code you found previously with NaN for the test and train sets.
admin_code = -666666666.0
c_train["Median Income"].replace(admin_code, np.NaN, inplace=True)
c_test["Median Income"].replace(admin_code, np.NaN, inplace=True)

print(c_train['Median Income'].unique())
print(c_train['Median Income'].dtype)
unique_values = set(c_train["Median Income"].unique())

# Comprobar si se ha eliminado el administrative code
for i in unique_values:
    if i == -42:
        print(i)
    elif i == -999999:
        print(i)
    elif i == -666666666.0:
        print(i)
# 2. Now, normalize. You can do this a few different ways, but you might consider writing a for loop.
# If you go that path the code below gets you started:
cols = ["Percent White", "Percent Black", "Median Income"]
for col in cols:
    # Calcular media y desviación estándar del conjunto de entrenamiento
    train_mean = c_train[col].mean()
    train_std = c_train[col].std()

    # Normalizar el conjunto de entrenamiento
    c_train[col] = (c_train[col] - train_mean) / train_std

    # Normalizar el conjunto de prueba (usando la media y desviación del conjunto de entrenamiento)
    c_test[col] = (c_test[col] - train_mean) / train_std

    # Reemplazar valores faltantes por la media del conjunto de entrenamiento
    c_train[col].replace(np.NaN, train_mean, inplace=True)
    c_test[col].replace(np.NaN, train_mean, inplace=True)

# Now, the sanity check: after this process, get the test mean for Median Income
MI_train_mean = c_train["Median Income"].mean()
print(MI_train_mean)

# Hint: It should be between 2000 and 3000
#Sol del test: 2044

### Question  5

There is just one more data preparation step, encoding features from the categorical variables ("Primary Type", "Ward", and "FBI Code"). The standard way to encode categorical features in machine learning is through one-hot encoding. The function "pd.get_dummies" will be useful. 

An inherent issue arises with this approach when a value appears in either your training or testing data, but not in both. If a value appears in your training set but not your testing set, create a column with all 0's in your testing set. If a value appears in your testing set but not your training set, drop it from your testing data. 
So:
1.  Use get_dummies to one-hot encode "Primary Type", "Ward", and "FBI Code"
2. For features that appear in the training data but not testing data, create them and populate them with 0's
3. Drop features that appear in the testing data but not training data
4. Finally, make sure that the columns that we aren't going to use for classification are dropped. These are: [`Date`, `Description`, `Location Description`,`Domestic`, `Beat`, `District`, `Block`, `Community Area`]

How many columns are now in your training and testing data?

In [14]:
import pandas as pd
# 1. One-hot encode these columns using get_dummies()
encode_cols = ['Primary Type', 'Ward', 'FBI Code']

#Your code to on-hot encode here
train_encoded = pd.get_dummies(c_train, columns=encode_cols)
test_encoded = pd.get_dummies(c_test, columns=encode_cols)

# Now, here is some code to get the columns as lists:
train_cols = set(train_encoded.columns.to_list())
test_cols = set(test_encoded.columns.to_list())

# 2. For features in the training data but not testing data, create them and populate with 0
# Your code to find the columns in training but not testing
# Sample code to set them to zero:
in_training_not_testing = train_cols - test_cols
for missing_col in in_training_not_testing:
    test_encoded[missing_col] = 0
    
# 3. For features in testing but not training, drop them
# Your code to find features in testing but not training
# Sample code to drop them
in_testing_not_training = test_cols - train_cols
for extra_col in in_testing_not_training:
    test_encoded.drop(columns=extra_col, inplace=True)

# Finally, drop the columns that will not be used for your model from both--these are given above
# Eliminar las columnas que no serán usadas para el modelo
columns_to_drop = ["Date", "Description", "Location Description", "Domestic", "Beat", "District", "Block", "Community Area"]
                               
train_encoded.drop(columns=columns_to_drop, inplace=True, errors='ignore')
test_encoded.drop(columns=columns_to_drop, inplace=True, errors='ignore')
                                  
# How many are you left with?
print(f"Number of columns in training data: {train_encoded.shape[1]}")
print(f"Number of columns in testing data: {test_encoded.shape[1]}")          
# Hint: It should be between 100 and 200

Number of columns in training data: 112
Number of columns in testing data: 112


The function below performs a sanity check by verifying that your training set and your testing set wound up with the same features, and that you have successfully imputed all missing values.

You can run it before moving on to modeling as a simple check to make sure you're in a good place:

In [15]:
def sanity_check(train_df, test_df): 
    
    # Sort features alphabetically
    train_df = train_df.reindex(sorted(train_df.columns), axis=1)
    test_df = test_df.reindex(sorted(test_df.columns), axis=1)

    # Check that they have the same features
    if (train_df.columns == test_df.columns).all():
        print("Success: Features match")

    # Check that no NAs remain
    if  not train_df.isna().sum().astype(bool).any() and \
        not test_df.isna().sum().astype(bool).any():
        print("Success: No NAs remain")

## Part 2: Modeling

In this part, you will train and evaluate models using different classification techniques and hyperparameters. You will now need to separate your train and test data into the training and testing features and target variables.

### Question 6

As mentioned, the target variable will be `Arrest`; train on the other features.

First, logistic regression. Use a GridSearch with 2-fold cross validation, and try tuning the penalty and the value for C. 
For the penalty, try l2 and no penalty.
For C, try the values (0.01, 0.1, 1).
Evaluate based on accuracy.
Which combination of penalty and C gives the best results from the GridSearch?

In [16]:
# If you need it, you can load train and test data that has been prepared for you, per the steps above
train_df = pd.read_csv("../data/quiz2_train.csv")
test_df = pd.read_csv("../data/quiz2_test.csv")

In [8]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Separar las características (X) y la variable objetivo (y)
X_train = train_df.drop(columns=['Arrest'])  # Eliminar la columna 'Arrest' para las características
y_train = train_df['Arrest']  # La variable objetivo es 'Arrest'

X_test = test_df.drop(columns=['Arrest'])  # Similar para los datos de prueba
y_test = test_df['Arrest']

# 2. Configuración de los parámetros a probar en el GridSearch
param_grid = {
    'penalty': ['l2', None],  # Probar 'l2' y sin penalización
    'C': [0.01, 0.1, 1],  # Probar diferentes valores de C
}

# 3. Crear el modelo de regresión logística
log_reg = LogisticRegression(random_state=0)

# 4. Configurar el GridSearchCV con 2 pliegues de validación cruzada
grid_model = GridSearchCV(log_reg, param_grid, cv=2, scoring='accuracy', verbose=1)

# 5. Entrenar el modelo con los datos de entrenamiento
grid_model.fit(X_train, y_train)

# 6. Ver los resultados del GridSearch
#print("Mejor combinación de hiperparámetros:", grid_model.best_params_)
#print("Mejor puntuación de precisión en la validación cruzada:", grid_model.best_score_)


# Mostrar los resultados del GridSearch
print(grid_model.cv_results_)

# Mostrar la mejor combinación de hiperparámetros
print(f"Mejor combinación de hiperparámetros: {grid_model.best_params_}")

# Si quieres ver la media de las puntuaciones de cada combinación probada
print(f"Mejor puntuación media: {grid_model.best_score_}")

# 7. Evaluar el modelo en el conjunto de prueba
y_pred = grid_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Precisión en el conjunto de prueba: {test_accuracy}")


# Try l2 and no penalty
# For C, try .01, .1, 1
# Use 2-fold cross validation
# Use random_state = 0--e.g., LogisticRegression(random_state=0)
# What combination gives the best mean accuracy?
# Hint: you can access results using grid_model.cv_results_
# Según el test Sin penalty y c= 0.1

Fitting 2 folds for each of 6 candidates, totalling 12 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:  2.5min finished


{'mean_fit_time': array([16.40823913,  0.08118677, 29.06654406,  0.08343661, 28.95643723,
        0.09478784]), 'std_fit_time': array([0.31228924, 0.00171804, 0.12913537, 0.0024153 , 0.02910602,
       0.00710058]), 'mean_score_time': array([0.29956031, 0.        , 0.30224431, 0.        , 0.30979848,
       0.        ]), 'std_score_time': array([0.0006597 , 0.        , 0.00142753, 0.        , 0.00279164,
       0.        ]), 'param_C': masked_array(data=[0.01, 0.01, 0.1, 0.1, 1, 1],
             mask=[False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_penalty': masked_array(data=['l2', None, 'l2', None, 'l2', None],
             mask=[False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'C': 0.01, 'penalty': 'l2'}, {'C': 0.01, 'penalty': None}, {'C': 0.1, 'penalty': 'l2'}, {'C': 0.1, 'penalty': None}, {'C': 1, 'penalty': 'l2'}, {'C': 1, 'penalty': None}], 'split0_test_score': array([0.8

### Question 7

Next, try tuning a linear support vector machine classifier. Again, use 2-fold cross-validation, and for C try the values (0.01, 0.1, 1). Once again, score on accuracy. Which value for C produces the best score in the GridSearch?

In [None]:
# Same as above, but this time with a linear SVM.
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# Separar las características (X) y la variable objetivo (y)
X_train = train_df.drop(columns=['Arrest'])  # Eliminar la columna 'Arrest' para las características
y_train = train_df['Arrest']  # La variable objetivo es 'Arrest'

X_test = test_df.drop(columns=['Arrest'])  # Similar para los datos de prueba
y_test = test_df['Arrest']
# Definir el clasificador SVM con kernel lineal
svm_model = SVC(kernel='linear', random_state=0)

# Definir los valores de C para probar
param_grid = {'C': [0.01, 0.1, 1]}

# Configurar GridSearchCV con validación cruzada de 2 pliegues
grid_model = GridSearchCV(svm_model, param_grid, cv=2, scoring='accuracy', verbose=3)

# Ajustar el modelo
grid_model.fit(X_train, y_train)  # Asegúrate de haber separado previamente las características y la variable objetivo

# Ver los mejores parámetros
print(f"Best C value: {grid_model.best_params_['C']}")

# También puedes acceder a los resultados completos
print(grid_model.cv_results_)

Fitting 2 folds for each of 3 candidates, totalling 6 fits
[CV] C=0.01 ..........................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .............................. C=0.01, score=0.865, total=18.5min
[CV] C=0.01 ..........................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 18.5min remaining:    0.0s


### Question 8

The last model type you want to consider is a Naive Bayes classifier. Likewise, use 2-fold cross-validation, and evaluate on accuracy. What is the mean accuracy score?

In [17]:
# Finally, naive Bayes. No need to use GridSearch for this one, since you're not tuning any hyperparameters
# You can use cross_val_score:
from sklearn.model_selection import cross_val_score
# do 2-fold cross-validation and give the mean accuracy
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score

# Separar características y la variable objetivo
X_train = train_df.drop('Arrest', axis=1)  # Las características (sin la columna "Arrest")
y_train = train_df['Arrest']  # La variable objetivo ("Arrest")

# Crear el clasificador Naive Bayes
nb_model = GaussianNB()

# Realizar la validación cruzada con 2 pliegues y calcular la precisión
cv_scores = cross_val_score(nb_model, X_train, y_train, cv=2, scoring='accuracy')

# Calcular la media de la precisión
mean_accuracy = cv_scores.mean()

# Mostrar el resultado
print(f'Mean accuracy: {mean_accuracy}')


Mean accuracy: 0.8623648578121235


### Question 9

Finally, we want to determine which features are most important.

For these examples, the SVM with the parameters in question 7 will have slightly edged out the others. Train a Linear SVM classifier with that value for C. 

For the features used to train this model, which has the largest positive contribution towards classification as positive? (i.e., which has the largest positvie coefficient?)

In [None]:
# Train your Linear SVM with the best value you found for C
# You can access model coefficients using model.coef_ (the array will be in the same order as the features)
from sklearn.svm import SVC

# Obtener el mejor valor de C encontrado en la pregunta anterior (supongamos que es 1, basado en los resultados de GridSearch)
best_C = 0.01

# Crear el modelo SVM
svm_model = SVC(C=best_C, kernel='linear')

# Entrenarlo con los datos de entrenamiento
svm_model.fit(X_train, y_train)

# Obtener los coeficientes del modelo
coefficients = svm_model.coef_

# Convertir los coeficientes en un array y mostrar los resultados
print("Coeficientes del modelo:")
print(coefficients)

# Obtener cuál de las características tiene el mayor coeficiente positivo
max_positive_idx = coefficients.argmax()
features = X_train.columns  # Obtener nombres de las características

# Mostrar la característica con la mayor contribución positiva
print(f"La característica con la mayor contribución positiva es: {features[max_positive_idx]}")

#Sol: Narcotica
