## **Checkpoint #1: *Open the Notebook and Load the Data***


Go to https://docs.google.com/presentation/d/1z9HhCQ2ldON_tb9JLuJlb_bH9cghNhHwXCE-nqxbaUI/edit?usp=sharing for help on running the notebook in Colab.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd drive/MyDrive/Regression\ Workshop

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
df = pd.read_csv('2016_Citizen_Survey_Master_Dataset_20240218.csv')

In [None]:
df_shape = ...

print(f"Initial shape of the dataframe: {df_shape}")

## **Checkpoint #2: *Prepare the Data with Imputation and Encoding***


### Imputation

Data imputation refers to the replacement of null values with some other value, whether it be the mean of a quantitative variable or the mode of a qualitative variable.

Here are some helpful links:

* https://www.tutorialspoint.com/how-to-handle-missing-values-of-categorical-variables-in-python#:~:text=in%20the%20section.-,Fill%20the%20missing%20values%20with%20the%20computed%20mode%20using%20the,method_name%20parameter%20as%20'mode'
* https://saturncloud.io/blog/how-to-count-nan-and-null-values-in-a-pandas-dataframe/

In [None]:
# get the total number of nulls in the entire dataframe
nulls = ...

print(f"Total number of nulls in dataset (original): {nulls}")

In [None]:
# Data Evaluation:
# Shape
# remove null values
# convert categorical values, etc.

# Checkpoint 1: clean / drop null values
# rename values and columns (mainly area column )
# Handling null values

# DON'T REMOVE
df.drop(columns=["Information preference about city government activities - Other"], inplace=True, errors="ignore")

for column in df.columns:
  num_nulls_in_column = ...                     # get the number of nulls in the column (modify the line from last cell)
  if num_nulls_in_column != 0:
    column_dtype = ...                          # get the dtype of this column (look this up)
    if column_dtype == 'object':                # categorical columns
      mode = ...                                # calculate mode of column
      df[column] = ...                          # fill all null values in the column with this mode
    elif column_dtype in ('float64', 'int64'):  # numerical columns
      mean = ...                                # calculate mode of column
      df[column] = ...                          # same as line above (fill nulls)

new_nulls = ...
print(f"Total number of nulls in dataset (after imputation): {new_nulls}")

### Encoding

Encoding aims to handle categorical data and convert it into quantitative data that the model can understand.

In [None]:
# rename the 'Area of College Station' to 'area' and do it inplace
...

y = df["area"]

In [None]:
num_categorical_before = df.select_dtypes(include=['object']).shape[1]
print(f"Number of categorical variables before encoding: {num_categorical_before}")

In [None]:
# Function to determine the encoding technique for a column based on the number of unique values
# Dont worry abt this but if unique values > 10, then were gonna use a frequency encoding stategy good for high-cardinality categorical variables
def determine_encoding(column):
    unique_values = column.nunique()
    # If the number of unique values is too high, return None to avoid encoding
    if unique_values > 10:
        return None
    elif unique_values == 2:  # for binary columns
        return "Label"
    else:
        return "OneHot"

X = df.drop(columns=["area"], inplace=False, errors="ignore")

one_hot_columns = []
drop_columns = []

for col in X.columns:
  if X[col].dtype == 'object':            # Check if the column is categorical
    encoding_type = ...                   # use function to determine which encoding
    if encoding_type == "Label":
      le = ...                            # instantiate a label encoder object (default params)
      X[col] = ...                        # fit/transform the column in X and reassign
    elif encoding_type == "OneHot":
      one_hot_columns.append(col)
    else:
      drop_columns.append(col)

# perform One-Hot encoding with dummies using one_hot_columns list (look up "pandas dummies")
X = ...

encoded_shape = X.shape
print(f"Shape of the Encoded DataFrame: {encoded_shape}")

num_categorical_after = X.select_dtypes(include=['object']).shape[1]
print(f"Number of categorical variables after encoding: {num_categorical_after}")

In [None]:
#there are still some categorical variables left, specifically high cardinality variables

# select columns with dtype of 'object' --> look up "pandas select dtypes"
remaining_categorical_vars = ...

# write lambda expression to see if number of unique values is greater than 10 for each column in remaining_categorical_vars
columns_with_many_categories = remaining_categorical_vars.apply(...)

# trying out frequency encoding:
columns_to_encode = columns_with_many_categories[columns_with_many_categories].index.tolist()

for col in columns_to_encode:
    frequency = ...                   # frequency of each category with normalization --> look up "value_counts"
    X[col] = X[col].map(frequency)

num_categorical_after = X.select_dtypes(include=['object']).shape[1]
print(f"Number of categorical variables after encoding: {num_categorical_after}") #now there's 0 categorical variables

## **Checkpoint #3: *Run the Base Models***


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [None]:
# split the dataset
# test size should be 0.2 and random state should be 42
X_train, X_test, y_train, y_test = ...

# Checking the shapes of the resulting splits
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
def fit_predict_eval(model):
  # fit model on training data
  ...

  # predict y's for x-test
  model_pred = ...

  # create classification report from predictions and actual y's --> look up on Google
  model_report = ...

  # create confusion matrix from predictions and actual y's --> look up on Google
  model_matrix = ...

  return model_report, model_matrix

We show the example with Logistic Regression and Random Forest, which have very high performance.

In [None]:
lr = LogisticRegression()   #Penalty/regularization is optional

lr_report, lr_matrix = fit_predict_eval(lr)

print(lr_report)
print(lr_matrix)

In [None]:
rfc = RandomForestClassifier()   #Penalty/regularization is optional

rfc_report, rfc_matrix = fit_predict_eval(rfc)

print(rfc_report)
print(rfc_matrix)

**Now, try creating the models with K-Nearest Neighbors and Support Vector Classifiers**. Look up on google for help on creating model object

In [None]:
knn = ...

knn_report, knn_matrix = fit_predict_eval(knn)

print(knn_report)
print(knn_matrix)

In [None]:
svm = ...

svm_report, svm_matrix = fit_predict_eval(svm)

print(svm_report)
print(svm_matrix)

## **Checkpoint #4: *Fine-Tune the Hyperparameters***


**Finally, it's your turn to create the best model using SVM or KNN! Alternatively, try to improve the LogisticRegression parameters to get an even better performance!** Try different models with different hyperparameters.

In [None]:
# TODO