## **Checkpoint #1: *Open the Notebook and Load the Data***


Go to https://docs.google.com/presentation/d/1z9HhCQ2ldON_tb9JLuJlb_bH9cghNhHwXCE-nqxbaUI/edit?usp=sharing for help on running the notebook in Colab.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# cd drive/MyDrive/Regression\ Workshop
%cd drive/Shareddrives/Aggie\ Data\ Science\ Club/Workshops/SP24\ Classification

/content/drive/Shareddrives/Aggie Data Science Club/Workshops/SP24 Classification


In [3]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [4]:
df = pd.read_csv('2016_Citizen_Survey_Master_Dataset_20240218.csv')


## **Checkpoint #2: *Prepare the Data with Imputation and Encoding***


In [5]:
nulls = df.isnull().sum().sum()
print(f"Total number of nulls in dataset (original): {nulls}")

Total number of nulls in dataset (original): 125767


https://www.tutorialspoint.com/how-to-handle-missing-values-of-categorical-variables-in-python#:~:text=in%20the%20section.-,Fill%20the%20missing%20values%20with%20the%20computed%20mode%20using%20the,method_name%20parameter%20as%20'mode'.



In [6]:
# Data Evaluation:
# Shape
# remove null values
# convert categorical values, etc.

# Checkpoint 1: clean / drop null values
# rename values and columns (mainly area column )
# Handling null values

df.drop(columns=["Information preference about city government activities - Other"], inplace=True, errors="ignore")

for column in df.columns:
  num_nulls_in_column = df[column].isnull().sum()
  if num_nulls_in_column != 0:
    column_dtype = df[column].dtype
    if column_dtype == 'object':  # Categorical columns
      mode = df[column].mode()[0]
      df[column] = df[column].fillna(mode)
    elif column_dtype in ('float64', 'int64'):  # Numerical columns
      mean = df[column].mean()
      df[column] = df[column].fillna(mean)

new_nulls = df.isnull().sum().sum()
print(f"Total number of nulls in dataset (after imputation): {new_nulls}")

Total number of nulls in dataset (after imputation): 0


In [7]:
df.rename(columns={'Area of College Station': 'area'}, inplace=True)

y = df["area"]

In [8]:
num_categorical_before = df.select_dtypes(include=['object']).shape[1]
print(f"Number of categorical variables before encoding: {num_categorical_before}")

Number of categorical variables before encoding: 117


In [9]:
# Function to determine the encoding technique for a column based on the number of unique values
# Dont worry abt this but if unique values > 10, then were gonna use a frequency encoding stategy good for high-cardinality categorical variables
def determine_encoding(column):
    unique_values = column.nunique()
    # If the number of unique values is too high, return None to avoid encoding
    if unique_values > 10:
        return None
    elif unique_values == 2:  # for binary columns
        return "Label"
    else:
        return "OneHot"

X = df.drop(columns=["area"], inplace=False, errors="ignore")

one_hot_columns = []
drop_columns = []

for col in X.columns:
  if X[col].dtype == 'object':  # Check if the column is categorical
    encoding_type = determine_encoding(X[col])
    if encoding_type == "Label":
      le = LabelEncoder()
      X[col] = le.fit_transform(X[col])
    elif encoding_type == "OneHot":
      one_hot_columns.append(col)
    else:
      drop_columns.append(col)

X = pd.get_dummies(X, columns=one_hot_columns)

encoded_shape = X.shape
print(f"Shape of the Encoded DataFrame: {encoded_shape}")
num_categorical_after = X.select_dtypes(include=['object']).shape[1]
print(f"Number of categorical variables after encoding: {num_categorical_after}")

Shape of the Encoded DataFrame: (2015, 518)
Number of categorical variables after encoding: 10


In [10]:
#there are still some categorical variables left, specifically high cardinality variables

remaining_categorical_vars = X.select_dtypes(include=['object'])
columns_with_many_categories = remaining_categorical_vars.apply(lambda col: col.nunique() > 10)

print(columns_with_many_categories)

# trying out frequency encoding:
columns_to_encode = columns_with_many_categories[columns_with_many_categories].index.tolist()
print(columns_to_encode)

for col in columns_to_encode:
    frequency = X[col].value_counts(normalize=True)  # frequency of each category
    X[col] = X[col].map(frequency)

num_categorical_after = X.select_dtypes(include=['object']).shape[1]
print(f"Number of categorical variables after encoding: {num_categorical_after}") #now there's 0 categorical variables

Dwelling type Other                                                                         True
Additional comments about specific city services or departments                             True
What do you value MOST about living in College Station?                                     True
If you could change ONE THING about College Station, what would it be                       True
What would you say should be College Station's highest priority                             True
What types of retail and commercial development would you like to see in College Station    True
How could the city's customer service be improved                                           True
Information preference Other                                                                True
How could the city improve its public communication efforts?                                True
Additional comments about College Station's municipal buildings and facilities              True
dtype: bool
['Dwelling type Ot

## **Checkpoint #3: *Run the Base Models***


In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Checking the shapes of the resulting splits
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1612, 518), (403, 518), (1612,), (403,))

In [12]:
def fit_predict_eval(model):
  model.fit(X_train, y_train)

  model_pred = model.predict(X_test)

  model_report = classification_report(model_pred, y_test)
  model_matrix = confusion_matrix(model_pred, y_test)

  return model_report, model_matrix

In [13]:
lr = LogisticRegression()   #Penalty/regularization is optional

lr_report, lr_matrix = fit_predict_eval(lr)

print(lr_report)
print(lr_matrix)

                                                           precision    recall  f1-score   support

Area A - North or Rock Prairie Rd. and West of Texas Ave.       0.80      0.88      0.84       116
    Area B - North of Bird Pond Rd and East of Texas Ave.       0.78      0.93      0.84        67
    Area C - South of Rock Prairie Rd. and West of Hwy. 6       0.96      0.89      0.93       148
       Area D - South of Bird Pond Rd. and East of Hwy. 6       0.85      0.69      0.76        72

                                                 accuracy                           0.86       403
                                                macro avg       0.85      0.85      0.84       403
                                             weighted avg       0.87      0.86      0.86       403

[[102   6   4   4]
 [  2  62   1   2]
 [  5   8 132   3]
 [ 18   4   0  50]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [14]:
knn = KNeighborsClassifier()   #Penalty/regularization is optional

knn_report, knn_matrix = fit_predict_eval(knn)

print(knn_report)
print(knn_matrix)

                                                           precision    recall  f1-score   support

Area A - North or Rock Prairie Rd. and West of Texas Ave.       0.45      0.39      0.42       146
    Area B - North of Bird Pond Rd and East of Texas Ave.       0.24      0.24      0.24        79
    Area C - South of Rock Prairie Rd. and West of Hwy. 6       0.45      0.43      0.44       141
       Area D - South of Bird Pond Rd. and East of Hwy. 6       0.32      0.51      0.40        37

                                                 accuracy                           0.39       403
                                                macro avg       0.36      0.39      0.37       403
                                             weighted avg       0.39      0.39      0.39       403

[[57 35 43 11]
 [22 19 25 13]
 [41 23 61 16]
 [ 7  3  8 19]]


In [15]:
rfc = RandomForestClassifier()   #Penalty/regularization is optional

rfc_report, rfc_matrix = fit_predict_eval(rfc)

print(rfc_report)
print(rfc_matrix)

                                                           precision    recall  f1-score   support

Area A - North or Rock Prairie Rd. and West of Texas Ave.       1.00      0.99      1.00       128
    Area B - North of Bird Pond Rd and East of Texas Ave.       1.00      1.00      1.00        80
    Area C - South of Rock Prairie Rd. and West of Hwy. 6       1.00      1.00      1.00       137
       Area D - South of Bird Pond Rd. and East of Hwy. 6       0.98      1.00      0.99        58

                                                 accuracy                           1.00       403
                                                macro avg       1.00      1.00      1.00       403
                                             weighted avg       1.00      1.00      1.00       403

[[127   0   0   1]
 [  0  80   0   0]
 [  0   0 137   0]
 [  0   0   0  58]]


In [16]:
svm = SVC()   # Kernel is optional

svm_report, svm_matrix = fit_predict_eval(svm)

print(svm_report)
print(svm_matrix)

                                                           precision    recall  f1-score   support

Area A - North or Rock Prairie Rd. and West of Texas Ave.       0.15      0.28      0.20        67
    Area B - North of Bird Pond Rd and East of Texas Ave.       0.00      0.00      0.00         0
    Area C - South of Rock Prairie Rd. and West of Hwy. 6       0.85      0.35      0.49       336
       Area D - South of Bird Pond Rd. and East of Hwy. 6       0.00      0.00      0.00         0

                                                 accuracy                           0.33       403
                                                macro avg       0.25      0.16      0.17       403
                                             weighted avg       0.73      0.33      0.44       403

[[ 19  21  21   6]
 [  0   0   0   0]
 [108  59 116  53]
 [  0   0   0   0]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## **Checkpoint #4: *Fine-Tune the Hyperparameters***


In [18]:
# try different models with different hyperparameters
# you have to improve SVM or KNN