# K-Prototype For Enrollment

This model deployment is for the marketing team to see not only where we should target for geofencing, but also with what. Given the sheer number of programs, it is good for us to know who the students are, where they live, their age, etc. and use that information for targeted marketing when possible. Therefore, with a combination of using a K-Prototype algorithm and applying the Haversine formula, I was able to show the unique clusters of our students and, in the interactive map I created, allow for filtering based on distance from the college.

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [None]:
%%html
<style>
table {float:left}
</style>

In [None]:
# Import module for data visualization
from plotnine import *
import plotnine

# Import module for k-prototype cluster
from kmodes.kprototypes import KPrototypes

# Format scientific notation from Pandas
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [None]:
# Load data, 
enroll = (pd.read_csv('202110 - 202280 FA, SP, SU HC 20th-D.csv', encoding = 'cp1252')
              .rename(columns = str.lower)
         )

# For practice, isolate a single semester as the K-Prototype Altorithm is time intensive
enroll = enroll[enroll['term'] == 202110]

# Sort values by term and id
enroll.sort_values(['term', 'id'], ascending = [True, True]).reset_index(drop = True, inplace = True)

## Data Exploration

In [None]:
# Examine missing data
miss_data = pd.DataFrame({'# missing':enroll.isna().sum()})

miss_data['% rep missing'] = enroll.isna().sum()/len(enroll) * 100

miss_data

### Columns To Eliminate Or Impute

  * *pidm*, *lname*, *fname*, *mi*, *rescode*, *degcode*, *majr*, *mrtl*, *ecode*, *race_desc*, *cnty*, *addr#*, *edgoal*, *goaldesc*, *program*, *dob*, *confid*
    * Remove all of these because they are either unnecessary for the k-prototype algorithm or repetitive
  * *age*: Missing just one value. Therefore, enroll['age'].isna() == False
  * *gender*: Missing just five values. Therefore, enroll['gender'].isna() == False
  * *cnty_desc1*: Missing 165 values (0.432%). Therefore, enroll['cnty_desc1'].isna() == False
  * *ethn_desc*: Missing 1725 values, 4.514%. Therefore, I will create an additional category, in essence, 'missing'.
  * *mrtl*: While there may be some benefit in predictive analysis with the marital status of the individual, we are missing nearly 62% of the entries, which is too much to impute given its a categorical variable. It is unclear what value could be derived from the remaining 38%. After completing the first run through, I might add this variable back in to see if the modeling produces any more insights.

In [None]:
# Pare down dataframe based on criteria above
enroll = enroll[['id', 'term', 'age', 'totcr', 'status', 'stype', 'resd_desc', 'degree',
                 'majr_desc1', 'gender', 'ethn_desc', 'cnty_desc1']]

# Fill missing values for *ethn_desc* with 'missing', effectively creating
# a new category
enroll['ethn_desc'] = enroll['ethn_desc'].fillna('missing')

# Change the *term* column to a category as it is not actually an 
# integer
enroll['term'] = enroll['term'].astype('str')

# Filter out the NaN values as there were not enough to warrant
# data imputation
mask1 = enroll['age'].isna() == False
mask2 = enroll['gender'].isna() == False
mask3 = enroll['cnty_desc1'].isna() == False

enroll = enroll[mask1 & mask2 & mask3]

## The Data

After removing the extraneous data and addressing missing values, we are left with 38044 observations and 12 columns. The model will be ran without the *id* column as it will provide no predictive value. The *term* column will be used as a filter to assess whether the clusters of students who enroll vary depending on the semester (i.e. Fall vs. Spring semester). Beyond these two columns, the features are as follows:

There are eight categorical variables and two float variables. The float variables will be centered and scaled. The categorical variables will be separated into indicator variables.

Let's inspect the nature of the categorical variables.


In [None]:
# Identify the number of unique values for each category, ignoring
# the 'id' and 'term' column
enroll.select_dtypes('object').nunique()

#for i in list(enroll['ethn_desc'].unique()):
#    print("  * " + "'" + i + "'")

The *status* columns has four unique categories:

  * 'LH': Students who are enrolled in fewer than 6 credits (< 6)
  * 'HT': Students who are enrolled in 6 credits but less than 12 (6 <= x < 12)
  * 'FT': Students who are enrolled in *at least* 12 credits. (>= 12)
  * 'W': Students who withdrew from the semester
  
*stype* has six unique categories:

  * 'A': Auditors
  * 'G': Guest students
  * 'H': High school students
  * 'N': New Students
  * 'T': Transfer students
  * 'C': Continuing students
  
*resd_desc* has 10 categories:

  * 'Kansas Resident-out of BC': Student who lives in KS but not in the Co.
  * 'Out of state Resident': Student who is not from KS
  * 'International': International student
  * 'County Resident': Co. resident
  * 'Undocumented-out of BC': Undocumented student who is not a Co. resident
  * 'Appl for Perm Res/Asylum/TPS': Perm Resident?
  * 'Undocumented-BC': Undocumented student from Co.
  * 'Appl for Perm Res/Asylum OOS': Perm Resident?
  * 'Appl for Perm Res/Asylum': Perm Resident?
  * 'App for Per Res/Asylum OOS/TPS': Perm Resident?

*degree* has 14 categories:
  * 'Associate in Arts'
  * 'Associate in Science'
  * 'Associate in Applied Science'
  * 'Associate in General Studies'
  * 'Non Degree Seeking'
  * '30 but less than 45 cr hrs'
  * 'Certification', 
  * '45 but less than 60 cr hrs'
  * 'Preparatory Coursework'
  * '16 but less than 30 cr hrs'
  * 'Degree-seeking non-BCCC'
  * 'CERTB 30-44 cr hrs'
  * 'CERTA less than 30 cr hrs'
  * 'CERTC 45-59 cr hrs'
  
*majr_desc1* has 117 categories:
  * 'Psychology'
  * 'Pre-Physical Therapy'
  * 'Pre-Engineering'
  * 'Pre-Computer Science'
  * 'Pre-Nursing/Health Science'
  * 'Political Science'
  * 'Mathematics'
  * 'Accounting'
  * 'Liberal Arts'
  * 'Elementary Education'
  * 'Nursing'
  * 'Fire Science'
  * 'Pre-Veterinarian'
  * 'Business Technology'
  * 'History'
  * 'Sociology/Social Work'
  * 'Business Administration'
  * 'Undeclared'
  * 'Music'
  * 'Elementary Education/BEST'
  * 'Internetworking Management'
  * 'Biological Science'
  * 'Engineering Technology'
  * 'Pre-Medicine'
  * 'English/Literature'
  * 'Theatre Performance'
  * 'Criminal Justice'
  * 'Pre-Physician Assistant'
  * 'Software Development'
  * 'Music Education'
  * 'Agriculture'
  * 'Chemistry'
  * 'Health Sciences'
  * 'Culinary Arts'
  * 'Accounting Assistant'
  * 'Fine Arts and Communication'
  * 'Humanities,Soc/Beh Sciences'
  * 'Farm and Ranch Manangement'
  * 'Marketing, Mgmt, Entrepreneur'
  * 'Secondary Education'
  * 'Digital Media'
  * 'Cyber Security'
  * 'Art'
  * 'Athletic Training'
  * 'Interactive, Digital & 3D Tech'
  * 'Data Analytics'
  * 'Mass Communication-Sport Media'
  * 'Mass Communication-Radio/TV'
  * 'Mass Communication-Journalism'
  * 'Pre-Pharmacy'
  * 'Sport Management'
  * 'Foreign Language'
  * 'Web Development'
  * 'Physician Coding'
  * 'Speech Communication'
  * 'Hotel Management'
  * 'Physics'
  * 'Mass Communications'
  * 'Early Childhood Education'
  * 'Religion'
  * 'Automotive Technology'
  * 'Theatre Technical'
  * 'Exercise Science'
  * 'Economics'
  * 'Automotive Collision Repair'
  * 'Welding Technology'
  * 'Marketing'
  * 'Entrepreneurship'
  * 'Preparatory Cousework 1 year'
  * 'Eng Tech-Drafting'
  * 'Education and Public Services'
  * 'Agribusiness'
  * 'Science, Engineering, and Math'
  * 'Database Administration'
  * 'Electrical Apprenticeship'
  * 'Business Medical Specialist'
  * 'Livestock Mgmt/Merchandising'
  * 'Prep Coursework Grad 4 years'
  * 'Food Science Business'
  * 'Homeland Security'
  * 'Philosophy'
  * 'Business and Industry'
  * 'Restaurant Management'
  * 'Dance'
  * 'Unmanned Aircraft Systems'
  * 'Surveying Technology'
  * 'Pre Nursing'
  * 'Workforce Development'
  * 'Eng Tech-Manufacturing'
  * 'Food Science Technology'
  * 'Engineering Graph Technology'
  * 'Transfer Major'
  * 'Construction Trades Apprentice'
  * 'Nurse Aide'
  * 'Plumber/Pipefitter Apprentice'
  * 'Construction Technology'
  * 'Unified Teaching'
  * 'Culinary Arts-Culinarian'
  * 'Adv'd Emerg Med Tech'
  * 'Diesel Technology'
  * 'Culinary Arts-Sous Chef'
  * 'Early Childhood CDA'
  * 'Emergency Medical Technician'
  * 'Eng Tech-Industrial Controls'
  * 'Patient Care Pathways'
  * 'Preparatory Coursework 2 years'
  * 'Food Science and Safety'
  * 'Medication Aide'
  * 'MOS Test Prep'
  * 'Professional Culinary Arts'
  * 'Pre-Health Professions'
  * 'Oper Train'g Assist'd Living'
  * 'Early Ed Child Unified Edu'
  * 'Bus Admin 75 WSU'
  * 'Prof Culinary Arts-Sous Chef'
  * 'Prof Culinary Arts-Culinarian'
  * 'Fire Science Leadership'
  
*gender* has three categories:
  * 'M': Male
  * 'F': Female
  * 'N': Neutral?
 
*ethn_desc* has 9 categories:
  * 'Black'
  * 'missing'
  * 'Caucasian/White'
  * 'Hispanic'
  * 'Undeclared'
  * 'Mixed'
  * 'Asian'
  * 'American Indian/Alaskan'
  * 'Pacific Islander/Hawaiian'
  
*cnty_desc1* has 421:
  * Due to the space this would take up, it will not be inlcuded here. Also it is not particularly helpful.

### Examine the numerical data

.desribe() naturally isolates the numerical data stored in a dataframe.

In [None]:
enroll.describe().T

The average number of credit hours taken over all of these semesters is 8.518 with a standard deviation of 4.633. The average age is 23.692 with a standard devation of 8.245. The minimum credit hours enrolled in is 0.000 whil the minimum age of a student enrolled in the semesters examined here is 13.583. The maximum age enrolled and credit hours enrolled is 87.250 and 29, respectively. 

In [None]:
# Create dataframe of the stypes. It will include a percent rep column,
# and the total number of students in that student type (headcount)
df_stype = (pd.DataFrame(enroll['stype'].value_counts())
              .reset_index()
           )
df_stype['Percentage'] = df_stype['stype']/enroll['stype'].value_counts().sum()
df_stype.rename(columns = {'index':'Stype', 'stype':'Total'}, inplace = True)
df_stype = df_stype.sort_values('Total', ascending = True).reset_index(drop = True)


In [None]:
# This dataframe still focuses on stype but totals the 
# headcount (total), gives total credit hours per 
# stype (totcr), and give the average age (age). 

df_stype2 = (enroll.groupby('stype')
                   .agg({
                         'stype':'count',
                         'totcr':'sum',
                         'age':'mean'
                        }
                 ).rename(columns = {'stype':'total'}
                 ).reset_index()
                  .sort_values('total', ascending = True)
            )

df_stype2

output = None

In [None]:
# Plot data. Interesting thing I learned here. ggplot from R has been 
# imported into Python, which is awesome because I know it much 
# better than matplotlib.

plotnine.options.figure_size = (8, 4.8)
(
    ggplot(data = df_stype2) + 
      geom_bar(aes(x = 'stype',
                   y = 'totcr'),
               fill = np.where(df_stype2['stype'] == 'N', '#981220', '#80797c'),
               stat = 'identity') + 
      labs(title = 'Stype that has the highest crhr',
           x = 'Stype',
           y = 'Frequency') +
      scale_x_discrete(limits = df_stype2['stype'].tolist()) + 
      theme_minimal() +
      coord_flip()
)

**Standardizing the numerical data**

K-means is sensitive to the scale of the numerical data. Therefore, we will need to standardize *age* and *totcr* because one is on a scale from 0-29 and the other is on a scale from 14-87. 

In [None]:
# The first step is to create a special enrollment dataframe (enroll_sp) 
# that has just the numeric values
enroll_sp = enroll[['id', 'term', 'age', 'totcr']]

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize standardscaler
std_scaler = StandardScaler()
 
# Standardize the numeric data--*age* and *totcr*
df_scaled = std_scaler.fit_transform(enroll_sp[['age', 'totcr']].to_numpy())
df_scaled = pd.DataFrame(df_scaled, columns=['age', 'totcr'])

# Import the ids back into the dataframe for merging
df_scaled['id'] = enroll_sp['id']
 
# Merge the filtered dataframe's categorical variables with the 
# standardized numeric variables.
filt_df = df_scaled.merge(enroll, how = 'left', on = 'id')

# Reorganize the variables and drop the unstandardized age and totcr 
# columns.
filt_df = (filt_df[['id', 'term', 'age_x', 'totcr_x', 'status', 'stype', 'resd_desc', 
                   'degree', 'majr_desc1', 'gender', 'ethn_desc', 'cnty_desc1']]
                .rename(
                   columns = {
                       'age_x':'age',
                       'totcr_x':'totcr'
                   })
          )

# Finally, drop the duplicate rows that are generated by this merger. I have 
# not figured out why it is duplicating rows (2.28.23)
# These duplications also generate some NaN values that were not there previously.
# Literally just one row of NaN values.
filt_df = (filt_df.drop_duplicates(subset = ['id', 'term'])
               .reset_index(drop = True)
               .drop(['id', 'term'], axis = 1)
               .dropna()
          )

# Finally, for the K-Prototype Algorithm, you have to store the index of the categorical
# columns. That is what we are doing below.
cat_columns_pos = [filt_df.columns.get_loc(i) for i in list(filt_df.select_dtypes('object').columns)]

The K-Prototype clustering algorithm runs best on a matrix, so we want to convert the dataframe (excluding the *id* and *term* column) into a matrix.

In [None]:
# Convert dataframe to matrix
dfmatrix = filt_df.to_numpy()

### Determine Appropriate Number Of Clusters
Now, as is customary to discover the appropriate number of clusters, we will use the *elbow method* to ascertain best number of clusters. Usually, with all numerical data as in the case of K-Means, we would use the within sum of squares errors (WSSE) with Euclidian distance, however, we cannot use Euclidean distance for categorical variables. Therefore, for K-Prototype, we will use the cost function that combines the calculation for numerical and categorical variables. 

In [None]:
# Choose optimal K using the elbow method
cost = []

for cluster in range(1, 10):
    try:
        kprototype = KPrototypes(n_jobs = -1, n_clusters = cluster, init = 'Huang', random_state = 0)
        kprototype.fit_predict(dfmatrix, categorical = cat_columns_pos)
        cost.append(kprototype.cost_)
        #print('Cluster Initiation: {}'.format(i))
    except:
        break
    
# Converting the results into a dataframe and plotting them
df_cost = pd.DataFrame({'Cluster':range(1, 10), 'Cost':cost})

In [None]:
# Look at the % change with each successive K group that is added.
perc_diff = {}

for i in range(1, 9, 1):
    perc_diff[i+1] = (((df_cost['Cost'][i] - df_cost['Cost'][i-1])/df_cost['Cost'][i-1])*100)
    

(df_cost.merge(pd.DataFrame.from_dict(perc_diff, orient = 'index')
        .reset_index()
        .rename(columns = {'index':'Cluster',
                           0:'% Change'}), how = 'left', on = 'Cluster')
        .fillna("")
)

In [None]:
# Create Scree Plot of the Cost function to choose optimal K value
plotnine.options.figure_size = (8, 4.8)
(
    ggplot(data = df_cost)+
    geom_line(aes(x = 'Cluster',
                  y = 'Cost'))+
    geom_point(aes(x = 'Cluster',
                   y = 'Cost'))+
    geom_label(aes(x = 'Cluster',
                   y = 'Cost',
                   label = 'Cluster'),
               size = 10,
               nudge_y = 1000) +
    labs(title = 'Optimal number of cluster with Elbow Method')+
    xlab('Number of Clusters k')+
    ylab('Cost')+
    theme_minimal()
)

Based on the *scree plot* above, we will start with K = 3 as it appears to be the optimal number of clusters. I found this number of clusters to be insufficient for driving insight. Therefore, I played with different number of clusters and found seven to be the optimal number of clusters.

In [None]:
# Fit the clusters
kprototype = KPrototypes(n_jobs = -1, n_clusters = 7, init = 'Huang', random_state = 0)
kprototype.fit_predict(dfmatrix, categorical = cat_columns_pos)

In [None]:
# cluster centroids
kprototype.cluster_centroids_

# check the iteration of the clusters created
kprototype.n_iter_

# Check the cost of the clusters created
kprototype.cost_

### Interpreting The Clusters

In [None]:
# add the clusters to the dataframe
filt_df['id'], filt_df['term'] = enroll_sp['id'], enroll_sp['term']

filt_df['cluster labels'] = kprototype.labels_
filt_df['segment'] = filt_df['cluster labels'].map({0:'First', 1:'Second', 2:'Third'})

# Order clusters
filt_df['segment'] = filt_df['segment'].astype('category')
filt_df['segment'] = filt_df['segment'].cat.reorder_categories(['First', 'Second', 'Third'])

Now we can dig into each cluster to see what characteristics are common between all of them.

In [None]:
# First, create a dataframe with the original ages and total credit hours enrolled
# Essentially destandardizing them.
age_cr = enroll[['id', 'age', 'totcr']].reset_index(drop = True)

# Merge the filtered dataframe with the original ages and credit hours
integrated_clusters = (filt_df.merge(age_cr, how = 'left', on = 'id').drop(['age_x', 'totcr_x'], axis = 1)
                           [['id', 'term', 'age_y', 'totcr_y', 'status', 'stype', 'resd_desc', 
                             'degree', 'majr_desc1', 'gender', 'ethn_desc', 'cnty_desc1', 
                             'cluster labels', 'segment']]
                             .rename(columns = {'age_y':'age',
                                                'totcr_y':'totcr'})
                      )

integrated_clusters1 = integrated_clusters[integrated_clusters['segment'] == 'First']
integrated_clusters2 = integrated_clusters[integrated_clusters['segment'] == 'Second']
integrated_clusters3 = integrated_clusters[integrated_clusters['segment'] == 'Third']

In [None]:
cat_mode = {}
for i in list(integrated_clusters1.select_dtypes('object').columns):
    cat_mode[i] = integrated_clusters1[i].mode()

In [None]:
num_mean = {}

for i in list(integrated_clusters1.select_dtypes('float64').columns):
    num_mean[i] = integrated_clusters1[i].mean()

A quick perusal shows that just looking at the means and modes of these students is not going to be enough to glean anything helpful from each grouping. 

# Longitude, Latitude Converter

In this script, we use the requests library to send a GET request to the OpenStreetMap Nominatim API. We pass the address as a query parameter and limit the results to one match. We also specify the response format as JSON.

Once we receive the JSON response, we extract the latitude and longitude from the first match, if one exists. If there are no matches, we leave the latitude and longitude fields empty.

Finally, we write the address, latitude, and longitude to the output CSV file.

Note that OpenStreetMap Nominatim has usage limits, so you should read the usage policy and make sure that your use of the service complies with the terms of use.

This process using openstreetmap did work for converting the addresses to longitude and latitude. It has in its user agreement not to do big batch dataprocessing. I'm not sure if one semester is big batch. In this case, it was about 6100 addresses. It only allows one request a second. Therefore, it took one hour and fourty-two minutes to download, but to my surprise, it worked! You'll notice, *nominatim* is the extension of the library *geopy* that is used in the call geopy.geocoders.Nominatim(user_agent = 'MyCoder')

In [None]:
import csv
import requests

# OpenStreetMap Nominatim API URL
NOMINATIM_URL = 'https://nominatim.openstreetmap.org/search'

# CSV file containing addresses
input_file = '202310 Addresses Missing.csv'

# CSV file to write output
output_file = 'long_lat_202310_missing.csv'

# Open input file and output file
with open(input_file, 'r') as csv_input, open(output_file, 'w', newline='') as csv_output:
    # Create CSV reader and writer objects
    reader = csv.reader(csv_input)
    writer = csv.writer(csv_output)

    # Write header row to output file
    writer.writerow(['Address', 'Latitude', 'Longitude'])

    # Loop through each row in the input file
    for row in reader:
        # Get the address from the current row
        address = row[0]

        # Send a GET request to the OpenStreetMap Nominatim API
        response = requests.get(NOMINATIM_URL, params={'q': address, 'format': 'json', 'limit': 1})

        # Parse the JSON response
        json_data = response.json()

        # Get the latitude and longitude from the JSON response
        if len(json_data) > 0:
            latitude = json_data[0]['lat']
            longitude = json_data[0]['lon']
        else:
            latitude = ''
            longitude = ''

        # Write the address, latitude, and longitude to the output file
        writer.writerow([address, latitude, longitude])


### The Code Below Is For Accessing Mapquest Geocode API

Mapquest did an outstanding job. You're allowed 15,000 requests a month. So that's like two semester's worth, but it did find every single address and came back with the correct longitude and latitude. It is limited, mostly, to the United States. The format for the address file is, no headers, just a matrix, with the first column as the street address, the second column is the city, the third column is the state, the fourth is the zip code. It does generally only load one row per second so 6100 takes an hour and 20 to an hour and 40 minutes. I did one load at night and it ran it a little faster. During the day, expect a semester of 6500 students to take two to three hours. 

In [None]:
import csv
import requests

# MapQuest Geocoding API URL
MAPQUEST_URL = 'https://www.mapquestapi.com/geocoding/v1/address'

# API key (replace with your own)
API_KEY = 'k7vgChZQLe0QFnfkRp6RjXGOW5Krg84z'

# CSV file containing addresses
input_file = '202310 Addresses.csv'

# CSV file to write output
output_file = 'long_lat_202310.csv'

# Open input file and output file
with open(input_file, 'r') as csv_input, open(output_file, 'w', newline='') as csv_output:
    # Create CSV reader and writer objects
    reader = csv.reader(csv_input)
    writer = csv.writer(csv_output)

    # Write header row to output file
    writer.writerow(['address', 'longitude', 'latitude'])

    # Loop through each row in the input file
    for row in reader:
        # Get the address components from the current row
        street, city, state, zip_code = row

        # Build the address string
        address = f'{street}, {city}, {state} {zip_code}'

        # Send a GET request to the MapQuest Geocoding API
        response = requests.get(MAPQUEST_URL, params={'key': API_KEY, 'location': address})

        # Parse the JSON response
        json_data = response.json()

        # Get the latitude and longitude from the JSON response
        if json_data['results']:
            latitude = json_data['results'][0]['locations'][0]['latLng']['lat']
            longitude = json_data['results'][0]['locations'][0]['latLng']['lng']
        else:
            latitude = ''
            longitude = ''

        # Write the address, latitude, and longitude to the output file
        writer.writerow([address, longitude, latitude])

# Plot Longitude And Latitude

This code comes from the website below:

https://towardsdatascience.com/clustering-geospatial-data-f0584f0b04ec

The code below works splendedly. To make the geospatial clustering useful for our marketing department, I limited the scope to the metro area, which is where the majority of our students come from.

In [None]:
#!pip install folium
#!pip install geopy
#!pip install scikit-learn
#!pip install MiniSom

In [None]:
# Load libraries
import numpy as np
import pandas as pd

# For plotting
import matplotlib.pyplot as plt
import seaborn as sns

# for geospatial
import folium
import geopy

# for machine learning
from sklearn import preprocessing, cluster
import scipy

# for deep learning
import minisom

In [None]:
# Load Dataframe
long_lat = pd.read_csv('long_lat_202310.csv')

long_lat.columns = [i.title() for i in long_lat.columns]

# Filter dataframe to Wichita and the surrounding area. The cities kept 
# in the filtered data account for ##.##% of all students enrolled for 
# Spring 2023.

long_lat = (long_lat[long_lat['City'].isin(['Wichita', 'El Dorado', 'Andover', 'Derby', 'Augusta', ',Rose Hill',
                                            'Haysville', 'Valley Center', 'Bel Aire', 'Towanda', 'Mulvane', 'Benton',
                                            'Douglass', 'Park City', 'Maize', 'Goddard', 'Kechi'
                                           ])]
                                     .reset_index(drop = True)
           )


In [None]:
# Curtail dataframe to just the Address, Latitude, and Longitude. 
# The new 'count' column is just to create a categorie for the 
# K-Means geospatial algorithm.
long_lat = (pd.DataFrame(long_lat.groupby(['Address', 'Latitude', 'Longitude'])['Address'].count())
              .rename(columns = {'Address':'Count'})
              .reset_index()
           )

category = []

for i in long_lat['Count']:
    if i <= 2:
        categorie.append('Low')
    elif i >= 3 and i < 7:
        categorie.append('Med')
    else:
        categorie.append('High')
        
long_lat['Categories'] = categorie

#pd.DataFrame(long_lat.groupby('Categories')['Address'].count())

In [None]:
# Get the coordinates for the city
city = 'Wichita'

# Get location
locator = geopy.geocoders.Nominatim(user_agent = 'MyCoder')
location = locator.geocode(city)
print(location)

# Keep latitude and longitude only
location = [location.latitude, location.longitude]
print('[lat, long]:', location)

Now *folium* will be used to create a map. It is a convenient package that allows us to plot interactive maps without needing to load a shapefile. Each store shall be identified by a point with size proportional to its current staff and color base on its cost. I'm also going to add a small piece of HTML code to the default map to display the legend.

In [None]:
x, y = 'Latitude', 'Longitude'
color = 'Categories'
size = 'Count'
popup = 'Address'
data = long_lat.copy()

# create color column
lst_colors = ['red', 'green', 'orange']
lst_elements = sorted(list(long_lat[color].unique()))
data['color'] = data[color].apply(lambda x: lst_colors[lst_elements.index(x)])


# create size column (scaled)
scaler = preprocessing.MinMaxScaler(feature_range = (300, 1000))
data['size'] = scaler.fit_transform(data[size].values.reshape(-1, 1)).reshape(-1)

# Initialize the map with the starting location
map_ = folium.Map(location = location, tiles = 'cartodbpositron', zoom_start = 11)

# add points
data.apply(lambda row: folium.CircleMarker(location = [row[x], row[y]], popup = row[popup],
                                           color = row['color'], fill = True,
                                           size = row['size']).add_to(map_), axis = 1)

# add html legend
legend_html = """<div style = "position:fixed; bottom:10px; left:10px; border:2px solid black; 
                z-index:9999; font-size:14px;">&nbsp;<b>"""+color+""":</b>
                <br>"""

for i in lst_elements:
    legend_html = legend_html+"""&nbsp;<i class="fa fa-circle 
                              fa-1x" style = "color:"""+lst_colors[lst_elements.index(i)]+"""">
                              </i>&nbsp;"""+str(i)+"""<br>"""

legend_html = legend_html+"""</div>"""

map_.get_root().html.add_child(folium.Element(legend_html))

# plot the map
map_

### Create Scree Plot To Find Best K

In [None]:
X = long_lat[["Latitude","Longitude"]]
max_k = 10

## iterations
distortions = [] 
for i in range(1, max_k+1):
    if len(X) >= i:
        model = cluster.KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
        model.fit(X)
        distortions.append(model.inertia_)
        
## best k: the lowest derivative
k = [i*100 for i in np.diff(distortions,2)].index(min([i*100 for i 
     in np.diff(distortions,2)]))

## plot
fig, ax = plt.subplots()

ax.plot(range(1, len(distortions)+1), distortions)
ax.axvline(k, ls='--', color="red", label="k = "+str(k))
ax.set(title='The Elbow Method', xlabel='Number of clusters', 
       ylabel="Distortion")
ax.legend()
ax.grid(True)

plt.show()

## Create the K-Means Geospatial Algorithm And Apply It To The Data

In [None]:
# Set number of clusters *k* at 5
k = 15

# Initialize k-means model
model = cluster.KMeans(n_clusters=k, init='k-means++')

# clustering
X = long_lat[["Latitude","Longitude"]]

dtf_X = X.copy()

# find real centroids
dtf_X["cluster"] = model.fit_predict(X)

closest, distances = scipy.cluster.vq.vq(model.cluster_centers_, 
                     dtf_X.drop("cluster", axis=1).values)

dtf_X["centroids"] = 0
for i in closest:
    dtf_X["centroids"].iloc[i] = 1  

# add clustering info to the original dataset
long_lat[["cluster","centroids"]] = dtf_X[["cluster","centroids"]]

long_lat.sample(5)

# View number of addresses in each cluster grouping
(pd.DataFrame(long_lat.groupby('cluster')['Address'].count())
   .rename(columns = {'Address':'Count'})
   .reset_index()
   .sort_values('Count', ascending = False)
)

In [None]:
## plot
fig, ax = plt.subplots()
sns.scatterplot(x="Latitude", y="Longitude", data=long_lat, 
                palette=sns.color_palette("bright",k),
                hue='cluster', size="centroids", size_order=[1,0],
                legend="brief", ax=ax).set_title('Clustering(k=' + str(k) + ')')

th_centroids = model.cluster_centers_
ax.scatter(th_centroids[:,0], th_centroids[:,1], s=50, c='black', 
           marker="x")


In [None]:
model = cluster.AffinityPropagation()

### Plot the K-Means Clusters On The Geospatial Map

In [None]:
# Create objects for ploting
x, y = 'Latitude', 'Longitude'
color = 'cluster'
size = 'Count'
popup = 'Address'
marker = 'centroids'
data = long_lat.copy()

# create color column
lst_colors = ['#%06X' % np.random.randint(0, 0xFFFFFF) for i in range(len(lst_elements))]
lst_elements = sorted(list(long_lat[color].unique()))
data['color'] = data[color].apply(lambda x: lst_colors[lst_elements.index(x)])

# create size column (scaled)
scaler = preprocessing.MinMaxScaler(feature_range = (300, 1000))
data['size'] = scaler.fit_transform(data[size].values.reshape(-1, 1)).reshape(-1)

# Initialize the map with the starting location
map_ = folium.Map(location = location, tiles = 'cartodbpositron', zoom_start = 11)

# add points
data.apply(lambda row: folium.CircleMarker(location = [row[x], row[y]], popup = row[popup],
                                           color = row['color'], fill = True,
                                           size = row['size']).add_to(map_), axis = 1)

# add html legend
legend_html = """<div style = "position:fixed; bottom:10px; left:10px; border:2px solid black; 
                z-index:9999; font-size:14px;">&nbsp;<b>"""+color+""":</b>
                <br>"""

for i in lst_elements:
    legend_html = legend_html+"""&nbsp;<i class="fa fa-circle 
                              fa-1x" style = "color:"""+lst_colors[lst_elements.index(i)]+"""">
                              </i>&nbsp;"""+str(i)+"""<br>"""

legend_html = legend_html+"""</div>"""

lst_elements = sorted(list(long_lat[marker].unique()))
data[data[marker]==1].apply(lambda row: folium.Marker(location = [row[x], row[y]], popup = row[marker], draggable = False, icon = folium.Icon(color = 'black')).add_to(map_), axis =1)
map_.get_root().html.add_child(folium.Element(legend_html))

# plot the map

map_

In [None]:
long_lat.to_excel(r'C:\pathway_to_files\Enrollments\K-Means_202310.xlsx', index = False, header=True)

### Calculate The Distance From Metro Campus For All of Our Students

We use the Haversine formula to calculate the distance from one of our two major Campuses (Andover or El Dorado). If the distance from one of the addresses is closer to Andover than El Dorado, then we store that distance, otherwise we store the distance from El Dorado to the student address. This way, we are storing the shortest distance from the student's home to a major Campus. 

In [None]:
import pandas as pd
import numpy as np
from math import radians, sin, cos, sqrt, atan2

# Fixed point coordinates
andover_lat = 37.7062631  # Andover BCC Latitude
andover_lon = -97.1276288 # Andover BCC longitude
eldo_lat = 37.8073112 # Eldo BCC Latitude
eldo_lon = -96.8851877 # Eldo BCC Longitude

# Haversine formula to calculate distance between two points
def calculate_distance(fix_lat1, fix_lon1, dflat, dflon, fix_lat2, fix_lon2):
    # Convert latitude and longitude to radians
    rlat1, rlon1, rlat2, rlon2, rlat3, rlon3 = map(radians, [fix_lat1, fix_lon1, dflat, dflon, fix_lat2, fix_lon2])

    # Haversine formula
    dlon = rlon2 - rlon1
    dlat = rlat2 - rlat1
    a = sin(dlat / 2)**2 + cos(rlat1) * cos(rlat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    distance1 = 3958.8 * c # Radius of Earth is 3958.8
    
    dlon2 = rlon2 - rlon3
    dlat2 = rlat2 - rlat3
    a2 = sin(dlat2 / 2)**2 + cos(rlat3) * cos(rlat2) * sin(dlon2 / 2)**2
    c2 = 2 * atan2(sqrt(a2), sqrt(1 - a2))
    distance2 = 3958.8 * c2
    
    if distance1 < distance2:
        distance = distance1
    else:
        distance = distance2
        
    return distance

# Read csv file into a pandas DataFrame
df = pd.read_csv('K-Means_202310.csv')

# Calculate distance from fixed point for each row
df['Distance'] = np.vectorize(calculate_distance)(andover_lat, andover_lon, df['Latitude'], df['Longitude'], eldo_lat, eldo_lon)

# Print DataFrame with Distance to folder
df.to_excel(r'C:\pathway_to_file\Enrollments\K-Means_202310_distance.xlsx', index = False, header = True)