## Guest Instructor - HR Analytics

### Day 1: Data gathering & ETL (Excel, Tableau)
Intro to HR Analytics
- Common Names: People Analytics, Talent Analytics, People Insights
- Organization Structure and Functions
- Business Areas
    - Talent Acquisition
    - HR Management
    - Operational Effectiveness
    - Diversity & Inclusion
    - Learning & Development
    - Performance Evaluation, Compensation etc.
   
Webscraping using Excel

Gathering & exporting city coordinates from Tableau

ETL in Python (S. 1)

Files: ```canadacities.csv```

### Day 2: Clustering Algorithms (Python, Jupyter Notebook)
Working in HR Analytics
- Vendor Tools
    - LinkedInn Insights
    - Glassdoor, Indeed
    - Gartner, Talent Neuron
- Popular Projects
    - Sentiment Analysis
    - Location Strategy
    - Skills Mapping
    - Career Pathing
    - Surveying, Email Nudging, etc.

Clustering Algorithms in Python (S. 2-5)

Take home: Do the same for US cities ```uscities.csv```

Can you create a function that will cluster according to your needs at the click of a button?

### Day 3: Reporting and Driving Insights (Tableau)
Career Advice
- Interview Process and Prepping
- Working Style: High-Level vs. Detailed, Process-oriented vs. Results-oriented
- Mentorship: WeCareer
    
Create Final Report in Tableau
- Calculated Field: Number of Cities in Cluster
- Filters: City Type, Population
- Parameters: Include Outliers T|F
- Formatting: Aliases, Colors, Shape and size
    
Take home: Replicate completed report

In [None]:
!pip install hdbscan
!pip install folium

##### Python Packages
<b>HDBSCN</b>: https://hdbscan.readthedocs.io/en/latest/index.html<br>
<b>Folium</b>: https://python-visualization.github.io/folium/

In [None]:
import pandas as pd
import numpy as np
import hdbscan
import folium
import re
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.neighbors import KNeighborsClassifier
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_formart = 'svg'
plt.style.use('ggplot')

## Step 1. Preparing the Data
#### Load, Clean, Extract

In [None]:
filepath = '../Data/canadacities.csv'
df = pd.read_csv(filepath)

df

In [None]:
df.isnull()

In [None]:
df.duplicated()
# df.duplicated().any()

In [None]:
df.describe()

In [None]:
# Option 1: Drop cities based on population threshold
df_drop1 = df.drop(df[df.population_2020 < 50000].index)

# Option 2: Drop cities based on city type
df_drop2 = df.drop(df[df.type == 'CA'].index)

In [None]:
# extract coordinates for faster computation
X = np.array(df[['lat','long']], dtype='float64')

X

### Optional: Assign weights to cities based on size (Core Predict)

##### Option 3. Using Population
```weight = ln(population) - q```
- As the population approaches its carrying capacity due to limited resources, it exhibits logarithmic growth
- q is a Qualifier that drops cities if they do not meet population threshold

##### Option 4. Using Class Ranking
```weight = 1 / ranking * 5```
- For some datasets, cities are ranked 1-4 based on their level i.e. capital city, metropolitan, town, county etc.

In [None]:
# Option 3
df['weight'] = (np.log(df.population_2020) - 8).astype('int')

# Option 4
df['weight'] = (1 / df.ranking*5).astype('int')

# duplicate rows by weight
df_weight = df.reindex(df.index.repeat(df.weight)).reset_index(drop=True)

# carry out the clustering using df_weight, Step 5 will drops duplicates before saving

## Step 2. Validating the Data
#### Visualizations with Folium

In [None]:
# Geographic coordinates distribution
plt.scatter(X[:,0], [X[:,1], alpha=0.2, s=50)

In [None]:
# Geolocation mapping
m = folium.Map(location=[df.lat.mean(), df.long.mean()], zoom_start=9, tiles='Stamen Toner')

In [None]:
for _, row in df.iterrows():
    folium.CircleMarker(location=[row.lat, row.long]).add_to(m)

In [None]:
# formatting - regex match all strings that contain a non-letter
for _, row in df.iterrows():
    folium.CircleMarker(location=[row.lat, row.long],
                        radius=5,
                        popup=re.sub(r'[^a-zA-Z ]+', '', row.city),
                        color='#1787FE',
                        fill=True,
                        fill_color='#1787FE'
                       ).add_to(m)

m

In [None]:
def create_map(df, cluster_column):
    m = folium.Map(location=[df.lat.mean(), df.long.mean()], zoom_start=9, tiles='Stamen Toner')
    
    for _, row in df.iterrows():
        if row[cluster_column] == -1:
            cluster_color = '#000000'
        else:
            cluster_color = cols[row[cluster_column]]
            
        folium.CircleMarker(location=[row['lat'], row['long']],
                            radius=5,
                            popup=row[cluster_column],
                            color=cluster_color,
                            fill=True,
                            fill_color=cluster_color
                           ).add_to(m)
        
    return m

## Step 3. Exploring Solutions
### Solution A. DBSCAN
Density-Based Spatial Clustering of Applications with Noise<br>
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

#### Parameters
```eps``` = episolon is the local radius of expanding clusters
- DBSCAN never takes a step larger than eps, but by doing multiple steps cluster can become much bigger than eps
- radian to kilometer coversion: eps = x/6371
- radian to miles conversion: eps = x/3959

In [None]:
x = 
model1 = DBSCAN(eps=x/6371, min_samples=2, algorithm='ball_tree', metric='haversine').fit(np.radians(X))
class_predictions = model1.labels_

class_predictions

In [None]:
df['CLUSTER_DBSCAN'] = class_predictions

df

In [None]:
print(f'Number of clusters: {len(np.unique(class_predictions))}')
print(f'Number of cutliers: {len(class_preedictions[class_predictions==-1])}')
print(f'Silhouette score: {silhouette_score(X[class_predictions!=-1], class_predictions[class_predictions!=-1])}')

In [None]:
m1 = create_map(df, 'CLUSTER_DBSCAN')

m1

### Solution B. HDBSCAN
Hierarchical DBSCAN allows varying density of clusters<br>
https://hdbscan.readthedocs.io/en/latest/parameter_selection.html

#### Parameters
```min_samples``` = minimum number of neighbours to a core point<br>
```min_cluster_size``` = minimum size a final cluster can be

- Increasing min_samples will increase the size of the clusters, but it does so by discarding data as outliers via DBSCAN
- Increasing min_cluster_size while keeping min_samples small keeps those outliers and merges any small clusters with their most similar neighbor until all clusters are above min_cluster_size
- This is the H part of HDBSCAN.

In [None]:
x = 40
y = 5
z = 2
model2 = hdbscan.HDBSCAN(min_cluster_size=y, min_samples=z, cluster_selection_epsilon=x/6371).fit(np.radians(X))
class_predictions = model2.labels_

df['CLUSTER_HDBSCAN'] = class_predictions

print(f'Number of clusters: {len(np.unique(class_predictions))}')
print(f'Number of cutliers: {len(class_preedictions[class_predictions==-1])}')
print(f'Silhouette score: {silhouette_score(X[class_predictions!=-1], class_predictions[class_predictions!=-1])}')

In [None]:
m2 = create_map(df, 'CLUSTER_HDBSCAN')

m2

### Solution C. Hybrid (HDBSCAN + K-Means Clustering)
To eliminate outliers, two-step hybrid method groups them into pre-existing clusters using K-means algorithm.
https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

In [None]:
# instantiate, split, train
classifier = KNeighborsClassifier(n_neighbors=1)

df_train = df[df.CLUSTER_HDBSCAN!=-1]
df_predict = df[df.CLUSTER_HDBSCAN==-1]

X_train = np.array(df_train[['lat', 'long']], dtype='float64')
y_train = np.array(df_train['CLUSTER_HDBSCAN'])
X_predict = np.array(df_predict[['lat', 'long']], dtype='float64')

classifier.fit(X_train, y_train)
predictions = classifier.predict(X_predict)

In [None]:
predictions

In [None]:
# appending cluster_hybrid column
df['CLUSTER_HYBRID'] = df['CLUSTER_HDBSCAN']
df.loc[df.CLUSTER_HDBSCAN==-1, 'CLUSTER_HYBRID'] = predictions

In [None]:
df

In [None]:
m3 = create_map(df, 'CLUSTER_HYBRID')

m3

## Step 4. Comparing Solutions

#### Using a simple histogram to compare and determine the optimal solution

In [None]:
df['CLUSTER_DBSCAN'].value_counts().plot.hist(bins=70m alpha=0.4, label='DBSCAN')
df['CLUSTER_HDBSCAN'].value_counts().plot.hist(bins=70m alpha=0.4, label='HDBSCAN')
df['CLUSTER_HYBRID'].value_counts().plot.hist(bins=70m alpha=0.4, label='Hybrid')

plt.legend()
plt.title('Comparing DBSCAN, DBSCAN, and Hybrid Approaches')
plt.xlabel('Cluster Size')

## Step 5. Save Data to File

In [None]:
#organize
df = df.drop_duplicates().sort_values(by=['CLUSTER_HDBSCAN', 'city'])
#save
df.to_csv('canadacities_CLUSTER.csv', encoding='utf-8', index=False)

## Extension. Create a Function to run all the clustering algorithms with One-click
What inputs and outputs does the function take?

What are the parameters that the users can decide on?

How to locate, load, and save data?

In [None]:
# hint

import os

def city_cluster(files, eps, min_cl, min_sp):
    for file in files:
        df = pd.read_csv(f'Data/{file}')
        X = np.array()
        """
        
        
        
        """
        
        
        
    print(os.listdir('Data'))
    

### END