## Starting point

- From the EDA:  we could identify redundant features (many geographical fields), plus not relevant ones (due to the huge ammount of missing data. We might use this information in this second approach to filter/clean data in the same way.
- From the initial model: we built a pure classifier using the selected features and corrected the imbalance using different techniques. The model that gave us the best results was the one using SMOTE as a resampling technique. We will use this model and the confusion matrix as a baseline to compare whatever is produced with this second approach.

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()  # for plot styling
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn import preprocessing
from time import time
from sklearn.metrics import precision_score, recall_score
from imblearn.under_sampling import RandomUnderSampler

## Brainstorming about different clustering algorithms

1. KMeans works better with spherical distributions of data and we do not know the shape of the multidimensional space.
2. DBSCAN getting the parameters right can be hard so we discard this method.
3. If available time: try other promising clustering methods to check which one is performing better.

Why not just a partition by region? Do we have enough data per region? During the exploration we analyzed the distribution of the data by region, city, continent, etc. Most of them have a wide range of values, some with more than 300) which makes not feasible training such a huge number of models. Besides whenever we have a new lead or a region not seen before, there will be no model for unseen regions.

In conclusion we cannot base the clustering in only one condition because if that information for a new lead is missing we will have no means of assign the new lead to a cluster, unless we assign it a random one.

Since we just want to build a quick probe of concept and we have limited time for this exercise we will not investigate further this approach unless we discovered that it is giving us some improvement.

## KMeans Approach

The idea is to perform some clustering first of the data to help identifying different regions and maybe to shape better patterns inside each cluster. Once we have these clusters we will build a different RandomForestClassifier for each cluster and compare the average results with the previous baseline.
Steps of this approach:
1. Collect cleaned dataset: with the transformed and scaled features, removing non-relevant ones.
2. K-Means: find the optimal number of cluster for this dataset. Prepare K-means with that configuration.
3. For each cluster train a different classifier and get the scores.

## Dataset

Reminder of the filters and transformations done in this dataset:
- extract year and month from signup_date
- remove fields that are not helping for the clustering such as *fakeId*, *signup_date* or *year* (but we will keep *month* as before) and fields with a high percentage of missing data such as *company_industrygroups*, *category_sectors*, *company_employes_qty*. 
- remove data from the initial years < 2014 (low volume and not a single successful lead info), risk of introducing noise.
- transformation of categorical features generating the alternative ones with numbers (id of the category)
- min-max scaler for numerical values.
- remove redundant geographical features. 

In [None]:
df = pd.read_pickle("training_df")
df.head()

In [None]:
len(df.columns)

We have 10 final features.

## Find optimal number of clusters. K-means

Let´s have a look at a couple of dimensions each time to get an idea how points are distributed in the space.

In [None]:
sns.set_context('poster')
sns.set_color_codes()
plot_kwds = {'alpha' : 0.25, 's' : 80, 'linewidths':0}
plt.scatter(df.campaign_name_cat, df.lead_source_cat, c='b', **plot_kwds)
frame = plt.gca()
frame.axes.get_xaxis().set_visible(False)
frame.axes.get_yaxis().set_visible(False)

In [None]:
plt.scatter(df.place_within_tenant_mm, df.city_mm, c='b', **plot_kwds)
frame = plt.gca()
frame.axes.get_xaxis().set_visible(False)
frame.axes.get_yaxis().set_visible(False)

For each k value, we will initialise k-means and use the inertia attribute to identify the sum of squared distances of samples to the nearest cluster centre.  **Inertia** is calculated as the sum of squared distance for each point to it's closest centroid, i.e., its assigned cluster. So $I=∑_i(d(i,cr))$ where cr is the centroid of the assigned cluster and d is the squared distance.

Let´s explore up to 15 clusters to check if we find an optimal value. We use "k-means++" instead of "random" to ensure centroids are initialized with some distance between them. In most cases, this will be an improvement over "random". 

In [None]:
def km_check(data, no_runs=25, no_clusters=15, k_means_method='k-means++'):
    """runs k-means clustering over no_clusters to get an idea of # clusters required for data.
    returns a plot with clusters v SSE. """
    sse = []
    for cluster in range(1, no_clusters):
        kmeans = KMeans(n_jobs=-1,
                        n_clusters=cluster,
                        init=k_means_method,
                        max_iter=500,
                        n_init=no_runs)
        kmeans.fit(data)
        sse.append(kmeans.inertia_)
    analysis = pd.DataFrame({'Cluster': range(1, no_clusters), 'SSE': sse})
    plt.figure(figsize=(12, 6))
    plt.plot(analysis['Cluster'], analysis['SSE'], marker='o')
    plt.xlabel('Number of clusters')
    plt.ylabel('Inertia/SSE')
    plt.show()

In [None]:
start = time()
km_check(df)
end = time()
result = end - start
print('Training time = %.3f seconds' % result)

The ideal number of clusters seems to be 4. Let´s try to paint some clusters for the previous graphs.

In [None]:
def fit_algorithm(data, algorithm, args, kwds):
    """fit an algorithm to the data"""
    start = time()
    labels = algorithm(*args, **kwds).fit_predict(data)
    end = time()
    result = end - start
    print('Fitting time = %.3f seconds' % result)
    return labels

In [None]:
def plot_clusters(algorithm, labels, feature1, feature2):
    palette = sns.color_palette('viridis', np.unique(labels).max() + 1)
    colors = [palette[x] if x >= 0 else (0.0, 0.0, 0.0) for x in labels]
    plt.scatter(feature1, feature2, c=colors, **plot_kwds)
    frame = plt.gca()
    frame.axes.get_xaxis().set_visible(False)
    frame.axes.get_yaxis().set_visible(False)
    plt.title('Clusters found by {}'.format(str(algorithm.__name__)), fontsize=24)

In [None]:
labels = fit_algorithm(df, KMeans, (), {'n_jobs':-1,'n_clusters':4, 'n_init':25, 'init':'k-means++', 'max_iter':500,'random_state':42})
plot_clusters(KMeans, labels, df.place_within_tenant_mm, df.city_mm)

In [None]:
plot_clusters(KMeans, labels, df.campaign_name_cat, df.lead_source_cat)

Are we happy with these clusters? It is difficult to say given that this is only a slice of the multi-dimensional space considered in the algorithm. However for the second slice we were able to catch a clear division between two clusters.

## Train a classifier for each cluster

We need to add a new column to the dataset with the assigned cluster to each point. This way we will split the data by each cluster and train a different model.

In [None]:
def add_cluster_id(data, labels):
    data['cluster_id'] = labels
    return data

In [None]:
df_clustered = add_cluster_id(df,labels)
df_clustered.head()

In [None]:
def check_undersampling(data):
    t = df.label.value_counts()
    percentage = (t[1] * 100 / t[0])* 100
    # if the percentage is less than 30% we use undersampling
    return percentage < 30.0

In [None]:
def fit_model(model, X0_train, y0_train):
    start = time()
    model.fit(X0_train,y0_train)
    end = time()
    result = end - start
    return model
    print('Training time = %.3f seconds' % result)

In [None]:
def get_train_test_sets(data):
    # remove cluster_id
    d = data.drop(['cluster_id'], axis=1)
    X = d.drop(['label'], axis=1)
    y = d['label']  # Labels
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test
    return X_train, X_test, y_train, y_test

In [None]:
def train_eval_model(data):
    X_train, X_test, y_train, y_test = get_train_test_sets(data)
    # similar configuration as the baseline model
    clf=RandomForestClassifier(n_estimators= 200,
         min_samples_split= 5,
         min_samples_leaf= 4,
         max_features= 'auto',
         max_depth= 10,
         bootstrap= False,
         n_jobs=4)
    start = time()
    if check_undersampling(data):
        undersample = RandomUnderSampler(sampling_strategy='majority')
        X_train_under, y_train_under = undersample.fit_resample(X_train, y_train)
        clf.fit(X_train_under,y_train_under)
    else:
        clf.fit(X_train,y_train)
    end = time()
    result = end - start
    print('Training time = %.3f seconds' % result)
    # get predictions
    y_pred = clf.predict(X_test)
    # get precision and recall 
    p = precision_score(y_pred, y_test, average=None)
    print("precision scores")
    print(p)
    r = recall_score(y_pred, y_test, average=None)
    print("recall scores")
    print(r)
    return p[0], r[0], p[1], r[1]

In [None]:
results = pd.DataFrame(
    columns=['cluster_id', 'precision_conv', 'recall_conv', 'precision_non_conv', 'recall_non_conv'])
for i in range(4):
    row = dict()
    row['cluster_id'] = i
    # get the cluster dataframe
    cluster_df = df_clustered[df_clustered['cluster_id'] == i]
    print("Start training and evaluation for cluster {}".format(i))
    # train the model and evaluate it
    p_0, r_0, p_1, r_1 = train_eval_model(cluster_df)
    row['precision_conv'] = p_1
    row['recall_conv'] = r_1
    row['precision_non_conv'] = p_0
    row['recall_non_conv'] = r_0 
    # save results in a dataframe
    results = results.append(row, ignore_index=True)

In [None]:
results.head()

The results are not bad at all. Maybe the cluster 0 is the one that is not performing the best. So the average metrics for this solution would be:

In [None]:
for i in range(4):
    # get the cluster dataframe
    cluster_df = df_clustered[df_clustered['cluster_id'] == i]
    print("Size of cluster_{} = {}".format(i,len(cluster_df)))

In [None]:
results.mean()

# Conclusions

By clustering and running small models we did not see much improvement in the detection of converted leads. It is a nice way to probe that this approach is also valid and could potentially be considered if the data source becomes really big and no big-data resources are available. 

# References

https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

https://machinelearningmastery.com/clustering-algorithms-with-python/

https://realpython.com/k-means-clustering-python/