# Motivation for this extra analysis

When sharing the results of the missing data, the two top features of that list are related with the same source and could potentially give us a hidden pattern. Thus we are going to complete some quick analysis to probe if they are relevant as predictors or not.

Steps we are going to follow:
1. Transform one of the top missing feature into a new synthetic binary feature. 1 meaning we have info about this source and 0 meaning we do not have it.
2. Add this feature to the correlation matrix and check if it is relevant for the label we want to predict or not.
3. If found to be relevant, generate a new dataset including this new synthetic feature.
4. Find again the best model using downsampling and learning from the lessons of the previous notebook.
5. Compare results.

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math

In [None]:
df = pd.read_csv('interview.csv')
df.head()

In [None]:
df = df.iloc[:,1:]

## Creating the sythetic feature

In [None]:
df['company_industrygroups'].isnull().sum()

In [None]:
df['company_industrygroups'] = df['company_industrygroups'].fillna('0')

In [None]:
df['tp_data'] = 1

In [None]:
df.loc[df['company_industrygroups']=='0','tp_data'] = 0

In [None]:
df.tp_data.value_counts()

In [None]:
leads_per_tp_data = df.groupby(["tp_data", "label"]).size().to_frame(name='count').reset_index()
sns.scatterplot(data=leads_per_tp_data, x="tp_data", y="count", hue="label", palette="Set2")

In [None]:
leads_per_tp_data

## Add the new feature to the cleaned dataset

In [None]:
c_df = pd.read_pickle("training_df")
c_df.head()

In [None]:
c_df['tp_data'] = df['tp_data']

## Repeate correlation matrix

In [None]:
plt.rcParams["figure.figsize"] = (20,10)
plt.tight_layout()

def paint_correlation_matrix(data):
    #Draw correlation mtx
    k = data.count(axis=1)[0] 
    corrmat = data.corr()
    cols = corrmat.nlargest(k, 'label')['label'].index
    cm = np.corrcoef(data[cols].values.T)
    sns.set(font_scale=1.25)
    hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
    plt.show()
    
paint_correlation_matrix(c_df)

In [None]:
c_df.to_pickle("training_df2")

## Conclusion

We found some pattern with the new synthetic feature and the label however when introducing it with the other features some correlation with an existing predictor was found, so we do not think this extra feature is going to introduce a lot of improvement.