# **CSI 382 - Data Mining and Knowledge Discovery**

# **Lab 3 - Exploratory Data Analysis**

Tukey defined data analysis in 1961 as:
"Procedures for analyzing data, techniques for interpreting the results of such
procedures, ways of planning the gathering of data to make its analysis easy, i.e., more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."

A statistical model can be used or not, but primarily EDA is for seeing what the
data can tell us beyond the formal modeling or hypothesis testing task.

## **GETTING TO KNOW THE DATA SET**

In Lab 3 we use exploratory methods to delve into the [churn](https://drive.google.com/file/d/1X8qRaBWHIFg7QS5WLYgMVf9UCK7qFpwt/view?usp=sharing) data set from the UCI Repository of Machine Learning Databases at the University of
California, Irvine.
Churn, also called attrition, is a term used to indicate a customer leaving the
service of one company in favor of another company. To begin, it is often best
simply to take a look at the field values for some of the records.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/bigml_59c28831336c6604c800002a.csv')

#Check number of rows and columns in the dataset
print("The dataset has %d rows and %d columns." % df.shape)

In [None]:
df.head(10)

The data set contains 20 variables worth of information about 3333 customers,
along with an indication of whether or not that customer churned (left the company).

In [None]:
print(df.dtypes)

## **Preprocess data**


We can see that the columns "state", "international plan", "voice mail plan" and "churn" have String values. The latter three seem to have just the values "yes" or "no" and are therefore converted to 1 and 0 respectively.

The "state" column is converted using the LabelEncoder, which replaces each unique label with a unique integer. In this case, a label encode is used instead of dummy variables because of the many distinct values, which when converted into dummy variables would mess up the for example the PCA and the feature importance of the tree-based models.

The "phone number" column is removed, because every customer has its own phone number.


In [None]:
df.columns

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

def preprocess_data(df):
    pre_df = df.copy()

    # Replace the spaces in the column names with underscores
    pre_df.columns = [s.replace(" ", "_") for s in pre_df.columns]

    # convert string columns to integers
    pre_df["international_plan"] = pre_df["international_plan"].apply(lambda x: 0 if x=="no" else 1)
    pre_df["voice_mail_plan"] = pre_df["voice_mail_plan"].apply(lambda x: 0 if x=="no" else 1)

    #Dropping unnecessary attribute
    pre_df = pre_df.drop(["phone_number"], axis=1)

    #Converting string to categorical vaiable
    le = LabelEncoder()
    le.fit(pre_df['state'])
    pre_df['state'] = le.transform(pre_df['state'])

    return pre_df

In [None]:
pre_df = preprocess_data(df)

pre_df.head(3)

In [None]:
pre_df.dtypes

## **Co-related Variables**

One should take care to avoid feeding correlated variables to one’s data mining
and statistical models. At best, using correlated variables will overemphasize
one data component; at worst, using correlated variables will cause the model to
become unstable and deliver unreliable results.

The data set contains three variables: total_day_minutes, total_day_calls, and
total_day_charge. The data description indicates that the charge variable may
be a function of minutes and calls, with the result that the variables would be
correlated. We investigate using the correlation matrix plot shown below.

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

f, ax = plt.subplots(figsize=(14, 10))

corr = pre_df.corr()

ax = sns.heatmap(corr, cmap = "RdBu_r")

ax.invert_yaxis()


plt.title("Heatmap of pairwise correlation of the columns")

plt.show()

### **Inference**

1. There does not seem to be any relationship between day minutes and day
calls or between day calls and day charge.
2. On the other hand, there is a perfect linear relationship between total day
minutes and total day charge, indicating that day charge is a simple linear
function of day minutes only.
3. Since day charge is correlated perfectly with day minutes, we should elim-
inate one of the two variables.
4. Investigation of the evening, night, and international components reflected
similar findings, and we thus also eliminate evening charge, night charge,
and international charge.
5. Dimensionality of the solution space is reduced, so that certain data mining
algorithms may more efficiently find the globally optimal solution.

We have therefore reduced the number of predictors from 20 to 16 by eliminating
redundant variables.

## **EXPLORING CATEGORICAL VARIABLES**

One of the primary reasons for performing exploratory data analysis is to
investigate the variables, look at histograms of the numeric variables, examine
the distributions of the categorical variables, and explore the relationships
among sets of variables.

For example, Figure 3 shows that we have clearly more samples for customers
without churn than for customers with churn. So we have a class imbalance
for the target variable which could lead to predictive models which are biased
towards the majority (i.e. no churn). In order to deal with this issue we will
investigate into the use of oversampling when building the models.

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure()

ax = fig.add_axes([0,0,1,1])

target = ['no-churn', 'churn']

hists = [len(pre_df[pre_df["churn"] == 0]),len(pre_df[pre_df["churn"] == 1])]
ax.bar(target,hists, color=['blue','orange'],)

plt.title("Churn Distribution")
plt.ylabel('#samples')

plt.ylim(0,3000)

plt.show()

In [None]:
pre_df.shape

In [None]:
churn_perc = pre_df["churn"].sum() * 100 / pre_df["churn"].shape[0]
print("Churn percentage is %.3f%%." % churn_perc)

For example, Figure 4 shows a comparison of the proportion of churners (or-
ange) and non-churners (blue) among customers who either had selected the
International Plan (yes, 346 of customers) or had not selected it (no, 2664 of
customers). The graphic appears to indicate that a greater proportion of International Plan holders are churning, but it is difficult to be sure.

In [None]:
def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-1.5.1.min.js?noext',
            },
          });
        </script>
        '''))

### **International Plan**

In [None]:
#plot libaries
import plotly
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode

configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook

from plotly.offline import plot, iplot

# do not show any warnings
import warnings
warnings.filterwarnings('ignore')

# SEED = 17 # specify seed for reproducable results
pd.set_option('display.max_columns', None) # prevents abbreviation (with '...') of columns in prints

colors = plotly.colors.DEFAULT_PLOTLY_COLORS
churn_dict = {0: "no churn", 1: "churn"}

churn = pre_df[pre_df["churn"] == 1]
no_churn = pre_df[pre_df["churn"] == 0]

def create_churn_trace(col, visible=False):
    return go.Histogram(
        x=churn[col],
        name='churn',
        marker = dict(color = colors[1]),
        visible=visible,
    )

def create_no_churn_trace(col, visible=False):
    return go.Histogram(
        x=no_churn[col],
        name='no churn',
        marker = dict(color = colors[0]),
        visible = visible,
    )

features_for_hist = ['international_plan']
active_idx = 0
traces_churn = [(create_churn_trace(col) if i != active_idx else create_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
traces_no_churn = [(create_no_churn_trace(col) if i != active_idx else create_no_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
data = traces_churn + traces_no_churn

# n_features = len(features_for_hist)
# steps = []
# for i in range(n_features):
#     step = dict(
#         method = 'restyle',
#         args = ['visible', [False] * len(data)],
#         label = features_for_hist[i],
#     )
#     step['args'][1][i] = True # Toggle i'th trace to "visible"
#     step['args'][1][i + n_features] = True # Toggle i'th trace to "visible"
#     steps.append(step)


layout = dict(
    yaxis=dict(
        title='#samples',
        automargin=True,
    ),
    xaxis=dict(
        title='Feature: international_plan',
        automargin=True,
    ),
)

fig = dict(data=data, layout=layout)

iplot(fig, filename='histogram')


#### **Inference**

Note that $137/(137 + 186) = 42.4\%$ of the International Plan holders churned,
compared with only $346/(346 + 2664) = 11.5\%$ of those without the International Plan. Customers selecting the International Plan are more than three times
as likely to leave the company’s service than those without the plan.

This EDA on the International Plan has indicated that:
1. Perhaps we should investigate what it is about the International Plan that is
inducing customers to leave!
2. We should expect that whatever data mining algorithms we use to predict
churn, the model will probably include whether or not the customer selected
the International Plan.

### **Voice Mail Plan**

Let us now turn to the VoiceMail Plan. Figure 5 shows in a bar graph that those
who do not have the VoiceMail Plan are more likely to churn than those who do
have the plan.

In [None]:
#plot libaries
import plotly
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode

configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook

from plotly.offline import plot, iplot

# do not show any warnings
import warnings
warnings.filterwarnings('ignore')

SEED = 17 # specify seed for reproducable results
pd.set_option('display.max_columns', None) # prevents abbreviation (with '...') of columns in prints

colors = plotly.colors.DEFAULT_PLOTLY_COLORS
churn_dict = {0: "no churn", 1: "churn"}

churn = pre_df[pre_df["churn"] == 1]
no_churn = pre_df[pre_df["churn"] == 0]

def create_churn_trace(col, visible=False):
    return go.Histogram(
        x=churn[col],
        name='churn',
        marker = dict(color = colors[1]),
        visible=visible,
    )

def create_no_churn_trace(col, visible=False):
    return go.Histogram(
        x=no_churn[col],
        name='no churn',
        marker = dict(color = colors[0]),
        visible = visible,
    )

features_for_hist = ['voice_mail_plan']

traces_churn = [(create_churn_trace(col) if i != active_idx else create_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
traces_no_churn = [(create_no_churn_trace(col) if i != active_idx else create_no_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
data = traces_churn + traces_no_churn

# n_features = len(features_for_hist)
# steps = []
# for i in range(n_features):
#     step = dict(
#         method = 'restyle',
#         args = ['visible', [False] * len(data)],
#         label = features_for_hist[i],
#     )
#     step['args'][1][i] = True # Toggle i'th trace to "visible"
#     step['args'][1][i + n_features] = True # Toggle i'th trace to "visible"
#     steps.append(step)


layout = dict(
    yaxis=dict(
        title='#samples',
        automargin=True,
    ),
    xaxis=dict(
        title='Feature: voice_mail_plan',
        automargin=True,
    ),
)

fig = dict(data=data, layout=layout)

iplot(fig, filename='histogram')


#### **Inference**

First of all, $842 + 80 = 922$ customers have the VoiceMail Plan, while $2008 +
403 = 2411$ do not. We then find that $403/2411 = 16.7\%$ of those without the
VoiceMail Plan are churners, compared to $80/922 = 8.7\%$ of customers who do
have the VoiceMail Plan. Thus, customers without the VoiceMail Plan are nearly
twice as likely to churn as customers with the plan.

This EDA on the International Plan has indicated that:
1. Perhaps we should enhance the VoiceMail Plan further or make it easier for
customers to join it, as an instrument for increasing customer loyalty.
2. We should expect that whatever data mining algorithms we use to predict
churn, the model will probably include whether or not the customer selected
the VoiceMail Plan. Our confidence in this expectation is perhaps not quite
as the high as that for the International Plan.

### **State**

For example, in Figure 6 we can see that some states have less proportion of
customer with churn like AK, HI, IA and some have a higher proportion such
as WA, MD and TX. This shows that we should incorporate the state into our
further analysis, because it could be help to predict if a customer is going to
churn.

In [None]:
#plot libaries
import plotly
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode

configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook

from plotly.offline import plot, iplot

# do not show any warnings
import warnings
warnings.filterwarnings('ignore')

SEED = 17 # specify seed for reproducable results
pd.set_option('display.max_columns', None) # prevents abbreviation (with '...') of columns in prints

colors = plotly.colors.DEFAULT_PLOTLY_COLORS
churn_dict = {0: "no churn", 1: "churn"}

state_churn_df = df.groupby(["state", "churn"]).size().unstack()

trace1 = go.Bar(
    x=state_churn_df.index,
    y=state_churn_df[False],
    marker = dict(color = colors[0]),
    name='no churn'
)
trace2 = go.Bar(
    x=state_churn_df.index,
    y=state_churn_df[True],
    marker = dict(color = colors[1]),
    name='churn'
)
data = [trace1, trace2]
layout = go.Layout(
    title='Churn distribution per state',
    autosize=True,
    barmode='stack',
    margin=go.layout.Margin(l=50, r=50),
    xaxis=dict(
        title='state',
        tickangle=45
    ),
    yaxis=dict(
        title='#samples',
        automargin=True,
    ),
    legend=dict(
        x=0,
        y=1,
    ),
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='stacked-bar')


In [None]:
state_churn_df

### **Can you find a pattern?**

In [None]:
#plot libaries
import plotly
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode

configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook

from plotly.offline import plot, iplot

# do not show any warnings
import warnings
warnings.filterwarnings('ignore')

colors = plotly.colors.DEFAULT_PLOTLY_COLORS
churn_dict = {0: "no churn", 1: "churn"}

churn = pre_df[pre_df["churn"] == 1]
no_churn = pre_df[pre_df["churn"] == 0]

features_not_for_hist = ["state", "phone_number", "churn", "international_plan", "voice_mail_plan", "total_eve_charge", "total_eve_minutes"]
features_for_hist = [x for x in pre_df.columns if x not in features_not_for_hist]
active_idx = 0
traces_churn = [(create_churn_trace(col) if i != active_idx else create_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
traces_no_churn = [(create_no_churn_trace(col) if i != active_idx else create_no_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
data = traces_churn + traces_no_churn

n_features = len(features_for_hist)
steps = []
for i in range(n_features):
    step = dict(
        method = 'restyle',
        args = ['visible', [False] * len(data)],
        label = features_for_hist[i],
    )
    step['args'][1][i] = True # Toggle i'th trace to "visible"
    step['args'][1][i + n_features] = True # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active = active_idx,
    currentvalue = dict(
        prefix = "Feature: ",
        xanchor= 'center',
    ),
    pad = {"t": 50},
    steps = steps,
)]

layout = dict(
    sliders=sliders,
    yaxis=dict(
        title='#samples',
        automargin=True,
    ),
)

fig = dict(data=data, layout=layout)

iplot(fig)

## **Distribution Box Plots**

Next, we take a look at the box plots for each feature. A box plot visualizes the following statistics:
* median
* the first quartile (Q1) and the third quartile (Q3) building the interquartile
* range (IQR)
* the lower fence (Q1 - 1.5 IQR) and the upper fence (Q3 + 1.5 IQR)
* the maximum and the minimum value

In [None]:
configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook

def create_box_churn_trace(col, visible=False):
    return go.Box(
        y=churn[col],
        name='churn',
        marker = dict(color = colors[1]),
        visible=visible,
    )

def create_box_no_churn_trace(col, visible=False):
    return go.Box(
        y=no_churn[col],
        name='no churn',
        marker = dict(color = colors[0]),
        visible = visible,
    )

features_not_for_hist = ["state", "phone_number", "churn"]
features_for_hist = [x for x in pre_df.columns if x not in features_not_for_hist]
# remove features with too less distinct values (e.g. binary features), because boxplot does not make any sense for them
features_for_box = [col for col in features_for_hist if len(churn[col].unique())>5]

active_idx = 0
box_traces_churn = [(create_box_churn_trace(col) if i != active_idx else create_box_churn_trace(col, visible=True)) for i, col in enumerate(features_for_box)]
box_traces_no_churn = [(create_box_no_churn_trace(col) if i != active_idx else create_box_no_churn_trace(col, visible=True)) for i, col in enumerate(features_for_box)]
data = box_traces_churn + box_traces_no_churn

n_features = len(features_for_box)
steps = []
for i in range(n_features):
    step = dict(
        method = 'restyle',
        args = ['visible', [False] * len(data)],
        label = features_for_box[i],
    )
    step['args'][1][i] = True # Toggle i'th trace to "visible"
    step['args'][1][i + n_features] = True # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active = active_idx,
    currentvalue = dict(
        prefix = "Feature: ",
        xanchor= 'center',
    ),
    pad = {"t": 50},
    steps = steps,
    len=1,
)]

layout = dict(
    sliders=sliders,
    yaxis=dict(
        title='value',
        automargin=True,
    ),
    legend=dict(
        x=0,
        y=1,
    ),
)

fig = dict(data=data, layout=layout)

iplot(fig, filename='box_slider')

## **USING EDA TO UNCOVER ANOMALOUS FIELDS**

Exploratory data analysis will sometimes uncover strange or anomalous records
or fields which the earlier data cleaning phase may have missed. Consider, for
example, the area code field in the present data set. Although the area codes
contain numerals, they can also be used as categorical variables, since they can
classify customers according to geographical location.

We are intrigued by the fact that the area code field contains only three different
values for all the records—408, 415, and 510—all three of which are in Califor-
nia, as shown by Figure 14.

In [None]:
configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook

features_for_hist = ['area_code']
traces_churn = [(create_churn_trace(col) if i != active_idx else create_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
traces_no_churn = [(create_no_churn_trace(col) if i != active_idx else create_no_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
data = traces_churn + traces_no_churn

layout = dict(
    yaxis=dict(
        title='#samples',
        automargin=True,
    ),
    xaxis=dict(
        title='Feature: area_code',
        automargin=True,
    ),
)

fig = dict(data=data, layout=layout)

iplot(fig, filename='histogram')

In [None]:
#plot libaries
import plotly
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode

configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook

state_churn_df = df.groupby(["state", "churn"]).size().unstack()
trace1 = go.Bar(
    x=state_churn_df.index,
    y=state_churn_df[False],
    marker = dict(color = colors[0]),
    name='no churn'
)
trace2 = go.Bar(
    x=state_churn_df.index,
    y=state_churn_df[True],
    marker = dict(color = colors[1]),
    name='churn'
)
data = [trace1, trace2]
layout = go.Layout(
    title='Churn distribution per state',
    autosize=True,
    barmode='stack',
    margin=go.layout.Margin(l=50, r=50),
    xaxis=dict(
        title='state',
        tickangle=45
    ),
    yaxis=dict(
        title='#samples',
        automargin=True,
    ),
    legend=dict(
        x=0,
        y=1,
    ),
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='stacked-bar')


In [None]:

configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook


colors = plotly.colors.DEFAULT_PLOTLY_COLORS
churn_dict = {0: "no churn", 1: "churn"}

area1 = pre_df[pre_df["area_code"] == 408]
area2 = pre_df[pre_df["area_code"] == 415]
area3 = pre_df[pre_df["area_code"] == 510]

def create_area1_trace(col, visible=False):
    return go.Histogram(
        x=area1[col],
        name='408',
        marker = dict(color = colors[0]),
        visible=visible,
    )

def create_area2_trace(col, visible=False):
    return go.Histogram(
        x=area2[col],
        name='415',
        marker = dict(color = colors[1]),
        visible=visible,
    )

def create_area3_trace(col, visible=False):
    return go.Histogram(
        x=area3[col],
        name='510',
        marker = dict(color = colors[2]),
        visible = visible,
    )

features_for_hist = ['state']

traces1_churn = [(create_area1_trace(col) if i != active_idx else create_area1_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
traces2_churn = [(create_area2_trace(col) if i != active_idx else create_area2_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
traces3_churn = [(create_area3_trace(col) if i != active_idx else create_area3_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]

data = traces1_churn + traces2_churn + traces3_churn

layout = dict(
    yaxis=dict(
        title='#samples',
        automargin=True,
    ),
    xaxis=dict(
        title='Feature: state',
        automargin=True,
    ),
)

fig = dict(data=data, layout=layout)

iplot(fig, filename='histogram')

### **Inference**

We should therefore be wary of this area code field, perhaps going so far as not
to include it as input to the data mining models in the next phase. On the other
hand, it may be the state field that is in error. Either way, further communication
with someone familiar with the data history, or a domain expert, is called for
before inclusion of these variables in the data mining models.

## **EXPLORING NUMERICAL VARIABLES**

Next, we turn to an exploration of the numerical predictive variables. We begin
with numerical summary measures, including minimum and maximum; mea-
sures of center, such as mean, median, and mode; and measures of variability,
such as standard deviation. Figure 16 shows these summary measures for some
of our numerical variables. We see, for example, that the minimum account
length is one month, the maximum is 243 months, and the mean and median
are about the same, at around 101 months, which is an indication of symmetry.
Notice that several variables show this evidence of symmetry, including all the
minutes, charge, and call fields.

In [None]:
df.describe()

### **Customer Service Calls**

We turn next to graphical analysis of our numerical variables. Figure 17 is a
histogram of customer service calls, with churn overlay. Figure 17 hints that the
proportion of churn may be greater for higher numbers of customer service calls,
but it is difficult to discern this result unequivocally.

In [None]:

configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook

def create_churn_trace(col, visible=False):
    return go.Histogram(
        x=churn[col],
        name='churn',
        marker = dict(color = colors[1]),
        visible=visible,
    )

def create_no_churn_trace(col, visible=False):
    return go.Histogram(
        x=no_churn[col],
        name='no churn',
        marker = dict(color = colors[0]),
        visible = visible,
    )

features_for_hist = ['customer_service_calls']
active_idx = 0
traces_churn = [(create_churn_trace(col) if i != active_idx else create_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
traces_no_churn = [(create_no_churn_trace(col) if i != active_idx else create_no_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
data = traces_churn + traces_no_churn

n_features = len(features_for_hist)
steps = []
for i in range(n_features):
    step = dict(
        method = 'restyle',
        args = ['visible', [False] * len(data)],
        label = features_for_hist[i],
    )
    step['args'][1][i] = True # Toggle i'th trace to "visible"
    step['args'][1][i + n_features] = True # Toggle i'th trace to "visible"
    steps.append(step)


layout = dict(
    yaxis=dict(
        title='#samples',
        automargin=True,
    ),
    xaxis=dict(
        title='Feature: Customer Service Calls',
        automargin=True,
    ),
)

fig = dict(data=data, layout=layout)

iplot(fig, filename='histogram')


#### **Inference**

This EDA on the customer service calls has indicated that:

1. We should track carefully the number of customer service calls made by
each customer. By the third call, specialized incentives should be offered
to retain customer loyalty.
2. We should expect that whatever data mining algorithms we use to predict
churn, the model will probably include the number of customer service
calls made by the customer.

## **Can you find a pattern?**

In [None]:
#plot libaries
import plotly
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode

configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook

from plotly.offline import plot, iplot

# do not show any warnings
import warnings
warnings.filterwarnings('ignore')

SEED = 17 # specify seed for reproducable results
pd.set_option('display.max_columns', None) # prevents abbreviation (with '...') of columns in prints

colors = plotly.colors.DEFAULT_PLOTLY_COLORS
churn_dict = {0: "no churn", 1: "churn"}

churn = pre_df[pre_df["churn"] == 1]
no_churn = pre_df[pre_df["churn"] == 0]

features_not_for_hist = ["state", "phone_number", "churn"]
features_for_hist = ["total_day_minutes", "total_day_calls", "total_eve_calls", "total_night_calls", "total_intl_calls"]
active_idx = 0
traces_churn = [(create_churn_trace(col) if i != active_idx else create_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
traces_no_churn = [(create_no_churn_trace(col) if i != active_idx else create_no_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
data = traces_churn + traces_no_churn

n_features = len(features_for_hist)
steps = []
for i in range(n_features):
    step = dict(
        method = 'restyle',
        args = ['visible', [False] * len(data)],
        label = features_for_hist[i],
    )
    step['args'][1][i] = True # Toggle i'th trace to "visible"
    step['args'][1][i + n_features] = True # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active = active_idx,
    currentvalue = dict(
        prefix = "Feature: ",
        xanchor= 'center',
    ),
    pad = {"t": 50},
    steps = steps,
)]

layout = dict(
    sliders=sliders,
    yaxis=dict(
        title='#samples',
        automargin=True,
    ),
)

fig = dict(data=data, layout=layout)

iplot(fig, filename='histogram_slider')

## **EXPLORING MULTIVARIATE RELATIONSHIPS - Scatter-plots**

We turn next to an examination of possible multivariate associations of numeri-
cal variables with churn, using two- and three-dimensional scatter plots.

### **2-D Scatterplots**

In [None]:
import matplotlib
#figure(figsize=(10, 8), dpi=80)

label = [0,1]
colors = ['blue','orange']

scatter = plt.scatter(pre_df['total_day_minutes'], pre_df['customer_service_calls'], c=pre_df['churn'], cmap=matplotlib.colors.ListedColormap(colors))
plt.xlabel('total_day_minutes')
plt.ylabel('customer_service_calls')
plt.title('Scatter plot of total_day_minutes against customer_service_calls')
plt.legend(handles=scatter.legend_elements()[0],
           labels=target)
plt.show()

In [None]:
import matplotlib
#figure(figsize=(10, 8), dpi=80)

label = [0,1]
colors = ['blue','orange']

scatter = plt.scatter(pre_df['total_day_minutes'], pre_df['number_vmail_messages'], c=pre_df['churn'], cmap=matplotlib.colors.ListedColormap(colors))
plt.xlabel('total_day_minutes')
plt.ylabel('number_vmail_messages')
plt.title('Scatter plot of total_day_minutes against number_vmail_messages')
plt.legend(handles=scatter.legend_elements()[0],
           labels=target)
plt.show()

### **3-D Scatterplots**

Sometimes, three-dimensional scatter plots can be helpful as well.

In [None]:
import plotly.graph_objects as go
import numpy as np
configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook

x, y, z = pre_df['total_day_minutes'], pre_df['customer_service_calls'], pre_df['total_eve_minutes']

fig = go.Figure(data=[go.Scatter3d(x=x, y=y, z=z,
    mode='markers', marker=dict(
        size=4,
        color = pre_df['churn'].astype(int),
        opacity=0.8
    ),

    )])
fig.update_layout(scene = dict(
                    xaxis_title='total_day_minutes',
                    yaxis_title='customer_service_calls',
                    zaxis_title='total_eve_minutes'),margin=dict(l=0, r=0, b=0, t=0))
fig.show()


## **SELECTING INTERESTING SUBSETS OF THE DATA FOR FURTHER INVESTIGATION**

We may use scatter plots (or histograms) to identify interesting subsets of the
data, in order to study these subsets more closely. In Figure 3.25 we see that
customers with high day minutes and high evening minutes are more likely to
churn.

In [None]:
import matplotlib
# figure(figsize=(10, 8), dpi=80)

label = [0,1]
colors = ['blue','orange']

scatter = plt.scatter(pre_df['total_day_minutes'], pre_df['total_eve_minutes'], c=pre_df['churn'], cmap=matplotlib.colors.ListedColormap(colors))
plt.xlabel('total_day_minutes')
plt.ylabel('total_eve_minutes')
plt.title('Scatter plot of total_day_minutes against total_eve_minutes')
plt.legend(handles=scatter.legend_elements()[0],
           labels=target)
# plt.grid(True)
plt.show()

It turns out that over $43\%$ of the customers who have both high day minutes and
high evening minutes are churners. This is approximately three times the churn
rate of the overall customer base in the data set. Therefore, it is recommended
that we consider how we can develop strategies for keeping our heavy-use
customers happy so that they do not leave the company’s service, perhaps
through discounting the higher levels of minutes used.

## **Binning**

There are various strategies for binning numerical variables. One approach is to
make the classes of equal width, analogous to equal-width histograms. Another
approach is to try to equalize the number of records in each class. You may
consider yet another approach, which attempts to partition the data set into
identifiable groups of records, which, with respect to the target variable, have
behavior similar to that for other records in the same class.

In [None]:
configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook

def create_churn_trace(col, visible=False):
    return go.Histogram(
        x=churn[col],
        name='churn',
        marker = dict(color = colors[1]),
        visible=visible,
    )

def create_no_churn_trace(col, visible=False):
    return go.Histogram(
        x=no_churn[col],
        name='no churn',
        marker = dict(color = colors[0]),
        visible = visible,
    )

features_for_hist = ['customer_service_calls']
active_idx = 0
traces_churn = [(create_churn_trace(col) if i != active_idx else create_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
traces_no_churn = [(create_no_churn_trace(col) if i != active_idx else create_no_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
data = traces_churn + traces_no_churn

layout = dict(
    yaxis=dict(
        title='#samples',
        automargin=True,
    ),
    xaxis=dict(
        title='Feature: Customer Service Calls',
        automargin=True,
    ),
)

fig = dict(data=data, layout=layout)

iplot(fig, filename='histogram')

For example, recall Figure 17, where we saw that customers with fewer than
four calls to customer service had a lower churn rate than that of customers who
had four or more calls to customer service. We may therefore decide to bin the
customer service calls variable into two classes, low and high.

In [None]:

configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook

def create_churn_trace(col, visible=False):
    return go.Histogram(
        x=churn[col],
        name='churn',
        marker = dict(color = colors[1]),
        visible=visible,
        xbins=dict(
            start=churn[col].min(),
            end=churn[col].max()+1,
            size=4
        ),
        autobinx = False
    )

def create_no_churn_trace(col, visible=False):
    return go.Histogram(
        x=no_churn[col],
        name='no churn',
        marker = dict(color = colors[0]),
        visible = visible,
        xbins=dict(
            start=no_churn[col].min(),
            end=no_churn[col].max()+1,
            size=4
            ),
        autobinx = False
    )

features_for_hist = ['customer_service_calls']
active_idx = 0
traces_churn = [(create_churn_trace(col) if i != active_idx else create_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
traces_no_churn = [(create_no_churn_trace(col) if i != active_idx else create_no_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
data = traces_churn + traces_no_churn


layout = dict(
    yaxis=dict(
        title='#samples',
        automargin=True,
    ),
    xaxis=dict(
        title='Feature: Customer Service Calls',
        automargin=True,
    ),
)

fig = dict(data=data, layout=layout)

iplot(fig, filename='histogram')


# **That's all for today**

# **Tasks**

Go to this [url](https://drive.google.com/file/d/13dnJ0Uszfcj95py2f17njZbE6HDa6BNS/view?usp=sharing) and download the data first. In order to know more about the dataset please refer to these links - [UCI/heart_disease](https://archive.ics.uci.edu/ml/datasets/Heart+Disease), or [Kaggle/heart_disease](https://www.kaggle.com/ronitf/heart-disease-uci).

**The "target" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.**

 Now try to do the following:

1. Find the correlation matrix of the variables, is there any correlation among variables?
2. Explore the categorical and numerical variables using histograms, scatterplots, boxplots. Can you find some pattern that indicates a relationship with the targe?
3. Can you do some scatterplots (either 2D or 3D) to find some interesting subsets of the data?
