# Causal Inference Using Pearl's Framework

## Objectives:

- Perform a causal inference task using Pearl’s framework;
- Infer the causal graph from observational data and then validate the graph;
- Merge machine learning with causal inference;


## Load Data and Libraries

In [1]:
# Libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import Normalizer, MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
from IPython.display import Image
import copy

import warnings
warnings.filterwarnings("ignore")


In [None]:
# add scripts
sys.path.append(os.path.abspath("../scripts/"))

from utils import Utils
Util = Utils()

In [None]:
# Load Data

raw_df = pd.read_csv("../data/data.csv")
raw_df.head()

In [None]:
# Check Dataset

raw_df.shape

In [None]:
raw_df.info()

### Observation
- There are 569 rows and 33 columns
- The last column is completely empty
- There is no Null value in any of the other columns.
- All the variables are float type except for the Id column and the diagnosis variable which is string.

## Clean Data

In [None]:
# removing null column and id
clean_df = raw_df.iloc[:,1:-1]
clean_df.info()

### check for outliers

In [None]:
test = Util.check_outlier(clean_df.iloc[:,1:])
test

### Observation
- There are no major outliers
- There are some minor outliers in each row
- There are no Null values

## Perform Exploratory Analysis

### Univariate Analysis

In [None]:
# Univariate Analysis
Util.describe(clean_df)

In [None]:
# check the target variable
target = clean_df["diagnosis"]
ax = sns.countplot(target,label="Count")       # M = 212, B = 357
B, M = target.value_counts()
print('Number of Benign: ',B)
print('Number of Malignant : ',M)

### Observation
- Important summary statistics of all the numerical variables like the mean, std, min, 25%, 50%, 75%, and max values.
- There is high variation in values in area_mean and area_worst.
- There are many variables have median value 0.
- The area_worst feature’s max value is 4254 and fractal_dimension_se features’ max 0.029840. This indicates we need to standardize or normalize data before visualization, feature selection, and classification.
- Bar plot of diagnosis shows that Malignant and Benign patients ratio is 37% (212/569) and 63% (357/569) respectively.

In [None]:
clean_df.iloc[:,1:] = Util.scale_and_normalize(clean_df.iloc[:,1:]) 
data = pd.concat([clean_df.iloc[:,:]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(18,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart",palette ="Set2")
plt.xticks(rotation=90)

### Observation
- Some variables can be used to classify diagnosis clearly becuase their distributions for Benign and Malignant cancer is
  clearly separated

### Bivariate Analysis

In [None]:
plot_pair(clean_df, [0,8], [16,17])

In [None]:
plot_pair(clean_df, [8,15], [16,17])

In [None]:
# bivariate Analysis
# correlation matrix
corr_matrix = clean_df.iloc[:,:].corr()
matrix = np.triu(corr_matrix)
fig, ax = plt.subplots(figsize=(17, 10))
ax = sns.heatmap(corr_matrix, annot=True, mask=matrix)

### Observation
- The Distribution of the variables is right skewed
- There is a large correlation between radius and concave points, concavity and conpactness
- there is a large positive correlation between smoothness and compactness, fractal dimension.