In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('data.csv')

In [None]:
print(df.shape)
df.head()

# 1. Data Understanding
Familiarize yourself with the dataset and interpret the significance of each feature.

In [None]:
df.describe()

From the statistics above, we can understand that the variables can have been scaled.  
But the idea of Normalization is far from being used:  

Gaussian distribution normalization is excluded due to the following reasons:
- The means of the variables are not 0.
- The standard deviations of the variables are not 1.

Min-Max normalization is excluded for the following reasons:

- The minimum values of the variables are not 0.
- The maximum values of the variables are not 1.

The use of Principal Component Analysis is far from being used also, as it contains columns not on the same scale, like 'col_10' witch use range of values more higher than others

Let's see how many value in each feature, to understand what they can be 

In [None]:
for col in df.columns:
    print(f"Count of distinct values for {col}:",len(df[col].unique()))

We can understand that "col_1", "col_3" and the target "label"  can be categorical variables, their values respectively [0,1], [0,0.5,1] and [0,1].  
And all the other variables are numerical.  
But at the moment that all of them are technicaly numerical, they will be considered like that for the modeling part.

In [None]:
df.columns

In [None]:
NUMERICAL = ['col_0', 'col_2', 'col_4', 'col_5', 'col_6', 'col_7', 'col_8', 'col_9', 'col_10', 'col_11', 'col_12', 'col_13']
CATEGORICAL = ['col_1', 'col_3']
TARGET = ['label']

#### Check the Normal Distribution  
I was about to use Shapiro test, but as the number of values is over 5000, this test will not work.  
In this casel, let's use Anderson test.  

In [None]:
# from scipy.stats import shapiro

# for column in df.columns:
#     stat, p = shapiro(df[column])
#     alpha = 0.05
#     if p > alpha:
#         print(f'Column {column} looks Gaussian (fail to reject H0)')
#     else:
#         print(f'Column {column} does not look Gaussian (reject H0)')


In [None]:
from scipy.stats import anderson

for column in df.columns:
    result = anderson(df[column])
    print(f"\n{column}:")

    for i in range(len(result.critical_values)):
        sl, cv = result.significance_level[i], result.critical_values[i]
        if result.statistic < cv:
            print(f'At {sl}% significance level, data looks normal (H1)')
        else:
            print(f'At {sl}% significance level, data does not look normal (H0)')


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Count plot of categorical variables
for feature in CATEGORICAL:
    plt.figure(figsize=(6, 4))
    sns.countplot(data=df, x=feature)
    plt.title(f'Count Plot of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Count')
    plt.show()

In [None]:
for feature in NUMERICAL:
    plt.figure(figsize=(6, 4))
    sns.histplot(data=df, x=feature, kde=True)
    plt.title(f'Histogram of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Frequency')
    plt.show()

In [None]:
# Box plots of numerical variables
for feature in NUMERICAL:
    plt.figure(figsize=(8, 6))
    sns.boxplot(data=df, y=feature)
    plt.title(f'Box Plot of {feature}')
    plt.ylabel(feature)
    plt.show()

# 2. Feature Importance
Identify the essential features in the dataset. What makes these features significant? How do they influence the outcome variable?

#### a) Analyse the corrolation  
To determine the influence of each variable on the outcome variable, it is necessary to analyze the correlation with the target variable.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Compute the correlation matrix
corr_matrix = df.corr()

# Sort the correlation matrix based on the target variable
target_correlation = corr_matrix['label'].sort_values(ascending=False)

# Plotting the correlation matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Plotting the correlation of features to the target variable
plt.figure(figsize=(8, 6))
target_correlation.plot(kind='bar')
plt.title('Correlation of Features to label')
plt.xlabel('Features')
plt.ylabel('Correlation')
plt.show()

To make an analysis about which one is "Highly correlated", we can chose a threshold of 0.5

In [None]:
correlations = corr_matrix['label']
high_corr_features = correlations[correlations.abs() > 0.5]
print("Features highly correlated with target variable:", high_corr_features.index.tolist())

We can observe that the contribution of all variables is low.

#### b) Mutual information scores

For further analysis, we will assess whether the relationship between variables is nonlinear and examine the contribution of each variable.

In [None]:
import pandas as pd
from sklearn.feature_selection import mutual_info_classif

# Separate the features (X) and the target (y)
X = df.drop('label', axis=1)
y = df['label']

# Calculate mutual information
mi = mutual_info_classif(X, y)

# Convert mutual information scores to a DataFrame
mi_series = pd.Series(mi, index=X.columns)

# Print the features sorted by mutual information score
print(mi_series.sort_values(ascending=False))

In [None]:
sum(mi_series.sort_values(ascending=False))

All the variables have a mutual information score close to 0, indicating a low contribution even with nonlinear analysis. However, we can identify the top 6 features that have a score higher than 0.05.

#### c) Random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

X = df.drop('label', axis=1)
y = df['label']

model = RandomForestClassifier()
model.fit(X, y)

importances = model.feature_importances_
for feature, importance in zip(X.columns, importances):
    print(f"Feature: {feature}, Importance: {importance}")

#### d) L1 regularization

In [None]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.004)

lasso.fit(X, y)

importance = np.abs(lasso.coef_)

feature_importance = sorted(list(zip(X.columns, importance)), key=lambda x: x[1], reverse=True)

for feature, importance in feature_importance:
    print(f"Feature: {feature}, Importance: {importance}")

Based on the L1 regularization, alpha = 0.004,  the important features are : ['col_1','col_3','col_6','col_9','col_10','col_11','col_12']

In [None]:
IMPORTANT_FEATURES = ['col_1','col_3','col_6','col_9','col_10','col_11','col_12']

The features: 'col_12', 'col_6' are mentioned in all methods, that give them more importance, even if they are with low contribution overall

# 3. Outlier Detection
Can you identify outliers in the dataset? What techniques do you use to identify these outliers, and how would you handle them?

Many techniques can be used to handle outliers, and here I will detail:

In [None]:
X = df.drop('label', axis=1)

### 1. Gaussian distribution (or Zscore)
Before using this method, we have to check if our variables follow the gaussian distribution, but as mentioned in section 1.a) No one of our variables is normally distributed, so we cannot use it. Except if we suppose that the data is normally distributed, and any other values are considered outliers.

In [None]:
from scipy.stats import zscore

z_scores = zscore(df)

mask = np.abs(z_scores) > 3

outliers = df.where(mask)  # This will replace non-outliers with NaNs
outliers = outliers.dropna(how='all')
print(f"Number of outliers using Zscore: {len(outliers)} that in percentage {round(100*len(outliers)/len(df),2)}%")

As the variables are not normally distributed, this method of outliers detection can give use wrong result, unless we support that they have to be normal.  
We will change the threshold to 4, to minimise the lose of data

In [None]:
mask = np.abs(z_scores) > 4
outliers = df.where(mask)
outliers = outliers.dropna(how='all')
print(f"Number of outliers using Zscore: {len(outliers)} that in percentage {round(100*len(outliers)/len(df),2)}%")

The percentage is more acceptable even if it's high, but we will adopt it for the rest of project

In [None]:
mask = np.abs(z_scores) < 4
df = df.where(mask)
df = df.dropna(how='any')
df.shape

#### 2. IQR (Interquartile Range) 

In [None]:
Q1 = X.quantile(0.25)
Q3 = X.quantile(0.75)
IQR = Q3 - Q1

# Define a mask for values outside the IQR range
mask = ((X < (Q1 - 1.5 * IQR)) | (X > (Q3 + 1.5 * IQR)))

outliers = X.where(mask)  # This will replace non-outliers with NaNs
outliers = outliers.dropna(how='all')
print(f"Number of outliers using IQR: {len(outliers)} that in percentage {round(100*len(outliers)/len(df),2)}%")

The IQR method yielded a wide range of "outliers," but the concern of potentially losing significant data led me to explore alternative methods for more thorough analysis and informed decision-making.

#### 3. Histograms

This method is based on histograms and the assumption of a normal distribution. The further the distribution deviates from normal, the more likely it is to be detected as outliers. While this method can be useful under the assumption of a normal data distribution, it may not be as effective when the distribution deviates significantly from normal.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

threshold = 3

# Detect outliers using histograms
outliers = pd.DataFrame()
for column in df.columns:
    feature = df[column]
    mean = feature.mean()
    std = feature.std()
    cutoff = mean + threshold * std
    outliers = outliers.append(df[feature > cutoff])

In [None]:
# Plot histograms with outliers highlighted
for column in df.columns:
    feature = df[column]
    plt.figure(figsize=(8, 6))
    plt.hist(feature, bins='auto', alpha=0.7, color='blue')
    plt.hist(outliers[column], bins='auto', alpha=0.5, color='red')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.title(f'Histogram of {column}')
    plt.legend(['Data', 'Outliers'])
    plt.show()

# 4. Feature Engineering
How would you create new features from the existing ones to better capture the underlying patterns in the data?


The idea is to create some combinations between features, and here we can use 2 methods:

#### a) Polynomial features  
The idea is to multiply each feature by itself and observe if this transformation enhances the capture of underlying patterns.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

In [None]:
X_poly.shape

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Compute the correlation matrix
corr_matrix = pd.DataFrame(np.c_[X_poly, df['label']]).corr()

# Sort the correlation matrix based on the target variable
correlations = corr_matrix[corr_matrix.columns[-1]]
high_corr_features = correlations[correlations.abs() > 0.5]
print("Features highly correlated with target variable:", high_corr_features.index.tolist())

#### b) Interaction features 
The idea is to create a cross multiplication of features.

In [None]:
from tqdm.notebook import tqdm
from sklearn.linear_model import Lasso

In [None]:
important_features = set()
seen= set()
X = df.drop('label', axis=1)
X_ = df.drop('label', axis=1)

y = df['label']
for i in tqdm(range(len(df.columns)-1)):
    X = df.drop('label', axis=1)
    for j in range(len(df.columns)-1):
        if j>=i:
            col1 = df.columns[i]
            col2 = df.columns[j]
            
            if f'{col1}_{col2}' not in seen:
                X[f'{col1}_{col2}'] = X[col1] * X[col2]
                seen.add(f'{col1}_{col2}')

    lasso = Lasso(alpha=1)

    lasso.fit(X, y)

    importance = np.abs(lasso.coef_)

    feature_importance = sorted(list(zip(X.columns, importance)), key=lambda x: x[1], reverse=True)

    for feature, importance in feature_importance:
        if importance >0:
            print(f"Feature: {feature}, Importance: {importance}")
            X_[feature] = X[feature]
            important_features.add(f'{feature}')
print(len(important_features))
print(important_features)

Despite the low scores observed for some cross-features, testing them did not result in significant improvements. Consequently, these features will not be included or adopted.

# 5. Model Building and Evaluation
Build a predictive model using the dataset. Which model did you choose and why? How well does your model perform?

In [None]:
X = df.drop('label', axis=1)
# Use only important features
# X = X[IMPORTANT_FEATURES]
y = df['label']

#### a) Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize DecisionTreeClassifier
dt = DecisionTreeClassifier()

# Fit the model
dt.fit(X_train, y_train)

# Predict the response for test dataset
y_pred = dt.predict(X_test)

# Model Accuracy
print("Decision Tree model accuracy(in %):", accuracy_score(y_test, y_pred)*100)


#### b) Neural Network

In [None]:
import tensorflow as tf

In [None]:
df.columns

For the time that all the variables are numerical, we will consider that like that for the modeling part

In [None]:
NUMERIC_FEATURE = df.columns[:-1]
# NUMERIC_FEATURE = ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_6', 'col_7', 'col_9', 'col_10', 'col_11', 'col_12', 'col_13']
# NUMERIC_FEATURE = IMPORTANT_FEATURES

In [None]:
numeric_columns = []
for feature in NUMERIC_FEATURE:
    num_col = tf.feature_column.numeric_column(feature)
    numeric_columns.append(num_col)
feature_columns = numeric_columns

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train, test = train_test_split(df, test_size=0.2)
print(len(train), 'train examples')
print(len(test), 'test examples')

In [None]:
def df_to_dataset(dataframe,shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    
    labels = dataframe.pop('label')
    
    features = dataframe
    
    ds = tf.data.Dataset.from_tensor_slices((dict(features),labels.values))
    
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    
    ds = ds.batch(batch_size)
    
    return ds

In [None]:
batch_size = 1024
ds_train = df_to_dataset(train, batch_size=batch_size)
ds_test = df_to_dataset(test, shuffle=False, batch_size=batch_size)

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.DenseFeatures(feature_columns),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dense(4, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
             loss=tf.keras.losses.BinaryCrossentropy(),
             metrics=['accuracy'])

In [None]:
history = model.fit(ds_train,
                    validation_data=ds_test,
                    epochs=200)
# loss: 0.5379 - accuracy: 0.7210 - val_loss: 0.5376 - val_accuracy: 0.7209

In [None]:
loss, accuracy = model.evaluate(ds_test)
# loss: 0.5377 - accuracy: 0.7204

In [None]:
train_accuracy = history.history['accuracy']
train_loss = history.history['loss']
val_accuracy = history.history['val_accuracy']
val_loss = history.history['val_loss']
epochs = range(len(train_accuracy))

In [None]:
plt.plot(epochs,train_accuracy,label='train_accuracy')
plt.plot(epochs,val_accuracy,label='val_accuracy')
plt.legend()
plt.show()

In [None]:
plt.plot(epochs,train_loss,label='train_loss')
plt.plot(epochs,val_loss,label='val_loss')
plt.legend()
plt.show()

The graphs effectively illustrate how well the model fits and converges to the achievable accuracy.  
After conducting all the tests, I have decided to adopt the FNN model with an accuracy of 72%.

# 6. Communication
Describe your process and findings clearly and understandably. Are you able to simplify complex data science concepts into everyday language?    
#### a) Describe your process and findings clearly and understandably

The dataset provided posed a significant challenge, but my curiosity drove me to explore and comprehend it. However, due to time constraints, I had to allocate my time across different questions, leaving room for further exploration in each area.  

I began by thoroughly understanding the data since it was crucial to have a clear understanding of the dataset at hand. I utilized various statistical descriptive methods and tests to explore the distribution of the dataset. Visualizations played a vital role in gaining different perspectives on the data.  

Determining the feature importance was a critical step, as it required conducting multiple statistical tests and employing various methods to quantify each feature's contribution to the target variable. This process allowed me to identify the most significant feature that influenced the knowledge of the target.  

Detecting outliers proved to be a challenging task as it necessitated identifying and handling outliers without sacrificing valuable information they might contain. Features that contained outliers had the potential to impact the entire dataset, unless we implemented methods to replace outliers with interpolated data or filled them with averages. This process often involved dealing with missing values.  

Feature engineering played a vital role in enhancing the dataset's power and ensuring a better representation of the target variable. This step involved creating new features that might have been hidden in the original dataset.  

The modeling phase, while comparatively straightforward, required sensitivity. Choosing the appropriate models, fine-tuning hyperparameters, and analyzing model logs could be time-consuming but necessary to achieve the objective of achieving higher accuracy.


#### b) Are you able to simplify complex data science concepts into everyday language?
The field of Data Science continues to evolve, becoming increasingly complex over time, especially with the integration of Artificial Intelligence which remains a research domain. New concepts emerge regularly, adding to the challenge of simplifying and conveying these ideas effectively. However, at its core, Data Science is no more than an advanced application of everyday life experiences, enhanced by technology. Leveraging relatable analogies and real-life examples proves to be an effective strategy in simplifying complex concepts. Additionally, incorporating storytelling techniques establishes a connection with the audience, fostering better understanding and enabling them to immerse themselves in the intricacies of the concept, thus enhancing memory retention. Simplifying concepts through non-technical graphs aids in visual comprehension, allowing the audience to grasp and absorb information more efficiently compared to solely auditory explanations.