# UCI SEMICOM dataset


After doing research about the dataset which can be found in the *word document* I will have put in the same folder as this analysis, I will now start to work on the dataset. I've taken a look into the dataset ( which you can also see in the sample ) and I know I have many columns with numerical variables. 

#### First we import all the important stuff and our dataset. 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer

from sklearn.feature_selection import VarianceThreshold

SemiCom = pd.read_csv("uci-secom.csv")
np.random.seed(0)

#### I will also add a function that makes sure the output is shown on full screen and not in a scrollable block.

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
return false;
}

Lets first see the size of our dataset. As we can see below we have to deal with a pretty big dataset. 
Now lets look a little closer to see what type of data we have.

In [None]:
print(SemiCom.shape)

mentions something about the nature of these columns.

***
Here we can check our data really quick.
***

In [None]:
SemiCom.sample(5)

### A small explanation

***
By looking at the sample and reading on kaggle about this dataset I will explain to you what this is.
This dataset is information about a machine with alot of sensors ( about 600 of them ). These sensors have an output which is always numerical or NaN. Next to that there are 
two other columns which are: Time, and pass/fail. Our goal is to predict to the best of our abilities if a row will pass or fail by using the most important features of the sensors.
***

## Cleaning the data

***
Before I can start to work with this dataset i need to clean it. The information from the dataset said we did have missing values so lets start to work on those:

* I first want to see what I'm dealing with. So I can decide if i want to remove columns or add values.
***

In [None]:
#Let's check how many rows and columns we have in this dataset
totaldata = np.product(SemiCom.shape)
totaldata

In [None]:
#Total amount of missimg data
missingdata = SemiCom.isnull().sum()
totalmissingdata = missingdata.sum()
totalmissingdata 

***
Now one thing i want to do is check the percentage of the total missing values in this dataset.
***

In [None]:
(totalmissingdata/totaldata) * 100

In [None]:
#I wanted to check which colums had the most NaN values
aa = missingdata.sort_values(ascending=False)

In [None]:
aa.plot()

comment: can you study where these missing values occur, are they co-occuring look like they are, can you seperate those rows, and analyze them separately. 
also you can check with the data description if there is any statemetn that may explain the missing values.

# Step 1 Cleaning: The threshold ###

***
As you can see not much of the data is missing so removing these wont have a big impact since the dataset has very many values. But, it is necessary to have a clean dataset so that our prediction is more accurate. So my plan is to make a threshold of 15%. When a column is missing more then 15% the collumn gets removed.
***

In [None]:
threshold = 0.15

columns_to_drop = missingdata[missingdata > threshold * len(SemiCom)].index
print(columns_to_drop)

***
So here we can see all the columns that are above the threshold and need to be removed. My next step is dropping these columns and checking before and after if columns have been removed. I wanted to do this bit with the 'dropna()' function but this drops rows or columns based on missing values. It cannot be used to drop columns that you specify.
***

In [None]:
print(SemiCom.columns)

In [None]:
SemiCom_dropped = SemiCom.drop(columns=columns_to_drop) #Dropping the columns that have more then 15% missing values

print(SemiCom_dropped.columns)

***

As you can see our columns length has gone down from 592 to 540. Now we need look for other ways to remove columns that are useless because now we still have too many columns. 
After doing some research and asking ChatGPT how i could clean a dataset that has many numerical columns. I found the Variance threshold which means that you remove the columns that have mostly the same information. And because it's almost constantly the same it is not very usefull.

This is usefull for me because my dataset has many columns with probably the same information. Which wont provide any extra information for the model.

***

# Step 2: Variance threshold

*** 
First we have to make sure we have the types to a numerical type. 
***

In [None]:
#had a small error about the time not being able to covert to float and because time is not usefull to the model, I removed it

SemiCom_dropped= SemiCom_dropped.drop(['Time'], axis=1)


In [None]:
print(SemiCom_dropped.dtypes)

***
I chose a variance of 0.05 because in my opinion if its beneath 0.05 it is a very minimal change and wont affect the module.
***

In [None]:

thresholder = VarianceThreshold(threshold=0.05)

X_high_variance = thresholder.fit_transform(SemiCom_dropped)
#put the remaining columns in a list
selected_features = SemiCom_dropped.columns[thresholder.get_support()].tolist()
SemiCom_filtered = SemiCom_dropped[selected_features]

SemiCom_filtered.columns

***
Yep! that was a good one. We just cut our columns in half from 540 to 251.

***

# Step 3: Correlation Matrix

***
With this method we want to reduce highly correlated columns. The reason behind this is that we probably have columns that have similar information. This will help minimalize the dataset and gives us more relevant information.

I will be using a threshold of 0.8 which means that any columns with a correlation above 0.8 will be added to the list of columns that will be removed. The "For" loop compares all the columns with eachother and if the columns are highly correlated they will be put into the variable "i" and "j". all the columns in "j" will be removed.
***

In [None]:

correlation_matrix = SemiCom_filtered.corr().abs()
threshold = 0.8  #Remove columns with a correlation above 0.8

#Find columns with high correlation
highly_correlated_cols = []

for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if correlation_matrix.iloc[i, j] > threshold:
            colname_i = correlation_matrix.columns[i]
            colname_j = correlation_matrix.columns[j]
            highly_correlated_cols.append(colname_j)

#Dropping highly correlated columns
SemiCom_correlated = SemiCom_filtered.drop(columns=highly_correlated_cols)
print(highly_correlated_cols)


In [None]:
SemiCom_correlated.columns

***
Now we're getting somewhere. We went from 251 to 138. I think this is pretty decent but I'm not completely happy with the amount of columns left. So I'm going to do some more research and find other ways to minimalize this number.
***

# Step 4: Fill Missing values

***
One of the last Cleaning steps is filling the remaining missing values. I'm doing this by filling the missing value with the mean or median. But which one is the best option for this dataset? When I searched online I found out that mean is often used when the distribution is pretty symmetric in this case median can also be used. The difference wont be big. When the distribution is skewed the mean is not useful. The median is less sensitive to outliers.

So our first step is seeing what distribution this dataset has.
***

In [None]:
correlationmat = SemiCom_correlated.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(np.abs(correlationmat), vmax=.8, square=True);


is target here or not. make it clear!

[optional] Try out Hierarchical clustering and find similar columns, maybe also useful for missing values

***
Let's look at the skewness of our dataset.
***

In [None]:
skewness = SemiCom_correlated.skew()

plt.figure(figsize=(8, 6))
plt.hist(skewness, bins=20)
plt.title('Skewness')
plt.show()

the tail values are important to identify , maybe you want to remove these or note any pattern, later if you build a model, you can check if it works on these points

***
So from the graph above we can mke the conclusion that we're dealing with a skewed distribution. And from our research we now know we need to use the "Mean" to fill/Impude the remaining missing values.
***

In [None]:
missingvalues = SemiCom_correlated.isnull().sum()
print(missingvalues.sum())
print(missingvalues.shape)

In [None]:
SemiCom_correlated = SemiCom_correlated.fillna(SemiCom_correlated.mean())
missingdata = SemiCom_correlated.isnull().sum()
totalmissingdata = missingdata.sum()
print(totalmissingdata)
print(missingdata.shape)

build a base line early, 

# Step 5: Handling outliers - Is this a good method for my dataset?

***

To clean my dataset some more my next step was to get rid of the outliers. But when i started removing the outliers 1/3 of my dataset went away. So I'm not too sure about handling the outliers

In [None]:
missing_values = SemiCom_correlated.isnull().sum()
print(missing_values)

optional: after modeling you can look into regions where the error is large, by selecting points with large prediction error, and compare the statisitcs of the features for those points with the the rest of the points. for example side to side bar plots for each feature. make a plot of the difference of the mean or median

beacuae you have too many features, maybe you can group columns via hierarchical clustering and then analyze in the group.

consider the model you want to use, does this model need scaling?


calculate the correlation of all features with the target and make a bar plot, and find highly occrelating features, for those make a scatter plot. 
calculate the mutual information  of all features with the target and make a bar plot, and find highly occrelating features, for those make a scatter plot. 

goal is to predict ==. strategy is what is informaing us about the target

compare performance as you inisitall planned. histograms.
compare the count of each class, establish if it is imbalanced.

In [None]:
SemiCom_correlated.boxplot()

In [None]:
threshold_percentile = 99  


for column in SemiCom_correlated.columns:
    threshold = np.percentile(SemiCom_correlated[column], threshold_percentile)
    SemiCom_correlated.loc[SemiCom_correlated[column] > threshold, column] = np.nan
df_cleaned = SemiCom_correlated.dropna()


In [None]:
df_cleaned.boxplot()

In [None]:
print(df_cleaned.shape)
print(SemiCom_correlated.shape)


In [None]:
SemiCom_correlated = SemiCom_correlated.fillna(SemiCom_correlated.mean())
missing_values = SemiCom_correlated.isnull().sum()
print(missing_values)

***
Now we look at the accuracy of our Dataset when it is not cleared from outliers, and then we look at the accuracy of the dataset that will be cleaned of outliers. With this we can conclude if we need to remove outliers. And if not there is a big question in why there are so many outliers.
***

### Uncleaned from outliers

In [None]:

X = SemiCom_correlated.drop(columns=['Pass/Fail'])
y = SemiCom_correlated['Pass/Fail']

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(solver='sag', max_iter=6000)
model.fit(X_train, y_train)

# Make predictions on the original test set
y_pred_original = model.predict(X_test)
print("Accuracy: ", model.score(X_test,y_test)*100)

***
here we split the dataset into train and test. now we have an accuracy but lets look at the correlation matrix.
***

In [None]:
lr = LogisticRegression(random_state=1, max_iter=6000, solver='sag')
lr.fit(X_train, y_train) 
y_pred = lr.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
plt.rcParams['figure.figsize'] = (5, 5)
sns.set(style = 'dark', font_scale = 1.4)
sns.heatmap(cm, annot = True, annot_kws = {"size": 15})
print(cm)


***
To summarize:

- True Positives (TP): 288
- False Positives (FP): 2
- False Negatives (FN): 24
- True Negatives (TN): 0
***

In [None]:
print("Accuracy: ", lr.score(X_test,y_test)*100)

### Cleaned from outliers

In [None]:
# same thing here as above but now with the outliers dropped.
X = df_cleaned.drop(columns=['Pass/Fail'])
y = df_cleaned['Pass/Fail']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(solver='sag', max_iter=8000)

model.fit(X_train, y_train)

y_pred_original = model.predict(X_test)
print("Accuracy: ", model.score(X_test,y_test)*100)

In [None]:
lr = LogisticRegression(random_state=1, solver='sag', max_iter=8000)
lr.fit(X_train, y_train) 
y_pred = lr.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
plt.rcParams['figure.figsize'] = (5, 5)
sns.set(style = 'dark', font_scale = 1.4)
sns.heatmap(cm, annot = True, annot_kws = {"size": 15})

print(cm)

***
To summarize:

- True Positives (TP): 92
- False Positives (FP): 2
- False Negatives (FN): 9
- True Negatives (TN): 0
***

## Conclusion:

The confusion matrix can calculate 3 things for us:

- Accuracy -> correct predictions/ total number of predictions
- Precision -> TN / (TP + FP)
- Recall -> TP / (FP + FN)

When i calculate these for both confusion matrixes this is the outcome:

Without outliers:

A : 90.19%
P : 97.87%
R : 91.09%

With outliers

A : 91.89%
P : 99.31%
R : 92.31%


This shows that the performance and accuracy with outliers is higher then the one without outliers. So in conclusion we will not be removing outliers.


# Step 6: hierarchical clustering

-
-
-


# Step 7: Oversampler / Undersampler

from working on another problem I came to this usefull method that is very helpful for this dataset. Let me explain why.

In [None]:
value_counts = SemiCom_correlated['Pass/Fail'].value_counts()

plt.bar(value_counts.index, value_counts.values)
#show amount of 0,1
for i, count in enumerate(value_counts.values):
    plt.text(i, count + 0.5, str(count), ha='center')

plt.xlabel('Result')
plt.ylabel('Count')
plt.title('Fail/Pass')

plt.show()

***
The difference in pass and fail is huge. Which is why our model never guesses the "negative". A solution to this problem is using the oversampler. With this we can equallize the amount of the sample when it comes to pass and fail. Then the model should be able to predict better then now. So let's test that out.
***

### Let's test if this is usefull for our model

In [None]:

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import classification_report, accuracy_score

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

### Oversampled results: Precision, recall, accuracy
***

In [None]:
oversampler = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train, y_train)

classifier = LogisticRegression(solver='sag', max_iter=9000)
classifier.fit(X_train_resampled, y_train_resampled)
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred))

confusion = confusion_matrix(y_test, y_pred)

print(cm)
sns.heatmap(confusion, annot=True)

***

The reason we're using this is to look at the difference between the majority class (pass) and the minority class (fail). As we can see the minority is pretty bad. There is a majority in fale negatives.
Now we are going to equalize the samples to hopefully change the performance.
***

### Undersampler: Creating equal samples to test this theorie
***

In [None]:
rus = RandomUnderSampler(random_state=42)

X_train_undersampled, y_train_undersampled = rus.fit_resample(X_train, y_train)

classifier.fit(X_train_undersampled, y_train_undersampled)

y_pred = classifier.predict(X_train_undersampled)
print(classification_report(y_train_undersampled, y_pred))

### This one is undersampled: (Equalized)
***

In [None]:
y_pred = classifier.predict(X_train_undersampled)
cmequal = confusion_matrix(y_train_undersampled, y_pred)
sns.heatmap(cmequal, annot=True)

*** 
One thing i noticed here is that the the negatives are still very low. So to maybe get a better model i wanted to play around with the sample size

## Sample_Stategy added

In [None]:
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)

X_train_undersampled, y_train_undersampled = rus.fit_resample(X_train, y_train)

classifier.fit(X_train_undersampled, y_train_undersampled)

y_pred = classifier.predict(X_train_undersampled)
print(classification_report(y_train_undersampled, y_pred))

In [None]:
y_pred = classifier.predict(X_train_undersampled)
cm = confusion_matrix(y_train_undersampled, y_pred)
sns.heatmap(cm, annot=True)

In [None]:
print(confusion)

In [None]:
print(cmequal)

In [None]:
print(cm)

***
This is pretty good because in our dataset false negatives are relatively better then false positives. because a fail that actually passed is better then the other way around.
***

# GridSearch

In [None]:
# Define the parameter grid
param_grid =  {    'criterion': ['gini', 'entropy'],
    'min_samples_split': [25, 30, 40, 50,54,53,52,55,56,57,60, 70, 80],
    'min_samples_leaf': [1, 2, 3, 4 , 5 ,],}   
# Create a decision tree classifier
model = DecisionTreeClassifier()

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best parameter and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(best_params)
print(best_score)

## Undersampled?

In [None]:
# Define the parameter grid
param_grid =  {    'criterion': ['gini', 'entropy'],
    'min_samples_split': [5, 10 ,20,21, 22, 23 ,25, 30, 40, 50],
    'min_samples_leaf': [1, 2, 3, 4 , 5 ,],}   
# Create a decision tree classifier
model = DecisionTreeClassifier()

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_search.fit(X_train_undersampled, y_train_undersampled)

# Get the best parameter and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(best_params)
print(best_score)