<h1><center>Dimension Reduction Tutorial</center></h1>

## What is Dimension Reduction?

Dimension reduction is the process of eliminating variables that has less effect on target and finding the most relevant and meaningful properties of the dataset. Sometimes, we work on a dataset with more than a thousands features and it is very hard to analyze each and every variable. Dimension reduction helps us to resize the data to a low-dimensional space to be able to easily extract patterns and insights.

The Divorce dataset that we used in the first part of this project has 54 features which makes it difficult to know which predictors we need to use for our model. In the first part of the project I used chi-square test to find significant predictors and I chose the 8 variables with highest score. 

In this part, we will go through other methods of dimension reduction and compare the result of each method to see which works better. 

## Why is Dimension Reduction required?

Here are three benefits of applying dimension reduction techniques to a dataset :

1) Applying dimension reduction will reduce the space we need to store the data.

2) Less dimension means less processing and computation time.

3) Visualizing data will be easier with less dimensions.

## Common Dimension Reduction Techniques

### Missing Value Ratio

This method is about removing columns with lots of missing values. Imagine a column that ore than a half of it is null, even if impute the column it will not give us usufull information so it is better to just get rid of it. 

as a result of running X.isnull().sum()/len(X)*100 we will see that since we do not have any null value, all the numbers will be zero. so we can say this technique is not usefull in this case at all.

### Low Variance Filter

Imagine an attribute in dataset that all the records has the same value, this column will not improve any result in our model. In our case, lets say everyone answered Never = 0 to one of the questions. A column with very low variance will not be helpful and we prefer to drop it. 

First, we need to calculate the variance so we need to make sure that there is no null in data. In my case there is no null so I just calculated the variance.

As a result of running X.var() we will see that again this is not a good method since the variance of all column are so close to each other and it is hard to decide which one to remove.

### High Correlation Filter

Some of the questions might carry the same meaning for couples and their answers might be the same. In this case the correlation of those question will be very close to one, so I believe this might be a good method to filter some attributes in dataset. After finding attributes with a high correlation, we can just remove one of them and stick to the other for our model. 

After running X.corr() we can see that since we have 54 attributes it will not be a good idea to check all these numbers so I do not prefer this method as well. 

### Backward Feature Elimination

Follow the below steps to understand and use the ‘Backward Feature Elimination’ technique: (https://www.analyticsvidhya.com/blog/)

1) We first take all the n variables present in our dataset and train the model using them

2) We then calculate the performance of the model

3) Now, we compute the performance of the model after eliminating each variable (n times), i.e., we drop one variable every time and train the model on the remaining n-1 variables

4) We identify the variable whose removal has produced the smallest (or no) change in the performance of the model, and then drop that variable

5) Repeat this process until no variable can be dropped


#### Implementation

In [2]:
# Importing required libraries:
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from   sklearn  import metrics
from   sklearn.model_selection import train_test_split
from   sklearn.linear_model    import LogisticRegression
from sklearn.feature_selection import RFE

# Read the data:
df = pd.read_csv("C:\\Users\\Azarm\\Desktop\\BCIT\\AdvancedTopics\\DataSets\\Divorce.csv",header = 0)

# Seperate the target and independent variable
X = df.copy()       # Create separate copy to prevent unwanted tampering of data.
del X['Divorce']     # Delete target variable.

# Target variable
y = df['Divorce']

# Create the object of the model
model = LogisticRegression()

# Specify the number of  features to select 
rfe = RFE(model, 8)

# fit the model
rfe = rfe.fit(X, y)

# Please uncomment the following lines to see the result
print('\n\nFEATUERS SELECTED\n\n')
print(rfe.support_)

print('\n\nRANKING OF FEATURES\n\n')
print(rfe.ranking_)



FEATUERS SELECTED


[False False  True False False  True False False False False False False
 False False False False  True  True False False False False False False
 False  True False False False False False False False False False False
 False False  True  True False False False False False False False False
  True False False False False False]


RANKING OF FEATURES


[13  9  1 28 24  1 47 38 33 41 29 27 32 17  8 26  1  1  6  5 31 44 42 37
 25  1 30  2 15 11  3 19 22 12 36 18 43 14  1  1 16 23 20  7 45 40 34 35
  1 21 46  4 10 39]


#### Result

|    | <div style="text-align: left">Attribute Name</div>| <div style="text-align: left">Information</div>                                                                                                 
|----|---------------|---------------------------------------------------------------------------------------------------------
| 1  | Q3            | <div style="text-align: left">When we need it, we can take our discussions with my spouse from the beginning and correct it.</div>                                                                        
| 2  | Q6           |  <div style="text-align: left">We don't have time at home as partners.</div>                                                                         
| 3  | ***Q17            |  <div style="text-align: left">We share the same views about being happy in our life with my spouse </div>               
| 4  | ***Q18            |  <div style="text-align: left">My spouse and I have similar ideas about how marriage should be</div>                                
| 5  | Q26            |  <div style="text-align: left">I know my spouse's basic anxieties.</div>                
| 6  | Q39            |  <div style="text-align: left">Our discussions often occur suddenly.</div>                        
| 7  | ***Q40            |  <div style="text-align: left">We're just starting a discussion before I know what's going on. 
| 8  |Q49            |  <div style="text-align: left">We're just starting a discussion before I know what's going on.

#### Logistic Regression 

In [2]:
# Implementing Logistic Regression

from   sklearn.model_selection import train_test_split
from   sklearn.linear_model    import LogisticRegression

# Re-assign X with significant columns only after chi-square test.
X = X[['Q3', 'Q6', 'Q17','Q18','Q26','Q39','Q40','Q49']]

y = df['Divorce']

# Split data.
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.25,random_state=0)

# Perform logistic regression.
logisticModel = LogisticRegression(fit_intercept=True, solver='liblinear',random_state=0)

# Fit the model.
logisticModel.fit(X_train,y_train)
y_pred=logisticModel.predict(X_test)
# print(y_pred)

# Show accuracy scores.
print('Results without scaling:')

# Show confusion matrix
cm = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
print("\nConfusion Matrix")
print(cm)

TN = cm[0][0] # True Negative  (Col 0, Row 0)
FN = cm[0][1] # False Negative (Col 0, Row 1)
FP = cm[1][0] # False Positive (Col 1, Row 0)
TP = cm[1][1] # True Positive  (Col 1, Row 1)


precision = (TP/(FP + TP))
print("\nPrecision:  " + str(round(precision, 3)))

recall = (TP/(TP + FN))
print("Recall:     " + str(round(recall,3)))

F1 = 2*((precision*recall)/(precision+recall))
print("F1:         " + str(round(F1,3)))

print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))

Results without scaling:

Confusion Matrix
Predicted   0   1
Actual           
0          22   0
1           1  20

Precision:  1.0
Recall:     0.952
F1:         0.976
Accuracy:  0.9767441860465116


#### Cross Fold Validation

In [3]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import numpy as np

# enumerate splits - returns train and test arrays of indexes.
# scikit-learn k-fold cross-validation
from sklearn.model_selection import KFold

# prepare cross validation with eight folds.
kfold         = KFold(5, True)
accuracyList  = []
precisionList = []
f1List        = []
recallList = []
count         = 0

for train_index, test_index in kfold.split(X):
    X_train = X.loc[X.index.isin(train_index)]
    X_test  = X.loc[X.index.isin(test_index)]
    y_train = y.loc[y.index.isin(train_index)]
    y_test  = y.loc[y.index.isin(test_index)]

    # Perform logistic regression.
    logisticModel = LogisticRegression(fit_intercept=True,
                                       solver='liblinear')
    # Fit the model.
    logisticModel.fit(X_train, y_train)

    y_pred = logisticModel.predict(X_test)

    # Show confusion matrix and accuracy scores.
    cm = pd.crosstab(y_test, y_pred,
                     rownames=['Actual'],
                     colnames=['Predicted'])
    count += 1

    # Calculate accuracy and precision scores and add to the list.
    accuracy = metrics.accuracy_score(y_test, y_pred)
    precision = metrics.precision_score(y_test, y_pred)

    accuracyList.append(accuracy)
    precisionList.append(precision)
    recall = metrics.recall_score(y_test, y_pred)
    f1 = metrics.f1_score(y_test, y_pred)

    accuracyList.append(accuracy)
    precisionList.append(precision)
    recallList.append(recall)
    f1List.append(f1)



# Show averages of scores over multiple runs.
print("\nAccuracy and Standard Deviation For All Folds:")
print("Average accuracy:  " + str(np.mean(accuracyList)))
print("Accuracy std:      " + str(np.std(accuracyList)))
print("Average precision: " + str(np.mean(precisionList)))
print("Precision std:     " + str(np.std(precisionList)))


Accuracy and Standard Deviation For All Folds:
Average accuracy:  0.9882352941176471
Accuracy std:      0.014408763192842228
Average precision: 1.0
Precision std:     0.0


### Forward Feature Selection

This is the opposite process of the Backward Feature Elimination we saw above. Instead of eliminating features, we try to find the best features which improve the performance of the model. This technique works as follows: (https://www.analyticsvidhya.com/blog/)

1) We start with a single feature. 

2) Essentially, we train the model n number of times using each feature separately

3) The variable giving the best performance is selected as the starting variable

4) Then we repeat this process and add one variable at a time. The variable that produces the highest increase in performance is retained

5) We repeat this process until no significant improvement is seen in the model’s performance

#### Implementation

In [2]:
# Importing required libraries:
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from   sklearn  import metrics
from   sklearn.model_selection import train_test_split
from   sklearn.linear_model    import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.feature_selection import f_regression

# Read the data:
df = pd.read_csv("C:\\Users\\Azarm\\Desktop\\BCIT\\AdvancedTopics\\DataSets\\Divorce.csv",header = 0)

# Seperate the target and independent variable
X = df.copy()       # Create separate copy to prevent unwanted tampering of data.
del X['Divorce']     # Delete target variable.

# Target variable
y = df['Divorce']

#  f_regression is a scoring function to be used in a feature selection procedure
#  f_regression will compute the correlation between each regressor and the target 
ffs = f_regression(X,y )

variable = [ ]
for i in range(0,len(X.columns)-1):
    if ffs[0][i] >=700:
       variable.append(X.columns[i])
    
print(variable)


['Q9', 'Q11', 'Q15', 'Q17', 'Q18', 'Q19', 'Q20', 'Q40']


#### Result

|    | <div style="text-align: left">Attribute Name</div>| <div style="text-align: left">Information</div>                                                                                                 
|----|---------------|---------------------------------------------------------------------------------------------------------
| 1  | Q9            | <div style="text-align: left">I enjoy traveling with my wife.</div>                                                                        
| 2  | Q11           |  <div style="text-align: left">I think that one day in the future, when I look back, I see that my spouse and I have been in harmony with each other.</div>                                                                         
| 3  | Q15            |  <div style="text-align: left">Our dreams with my spouse are similar and harmonious. </div>               
| 4  | ***Q17            |  <div style="text-align: left">We share the same views about being happy in our life with my spouse.</div>                                
| 5  | ***Q18            |  <div style="text-align: left">My spouse and I have similar ideas about how marriage should be.</div>                
| 6  | Q19           |  <div style="text-align: left">My spouse and I have similar ideas about how roles should be in marriage</div>                        
| 7  | Q20            |  <div style="text-align: left">My spouse and I have similar values in trust.</div>
| 8  |***Q40            |  <div style="text-align: left">We're just starting a discussion before I know what's going on.</div>

#### Logistic Regression

In [5]:
# Implementing Logistic Regression

from   sklearn.model_selection import train_test_split
from   sklearn.linear_model    import LogisticRegression

# Re-assign X with significant columns only after chi-square test.
X = X[['Q9', 'Q11', 'Q15','Q17','Q18','Q19','Q20','Q40']]

y = df['Divorce']

# Split data.
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.25,random_state=0)

# Perform logistic regression.
logisticModel = LogisticRegression(fit_intercept=True, solver='liblinear',random_state=0)

# Fit the model.
logisticModel.fit(X_train,y_train)
y_pred=logisticModel.predict(X_test)
# print(y_pred)

# Show accuracy scores.
print('Results without scaling:')

# Show confusion matrix
cm = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
print("\nConfusion Matrix")
print(cm)

TN = cm[0][0] # True Negative  (Col 0, Row 0)
FN = cm[0][1] # False Negative (Col 0, Row 1)
FP = cm[1][0] # False Positive (Col 1, Row 0)
TP = cm[1][1] # True Positive  (Col 1, Row 1)


precision = (TP/(FP + TP))
print("\nPrecision:  " + str(round(precision, 3)))

recall = (TP/(TP + FN))
print("Recall:     " + str(round(recall,3)))

F1 = 2*((precision*recall)/(precision+recall))
print("F1:         " + str(round(F1,3)))

print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))

Results without scaling:

Confusion Matrix
Predicted   0   1
Actual           
0          22   0
1           2  19

Precision:  1.0
Recall:     0.905
F1:         0.95
Accuracy:  0.9534883720930233


#### Cross Fold Validation

In [6]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import numpy as np

# enumerate splits - returns train and test arrays of indexes.
# scikit-learn k-fold cross-validation
from sklearn.model_selection import KFold

# prepare cross validation with eight folds.
kfold         = KFold(5, True)
accuracyList  = []
precisionList = []
f1List        = []
recallList = []
count         = 0

for train_index, test_index in kfold.split(X):
    X_train = X.loc[X.index.isin(train_index)]
    X_test  = X.loc[X.index.isin(test_index)]
    y_train = y.loc[y.index.isin(train_index)]
    y_test  = y.loc[y.index.isin(test_index)]

    # Perform logistic regression.
    logisticModel = LogisticRegression(fit_intercept=True,
                                       solver='liblinear')
    # Fit the model.
    logisticModel.fit(X_train, y_train)

    y_pred = logisticModel.predict(X_test)

    # Show confusion matrix and accuracy scores.
    cm = pd.crosstab(y_test, y_pred,
                     rownames=['Actual'],
                     colnames=['Predicted'])
    count += 1

    # Calculate accuracy and precision scores and add to the list.
    accuracy = metrics.accuracy_score(y_test, y_pred)
    precision = metrics.precision_score(y_test, y_pred)

    accuracyList.append(accuracy)
    precisionList.append(precision)
    recall = metrics.recall_score(y_test, y_pred)
    f1 = metrics.f1_score(y_test, y_pred)

    accuracyList.append(accuracy)
    precisionList.append(precision)
    recallList.append(recall)
    f1List.append(f1)



# Show averages of scores over multiple runs.
print("\nAccuracy and Standard Deviation For All Folds:")
print("Average accuracy:  " + str(np.mean(accuracyList)))
print("Accuracy std:      " + str(np.std(accuracyList)))
print("Average precision: " + str(np.mean(precisionList)))
print("Precision std:     " + str(np.std(precisionList)))


Accuracy and Standard Deviation For All Folds:
Average accuracy:  0.9823529411764707
Accuracy std:      0.02352941176470589
Average precision: 1.0
Precision std:     0.0


### Factor Analysis

## Conclusion

In this tutorial, we went through 5 different methods of dimension reduction. I would like to mention that BFE and FFS are both time consuming and expensive in terms of computation. It will be better to use them in case we have a small number of input variables.

The best method of dimension reduction in Divorce dataset based on the result was Backward Feature Elimination. 

### Camparison Table

|<div style="text-align: left">Dimension Reduction Method</div>|<div style="text-align: left">Selected Attributes</div>|<div style="text-align: left">Model Accuracy</div>|<div style="text-align: left">Cross Fold Average Accuracy</div>                                                                                             
|----|---------------|-------|------------------------------------------------------------------------------------------------
|<div style="text-align: left">Chi-Square Test</div>|  <div style="text-align: left">Q5, Q9, Q17, Q18, Q19, Q35, Q36, Q40</div>      | <div style="text-align: center">0.9534883720930233</div> | <div style="text-align: center">0.9823529411764707</div>
|<div style="text-align: left">Backward Feature Elimination</div>|  <div style="text-align: left">Q3, Q6, Q17, Q18, Q26, Q39, Q40, Q49</div>  | <div style="text-align: center">0.9767441860465116</div> | <div style="text-align: center">0.9882352941176471</div>|
|<div style="text-align: left">Forward Feature Selection</div>|  <div style="text-align: left">Q9, Q11, Q15, Q17, Q18, Q19, Q20, Q40</div> | <div style="text-align: center">0.9534883720930233</div> |<div style="text-align: center">0.9823529411764707</div> |

## References

https://www.analyticsvidhya.com/blog/

https://www.datacamp.com/