## Readme

This Jupyter Notebook analyzes the correlation among relevant  numerical features in a dataset to identify and remove highly correlated features. The purpose of this analysis is to reduce multicollinearity, which can negatively impact the performance of machine learning models.

From the 20 relevant features analyzed, the following 8 features are retained for further use due to their lower correlation:
- 'reading_time'
- 'mistakes_dist_ratio'
- 'polysyllabcount'
- 'sentence_count'
- 'difficult_words'
- 'comma_count'
- 'transitional_phrases_c'
- 'text_dist_words_ratio'

Other features were excluded due to high correlation (greater than 0.9). The Pearson correlation coefficient was used for the correlation analysis.



## Contents

- [1. Import Libraries](#1-Import-Libraries)
- [2. Load Data](#2-Load-Data)
- [3. Correlation Analysis](#3-Correlation-Analysis)
- [4. Analyze Manually What to Remove](#4-Analyze-Manually-What-to-Remove)
- [5. Repeat Correlation Analysis for Remaining Features](#5-Repeat-Correlation-Analysis-for-Remaining-Features)
- [6. List of Remaining Features](#43-List-of-Remaining-Features)


In [1]:
# 1. Import Libraries
import pandas as pd
import numpy as np

# 2. Load Data
df = pd.read_csv('numeric_features_added_v1.csv')

# List of relevant features
relevant_features = [
    'syllable_count', 'letters_count', 'char_count', 'reading_time_minutes', 'reading_time', 
    'preprocessed_text_count', 'lexicon_count', 'word_count_in_full_text', 'stopword_count_in_full_text', 
    'monosyllabcount', 'mistakes_dist_ratio', 'polysyllabcount', 'preprocessed_text_dist_count', 'sentence_count', 
    'mistakes_dist_dist_ratio', 'difficult_words', 'comma_count', 'transitional_phrases_c', 'transitional_phrases_dist_c', 
    'text_dist_words_ratio'
]

# Create a DataFrame
relevant_features_df = pd.DataFrame(relevant_features, columns=['Relevant Features'])

# Filter the original DataFrame to only include relevant features
relevant_features_data = df[relevant_features]

# 4. Correlation Analysis
# Compute the correlation matrix
corr_matrix = relevant_features_data.corr()

# Unstack the correlation matrix to get pairs of features and their correlation
corr_pairs = corr_matrix.unstack()

# Convert the series to a DataFrame and reset the index
corr_pairs_df = pd.DataFrame(corr_pairs, columns=['Correlation']).reset_index()

# Rename columns for clarity
corr_pairs_df.columns = ['Feature1', 'Feature2', 'Correlation']

# Remove self-correlations (where Feature1 == Feature2)
corr_pairs_df = corr_pairs_df[corr_pairs_df['Feature1'] != corr_pairs_df['Feature2']]

# Drop duplicate pairs (e.g., (A, B) and (B, A))
corr_pairs_df['Pair'] = corr_pairs_df.apply(lambda row: tuple(sorted([row['Feature1'], row['Feature2']])), axis=1)
corr_pairs_df = corr_pairs_df.drop_duplicates(subset='Pair').drop(columns='Pair')

# Sort the correlations in descending order
sorted_corr_pairs_df = corr_pairs_df.sort_values(by='Correlation', ascending=False)

# Suggest features for removal with correlation >= 0.90
high_corr_features = sorted_corr_pairs_df[sorted_corr_pairs_df['Correlation'] >= 0.90]


In [2]:
sorted_corr_pairs_df.head(30)

Unnamed: 0,Feature1,Feature2,Correlation
64,reading_time_minutes,reading_time,1.0
43,char_count,reading_time_minutes,1.0
44,char_count,reading_time,1.0
358,transitional_phrases_c,transitional_phrases_dist_c,0.999887
23,letters_count,reading_time_minutes,0.999829
24,letters_count,reading_time,0.999829
22,letters_count,char_count,0.999829
127,lexicon_count,word_count_in_full_text,0.999818
106,preprocessed_text_count,lexicon_count,0.999419
107,preprocessed_text_count,word_count_in_full_text,0.999208


### Analyze manually what to remove (1 out of 2 correlated should be left)

In [9]:
features_to_remove = [
    'reading_time_minutes',
    'char_count',
    'letters_count',
    'syllable_count',
    'lexicon_count',
    'word_count_in_full_text',
    'transitional_phrases_dist_c',
    'mistakes_dist_dist_ratio',
    'preprocessed_text_dist_count',
    'monosyllabcount',
    'preprocessed_text_count',
    'stopword_count_in_full_text'
]


### Repeat correlation analysis for remaining features

In [11]:
# 1. Import Libraries
import pandas as pd
import numpy as np

# 2. Load Data
df = pd.read_csv('numeric_features_added_v1.csv')

# List of relevant features
relevant_features = [
    'reading_time',
    'mistakes_dist_ratio',
    'polysyllabcount', 'sentence_count',
    'difficult_words', 'comma_count',
    'transitional_phrases_c', 'text_dist_words_ratio'
]

# Create a DataFrame
relevant_features_df = pd.DataFrame(relevant_features, columns=['Relevant Features'])

# Filter the original DataFrame to only include relevant features
relevant_features_data = df[relevant_features]

# 4. Correlation Analysis
# Compute the correlation matrix
corr_matrix = relevant_features_data.corr()

# Unstack the correlation matrix to get pairs of features and their correlation
corr_pairs = corr_matrix.unstack()

# Convert the series to a DataFrame and reset the index
corr_pairs_df = pd.DataFrame(corr_pairs, columns=['Correlation']).reset_index()

# Rename columns for clarity
corr_pairs_df.columns = ['Feature1', 'Feature2', 'Correlation']

# Remove self-correlations (where Feature1 == Feature2)
corr_pairs_df = corr_pairs_df[corr_pairs_df['Feature1'] != corr_pairs_df['Feature2']]

# Drop duplicate pairs (e.g., (A, B) and (B, A))
corr_pairs_df['Pair'] = corr_pairs_df.apply(lambda row: tuple(sorted([row['Feature1'], row['Feature2']])), axis=1)
corr_pairs_df = corr_pairs_df.drop_duplicates(subset='Pair').drop(columns='Pair')

# Sort the correlations in descending order
sorted_corr_pairs_df = corr_pairs_df.sort_values(by='Correlation', ascending=False)

# Suggest features for removal with correlation >= 0.91
high_corr_features = sorted_corr_pairs_df[sorted_corr_pairs_df['Correlation'] >= 0.90]


In [12]:
sorted_corr_pairs_df.head(30)

Unnamed: 0,Feature1,Feature2,Correlation
20,polysyllabcount,difficult_words,0.896468
2,reading_time,polysyllabcount,0.859977
4,reading_time,difficult_words,0.846999
3,reading_time,sentence_count,0.801289
5,reading_time,comma_count,0.703225
15,mistakes_dist_ratio,text_dist_words_ratio,0.702323
21,polysyllabcount,comma_count,0.676264
37,difficult_words,comma_count,0.675047
6,reading_time,transitional_phrases_c,0.664977
19,polysyllabcount,sentence_count,0.639405


### List of remaining features (with correlation < 0.9)

From the 20 ralevant numerical features, the following 8 will be retained for further use. The rest are excluded due to high correlation (greater than 0.9):

- 'reading_time'
- 'mistakes_dist_ratio'
- 'polysyllabcount'
- 'sentence_count'
- 'difficult_words'
- 'comma_count'
- 'transitional_phrases_c'
- 'text_dist_words_ratio'
