# Feature Engineering Phase 2: Correlation Analysis & Numeric Weights

This notebook focuses on Phase 2 of feature engineering:
1. Correlation analysis with the outcome variable 'risk'
2. Creating JSON mappings for categorical variables
3. Assigning numeric weights based on correlations
4. Sentiment analysis and NLP processing for comments

## 1. Import Libraries and Load Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from scipy.stats import chi2_contingency, pearsonr
import json
import warnings
warnings.filterwarnings('ignore')

# For sentiment analysis and NLP
try:
    from textblob import TextBlob
    TEXTBLOB_AVAILABLE = True
except ImportError:
    print("TextBlob not available. Will use basic sentiment scoring.")
    TEXTBLOB_AVAILABLE = False

plt.style.use('default')
sns.set_palette("husl")

TextBlob not available. Will use basic sentiment scoring.


In [2]:
# Load the engineered dataset from Phase 1
df = pd.read_csv('../data/refined_data_for_model/engineered_student_data.csv')
print(f"Dataset shape: {df.shape}")
print(f"\nDataset info:")
df.info()
print(f"\nRisk distribution:")
print(df['risk'].value_counts())

Dataset shape: (282, 41)

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 282 entries, 0 to 281
Data columns (total 41 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   student_id                    282 non-null    int64  
 1   country                       282 non-null    object 
 2   course                        282 non-null    object 
 3   student_cohort                282 non-null    object 
 4   academic_status               282 non-null    object 
 5   failed_subjects               282 non-null    int64  
 6   study_skills(attended)        282 non-null    object 
 7   referral                      282 non-null    object 
 8   pp_meeting                    282 non-null    object 
 9   self_assessment               109 non-null    object 
 10  readiness_assessment_results  282 non-null    object 
 11  follow_up                     282 non-null    object 
 12  follow_up_type          

## 2. Correlation Analysis with Outcome Variable

### Step 1: Prepare Data for Correlation Analysis

In [3]:
# Create a copy for analysis
df_analysis = df.copy()

# Encode the target variable for correlation analysis
risk_encoder = LabelEncoder()
df_analysis['risk_encoded'] = risk_encoder.fit_transform(df_analysis['risk'])

print("Risk encoding:")
for i, risk_level in enumerate(risk_encoder.classes_):
    print(f"  {risk_level}: {i}")

# Separate features by type
numeric_features = df_analysis.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df_analysis.select_dtypes(include=['object']).columns.tolist()

# Remove target and ID columns from feature lists
numeric_features = [col for col in numeric_features if col not in ['student_id', 'risk_encoded']]
categorical_features = [col for col in categorical_features if col not in ['student_id', 'risk']]

print(f"\nNumeric features ({len(numeric_features)}): {numeric_features}")
print(f"\nCategorical features ({len(categorical_features)}): {categorical_features}")

Risk encoding:
  high: 0
  medium: 1

Numeric features (16): ['failed_subjects', 'subject_1_assess_1', 'subject_1_assess_2', 'subject_1_assess_3', 'subject_1_assess_4', 'attendance_1', 'subject_2_assess_1', 'subject_2_assess_2', 'subject_2_assess_3', 'subject_2_assess_4', 'attendance_2', 'subject_3_assess_1', 'subject_3_assess_2', 'subject_3_assess_3', 'subject_3_assess_4', 'attendance_3']

Categorical features (23): ['country', 'course', 'student_cohort', 'academic_status', 'study_skills(attended)', 'referral', 'pp_meeting', 'self_assessment', 'readiness_assessment_results', 'follow_up', 'follow_up_type', 'subject_1', 'learn_jcu_issues_1', 'lecturer_referral_1', 'subject_2', 'learn_jcu_issues_2', 'lecturer_referral_2', 'subject_3', 'learn_jcu_issues_3', 'lecturer_referral_3', 'comments', 'identified_issues', 'course_group']


### Step 2: Numeric Features Correlation

In [None]:
# Calculate correlations for numeric features
numeric_correlations = {}
print("NUMERIC FEATURES CORRELATION WITH RISK:")
print("=" * 50)

for feature in numeric_features:
    # Remove any NaN values for correlation calculation
    valid_data = df_analysis[[feature, 'risk_encoded']].dropna()

    if len(valid_data) > 1:
        correlation, p_value = pearsonr(valid_data[feature], valid_data['risk_encoded'])
        numeric_correlations[feature] = {
            'correlation': correlation,
            'p_value': p_value,
            'abs_correlation': abs(correlation)
        }
        print(f"{feature:25} | Corr: {correlation:6.3f} | P-value: {p_value:6.3f}")
    else:
        numeric_correlations[feature] = {
            'correlation': 0,
            'p_value': 1.0,
            'abs_correlation': 0
        }
        print(f"{feature:25} | No valid data for correlation")

# Sort by absolute correlation
sorted_numeric = sorted(numeric_correlations.items(),
                       key=lambda x: x[1]['abs_correlation'], reverse=True)

print(f"\nTOP NUMERIC CORRELATIONS:")
for feature, stats in sorted_numeric[:10]:
    print(f"{feature:25} | {stats['correlation']:6.3f}")

NUMERIC FEATURES CORRELATION WITH RISK:
failed_subjects           | Corr: -0.670 | P-value:  0.000
subject_1_assess_1        | Corr:  0.431 | P-value:  0.000
subject_1_assess_2        | Corr:  0.246 | P-value:  0.000
subject_1_assess_3        | Corr:  0.296 | P-value:  0.000
subject_1_assess_4        | Corr:  0.352 | P-value:  0.000
attendance_1              | Corr:  0.399 | P-value:  0.000
subject_2_assess_1        | Corr:  0.346 | P-value:  0.000
subject_2_assess_2        | Corr:  0.386 | P-value:  0.000
subject_2_assess_3        | Corr:  0.340 | P-value:  0.000
subject_2_assess_4        | Corr:  0.410 | P-value:  0.000
attendance_2              | Corr:  0.428 | P-value:  0.000
subject_3_assess_1        | Corr:  0.336 | P-value:  0.000
subject_3_assess_2        | Corr:  0.303 | P-value:  0.000
subject_3_assess_3        | Corr:  0.342 | P-value:  0.000
subject_3_assess_4        | Corr:  0.344 | P-value:  0.000
attendance_3              | Corr:  0.428 | P-value:  0.000

TOP NUMERIC COR

### Step 3: Categorical Features Association

In [5]:
# Calculate chi-square test for categorical features
categorical_associations = {}
print("CATEGORICAL FEATURES ASSOCIATION WITH RISK:")
print("=" * 55)

for feature in categorical_features:
    try:
        # Create contingency table
        contingency_table = pd.crosstab(df_analysis[feature], df_analysis['risk'])

        # Perform chi-square test
        chi2, p_value, dof, expected = chi2_contingency(contingency_table)

        # Calculate Cramér's V (effect size)
        n = contingency_table.sum().sum()
        cramers_v = np.sqrt(chi2 / (n * (min(contingency_table.shape) - 1)))

        categorical_associations[feature] = {
            'chi2': chi2,
            'p_value': p_value,
            'cramers_v': cramers_v,
            'unique_values': df_analysis[feature].nunique()
        }

        print(f"{feature:25} | Chi2: {chi2:8.2f} | P-val: {p_value:6.3f} | Cramér's V: {cramers_v:.3f}")

    except Exception as e:
        categorical_associations[feature] = {
            'chi2': 0,
            'p_value': 1.0,
            'cramers_v': 0,
            'unique_values': df_analysis[feature].nunique()
        }
        print(f"{feature:25} | Error: {str(e)}")

# Sort by Cramér's V
sorted_categorical = sorted(categorical_associations.items(),
                          key=lambda x: x[1]['cramers_v'], reverse=True)

print(f"\nTOP CATEGORICAL ASSOCIATIONS:")
for feature, stats in sorted_categorical[:10]:
    print(f"{feature:25} | Cramér's V: {stats['cramers_v']:6.3f}")

CATEGORICAL FEATURES ASSOCIATION WITH RISK:
country                   | Chi2:    26.74 | P-val:  0.685 | Cramér's V: 0.308
course                    | Chi2:    11.27 | P-val:  0.588 | Cramér's V: 0.200
student_cohort            | Chi2:    89.20 | P-val:  0.000 | Cramér's V: 0.562
academic_status           | Chi2:   184.76 | P-val:  0.000 | Cramér's V: 0.809
study_skills(attended)    | Chi2:     5.57 | P-val:  0.351 | Cramér's V: 0.140
referral                  | Chi2:     2.75 | P-val:  0.600 | Cramér's V: 0.099
pp_meeting                | Chi2:     4.38 | P-val:  0.224 | Cramér's V: 0.125
self_assessment           | Chi2:     0.00 | P-val:  1.000 | Cramér's V: 0.000
readiness_assessment_results | Chi2:     0.00 | P-val:  1.000 | Cramér's V: nan
follow_up                 | Chi2:     0.72 | P-val:  0.395 | Cramér's V: 0.051
follow_up_type            | Chi2:     1.94 | P-val:  0.586 | Cramér's V: 0.083
subject_1                 | Chi2:     9.22 | P-val:  0.512 | Cramér's V: 0.181
learn_j

## 3. Create JSON Mapping for Categorical Variables

In [6]:
# Create detailed mappings for categorical variables based on their association with risk
categorical_mappings = {}

print("CREATING CATEGORICAL VARIABLE MAPPINGS:")
print("=" * 50)

for feature in categorical_features:
    print(f"\nAnalyzing {feature}:")

    # Get value counts by risk level
    risk_breakdown = pd.crosstab(df_analysis[feature], df_analysis['risk'], normalize='index') * 100

    # Calculate risk scores for each category
    category_scores = {}

    for category in df_analysis[feature].unique():
        if pd.isna(category):
            continue

        category_data = df_analysis[df_analysis[feature] == category]

        if len(category_data) > 0:
            # Calculate percentage of high risk students in this category
            high_risk_pct = (category_data['risk'] == 'high').mean() * 100
            medium_risk_pct = (category_data['risk'] == 'medium').mean() * 100

            # Create composite risk score (weighted towards high risk)
            risk_score = (high_risk_pct * 2 + medium_risk_pct) / 3

            category_scores[str(category)] = {
                'risk_score': round(risk_score, 2),
                'high_risk_pct': round(high_risk_pct, 1),
                'medium_risk_pct': round(medium_risk_pct, 1),
                'count': len(category_data)
            }

            print(f"  {category}: Risk Score={risk_score:.1f}, High={high_risk_pct:.1f}%, Medium={medium_risk_pct:.1f}%")

    # Sort categories by risk score
    sorted_categories = sorted(category_scores.items(), key=lambda x: x[1]['risk_score'], reverse=True)

    # Assign numeric weights based on ranking
    weights = {}
    for i, (category, scores) in enumerate(sorted_categories):
        # Higher risk categories get higher weights
        weight = len(sorted_categories) - i
        weights[category] = weight

    categorical_mappings[feature] = {
        'association_strength': categorical_associations[feature]['cramers_v'],
        'category_details': category_scores,
        'numeric_weights': weights,
        'unique_count': len(category_scores)
    }

print(f"\nCompleted mappings for {len(categorical_mappings)} categorical features.")

CREATING CATEGORICAL VARIABLE MAPPINGS:

Analyzing country:
  australia: Risk Score=44.4, High=33.3%, Medium=66.7%
  bangladesh: Risk Score=43.8, High=31.2%, Medium=68.8%
  bhutan: Risk Score=43.5, High=30.6%, Medium=69.4%
  chile: Risk Score=33.3, High=0.0%, Medium=100.0%
  colombia: Risk Score=66.7, High=100.0%, Medium=0.0%
  ghana: Risk Score=33.3, High=0.0%, Medium=100.0%
  india: Risk Score=42.2, High=26.5%, Medium=73.5%
  jordan: Risk Score=33.3, High=0.0%, Medium=100.0%
  kenya: Risk Score=46.5, High=39.5%, Medium=60.5%
  malaysia: Risk Score=33.3, High=0.0%, Medium=100.0%
  nepal: Risk Score=45.6, High=36.7%, Medium=63.3%
  papua new guinea: Risk Score=42.6, High=27.8%, Medium=72.2%
  philippines: Risk Score=44.4, High=33.3%, Medium=66.7%
  vietnam: Risk Score=40.7, High=22.2%, Medium=77.8%
  zimbabwe: Risk Score=50.0, High=50.0%, Medium=50.0%
  nigeria: Risk Score=50.0, High=50.0%, Medium=50.0%
  china: Risk Score=47.2, High=41.7%, Medium=58.3%
  uganda: Risk Score=57.1, High=

### Save JSON Mapping File

In [7]:
# Create comprehensive mapping structure
feature_mappings = {
    'metadata': {
        'created_date': pd.Timestamp.now().isoformat(),
        'dataset_shape': df.shape,
        'risk_levels': list(df['risk'].unique()),
        'risk_distribution': df['risk'].value_counts().to_dict()
    },
    'numeric_features': {
        'correlations': {k: v for k, v in numeric_correlations.items()},
        'top_features': [item[0] for item in sorted_numeric[:10]]
    },
    'categorical_features': categorical_mappings
}

# Save to JSON file
output_path = '../data/refined_data_for_model/feature_mappings.json'
with open(output_path, 'w') as f:
    json.dump(feature_mappings, f, indent=2)

print(f"Feature mappings saved to: {output_path}")
print(f"\nMapping summary:")
print(f"  - Numeric features analyzed: {len(numeric_correlations)}")
print(f"  - Categorical features analyzed: {len(categorical_mappings)}")
print(f"  - Total features mapped: {len(numeric_correlations) + len(categorical_mappings)}")

Feature mappings saved to: ../data/refined_data_for_model/feature_mappings.json

Mapping summary:
  - Numeric features analyzed: 16
  - Categorical features analyzed: 23
  - Total features mapped: 39


## 4. Apply Numeric Weights to Categorical Variables

In [8]:
# Create a new dataframe with numeric weights applied
df_weighted = df.copy()

print("APPLYING NUMERIC WEIGHTS TO CATEGORICAL VARIABLES:")
print("=" * 55)

for feature in categorical_features:
    if feature in categorical_mappings:
        weights = categorical_mappings[feature]['numeric_weights']

        # Create new weighted column
        weighted_col = f"{feature}_weighted"
        df_weighted[weighted_col] = df_weighted[feature].astype(str).map(weights)

        # Handle any unmapped values with median weight
        median_weight = np.median(list(weights.values()))
        df_weighted[weighted_col] = df_weighted[weighted_col].fillna(median_weight)

        print(f"{feature:25} -> {weighted_col:35} | Unique weights: {len(weights)}")

        # Show weight distribution
        weight_dist = df_weighted[weighted_col].value_counts().sort_index()
        print(f"  Weight distribution: {dict(weight_dist)}")

print(f"\nWeighted dataset shape: {df_weighted.shape}")
print(f"New weighted columns added: {len([col for col in df_weighted.columns if col.endswith('_weighted')])}")

APPLYING NUMERIC WEIGHTS TO CATEGORICAL VARIABLES:
country                   -> country_weighted                    | Unique weights: 32
  Weight distribution: {1: np.int64(1), 2: np.int64(1), 3: np.int64(1), 4: np.int64(1), 5: np.int64(1), 6: np.int64(3), 7: np.int64(1), 8: np.int64(1), 9: np.int64(1), 10: np.int64(1), 11: np.int64(5), 12: np.int64(9), 13: np.int64(49), 14: np.int64(18), 15: np.int64(62), 16: np.int64(16), 17: np.int64(3), 18: np.int64(3), 19: np.int64(30), 20: np.int64(38), 21: np.int64(12), 22: np.int64(2), 23: np.int64(2), 24: np.int64(4), 25: np.int64(2), 26: np.int64(3), 27: np.int64(7), 28: np.int64(1), 29: np.int64(1), 30: np.int64(1), 31: np.int64(1), 32: np.int64(1)}
course                    -> course_weighted                     | Unique weights: 14
  Weight distribution: {1: np.int64(17), 2: np.int64(9), 3: np.int64(8), 4: np.int64(38), 5: np.int64(7), 6: np.int64(34), 7: np.int64(28), 8: np.int64(31), 9: np.int64(18), 10: np.int64(42), 11: np.int64(26), 1

## 5. Sentiment Analysis and NLP for Comments

### Step 1: Identify Comment Fields

In [9]:
# Identify potential comment/text fields
comment_fields = []
text_fields = df_weighted.select_dtypes(include=['object']).columns.tolist()

print("IDENTIFYING COMMENT FIELDS:")
print("=" * 30)

for field in text_fields:
    if field not in ['student_id', 'risk'] and not field.endswith('_weighted'):
        # Check if field contains longer text (potential comments)
        avg_length = df_weighted[field].astype(str).str.len().mean()
        unique_ratio = df_weighted[field].nunique() / len(df_weighted)

        print(f"{field:25} | Avg Length: {avg_length:5.1f} | Unique Ratio: {unique_ratio:.3f}")

        # Consider as comment field if average length > 20 chars or high uniqueness
        if avg_length > 20 or unique_ratio > 0.5:
            comment_fields.append(field)
            print(f"  -> Identified as COMMENT field")

print(f"\nComment fields identified: {comment_fields}")

IDENTIFYING COMMENT FIELDS:
country                   | Avg Length:   6.7 | Unique Ratio: 0.113
course                    | Avg Length:   7.2 | Unique Ratio: 0.050
student_cohort            | Avg Length:   7.3 | Unique Ratio: 0.028
academic_status           | Avg Length:  12.6 | Unique Ratio: 0.014
study_skills(attended)    | Avg Length:  14.5 | Unique Ratio: 0.021
referral                  | Avg Length:  10.8 | Unique Ratio: 0.018
pp_meeting                | Avg Length:   9.1 | Unique Ratio: 0.014
self_assessment           | Avg Length:   2.8 | Unique Ratio: 0.007
readiness_assessment_results | Avg Length:  22.0 | Unique Ratio: 0.004
  -> Identified as COMMENT field
follow_up                 | Avg Length:   2.5 | Unique Ratio: 0.007
follow_up_type            | Avg Length:   5.2 | Unique Ratio: 0.014
subject_1                 | Avg Length:   6.0 | Unique Ratio: 0.039
learn_jcu_issues_1        | Avg Length:   6.8 | Unique Ratio: 0.007
lecturer_referral_1       | Avg Length:  14.4 | Uniq

### Step 2: Sentiment Analysis Function

In [10]:
def analyze_sentiment(text):
    """
    Analyze sentiment of text and return scores
    """
    if pd.isna(text) or str(text).strip() == '':
        return {
            'sentiment_score': 0,
            'sentiment_category': 'neutral',
            'text_length': 0,
            'word_count': 0
        }

    text_str = str(text).lower()

    if TEXTBLOB_AVAILABLE:
        # Use TextBlob for sentiment analysis
        blob = TextBlob(text_str)
        sentiment_score = blob.sentiment.polarity  # -1 to 1
    else:
        # Simple keyword-based sentiment scoring
        positive_words = ['good', 'excellent', 'great', 'outstanding', 'positive', 'strong', 'effective']
        negative_words = ['poor', 'bad', 'terrible', 'weak', 'negative', 'struggling', 'concerning', 'issue']

        pos_count = sum(1 for word in positive_words if word in text_str)
        neg_count = sum(1 for word in negative_words if word in text_str)

        # Simple scoring
        if pos_count > neg_count:
            sentiment_score = min(0.5, pos_count * 0.2)
        elif neg_count > pos_count:
            sentiment_score = max(-0.5, -neg_count * 0.2)
        else:
            sentiment_score = 0

    # Categorize sentiment
    if sentiment_score > 0.1:
        category = 'positive'
    elif sentiment_score < -0.1:
        category = 'negative'
    else:
        category = 'neutral'

    return {
        'sentiment_score': round(sentiment_score, 3),
        'sentiment_category': category,
        'text_length': len(text_str),
        'word_count': len(text_str.split())
    }

print("Sentiment analysis function defined.")
print("\nTesting sentiment analysis:")
test_texts = [
    "This student is performing excellently",
    "Student is struggling with attendance issues",
    "Average performance, no major concerns"
]

for text in test_texts:
    result = analyze_sentiment(text)
    print(f"'{text}' -> {result['sentiment_category']} ({result['sentiment_score']})")

Sentiment analysis function defined.

Testing sentiment analysis:
'This student is performing excellently' -> positive (0.2)
'Student is struggling with attendance issues' -> negative (-0.4)
'Average performance, no major concerns' -> neutral (0)


### Step 3: Apply Sentiment Analysis to Comment Fields

In [11]:
# Apply sentiment analysis to identified comment fields
print("APPLYING SENTIMENT ANALYSIS:")
print("=" * 35)

for field in comment_fields:
    print(f"\nProcessing {field}...")

    # Apply sentiment analysis
    sentiment_results = df_weighted[field].apply(analyze_sentiment)

    # Extract sentiment components
    df_weighted[f"{field}_sentiment_score"] = [r['sentiment_score'] for r in sentiment_results]
    df_weighted[f"{field}_sentiment_category"] = [r['sentiment_category'] for r in sentiment_results]
    df_weighted[f"{field}_text_length"] = [r['text_length'] for r in sentiment_results]
    df_weighted[f"{field}_word_count"] = [r['word_count'] for r in sentiment_results]

    # Show sentiment distribution
    sentiment_dist = df_weighted[f"{field}_sentiment_category"].value_counts()
    print(f"  Sentiment distribution: {dict(sentiment_dist)}")

    # Analyze sentiment vs risk
    sentiment_risk = pd.crosstab(df_weighted[f"{field}_sentiment_category"],
                                df_weighted['risk'], normalize='index') * 100
    print(f"  Sentiment vs Risk (% by sentiment):")
    print(sentiment_risk.round(1))

if comment_fields:
    print(f"\nSentiment analysis completed for {len(comment_fields)} comment fields.")
    print(f"Added {len(comment_fields) * 4} new sentiment-related columns.")
else:
    print("\nNo comment fields identified for sentiment analysis.")

APPLYING SENTIMENT ANALYSIS:

Processing readiness_assessment_results...
  Sentiment distribution: {'neutral': np.int64(282)}
  Sentiment vs Risk (% by sentiment):
risk                                             high  medium
readiness_assessment_results_sentiment_category              
neutral                                          33.7    66.3

Processing comments...
  Sentiment distribution: {'neutral': np.int64(270), 'negative': np.int64(12)}
  Sentiment vs Risk (% by sentiment):
risk                         high  medium
comments_sentiment_category              
negative                     91.7     8.3
neutral                      31.1    68.9

Sentiment analysis completed for 2 comment fields.
Added 8 new sentiment-related columns.


## 6. Final Feature Engineering Summary

In [12]:
# Save the final engineered dataset
output_path = '../data/refined_data_for_model/fully_engineered_student_data.csv'
df_weighted.to_csv(output_path, index=False)

print("=" * 60)
print("PHASE 2 FEATURE ENGINEERING COMPLETION SUMMARY")
print("=" * 60)

print(f"\n✅ COMPLETED TASKS:")
print(f"   1. ✅ Loaded engineered_student_data.csv")
print(f"   2. ✅ Analyzed correlations for {len(numeric_features)} numeric features")
print(f"   3. ✅ Analyzed associations for {len(categorical_features)} categorical features")
print(f"   4. ✅ Created JSON mapping with numeric weights")
print(f"   5. ✅ Applied numeric weights to categorical variables")
print(f"   6. ✅ Performed sentiment analysis on {len(comment_fields)} comment fields")
print(f"   7. ✅ Created comprehensive feature-engineered dataset")

print(f"\n📊 FINAL DATASET STATUS:")
print(f"   • Original features: {df.shape[1]}")
print(f"   • Final features: {df_weighted.shape[1]}")
print(f"   • New features added: {df_weighted.shape[1] - df.shape[1]}")
print(f"   • Weighted categorical features: {len([col for col in df_weighted.columns if col.endswith('_weighted')])}")
print(f"   • Sentiment features: {len([col for col in df_weighted.columns if 'sentiment' in col])}")

print(f"\n📁 OUTPUT FILES:")
print(f"   • Feature mappings: ../data/refined_data_for_model/feature_mappings.json")
print(f"   • Engineered dataset: {output_path}")

print(f"\n🚀 READY FOR PREDICTION MODEL:")
print(f"   • Dataset shape: {df_weighted.shape}")
print(f"   • Target variable: 'risk' (medium/high)")
print(f"   • Features ready for model training")

# Show column summary
print(f"\n📋 FEATURE SUMMARY:")
original_features = [col for col in df_weighted.columns if not col.endswith('_weighted') and 'sentiment' not in col and col != 'risk_encoded']
weighted_features = [col for col in df_weighted.columns if col.endswith('_weighted')]
sentiment_features = [col for col in df_weighted.columns if 'sentiment' in col or 'text_length' in col or 'word_count' in col]

print(f"   • Original features: {len(original_features)}")
print(f"   • Weighted categorical: {len(weighted_features)}")
print(f"   • Sentiment/NLP features: {len(sentiment_features)}")
print(f"   • Total engineered features: {len(weighted_features) + len(sentiment_features)}")

PHASE 2 FEATURE ENGINEERING COMPLETION SUMMARY

✅ COMPLETED TASKS:
   1. ✅ Loaded engineered_student_data.csv
   2. ✅ Analyzed correlations for 16 numeric features
   3. ✅ Analyzed associations for 23 categorical features
   4. ✅ Created JSON mapping with numeric weights
   5. ✅ Applied numeric weights to categorical variables
   6. ✅ Performed sentiment analysis on 2 comment fields
   7. ✅ Created comprehensive feature-engineered dataset

📊 FINAL DATASET STATUS:
   • Original features: 41
   • Final features: 72
   • New features added: 31
   • Weighted categorical features: 23
   • Sentiment features: 4

📁 OUTPUT FILES:
   • Feature mappings: ../data/refined_data_for_model/feature_mappings.json
   • Engineered dataset: ../data/refined_data_for_model/fully_engineered_student_data.csv

🚀 READY FOR PREDICTION MODEL:
   • Dataset shape: (282, 72)
   • Target variable: 'risk' (medium/high)
   • Features ready for model training

📋 FEATURE SUMMARY:
   • Original features: 45
   • Weighted 