## PESONALITY PREDICTOR


**Table of content :**

1. IMPOPRTING LIBRARIES AND LOADING DATA

2. DATA CLEANING & PREPROCESSING

3. EXPLORATORY DATA ANALYSIS (EDA)

4. FEATURE ENGINEERING

5. MODEL

6. GIVE INPUTS 

7. FIND THE OUTPUT

### IMPOPRTING LIBRARIES AND LOADING DATA


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rakeshkapilavai/extrovert-vs-introvert-behavior-data")

print("Path to dataset files:", path)
data = pd.read_csv(path + "/personality_dataset.csv")
data.head()

Path to dataset files: /home/mutaician/.cache/kagglehub/datasets/rakeshkapilavai/extrovert-vs-introvert-behavior-data/versions/2


Unnamed: 0,Time_spent_Alone,Stage_fear,Social_event_attendance,Going_outside,Drained_after_socializing,Friends_circle_size,Post_frequency,Personality
0,4.0,No,4.0,6.0,No,13.0,5.0,Extrovert
1,9.0,Yes,0.0,0.0,Yes,0.0,3.0,Introvert
2,9.0,Yes,1.0,2.0,Yes,5.0,2.0,Introvert
3,0.0,No,6.0,7.0,No,14.0,8.0,Extrovert
4,3.0,No,9.0,4.0,No,8.0,5.0,Extrovert


### DATA CLEANING AND PREPROCESSING

**In this section, we systematically prepare our raw data for analysis by addressing quality issues and ensuring consistency. Each step is clearly explained and verified.**

## Step 1: Initial Data Exploration
**Goal**: Understand the structure, size, and basic characteristics of our dataset

In [3]:
print("=== STEP 1: INITIAL DATA EXPLORATION ===")
print(f"Dataset Shape: {data.shape}")
print(f"Total Records: {data.shape[0]}")
print(f"Total Features: {data.shape[1]}")
print("\nColumn Names and Data Types:")
print(data.dtypes)
print("\nFirst 5 rows:")
print(data.head())
print("\nBasic Statistics:")
print(data.describe())

=== STEP 1: INITIAL DATA EXPLORATION ===
Dataset Shape: (2900, 8)
Total Records: 2900
Total Features: 8

Column Names and Data Types:
Time_spent_Alone             float64
Stage_fear                    object
Social_event_attendance      float64
Going_outside                float64
Drained_after_socializing     object
Friends_circle_size          float64
Post_frequency               float64
Personality                   object
dtype: object

First 5 rows:
   Time_spent_Alone Stage_fear  Social_event_attendance  Going_outside  \
0               4.0         No                      4.0            6.0   
1               9.0        Yes                      0.0            0.0   
2               9.0        Yes                      1.0            2.0   
3               0.0         No                      6.0            7.0   
4               3.0         No                      9.0            4.0   

  Drained_after_socializing  Friends_circle_size  Post_frequency Personality  
0                

## Step 2: Missing Values Analysis
**Goal**: Identify and handle any missing data to ensure data quality

In [4]:
print("=== STEP 2: MISSING VALUES ANALYSIS ===")
print("Missing values per column:")
missing_values = data.isnull().sum()
print(missing_values)
print(f"\nTotal missing values: {missing_values.sum()}")
print(f"Percentage of missing data: {(missing_values.sum() / len(data)) * 100:.2f}%")

if missing_values.sum() > 0:
    print("\nColumns with missing values:")
    for col in missing_values[missing_values > 0].index:
        pct = (missing_values[col] / len(data)) * 100
        print(f"  {col}: {missing_values[col]} missing ({pct:.2f}%)")
else:
    print("\n✅ No missing values found!")

=== STEP 2: MISSING VALUES ANALYSIS ===
Missing values per column:
Time_spent_Alone             63
Stage_fear                   73
Social_event_attendance      62
Going_outside                66
Drained_after_socializing    52
Friends_circle_size          77
Post_frequency               65
Personality                   0
dtype: int64

Total missing values: 458
Percentage of missing data: 15.79%

Columns with missing values:
  Time_spent_Alone: 63 missing (2.17%)
  Stage_fear: 73 missing (2.52%)
  Social_event_attendance: 62 missing (2.14%)
  Going_outside: 66 missing (2.28%)
  Drained_after_socializing: 52 missing (1.79%)
  Friends_circle_size: 77 missing (2.66%)
  Post_frequency: 65 missing (2.24%)


## Step 3: Duplicate Detection
**Goal**: Identify and remove any duplicate records to maintain data quality

In [5]:
print("=== STEP 3: DUPLICATE DETECTION ===")
initial_rows = len(data)
duplicate_count = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

if duplicate_count > 0:
    print(f"Percentage of duplicates: {(duplicate_count / initial_rows) * 100:.2f}%")
    data = data.drop_duplicates()
    print(f"✅ Removed {duplicate_count} duplicate rows")
    print(f"Dataset shape after removing duplicates: {data.shape}")
else:
    print("✅ No duplicate rows found!")

print(f"Final dataset size: {len(data)} rows")

=== STEP 3: DUPLICATE DETECTION ===
Number of duplicate rows: 388
Percentage of duplicates: 13.38%
✅ Removed 388 duplicate rows
Dataset shape after removing duplicates: (2512, 8)
Final dataset size: 2512 rows


## Step 4: Column Classification
**Goal**: Classify columns into numeric and categorical for appropriate processing

In [6]:
print("=== STEP 4: COLUMN CLASSIFICATION ===")

# Define column types based on the dataset structure
numeric_columns = ['Time_spent_Alone', 'Social_event_attendance', 'Going_outside', 
                  'Friends_circle_size', 'Post_frequency']
categorical_columns = ['Stage_fear', 'Drained_after_socializing']
target_column = 'Personality'

print(f"Numeric columns ({len(numeric_columns)}): {numeric_columns}")
print(f"Categorical columns ({len(categorical_columns)}): {categorical_columns}")
print(f"Target column: {target_column}")

# Verify all columns are accounted for
all_feature_columns = numeric_columns + categorical_columns + [target_column]
missing_cols = [col for col in data.columns if col not in all_feature_columns]
if missing_cols:
    print(f"⚠️  Unclassified columns: {missing_cols}")
else:
    print("✅ All columns properly classified!")

=== STEP 4: COLUMN CLASSIFICATION ===
Numeric columns (5): ['Time_spent_Alone', 'Social_event_attendance', 'Going_outside', 'Friends_circle_size', 'Post_frequency']
Categorical columns (2): ['Stage_fear', 'Drained_after_socializing']
Target column: Personality
✅ All columns properly classified!


## Step 5: Data Quality Checks
**Goal**: Verify that our data values make logical sense

In [7]:
print("=== STEP 5: DATA QUALITY CHECKS ===")

# Check for negative values in columns where they shouldn't exist
print("Checking for negative values in numeric columns:")
for col in numeric_columns:
    negative_count = (data[col] < 0).sum()
    if negative_count > 0:
        print(f"  ⚠️  {col}: {negative_count} negative values")
    else:
        print(f"  ✅ {col}: No negative values")

# Check value ranges for reasonableness
print("\nValue ranges for numeric columns:")
for col in numeric_columns:
    min_val = data[col].min()
    max_val = data[col].max()
    mean_val = data[col].mean()
    print(f"  {col}: Range [{min_val:.1f} - {max_val:.1f}], Mean: {mean_val:.1f}")

# Check categorical column values
print("\nCategorical column values:")
for col in categorical_columns:
    unique_vals = data[col].unique()
    print(f"  {col}: {unique_vals}")

# Check target variable distribution
print(f"\nTarget variable ({target_column}) distribution:")
target_counts = data[target_column].value_counts()
for personality, count in target_counts.items():
    percentage = (count / len(data)) * 100
    print(f"  {personality}: {count} ({percentage:.1f}%)")

=== STEP 5: DATA QUALITY CHECKS ===
Checking for negative values in numeric columns:
  ✅ Time_spent_Alone: No negative values
  ✅ Social_event_attendance: No negative values
  ✅ Going_outside: No negative values
  ✅ Friends_circle_size: No negative values
  ✅ Post_frequency: No negative values

Value ranges for numeric columns:
  Time_spent_Alone: Range [0.0 - 11.0], Mean: 4.2
  Social_event_attendance: Range [0.0 - 10.0], Mean: 4.2
  Going_outside: Range [0.0 - 7.0], Mean: 3.2
  Friends_circle_size: Range [0.0 - 15.0], Mean: 6.6
  Post_frequency: Range [0.0 - 10.0], Mean: 3.8

Categorical column values:
  Stage_fear: ['No' 'Yes' nan]
  Drained_after_socializing: ['No' 'Yes' nan]

Target variable (Personality) distribution:
  Extrovert: 1417 (56.4%)
  Introvert: 1095 (43.6%)


## Step 6: Handle Missing Values 
**Goal**: Apply appropriate imputation strategies if missing values are found

In [8]:
print("=== STEP 6: HANDLE MISSING VALUES ===")

# Create a copy for preprocessing
data_clean = data.copy()

if missing_values.sum() > 0:
    print("Applying imputation strategies...")
    
    # Impute numeric columns with median (robust to outliers)
    if any(data_clean[numeric_columns].isnull().sum() > 0):
        numeric_imputer = SimpleImputer(strategy='median')
        data_clean[numeric_columns] = numeric_imputer.fit_transform(data_clean[numeric_columns])
        print("✅ Numeric columns: Missing values filled with median")
    
    # Impute categorical columns with mode (most frequent value)
    if any(data_clean[categorical_columns].isnull().sum() > 0):
        categorical_imputer = SimpleImputer(strategy='most_frequent')
        data_clean[categorical_columns] = categorical_imputer.fit_transform(data_clean[categorical_columns])
        print("✅ Categorical columns: Missing values filled with mode")
    
    # Verify no missing values remain
    remaining_missing = data_clean.isnull().sum().sum()
    print(f"\nRemaining missing values: {remaining_missing}")
else:
    print("✅ No missing values to handle!")

print(f"Dataset shape after handling missing values: {data_clean.shape}")

=== STEP 6: HANDLE MISSING VALUES ===
Applying imputation strategies...
✅ Numeric columns: Missing values filled with median
✅ Categorical columns: Missing values filled with mode

Remaining missing values: 0
Dataset shape after handling missing values: (2512, 8)


## Step 7: Encode Categorical Variables
**Goal**: Convert categorical variables to numeric format for machine learning

In [9]:
print("=== STEP 7: ENCODE CATEGORICAL VARIABLES ===")

# Encode target variable
label_encoder = LabelEncoder()
data_clean[target_column] = label_encoder.fit_transform(data_clean[target_column])
print(f"Target variable encoded: {dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))}")

# One-hot encode other categorical variables
print("\nApplying one-hot encoding to categorical features...")
data_encoded = pd.get_dummies(data_clean, columns=categorical_columns, drop_first=True)

print(f"Original shape: {data_clean.shape}")
print(f"After encoding: {data_encoded.shape}")
print(f"New columns added: {data_encoded.shape[1] - data_clean.shape[1]}")

# Display new column names
new_columns = [col for col in data_encoded.columns if col not in data_clean.columns]
if new_columns:
    print(f"New encoded columns: {new_columns}")

print("✅ Categorical encoding completed!")

=== STEP 7: ENCODE CATEGORICAL VARIABLES ===
Target variable encoded: {'Extrovert': np.int64(0), 'Introvert': np.int64(1)}

Applying one-hot encoding to categorical features...
Original shape: (2512, 8)
After encoding: (2512, 8)
New columns added: 0
New encoded columns: ['Stage_fear_Yes', 'Drained_after_socializing_Yes']
✅ Categorical encoding completed!


## Step 8: Train-Test Split
**Goal**: Split the data into training and testing sets for model evaluation

In [10]:
print("=== STEP 8: TRAIN-TEST SPLIT ===")

# Separate features and target
X = data_encoded.drop(columns=[target_column])
y = data_encoded[target_column]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Feature columns: {list(X.columns)}")

# Perform stratified split to maintain class balance
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"\nTraining set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Split ratio: {X_train.shape[0]/(X_train.shape[0]+X_test.shape[0])*100:.0f}% train, {X_test.shape[0]/(X_train.shape[0]+X_test.shape[0])*100:.0f}% test")

# Verify class balance is maintained
print("\nClass distribution in training set:")
train_dist = y_train.value_counts(normalize=True)
for class_val, percentage in train_dist.items():
    class_name = label_encoder.inverse_transform([class_val])[0]
    print(f"  {class_name}: {percentage:.1%}")

print("\nClass distribution in test set:")
test_dist = y_test.value_counts(normalize=True)
for class_val, percentage in test_dist.items():
    class_name = label_encoder.inverse_transform([class_val])[0]
    print(f"  {class_name}: {percentage:.1%}")

print("\n✅ Data preprocessing completed successfully!")
print("\n" + "="*50)
print("PREPROCESSING SUMMARY:")
print(f"• Original dataset: {data.shape}")
print(f"• Final processed dataset: {data_encoded.shape}")
print(f"• Training samples: {X_train.shape[0]}")
print(f"• Test samples: {X_test.shape[0]}")
print(f"• Features: {X_train.shape[1]}")
print("• Ready for Exploratory Data Analysis!")
print("="*50)

=== STEP 8: TRAIN-TEST SPLIT ===
Features shape: (2512, 7)
Target shape: (2512,)
Feature columns: ['Time_spent_Alone', 'Social_event_attendance', 'Going_outside', 'Friends_circle_size', 'Post_frequency', 'Stage_fear_Yes', 'Drained_after_socializing_Yes']

Training set: 2009 samples
Test set: 503 samples
Split ratio: 80% train, 20% test

Class distribution in training set:
  Extrovert: 56.4%
  Introvert: 43.6%

Class distribution in test set:
  Extrovert: 56.5%
  Introvert: 43.5%

✅ Data preprocessing completed successfully!

PREPROCESSING SUMMARY:
• Original dataset: (2512, 8)
• Final processed dataset: (2512, 8)
• Training samples: 2009
• Test samples: 503
• Features: 7
• Ready for Exploratory Data Analysis!
