# AI Text Detector - Data Exploration

This notebook explores the dataset containing 11,638 AI-generated texts and 11,638 human-written texts.

## Dataset Overview
- **Total samples**: 23,276 (perfectly balanced)
- **AI-generated**: 11,638 samples (label = 1)
- **Human-written**: 11,638 samples (label = 0)
- **Features**: Text content and binary labels

In [1]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import re
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")

Libraries imported successfully!


In [2]:
# Load the dataset
df = pd.read_csv('../Training_Essay_Data.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nDataset info:")
print(df.info())
print(f"\nFirst few rows:")
df.head()

Dataset shape: (23274, 2)

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23274 entries, 0 to 23273
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       23274 non-null  object
 1   generated  23274 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 363.8+ KB
None

First few rows:


Unnamed: 0,text,generated
0,Car-free cities have become a subject of incre...,1
1,"Car Free Cities Car-free cities, a concept ga...",1
2,A Sustainable Urban Future Car-free cities ...,1
3,Pioneering Sustainable Urban Living In an e...,1
4,The Path to Sustainable Urban Living In an ...,1


In [3]:
# Check for missing values and data quality
print("Missing values:")
print(df.isnull().sum())

print(f"\nLabel distribution:")
label_counts = df['generated'].value_counts()
print(label_counts)
print(f"\nPercentage distribution:")
print(df['generated'].value_counts(normalize=True) * 100)

# Check for duplicates
print(f"\nDuplicate texts: {df['text'].duplicated().sum()}")
print(f"Total unique texts: {df['text'].nunique()}")

Missing values:
text         0
generated    0
dtype: int64

Label distribution:
generated
1    11637
0    11637
Name: count, dtype: int64

Percentage distribution:
generated
1    50.0
0    50.0
Name: proportion, dtype: float64

Duplicate texts: 428
Total unique texts: 22846


In [4]:
# Visualize label distribution
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Class Distribution', 'Class Percentage'),
    specs=[[{"type": "bar"}, {"type": "pie"}]]
)

# Bar chart
labels = ['Human (0)', 'AI Generated (1)']
counts = [label_counts[0], label_counts[1]]
colors = ['#FF6B6B', '#4ECDC4']

fig.add_trace(
    go.Bar(x=labels, y=counts, marker_color=colors, name='Count'),
    row=1, col=1
)

# Pie chart
fig.add_trace(
    go.Pie(labels=labels, values=counts, marker_colors=colors, name='Distribution'),
    row=1, col=2
)

fig.update_layout(
    title_text="Dataset Label Distribution",
    showlegend=False,
    height=400
)

fig.show()

In [5]:
# Text length analysis
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['sentence_count'] = df['text'].str.split('.').str.len()

print("Text Statistics by Class:")
stats_by_class = df.groupby('generated')[['text_length', 'word_count', 'sentence_count']].agg([
    'mean', 'median', 'std', 'min', 'max'
]).round(2)

print(stats_by_class)

Text Statistics by Class:
          text_length                             word_count                 \
                 mean  median      std  min   max       mean median     std   
generated                                                                     
0             2329.92  2169.0  1033.71  239  8436     419.78  392.0  183.89   
1             1984.40  2048.0   839.41    1  5078     312.38  324.0  123.20   

                    sentence_count                         
          min   max           mean median    std min  max  
generated                                                  
0          48  1367          21.63   21.0  10.22   1  102  
1           1   785          15.15   13.0   6.53   1   94  


In [6]:
# Visualize text length distributions
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Character Length Distribution', 'Word Count Distribution', 
                   'Character Length by Class', 'Word Count by Class')
)

# Overall distributions
fig.add_trace(
    go.Histogram(x=df['text_length'], nbinsx=50, name='Character Length', 
                marker_color='lightblue', opacity=0.7),
    row=1, col=1
)

fig.add_trace(
    go.Histogram(x=df['word_count'], nbinsx=50, name='Word Count', 
                marker_color='lightgreen', opacity=0.7),
    row=1, col=2
)

# By class
for label, color in zip([0, 1], ['#FF6B6B', '#4ECDC4']):
    class_data = df[df['generated'] == label]
    label_name = 'Human' if label == 0 else 'AI Generated'
    
    fig.add_trace(
        go.Histogram(x=class_data['text_length'], nbinsx=30, 
                    name=f'{label_name} - Chars', 
                    marker_color=color, opacity=0.6),
        row=2, col=1
    )
    
    fig.add_trace(
        go.Histogram(x=class_data['word_count'], nbinsx=30, 
                    name=f'{label_name} - Words', 
                    marker_color=color, opacity=0.6),
        row=2, col=2
    )

fig.update_layout(title_text="Text Length Analysis", height=800)
fig.show()

In [7]:
# Linguistic feature analysis
def extract_linguistic_features(text):
    """Extract various linguistic features from text"""
    features = {}
    
    # Basic counts
    features['punctuation_count'] = len(re.findall(r'[.!?;,:]', text))
    features['exclamation_count'] = text.count('!')
    features['question_count'] = text.count('?')
    features['uppercase_count'] = sum(1 for c in text if c.isupper())
    features['digit_count'] = sum(1 for c in text if c.isdigit())
    
    # Ratios
    text_length = len(text)
    if text_length > 0:
        features['punctuation_ratio'] = features['punctuation_count'] / text_length
        features['uppercase_ratio'] = features['uppercase_count'] / text_length
        features['digit_ratio'] = features['digit_count'] / text_length
    else:
        features['punctuation_ratio'] = 0
        features['uppercase_ratio'] = 0
        features['digit_ratio'] = 0
    
    # Average word length
    words = text.split()
    if words:
        features['avg_word_length'] = np.mean([len(word.strip('.,!?;:')) for word in words])
        features['unique_word_ratio'] = len(set(words)) / len(words)
    else:
        features['avg_word_length'] = 0
        features['unique_word_ratio'] = 0
    
    return features

# Apply feature extraction
print("Extracting linguistic features...")
linguistic_features = df['text'].apply(extract_linguistic_features).apply(pd.Series)
df_features = pd.concat([df, linguistic_features], axis=1)

print("Feature extraction completed!")
print(f"New features: {list(linguistic_features.columns)}")

Extracting linguistic features...
Feature extraction completed!
New features: ['punctuation_count', 'exclamation_count', 'question_count', 'uppercase_count', 'digit_count', 'punctuation_ratio', 'uppercase_ratio', 'digit_ratio', 'avg_word_length', 'unique_word_ratio']
Feature extraction completed!
New features: ['punctuation_count', 'exclamation_count', 'question_count', 'uppercase_count', 'digit_count', 'punctuation_ratio', 'uppercase_ratio', 'digit_ratio', 'avg_word_length', 'unique_word_ratio']


In [8]:
# Compare linguistic features between classes
feature_cols = ['punctuation_ratio', 'uppercase_ratio', 'avg_word_length', 'unique_word_ratio']

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[f'{col.replace("_", " ").title()} Distribution' for col in feature_cols]
)

positions = [(1,1), (1,2), (2,1), (2,2)]

for i, (feature, pos) in enumerate(zip(feature_cols, positions)):
    for label, color in zip([0, 1], ['#FF6B6B', '#4ECDC4']):
        class_data = df_features[df_features['generated'] == label]
        label_name = 'Human' if label == 0 else 'AI Generated'
        
        fig.add_trace(
            go.Histogram(x=class_data[feature], nbinsx=30, 
                        name=f'{label_name}', 
                        marker_color=color, opacity=0.6,
                        legendgroup=f'group{label}',
                        showlegend=True if i == 0 else False),
            row=pos[0], col=pos[1]
        )

fig.update_layout(title_text="Linguistic Features Comparison", height=800)
fig.show()

# Statistical comparison
print("\nStatistical comparison of features:")
comparison_stats = df_features.groupby('generated')[feature_cols].agg(['mean', 'std']).round(4)
print(comparison_stats)


Statistical comparison of features:
          punctuation_ratio         uppercase_ratio         avg_word_length  \
                       mean     std            mean     std            mean   
generated                                                                     
0                    0.0156  0.0048          0.0169  0.0143          4.4349   
1                    0.0156  0.0047          0.0114  0.0061          5.1714   

                  unique_word_ratio          
              std              mean     std  
generated                                    
0          0.3028            0.4892  0.0898  
1          0.4788            0.5997  0.1316  


In [9]:
# Sample texts from each class
print("=" * 80)
print("SAMPLE AI-GENERATED TEXT:")
print("=" * 80)
ai_sample = df[df['generated'] == 1]['text'].iloc[0][:500] + "..."
print(ai_sample)

print("\n" + "=" * 80)
print("SAMPLE HUMAN-WRITTEN TEXT:")
print("=" * 80)
human_sample = df[df['generated'] == 0]['text'].iloc[0][:500] + "..."
print(human_sample)

print("\n" + "=" * 80)
print("KEY OBSERVATIONS:")
print("=" * 80)
print("1. AI texts: More formal, structured, fewer grammar errors")
print("2. Human texts: More casual, contains typos, natural variations")
print("3. Dataset is perfectly balanced - excellent for training")
print("4. Clear linguistic differences between classes")

SAMPLE AI-GENERATED TEXT:
Car-free cities have become a subject of increasing interest and debate in recent years, as urban areas around the world grapple with the challenges of congestion, pollution, and limited resources. The concept of a car-free city involves creating urban environments where private automobiles are either significantly restricted or completely banned, with a focus on alternative transportation methods and sustainable urban planning. This essay explores the benefits, challenges, and potential solutio...

SAMPLE HUMAN-WRITTEN TEXT:
Have you ever hurd are know about are solar system and whats all in it ? Are solar system is made up of alot of things such ass oher planets ,stars ,l arg rocks,and different galexys . Out of all the plantes Earth is the 3rd planet away from the sun .

Venus has its ups and downs and what I mean about this is that venus has good reasons why people shold go study it . Some bad reasons is that Are planets traval at different speeds .

And o

In [10]:
# Save processed data for model training
df_features.to_csv('../data/processed_training_data.csv', index=False)
print("Processed data saved to '../data/processed_training_data.csv'")

# Save feature statistics for reference
feature_stats = {
    'total_samples': len(df),
    'ai_samples': len(df[df['generated'] == 1]),
    'human_samples': len(df[df['generated'] == 0]),
    'avg_text_length': df['text_length'].mean(),
    'avg_word_count': df['word_count'].mean(),
    'feature_columns': list(linguistic_features.columns)
}

import json
with open('../data/dataset_stats.json', 'w') as f:
    json.dump(feature_stats, f, indent=2)

print("Dataset statistics saved to '../data/dataset_stats.json'")
print("\nData exploration completed! Ready for model training.")

Processed data saved to '../data/processed_training_data.csv'
Dataset statistics saved to '../data/dataset_stats.json'

Data exploration completed! Ready for model training.
