# Password Data Exploratory Analysis

This notebook performs Exploratory Data Analysis (EDA) on a password dataset stored in a plain text file (one password per line).

## 1. Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from collections import Counter

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')

## 2. Load Data

Specify the path to your password file and load the passwords into a pandas Series.

In [None]:
# --- Configuration ---
PASSWORD_FILE_PATH = '../data/raw/your_passwords.txt'  # <--- *** UPDATE THIS PATH ***
# ---------------------

passwords = []
try:
    # Read passwords from the text file, ignoring potential encoding errors
    # and stripping whitespace. Skip empty lines.
    with open(PASSWORD_FILE_PATH, 'r', encoding='utf-8', errors='ignore') as f:
        passwords = [line.strip() for line in f if line.strip()]
    print(f"Successfully loaded {len(passwords)} passwords from '{PASSWORD_FILE_PATH}'.")
except FileNotFoundError:
    print(f"Error: File not found at '{PASSWORD_FILE_PATH}'. Please check the path.")
except Exception as e:
    print(f"An error occurred while reading the file: {e}")

# Convert the list of passwords into a pandas Series
if passwords:
    password_series = pd.Series(passwords, name='password')
    print("\nSample passwords:")
    print(password_series.head())
else:
    print("\nNo passwords were loaded. Creating an empty Series.")
    # Create an empty series to prevent errors in later cells
    password_series = pd.Series([], dtype=str, name='password')

## 3. Basic Statistics

Get some basic statistics about the passwords, such as the total count, number of unique passwords, and the most common password.

In [None]:
if not password_series.empty:
    print("Basic Password Statistics:")
    # describe() for object dtype gives count, unique, top, freq
    print(password_series.describe())
else:
    print("Password series is empty. Cannot calculate statistics.")

## 4. Password Length Distribution

Analyze the distribution of password lengths.

In [None]:
if not password_series.empty:
    password_lengths = password_series.str.len()

    print("Password Length Statistics:")
    print(password_lengths.describe())

    # Plotting the distribution
    plt.figure(figsize=(14, 7))
    # Use max length for bins, ensure at least 1 bin, cap bins reasonably if max length is huge
    max_len = password_lengths.max()
    bins = min(max(1, max_len), 100) # Cap bins at 100 for very long passwords
    sns.histplot(password_lengths, bins=bins, kde=False, stat='count')
    plt.title('Distribution of Password Lengths', fontsize=16)
    plt.xlabel('Password Length', fontsize=12)
    plt.ylabel('Frequency (Count)', fontsize=12)
    # Adjust x-ticks for readability
    tick_step = max(1, max_len // 20)
    plt.xticks(range(0, max_len + tick_step, tick_step))
    plt.xlim(0, max_len + 1) # Set x-axis limits
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()
else:
    print("Password series is empty. Cannot analyze lengths.")

## 5. Character Frequency Analysis

Analyze the frequency of individual characters appearing in the passwords.

In [None]:
if not password_series.empty:
    # Concatenate all passwords into a single string
    all_chars = "".join(password_series.astype(str))

    # Count character frequencies
    char_counts = Counter(all_chars)

    # Convert to a DataFrame for easier plotting
    char_freq_df = pd.DataFrame(char_counts.items(), columns=['Character', 'Frequency'])
    char_freq_df = char_freq_df.sort_values(by='Frequency', ascending=False)

    print("Character Frequency Analysis (Top 30):")
    print(char_freq_df.head(30))

    # Plotting the frequency of the top N characters
    top_n = 40
    plt.figure(figsize=(15, 7))
    sns.barplot(x='Character', y='Frequency', data=char_freq_df.head(top_n), palette='viridis')
    plt.title(f'Frequency of Top {top_n} Characters', fontsize=16)
    plt.xlabel('Character', fontsize=12)
    plt.ylabel('Total Frequency', fontsize=12)
    plt.xticks(rotation=45)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()
else:
    print("Password series is empty. Cannot analyze character frequencies.")

## 6. Further Analysis Ideas

*   **Character Type Analysis:** Analyze the usage of lowercase letters, uppercase letters, digits, and symbols.
*   **Common Substring/N-gram Analysis:** Identify frequently occurring sequences of characters (e.g., '123', 'password', 'qwerty').
*   **Positional Character Analysis:** Analyze character frequencies based on their position in the password (e.g., first character, last character).
*   **Entropy Calculation:** Estimate the randomness or complexity of passwords.