# PySpark Analysis of Most Common Passwords
This notebook analyzes a dataset containing leaked passwords using PySpark.
The objective is to understand the distribution and characteristics of these passwords.

## Environment Setup

In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
!tar xf spark-3.0.0-bin-hadoop2.7.tgz

!pip install -q findspark

In [4]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop2.7"

## Google Drive Mount

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Pyspark Initialization and Data Loading

In [6]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [8]:
data = spark.read.text("/content/drive/MyDrive/Colab Notebooks/breachcompilation.txt")

## Basic Exploratory Data Analysis

In [10]:
from pyspark.sql.functions import col, count

password_counts = data.groupBy("value").agg(count("value").alias("count")).orderBy(col("count").desc())

Here is the length of the dataset

In [13]:
num_rows = password_counts.count()
print("Number of rows in password_counts:", num_rows)

Number of rows in password_counts: 384153427


Top 20 passwords used

In [18]:
password_counts.show(20)

+----------+-------+
|     value|  count|
+----------+-------+
|    123456|6690567|
| 123456789|2286809|
|    111111| 983775|
|  password| 964657|
|    qwerty| 887545|
|    abc123| 860278|
|  12345678| 844370|
| password1| 757901|
|   1234567| 744492|
|    123123| 679338|
|1234567890| 673299|
|    000000| 665562|
|     12345| 599246|
|  iloveyou| 443963|
|1q2w3e4r5t| 346500|
|      1234| 340034|
|   123456a| 312049|
|qwertyuiop| 302333|
|    monkey| 290794|
|    123321| 286281|
+----------+-------+
only showing top 20 rows



## Length Distribution

In [14]:
from pyspark.sql.functions import length

# Add a new column for password length
password_lengths = password_counts.withColumn("length", length("value"))

# Group by password length and count
length_distribution = password_lengths.groupBy("length").count().orderBy("length")

# Show the distribution
length_distribution.show()

+------+---------+
|length|    count|
+------+---------+
|     1|      278|
|     2|     8410|
|     3|   118303|
|     4|   590153|
|     5|  2834300|
|     6| 22276725|
|     7| 38878206|
|     8|124435146|
|     9| 51686418|
|    10| 63587053|
|    11| 22078970|
|    12| 16071520|
|    13|  9344199|
|    14|  6972754|
|    15| 11316434|
|    16|  4400637|
|    17|  1399869|
|    18|  1238388|
|    19|   782862|
|    20|   921617|
+------+---------+
only showing top 20 rows



As we can see most passwords were 8 characters long followed by 10 characters

## Password Classification

In [15]:
from pyspark.sql.functions import when, col

# Classify passwords based on their composition
password_class = password_lengths.withColumn("type",
    when(col("value").rlike("^[0-9]+$"), "Numeric")
    .when(col("value").rlike("^[A-Za-z]+$"), "Alphabetic")
    .when(col("value").rlike("^[A-Za-z0-9]+$"), "Alphanumeric")
    .otherwise("Complex"))

# Group by type and count
type_distribution = password_class.groupBy("type").count().orderBy("count", ascending=False)

# Show the distribution
type_distribution.show()

+------------+---------+
|        type|    count|
+------------+---------+
|Alphanumeric|209751869|
|  Alphabetic|102299695|
|     Complex| 38251465|
|     Numeric| 33850398|
+------------+---------+



## Pattern Filtering

In [16]:
# Filter passwords for specific patterns
pattern_123 = password_counts.filter(col("value").contains("123"))
pattern_password = password_counts.filter(col("value").contains("password"))
pattern_qwerty = password_counts.filter(col("value").contains("qwerty"))

# Show counts of specific patterns
print(f"Number of passwords containing '123': {pattern_123.count()}")
print(f"Number of passwords containing 'password': {pattern_password.count()}")
print(f"Number of passwords containing 'qwerty': {pattern_qwerty.count()}")

Number of passwords containing '123': 7795718
Number of passwords containing 'password': 56344
Number of passwords containing 'qwerty': 207964


## Entropy Analysis

Entropy is a measure of unpredictability or randomness. In the context of passwords, higher entropy means that the password is more random and thus harder to crack. The formula for entropy is:

H = −∑(P(x)×log
2 P(x))

Where \( P(x) \) is the probability of a character \( x \) appearing in the password

In [11]:
from pyspark.sql.functions import udf
from math import log2
from collections import Counter
from pyspark.sql.types import DoubleType

def calculate_entropy(password):
    if not password:
        return 0.0
    prob = [float(password.count(c)) / len(password) for c in dict.fromkeys(list(password))]
    return -sum([p * log2(p) for p in prob])

entropy_udf = udf(calculate_entropy, DoubleType())

password_entropy = password_counts.withColumn("entropy", entropy_udf("value"))
password_entropy.show()

+----------+-------+-----------------+
|     value|  count|          entropy|
+----------+-------+-----------------+
|    123456|6690567|2.584962500721156|
| 123456789|2286809|3.169925001442312|
|    111111| 983775|             -0.0|
|  password| 964657|             2.75|
|    qwerty| 887545|2.584962500721156|
|    abc123| 860278|2.584962500721156|
|  12345678| 844370|              3.0|
| password1| 757901| 2.94770277922009|
|   1234567| 744492|2.807354922057604|
|    123123| 679338|1.584962500721156|
|1234567890| 673299|3.321928094887362|
|    000000| 665562|             -0.0|
|     12345| 599246|2.321928094887362|
|  iloveyou| 443963|             2.75|
|1q2w3e4r5t| 346500|3.321928094887362|
|      1234| 340034|              2.0|
|   123456a| 312049|2.807354922057604|
|qwertyuiop| 302333|3.321928094887362|
|    monkey| 290794|2.584962500721156|
|    123321| 286281|1.584962500721156|
+----------+-------+-----------------+
only showing top 20 rows



## Dictionary Word Analysis:
This requires a list of common dictionary words. For simplicity, we'll consider a small list, but in a real-world scenario, you'd use a comprehensive dictionary.

In [18]:
dictionary_words = ["password", "admin", "welcome", "user", "letmein"]

# Filter passwords that match dictionary words
dict_passwords = password_counts.filter(password_counts["value"].isin(dictionary_words))
dict_passwords.show()

+--------+------+
|   value| count|
+--------+------+
|password|964657|
| welcome| 61807|
| letmein| 54791|
|    user| 50949|
|   admin| 11820|
+--------+------+



## Levenshtein Distance Analysis:
The Levenshtein distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another.

Analyzing this for all pairs in a large dataset can be computationally intensive. Here, I havent run these due to timeout.

In [None]:
from pyspark.sql.functions import levenshtein

# Sample a subset of the data for demonstration purposes
sample_df = password_counts.sample(False, 0.001, 42)

# Cartesian product to get all pairs
cartesian_df = sample_df.alias("df1").crossJoin(sample_df.alias("df2"))

# Calculate Levenshtein distance
levenshtein_df = cartesian_df.select(
    col("df1.value").alias("password1"),
    col("df2.value").alias("password2"),
    levenshtein(col("df1.value"), col("df2.value")).alias("distance")
)
levenshtein_df.show()

## N-gram Analysis:
Identify common substrings (n-grams) in passwords.


In [None]:
from pyspark.ml.feature import NGram
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

# Split each password into individual characters to create an array
password_array = password_counts.withColumn("characters", split(password_counts["value"], ""))

# Now, apply NGram on this array column
ngram = NGram(n=3, inputCol="characters", outputCol="ngrams")

# Transform and explode to get individual n-grams
ngram_df = ngram.transform(password_array)
ngram_exploded = ngram_df.select(explode(ngram_df.ngrams).alias("ngram"))
ngram_counts = ngram_exploded.groupBy("ngram").count().orderBy("count", ascending=False)
ngram_counts.show()
