# Creation of Main Dataset
The researchers got two datasets with common passwords. They merged them and got properties from the passwords.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_1 = pd.read_csv("common_passwords.csv")
df_2 = pd.read_csv("top_200_password_2020_by_country.csv")

## Merging Datasets

### First Dataset

In [3]:
df_1.head()

Unnamed: 0,password,length,num_chars,num_digits,num_upper,num_lower,num_special,num_vowels,num_syllables
0,123456,6,0,6,0,0,0,0,1
1,password,8,8,0,0,8,0,2,2
2,12345678,8,0,8,0,0,0,0,1
3,qwerty,6,6,0,0,6,0,1,3
4,123456789,9,0,9,0,0,0,0,1


The first dataset had many passwords and useful features like the lengths of the passwords and the number of special characters. However, the researchers did not want to count the number of syllables in the passwords of the second dataset, so they dropped the feature that specified the number of syllables in a password.

In [4]:
df_1.drop("num_syllables", axis=1, inplace=True)
df_1.head()

Unnamed: 0,password,length,num_chars,num_digits,num_upper,num_lower,num_special,num_vowels
0,123456,6,0,6,0,0,0,0
1,password,8,8,0,0,8,0,2
2,12345678,8,0,8,0,0,0,0
3,qwerty,6,6,0,0,6,0,1
4,123456789,9,0,9,0,0,0,0


### Second Dataset

In [5]:
df_2.head()

Unnamed: 0,country_code,country,Rank,Password,User_count,Time_to_crack,Global_rank,Time_to_crack_in_seconds
0,au,Australia,1,123456,308483,< 1 second,1.0,0
1,au,Australia,2,password,191880,< 1 second,5.0,0
2,au,Australia,3,lizottes,98220,3 Hours,,10800
3,au,Australia,4,password1,86884,< 1 second,16.0,0
4,au,Australia,5,123456789,75856,< 1 second,2.0,0


The dataset had nothing in common with the first dataset besides the fact that it had passwords. So, the researchers decided to drop all columns of the dataset besides the one with passwords. Then, they modified it so that it would have the same features as the first dataset.

In [6]:
import re

In [7]:
df_2 = pd.DataFrame({
    "password": df_2["Password"]
})
df_2["length"] = df_2["password"].apply(len)
df_2["num_chars"] = df_2["password"].apply(lambda x: len(re.findall("[A-Za-z]", x)))
df_2["num_digits"] = df_2["password"].apply(lambda x: len(re.findall("[0-9]", x)))
df_2["num_upper"] = df_2["password"].apply(lambda x: len(re.findall("[A-Z]", x)))
df_2["num_lower"] = df_2["password"].apply(lambda x: len(re.findall("[a-z]", x)))
df_2["num_special"] = df_2["length"] - df_2["num_chars"] - df_2["num_digits"]
df_2["num_vowels"] = df_2["password"].apply(lambda x: len(re.findall("[aeiou]", x)))

df_2.head()

Unnamed: 0,password,length,num_chars,num_digits,num_upper,num_lower,num_special,num_vowels
0,123456,6,0,6,0,0,0,0
1,password,8,8,0,0,8,0,2
2,lizottes,8,8,0,0,8,0,3
3,password1,9,8,1,0,8,0,2
4,123456789,9,0,9,0,0,0,0


In [8]:
df = pd.concat([df_1, df_2], axis=0)
(len(df_1), len(df_2), len(df))

(10000, 9800, 19800)

### Combining Both Datasets

The researchers checked if the merged dataset had more unique values than the number of passwords in the first and second datasets.

In [9]:
print("Length of first dataset: {}".format(len(df_1)))
print("Length of second dataset: {}".format(len(df_2)))
print("Number of unique passwords in merged dataset: {}".format(len(df.drop_duplicates())))

Length of first dataset: 10000
Length of second dataset: 9800
Number of unique passwords in merged dataset: 12816


There were more passwords in the merged dataset than the two datasets individually, so they dropped the duplicate rows in their merged dataset and used it for thier study.

In [10]:
df = df.drop_duplicates()
df.head()

Unnamed: 0,password,length,num_chars,num_digits,num_upper,num_lower,num_special,num_vowels
0,123456,6,0,6,0,0,0,0
1,password,8,8,0,0,8,0,2
2,12345678,8,0,8,0,0,0,0
3,qwerty,6,6,0,0,6,0,1
4,123456789,9,0,9,0,0,0,0


## More Columns
The researchers considered adding entropy and password strength in their data using a library named [`password_strength`](https://pypi.org/project/password-strength/).

In [11]:
from password_strength import PasswordStats
import math

In [12]:
df["bits_of_entropy"] = df["password"].apply(lambda x: PasswordStats(x).entropy_bits)
df["strength"] = df["password"].apply(lambda x: PasswordStats(x).strength())

In [13]:
# df.query('strength == strength.max()')
df.head()

Unnamed: 0,password,length,num_chars,num_digits,num_upper,num_lower,num_special,num_vowels,bits_of_entropy,strength
0,123456,6,0,6,0,0,0,0,15.509775,0.172331
1,password,8,8,0,0,8,0,2,22.458839,0.249543
2,12345678,8,0,8,0,0,0,0,24.0,0.266667
3,qwerty,6,6,0,0,6,0,1,15.509775,0.172331
4,123456789,9,0,9,0,0,0,0,28.529325,0.316992
