## Dataset and Problem Introduction

In this analysis, we explore password regulations and implement recommended password checks.

<p>The notebook goes through the rules in <a href="https://pages.nist.gov/800-63-3/sp800-63b.html">NIST Special Publication 800-63B</a> which details what checks a <em>verifier</em> should perform to make sure users don't pick bad passwords. The passwords used are from users of a fictional company and we use Python to flag the bad passwords. </p>

<b>Data Sources:</b> 
* https://github.com/danielmiessler/SecLists/tree/master/Passwords
* https://github.com/first20hours/google-10000-english

<br>Reference: https://www.datacamp.com/

In [21]:
# Importing the pandas module
import pandas as pd

# Loading in datasets/users.csv 
users = pd.read_csv("datasets/users.csv")

# Printing out how many users we've got
print(len(users))

# Taking a look at the 12 first users
users.head(12)

982


Unnamed: 0,id,user_name,password
0,1,vance.jennings,joobheco
1,2,consuelo.eaton,0869347314
2,3,mitchel.perkins,fabypotter
3,4,odessa.vaughan,aharney88
4,5,araceli.wilder,acecdn3000
5,6,shawn.harrington,5278049
6,7,evelyn.gay,master
7,8,noreen.hale,murphy
8,9,gladys.ward,lwsves2
9,10,brant.zimmerman,1190KAREN5572497


## Passwords shouldn't be too short
<p>If we take a look at the first 12 users above, we already see some bad passwords. The first thing we should check according to the NIST Special Publication 800-63B is:</p>
<blockquote>
  <p>Verifiers SHALL require subscriber-chosen memorized secrets to be at least 8 characters in length.</p>
</blockquote>
<p>So passwords of users shouldn't be too short.</p>

In [22]:
# Calculating the lengths of users' passwords
users['length'] = users['password'].str.len() 

# Flagging the users with too short passwords
users['too_short'] = users['length'] < 8

# Counting and printing the number of users with too short passwords
print(users['too_short'].sum())

# Taking a look at the 12 first rows
users.head(12)

376


Unnamed: 0,id,user_name,password,length,too_short
0,1,vance.jennings,joobheco,8,False
1,2,consuelo.eaton,0869347314,10,False
2,3,mitchel.perkins,fabypotter,10,False
3,4,odessa.vaughan,aharney88,9,False
4,5,araceli.wilder,acecdn3000,10,False
5,6,shawn.harrington,5278049,7,True
6,7,evelyn.gay,master,6,True
7,8,noreen.hale,murphy,6,True
8,9,gladys.ward,lwsves2,7,True
9,10,brant.zimmerman,1190KAREN5572497,16,False


## Common passwords
<blockquote>
  <p>Verifiers SHALL compare the prospective secrets against a list that contains values known to be commonly-used, expected, or compromised.</p>
  <ul>
  <li>Passwords obtained from previous breach corpuses.</li>
  <li>Dictionary words.</li>
  <li>Repetitive or sequential characters (e.g. ‘aaaaaa’, ‘1234abcd’).</li>
  <li>Context-specific words, such as the name of the service, the username, and derivatives thereof.</li>
  </ul>
</blockquote>
<p> As many websites don't follow the NIST guidelines and encrypt passwords, it's possible to find large lists of the most popular passwords online. To check this, let's start by loading in the 10,000 most common passwords taken from <a href="https://github.com/danielmiessler/SecLists/tree/master/Passwords">here</a>.</p>

In [23]:
# Reading in the top 10000 passwords
common_passwords = pd.read_csv("datasets/10_million_password_list_top_10000.txt",
                               header=None, squeeze=True)

# Taking a look at the top 20
common_passwords.head(20)

0        123456
1      password
2      12345678
3        qwerty
4     123456789
5         12345
6          1234
7        111111
8       1234567
9        dragon
10       123123
11     baseball
12       abc123
13     football
14       monkey
15      letmein
16       696969
17       shadow
18       master
19       666666
Name: 0, dtype: object

The list of passwords is ordered with the most common passwords first, so we shouldn't be surprised to see passwords like <code>123456</code> and <code>qwerty</code> above. 
    
As hackers also have access to this list of common passwords, it's important that none of our users use these passwords. Let's flag all the passwords in our user database that are among the top 10,000 used passwords.

In [24]:
# Flagging the users with passwords that are common passwords
users['common_password'] = users['password'].isin(common_passwords) 

# Counting and printing the number of users using common passwords
print(users['common_password'].sum())

# Taking a look at the 12 first rows
users.head(12)

129


Unnamed: 0,id,user_name,password,length,too_short,common_password
0,1,vance.jennings,joobheco,8,False,False
1,2,consuelo.eaton,0869347314,10,False,False
2,3,mitchel.perkins,fabypotter,10,False,False
3,4,odessa.vaughan,aharney88,9,False,False
4,5,araceli.wilder,acecdn3000,10,False,False
5,6,shawn.harrington,5278049,7,True,False
6,7,evelyn.gay,master,6,True,True
7,8,noreen.hale,murphy,6,True,True
8,9,gladys.ward,lwsves2,7,True,False
9,10,brant.zimmerman,1190KAREN5572497,16,False,False


## Passwords shouldn't be common words
<blockquote>
  <p>Verifiers SHALL compare the prospective secrets against a list that contains […] dictionary words.</p>
</blockquote>
<p>This follows the same logic as before: It is easy for hackers to check users' passwords against common English words and therefore common English words make bad passwords. Let's check our users' passwords against the top 10,000 English words from <a href="https://github.com/first20hours/google-10000-english">Google's Trillion Word Corpus</a>.</p>

In [25]:
# Reading in a list of the 10000 most common words
words = pd.read_csv("datasets/google-10000-english.txt",
                    header=None, squeeze=True)

# Flagging the users with passwords that are common words
users['common_word'] = users['password'].str.lower().isin(words) 

# Counting and printing the number of users using common words as passwords
print(users['common_word'].sum())

# Taking a look at the 12 first rows
users.head(12)

137


Unnamed: 0,id,user_name,password,length,too_short,common_password,common_word
0,1,vance.jennings,joobheco,8,False,False,False
1,2,consuelo.eaton,0869347314,10,False,False,False
2,3,mitchel.perkins,fabypotter,10,False,False,False
3,4,odessa.vaughan,aharney88,9,False,False,False
4,5,araceli.wilder,acecdn3000,10,False,False,False
5,6,shawn.harrington,5278049,7,True,False,False
6,7,evelyn.gay,master,6,True,True,True
7,8,noreen.hale,murphy,6,True,True,True
8,9,gladys.ward,lwsves2,7,True,False,False
9,10,brant.zimmerman,1190KAREN5572497,16,False,False,False


## Passwords shouldn't be your name
<blockquote>
  <p>Verifiers SHALL compare the prospective secrets against a list that contains […] context-specific words, such as the name of the service, the username, and derivatives thereof.</p>
</blockquote>
<p>There are many things we could check here, but for now let's just flag passwords that are the same as either a user's first or last name.</p>

In [26]:
# Extracting first and last names into their own columns
users['first_name'] = users['user_name'].str.extract(r'(^\w+)', expand = False)
users['last_name'] = users['user_name'].str.extract(r'(\w+$)', expand = False)

# Flagging the users with passwords that matches their names
users['uses_name'] = (
    (users['password'].str.lower() == users['first_name']) |
    (users['password'].str.lower() == users['last_name']))

# Counting and printing the number of users using names as passwords
users['uses_name'].sum()

# Taking a look at the 12 first rows
users.head(12)

Unnamed: 0,id,user_name,password,length,too_short,common_password,common_word,first_name,last_name,uses_name
0,1,vance.jennings,joobheco,8,False,False,False,vance,jennings,False
1,2,consuelo.eaton,0869347314,10,False,False,False,consuelo,eaton,False
2,3,mitchel.perkins,fabypotter,10,False,False,False,mitchel,perkins,False
3,4,odessa.vaughan,aharney88,9,False,False,False,odessa,vaughan,False
4,5,araceli.wilder,acecdn3000,10,False,False,False,araceli,wilder,False
5,6,shawn.harrington,5278049,7,True,False,False,shawn,harrington,False
6,7,evelyn.gay,master,6,True,True,True,evelyn,gay,False
7,8,noreen.hale,murphy,6,True,True,True,noreen,hale,False
8,9,gladys.ward,lwsves2,7,True,False,False,gladys,ward,False
9,10,brant.zimmerman,1190KAREN5572497,16,False,False,False,brant,zimmerman,False


## Passwords shouldn't be repetitive
<blockquote>
  <p>Verifiers SHALL compare the prospective secrets [so that they don't contain] repetitive or sequential characters (e.g. ‘aaaaaa’, ‘1234abcd’).</p>
</blockquote>
<p>To check for <em>repetitiveness</em> can be arbitrarily complex, but here we're only going to do something simple - flag all passwords that contain 4 or more repeated characters.</p>

In [27]:
### Flagging the users with passwords with >= 4 repeats
users['too_many_repeats'] = users['password'].str.contains(r'(.)\1\1\1')

# Taking a look at the users with too many repeats
users[users['too_many_repeats']]

Unnamed: 0,id,user_name,password,length,too_short,common_password,common_word,first_name,last_name,uses_name,too_many_repeats
146,147,patti.dixon,555555,6,True,True,False,patti,dixon,False,True
572,573,cornelia.bradley,555555,6,True,True,False,cornelia,bradley,False,True
644,645,essie.lopez,11111,5,True,True,False,essie,lopez,False,True
798,799,charley.key,888888,6,True,True,False,charley,key,False,True
807,808,thurman.osborne,rinnnng0,8,False,False,False,thurman,osborne,False,True
941,942,mitch.ferguson,aaaaaa,6,True,True,False,mitch,ferguson,False,True


## Final flag
<p>Now that we have implemented all the basic tests suggested by NIST Special Publication 800-63B, we can flag the bad passwords.</p>

In [28]:
# Flagging all passwords that are bad
users['bad_password'] = ( 
    users['too_short'] | 
    users['common_password'] |
    users['common_word'] |
    users['uses_name'] |
    users['too_many_repeats'] )

# Counting and printing the number of bad passwords
print(users['bad_password'].sum())

# Looking at the first 25 bad passwords
users['password'][users['bad_password']].head(25)

424


5       5278049
6        master
7        murphy
8       lwsves2
11      hubbard
13       310356
15      oZ4k0QE
16      chelsea
17      zvc1939
18       nickgd
21     cocacola
22      woodard
25        AJ9Da
26       ewokzs
28      YyGjz8E
30         reid
34      jOYZBs8
38      wwewwf1
43       225377
45       NdZ7E6
47        CQB3Z
48        diffo
51    123456789
52      y8uM7D6
56      mikeloo
Name: password, dtype: object