## # Introduction
<p><img src="https://i.imgur.com/kjWF1So.jpg" alt="Different characters on a computer screen"></p>
<p>According to a 2019 <a href="https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/PasswordCheckup-HarrisPoll-InfographicFINAL.pdf">Google / Harris Poll</a>, 24% of Americans have used common passwords, like <code>abc123</code>, <code>Password</code>, and <code>Admin</code>. Even more concerning, 59% of Americans have incorporated personal information, such as their name or birthday, into their password. This makes it unsurprising that 4 in 10 Americans have had their personal information compromised online. Passwords with commonly used phrases and personal information makes cracking a password drastically easier.</p>
<p>You may have noticed over the years that password requirements have increased in complexity, including recommendations to change your passwords every couple of months. Compiled from industry recommendations, below is a list of passwords requirements you will be asked to test: </p>
<p><strong>Password Requirments:</strong></p>
<ol>
<li>Must be at least 10 characters in length</li>
<li>Must contain at least:<ul>
<li>one lower case letter </li>
<li>one upper case letter </li>
<li>one numeric character </li>
<li>one non-alphanumeric character</li></ul></li>
<li>Must not contain the phrase <code>password</code> (case insensitive)</li>
<li>Must not contain the user's first or last name, e.g., if the user's name is <code>John Smith</code>, then <code>SmItH876!</code> is not a valid password.</li>
</ol>
<p>Here is the dataset that you will investigate this project:</p>
<div style="background-color: #ebf4f7; color: #595959; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/logins.csv</b></div>
Each row represents a login credential. There are no missing values and you can consider the dataset "clean".
<ul>
    <li><b>id:</b> the user's unique ID.</li>
    <li><b>username:</b> the username with the format {firstname}.{lastname}.</li>
    <li><b>password:</b> the password that may or may not meet the requirements. <i>Note, passwords should never be saved in plaintext, always encrypt them when working with real live passwords!</i></li>
</ul>
</div>
<p>Warning: This dataset contains some <strong>real</strong> passwords leaked from <strong>real</strong> websites. These passwords have been filtered, but may still include words that are explicit and offensive.</p>
<p>From here on out, it will be your task to explore and manipulate the existing data until you can answer the two questions described in the instructions panel. Feel free to import as many packages as you need to complete your task, and add cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> To complete this project, you need to know how to manipulate strings in pandas DataFrames and be familiar with regular expressions. Before starting this project we recommend that you have completed the following courses: <a href="https://learn.datacamp.com/courses/data-cleaning-in-python">Data Cleaning in Python</a> and <a href="https://learn.datacamp.com/courses/regular-expressions-in-python">Regular Expressions in Python</a>.</p>

In [34]:
import pandas as pd
import re

In [35]:
# load dataset
df = pd.read_csv('datasets/logins.csv')
print(f'shape: {df.shape}')
print(f'head:\n{df.head()}')

shape: (982, 3)
head:
   id         username               password
0   1   vance.jennings         vanceRules888!
1   2   consuelo.eaton  Mail_Pen%Scarlets.414
2   3  mitchel.perkins               Z00+1960
3   4   odessa.vaughan              D-rockyou
4   5   araceli.wilder             Araceli}r3


In [36]:
# check if username ALWAYS consist of two parts separated by "."
df.username.str.split('.').str.len().value_counts()

2    982
Name: username, dtype: int64

In [37]:
def validate(row: pd.Series) -> bool:
    """Password validator function. Returns True if password:
        - is at least 10 characters in length
        - contains at least one lower case letter
        - contains at least one upper case letter
        - contains at least one numeric character
        - contains at least one non-alphanumeric character
        - does not contain the phrase 'password' (case insensitive)
        - does not contain the user's first or last name
    """
    # extract parts
    fname, lname = row.username.lower().split('.')
    pwd = row.password
    
    result = True
    
    # check length, lower case, upper case, numeric, non-alpha
    pattern = re.compile(
        r'\A' +\
        r'(?=\D*\d)' +\
        r'(?=[^a-z]*[a-z])' +\
        r'(?=[^A-Z]*[A-Z])' +\
        r'(?=[\w]*[\W])' +\
        r'[\D\d]{10,}' +\
        r'\Z'
    )
    result = result and (pattern.match(pwd) is not None)
    
    # check password phrase
    result = result and ('password' not in pwd.lower())
    
    # check first name
    result = result and (fname not in pwd.lower())
    
    # check last name
    result = result and (lname not in pwd.lower())
    
    return result
    
    

In [38]:
# test 'validate' function

# good password
assert validate(
    pd.Series({'username': 'qwer.tyui', 'password': 'aA1.bB2,cC3!'})
) is True, '"good password" test failed'

# incorrect password length
assert validate(
    pd.Series({'username': 'qwer.tyui', 'password': 'aA1.bB2,'})
) is False, '"incorrect password length" test failed'

# no lower case letter
assert validate(
    pd.Series({'username': 'qwer.tyui', 'password': 'XA1.XB2,XC3!'})
) is False, '"no lower case letter" test failed'

# no upper case letter
assert validate(
    pd.Series({'username': 'qwer.tyui', 'password': 'ax1.bx2,cx3!'})
) is False, '"no upper case letter" test failed'

# no numeric character
assert validate(
    pd.Series({'username': 'qwer.tyui', 'password': 'aAx.bBx,cCx!'})
) is False, '"no numeric character" test failed'

# no non-alphanumeric character
assert validate(
    pd.Series({'username': 'qwer.tyui', 'password': 'aA1xbB2xcC3x'})
) is False, '"no non-alphanumeric character" test failed'

# no password phrase
assert validate(
    pd.Series({'username': 'qwer.tyui', 'password': 'aA1.password,cC3!'})
) is False, '"no password phrase" test failed'
assert validate(
    pd.Series({'username': 'qwer.tyui', 'password': 'aA1.Password,cC3!'})
) is False, '"no password phrase" test failed'
assert validate(
    pd.Series({'username': 'qwer.tyui', 'password': 'aA1.paSsword,cC3!'})
) is False, '"no password phrase" test failed'

# containst first name
assert validate(
    pd.Series({'username': 'qwer.tyui', 'password': 'qwer.bB2,cC3!'})
) is False, '"containst first name" test failed'
assert validate(
    pd.Series({'username': 'qwer.tyui', 'password': 'Qwer.bB2,cC3!'})
) is False, '"containst first name" test failed'
assert validate(
    pd.Series({'username': 'qwer.tyui', 'password': 'qWer.bB2,cC3!'})
) is False, '"containst first name" test failed'

# contains last name
assert validate(
    pd.Series({'username': 'qwer.tyui', 'password': 'tyui.bB2,cC3!'})
) is False, '"containst last name" test failed'
assert validate(
    pd.Series({'username': 'Qwer.tyui', 'password': 'Tyui.bB2,cC3!'})
) is False, '"containst last name" test failed'
assert validate(
    pd.Series({'username': 'qWer.tyui', 'password': 'tYui.bB2,cC3!'})
) is False, '"containst last name" test failed'

print('All tests passed')

All tests passed


In [39]:
# validation result to new column
df['result'] = df.apply(validate, axis=1)

In [40]:
# get validation result percentages
validation_results = df['result'].value_counts(normalize=True)
validation_results

False    0.749491
True     0.250509
Name: result, dtype: float64

### 1. What percentage of users have invalid passwords?

In [41]:
# save percentage of users having invalid passwords (Answer # 1)
bad_pass = float(round(validation_results[False], 2))
bad_pass

0.75

### 2. Which users need to change their passwords?

In [42]:
# save user names of affected users
email_list = df[~df.result]['username'].sort_values()
email_list.head()

931    abdul.rowland
713     addie.cherry
857     adele.moreno
291     adeline.bush
663      adolfo.kane
Name: username, dtype: object