## # Introduction
<p><img src="https://i.imgur.com/kjWF1So.jpg" alt="Different characters on a computer screen"></p>
<p>According to a 2019 <a href="https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/PasswordCheckup-HarrisPoll-InfographicFINAL.pdf">Google / Harris Poll</a>, 24% of Americans have used common passwords, like <code>abc123</code>, <code>Password</code>, and <code>Admin</code>. Even more concerning, 59% of Americans have incorporated personal information, such as their name or birthday, into their password. This makes it unsurprising that 4 in 10 Americans have had their personal information compromised online. Passwords with commonly used phrases and personal information makes cracking a password drastically easier.</p>
<p>You may have noticed over the years that password requirements have increased in complexity, including recommendations to change your passwords every couple of months. Compiled from industry recommendations, below is a list of passwords requirements you will be asked to test: </p>
<p><strong>Password Requirments:</strong></p>
<ol>
<li>Must be at least 10 characters in length</li>
<li>Must contain at least:<ul>
<li>one lower case letter </li>
<li>one upper case letter </li>
<li>one numeric character </li>
<li>one non-alphanumeric character</li></ul></li>
<li>Must not contain the phrase <code>password</code> (case insensitive)</li>
<li>Must not contain the user's first or last name, e.g., if the user's name is <code>John Smith</code>, then <code>SmItH876!</code> is not a valid password.</li>
</ol>
<p>Here is the dataset that you will investigate this project:</p>
<div style="background-color: #ebf4f7; color: #595959; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/logins.csv</b></div>
Each row represents a login credential. There are no missing values and you can consider the dataset "clean".
<ul>
    <li><b>id:</b> the user's unique ID.</li>
    <li><b>username:</b> the username with the format {firstname}.{lastname}.</li>
    <li><b>password:</b> the password that may or may not meet the requirements. <i>Note, passwords should never be saved in plaintext, always encrypt them when working with real live passwords!</i></li>
</ul>
</div>
<p>Warning: This dataset contains some <strong>real</strong> passwords leaked from <strong>real</strong> websites. These passwords have been filtered, but may still include words that are explicit and offensive.</p>
<p>From here on out, it will be your task to explore and manipulate the existing data until you can answer the two questions described in the instructions panel. Feel free to import as many packages as you need to complete your task, and add cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> To complete this project, you need to know how to manipulate strings in pandas DataFrames and be familiar with regular expressions. Before starting this project we recommend that you have completed the following courses: <a href="https://learn.datacamp.com/courses/data-cleaning-in-python">Data Cleaning in Python</a> and <a href="https://learn.datacamp.com/courses/regular-expressions-in-python">Regular Expressions in Python</a>.</p>

In [79]:
# Use this cell to begin your analysis, and add as many as you would like!
# import packages
import pandas as pd

# import data
df = pd.read_csv('datasets/logins.csv')

# check the df
df.head(5)

Unnamed: 0,id,username,password
0,1,vance.jennings,vanceRules888!
1,2,consuelo.eaton,Mail_Pen%Scarlets.414
2,3,mitchel.perkins,Z00+1960
3,4,odessa.vaughan,D-rockyou
4,5,araceli.wilder,Araceli}r3


In [80]:
# length of password
df['password_length'] = df['password'].str.len()

# flag the passwords < 10
df['short_password'] = df['password_length'] < 10

# check the df
df.head(5)

Unnamed: 0,id,username,password,password_length,short_password
0,1,vance.jennings,vanceRules888!,14,False
1,2,consuelo.eaton,Mail_Pen%Scarlets.414,21,False
2,3,mitchel.perkins,Z00+1960,8,True
3,4,odessa.vaughan,D-rockyou,9,True
4,5,araceli.wilder,Araceli}r3,10,False


In [81]:
# check the password whether contains at least one of each: 
#                      lower case letter
#                      upper case letter
#                      numeric character
#                      non-alphanumeric character
df['all_in'] = ((df['password'].str.contains('[a-z]')) &
                 (df['password'].str.contains('[A-Z]')) &
                 (df['password'].str.contains('[0-9]')) &
                 (df['password'].str.contains('\W')))

df.head()

Unnamed: 0,id,username,password,password_length,short_password,all_in
0,1,vance.jennings,vanceRules888!,14,False,True
1,2,consuelo.eaton,Mail_Pen%Scarlets.414,21,False,True
2,3,mitchel.perkins,Z00+1960,8,True,False
3,4,odessa.vaughan,D-rockyou,9,True,False
4,5,araceli.wilder,Araceli}r3,10,False,True


In [82]:
# password contains 'password' (case insensitive)
df['contain_password'] = df['password'].str.contains('password', case = False)

df.head()

Unnamed: 0,id,username,password,password_length,short_password,all_in,contain_password
0,1,vance.jennings,vanceRules888!,14,False,True,False
1,2,consuelo.eaton,Mail_Pen%Scarlets.414,21,False,True,False
2,3,mitchel.perkins,Z00+1960,8,True,False,False
3,4,odessa.vaughan,D-rockyou,9,True,False,False
4,5,araceli.wilder,Araceli}r3,10,False,True,False


In [83]:
# password contain the user's first or last name
df['first_name'] = df['username'].str.extract('(^\w+)',expand = False)
df['last_name'] = df['username'].str.extract('(\w+$)', expand = False)

# flag the password contain user's first name
df['contain_first_name'] = False
for index, value in enumerate(df['first_name']):
    if value in df.loc[index, 'password'].lower():
        df.loc[index, 'contain_first_name'] = True

# flag the password contain user's last name
df['contain_last_name'] = False
for index, value in enumerate(df['last_name']):
    if value in df.loc[index, 'password'].lower():
        df.loc[index, 'contain_last_name'] = True

# 
df['contain_name'] = ((df['contain_first_name']) |
                      (df['contain_last_name']))
                      

df.head()

Unnamed: 0,id,username,password,password_length,short_password,all_in,contain_password,first_name,last_name,contain_first_name,contain_last_name,contain_name
0,1,vance.jennings,vanceRules888!,14,False,True,False,vance,jennings,True,False,True
1,2,consuelo.eaton,Mail_Pen%Scarlets.414,21,False,True,False,consuelo,eaton,False,False,False
2,3,mitchel.perkins,Z00+1960,8,True,False,False,mitchel,perkins,False,False,False
3,4,odessa.vaughan,D-rockyou,9,True,False,False,odessa,vaughan,False,False,False
4,5,araceli.wilder,Araceli}r3,10,False,True,False,araceli,wilder,True,False,True


In [84]:
# put everything toghter
df['bad_password'] = (
                      (df['short_password']) |
                      (df['all_in'] == False) |
                      (df['contain_password']) |
                      (df['contain_name'])
)

df = df.drop(['first_name', 'last_name', 'contain_first_name', 'contain_last_name'], axis = 1)
df.head()

Unnamed: 0,id,username,password,password_length,short_password,all_in,contain_password,contain_name,bad_password
0,1,vance.jennings,vanceRules888!,14,False,True,False,True,True
1,2,consuelo.eaton,Mail_Pen%Scarlets.414,21,False,True,False,False,False
2,3,mitchel.perkins,Z00+1960,8,True,False,False,False,True
3,4,odessa.vaughan,D-rockyou,9,True,False,False,False,True
4,5,araceli.wilder,Araceli}r3,10,False,True,False,True,True


In [85]:
# What percentage of users have invalid passwords?
bad_pass = round(df['bad_password'].sum()/df.shape[0], 2)

bad_pass


0.75

In [86]:
# Which users need to change their passwords?
email_list = df[df['bad_password']].username.sort_values()

email_list

931           abdul.rowland
713            addie.cherry
857            adele.moreno
291            adeline.bush
663             adolfo.kane
775             adolfo.lara
51             ahmad.hopper
298              aida.combs
898           aisha.jenkins
471               al.dunlap
356            alana.franco
546         alberta.leblanc
306            alec.robbins
831    alejandra.stephenson
44          alejandro.burke
195        alejandro.nieves
483        alexander.thomas
920       alexandria.hinton
93        alexis.mccullough
219         alexis.reynolds
456          alfonso.weaver
366           alfonzo.johns
595          alisa.campbell
781             alisa.cohen
442             alison.neal
452          allan.marshall
338           alonzo.fowler
751           amado.bridges
207        amado.fitzgerald
543           amber.summers
               ...         
64              ursula.wood
664       valentin.castillo
551           valeria.curry
0            vance.jennings
731           vaness

In [87]:
# Importing the pandas module
import pandas as pd

# Loading in datasets/users.csv 
logins = pd.read_csv("datasets/logins.csv")

# Rule 1: Not too short
# Create a boolean variable
length_check = logins['password'].str.len() >= 10
# Separate using boolean indexing
valid_pws = logins[length_check]
bad_pws = logins[~length_check]

# Rule 2: All the types of characters
# Let's create a boolean index for each character requirement
# [ ] is used to indicate a set of characters
# e.g. [abc] will match 'a', 'b', or 'c'.
# We can use a-z to represent all lowercase chars between a and Z
lcase = valid_pws['password'].str.contains('[a-z]') 
ucase = valid_pws['password'].str.contains('[A-Z]')
special = valid_pws['password'].str.contains('\W')
# \d matches any decimal digit; this is equivalent to doing [0-9]
# \W matches any non-alphanumeric character
numeric = valid_pws['password'].str.contains('\d')
# A password needs to have all these as true 
# If any of these are false, we need it to return false
# In other words, all of these have to be true to return true
# We can use the & (and) operator
char_check = lcase & ucase & numeric & special
bad_pws = bad_pws.append(valid_pws[~char_check],ignore_index=True)
valid_pws = valid_pws[char_check]

# Rule 3: Must not contain the phrase password (case insensitive)
banned_phrases = valid_pws['password'].str.contains('password', case=False) 
bad_pws = bad_pws.append(valid_pws[banned_phrases],ignore_index=True)
valid_pws = valid_pws[~banned_phrases]

# Rule 4: Must not contain the user's first or last name
# Extracting first and last names into their own columns
valid_pws['first_name'] = valid_pws['username'].str.extract('(^\w+)', expand = False)
valid_pws['last_name'] = valid_pws['username'].str.extract('(\w+$)', expand = False)
# Iterate over DataFrame rows
for i, row in valid_pws.iterrows():
    if row.first_name in row.password.lower() or row.last_name in row.password.lower():
        valid_pws = valid_pws.drop(index=i)
        bad_pws = bad_pws.append(row,ignore_index=True)
# Note this could be done more efficiently with a lambda function

# Answering the questions
bad_pass = round(bad_pws.shape[0] / logins.shape[0], 2)
print("Percentage of users with invalid passwords", bad_pass)
email_list = bad_pws['username'].sort_values()
print(email_list)

Percentage of users with invalid passwords 0.75
405           abdul.rowland
309            addie.cherry
372            adele.moreno
517            adeline.bush
279             adolfo.kane
337             adolfo.lara
16             ahmad.hopper
122              aida.combs
700           aisha.jenkins
199               al.dunlap
147            alana.franco
593         alberta.leblanc
521            alec.robbins
671    alejandra.stephenson
434         alejandro.burke
482        alejandro.nieves
205        alexander.thomas
400       alexandria.hinton
453       alexis.mccullough
93          alexis.reynolds
568          alfonso.weaver
151           alfonzo.johns
611          alisa.campbell
342             alisa.cohen
567             alison.neal
190          allan.marshall
142           alonzo.fowler
652           amado.bridges
88         amado.fitzgerald
592           amber.summers
               ...         
20              ursula.wood
280       valentin.castillo
596           valeria.curry
