<a href="https://colab.research.google.com/github/anilaksu/AI-and-Data-Science-Codes/blob/Natural-Language-Processing/Data_Science_Project_2_Password_Strength_with_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Predict Password Strength using Natural Language Processing**


Anil Aksu

Personal e-mail: aaa293@cornell.edu

In a hutshell, we need to classify password data into three categories:


1.   0: Weak
2.   1: Medium
3.   2: Strong



## Notebook Organization:
- **Business Understanding**
- **Data Collection**
- **Data Cleaning**
- **Data Analysis**
- **Feature Engineering**
- **Model Building**
- **Deployment**

Full solution to the project can be found in the [link](https://drive.google.com/drive/folders/1ET_ggkzxtvQ287RyzHB5mcGxNyEbQe48)

In [1]:
# Here we set our working directory in our google drive to access datasets externally
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/ColabNotebooks/Data Science Projects/Password Strength NLP Project

Mounted at /content/drive
/content/drive/MyDrive/ColabNotebooks/Data Science Projects/Password Strength NLP Project


#1. Data Collection

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

##1.1 Reading data from SQL Database

You can check your SQLite data from the website in the [link](https://www.sqliteonline.com).

In [7]:
# Here we query password data with SQLite
import sqlite3
conn = sqlite3.connect('password_data.sqlite')       # Here we establish a connection
df = pd.read_sql_query("SELECT * FROM Users", conn)   # Here we form a data table with SQL query
df.head()

Unnamed: 0,index,password,strength
0,0,zxe870819,1
1,1,xw46454nr23l,1
2,2,soporte13,1
3,3,accounts6000webhost.com,2
4,4,c443balg,1


##2. Data Cleaning

In [9]:
df.drop(['index'], axis = 1, inplace = True) # Here we drop index, which has no use in our analysis
df.head()

Unnamed: 0,password,strength
0,zxe870819,1
1,xw46454nr23l,1
2,soporte13,1
3,accounts6000webhost.com,2
4,c443balg,1


In [10]:
# Here we check duplicates
df.duplicated().sum()

0

In [11]:
# Here we check missing values
df.isnull().sum()

password    0
strength    0
dtype: int64

In [12]:
# Here we check data types
df.dtypes

password    object
strength     int64
dtype: object

In [13]:
# Here we check if all strength values are between 0-2
df['strength'].unique()

array([1, 2, 0])

#3. Data Analysis

##3.1 Semantic Analysis

Factors contributing to the password strength:
1.   How many password textual actually holds only numeric characters?
2.   How many password textual holds only Upper-case character?
3.   How many password textual holds only alpha-numeric character?
4.   How many password textual holds only title-case character?
5.   How many password textual holds only some special character?

In [15]:
# Here we check if there only numeric characters
df[df['password'].str.isnumeric()]

Unnamed: 0,password,strength
12280,943801,0
14992,12345,0
20958,147856,0
21671,140290,0
23269,123987,0
28569,1233214,0
31329,159456,0
32574,363761,0
37855,4524344,0
43648,5521597,0


In [16]:
df[df['password'].str.isnumeric()].shape

(26, 2)

In [17]:
# Here we check if there only Upper-case character
df[df['password'].str.isupper()].shape

(1506, 2)

In [21]:
# Here we check if there only alpha-numeric character
df[df['password'].str.isalnum()]

Unnamed: 0,password,strength
0,zxe870819,1
1,xw46454nr23l,1
2,soporte13,1
4,c443balg,1
5,16623670p,1
...,...,...
99995,obejofi215,1
99996,fmiopvxb64,1
99997,czvrbun38,1
99998,mymyxe430,1


In [23]:
df[df['password'].str.isalnum()].shape

(97203, 2)

In [26]:
# Here we check if there only title-case character
df[df['password'].str.istitle()]

Unnamed: 0,password,strength
64,Hisanthoshjasika0,2
242,Therockrockbottom72,2
338,1A2S3D4F,1
367,13269123A,1
526,Csicskarozsika1,2
...,...,...
99168,1053815198M,1
99192,Alfranx05122023,2
99375,Kensington1956,2
99590,V13000993J,1


In [33]:
df[df['password'].str.istitle()].shape

(932, 2)

In [40]:
# Here we check if there only special character
import string           # String package
string.punctuation      # Punctuation characters

def find_semantics(row):
  for char in row:
    if char in string.punctuation:
      return True
    else:
      pass
  return False

In [38]:
df[df['password'].apply(find_semantics)]

Unnamed: 0,password,strength
3,accounts6000webhost.com,2
68,12463773800+,1
98,p.r.c.d.g.,1
145,cita-cita,1
180,karolina.susnina0U,2
...,...,...
99748,maiselis.com,1
99845,hosting4meze!@#,2
99954,semista_bakung15,2
99980,halflife2010!LEB,2


In [39]:
df[df['password'].apply(find_semantics)].shape

(2663, 2)

#4. Feature Engineering

Dependent on the particular problem you are solving, in our case, we define our features as:
1.   Length
2.   Lower frequency
3.   Upper frequency
4.   Digit frequency
5.   Special character frequency



In [41]:
# Here we add Length as a column
df["length"] = df['password'].str.len()
df.head()

Unnamed: 0,password,strength,length
0,zxe870819,1,9
1,xw46454nr23l,1,12
2,soporte13,1,9
3,accounts6000webhost.com,2,23
4,c443balg,1,8


In [45]:
# Here we add lower frequency as a column
def freq_lowercase(row):
  return len([char for char in row if char.islower()])/len(row)

In [46]:
# Here we add upper frequency as a column
def freq_uppercase(row):
  return len([char for char in row if char.isupper()])/len(row)

In [47]:
# Here we add digit frequency as a column
def freq_numerical_case(row):
  return len([char for char in row if char.isnumeric()])/len(row)

In [48]:
# Here we apply the functions above
df['lower_freq'] = np.round(df["password"].apply(freq_lowercase), 3)
df['upper_freq'] = np.round(df["password"].apply(freq_uppercase), 3)
df['digit_freq'] = np.round(df["password"].apply(freq_numerical_case), 3)
df.head()

Unnamed: 0,password,strength,length,lower_freq,upper_freq,digit_freq
0,zxe870819,1,9,0.333,0.0,0.667
1,xw46454nr23l,1,12,0.417,0.0,0.583
2,soporte13,1,9,0.778,0.0,0.222
3,accounts6000webhost.com,2,23,0.783,0.0,0.174
4,c443balg,1,8,0.625,0.0,0.375
