In this notebook we are giong to implement password strength checker.

In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, torch
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cpu'

In [3]:
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/Password.csv', error_bad_lines=False)



  data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/Password.csv', error_bad_lines=False)
Skipping line 2810: expected 2 fields, saw 5
Skipping line 4641: expected 2 fields, saw 5
Skipping line 7171: expected 2 fields, saw 5
Skipping line 11220: expected 2 fields, saw 5
Skipping line 13809: expected 2 fields, saw 5
Skipping line 14132: expected 2 fields, saw 5
Skipping line 14293: expected 2 fields, saw 5
Skipping line 14865: expected 2 fields, saw 5
Skipping line 17419: expected 2 fields, saw 5
Skipping line 22801: expected 2 fields, saw 5
Skipping line 25001: expected 2 fields, saw 5
Skipping line 26603: expected 2 fields, saw 5
Skipping line 26742: expected 2 fields, saw 5
Skipping line 29702: expected 2 fields, saw 5
Skipping line 32767: expected 2 fields, saw 5
Skipping line 32878: expected 2 fields, saw 5
Skipping line 35643: expected 2 fields, saw 5
Skipping line 36550: expected 2 fields, saw 5
Skipping line 38732: expected 2 fields, saw 5
Skipping line 40567

As you can see, our data has 2 columns, one is password and the other is power, hence it is a supervised learning task.

In [4]:
data

Unnamed: 0,password,strength
0,kzde5577,1
1,kino3434,1
2,visi7k1yr,1
3,megzy123,1
4,lamborghin1,1
...,...,...
669635,10redtux10,1
669636,infrared1,1
669637,184520socram,1
669638,marken22a,1


**Problem Solving Intuition:** The problem is, how can we create a model that takes a password and tells if that password is strong enough?
We need to use the **TF-IDF** matrix to convert our text into a numeric representation so that our model can recognize the pattern of passwords and separate passwords based on their strength level.

# Data Preprocessing

In [5]:
data.dropna(inplace=True)

In [6]:
data['strength'].unique()

array([1, 2, 0])

0.   Weak
1.   Strong
2.   Very Strong



In [7]:
data = np.array(data)
data

array([['kzde5577', 1],
       ['kino3434', 1],
       ['visi7k1yr', 1],
       ...,
       ['184520socram', 1],
       ['marken22a', 1],
       ['fxx4pw4g', 1]], dtype=object)

In [8]:
x = [item[0] for item in data]
y = [item[1] for item in data]

In this section, we split each password into its characters in order to learn the strength level of a password based on its characters.

In [9]:
def custom_tokenizer(text):
    temp = []
    for i in text:
        temp.append(i)

    return temp

vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer)

In [10]:
tfidf_matrix = vectorizer.fit_transform(x)



In [11]:
tfidf_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [12]:
vectorizer.get_feature_names_out()

array(['\x01', '\x02', '\x04', '\x05', '\x06', '\x08', '\x0e', '\x0f',
       '\x10', '\x11', '\x12', '\x13', '\x16', '\x17', '\x18', '\x19',
       '\x1b', '\x1c', '\x1d', '\x1e', ' ', '!', '"', '#', '$', '%', '&',
       '(', ')', '*', '+', '-', '.', '/', '0', '1', '2', '3', '4', '5',
       '6', '7', '8', '9', ';', '<', '=', '>', '?', '@', '[', '\\', ']',
       '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
       'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
       'x', 'y', 'z', '{', '|', '}', '~', '\x7f', '\x81', '\x8d', '\xa0',
       '¡', '¢', '¤', '¦', '§', '¨', '«', '¯', '°', '±', '²', '³', '´',
       'µ', '¶', '·', '¹', 'º', '»', '¼', '½', '¾', '¿', '×', 'ß', 'à',
       'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'í', 'î', 'ï',
       'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', '÷', 'ù', 'ú', 'û', 'ü', 'ý',
       'þ', 'ÿ', 'œ', 'ƒ', '—', '‚', '‡', '…', '‹', '›', '™'],
      dtype=object)

Let's visualize tfidf numbers for a password

In [13]:
sample = tfidf_matrix[5]
sample_df = pd.DataFrame(sample.T.todense(), index=vectorizer.get_feature_names_out(), columns=['TF-IDF'])
sample_df = sample_df.sort_values('TF-IDF', ascending=False)
sample_df

Unnamed: 0,TF-IDF
a,0.329761
q,0.311920
f,0.301258
v,0.296521
z,0.292744
...,...
8,0.000000
7,0.000000
6,0.000000
5,0.000000


# Train Model


## Logistic Regression

In [14]:
x_train, x_test, y_train, y_test = train_test_split(tfidf_matrix, y, test_size=0.2, random_state=42)

In [15]:
logistic = LogisticRegression(random_state=42)
logistic.fit(x_train, y_train)
yhat = logistic.predict(x_test)
print(classification_report(y_test, yhat))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


              precision    recall  f1-score   support

           0       0.58      0.30      0.39     17908
           1       0.84      0.94      0.89     99519
           2       0.82      0.70      0.75     16501

    accuracy                           0.82    133928
   macro avg       0.75      0.64      0.68    133928
weighted avg       0.80      0.82      0.80    133928



Let's see our model's predictions about new passwords

In [16]:
new_pass = 'al265#@2694ffasdd'
new_pass = vectorizer.transform([new_pass]).toarray()
print(f'Strength Level is: {logistic.predict(new_pass)}')

Strength Level is: [2]
