
# <center>Python - Advanced Wrangling With Pandas - Practice Solutions<a class="tocSkip"></center>
# <center>QTM 350: Data Science Computing <a class="tocSkip"></center>    
# <center>Davi Moreira <a class="tocSkip"></center>

## Introduction <a class="tocSkip">
<hr>

<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/2024S_dsc_emory_qtm_350/main/lecture_material/material-topic-03/img/py4ds.png" width="200"/>
</div>
</center>


This topic material is based on the [Python Programming for Data Science](https://www.tomasbeuzen.com/python-programming-for-data-science/README.html) book and adapted for our purposes in the course.

## Exercise

In this set of practice exercises we'll be looking at a cool dataset of real passwords (made available from actual data breaches) sourced and compiled from [Information is Beautiful](https://informationisbeautiful.net/visualizations/top-500-passwords-visualized/?utm_content=buffer994fa&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer) and contributed to [R's Tidy Tuesday project](https://github.com/rfordatascience/tidytuesday). These passwords are common ("bad") passwords that you should avoid using! But we're going to use this dataset to practice some regex skills.

Let's start by importing pandas with the alias `pd`.

In [82]:
import pandas as pd

## Exercise

The dataset has the following columns:

|variable          |class     |description |
|:-----------------|:---------|:-----------|
|rank              |int    | popularity in their database of released passwords |
|password          |str | Actual text of the password |
|category          |str | What category does the password fall in to?|
|value             |float    | Time to crack by online guessing |
|time_unit         |str | Time unit to match with value |
|offline_crack_sec |float    | Time to crack offline in seconds |
|rank_alt          |int    | Rank 2 |
|strength          |int    | Strength = quality of password where 10 is highest, 1 is lowest, please note that these are relative to these generally bad passwords |
|font_size         |int    | Used to create the graphic for KIB |


In these exercises, we're only interested in the `password`, `value` and `time_unit` columns so import only these two columns as a dataframe named `df` from this url: <https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv>

In [83]:
df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv',
                 usecols=['password', 'value', 'time_unit'],
                 skipfooter = 7,
                 engine='python')
df

Unnamed: 0,password,value,time_unit
0,password,6.91,years
1,123456,18.52,minutes
2,12345678,1.29,days
3,1234,11.11,seconds
4,qwerty,3.72,days
...,...,...,...
495,reddog,3.72,days
496,alexande,6.91,years
497,college,3.19,months
498,jester,3.72,days


## Exercise

An online password attack is when someone tries to hack your account by simply trying a very large number of username/password combinations to access your account. For each `password` in our dataset, the `value` column shows the amount of time it is estimated to take an "online password attack" to hack your account. The column `time_unit` shows the units of that time value (e.g., hours, days, years, etc.)

It would be much nicer if our `value`s were of the same units so we can more easily compare the "online password guessing time" for each password. So your first task is to convert all of the values to units of hours (assume the conversion units I've provided below, e.g., 1 day is 24 hours, 1 week is 168 hours, etc).

In [84]:
units = {
    "seconds": 1 / 3600,
    "minutes": 1 / 60,
    "days": 24,
    "weeks": 168,
    "months": 720,
    "years": 8760,
}

for key, val in units.items():
    df.loc[df['time_unit'] == key, 'value'] *= val 

df['time_unit'] = 'hours'
df.head()

Unnamed: 0,password,value,time_unit
0,password,60531.6,hours
1,123456,0.308667,hours
2,12345678,30.96,hours
3,1234,0.003086,hours
4,qwerty,89.28,hours


## Exercise

How many password begin with the sequence `123`?

In [85]:
df['password'].str.contains(r"^123").sum()

9

## Exercise

What is the average time in hours needed to crack these passwords that begin with `123`? How does this compare to the average of all passwords in the dataset?

In [86]:
print(f"Avg. time to crack passwords beginning with 123: {df[df['password'].str.contains(r'^123')]['value'].mean():.0f} hrs")
print(f"Avg. time to crack for all passwords in dataset: {df['value'].mean():.0f} hrs")

Avg. time to crack passwords beginning with 123: 107 hrs
Avg. time to crack for all passwords in dataset: 13918 hrs


## Exercise

How many passwords do not contain a number?

In [87]:
df[df['password'].str.contains(r"^[^0-9]*$")].shape[0]

446

## Exercise

How many passwords contain at least one number?

In [88]:
df[df['password'].str.contains(r".*[0-9].*")].shape[0]

54

## Exercise

Is there an obvious difference in online cracking time between passwords that don't contain a number vs passwords that contain at least one number?

In [89]:
print(f"        Avg. time to crack passwords without a number: {df[df['password'].str.contains(r'^[^0-9]*$')]['value'].mean():.0f} hrs")
print(f"Avg. time to crack passwords with at least one number: {df[df['password'].str.contains(r'.*[0-9].*')]['value'].mean():.0f} hrs")

        Avg. time to crack passwords without a number: 8095 hrs
Avg. time to crack passwords with at least one number: 62005 hrs


## Exercise

How many passwords contain at least one of the following punctuations: `[.!?\\-]` (hint: remember this dataset contains *weak* passwords...)?

In [90]:
df[df['password'].str.contains(r'[.!?\\-]')].shape[0]

0

## Exercise

Which password(s) in the datasets took the shortest time to crack by online guessing? Which took the longest?

In [91]:
min_row, max_row = df.iloc[df['value'].idxmin()], df.iloc[df['value'].idxmax()]
print("Shortest:\n", df.iloc[df['value'].idxmin()])
print("\nLongest:\n", df.iloc[df['value'].idxmax()])


Shortest:
 password         1234
value        0.003086
time_unit       hours
Name: 3, dtype: object

Longest:
 password     trustno1
value        808285.2
time_unit       hours
Name: 25, dtype: object


In [92]:
!jupyter nbconvert _09-py-wrangling-advanced-practice-solutions.ipynb --to html --template classic --output 09-py-wrangling-advanced-practice-solutions.html

[NbConvertApp] Converting notebook _09-py-wrangling-advanced-practice-solutions.ipynb to html
[NbConvertApp] Writing 300519 bytes to 09-py-wrangling-advanced-practice-solutions.html


# <center>Have fun!<a class="tocSkip"></center>