## Student Performance Indicator


#### Life cycle of Machine learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model

### 1) Problem statement
- This project understands how financial losses are explained by Cyber Security threats. 

### 2) Data Collection
- Dataset Source - https://www.kaggle.com/datasets/gojoyuno/cyber-breach-analysis-dataset/
- The data consists of 18 column and 1000 rows.

### 2.1 Import Data and Required Packages
####  Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [12]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

#### Import the CSV Data as Pandas DataFrame

In [13]:
df = pd.read_csv('./data/breached_services_info.csv')

#### Show Top 5 Records

In [14]:
df.head()

Unnamed: 0.1,Unnamed: 0,Name,Title,Domain,BreachDate,AddedDate,ModifiedDate,PwnCount,Description,LogoPath,DataClasses,IsVerified,IsFabricated,IsSensitive,IsRetired,IsSpamList,IsMalware,IsSubscriptionFree
0,0,000webhost,000webhost,000webhost.com,2015-03-01,2015-10-26T23:35:45Z,2017-12-10T21:44:27Z,14936670,"In approximately March 2015, the free web host...",https://haveibeenpwned.com/Content/Images/Pwne...,"['Email addresses', 'IP addresses', 'Names', '...",True,False,False,False,False,False,False
1,1,123RF,123RF,123rf.com,2020-03-22,2020-11-15T00:59:50Z,2020-11-15T01:07:10Z,8661578,"In March 2020, the stock photo site <a href=""h...",https://haveibeenpwned.com/Content/Images/Pwne...,"['Email addresses', 'IP addresses', 'Names', '...",True,False,False,False,False,False,False
2,2,126,126,126.com,2012-01-01,2016-10-08T07:46:05Z,2016-10-08T07:46:05Z,6414191,"In approximately 2012, it's alleged that the C...",https://haveibeenpwned.com/Content/Images/Pwne...,"['Email addresses', 'Passwords']",False,False,False,False,False,False,False
3,3,17Media,17,17app.co,2016-04-19,2016-07-08T01:55:03Z,2016-07-08T01:55:03Z,4009640,"In April 2016, customer data obtained from the...",https://haveibeenpwned.com/Content/Images/Pwne...,"['Device information', 'Email addresses', 'IP ...",True,False,False,False,False,False,False
4,4,17173,17173,17173.com,2011-12-28,2018-04-28T04:53:15Z,2018-04-28T04:53:15Z,7485802,"In late 2011, <a href=""https://news.softpedia....",https://haveibeenpwned.com/Content/Images/Pwne...,"['Email addresses', 'Passwords', 'Usernames']",False,False,False,False,False,False,False


In [15]:
# Rename first column
df = df.rename(columns={'Unnamed: 0': 'Record'})

#### Shape of the dataset

In [16]:
df.shape

(777, 18)

### 2.2 Dataset information

-  Name : The name of the breached entity.
-  Title : A brief title or label for the breach.
-  Domain : The domain name associated with the breach (if available).
-  BreachDate : The date when the breach occurred.
-  AddedDate : The date when the breach was added to the database.
-  ModifiedDat : The date when the breach data was last modified.
-  PwnCount : The number of accounts impacted by the breach.
-  Description : A description of the breach.
-  LogoPath : The path to the logo associated with the breached entity.
- IsVerified : A boolean indicating whether the data breach has been verified.
- IsFabricated : A boolean indicating whether the data has been fabricated.
- IsSensitive : A boolean indicating whether the data is sensitive.
- IsRetired : A boolean indicating whether the data is retired.
- IsSpamList : A boolean indicating whether the data is part of a spam list.
- IsMalware : A boolean indicating whether the data includes malware.
- IsSubscriptionFree : A boolean indicating whether the service is subscription-free.

### 3. Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

### 3.1 Check Missing values

In [17]:
df.isna().sum()

Record                 0
Name                   0
Title                  0
Domain                38
BreachDate             0
AddedDate              0
ModifiedDate           0
PwnCount               0
Description            0
LogoPath               0
DataClasses            0
IsVerified             0
IsFabricated           0
IsSensitive            0
IsRetired              0
IsSpamList             0
IsMalware              0
IsSubscriptionFree     0
dtype: int64

### 3.2 Check Duplicates

In [18]:
df.duplicated().sum()

0

#### There are no duplicates  values in the data set

### 3.3 Check data types

In [19]:
# Check Null and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Record              777 non-null    int64 
 1   Name                777 non-null    object
 2   Title               777 non-null    object
 3   Domain              739 non-null    object
 4   BreachDate          777 non-null    object
 5   AddedDate           777 non-null    object
 6   ModifiedDate        777 non-null    object
 7   PwnCount            777 non-null    int64 
 8   Description         777 non-null    object
 9   LogoPath            777 non-null    object
 10  DataClasses         777 non-null    object
 11  IsVerified          777 non-null    bool  
 12  IsFabricated        777 non-null    bool  
 13  IsSensitive         777 non-null    bool  
 14  IsRetired           777 non-null    bool  
 15  IsSpamList          777 non-null    bool  
 16  IsMalware           777 no

### 3.4 Checking the number of unique values of each column

In [20]:
df.nunique()

Record                777
Name                  777
Title                 777
Domain                720
BreachDate            650
AddedDate             773
ModifiedDate          768
PwnCount              776
Description           777
LogoPath              717
DataClasses           401
IsVerified              2
IsFabricated            2
IsSensitive             2
IsRetired               2
IsSpamList              2
IsMalware               2
IsSubscriptionFree      2
dtype: int64

### 3.5 Check statistics of data set

In [9]:
#    - **Quantitative Variables:** `PwnCount`
#    - **Categorical Variables:** `IsVerified`, `IsFabricated`, `IsSensitive`, `IsRetired`, `IsSpamList`, `IsMalware`, `IsSubscriptionFree`
#    - **Date Variables:** `BreachDate`, `AddedDate`, `ModifiedDate`
#    - **Text Variables:** `Name`, `Title`, `Domain`, `Description`
#    - **Path Variables:** `LogoPath`

# Convert to timezone-aware datetime
df['AddedDate'] = df['AddedDate'].dt.tz_convert('UTC')

# Cast variables
cast_dict = {
    'PwnCount': 'int64',
    'IsVerified': 'category',
    'IsFabricated': 'category',
    'IsSensitive': 'category',
    'IsRetired': 'category',
    'IsSpamList': 'category',
    'IsMalware': 'category',
    'IsSubscriptionFree': 'category',
    'BreachDate': 'datetime64[ns, UTC]',  # Specify timezone-aware dtype
    'AddedDate': 'datetime64[ns, UTC]',    # Specify timezone-aware dtype
    'ModifiedDate': 'datetime64[ns, UTC]'  # Specify timezone-aware dtype
}
df = df.astype(cast_dict)

AttributeError: Can only use .dt accessor with datetimelike values

In [21]:
df.describe()

Unnamed: 0,Record,PwnCount
count,777.0,777.0
mean,388.0,17396760.0
std,224.444871,70068860.0
min,0.0,858.0
25%,194.0,269552.0
50%,388.0,1141278.0
75%,582.0,5970416.0
max,776.0,772905000.0


## 4. Processing

In [22]:
# Define method to calculate the estimated financial loss
def estimate_financial_loss(row):
    base_cost_per_account = 1  # Fictitious base cost per account
    sensitivity_multiplier = 1.5 if row["IsSensitive"] else 1
    malware_multiplier = 2 if row["IsMalware"] else 1
    verification_multiplier = 1 if row["IsVerified"] else 0.8

    # Calculate estimated loss
    estimated_loss = (
        row["PwnCount"]
        * base_cost_per_account
        * sensitivity_multiplier
        * malware_multiplier
        * verification_multiplier
    )

    # Add hypothetical legal/PR costs
    legal_pr_cost = 500  # Arbitrary fixed cost

    estimated_loss += legal_pr_cost

    return estimated_loss


# Apply the function to the dataset
df["financial_estimated_loss"] = df.apply(estimate_financial_loss, axis=1)

In [23]:
# Save cleaned data
df.to_csv('./data/breached_services_info.csv', index=False)

In [None]:
# TODO - Add more EDA for better data understanding