## Data Breach Analytics 2005 - 2017

**By Miriam Rodriguez**


### Data Description

#### Dataset
- Dataset name:              Privacy_Rights_Clearinghouse-Data-Breaches-Export.csv
- Source:                    https://www.privacyrights.org/data-breaches (Links to an external site.). 
- Data Description and FAQ:  https://www.privacyrights.org/chronology-data-breaches-faq 
- Duration of the data:      2005 through 2017

#### Data Breach Types
   - CARD - Payment Card Fraud:    Fraud involving debit and credit cards that is not accomplished via hacking (e.g. skimming                                      devices at point-of-service terminals).
   - HACK - Hacking or Malware:    Hacked by outside party or infected by malware
   - INSD - Insider:               Someone with legitimate access intentionally breaches information, such as an employee,                                          contractor, or customer)
   - PHYS - Physical Loss:         Includes paper documents that are lost, discarded, or stolen (non-electronic)
   - PORT - Portable Device:       Lost, discarded, or stolen laptop, PDA, smartphone, memory stick, CDs, hard drive, data tape,                                    etc.
   - STAT - Stationary Device:     Stationary computer loss (lost, inappropriately accessed, discarded, or stolen computer or                                      server not designed for mobility)
   - DISC - Unintended Disclosure: Unintended disclosure (not involving hacking, intentional breach, or physical loss i.e.                                          sensitive information posted publicly, mishandled, sent to the wrong party via publishing                                      online, sending in an email, sending in a mailing or sending via fax.
   - Unknown

#### Institution/Organization) Type
-	BSF - Businesses - Financial and Insurance Services
-	BSO - Businesses - Other
-	BSR - Businesses - Retail/Merchant – Including Online Retail
-	EDU - Educational Institutions
-	GOV - Government & Military
-	MED - Healthcare - Medical Providers & Medical Insurance Services
-	NGO - Nonprofit Organizations

#### Data elements/column names
1.	Date Made Public: Date Breach information released to public (date: year, month, day)
2.	Company: Company breached (text)
3.	City: City of breached company (text)
4.	State: State of breached company (text)
5.	Type of Breach: Refer to four-character Breach Type above
6.	Type of Organization: Refer to three-character Institution/Organization Type above
7.	Total Records: Number of records breached (integer)
8.	Description of Incident: Text describing breach (text)
9.	Information Source: Location of database source (text)
10.	Source URL: Location of data source URL (text)
11.	Year of Breach: Four digit year (numeric)
12.	Latitude: Location Latitude (signed numeric long)
13.	Longitude: Location Longitude (signed numeric long)


In [219]:
# import python packages
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline

# Importing Data & Basic Statistics

In [260]:
#import breach data ... open or read the breach data
df=pd.read_csv("Privacy_Rights_Clearinghouse-Data-Breaches-Export.csv")
df.head()


Unnamed: 0,Date Made Public,Company,City,State,Type of breach,Type of organization,Total Records,Description of incident,Information Source,Source URL,Year of Breach,Latitude,Longitude
0,16-May-08,Greil Memorial Psychiatric Hospital,Montgomery,Alabama,PHYS,EDU,0,Index cards containing patients \n ...,Dataloss DB,,2008.0,32.366805,-86.299969
1,21-Mar-08,Compass Bank,Birmingham,Alabama,INSD,BSF,1000000,"A database containing names, account \n ...",Dataloss DB,,2008.0,33.520661,-86.80249
2,7-Aug-07,Electronic Data Systems,Montgomery,Alabama,INSD,BSO,498,A former employee \n was arrested t...,Dataloss DB,,2007.0,32.366805,-86.299969
3,3-Jun-07,Gadsden State Community College,College Gadsden,Alabama,PHYS,EDU,400,Students who took \n an Art Appreci...,Dataloss DB,,2007.0,34.025272,-85.995891
4,5-Apr-07,DCH Health Systems,Tuscaloosa,Alabama,PORT,MED,6000,An encrypted disc \n and hardcopy d...,Dataloss DB,,2007.0,33.209841,-87.569174


### Determine data types and missing values

In [261]:
# how many missing values in each column or variable
df.isnull().sum()

Date Made Public              0
Company                       0
City                       2520
State                        68
Type of breach                0
Type of organization          0
Total Records                38
Description of incident       3
Information Source           54
Source URL                 5410
Year of Breach               33
Latitude                      0
Longitude                     0
dtype: int64

In [262]:
# Describe data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8202 entries, 0 to 8201
Data columns (total 13 columns):
Date Made Public           8202 non-null object
Company                    8202 non-null object
City                       5682 non-null object
State                      8134 non-null object
Type of breach             8202 non-null object
Type of organization       8202 non-null object
Total Records              8164 non-null object
Description of incident    8199 non-null object
Information Source         8148 non-null object
Source URL                 2792 non-null object
Year of Breach             8169 non-null float64
Latitude                   8202 non-null float64
Longitude                  8202 non-null float64
dtypes: float64(3), object(10)
memory usage: 833.1+ KB


**There are 8202 records total.**  

** There are data quality issues**


** Based upon above counts, there are missing values for City, State, Total_Recs, Description, Breach_Year **


Column name: City 
- How resolved: Drop column. 
- Justification: The majority of cities are not provided.  All we really need is State.

Column name: State
- How resolved: Replace spaces with 'United States'
- Justification: Since this value was not provided in the file, the assumption is that the breach was national.

Column name: Total_Recs 
- How resolved: Move zeros to null values. Then convert to float to remove zeros, then convert to int.
- Justification: This will enable the ability to measure the quanitative impact.

Column name: Description 
- How resolved: Replace spaces with 'None'
- Justification: This value was not provided in the file.

Column name: Breach_Year
- How resolved:   Drop column
- Justification:  We have date with year with no empty values so do not need this field


Remove the following fields as they are not needed:
- Information Source 
- Source URL 

In [263]:
#Two columns (Information Source, Source URL) are not necessary for analysis. They will be dropped.
df = df.drop(['Information Source', 'Source URL'], axis=1)
df.head()

Unnamed: 0,Date Made Public,Company,City,State,Type of breach,Type of organization,Total Records,Description of incident,Year of Breach,Latitude,Longitude
0,16-May-08,Greil Memorial Psychiatric Hospital,Montgomery,Alabama,PHYS,EDU,0,Index cards containing patients \n ...,2008.0,32.366805,-86.299969
1,21-Mar-08,Compass Bank,Birmingham,Alabama,INSD,BSF,1000000,"A database containing names, account \n ...",2008.0,33.520661,-86.80249
2,7-Aug-07,Electronic Data Systems,Montgomery,Alabama,INSD,BSO,498,A former employee \n was arrested t...,2007.0,32.366805,-86.299969
3,3-Jun-07,Gadsden State Community College,College Gadsden,Alabama,PHYS,EDU,400,Students who took \n an Art Appreci...,2007.0,34.025272,-85.995891
4,5-Apr-07,DCH Health Systems,Tuscaloosa,Alabama,PORT,MED,6000,An encrypted disc \n and hardcopy d...,2007.0,33.209841,-87.569174


Remove spaces from field names (rename)
- Total Records:           Total_Recs
- Description of incident: Description
- Type of organization:    Organization_Type
- Type of breach:          Breach_Type
- Date Made Public         Date_Public

In [264]:
#Drop column Breach_Year which has missing values. No longer needed since Date_Public has no missing values.
df = df.drop(['Year of Breach'], axis=1)
df.head()

Unnamed: 0,Date Made Public,Company,City,State,Type of breach,Type of organization,Total Records,Description of incident,Latitude,Longitude
0,16-May-08,Greil Memorial Psychiatric Hospital,Montgomery,Alabama,PHYS,EDU,0,Index cards containing patients \n ...,32.366805,-86.299969
1,21-Mar-08,Compass Bank,Birmingham,Alabama,INSD,BSF,1000000,"A database containing names, account \n ...",33.520661,-86.80249
2,7-Aug-07,Electronic Data Systems,Montgomery,Alabama,INSD,BSO,498,A former employee \n was arrested t...,32.366805,-86.299969
3,3-Jun-07,Gadsden State Community College,College Gadsden,Alabama,PHYS,EDU,400,Students who took \n an Art Appreci...,34.025272,-85.995891
4,5-Apr-07,DCH Health Systems,Tuscaloosa,Alabama,PORT,MED,6000,An encrypted disc \n and hardcopy d...,33.209841,-87.569174


In [265]:
# rename columns
df = df.rename(columns={'Total Records': 'Total_Recs'})
df.head()

Unnamed: 0,Date Made Public,Company,City,State,Type of breach,Type of organization,Total_Recs,Description of incident,Latitude,Longitude
0,16-May-08,Greil Memorial Psychiatric Hospital,Montgomery,Alabama,PHYS,EDU,0,Index cards containing patients \n ...,32.366805,-86.299969
1,21-Mar-08,Compass Bank,Birmingham,Alabama,INSD,BSF,1000000,"A database containing names, account \n ...",33.520661,-86.80249
2,7-Aug-07,Electronic Data Systems,Montgomery,Alabama,INSD,BSO,498,A former employee \n was arrested t...,32.366805,-86.299969
3,3-Jun-07,Gadsden State Community College,College Gadsden,Alabama,PHYS,EDU,400,Students who took \n an Art Appreci...,34.025272,-85.995891
4,5-Apr-07,DCH Health Systems,Tuscaloosa,Alabama,PORT,MED,6000,An encrypted disc \n and hardcopy d...,33.209841,-87.569174


In [266]:
# rename columns Description of incident: Description
df = df.rename(columns={'Description of incident': 'Description'})

In [267]:
# rename columns Type of organization:    Organization_Type
df = df.rename(columns={'Type of organization': 'Organization_Type'})

In [268]:
# rename columns Date Made Public:    Date_Public
df = df.rename(columns={'Date Made Public': 'Date_Public'})

In [269]:
# rename columns Type of breach:          Breach_Type
df = df.rename(columns={'Type of breach': 'Breach_Type'})
df.head()

Unnamed: 0,Date_Public,Company,City,State,Breach_Type,Organization_Type,Total_Recs,Description,Latitude,Longitude
0,16-May-08,Greil Memorial Psychiatric Hospital,Montgomery,Alabama,PHYS,EDU,0,Index cards containing patients \n ...,32.366805,-86.299969
1,21-Mar-08,Compass Bank,Birmingham,Alabama,INSD,BSF,1000000,"A database containing names, account \n ...",33.520661,-86.80249
2,7-Aug-07,Electronic Data Systems,Montgomery,Alabama,INSD,BSO,498,A former employee \n was arrested t...,32.366805,-86.299969
3,3-Jun-07,Gadsden State Community College,College Gadsden,Alabama,PHYS,EDU,400,Students who took \n an Art Appreci...,34.025272,-85.995891
4,5-Apr-07,DCH Health Systems,Tuscaloosa,Alabama,PORT,MED,6000,An encrypted disc \n and hardcopy d...,33.209841,-87.569174


*First, need to resolve missing Year of Breach (since date made public all there, can convert to date and use that year).  Convert the date from an object to a date timestamp.*

In [270]:
df['Date_Public'] =  pd.to_datetime(df['Date_Public'])

In [271]:
# replace null value with zero 
df = df.fillna({'Total_Recs': '0'})
df.isnull().sum()

Date_Public             0
Company                 0
City                 2520
State                  68
Breach_Type             0
Organization_Type       0
Total_Recs              0
Description             3
Latitude                0
Longitude               0
dtype: int64

In [272]:
#Convert Total_Recs to float and remove zeros
df["Total_Recs"] = df["Total_Recs"].str.replace(",","").astype(float)

In [273]:
#Convert Total_Recs to integer
df['Total_Recs'] = df['Total_Recs'].astype(np.int64)

In [274]:
df.head()

Unnamed: 0,Date_Public,Company,City,State,Breach_Type,Organization_Type,Total_Recs,Description,Latitude,Longitude
0,2008-05-16,Greil Memorial Psychiatric Hospital,Montgomery,Alabama,PHYS,EDU,0,Index cards containing patients \n ...,32.366805,-86.299969
1,2008-03-21,Compass Bank,Birmingham,Alabama,INSD,BSF,1000000,"A database containing names, account \n ...",33.520661,-86.80249
2,2007-08-07,Electronic Data Systems,Montgomery,Alabama,INSD,BSO,498,A former employee \n was arrested t...,32.366805,-86.299969
3,2007-06-03,Gadsden State Community College,College Gadsden,Alabama,PHYS,EDU,400,Students who took \n an Art Appreci...,34.025272,-85.995891
4,2007-04-05,DCH Health Systems,Tuscaloosa,Alabama,PORT,MED,6000,An encrypted disc \n and hardcopy d...,33.209841,-87.569174


In [275]:
#Drop column City which has missing values. Not needed as we are matching GDP by State and the majority of the cities are missing
df = df.drop(['City'], axis=1)
df.head()

Unnamed: 0,Date_Public,Company,State,Breach_Type,Organization_Type,Total_Recs,Description,Latitude,Longitude
0,2008-05-16,Greil Memorial Psychiatric Hospital,Alabama,PHYS,EDU,0,Index cards containing patients \n ...,32.366805,-86.299969
1,2008-03-21,Compass Bank,Alabama,INSD,BSF,1000000,"A database containing names, account \n ...",33.520661,-86.80249
2,2007-08-07,Electronic Data Systems,Alabama,INSD,BSO,498,A former employee \n was arrested t...,32.366805,-86.299969
3,2007-06-03,Gadsden State Community College,Alabama,PHYS,EDU,400,Students who took \n an Art Appreci...,34.025272,-85.995891
4,2007-04-05,DCH Health Systems,Alabama,PORT,MED,6000,An encrypted disc \n and hardcopy d...,33.209841,-87.569174


In [276]:
# replace null value with 'United States' 
df = df.fillna({'State': 'United States'})

In [277]:
# replace null value with 'not specified' 
df = df.fillna({'Description': 'None'})
df.isnull().sum()

Date_Public          0
Company              0
State                0
Breach_Type          0
Organization_Type    0
Total_Recs           0
Description          0
Latitude             0
Longitude            0
dtype: int64

In [278]:
df.head()

Unnamed: 0,Date_Public,Company,State,Breach_Type,Organization_Type,Total_Recs,Description,Latitude,Longitude
0,2008-05-16,Greil Memorial Psychiatric Hospital,Alabama,PHYS,EDU,0,Index cards containing patients \n ...,32.366805,-86.299969
1,2008-03-21,Compass Bank,Alabama,INSD,BSF,1000000,"A database containing names, account \n ...",33.520661,-86.80249
2,2007-08-07,Electronic Data Systems,Alabama,INSD,BSO,498,A former employee \n was arrested t...,32.366805,-86.299969
3,2007-06-03,Gadsden State Community College,Alabama,PHYS,EDU,400,Students who took \n an Art Appreci...,34.025272,-85.995891
4,2007-04-05,DCH Health Systems,Alabama,PORT,MED,6000,An encrypted disc \n and hardcopy d...,33.209841,-87.569174


# Data understanding & processing (ETL)

In [279]:
#show the information about the dataset - no missing data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8202 entries, 0 to 8201
Data columns (total 9 columns):
Date_Public          8202 non-null datetime64[ns]
Company              8202 non-null object
State                8202 non-null object
Breach_Type          8202 non-null object
Organization_Type    8202 non-null object
Total_Recs           8202 non-null int64
Description          8202 non-null object
Latitude             8202 non-null float64
Longitude            8202 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(5)
memory usage: 576.8+ KB


**Breach_Type and Organization_Type are categories**

In [280]:
#describe the column Type of Organization only (e.g., count, unique, frequency)
df['Organization_Type'].describe()

count     8202
unique       7
top        MED
freq      4078
Name: Organization_Type, dtype: object

##### 'MED' is top organization type

In [281]:
#describe the column Type of Breach only (e.g., count, unique, frequency)
df['Breach_Type'].describe()


count     8202
unique       8
top       HACK
freq      2454
Name: Breach_Type, dtype: object

##### 'HACK' is top type of breach.

In [282]:
#describe the column State only (e.g., count, unique, frequency)
df['State'].describe()

count           8202
unique            66
top       California
freq            1287
Name: State, dtype: object

In [284]:
df.to_csv("databreach_cleaned.csv")

Business questions include comparisons of data breaches in terms of U.S. states. 
•	Some large states (e.g., California) with many businesses/populations have more cases of data breaches than such U.S states as Wyoming and Idaho. 
•	Data about gross domestic product (GDP) for U.S. states between 2005 and 2017 needs to be collected  
•	Normalize the number of data breaches by using each state's GDP.  Do a join by state code and year.

##### 'California' is top state for breaches reported, however, we need to normalize and adjust the frequency by the GDP to get the REAL statistic. Create a new dataframe with data breaches by state, then normalize.

In [286]:
df.groupby('State').size()

State
Alabama                   78
Alaska                    23
Arizona                  135
Arkansas                  54
Beijing                    1
Berlin                     1
British Columbia           3
Buckinghamshire            2
California              1287
Cheshire                   1
Colorado                 165
Connecticut              138
Delaware                  22
District Of Columbia     152
Dublin                     1
Florida                  435
Georgia                  244
Grand Bahama               1
Guangdong                  1
Hawaii                    27
Idaho                     22
Illinois                 335
Indiana                  208
Iowa                      67
Kansas                    54
Kentucky                 112
London                     2
Louisiana                 61
Maine                     33
Maryland                 374
                        ... 
Nebraska                  38
Nevada                    62
New Hampshire             44
New Jers

In [259]:
#import gdp data ... open or read the gdp data
gp=pd.read_csv("gdpstate_naics_all/gdpstate_naics_all.csv")
gp.head()


Unnamed: 0,GeoFIPS,GeoName,Region,ComponentId,ComponentName,IndustryId,IndustryClassification,Description,1997,1998,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,0,United States,,200,Gross domestic product (GDP) by state (million...,1,...,All industry total,8542530,9024434,...,14626598,14320114,14859772,15406002,16041243,16576738,17312308,18007206,18509998,19263350
1,0,United States,,200,Gross domestic product (GDP) by state (million...,2,...,Private industries,7459395,7894015,...,12716179,12352979,12826507,13348439,13957545,14468465,15149621,15776274,16224645,16925936
2,0,United States,,200,Gross domestic product (GDP) by state (million...,3,11,"Agriculture, forestry, fishing, and hunting",108796,99940,...,154525,137655,160217,197241,185800,221821,204404,184791,177580,173445
3,0,United States,,200,Gross domestic product (GDP) by state (million...,4,111-112,Farms,88136,79030,...,126345,109800,129725,166249,151489,186960,167709,145476,136672,(NA)
4,0,United States,,200,Gross domestic product (GDP) by state (million...,5,113-115,"Forestry, fishing, and related activities",20660,20910,...,28180,27855,30492,30992,34311,34861,36695,39315,40907,(NA)


##### Keeping GeoName, Componentid (to select by Componentid later) and years 2005 - 2017.  All other columns will be dropped.  Only Componentid of 200 will be used. 
##### Will group and sum by GeoName, then match with data by year.