# NORS Dataset Exploration

## ENVIRONMENT

Let's install and import all the necessary libraries/packages needed for this project.

Type this into the terminal:

```pip install notebook pandas```

Then, let's begin with the imports.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# to print out all the outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

## ACQUISITION OF DATA

The information and the download of the dataset can be found at [NORS Data CDC](https://data.cdc.gov/Foodborne-Waterborne-and-Related-Diseases/NORS/5xkq-dg7x/about_data).

In [3]:
# Read a csv file and suppress warnings about mixed datatypes with the low_memory parameter.
df = pd.read_csv('../data/NORS_20250114.csv', low_memory=False)

In [4]:
df.head()

Unnamed: 0,Year,Month,State,Primary Mode,Etiology,Serotype or Genotype,Etiology Status,Setting,Illnesses,Hospitalizations,Info On Hospitalizations,Deaths,Info On Deaths,Food Vehicle,Food Contaminated Ingredient,IFSAC Category,Water Exposure,Water Type,Animal Type
0,2023,1,Minnesota,Food,Norovirus Genogroup IX,GII.P15 GIX.1,Confirmed,Restaurant: Sit-down dining,23,0.0,23.0,0.0,23.0,,,,,,
1,2023,1,Massachusetts,Indeterminate/unknown,Norovirus,,Suspected,Long-term care/nursing home/assisted living facility,7,0.0,0.0,0.0,0.0,,,,,,
2,2023,1,North Carolina,Person-to-person,Norovirus unknown,,Confirmed,Long-term care/nursing home/assisted living facility,23,1.0,23.0,0.0,23.0,,,,,,
3,2023,1,Wisconsin,Person-to-person,Norovirus unknown;Clostridium difficile,,Confirmed;Confirmed,Long-term care/nursing home/assisted living facility,12,0.0,12.0,0.0,12.0,,,,,,
4,2023,1,Wisconsin,Person-to-person,,,,Long-term care/nursing home/assisted living facility,9,0.0,9.0,0.0,9.0,,,,,,


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66713 entries, 0 to 66712
Data columns (total 19 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          66713 non-null  int64  
 1   Month                         66713 non-null  int64  
 2   State                         66713 non-null  object 
 3   Primary Mode                  66713 non-null  object 
 4   Etiology                      50375 non-null  object 
 5   Serotype or Genotype          16470 non-null  object 
 6   Etiology Status               50375 non-null  object 
 7   Setting                       60804 non-null  object 
 8   Illnesses                     66713 non-null  int64  
 9   Hospitalizations              58155 non-null  float64
 10  Info On Hospitalizations      58480 non-null  float64
 11  Deaths                        58785 non-null  float64
 12  Info On Deaths                58463 non-null  float64
 13  F

## PREPARATION OF THE DATA

Let's filter for outbreaks occuring after 2009 since it's the earliest year that all modes of outbreaks are available.

In [6]:
df.shape

(66713, 19)

In [7]:
df = df[df["Year"] >= 2009]

In [8]:
df.shape

(51694, 19)

In [9]:
df.head()

Unnamed: 0,Year,Month,State,Primary Mode,Etiology,Serotype or Genotype,Etiology Status,Setting,Illnesses,Hospitalizations,Info On Hospitalizations,Deaths,Info On Deaths,Food Vehicle,Food Contaminated Ingredient,IFSAC Category,Water Exposure,Water Type,Animal Type
0,2023,1,Minnesota,Food,Norovirus Genogroup IX,GII.P15 GIX.1,Confirmed,Restaurant: Sit-down dining,23,0.0,23.0,0.0,23.0,,,,,,
1,2023,1,Massachusetts,Indeterminate/unknown,Norovirus,,Suspected,Long-term care/nursing home/assisted living facility,7,0.0,0.0,0.0,0.0,,,,,,
2,2023,1,North Carolina,Person-to-person,Norovirus unknown,,Confirmed,Long-term care/nursing home/assisted living facility,23,1.0,23.0,0.0,23.0,,,,,,
3,2023,1,Wisconsin,Person-to-person,Norovirus unknown;Clostridium difficile,,Confirmed;Confirmed,Long-term care/nursing home/assisted living facility,12,0.0,12.0,0.0,12.0,,,,,,
4,2023,1,Wisconsin,Person-to-person,,,,Long-term care/nursing home/assisted living facility,9,0.0,9.0,0.0,9.0,,,,,,


In [10]:
def show_missing(df):
    """
    Takes a dataframe and returns a dataframe with stats
    on missing and null values with their percentages.
    """
    null_count = df.isnull().sum()
    null_percentage = (null_count / df.shape[0]) * 100
    empty_count = pd.Series(((df == ' ') | (df == '')).sum())
    empty_percentage = (empty_count / df.shape[0]) * 100
    nan_count = pd.Series(((df == 'nan') | (df == 'NaN')).sum())
    nan_percentage = (nan_count / df.shape[0]) * 100
    dfx = pd.DataFrame({'num_missing': null_count, 'missing_percentage': null_percentage,
                         'num_empty': empty_count, 'empty_percentage': empty_percentage,
                         'nan_count': nan_count, 'nan_percentage': nan_percentage})
    return dfx

In [11]:
show_missing(df)

Unnamed: 0,num_missing,missing_percentage,num_empty,empty_percentage,nan_count,nan_percentage
Year,0,0.0,0,0.0,0,0.0
Month,0,0.0,0,0.0,0,0.0
State,0,0.0,0,0.0,0,0.0
Primary Mode,0,0.0,0,0.0,0,0.0
Etiology,11240,21.743336,0,0.0,0,0.0
Serotype or Genotype,37738,73.00267,0,0.0,0,0.0
Etiology Status,11240,21.743336,0,0.0,0,0.0
Setting,5404,10.453824,0,0.0,0,0.0
Illnesses,0,0.0,0,0.0,0,0.0
Hospitalizations,4264,8.248539,0,0.0,0,0.0


In [12]:
def get_values(df, columns):
    """
    Take a dataframe and a list of columns and
    returns the value counts for the columns.
    """
    for column in columns:
        print('=====================================')
        print(df[column].value_counts(dropna=False))
        print('\n')

def show_values(df, param):
    if param == 'all':
        get_values(df, df.columns)
    else:
        get_values(df, param) 

In [13]:
df.head(100)

Unnamed: 0,Year,Month,State,Primary Mode,Etiology,Serotype or Genotype,Etiology Status,Setting,Illnesses,Hospitalizations,Info On Hospitalizations,Deaths,Info On Deaths,Food Vehicle,Food Contaminated Ingredient,IFSAC Category,Water Exposure,Water Type,Animal Type
0,2023,1,Minnesota,Food,Norovirus Genogroup IX,GII.P15 GIX.1,Confirmed,Restaurant: Sit-down dining,23,0.0,23.0,0.0,23.0,,,,,,
1,2023,1,Massachusetts,Indeterminate/unknown,Norovirus,,Suspected,Long-term care/nursing home/assisted living facility,7,0.0,0.0,0.0,0.0,,,,,,
2,2023,1,North Carolina,Person-to-person,Norovirus unknown,,Confirmed,Long-term care/nursing home/assisted living facility,23,1.0,23.0,0.0,23.0,,,,,,
3,2023,1,Wisconsin,Person-to-person,Norovirus unknown;Clostridium difficile,,Confirmed;Confirmed,Long-term care/nursing home/assisted living facility,12,0.0,12.0,0.0,12.0,,,,,,
4,2023,1,Wisconsin,Person-to-person,,,,Long-term care/nursing home/assisted living facility,9,0.0,9.0,0.0,9.0,,,,,,
5,2023,1,Wisconsin,Person-to-person,Norovirus unknown,,Suspected,Long-term care/nursing home/assisted living facility,4,0.0,4.0,0.0,4.0,,,,,,
6,2023,1,Wisconsin,Person-to-person,Norovirus Genogroup II,,Suspected,Long-term care/nursing home/assisted living facility,18,0.0,18.0,0.0,18.0,,,,,,
7,2023,1,Alabama,Indeterminate/unknown,,,,School/college/university,7,1.0,7.0,0.0,7.0,,,,,,
8,2023,1,Ohio,Food,Norovirus Genogroup II,GII.P untypeable GII.6,Suspected,Restaurant: Sit-down dining,2,0.0,2.0,0.0,2.0,,,,,,
9,2023,1,Minnesota,Person-to-person,Norovirus unknown,,Suspected,Long-term care/nursing home/assisted living facility,11,,0.0,,0.0,,,,,,


In [14]:
show_values(df, ['State',
                 'Primary Mode',
                 'Etiology',
                 'Serotype or Genotype',
                 'Etiology Status',
                 'Setting',
                 'Food Vehicle',
                 'Food Contaminated Ingredient',
                 'IFSAC Category',
                 'Water Exposure',
                 'Water Type',
                 'Animal Type'])

State
Wisconsin               3564
Ohio                    2856
Virginia                2751
Illinois                2654
Minnesota               2623
Pennsylvania            2605
New York                2570
Michigan                2421
Massachusetts           2286
Oregon                  2279
Colorado                1624
Texas                   1358
California              1198
North Carolina          1170
South Carolina          1165
Arizona                 1090
Florida                  944
Rhode Island             927
Multistate               888
Connecticut              885
Maine                    845
Kentucky                 843
Iowa                     820
Tennessee                816
New Hampshire            796
Washington               777
West Virginia            746
Alabama                  675
Nevada                   620
Indiana                  608
Utah                     567
Kansas                   539
Missouri                 514
Nebraska                 497
Montana 

In [15]:
show_missing(df)

Unnamed: 0,num_missing,missing_percentage,num_empty,empty_percentage,nan_count,nan_percentage
Year,0,0.0,0,0.0,0,0.0
Month,0,0.0,0,0.0,0,0.0
State,0,0.0,0,0.0,0,0.0
Primary Mode,0,0.0,0,0.0,0,0.0
Etiology,11240,21.743336,0,0.0,0,0.0
Serotype or Genotype,37738,73.00267,0,0.0,0,0.0
Etiology Status,11240,21.743336,0,0.0,0,0.0
Setting,5404,10.453824,0,0.0,0,0.0
Illnesses,0,0.0,0,0.0,0,0.0
Hospitalizations,4264,8.248539,0,0.0,0,0.0


Let's get take note of the columns that we want to keep and ones that have multiple entries and need to be exploded.

In [16]:
columns_to_keep = ['Year',
                   'Month',
                   'State',
                   'Primary Mode',
                   'Etiology',
                   'Setting',
                   'Illnesses',
                   'Hospitalizations',
                   'Info On Hospitalizations',
                   'Deaths',
                   'Info On Deaths',
                   'Food Vehicle',
                   'Food Contaminated Ingredient',
                   'IFSAC Category',
                   'Water Exposure',
                   'Water Type',
                   'Animal Type'
                   ]

columns_to_explode =  ['Etiology',
                       'Setting',
                       'Food Vehicle',
                       'Food Contaminated Ingredient',
                       'Water Exposure',
                       'Water Type',
                       'Animal Type']

In [17]:
df = df[columns_to_keep]

In [18]:
df.shape

(51694, 17)

In [19]:
for colname in columns_to_explode:
    df = df.assign(**{colname: df[colname].str.split(';')}).explode(colname)

df.shape


(58577, 17)

In [20]:
df.head(100)

Unnamed: 0,Year,Month,State,Primary Mode,Etiology,Setting,Illnesses,Hospitalizations,Info On Hospitalizations,Deaths,Info On Deaths,Food Vehicle,Food Contaminated Ingredient,IFSAC Category,Water Exposure,Water Type,Animal Type
0,2023,1,Minnesota,Food,Norovirus Genogroup IX,Restaurant: Sit-down dining,23,0.0,23.0,0.0,23.0,,,,,,
1,2023,1,Massachusetts,Indeterminate/unknown,Norovirus,Long-term care/nursing home/assisted living facility,7,0.0,0.0,0.0,0.0,,,,,,
2,2023,1,North Carolina,Person-to-person,Norovirus unknown,Long-term care/nursing home/assisted living facility,23,1.0,23.0,0.0,23.0,,,,,,
3,2023,1,Wisconsin,Person-to-person,Norovirus unknown,Long-term care/nursing home/assisted living facility,12,0.0,12.0,0.0,12.0,,,,,,
3,2023,1,Wisconsin,Person-to-person,Clostridium difficile,Long-term care/nursing home/assisted living facility,12,0.0,12.0,0.0,12.0,,,,,,
4,2023,1,Wisconsin,Person-to-person,,Long-term care/nursing home/assisted living facility,9,0.0,9.0,0.0,9.0,,,,,,
5,2023,1,Wisconsin,Person-to-person,Norovirus unknown,Long-term care/nursing home/assisted living facility,4,0.0,4.0,0.0,4.0,,,,,,
6,2023,1,Wisconsin,Person-to-person,Norovirus Genogroup II,Long-term care/nursing home/assisted living facility,18,0.0,18.0,0.0,18.0,,,,,,
7,2023,1,Alabama,Indeterminate/unknown,,School/college/university,7,1.0,7.0,0.0,7.0,,,,,,
8,2023,1,Ohio,Food,Norovirus Genogroup II,Restaurant: Sit-down dining,2,0.0,2.0,0.0,2.0,,,,,,


In [21]:
df.shape

(58577, 17)

In [22]:
df.to_csv('../data/outbreaks.csv')