# Descriptive Statistics for Common Datasets
This script can be run on each of the datasets. Dataset-specific scripts will be written separately.
This script counts and removes duplicate rows, calculates the number of unique values per column, calculates the number/percent of nulls per column, and counts number of rows with the same set of null column(s).

_Author: Jared Gauntt_

## Prepare for Analysis


### Set Parameters

In [1]:
localFolder='C:/Users/jared/Documents/My Files/DAEN 690/Analysis/'
fileName='DAEN 690 2021-02-14 V2.xlsx'
sheetName='Medications'

### Import Libraries

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

### Import From Excel Spreadsheet

In [3]:
#Import single tab 
df=pd.read_excel(localFolder+fileName,sheet_name=sheetName)

#Print (1) number of rows/columns and (2) column names/types for quick confirmation of successful import
numOriginalRows=len(df)
numCols=len(df.columns)
print(sheetName)
print('Number of Rows = '+str(numOriginalRows))
print('Number of Columns = '+str(numCols))
df.dtypes

Medications
Number of Rows = 63311
Number of Columns = 4


PatientId                            int64
Medication_Given_RXCUI_Code        float64
Medication_Given_Description        object
Personnel_Performer_ID_Internal     object
dtype: object

## Numeric Analysis

### Duplicate Rows
Duplicate rows will likely need to be removed during data conditioning.

In [4]:
#Determine which rows are duplicates (True=duplicate, False=first instance of row)
duplicateRowIdentifier=df.duplicated()

#Calculate number of duplicate rows
numDuplicateRows=len(duplicateRowIdentifier[duplicateRowIdentifier==True])
numUniqueRows=numOriginalRows-numDuplicateRows

#Calculate percentage of rows that are duplicates
percentDuplicateRows=round(numDuplicateRows/len(df)*100,4)

#Print results
print('Number of Duplicate Rows =')
print(numDuplicateRows)
print('Percentage of Rows that are Duplicates = ')
print(percentDuplicateRows)

Number of Duplicate Rows =
9272
Percentage of Rows that are Duplicates = 
14.6452


Duplicate rows will be removed before conducting the remaining analysis in this script.

In [5]:
#Reduce to the rows that were not flagged as duplicates
df=df.loc[duplicateRowIdentifier==False,:]

#Confirm
print('Expected Number of Rows = '+str(numUniqueRows))
print('Updated Data Frame Shape =')
print(df.shape)

Expected Number of Rows = 54039
Updated Data Frame Shape =
(54039, 4)


### Unique Values Per Column
This is a simple calculation of the number of unique values per individual column.

In [6]:
#Calculate the number of unique values per columns (NULL counts as a value)
dsNumUnique=df.nunique(dropna=False)
dsNumUnique.name='Number of Unique Values'
print(dsNumUnique)

PatientId                          38501
Medication_Given_RXCUI_Code           33
Medication_Given_Description          33
Personnel_Performer_ID_Internal      845
Name: Number of Unique Values, dtype: int64


### Null Values Per Column
This is a simple calculation of the number/percentage of null values per individual column.

In [7]:
#Calculate the number of null values per columns
dsNumNull=df.isnull().sum()
dsNumNull.name='Number Rows With Nulls'
print(dsNumNull)

PatientId                          0
Medication_Given_RXCUI_Code        6
Medication_Given_Description       6
Personnel_Performer_ID_Internal    0
Name: Number Rows With Nulls, dtype: int64


In [8]:
#Calculate the percentage of null values per columns
dsPercentNull=(dsNumNull/numUniqueRows*100).round(2)
dsPercentNull.name='Percent Rows With Nulls'
print(dsPercentNull)

PatientId                          0.00
Medication_Given_RXCUI_Code        0.01
Medication_Given_Description       0.01
Personnel_Performer_ID_Internal    0.00
Name: Percent Rows With Nulls, dtype: float64


In [9]:
#Merge data series (by column) together
dfPerCol=pd.DataFrame()
dfPerCol[dsNumUnique.name]=dsNumUnique
dfPerCol[dsNumNull.name]=dsNumNull
dfPerCol[dsPercentNull.name]=dsPercentNull

### Null Values Per Row
This section determines which columns have a null value per each row, resulting in a data series matching the number of unique rows in the dataset. The series values are tuples of column names. The resulting table shows the unique tuples of column names with their corresponding row count. Once the data subsets for the project questions are determined, this will help inform the completeness of those data subsets.

In [10]:
#For a single row, determine which columns have null values
def NullsPerRow(dsRow):  
    columnsNull=list(dsRow[dsRow.isnull()].index)
    columnsNull.sort()
    columnsNull=tuple(columnsNull) #used tuple since lists can't be used by pandas unique()
    return(columnsNull)
dsNulls=df.apply(NullsPerRow,axis=1)

#Create data frame for counting 
dfNulls=pd.DataFrame(dsNulls.unique(),columns=['Columns With Null'])
dfNulls['Number of Rows']=0

#Count the number of rows per each tuple of null columns
for index in dfNulls.index:
    dfNulls.loc[index,'Number of Rows']=len(dsNulls[dsNulls==dfNulls.loc[index,'Columns With Null']])
dfNulls.sort_values(by='Number of Rows',ascending=False,inplace=True)
dfNulls.reset_index(drop=True,inplace=True)

#Add a column for percent of rows
dfNulls['Percent of Rows']=(dfNulls['Number of Rows']/numUniqueRows*100).round(2)
dfNulls

Unnamed: 0,Columns With Null,Number of Rows,Percent of Rows
0,(),54033,99.99
1,"(Medication_Given_Description, Medication_Given_RXCUI_Code)",6,0.01


## Visualizations

In [11]:
print(sheetName+' Dataset\n')
print('Original Number of Rows = '+str(numOriginalRows))
print('Duplicate Number of Rows = '+str(numDuplicateRows))
print('Percent Duplicate Rows = '+str(percentDuplicateRows)+'\n')
print('Duplicate rows removed prior to remaining analysis')
print('Unique Number of Rows = '+str(numUniqueRows)+'\n')

Medications Dataset

Original Number of Rows = 63311
Duplicate Number of Rows = 9272
Percent Duplicate Rows = 14.6452

Duplicate rows removed prior to remaining analysis
Unique Number of Rows = 54039



In [12]:
dfPerCol

Unnamed: 0,Number of Unique Values,Number Rows With Nulls,Percent Rows With Nulls
PatientId,38501,0,0.0
Medication_Given_RXCUI_Code,33,6,0.01
Medication_Given_Description,33,6,0.01
Personnel_Performer_ID_Internal,845,0,0.0


In [13]:
dfNulls

Unnamed: 0,Columns With Null,Number of Rows,Percent of Rows
0,(),54033,99.99
1,"(Medication_Given_Description, Medication_Given_RXCUI_Code)",6,0.01
