<h1><center>Performance Assessment:Exploratory Data Analysis (OEM2)</center></h1>
<h3><center> by Bader Ale <center><h3>

For this Performance Assessment, I will be using the medical data contained in the D207 Definitions and Datafile directory.

# Part 1: Research Question and Variables
The research question for this analysis is:
**Is there a relation between the amount of times the primary physician visited the patient during their hospital stay and the occurence of readmission within 30 days following the patient's discharge from the facility?**

The first thing we have to do is import the original CSV file that contains our data. To do this, we must first import the necessary packages.

In [6]:
# Importing Libraries
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell # Importing so we can run multiple lines in one cell
InteractiveShell.ast_node_interactivity = "all" # Code so multiple lines in one cell can be ran simultaenously 

In [7]:
# Reading in the original CSV file
df = pd.read_csv(r'F:\GitHub Repos\WGU_MSDA\D207_Exploratory Data Analysis\medical_clean.csv')

After importing our CSV file, we will see the first 5 records of our dataframe and see the overall shape/size.

In [8]:
# Returning first 5 records of dataframe
df.head(5)

Unnamed: 0,CaseOrder,Customer_id,Interaction,UID,City,State,County,Zip,Lat,Lng,...,TotalCharge,Additional_charges,Item1,Item2,Item3,Item4,Item5,Item6,Item7,Item8
0,1,C412403,8cd49b13-f45a-4b47-a2bd-173ffa932c2f,3a83ddb66e2ae73798bdf1d705dc0932,Eva,AL,Morgan,35621,34.3496,-86.72508,...,3726.70286,17939.40342,3,3,2,2,4,3,3,4
1,2,Z919181,d2450b70-0337-4406-bdbb-bc1037f1734c,176354c5eef714957d486009feabf195,Marianna,FL,Jackson,32446,30.84513,-85.22907,...,4193.190458,17612.99812,3,4,3,4,4,4,3,3
2,3,F995323,a2057123-abf5-4a2c-abad-8ffe33512562,e19a0fa00aeda885b8a436757e889bc9,Sioux Falls,SD,Minnehaha,57110,43.54321,-96.63772,...,2434.234222,17505.19246,2,4,4,4,3,4,3,3
3,4,A879973,1dec528d-eb34-4079-adce-0d7a40e82205,cd17d7b6d152cb6f23957346d11c3f07,New Richland,MN,Waseca,56072,43.89744,-93.51479,...,2127.830423,12993.43735,3,5,5,3,4,5,5,5
4,5,C544523,5885f56b-d6da-43a3-8760-83583af94266,d2f0425877b10ed6bb381f3e2579424a,West Point,VA,King William,23181,37.59894,-76.88958,...,2113.073274,3716.525786,2,1,3,3,5,3,4,3


In [9]:
# Returning number of (rows, columns)
df.shape

(10000, 50)

Our dataframe has a total of 10,000 rows and 53 columns. Next, we will return a list of all variables and their dataypes.

In [10]:
# Return variables, datatypes and non-null status of each.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 50 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   CaseOrder           10000 non-null  int64  
 1   Customer_id         10000 non-null  object 
 2   Interaction         10000 non-null  object 
 3   UID                 10000 non-null  object 
 4   City                10000 non-null  object 
 5   State               10000 non-null  object 
 6   County              10000 non-null  object 
 7   Zip                 10000 non-null  int64  
 8   Lat                 10000 non-null  float64
 9   Lng                 10000 non-null  float64
 10  Population          10000 non-null  int64  
 11  Area                10000 non-null  object 
 12  TimeZone            10000 non-null  object 
 13  Job                 10000 non-null  object 
 14  Children            10000 non-null  int64  
 15  Age                 10000 non-null  int64  
 16  Incom

For our research question, the pertinent variables are _ReAdmis_, _Doc-visits_, 

# Part 2: Detection and Treatment of Duplicates
Our first task is to detect and treat any duplicated values in our entire dataset.

In [None]:
# Returning a total count of duplicated values
df.duplicated().value_counts()

Here we can see there are **no** duplicated values, represented by the "False 10000" output (which is also the total rows shown in the .shape fucntion). We can now move to the next section of data cleaning, detection and treatment of missing values.

# Part 3: Detection and Treatment of Missing Values
In this section, we will see if there are any missing values for all variables in the dataset.

In [None]:
# Returning a list of variables with total counts for missing values 
df.isnull().sum()

Here we can see that there are 7 columns with missing values; **children**, **age**, **income**, **soft_drink**, **overweight**, **anxiety** and **initial_days**
<br>
<br>1) **Children** and **Age** are considered *discrete quantitative variables* because they can only be particular numbers
<br>2) **Income** and **Initial_days** are considered *continuous quantitative variables* because your income can be whole numbers or contain decimals
<br>3) **Overweight**, **Soft_drink** and **Anxiety** are considered *nominal qualitative variables* because they are either yes or no

We will return some basic statistics on the quantitative variables to check the before and after imputation.

In [None]:
# Checking statistical information on the columns with missing data that are quantative
df[['Children', 'Age', 'Income','Initial_days']].describe()

Using the seaborn package, we can create histograms of the quantitative variables **Children**, **Age**, **Income**, **Intial_days** to visually
analyze their distribution, but before we must import the seaborn package into our notebook.

In [None]:
# Importing seaborn package with the inline magic function
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline 

In [None]:
#Plotting histograms
sns.displot(df, x='Children')
sns.displot(df, x='Age')
sns.displot(df, x='Income')
sns.displot(df, x='Initial_days')

From these graphs, we can see that:

1) Both **Children** and **Income** are positively skewed to the right
2) **Age** is uniformly distributed 
3) **Initial_days** has a bimodal distribution

For **Income**, **Children** and **Initial_days** variables, we will treat missing values by imputation using the median value while for **Age** we will be using the mean for imputation.

In [None]:
# Performing imputation 
df['Children'].fillna(df['Children'].median(), inplace= True) # Using median value for Children
df['Income'].fillna(df['Income'].median(), inplace= True) # Using Median value for Income
df['Initial_days'].fillna(df['Initial_days'].median(), inplace= True) # Using median value for Initial_days
df['Age'].fillna(df['Age'].mean(), inplace= True) # Performing imputation using mean values form Age

In [None]:
# Checking statistics again for comparison
df[['Children', 'Age', 'Income','Initial_days']].describe()

In [None]:
# Plotting histograms again to check for skewness
#Plotting histograms
sns.displot(df, x='Children')
sns.displot(df, x='Age')
sns.displot(df, x='Income')
sns.displot(df, x='Initial_days')

Here we can see that the data behavior is relatively conserved, evident by both statistical information and histogram.
Now we will focus on the remaining categorical variables **Overweight**, **Anxiety**, and **Soft_drink**.


In [None]:
#Detecting amount of missing values in each
print(df['Overweight'].isnull().value_counts())
print('') 
print(df['Soft_drink'].isnull().value_counts())
print('')
print(df['Anxiety'].isnull().value_counts())

Let's begin with the **Overweight** column - the **Overweight** columns is of the "0 or 1" kind. In order to treat missing values, we will calculate the percentage of each category ( 0 and 1) and impute missing values with the highest percentage category.

In [None]:
# Determining value counts for each category in the Overweight column
df['Overweight'].value_counts(normalize=True, sort=True) # We are using the parameter normalize to return the proportions rather than frequencies

We can see a nearly 70/30 split for 1:0 ratio - this means we can impute the missing values using '1'

In [None]:
# Treating missing values in Overweight column with '1' (highest percentage category)
df['Overweight'].fillna(1, inplace=True)

Now we will follow the same procedure for the **Anxiety** column

In [None]:
# Determining value counts for each category in the Overweight column
df['Anxiety'].value_counts(normalize=True, sort=True) # We are using the parameter normalize to return the proportions rather than frequencies

In [None]:
# Treating missing values in Anxiety column with '0' (highest percentage category)
df['Anxiety'].fillna(0, inplace=True)

We will now focus on the **Soft-Drink** column which is of the 'Yes/No' type - we will first re-express the variable to '0/1' using Ordinal Encoding and then using the methods used above for **Overweight** and **Anxiety** to fill is missing values.

In [None]:
# We will run the .unique() function to determine number of unique values for specified variables
df['Soft_drink'].unique()

We see that the unique values are 'No', 'Yes' and _nan_ for missing values. We will use the percentage method mentioned above to fill these in.

In [None]:
# Determining value counts for each category in the Overweight column
df['Soft_drink'].value_counts(normalize=True, sort=True, dropna=True) # not including the NaN values

Here we see a 75/25 split for a 1:0 ratio - we will use 'No' to fill in the missing values

In [None]:
# Treating missing values in Soft_drink column with 'No' (highest percentage category)
df['Soft_drink'].fillna('No', inplace=True)

In [None]:
# Rechecking for any missing values
df['Soft_drink'].unique()

We will now begin the ordinal encoding process...

In [None]:
# Replicate the variable in preparation for replacing its categorical values with
# numeric ones. This replicated variable will store the re-expressed values once converted

df['Soft_drink_numeric'] = df['Soft_drink']

In [None]:
# Checking duplicated column 'Soft_drink_numeric' vs original "Soft_drink"
df[['Soft_drink_numeric', 'Soft_drink']]

In [None]:
# Set up a dictionary specifically for converting the categorical values to numeric values.
dict_soft_drink = {'Soft_drink_numeric': {'No': 0, 'Yes': 1}}

In [None]:
# Use the dictionary to replace the variable’s values. The replacefunction will replace the values according to the rules in
# the dictionary dict_edu and store in existing data frame.
df.replace(dict_soft_drink, inplace=True)

We have detected and treated all missing values. We will now see the dataframe overall and run .info() to see if any null-values exist

In [None]:
# Checking info on dataframe
df.info()

As one can see, there are no more variables with missing values.

# Part 4: Detection and Treatment of Outliers
In this section, we will check for any outliers in the quantitative variables and treat them accordingly.
Let's first visualize the boxplots for the variables.

In [None]:
# Children boxplot
sns.boxplot(x='Children', data=df)

In [None]:
# Age boxplot
sns.boxplot(x='Age', data=df)

In [None]:
# Income boxplot
sns.boxplot(x='Income', data=df)

In [None]:
# Initial_days boxplot
sns.boxplot(x='Initial_days', data=df)

Here we can see that **Age** and **Initial_days** have no outliers while **Children** has 4 outliers and **Income** has multiple. Since we do not know if the outliers are factual errors, we will first extract the outliers, save them as their own dataframe and then remove them from the original dataframe.

In [None]:
# Quantity of outliers in the Income column
df['Income'][df['Income'] > 46466.7975].count()

In [None]:
# Getting basic statistical info on Income and Children columns
df['Income'].describe()
df['Children'].describe()


We will use the z-scores to extract all records whose z-score is greater than 3. We must first import the SciPy package

In [None]:
# Improting Scipy package
import scipy.stats as stats

In [None]:
# Creating a new column for the Income z-scores and Children z-scores
df['Income_z_Scores'] = stats.zscore(df['Income'])
df['Children_z_Scores'] = stats.zscore(df['Children'])

In [None]:
# Viewing first 10 records for both Income and Children z-score columns
df[['Income', 'Income_z_Scores']].head(10)
df[['Children', 'Children_z_Scores']].head(10)

In [None]:
# Extracting records with z-scores -3 < z and z > 3 and saving as new variable 
income_outliers = df.query('Income_z_Scores < -3 | Income_z_Scores > 3')
children_outliers = df.query('Children_z_Scores < -3 | Children_z_Scores > 3')

In [None]:
# Creating a dataframe with Children and Income outliers removed and saving as df_new
df_new = df[(df['Income_z_Scores'] > -3) & (df['Income_z_Scores'] < 3) & (df['Children_z_Scores'] > -3) & (df['Children_z_Scores'] < 3)]

In [None]:
# Checking z-scores in df_new for any z-scores missed in both Income and Children column
df_new['Children_z_Scores'].loc[lambda x : (x < -3) | (x > 3)].sum()
df_new['Income_z_Scores'].loc[lambda x : (x < -3) | (x > 3)].sum()

### Extraction to CSV

We will now extract our treated dataframe into a CSV file

In [None]:
# Extracting dataframe 'df_new' into a CSV file named medical_data_treated
medical_data_treated = df_new.to_csv('Medical_Data_Treated.csv', index=True)

# Part 5: Principal Component Analysis (PCA)


In [None]:
from sklearn.decomposition import PCA # to apply PCA
from sklearn.preprocessing import StandardScaler # to standardize features
import numpy as np 
import matplotlib.pyplot as plt

Even though it is not required, we will visualize the variable correlations using a heatmap.

In [None]:
# Correlation heatmap
plt.figure(figsize = (10,10)) # Creating slightly larger heatmap for ease of visualization
mask = np.triu(np.ones_like(df_new.corr(), dtype=bool)) # Creating heatmap mask
sns.heatmap(df_new.corr(), square=True, xticklabels=True, yticklabels=True, cmap='vlag', mask=mask) # Heatmap parameters

In [None]:
# Seleting only numeric columns and saving to new dataframe 
df_new_numeric = df_new.select_dtypes(include=np.number) 
df_new_numeric.dtypes # Checking to see if all variables are numeric

In [None]:
# Selecting the variables to perform PCA and naming variable test_pca
test_pca = df_new_numeric[['Population','Children', 'Age', 'Income', 'Doc_visits', 'Initial_days', 'TotalCharge', 'Additional_charges', 'VitD_levels', 'Doc_visits', 'Full_meals_eaten']]
test_pca

In [None]:
# Normalizing test_pca dataframe
test_pca_normalized = (test_pca-test_pca.mean()) / test_pca.std()
test_pca_normalized

In [None]:
# Initiating pca object
pca = PCA(n_components=test_pca.shape[1])

In [None]:
# Fitting data
pca.fit(test_pca_normalized)

In [None]:
# Creating new dataframe with PCs
test_pca2 = pd.DataFrame(pca.transform(test_pca_normalized), columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10', 'PC11'])
test_pca2

In [None]:
# Creating loadings
loadings = pd.DataFrame(pca.components_.T,
                        columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10', 'PC11'],
                        index = test_pca_normalized.columns)
loadings

In [None]:
# Creating covariance matrix and eigenvalues
cov_matrix = np.dot(test_pca_normalized.T, test_pca_normalized) / test_pca.shape[0]
eigenvalues = [np.dot(eigenvector.T, np.dot(cov_matrix,eigenvector)) for eigenvector in pca.components_]
eigenvalues

In [None]:
# Plotting eigenvalues
plt.plot(eigenvalues)
plt.xlabel('Number of Components')
plt.ylabel('EigenValues')
plt.axhline(y=1, color='red') # Visualizing eigenvalue = 1 to use in Kaiser Rule
plt.show()