## Exploratory Data Analysis of Diabetes dataset

### Task Description   
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. 
The objective of the dataset is to diagnostically predict whether a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. From the data set in the (.csv) File We can find several variables, some of them are independent (several medical predictor variables) and only one target dependent variable (Outcome).
Note: Please keep in mind that use of python is needed to showcase your skills and writing



### Objective of this Notebook
This notebook is an example of how to perform an EDA on a dataset with the goal of providing an example of clear and easy to understand python code. After following this notebook you will have an understanding of:
* How python can be used to perform EDA
* How markdown, codeblocks and code comments can be used to explain code.
* How a hypothesis driven approach to EDA can be helpful for understanding the value of data.




### Notebook Methodology

This notebook will guide the reader through the initial exploration of the dataset provided with the goal of establishing the suitability of the data to be used for further analysis. We will seek to understand the potentially useful features (columns) and gain trust in the suitability of this dataset as the basis of predictive analysis. We will follow the principle of EDA as set out by John Tukey in that we will seek to confirm the expected and show the unexpected. Ultimately we will seek to answer the questions:
1. Is this dataset suitable for analysis of diabetes in patients?
2. Are there features of the dataset that are likely to be useful in the prediction of diabetes in patients?
3. Are there problems in the dataset that should be considered when using this dataset?
4. How should we proceed with creating a predictive model? 

### Notebook Description
In order to gain confidence in the dataset and identify potential interesting features we will use the the first section of this notebooks to load the dataset and explore the size and shape of the data as well as understand the types of data each column contains. Any problems in the data, such as unexpected or missing values will be identified and dealt with. Any resulting problematic columns will be excluded. We conclude this section by making a number of hypothesis based on our initial exploration of the data. These results of the hypothesis will guide future work on this dataset.

The second section will look deeper into the contents of the columns that we expect to be useful columns and allow us to confirm or reject our hypotheses from section one and ultimately help us answer the questions "Are there features of the dataset that are likely to be useful in the prediction of diabetes in patients?" and "Are there problems in the dataset that should be considered when using this dataset?". We will check the variability of the independent and dependent variables and look for correlation between the variables. This will allow us to identify potential features to be used in statistical models related that will allow us to understand the drivers of the independent variable.

The final section will summarize the findings and present potential next steps  that could be taken and answer our final question of "How should we proceed with creating a predictive model?".

### Section 1: Understanding the Data

#### Section 1.1 Load and Review Data
We go into the EDA with the expectation that each column contains complete and valid data. We will inspect the columns and rows to find any violations of this and deal with this either by removing, changing or ignoring the data until we are satisfied that we can have a positive answer to our initial question "Is this dataset suitable for analysis of diabetes in patients?

In [174]:
# We will use pandas to explore the load, explore and manipulate the data. Our first step is to import the required package. 
import pandas as pd

# To make statistics easier to read we set the default display properties to 2 decimal places
pd.options.display.float_format = "{:,.2f}".format

In [175]:

# We will create a function to load the data into a dataframe.
# This allows us to easily re-run this step whenever necessary in case of needing to revert to fresh data.

def load_diabetes_data(data="data/diabetes (1).csv", display_metadata=True):
    '''
    Load diabetes data from a CSV file into a pandas DataFrame.
    Parameters: 
    - data (str): The path to the CSV file containing diabetes data.
    - display_meta_data (bool) : Should metadata be displayed
    Returns: a pandas DataFrame containing diabetes data from the csv file.

    '''
    diabetes_df = pd.read_csv(data)
    if display_metadata:
        no_cols, no_rows = diabetes_df.shape
        print(f"Number of columns: {no_cols}")
        print(f"Number of rows: {no_rows}")
        print("Column Info:")
        diabetes_df.info()
        
    return diabetes_df



In [176]:
# We run the function to lead the data and 
diabetes_df = load_diabetes_data()

Number of columns: 768
Number of rows: 9
Column Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   767 non-null    float64
 2   BloodPressure             765 non-null    object 
 3   SkinThickness             764 non-null    float64
 4   Insulin                   767 non-null    float64
 5   BMI                       767 non-null    float64
 6   DiabetesPedigreeFunction  765 non-null    object 
 7   Age                       768 non-null    int64  
 8   Outcome                   766 non-null    float64
dtypes: float64(5), int64(2), object(2)
memory usage: 54.1+ KB


#### Initial understanding of data
In Section 1.1 we have loaded the data and reviewed the metadata. We have found a small number of null values in the **Glucose**, **BloodPressure**, **SkinThinckness**, **Insulin**, **BMI**, **DiabetesPedigreeFunction**, and **Outcome** columns.  
We will consider how to handle this in the data cleaning phase. Either by removing the lines or ignoring the columns.
Based on the metadata we see the data contains mainly numeric data along with two columns which are "object". This usually points to a column that contains mixed data types, missing data or problematic data.  
We note that the **BloodPressure** and **DiabetesPedigreeFunction** need to be cleaned to have a single data type rather than object.  This may solved while dealing with the null values.
Once we have cleaned the data we can continue to see if we can use the data further.

#### Section 1.2 Identify Problems in data


We will ready our data for inspection by cleaning null values and fixing data types.
As noted before **BloodPresure** and **DiabetesPedigreeFunction** are "object" data types. This can be caused by data not conforming to expectations. There were also null values identfied in many columns. We will display any problematic and null values in the dataset so we can decide how to clean them before moving on to further analysis in of the data.

In [177]:
# We display a sample of rows to understand the data inside the columns.
display(diabetes_df.head(10))
    

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72,35.0,0.0,33.6,0.627,50,1.0
1,1,85.0,66,29.0,0.0,26.6,0.351,31,0.0
2,8,183.0,64,0.0,0.0,23.3,0.672,32,1.0
3,1,89.0,66,23.0,94.0,28.1,0.167,21,0.0
4,0,137.0,40,35.0,168.0,43.1,2.288,33,1.0
5,5,116.0,74,0.0,0.0,25.6,0.201,30,0.0
6,3,78.0,50,32.0,88.0,31.0,0.248,26,1.0
7,10,115.0,0,0.0,0.0,35.3,0.134,29,0.0
8,2,197.0,70,45.0,543.0,30.5,0.158,53,1.0
9,8,125.0,96,0.0,0.0,0.0,0.232,54,1.0


Based on the results of the above we expect all the coumns to be numeric in some way. We can see which values of the BloodPressure and DiabetesPedigireeFunction columns are non-numeric. 
We should inspect the non conforming rows.

In [164]:
# Here we make a function that returns the rows that do not conform to a numeric datatype

def non_conform_rows(df, columns):
    """
    Return a new DataFrame containing rows that do not conform to specified data types.

    Parameters:
    - df: The DataFrame to check for non-conforming rows.
    - columns (list): A list of column names to check for data type conformity.
    
    Returns:
    - non_conforming_df (pandas.DataFrame): A new DataFrame containing rows that do not conform to numeric data type.

    Example usage:
    >>> non_conforming_df = non_conform_rows(diabetes_df, columns=["BloodPressure", "DiabetesPedigreeFunction"])
    """
    non_conforming_rows = []
    
    for column in columns:
        non_conforming_rows.append(df[pd.to_numeric(df[column], errors='coerce').isna()])
    
    non_conforming_df = pd.concat(non_conforming_rows, ignore_index=True)
    
    return non_conforming_df

In [255]:
# We call our function with our dataframe and all columns in that dataframe and allow the output to be displayed so we can see what the issues look like
non_conform_rows(diabetes_df, diabetes_df.columns)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,5,,104,0.0,0.0,37.7,0.151,52,1.0
1,2,100.0,,23.0,0.0,29.7,0.368,21,0.0
2,5,143.0,,0.0,0.0,45.0,0.19,47,
3,7,136.0,,26.0,135.0,26.0,0.647,51,0.0
4,2,112.0,missed value,32.0,0.0,35.7,0.148,21,0.0
5,9,156.0,86,,155.0,34.3,1.189,42,1.0
6,3,148.0,66,,0.0,32.5,0.256,22,0.0
7,2,87.0,0,,0.0,28.9,0.773,25,0.0
8,0,104.0,64,,116.0,27.8,0.454,23,0.0
9,1,128.0,98,41.0,,32.0,1.321,33,1.0


In [223]:
# We can check what percentage of our data is affected by counting rows that do not conform and total rows in the df
count_non_conforming_rows = len(non_conform_rows(diabetes_df, diabetes_df.columns))
total_rows =  len(diabetes_df)    
# calculate the share of bad rows and display the values using an f string
percentage_null = 100 * count_non_conforming_rows / total_rows
print(f"{count_non_conforming_rows} rows with non-conforming values")
print(f"{percentage_null:.2f} % of rows have non-conforming values:")

18 rows with non-conforming values
2.34 % of rows have non-conforming values:


#### 1.3 Cleaning Problem Values

Based on the above value that less than 2.5% of rows are problematic at this point, we can remove them from our dataset and convert the object columns to specific data types. Based on the review of the data, **BloodPressure** should be an Integrer and **DiabetesPedigreeFunction** should be a float. We will create a function that does those steps.


In [251]:
# Create a new dataframe that returns a dataframe containing only conforming rows  and corrected datatypes of given columns.
def conform_rows(df, columns, dtype_mapping={"BloodPressure": "int64", "DiabetesPedigreeFunction": "float64"}):
    """
    Return a new DataFrame containing only rows that conform to numeric data types.

    Parameters:
    - df: The DataFrame to check for non-conforming rows.
    - columns (list): A list of column names to check for data type conformity.
    - dtype_mapping (dict): dictionary mapping column names to datatypes.
    
    Returns:
    - conforming_df (pandas.DataFrame): A new DataFrame containing rows that conform to numeric data type.

    """
    # Check each column and create mask of true / false value where there are conforming / non-conforming values
    for column in columns:
        conforming_mask = pd.to_numeric(df[column], errors='coerce').notna()
        df = df[conforming_mask]
    
    #apply the dtype_mapping to the dataframe
    if dtype_mapping:
        df = df.astype(dtype_mapping)    

    return df


In [260]:
# call the cleaning function and assign to a new df
diabetes_df_clean = conform_rows(diabetes_df, diabetes_df.columns)
# Check the info of the new df to see if the changes have been applied 
diabetes_df_clean.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 752 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               752 non-null    int64  
 1   Glucose                   752 non-null    float64
 2   BloodPressure             752 non-null    int64  
 3   SkinThickness             752 non-null    float64
 4   Insulin                   752 non-null    float64
 5   BMI                       752 non-null    float64
 6   DiabetesPedigreeFunction  752 non-null    float64
 7   Age                       752 non-null    int64  
 8   Outcome                   752 non-null    float64
dtypes: float64(6), int64(3)
memory usage: 58.8 KB


We see that we have 752 remaining non-null values in all columns and the columns are all numeric types.

In [None]:
#display(diabetes_df.describe()) 
    #There are  values of **0.00** in columns where this would not be expected, such as **BloodPressure** and some very high values such as 999999 for Glucose.