#  Diabetes Analysis

---------------
## Context
---------------

Diabetes is one of the most frequent diseases worldwide and the number of diabetic patients are growing over the years. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes.

A few years ago research was done on a tribe in America which is called the Pima tribe (also known as the Pima Indians). In this tribe, it was found that the ladies are prone to diabetes very early. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients were females at least 21 years old of Pima Indian heritage. 

-----------------
## Objective
-----------------

To analyse different aspects of Diabetes in the Pima Indians tribe.

-------------------------
## Feature descriptions
-------------------------

The dataset has the following information:

* Pregnancies: Number of times pregnant
* Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
* BloodPressure: Diastolic blood pressure (mm Hg)
* SkinThickness: Triceps skin fold thickness (mm)
* Insulin: 2-Hour serum insulin (mu U/ml)
* BMI: Body mass index (weight in kg/(height in m)^2)
* DiabetesPedigreeFunction: A function that scores the likelihood of diabetes based on family history.
* Age: Age in years
* Outcome: Class variable (0: a person is not diabetic or 1: a person is diabetic)

## Import the necessary libraries

In [12]:
# Import libraries for data manipulation
import pandas as pd
import numpy as np

# Import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.gofplots import ProbPlot

# Import libraries for building linear regression model using statsmodel
from statsmodels.formula.api import ols
import statsmodels.api as sm

import scipy.stats as stats
from scipy.stats import levene, ttest_ind, f_oneway, kruskal, shapiro, kstest
from scipy.stats import chi2_contingency

# Importing Linear Regression from sklearn
from sklearn.linear_model import LinearRegression

# Import library for preparing data
from sklearn.model_selection import train_test_split

# Import library for data preprocessing
from sklearn.preprocessing import MinMaxScaler

import warnings
warnings.filterwarnings("ignore")

## Load dataset 

In [9]:
pima = pd.read_csv("./diabetes.csv")
pima.shape

(768, 9)

## **Statistical Tests - Continuous Variables**

### T - Test

In [15]:
# Separate data into diabetic and non-diabetic groups
diabetic = pima[pima['Outcome'] == 1]
non_diabetic = pima[pima['Outcome'] == 0]

# Perform independent t-tests between numerical columns
numerical_columns = pima.columns

for col in numerical_columns:
    if col != "Outcome":
        t_stat, p_val = stats.ttest_ind(diabetic[col], non_diabetic[col])
        print(f"Independent t-test for {col}:")
        print("t-statistic:", t_stat)
        print("p-value:", p_val)
        if p_val < 0.05:
            print("The difference in {col} between the two groups is statistically significant.")
        else:
            print("The difference in means is not statistically significant")
        print()

Independent t-test for Pregnancies:
t-statistic: 6.298430550035151
p-value: 5.065127298053476e-10
The difference in {col} between the two groups is statistically significant.

Independent t-test for Glucose:
t-statistic: 15.67806773326978
p-value: 2.97341391544088e-48
The difference in {col} between the two groups is statistically significant.

Independent t-test for BloodPressure:
t-statistic: 4.5689706614887955
p-value: 5.711009916900583e-06
The difference in {col} between the two groups is statistically significant.

Independent t-test for SkinThickness:
t-statistic: 4.82826557728321
p-value: 1.6631333908835786e-06
The difference in {col} between the two groups is statistically significant.

Independent t-test for Insulin:
t-statistic: 5.026611354340738
p-value: 6.219870924284867e-07
The difference in {col} between the two groups is statistically significant.

Independent t-test for BMI:
t-statistic: 9.09702964503362
p-value: 7.868367931282461e-19
The difference in {col} between the

## Relationship of Continuous variables

### Pearson Correlation

In [16]:
for col1 in numerical_columns:
    for col2 in numerical_columns:
        if col1 != col2:
            correlation_coefficient, p_val = stats.pearsonr(pima[col1], pima[col2])
            print(f"Correlation between {col1} and {col2}:")
            print("Correlation coefficient:", correlation_coefficient)
            print("p-value:", p_val)
            if p_val < 0.05:
                print("The correlation is statistically significant at the 5% level.")
            else:
                print("The correlation is not statistically significant at the 5% level.")
            print()

Correlation between Pregnancies and Glucose:
Correlation coefficient: 0.12802224682727548
p-value: 0.0003755097066736405
The correlation is statistically significant at the 5% level.

Correlation between Pregnancies and BloodPressure:
Correlation coefficient: 0.20898725849828428
p-value: 5.009006407843621e-09
The correlation is statistically significant at the 5% level.

Correlation between Pregnancies and SkinThickness:
Correlation coefficient: 0.009393145275494907
p-value: 0.7949469195138572
The correlation is not statistically significant at the 5% level.

Correlation between Pregnancies and Insulin:
Correlation coefficient: -0.01878025894733311
p-value: 0.6033062142356396
The correlation is not statistically significant at the 5% level.

Correlation between Pregnancies and BMI:
Correlation coefficient: 0.021545924770438516
p-value: 0.5510450832855005
The correlation is not statistically significant at the 5% level.

Correlation between Pregnancies and DiabetesPedigreeFunction:
Corr