# Diabetes Survey Analysis 
1. Data Quality Check(errors: missing values) 
2. Inclusion and exclusion
3. Analysis of demographic variables(age[mean,std], percentage, frequencies) 
4. Estimation of Probable Type(percentage, frequencies) 
5. Association between demographic variables: Estimation of Probable Type(Gender, Marital Status, Education Level, chi-square test)
6. Self Management and Knowledge(percentage, frequencies)       
7. Association between demographic variables: Self Management and Knowledge(Gender, Marital Status, Education Level, chi-square test)


## Task #1: Data Quality Check 

In [1]:
# import librarires 
import pandas as pd 
import researchpy as rp 

In [2]:
data = pd.read_excel('../data/diabetes.xlsx')

In [3]:
data.head(2) 

Unnamed: 0,Timestamp,Are you willing to participant in the survey?,Age of Participant,Weight(kg),Usual Blood Pressure(mmHg),Height(in),Gender,Marital status,Education Level,Occupation,...,Do you smoke ?,Overall health in the past 4 weeks ?,Does diabetes affect day to day activities ?,Family history of diabetes,Whether stayed in hospital overnight,Reason for most recent stay in hospital,Where do you go for diabetes check-up,Number of diabetes check-ups in the last 12 month,"Representative name ( bithi,nodee,muhibullah,...etc)",Representative(Short Name/Initials):
0,2021-04-28 22:40:52.318,Yes,24.0,,,,Male,Unmarried,Undergraduate,Health Care Service Provider,...,No,Very good,No,No,,,,,,
1,2021-04-30 19:58:59.820,Yes,42.0,,,,Male,Married,SSC,Public Service,...,No,Excellent,Yes,Yes,,,,,,


In [4]:
# shape: rows and columns 
data.shape

(306, 33)

In [5]:
# info: information about survey 
data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 33 columns):
 #   Column                                                                                Non-Null Count  Dtype         
---  ------                                                                                --------------  -----         
 0   Timestamp                                                                             306 non-null    datetime64[ns]
 1   Are you willing to participant in the survey?                                         304 non-null    object        
 2   Age of Participant                                                                    305 non-null    float64       
 3   Weight(kg)                                                                            303 non-null    float64       
 4   Usual Blood Pressure(mmHg)                                                            303 non-null    object        
 5   Height(in)                          

In [6]:
# columns 
data.columns

Index(['Timestamp', 'Are you willing to participant in the survey? ',
       'Age of Participant', 'Weight(kg)', 'Usual Blood Pressure(mmHg)',
       'Height(in)', 'Gender', 'Marital status', 'Education Level ',
       'Occupation ', 'Division', 'District',
       'At which age you diagnosed diabetes?',
       'What type of diabetes do you have?',
       'Did you inject insulin within the first 3 months of being diagnosed ? ',
       'Did you continue injecting for more than one year after you first injected insulin? ',
       'How do you control your diabetes now ? ',
       'Do you take any medication for any other condition ? ',
       'What type of medication do you take ?',
       'Do you know enough about when to take your medication ? ',
       'How often do you test your own blood glucose level ? ',
       ' Do you know enough about what you should eat to help you manage your diabetes ? ',
       'Do you know about the role of Physical activity in managing your diabetes ? ',
  

In [7]:
# drop unused columns 
data.drop(columns=['Timestamp', 'Are you willing to participant in the survey? ', 
                  'Representative name ( bithi,nodee,muhibullah,...etc) ', 
                  'Representative(Short Name/Initials):', 
                   'Weight(kg)', 'Usual Blood Pressure(mmHg)',
                    'Height(in)'], inplace=True)

In [8]:
# check missing data: data errors
data.isnull().sum() 

Age of Participant                                                                      1
Gender                                                                                  1
Marital status                                                                          1
Education Level                                                                         1
Occupation                                                                              1
Division                                                                                1
District                                                                                1
At which age you diagnosed diabetes?                                                    1
What type of diabetes do you have?                                                      1
Did you inject insulin within the first 3 months of being diagnosed ?                   1
Did you continue injecting for more than one year after you first injected insulin?     1
How do you

In [9]:
# drop missing data 
data.dropna(inplace=True) 

In [10]:
# check shape of data again 
data.shape

(303, 26)

In [11]:
# check missing values again 
data.isnull().sum() 

Age of Participant                                                                      0
Gender                                                                                  0
Marital status                                                                          0
Education Level                                                                         0
Occupation                                                                              0
Division                                                                                0
District                                                                                0
At which age you diagnosed diabetes?                                                    0
What type of diabetes do you have?                                                      0
Did you inject insulin within the first 3 months of being diagnosed ?                   0
Did you continue injecting for more than one year after you first injected insulin?     0
How do you

In [12]:
data.columns = data.columns.str.strip().str.replace(' ', '_') 
data.columns

Index(['Age_of_Participant', 'Gender', 'Marital_status', 'Education_Level',
       'Occupation', 'Division', 'District',
       'At_which_age_you_diagnosed_diabetes?',
       'What_type_of_diabetes_do_you_have?',
       'Did_you_inject_insulin_within_the_first_3_months_of_being_diagnosed_?',
       'Did_you_continue_injecting_for_more_than_one_year_after_you_first_injected_insulin?',
       'How_do_you_control_your_diabetes_now_?',
       'Do_you_take_any_medication_for_any_other_condition_?',
       'What_type_of_medication_do_you_take_?',
       'Do_you_know_enough_about_when_to_take_your_medication_?',
       'How_often_do_you_test_your_own_blood_glucose_level_?',
       'Do_you_know_enough_about_what_you_should_eat_to_help_you_manage_your_diabetes_?',
       'Do_you_know_about_the_role_of_Physical_activity_in_managing_your_diabetes_?',
       'Do_you_smoke_?', 'Overall_health_in_the_past_4_weeks_?',
       'Does_diabetes_affect_day_to_day_activities_?',
       'Family_history_of_di

In [13]:
data.replace(to_replace=['No ( suggested type 2)', 'Yes ( suggested type 1)'], 
             value=['Probable Type 2', 'Probable Type 1'], inplace=True)

In [14]:
data['Did_you_inject_insulin_within_the_first_3_months_of_being_diagnosed_?'].value_counts()

Probable Type 2    235
Probable Type 1     68
Name: Did_you_inject_insulin_within_the_first_3_months_of_being_diagnosed_?, dtype: int64

In [15]:
data['Did_you_continue_injecting_for_more_than_one_year_after_you_first_injected_insulin?'].value_counts() 

Probable Type 2    219
Probable Type 1     84
Name: Did_you_continue_injecting_for_more_than_one_year_after_you_first_injected_insulin?, dtype: int64

In [16]:
# export clean data 
data.to_csv("../data/diabetes_cleaned_data_v2.csv", index=False)

In [17]:
! ls 

01_DataCleaning.ipynb  02_DiabetesInitialAnalysis.ipynb


In [18]:
import pandas as pd 
import seaborn as sns 
import researchpy as rp 

In [19]:
# reading clean data 
df = pd.read_csv("diabetes_cleaned_data.csv")
df.head(10) 

FileNotFoundError: [Errno 2] No such file or directory: 'diabetes_cleaned_data.csv'

In [8]:
df.shape 

(303, 26)

In [9]:
df.columns

Index(['Age_of_Participant', 'Gender', 'Marital_status', 'Education_Level',
       'Occupation', 'Division', 'District',
       'At_which_age_you_diagnosed_diabetes?',
       'What_type_of_diabetes_do_you_have?',
       'Did_you_inject_insulin_within_the_first_3_months_of_being_diagnosed_?',
       'Did_you_continue_injecting_for_more_than_one_year_after_you_first_injected_insulin?',
       'How_do_you_control_your_diabetes_now_?',
       'Do_you_take_any_medication_for_any_other_condition_?',
       'What_type_of_medication_do_you_take_?',
       'Do_you_know_enough_about_when_to_take_your_medication_?',
       'How_often_do_you_test_your_own_blood_glucose_level_?',
       'Do_you_know_enough_about_what_you_should_eat_to_help_you_manage_your_diabetes_?',
       'Do_you_know_about_the_role_of_Physical_activity_in_managing_your_diabetes_?',
       'Do_you_smoke_?', 'Overall_health_in_the_past_4_weeks_?',
       'Does_diabetes_affect_day_to_day_activities_?',
       'Family_history_of_di

## Task #2: Analysis of demographic variables 

In [10]:
# distribution of age 
rp.codebook(df['Age_of_Participant'])

Variable: Age_of_Participant    Data Type: float64 

 Number of Obs.: 303 
 Number of missing obs.: 0 
 Percent missing: 0.0 
 Number of unique values: 50 

 Range: [20.0, 80.0] 
 Mean: 52.53 
 Standard Deviation: 10.16 
 Mode: 57.0 
 10th Percentile: 40.0 
 25th Percentile: 46.0 
 50th Percentile: 53.0 
 75th Percentile: 58.0 
 90th Percentile: 65.0 







In [11]:
df['Age_of_Participant'].describe()

count    303.000000
mean      52.531353
std       10.158260
min       20.000000
25%       46.000000
50%       53.000000
75%       58.000000
max       80.000000
Name: Age_of_Participant, dtype: float64

In [12]:
df['Age_of_Participant'].value_counts(bins=[16, 35, 60, 80], sort=True) 

(40.0, 60.0]      215
(60.0, 80.0]       56
(19.999, 40.0]     32
Name: Age_of_Participant, dtype: int64

In [32]:
df['Age_of_Participant'].value_counts(bins=[20, 40, 60, 80], sort=True, ascending=True, normalize=True) * 100  

(19.999, 40.0]    10.561056
(60.0, 80.0]      18.481848
(40.0, 60.0]      70.957096
Name: Age_of_Participant, dtype: float64

In [72]:
# distribution of age 
rp.summary_cont(df['Age_of_Participant'])





Unnamed: 0,Variable,N,Mean,SD,SE,95% Conf.,Interval
0,Age_of_Participant,303.0,52.5314,10.1583,0.5836,51.383,53.6797


In [13]:
dm = rp.summary_cat(df[['Gender', 'Marital_status', 'Education_Level',
       'Occupation']])
dm 

Unnamed: 0,Variable,Outcome,Count,Percent
0,Gender,Male,211,69.64
1,,Female,92,30.36
2,Marital_status,Married,296,97.69
3,,Unmarried,7,2.31
4,Education_Level,"Post graduate (MSc, Ms, MD, PhD, Diploma and O...",93,30.69
5,,Graduate,76,25.08
6,,HSC,40,13.2
7,,SSC,37,12.21
8,,Undergraduate,34,11.22
9,,No education,23,7.59


In [48]:
dm.to_csv('demographics.csv', index=False)

## Task #3: Estimation of probable type 

In [45]:
estimation = rp.summary_cat(df[['At_which_age_you_diagnosed_diabetes?',
       'What_type_of_diabetes_do_you_have?',
       'Did_you_inject_insulin_within_the_first_3_months_of_being_diagnosed_?',
       'Did_you_continue_injecting_for_more_than_one_year_after_you_first_injected_insulin?']])

In [46]:
estimation.to_csv('estimation_of_probable_type.csv', index=False)

In [69]:
crosstab, res = rp.crosstab(df['Gender'], 
                            df['At_which_age_you_diagnosed_diabetes?'], 
                            test= "chi-square")

In [70]:
crosstab

Unnamed: 0_level_0,At_which_age_you_diagnosed_diabetes?,At_which_age_you_diagnosed_diabetes?,At_which_age_you_diagnosed_diabetes?,At_which_age_you_diagnosed_diabetes?,At_which_age_you_diagnosed_diabetes?,At_which_age_you_diagnosed_diabetes?
At_which_age_you_diagnosed_diabetes?,16 to 35,36 to 50,51-65,66 and above,Under 16 years,All
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Female,28,49,14,1,0,92
Male,32,158,15,5,1,211
All,60,207,29,6,1,303


In [71]:
res 

Unnamed: 0,Chi-square test,results
0,Pearson Chi-square ( 4.0) =,17.2957
1,p-value =,0.0017
2,Cramer's V =,0.2389


## Task #4: Self Management and Knowledge      

In [41]:
result = rp.summary_cat(df[['How_do_you_control_your_diabetes_now_?',
       'Do_you_take_any_medication_for_any_other_condition_?',
       'What_type_of_medication_do_you_take_?',
       'Do_you_know_enough_about_when_to_take_your_medication_?',
       'How_often_do_you_test_your_own_blood_glucose_level_?',
       'Do_you_know_enough_about_what_you_should_eat_to_help_you_manage_your_diabetes_?',
       'Do_you_know_about_the_role_of_Physical_activity_in_managing_your_diabetes_?',
       'Do_you_smoke_?', 'Overall_health_in_the_past_4_weeks_?',
       'Does_diabetes_affect_day_to_day_activities_?',
       'Family_history_of_diabetes', 'Whether_stayed_in_hospital_overnight',
       'Reason_for_most_recent__stay_in_hospital',
       'Where_do__you__go_for_diabetes_check-up',
       'Number_of_diabetes_check-ups_in_the_last__12_month']])

In [42]:
result.to_csv('knowledge_management.csv', index=False)

In [44]:
result

Unnamed: 0,Variable,Outcome,Count,Percent
0,How_do_you_control_your_diabetes_now_?,Tablets,49,16.17
1,,"Tablets, Diet",49,16.17
2,,"Diet, Physical activity",35,11.55
3,,"Tablets, Physical activity",27,8.91
4,,"Tablets, Diet, Physical activity",23,7.59
...,...,...,...,...
87,Number_of_diabetes_check-ups_in_the_last__12_m...,Three or more time,244,80.53
88,,Twice,37,12.21
89,,,13,4.29
90,,once,8,2.64
