**Does family history with overweight have an effect on the lifestyle habits of overweight individuals? Doe these individuals engage in healthier lifestyle choices when they have a history of overweight in the family?**

The question is investigated through the variable: family_history_with_overweight, a categorical with yes/no responses. 

The dataset is filtered to include only those in the overweight/obese category (by cleaning via. the NObeyesdad variable).

The family_history_with_overweight will be investigated in relation to the variables: 
- FAF (Frequency of Physical Activity): float value 1-3. 
- FCVC (Frequency of consumption of vegetables): float value 1-3.
- FAVC (Frequent consumption of high caloric food): previously yes/no, recoded to 1/0.
- SCC (Monitor daily calories): previously yes/no, recoded to 1/0.

- Created additive variable "Sum" which is a sum of the previous 4, on a scale from 0-8. This variable is used in the regression model. 

In [None]:
import pandas as pd
import numpy as np

In [None]:
!pip install -U -q PyDrive
 
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
link = 'https://drive.google.com/file/d/1owE21jzuj7VBDN-AsVgpH6W6GTL7N0Kp/view?usp=share_link'
#splitting this to only keep the id part of the url
id = link.split("/")[-2]

downloaded = drive.CreateFile({'id':id})
downloaded.GetContentFile('ObesityDataSet_raw_and_data_sinthetic.csv')

#at this point, you can name your dataframe whatever you want and use .head() to check out the first few rows
obesity = pd.read_csv('ObesityDataSet_raw_and_data_sinthetic.csv')
obesity.head()

In [None]:
types ={'family_history_with_overweight': 'category', 'NObeyesdad': 'category', 'FCVC': 'float64'}

obesity = obesity.astype(types)
# changing datatypes of variables
obesity.dtypes

Univariate Descriptive Statistics:

In [None]:
print(obesity['NObeyesdad'].value_counts())
# getting counts of each weight group

In [None]:
obesity['NObeyesdad'].value_counts().plot(kind='bar', title='Bar Graph for Weight Category', xlabel='Weight Category', ylabel='Frequency')
# visualizing counts

Useful to see the distribution of observations in each weight category to see if the sample is large enough to group by only the overweight groups. 

In [None]:
obesity_filt = obesity.filter(["NObeyesdad", "family_history_with_overweight", "FAVC", "FCVC", "SCC", "FAF"], axis = 1)
obesity_filt.head()
# filtering to include just the variables of interest for the research question

In [None]:
# filter Nobeyesdad to only include obese 1, 2, 3 and overweight 1, 2 categories
obobesity = obesity_filt.loc[(obesity_filt['NObeyesdad'] == "Obesity_Type_I") | (obesity_filt['NObeyesdad'] == "Obesity_Type_II") | (obesity_filt['NObeyesdad'] == "Obesity_Type_III") | (obesity_filt['NObeyesdad'] == "Overweight_Level_I") | (obesity_filt['NObeyesdad'] == "Overweight_Level_II")]

obobesity.head()

In [None]:
obobesity['FAVC'].replace(['yes', 'no'],
                        [0.0, 1.0], inplace=True)

obobesity.head()
# changing variable coding to numerical values, to make it easier for building models later

In [None]:
obobesity['SCC'].replace(['yes', 'no'],
                        [1.0, 0.0], inplace=True)
# changing variable coding to numerical values, to make it easier for building models later
obobesity.head()

In [None]:
obobesity.groupby("family_history_with_overweight")["NObeyesdad"].count()
# checking for counts to see if normally distributed 

Since both groups have > 30 observations,  can assume normal distribution for both groups. 

The two groups are: 
overweight people with family history of overweight and overweight people with no family history of overweight. 

In [None]:
obeswhist = obobesity.loc[(obobesity['family_history_with_overweight'] == "yes")]
obeswhist

In [None]:
obenohist = obobesity.loc[(obobesity['family_history_with_overweight'] == "no")]
obenohist

In [None]:
obeswhist.describe()

In [None]:
obenohist.describe()

Looking at these summary statistics, we can say that the group with no history of overweight has a lower mean of vegetable consumptio but a higher mean in physical activity frequency. So there are some healthy lifestyle habits that this group partakes in but not both validate the hypothesis. 

Similarly, the standard deviation is visibly different for physical activity in the 2 groups as it is higher in the no history group. 
This too is helpful in visualizing data but does not help in answering the question at hand about whether one group partakes in more healthy lifestyle habits than the other. 

In [None]:
obeswhist.skew()

In [None]:
obenohist.skew()

Looking at the two sets of .skew measures, it is evident that the quantitative healthy lifestyle variables (Frequency of consumption of vegetables, Physical activity frequency) differ based on family history with overweight. 

In the group with a history of obese, the curve is towards the left tail for vegetable consumptino and a little on the right for physical activity. While for the no history of overweight group, the vegetable consumption is much closer to the middle and the physical activity is towards the right. This already shows that the latter looks like they partake in healthier habits. 

In [None]:
obenohist[['FCVC','FAF', 'FCVC', 'SCC']].plot(kind='box')

In [None]:
obeswhist[['FCVC','FAF', 'FCVC', 'SCC']].plot(kind='box')

Looking at the box plots to see the spread of the data in the two groups. It does make sense that the with history group is more "normally" distributed with ranging quartile values - since the sample is much larger there. 
There are some outliers in the other grorup which will be worth exploring in later stages. 

In [None]:
obeswhist.corr()

In [None]:
obenohist.corr()

Testing for correlations in variables, maybe those who work out often eat more vegetables. It does look like there is some relationship in that but since the value is higher in the no history with overweight group - it is not very useful (since we are seeing if those with a history of overweight are conscious about their lifestyle habits). 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

sns.set(style='white')

sns.barplot(x='NObeyesdad', y='FCVC', data=obeswhist)
plt.xticks(rotation=45)

In [None]:
sns.set(style='white')

sns.barplot(x='NObeyesdad', y='FCVC', data=obenohist)
plt.xticks(rotation=45)

In [None]:
sns.set(style='white')

sns.barplot(x='NObeyesdad', y='FCVC', hue='family_history_with_overweight', data=obobesity)
plt.xticks(rotation=45)

These bar graphs were actually really helpful in visualizing the data. I just found that the no obesity history category has no one in the obesity type 3 group - either because the sample is so small or because there is some relationship between having a history of obesity and being in this category. These bar plots could help with visualizing the counts of many of these numerical variables in seeing how they are distributed. From the one looking at the groups with the history of obesity, the highest group (obesity 3) consume the most vegetables. This condirms the hypothesis of them being more aware of their lifestyle habits, however, only this one statistic helps in answering the question so far. 

Looking at the EDA so far, all of it has been helpful in understanding what the data looks like - however, it could be useful to recode quantitative variables like FAVC and SCC to also have numerical values and add to the descriptive statistical analysis. This way, we could answer the question with not just the vegetable consumption and physical activity but also Frequent consumption of high caloric food and monitoring calories. This could all e done in the same quantitative way through a scaled transformation (like 1 being yes and 0 being no).

In [None]:
data_crosstab = pd.crosstab(obobesity['family_history_with_overweight'],
                            obobesity['NObeyesdad'], 
                               margins = False)
data_crosstab

is there a relationship between weight and whether there is a history of overweight in the family? 

In [None]:
from scipy.stats import chi2_contingency

chisqresult = chi2_contingency(data_crosstab)

chisqresult

So far, it appears as though other variables that are coded as categorical should be included in this analysis. I got very little information from this EDA - although I do see positive correlations in variables showing that there are some relationships between eating vegetables and exercising - I did not see those differ significantly when comparing history with and without overweight. 
I do wonder if recoding the variables into 0's and 1's to have more quantitatives to work off of would be helpful. 
Simply using what is already quantitive could be a form of survivorship or omission bias since filtering the data and exploring it without them meant they were just not counted in answering the question. 

(This was the last 2 prompts combined)

In [None]:
obobesity['Sum'] = obobesity['FAVC'] + obobesity['FCVC'] + obobesity['SCC'] + obobesity['FAF']


In [None]:
obobesity.head()

Logistic Regression: 

y = family-history-with-overweight, bivariate categorical

x (predictor) = Sum, categorical

In [None]:
import statsmodels.api as sapi
import statsmodels.stats.api as sms
import statsmodels.stats.outliers_influence as st_inf
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as mc
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from sklearn.metrics import classification_report
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.stats as statsmodstats

In [None]:
logformula ='family_history_with_overweight ~ Sum'

In [None]:
obobesity.dtypes

In [None]:
obobesity['family_history_with_overweight'].replace(['no', 'yes'],
                        [0.0, 1.0], inplace=True)

In [None]:
logmod = smf.logit(logformula, data=obobesity).fit()
logmod.summary()

Checking assumptions: 

In [None]:
# Checking appropriate outcome count, getting counts for each of the bivariate values
obobesity['family_history_with_overweight'].value_counts()

2 categories in bivariate variable. Assumption is met.

In [None]:
# Checking independence. Generating residual series plot to check if variables rely on one another. 

fig = plt.figure(figsize=(8,5))
ax = fig.add_subplot(111, title="Residual Series Plot",
                    xlabel="Index Number", ylabel="Deviance Residuals")
ax.plot(obobesity['Sum'], logmod.resid_dev)
plt.axhline(y=0, ls="--", color='red');

Met, no patterns in the plot.

Not concerned with VIF since only one predictor. 

In [None]:
# Checking linearity. Getting statistics for the deviance residuals.

logmod.resid_dev.describe()

No value > |3|. Met. 

In [None]:
# Checking linearity of independent variables and log-odds. 
# Fitted values from the model (log odds or logit) plotted against the x variables.
sns.regplot(obobesity['Sum'], logmod.fittedvalues)

Linear. Met.


Sufficient sample size assumption met looking at numbers from EDA. 



All assumptions met. Can continue to the summary and interpretation. 

In [None]:
logmod.summary()

**Model interpretation**

Firstly, the p value in this model is < 0.05, this means that there is a significant enough effect in that we can reject the null hypothesis. So the summed lifestyle variable is an important predictor in family_history_with_overweight.

For every unit change in Sum, there is a 0.233 decrease in there being a history of overweight in the family of the participant. 

With the pseudo R value at 0.0069, it is evidenct that the model is not a good fit for the data. A value > 0.2 would indicate a good fit, since the value of this model is around 0.007 - different models could be proposed to better fit the given data in the way it is compiled. 

The LLR p-value is associated with a likelihood test, comparig 2 models, one which reflects the null hypothesis, and another reflecting the relationship between the 2 selected variables. 

In interpretting the p value here, is is > 0.05 so similar to the regular p value, we fail to reject the null. So the model is unlikely to create a meaningful representation of the data.


The way you interpret the p-value is the same in the sense that you reject H0 if its less than the chosen threshold.

In more plain language, this particular model is not a great at predicting the relationship between family history with overweight and the SUm variable created using the 4 lifestyle choice indicators. 
There may still be a relationship for overweight individuals, if they have overweight relatives they may still be more (or less) conscious of their lifestyle choices - however, this summed variable and the model proposed is not the best way of showing this relationship. 

Perhaps it is the fault of the additive approach, in that the eating of vegetables, tracking of calories, physical activity, and consumption of high caloric foods are not interrelated. So a more accurate summed variable could be created - or all the variables could be investigated on their own. 