# Predictive Analysis of Percentage of obese population on Nutrition__Physical_Activity__and_Obesity Dataset

### About Dataset
#### # Obesity is a worldwide problem which causes a lot of serious medical problems. Obesity will be increased, about 45% out of the whole population by 2035. The proportion of morbid obese and the actual costs on healthcare will be increased. Implementation of a system that could estimate the percentage of obese population for particular time duration given the age range, income range, location, high confidence level and low confidence level of obesity, education, gender, the class level, etc. of the population can help in fight against obesity.

#### - Here we are going to Load and Analyse the dataset
#### - Clean the Dataset from Null values and Irrelevant Columns if any existed
#### - Neglecting Unnecessay column out of the Dataset
#### - Feature Encoding, Scaling is done on required column.
#### - Split the Data into Train test split
#### - As this data is a Regression Type - Linear regression Algorithm will be used
#### - Check the accuracy for the Model

### Importing Basic Libraries


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Load the data


In [2]:
data = pd.read_csv("Nutrition__Physical_Activity__and_Obesity_-_Behavioral_Risk_Factor_Surveillance_System.csv")
data.head()

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,Datasource,Class,Topic,Question,Data_Value_Unit,Data_Value_Type,...,GeoLocation,ClassID,TopicID,QuestionID,DataValueTypeID,LocationID,StratificationCategory1,Stratification1,StratificationCategoryId1,StratificationID1
0,2011,2011,AL,Alabama,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(32.84057112200048, -86.63186076199969)",OWS,OWS1,Q036,VALUE,1,Total,Total,OVR,OVERALL
1,2011,2011,AL,Alabama,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(32.84057112200048, -86.63186076199969)",OWS,OWS1,Q036,VALUE,1,Gender,Male,GEN,MALE
2,2011,2011,AL,Alabama,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(32.84057112200048, -86.63186076199969)",OWS,OWS1,Q036,VALUE,1,Gender,Female,GEN,FEMALE
3,2011,2011,AL,Alabama,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(32.84057112200048, -86.63186076199969)",OWS,OWS1,Q036,VALUE,1,Education,Less than high school,EDU,EDUHS
4,2011,2011,AL,Alabama,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(32.84057112200048, -86.63186076199969)",OWS,OWS1,Q036,VALUE,1,Education,High school graduate,EDU,EDUHSGRAD


### Analysing the data from the Dataset

In [3]:
data.columns

Index(['YearStart', 'YearEnd', 'LocationAbbr', 'LocationDesc', 'Datasource',
       'Class', 'Topic', 'Question', 'Data_Value_Unit', 'Data_Value_Type',
       'Data_Value', 'Data_Value_Alt', 'Data_Value_Footnote_Symbol',
       'Data_Value_Footnote', 'Low_Confidence_Limit', 'High_Confidence_Limit ',
       'Sample_Size', 'Total', 'Age(years)', 'Education', 'Gender', 'Income',
       'Race/Ethnicity', 'GeoLocation', 'ClassID', 'TopicID', 'QuestionID',
       'DataValueTypeID', 'LocationID', 'StratificationCategory1',
       'Stratification1', 'StratificationCategoryId1', 'StratificationID1'],
      dtype='object')

In [4]:
#Find the shape of the Data
print("SHAPE: ",data.shape)

SHAPE:  (53392, 33)


### Checking for any missing values


In [5]:
print(data.isnull().any())

YearStart                     False
YearEnd                       False
LocationAbbr                  False
LocationDesc                  False
Datasource                    False
Class                         False
Topic                         False
Question                      False
Data_Value_Unit                True
Data_Value_Type               False
Data_Value                     True
Data_Value_Alt                 True
Data_Value_Footnote_Symbol     True
Data_Value_Footnote            True
Low_Confidence_Limit           True
High_Confidence_Limit          True
Sample_Size                    True
Total                          True
Age(years)                     True
Education                      True
Gender                         True
Income                         True
Race/Ethnicity                 True
GeoLocation                    True
ClassID                       False
TopicID                       False
QuestionID                    False
DataValueTypeID             

In [6]:
data.isnull().sum() / len(data) * 100


YearStart                       0.000000
YearEnd                         0.000000
LocationAbbr                    0.000000
LocationDesc                    0.000000
Datasource                      0.000000
Class                           0.000000
Topic                           0.000000
Question                        0.000000
Data_Value_Unit               100.000000
Data_Value_Type                 0.000000
Data_Value                      9.450854
Data_Value_Alt                  9.450854
Data_Value_Footnote_Symbol     90.549146
Data_Value_Footnote            90.549146
Low_Confidence_Limit            9.450854
High_Confidence_Limit           9.450854
Sample_Size                     9.450854
Total                          96.428304
Age(years)                     78.577315
Education                      85.713215
Gender                         92.856608
Income                         74.998127
Race/Ethnicity                 71.426431
GeoLocation                     1.887923
ClassID         

####   Note:  Age(years), Education, Gender, Income,Race/Ethnicity are neglected because of High percentage of missing values

### Taking the necessary columns in data


In [7]:
necessary = ['YearStart','Data_Value', 'Low_Confidence_Limit', 'High_Confidence_Limit ',
       'Sample_Size', 'LocationID','ClassID']
data = data[necessary]
data


Unnamed: 0,YearStart,Data_Value,Low_Confidence_Limit,High_Confidence_Limit,Sample_Size,LocationID,ClassID
0,2011,32.0,30.5,33.5,7304.0,1,OWS
1,2011,32.3,29.9,34.7,2581.0,1,OWS
2,2011,31.8,30.0,33.6,4723.0,1,OWS
3,2011,33.6,29.9,37.6,1153.0,1,OWS
4,2011,32.8,30.2,35.6,2402.0,1,OWS
...,...,...,...,...,...,...,...
53387,2016,,,,,78,PA
53388,2016,,,,,78,PA
53389,2016,,,,,78,PA
53390,2016,,,,,78,PA


In [8]:
data.isnull().sum() / len(data) * 100


YearStart                 0.000000
Data_Value                9.450854
Low_Confidence_Limit      9.450854
High_Confidence_Limit     9.450854
Sample_Size               9.450854
LocationID                0.000000
ClassID                   0.000000
dtype: float64

### Dropping missing values from the dataset


In [9]:
data = data.dropna()
data

Unnamed: 0,YearStart,Data_Value,Low_Confidence_Limit,High_Confidence_Limit,Sample_Size,LocationID,ClassID
0,2011,32.0,30.5,33.5,7304.0,1,OWS
1,2011,32.3,29.9,34.7,2581.0,1,OWS
2,2011,31.8,30.0,33.6,4723.0,1,OWS
3,2011,33.6,29.9,37.6,1153.0,1,OWS
4,2011,32.8,30.2,35.6,2402.0,1,OWS
...,...,...,...,...,...,...,...
53382,2016,13.3,8.0,21.2,212.0,78,PA
53383,2016,25.3,16.4,37.0,137.0,78,PA
53384,2016,18.3,10.8,29.2,154.0,78,PA
53385,2016,24.1,19.9,28.9,820.0,78,PA


In [10]:
data.isnull().sum() / len(data) * 100


YearStart                 0.0
Data_Value                0.0
Low_Confidence_Limit      0.0
High_Confidence_Limit     0.0
Sample_Size               0.0
LocationID                0.0
ClassID                   0.0
dtype: float64

### Feature Encoding on the dataset


In [11]:
df = pd.get_dummies(data, columns=['YearStart','ClassID'])
df

Unnamed: 0,Data_Value,Low_Confidence_Limit,High_Confidence_Limit,Sample_Size,LocationID,YearStart_2011,YearStart_2012,YearStart_2013,YearStart_2014,YearStart_2015,YearStart_2016,ClassID_FV,ClassID_OWS,ClassID_PA
0,32.0,30.5,33.5,7304.0,1,1,0,0,0,0,0,0,1,0
1,32.3,29.9,34.7,2581.0,1,1,0,0,0,0,0,0,1,0
2,31.8,30.0,33.6,4723.0,1,1,0,0,0,0,0,0,1,0
3,33.6,29.9,37.6,1153.0,1,1,0,0,0,0,0,0,1,0
4,32.8,30.2,35.6,2402.0,1,1,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53382,13.3,8.0,21.2,212.0,78,0,0,0,0,0,1,0,0,1
53383,25.3,16.4,37.0,137.0,78,0,0,0,0,0,1,0,0,1
53384,18.3,10.8,29.2,154.0,78,0,0,0,0,0,1,0,0,1
53385,24.1,19.9,28.9,820.0,78,0,0,0,0,0,1,0,0,1


### Feature Scaling


In [12]:
from sklearn.preprocessing import RobustScaler
rs = RobustScaler()
rs

RobustScaler()

In [13]:
df[["Sample_Size"]] = rs.fit_transform(df[["Sample_Size"]])
df.head()

Unnamed: 0,Data_Value,Low_Confidence_Limit,High_Confidence_Limit,Sample_Size,LocationID,YearStart_2011,YearStart_2012,YearStart_2013,YearStart_2014,YearStart_2015,YearStart_2016,ClassID_FV,ClassID_OWS,ClassID_PA
0,32.0,30.5,33.5,3.12084,1,1,0,0,0,0,0,0,1,0
1,32.3,29.9,34.7,0.702509,1,1,0,0,0,0,0,0,1,0
2,31.8,30.0,33.6,1.799283,1,1,0,0,0,0,0,0,1,0
3,33.6,29.9,37.6,-0.028674,1,1,0,0,0,0,0,0,1,0
4,32.8,30.2,35.6,0.610855,1,1,0,0,0,0,0,0,1,0


### Split the data into training and testing sets


In [14]:
from sklearn.model_selection import train_test_split
X = df.drop('Data_Value', axis=1)
y = df['Data_Value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Build and train a linear regression model


In [15]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr

LinearRegression()

In [16]:
lr.fit(X_train, y_train)

LinearRegression()

In [17]:
y_pred = lr.predict(X_test)


### Calculate the Accuracy


In [18]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print("r2_score:", r2)

r2_score: 0.9988032633230085


## The Machine Learning Model Predicts the dataset with 99% of accuracy