## Topic: Obesity, Physical Activity and Diet (England) 
---
Data description:



|<center> Definition of| Variables used in the datasets|
|:------------|:-------------------------------------------------------------------------------------------| 
| <center>   Year     | Financial year within which the episode finished|
| <center>   ONS_Code  | ONS nine-character geographic code|
| <center>  Org_Code  | ODS organisational code|
| <center>  Org_Name   | ODS organisational name|
| <center>  Org_Type   | ODS organisational type|
| <center>Classification|Measure by which the metrics can be broken down by: <br/><br/> FAE_Primary_Obesity â€“ Finished Admission Episodes with a primary diagnosis of Obesity <br/> FAE_PrimarySecondary_Obesity - Finished Admission Episodes with a primary or secondary diagnosis of obesity <br/> FCE_PrimarySecondary_Obesity_Bariatric - Finished Consultant Episodes with a primary diagnosis of obesity and a main or secondary procedure of 'Bariatric Surgery'|
| <center> Metric_Primary|Demographic by which the data is presented (gender or age group)|
| <center>Metric_Secondary|Demographic breakdown|
| <center>Value|Number of admissions for each Classification/Metric|

source:https://digital.nhs.uk/catalogue/PUB23742

## Questions for Time Series Dataset
1. check doc 'obes-phys-acti-diet-eng-2017-rep' p8,p9, p16
2. Do analysis by age group, gender, by different year, calculate increased percentage

## Questions for CGG Dataset
1. Calculate Obesity Prevalence by region (check doc 'obes-phys-acti-diet-eng-2017-rep' p15)

## Difference between df1 & df2 -> only the region
1. df1 only include England, while df2 include other part of England (south England...)
2. df1 include different time span, but df2 only include 2015/16
3. df2 have missing value on Org_Type, Value

In [4]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [5]:
# read library
import pandas as pd
import numpy as np
import os

# read data
#path="/Users/chloe/Desktop/ST445_Project"
path='/Users/lin/Desktop/ST445_Project' # another laptop
os.chdir(path)
df1 = pd.read_csv('Time_series_data_1516.csv', sep=',') # Time Series Dataset
df2 = pd.read_csv('CCG_data_1516.csv', sep=',') # CCG: Clinical Commissioning Groups
df3 = pd.read_csv('LA_data_1516.csv', sep=',')

In [6]:
# check dimension of data
print(df1.shape)
print(df2.shape)
print(df3.shape)

# check missing value
print(df1.isnull().sum())
print(df2.isnull().sum())
print(df3.isnull().sum())

(462, 16)
(2043, 8)
(1458, 8)
Year                  0
ONS_Code              0
Org_Code              0
Org_Name              0
Classification        0
Metric_Primary        0
Metric_Secondary      0
Value                 0
Unnamed: 8          462
Unnamed: 9          462
Unnamed: 10         462
Unnamed: 11         462
Unnamed: 12         462
Unnamed: 13         462
Unnamed: 14         462
Unnamed: 15         462
dtype: int64
Year              0
Org_Type          0
ONS_Code          0
Org_Code          9
Org_Name          0
Classification    0
Metric_Primary    0
Value             0
dtype: int64
Year              0
Org_Type          0
ONS_Code          0
Org_Code          9
Org_Name          0
Classification    0
Metric_Primary    0
Value             0
dtype: int64


In [7]:
# drop extra columns in dataframe
print(vars(df1)) # have empty columns
df1.drop(df1.columns[[8,9,10,11,12,13,14,15]], axis=1, inplace=True)
df1.head(2)

{'is_copy': None, '_data': BlockManager
Items: Index(['Year', 'ONS_Code', 'Org_Code', 'Org_Name', 'Classification',
       'Metric_Primary', 'Metric_Secondary', 'Value', 'Unnamed: 8',
       'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12',
       'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15'],
      dtype='object')
Axis 1: RangeIndex(start=0, stop=462, step=1)
FloatBlock: slice(8, 16, 1), 8 x 462, dtype: float64
IntBlock: slice(7, 8, 1), 1 x 462, dtype: int64
ObjectBlock: slice(0, 7, 1), 7 x 462, dtype: object, '_item_cache': {}}


Unnamed: 0,Year,ONS_Code,Org_Code,Org_Name,Classification,Metric_Primary,Metric_Secondary,Value
0,2015/16,E92000001,ENG,England,FAE_Primary_Obesity,Gender,All persons,9929
1,2015/16,E92000001,ENG,England,FAE_Primary_Obesity,Gender,Male,2573


In [8]:
# sort the data by year, classification, metric_primary
df1 = df1.sort_index(by=['Year','Classification','Metric_Primary'])

# reset index because sort will change the order
df1 = df1.reset_index(drop=True)

  


In [9]:
# sum column by 'Year' & 'Clssification' when Metric_Primary==AgeGroup
df0 = df1[df1.Metric_Primary=='AgeGroup'].groupby(['Year', 'Classification']).agg({'Value': sum}).rename(columns={'Value': 'Count'}).reset_index()
df1 = pd.merge(df1, df0, how='left', on=['Year', 'Classification'])

#df0 = df1.loc[df1['Metric_Primary']=='AgeGroup'].groupby(['Year', 'Classification'])[['Value']].sum().rename(columns={'Value': 'Count'})

In [10]:
# pivot df0 since we can use it to plot curve or compare different years
df0_p = df0.pivot(index='Classification', columns='Year', values='Count'); df0_p

Year,2002/03,2003/04,2004/05,2005/06,2006/07,2007/08,2008/09,2009/10,2010/11,2011/12,2012/13,2013/14,2014/15,2015/16
Classification,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
FAE_PrimarySecondary_Obesity,29199,33524,40724,51997,67163,80772,102834,142061,211499,266659,292396,365568,440273,524704
FAE_Primary_Obesity,1275,1711,2034,2561,3862,5014,7985,10569,11566,11736,10957,9325,9130,9929
FCE_PrimarySecondary_Obesity_Bariatric,345,474,743,1035,1951,2722,4219,7213,8082,8794,8024,6384,6032,6438


In [11]:
# Idealy Count should have the same number as Gender-All persons by different year
df1.head(10)

Unnamed: 0,Year,ONS_Code,Org_Code,Org_Name,Classification,Metric_Primary,Metric_Secondary,Value,Count
0,2002/03,E92000001,ENG,England,FAE_PrimarySecondary_Obesity,AgeGroup,16-24,912,29199
1,2002/03,E92000001,ENG,England,FAE_PrimarySecondary_Obesity,AgeGroup,25-34,2288,29199
2,2002/03,E92000001,ENG,England,FAE_PrimarySecondary_Obesity,AgeGroup,35-44,4371,29199
3,2002/03,E92000001,ENG,England,FAE_PrimarySecondary_Obesity,AgeGroup,45-54,5661,29199
4,2002/03,E92000001,ENG,England,FAE_PrimarySecondary_Obesity,AgeGroup,55-64,6721,29199
5,2002/03,E92000001,ENG,England,FAE_PrimarySecondary_Obesity,AgeGroup,65-74,5391,29199
6,2002/03,E92000001,ENG,England,FAE_PrimarySecondary_Obesity,AgeGroup,75+,2738,29199
7,2002/03,E92000001,ENG,England,FAE_PrimarySecondary_Obesity,AgeGroup,Under 16,1117,29199
8,2002/03,E92000001,ENG,England,FAE_PrimarySecondary_Obesity,Gender,All persons,29237,29199
9,2002/03,E92000001,ENG,England,FAE_PrimarySecondary_Obesity,Gender,Male,12068,29199


In [12]:
df2 = df2.sort_index(by=['Org_Name','Org_Type'])

  """Entry point for launching an IPython kernel.


In [15]:
print(df1.head(3))
print(df2.head(3))

      Year   ONS_Code Org_Code Org_Name                Classification  \
0  2002/03  E92000001      ENG  England  FAE_PrimarySecondary_Obesity   
1  2002/03  E92000001      ENG  England  FAE_PrimarySecondary_Obesity   
2  2002/03  E92000001      ENG  England  FAE_PrimarySecondary_Obesity   

  Metric_Primary Metric_Secondary  Value  Count  
0       AgeGroup            16-24    912  29199  
1       AgeGroup            25-34   2288  29199  
2       AgeGroup            35-44   4371  29199  
        Year  Org_Type   ONS_Code Org_Code Org_Name       Classification  \
0    2015/16  National  E92000001      NaN  ENGLAND  FAE_Primary_Obesity   
227  2015/16  National  E92000001      NaN  ENGLAND  FAE_Primary_Obesity   
454  2015/16  National  E92000001      NaN  ENGLAND  FAE_Primary_Obesity   

    Metric_Primary  Value  
0      All persons  9,929  
227           Male  2,573  
454         Female  7,356  


In [25]:
df3 = df3.sort_index(by=['Org_Name'])

# data are all 2015/16
df3_p_name = df3[df3.Metric_Primary=='All persons'].pivot(index='Org_Name', columns='Classification', values='Value').reset_index()
df3_p_type = df3[df3.Metric_Primary=='All persons'].pivot(index=['Org_Type','Org_Name'], columns='Classification', values='Value').reset_index()


  """Entry point for launching an IPython kernel.


ValueError: Wrong number of items passed 486, placement implies 2