# Análise de obesidade 

Neste projeto, você deve responder a um conjunto de perguntas feitas a um conjuntos de dados:

[Obesity among adults by country, 1975-2016](https://www.kaggle.com/amanarora/obesity-among-adults-by-country-19752016/)

Tais dados são públicos, foram publicados no Kaggle.

## Perguntas

1. Qual o percentual médio de obesidade por sexo no mundo no ano de 2015?

2. Quais são os 5 países com a maior e a menor taxa de aumento nos índices de obesidade no período observado?

3. Quais os países com maiores e menores níveis percetuais de obesidade em 2015?

4. Qual a diferença média percentual de obesidade entre sexos ao longo dos anos para o Brasil?

5. Você conseguiria plotar um gráfico mostrando a evolução da obesidade para ambos sexos no mundo?

## EDA
---

### data loading

In [36]:
import pandas as pd

df = pd.read_csv('../datasets/obesity_cleaned.csv')

### data exploration and preprocessing

In [3]:
# base de dados
df

Unnamed: 0.1,Unnamed: 0,Country,Year,Obesity (%),Sex
0,0,Afghanistan,1975,0.5 [0.2-1.1],Both sexes
1,1,Afghanistan,1975,0.2 [0.0-0.6],Male
2,2,Afghanistan,1975,0.8 [0.2-2.0],Female
3,3,Afghanistan,1976,0.5 [0.2-1.1],Both sexes
4,4,Afghanistan,1976,0.2 [0.0-0.7],Male
...,...,...,...,...,...
24565,24565,Zimbabwe,2015,4.5 [2.4-7.6],Male
24566,24566,Zimbabwe,2015,24.8 [18.9-31.3],Female
24567,24567,Zimbabwe,2016,15.5 [12.0-19.2],Both sexes
24568,24568,Zimbabwe,2016,4.7 [2.5-8.0],Male


In [4]:
df.shape

(24570, 5)

In [5]:
df.columns

Index(['Unnamed: 0', 'Country', 'Year', 'Obesity (%)', 'Sex'], dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24570 entries, 0 to 24569
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   24570 non-null  int64 
 1   Country      24570 non-null  object
 2   Year         24570 non-null  int64 
 3   Obesity (%)  24570 non-null  object
 4   Sex          24570 non-null  object
dtypes: int64(2), object(3)
memory usage: 959.9+ KB


In [7]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Year
count,24570.0,24570.0
mean,12284.5,1995.5
std,7092.892393,12.121165
min,0.0,1975.0
25%,6142.25,1985.0
50%,12284.5,1995.5
75%,18426.75,2006.0
max,24569.0,2016.0


In [8]:
df.Year.unique()

array([1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985,
       1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996,
       1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007,
       2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016])

In [9]:
df.Country.unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia (Plurinational State of)', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Brunei Darussalam', 'Bulgaria',
       'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo', 'Cook Islands', 'Costa Rica',
       'Croatia', 'Cuba', 'Cyprus', 'Czechia', "Côte d'Ivoire",
       "Democratic People's Republic of Korea",
       'Democratic Republic of the Congo', 'Denmark', 'Djibouti',
       'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Eswatini', 'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon',
       'Gambia

In [10]:
df.Sex.unique()

array(['Both sexes', 'Male', 'Female'], dtype=object)

In [13]:
df.Sex.value_counts()

Sex
Both sexes    8190
Male          8190
Female        8190
Name: count, dtype: int64

In [14]:
df['Obesity (%)'].unique()

array(['0.5 [0.2-1.1]', '0.2 [0.0-0.6]', '0.8 [0.2-2.0]', ...,
       '24.4 [18.8-30.6]', '24.8 [18.9-31.3]', '4.7 [2.5-8.0]'],
      dtype=object)

In [12]:
df['Obesity (%)'].value_counts()

Obesity (%)
No data            504
0.4 [0.1-1.0]       55
0.6 [0.2-1.3]       47
0.3 [0.1-0.7]       46
0.3 [0.1-0.8]       46
                  ... 
9.5 [6.7-13.1]       1
4.4 [2.2-7.6]        1
14.1 [9.1-20.3]      1
9.8 [6.9-13.4]       1
4.7 [2.5-8.0]        1
Name: count, Length: 16375, dtype: int64

In [37]:
# excluindo coluna unnamed
df.drop(columns=['Unnamed: 0'], inplace=True)

In [38]:
# tratando dados inconsistentes
df.drop(df[df['Obesity (%)'] == 'No data'].index, inplace=True)

In [42]:
# criando coluna com dados padronizados
df['Obesity'] = df['Obesity (%)'].apply(lambda x:x.split()[0]).astype(float)

In [43]:
df

Unnamed: 0,Country,Year,Obesity (%),Sex,Obesity
0,Afghanistan,1975,0.5 [0.2-1.1],Both sexes,0.5
1,Afghanistan,1975,0.2 [0.0-0.6],Male,0.2
2,Afghanistan,1975,0.8 [0.2-2.0],Female,0.8
3,Afghanistan,1976,0.5 [0.2-1.1],Both sexes,0.5
4,Afghanistan,1976,0.2 [0.0-0.7],Male,0.2
...,...,...,...,...,...
24565,Zimbabwe,2015,4.5 [2.4-7.6],Male,4.5
24566,Zimbabwe,2015,24.8 [18.9-31.3],Female,24.8
24567,Zimbabwe,2016,15.5 [12.0-19.2],Both sexes,15.5
24568,Zimbabwe,2016,4.7 [2.5-8.0],Male,4.7


In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24066 entries, 0 to 24569
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Country      24066 non-null  object 
 1   Year         24066 non-null  int64  
 2   Obesity (%)  24066 non-null  object 
 3   Sex          24066 non-null  object 
 4   Obesity      24066 non-null  float64
dtypes: float64(1), int64(1), object(3)
memory usage: 1.1+ MB


In [45]:
df.describe()

Unnamed: 0,Year,Obesity
count,24066.0,24066.0
mean,1995.5,12.448932
std,12.12117,10.407428
min,1975.0,0.1
25%,1985.0,3.9
50%,1995.5,10.6
75%,2006.0,18.175
max,2016.0,63.3


### insight extraction

1. **Qual o percentual médio de obesidade por sexo no mundo no ano de 2015?**

In [49]:
df[df.Year == 2015].Obesity.mean()

19.46282722513089

2. **Quais são os 5 países com a maior e a menor taxa de aumento nos índices de obesidade no período observado?**

In [62]:
dados_agrup = df.groupby(['Country', 'Year'])['Obesity'].mean().reset_index()
dados_agrup

Unnamed: 0,Country,Year,Obesity
0,Afghanistan,1975,0.500000
1,Afghanistan,1976,0.500000
2,Afghanistan,1977,0.566667
3,Afghanistan,1978,0.566667
4,Afghanistan,1979,0.633333
...,...,...,...
8017,Zimbabwe,2012,13.933333
8018,Zimbabwe,2013,14.233333
8019,Zimbabwe,2014,14.566667
8020,Zimbabwe,2015,14.833333


In [63]:
# variação da taxa de obesidade por pais

dados_agrup_2 = dados_agrup.groupby('Country')['Obesity'].apply(lambda x : x.iloc[-1] - x.iloc[0])
dados_agrup_2

Country
Afghanistan                            4.933333
Albania                               15.200000
Algeria                               20.600000
Andorra                               12.800000
Angola                                 7.300000
                                        ...    
Venezuela (Bolivarian Republic of)    15.966667
Viet Nam                               1.966667
Yemen                                 14.300000
Zambia                                 6.533333
Zimbabwe                              11.566667
Name: Obesity, Length: 191, dtype: float64

In [64]:
paises = dados_agrup_2.sort_values(ascending=False)
paises

Country
Tuvalu          33.666667
Niue            31.000000
Kiribati        30.000000
Tonga           28.100000
Cook Islands    27.866667
                  ...    
Timor-Leste      3.533333
Bangladesh       3.400000
Japan            3.266667
Singapore        3.066667
Viet Nam         1.966667
Name: Obesity, Length: 191, dtype: float64

In [66]:
# 5 paises com maiores taxas de aumento na obesidade
paises.head()

Country
Tuvalu          33.666667
Niue            31.000000
Kiribati        30.000000
Tonga           28.100000
Cook Islands    27.866667
Name: Obesity, dtype: float64

In [67]:
# 5 paises com menores taxas de aumento na obesidade
paises.tail()

Country
Timor-Leste    3.533333
Bangladesh     3.400000
Japan          3.266667
Singapore      3.066667
Viet Nam       1.966667
Name: Obesity, dtype: float64

3. **Qual a diferença média percentual de obesidade entre sexos ao longo dos anos para o Brasil?**

In [71]:
br = df[df.Country == 'Brazil']
br

Unnamed: 0,Country,Year,Obesity (%),Sex,Obesity
2898,Brazil,1975,5.2 [3.3-7.9],Both sexes,5.2
2899,Brazil,1975,3.0 [1.4-5.7],Male,3.0
2900,Brazil,1975,7.3 [4.0-12.0],Female,7.3
2901,Brazil,1976,5.5 [3.5-8.1],Both sexes,5.5
2902,Brazil,1976,3.2 [1.5-5.8],Male,3.2
...,...,...,...,...,...
3019,Brazil,2015,18.0 [13.9-22.6],Male,18.0
3020,Brazil,2015,24.9 [20.3-29.8],Female,24.9
3021,Brazil,2016,22.1 [18.7-25.7],Both sexes,22.1
3022,Brazil,2016,18.5 [14.1-23.5],Male,18.5


In [86]:
br['Diff'] = br.groupby('Country')['Obesity'].diff()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  br['Diff'] = br.groupby('Country')['Obesity'].diff()


In [93]:
br

Unnamed: 0,Country,Year,Obesity (%),Sex,Obesity,Diff
2898,Brazil,1975,5.2 [3.3-7.9],Both sexes,5.2,
2899,Brazil,1975,3.0 [1.4-5.7],Male,3.0,-2.2
2900,Brazil,1975,7.3 [4.0-12.0],Female,7.3,4.3
2901,Brazil,1976,5.5 [3.5-8.1],Both sexes,5.5,-1.8
2902,Brazil,1976,3.2 [1.5-5.8],Male,3.2,-2.3
...,...,...,...,...,...,...
3019,Brazil,2015,18.0 [13.9-22.6],Male,18.0,-3.6
3020,Brazil,2015,24.9 [20.3-29.8],Female,24.9,6.9
3021,Brazil,2016,22.1 [18.7-25.7],Both sexes,22.1,-2.8
3022,Brazil,2016,18.5 [14.1-23.5],Male,18.5,-3.6
