In [1]:
%cd ..

D:\SoftUni\Data Science\Project


### Imports

In [2]:
import pandas as pd
import numpy as np
from src import functions

### Loading data

In [3]:
mental_disorders_data = pd.read_csv('data/mental_disorders.csv')

In [4]:
mental_disorders_data

Unnamed: 0,Location,Year,Age,Sex,Cause of death or injury,Measure,Value,Lower bound,Upper bound
0,Afghanistan,2021.0,Age-standardized,Both,Mental disorders,"Prevalent cases per 100,000",17904.245131,15492.673337,20696.401183
1,Angola,2021.0,Age-standardized,Both,Mental disorders,"Prevalent cases per 100,000",14517.480604,12748.492025,16638.169306
2,Albania,2021.0,Age-standardized,Both,Mental disorders,"Prevalent cases per 100,000",12239.869258,10809.258625,13998.851842
3,Andorra,2021.0,Age-standardized,Both,Mental disorders,"Prevalent cases per 100,000",16246.594797,14193.602393,18600.414692
4,United Arab Emirates,2021.0,Age-standardized,Both,Mental disorders,"Prevalent cases per 100,000",13856.088574,12223.777898,15829.943793
...,...,...,...,...,...,...,...,...,...
201,Egypt,2021.0,Age-standardized,Both,Mental disorders,"Prevalent cases per 100,000",15360.979870,13533.839581,17589.294552
202,Sudan,2021.0,Age-standardized,Both,Mental disorders,"Prevalent cases per 100,000",16524.691383,14636.103054,18906.590807
203,China,2021.0,Age-standardized,Both,Mental disorders,"Prevalent cases per 100,000",11404.913749,10541.437052,12378.076465
204,Institute for Health Metrics and Evaluation (I...,,,,,,,,


### Metadata

Mental disorders dataset has 204 observations of 9 features. It contains information about prevalent cases of mental illnes worldwide for 2021 for both sexes. The information is age-standardized which means that it takes into account the differences in the age structure of each country.[[1]](#References) It is provided in the 'Value 'column and is measured in prevalent cases per 100 000 people. Thus the values ​​of different countries could be compared unbiased.

The source of this dataset is https://www.healthdata.org/research-analysis/health-risks-issues/mental-health.

### Exploration and cleaning

In [5]:
mental_disorders_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Location                  206 non-null    object 
 1   Year                      204 non-null    float64
 2   Age                       204 non-null    object 
 3   Sex                       204 non-null    object 
 4   Cause of death or injury  204 non-null    object 
 5   Measure                   204 non-null    object 
 6   Value                     204 non-null    float64
 7   Lower bound               204 non-null    float64
 8   Upper bound               204 non-null    float64
dtypes: float64(4), object(5)
memory usage: 14.6+ KB


In [6]:
mental_disorders_data.duplicated().unique()

array([False])

There are no duplicated observations nor missing values in this dataset.

The last two rows are not observations but contain information about data provider and terms of use. It is safe to delete them.

In [7]:
mental_disorders_data = mental_disorders_data.iloc[:-2]

In [8]:
mental_disorders_data.tail(2)

Unnamed: 0,Location,Year,Age,Sex,Cause of death or injury,Measure,Value,Lower bound,Upper bound
202,Sudan,2021.0,Age-standardized,Both,Mental disorders,"Prevalent cases per 100,000",16524.691383,14636.103054,18906.590807
203,China,2021.0,Age-standardized,Both,Mental disorders,"Prevalent cases per 100,000",11404.913749,10541.437052,12378.076465


The names of the columns have to be standardized to snake_case for more easy work.

In [9]:
mental_disorders_data.rename(functions.to_snake_case, axis = 'columns', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mental_disorders_data.rename(functions.to_snake_case, axis = 'columns', inplace = True)


In [10]:
mental_disorders_data.head(3)

Unnamed: 0,location,year,age,sex,cause_of_death_or_injury,measure,value,lower_bound,upper_bound
0,Afghanistan,2021.0,Age-standardized,Both,Mental disorders,"Prevalent cases per 100,000",17904.245131,15492.673337,20696.401183
1,Angola,2021.0,Age-standardized,Both,Mental disorders,"Prevalent cases per 100,000",14517.480604,12748.492025,16638.169306
2,Albania,2021.0,Age-standardized,Both,Mental disorders,"Prevalent cases per 100,000",12239.869258,10809.258625,13998.851842


### Feature selection

In [11]:
mental_disorders_data.nunique()

location                    204
year                          1
age                           1
sex                           1
cause_of_death_or_injury      1
measure                       1
value                       204
lower_bound                 204
upper_bound                 204
dtype: int64

All observations are age standardized for both sexes for 2021. The values of mental disorders are measured by a common unit. So each of the columns 'year', 'age', 'sex', 'cause_of_death_and_injury' and 'measure' has single unique value for all observations. From them rest, the features of interest of this study are 'country' and 'values'.
I will extract the in a new dataframe named 'mental_disorders_by_country'. The 'location' column will be renamed to 'country' for consistency with the other dataset. The 'value' column also will be renamed to 'mental_disorders_per_100k' for better description of the feature. Its values will be rounded for better readability.

In [12]:
mental_disorders_by_county = mental_disorders_data[['location', 'value']]

In [13]:
mental_disorders_by_county = mental_disorders_by_county.rename(columns = {'location': 'country', 'value': 'mental_disorders_per_100k'})

In [14]:
mental_disorders_by_county['mental_disorders_per_100k'] = mental_disorders_by_county['mental_disorders_per_100k'].round(0)

In [15]:
mental_disorders_by_county.head(3)

Unnamed: 0,country,mental_disorders_per_100k
0,Afghanistan,17904.0
1,Angola,14517.0
2,Albania,12240.0


In [16]:
mental_disorders_by_county.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 2 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   country                    204 non-null    object 
 1   mental_disorders_per_100k  204 non-null    float64
dtypes: float64(1), object(1)
memory usage: 3.3+ KB


The dtypes of resulted features are appropriate so no need to transform them.

### Saving transformed data
The resulted dataset is saved for further use in this study.

In [17]:
mental_disorders_by_county.to_csv('data/mental_disorder_by_country.csv', index=False)

<a id='references'></a>

### References
1. [Age standardization - ourworldindata.org](https://ourworldindata.org/age-standardization)
2. [Guide](../data/guiding_questions.pdf)