# Analysis of Worldwide Suicide Rates
                                                --done by Dayana Kassenova 

<img src="figs/suicide-image.JPG" width=800 /> 

### Content
+ Introduction: Suicide Rates
+ Data description
+ Formulation of Research questions
+ Data preparation : cleaning and shaping

## 1. Introduction: Suicide Rates

&emsp;  Suicide takes place throughout the life and it has been one of the leading causes of death in many countries. Unfortunately, each suicide is a tragic loss of human's life that has own consequences on people left apart, like family and friends.

&emsp; To be more precise, suicide is when someone is suffered from mental depression or in a state of stress, usually they feel unable to cope with their problems, and look to the future without hope. For example, humankinds experience both victories and trials, such as hiring to a new position, starting a family, purchasing a new house are altogether energizing victories;   while losing of friends or family members or an aggressive behavior at home are worst trials. Due to these troubles in life, people regularly think about self-destruction as suicide. 


 &emsp;  The goal of the Global Suicide Rates analysis is to demonstrate facts and statistics of 27 820 observations, which grouped by certain age ranges, gender(male/female), number of suicides, population of country, GDP (gross domestic product) per year and person, HDI (human development index) different generations during the period of 1985 & 2016 across the world. In addition, we will find which factors had more influence on the number of suicides in 101 countries.

 &emsp;  Furthermore, according to World Health Organization, each year about 1 million people die, and a much larger number of people commit suicide around the globe. On the other hand, since 20th century there is a decline of suicide rate. For instance, in 1990 Armenia had 7.51 suicides per 100k, after two decades it dramatically decreased to 0.72 suicides per 100k. 


### Sources
1. Dataset is from kaglle- https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016
2. Images from https://ru.depositphotos.com/127753404/stock-illustration-woman-with-tied-eyes-put.html
3. Information is from https://www.who.int/news-room/fact-sheets/detail/suicide

## 2. Data description

 &emsp;  By relying on description of Global Suicide Rates above, it is obvious that individuals with different ages kill themselves because of pressure, barriers, or a lack of mental healthcare across all countries. In addition, in this analysis I am going to compare the count of suicides depending on different age groups, genders, generations, countries, and ratio of suicides per 100k population over period of 1985 and 2016 years.  
 
I am really keen on this topic, because everyone should be aware of this kind of issue, which is dominant in suicidal ratings in our life, and be ready to help their surroundings on time. 

My analysis will be based on dataset during the 31 years, because I found this great dataset on Kaggle website.
Below is data variables that I used for my analysis:

+  Country –  geographical location on a map.
+  Year – the time when suicide happened.
+  Sex – gender as Male/Female.
+  Age – age of individuals who committed suicide.
+  Suicides_no – number of suicides.
+  Population – people living in a particular country.
+  Ratio(suicides_no/pop) – # of suicides per country's population.
+  HDI for year – human development index in each year.
+  GDP_per year – gross domestic value in each year.
+  GDP_per_capita – gross domestic value for each person. 
+  Generation – name of different age groups.


## 3. Formulation of Research questions

While making an analysis, I will find out answers to these crucial questions of ratings:

1. Analyze - What is the proportion of women and men, who had suicidal behaviour?
2. Analyze - Who's more likely to commit suicide due to gender & certain age ranges?
3. Analyze - Has the number of suicides increased or decreased over time (based on'suicides/pop')?
4. Analyze - Which country had the most depressed men and women due to suicides per population?
5. Analyze - Has the GDP(gross domestic product)& HDI affect on suicides amount?
6. Analyze - Which generations were extremely sad during 31 years (1985-2016)?


## 4. Data preparation: cleaning and shaping

✤   At the beginning, I am going to import essential libraries to make further statistics with graphs.

+ **numpy** - is Python's important library, for operations with numbers.
+ **pandas** - is for dealing with data storage and Dataframes.
+ **seaborn** - is for making histograms,scatter plots,box plots.
+ **matlolib** - is for displaying charts and graphs.
+ **warnings** - is for showing errors.

In [45]:
#Importing 'must-have' Libraries like these:

import numpy as np  # to access data in any formats.
import pandas as pd  # to work with data frame formats.
import seaborn as sns  # to load color themes in graphics
import matplotlib.pyplot as plt  # to plot data
import warnings  # do not show matching warnings
warnings.filterwarnings('ignore')
%matplotlib inline  
#to draw cell-oriented tables

✤  At this step, I can use ready dataset of Worldwide Suicide rates.
+ to access by reading all columns from table.
+ to see first five rows and last 5 rows.
+ to delete unnecessary columns.
+ to check if there are null values.
+ to replace them by number.
+ to know all datatypes of columns.
+ to describe numerical distribution values.
+ to find out all unique years,age groups,generations,countries and their number.
+ to check for duplicates
+ to show final version of dataset ✤
 

✤ Here, we can begin reading our 'suicide' csv file with in dataset:

In [56]:
#reading my dataset from 'suicide' csv file:
rating = pd.read_csv('datasets/suicide.csv')

✤ Firstly,I want to retrieve column names, to see if there are unnecessary spaces or symbols:

In [57]:
rating.columns

Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
       'suicides/100k pop', 'country-year', 'HDI for year',
       ' gdp_for_year ($) ', 'gdp_per_capita ($)', 'generation'],
      dtype='object')

 ✤ Now,we can make table in correct way,by making column names without spaces or symbols($), with replace() function:

In [58]:
rating.columns = rating.columns.str.strip().str.lower().str.replace(' ', '').str.replace('(', '').str.replace(')', '').str.replace('$', '')

✤ Just general overview of suicide rates dataset:

In [59]:
#to show first 5 rows of dataset in table format
rating.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100kpop,country-year,hdiforyear,gdp_for_year,gdp_per_capita,generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [60]:
#to show last 5 rows of dataset in table format:
rating.tail()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100kpop,country-year,hdiforyear,gdp_for_year,gdp_per_capita,generation
27815,Uzbekistan,2014,female,35-54 years,107,3620833,2.96,Uzbekistan2014,0.675,63067077179,2309,Generation X
27816,Uzbekistan,2014,female,75+ years,9,348465,2.58,Uzbekistan2014,0.675,63067077179,2309,Silent
27817,Uzbekistan,2014,male,5-14 years,60,2762158,2.17,Uzbekistan2014,0.675,63067077179,2309,Generation Z
27818,Uzbekistan,2014,female,5-14 years,44,2631600,1.67,Uzbekistan2014,0.675,63067077179,2309,Generation Z
27819,Uzbekistan,2014,female,55-74 years,21,1438935,1.46,Uzbekistan2014,0.675,63067077179,2309,Boomers


✤  From results of dataset, I can count total number of rows & columns, with shape function :

In [29]:
print(rating.shape)

(27820, 12)


✤  In this situation, I suppose that we should delete 'country-year' column , by using drop() function,
because we already have 'country' and 'year' columns, and it will not make sence: 

In [61]:
rating.drop(['country-year'], axis=1,inplace=True)

✤ Since some countries have higher(10 million) or lower(1m million) population, we can not compare them by column 'suicides_no/per 100k population'.  &emsp;  That's why I obtained(created) new column - Ratio of suicides number by population of country:

In [62]:
ratio=rating['suicides_no'] / rating['population']
rating['suicide/pop']=round(ratio * 100000, 1)

# I multiplied ratio by 100 000,to gain whole number.
rating.head() 

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100kpop,hdiforyear,gdp_for_year,gdp_per_capita,generation,suicide/pop
0,Albania,1987,male,15-24 years,21,312900,6.71,,2156624900,796,Generation X,6.7
1,Albania,1987,male,35-54 years,16,308000,5.19,,2156624900,796,Silent,5.2
2,Albania,1987,female,15-24 years,14,289700,4.83,,2156624900,796,Generation X,4.8
3,Albania,1987,male,75+ years,1,21800,4.59,,2156624900,796,G.I. Generation,4.6
4,Albania,1987,male,25-34 years,9,274300,3.28,,2156624900,796,Boomers,3.3


 ✤ Now, we can understand that we do not need column 'suicides/100k pop', and we can delete this column by using drop() function:

In [64]:
rating.drop(['suicides/100kpop'], axis=1,inplace=True)

 ✤ As you can see from initial 5 rows of table, there are null values(NaN) in HDI column, and we should guess all other missing values from table:

In [65]:
# by using isnull() function and we count empty values in columns:
rating.isnull().sum()

country               0
year                  0
sex                   0
age                   0
suicides_no           0
population            0
hdiforyear        19456
gdp_for_year          0
gdp_per_capita        0
generation            0
suicide/pop           0
dtype: int64

✤ Only 'HDI for year'column has missing rows- 19456 out of 27820 rows,and I am going to fill them by previous and next numbers in dataset, because in this period of time HDI's value was relatively stable(0.6-0.7) due to dataset, I guess that NaN values can be filled by forward and backward methods:

In [66]:
# using of fillna() function
# rating['HDI for year'].fillna(0, inplace = True)
rating = rating.fillna(method='ffill').fillna(method='bfill')

✤ Here, we know that none of the columns has Missing Values:

In [11]:
# again by using isnull() function and we count empty values in columns:
# rating.isnull().sum()
# answer -no NaN rows.

✤  Additionally, we should know datatypes of our variables, it will help us in analysis:

In [67]:
rating.dtypes

country            object
year                int64
sex                object
age                object
suicides_no         int64
population          int64
hdiforyear        float64
gdp_for_year       object
gdp_per_capita      int64
generation         object
suicide/pop       float64
dtype: object

✤  Also, there is a numerical description of our dataset, it illustrates a fairly wide range of data such as total amount, average mean value, standard deviation, interqurtile range and min/max values:

In [68]:
rating[['year','suicides_no','population','suicide/pop','gdp_per_capita','hdiforyear']].describe()

Unnamed: 0,year,suicides_no,population,suicide/pop,gdp_per_capita,hdiforyear
count,27820.0,27820.0,27820.0,27820.0,27820.0,27820.0
mean,2001.258375,242.574407,1844794.0,12.816078,16866.464414,0.761476
std,8.469055,902.047917,3911779.0,18.961707,18887.576472,0.091981
min,1985.0,0.0,278.0,0.0,251.0,0.483
25%,1995.0,3.0,97498.5,0.9,3447.0,0.694
50%,2002.0,25.0,430150.0,6.0,9372.0,0.766
75%,2008.0,131.0,1486143.0,16.6,24874.0,0.836
max,2016.0,22338.0,43805210.0,225.0,126352.0,0.944


✤ Moreover, we must know that, in which years suicides were committed by people, and their ages, within generations:

In [14]:
columns = ['year', 'age', 'generation']
for col in columns:
    print("{}'s - Unique values - \n {}".format(col, rating[col].unique()))

year's - Unique values - 
 [1987 1988 1989 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
 2003 2004 2005 2006 2007 2008 2009 2010 1985 1986 1990 1991 2012 2013
 2014 2015 2011 2016]
age's - Unique values - 
 ['15-24 years' '35-54 years' '75+ years' '25-34 years' '55-74 years'
 '5-14 years']
generation's - Unique values - 
 ['Generation X' 'Silent' 'G.I. Generation' 'Boomers' 'Millenials'
 'Generation Z']


✤  Here, I am retrieveing all available distinct(unique) countries from data in alphabetic order, where suicides have been occured.

In [15]:
print('Amount of different types of countries: ')
print(len(rating['country'].unique())) 
#count them

distinct_countries = rating['country'].unique()
print(distinct_countries) 
#show all of them

Amount of different types of countries: 
101
['Albania' 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Aruba' 'Australia'
 'Austria' 'Azerbaijan' 'Bahamas' 'Bahrain' 'Barbados' 'Belarus' 'Belgium'
 'Belize' 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria' 'Cabo Verde'
 'Canada' 'Chile' 'Colombia' 'Costa Rica' 'Croatia' 'Cuba' 'Cyprus'
 'Czech Republic' 'Denmark' 'Dominica' 'Ecuador' 'El Salvador' 'Estonia'
 'Fiji' 'Finland' 'France' 'Georgia' 'Germany' 'Greece' 'Grenada'
 'Guatemala' 'Guyana' 'Hungary' 'Iceland' 'Ireland' 'Israel' 'Italy'
 'Jamaica' 'Japan' 'Kazakhstan' 'Kiribati' 'Kuwait' 'Kyrgyzstan' 'Latvia'
 'Lithuania' 'Luxembourg' 'Macau' 'Maldives' 'Malta' 'Mauritius' 'Mexico'
 'Mongolia' 'Montenegro' 'Netherlands' 'New Zealand' 'Nicaragua' 'Norway'
 'Oman' 'Panama' 'Paraguay' 'Philippines' 'Poland' 'Portugal'
 'Puerto Rico' 'Qatar' 'Republic of Korea' 'Romania' 'Russian Federation'
 'Saint Kitts and Nevis' 'Saint Lucia' 'Saint Vincent and Grenadines'
 'San Marino' 'Serbia' 'Seychelle

✤ I also should check for duplicated rows of data, if there is values, which occurs more than one time:

In [16]:
# a = rating.duplicated(['country','year','sex','age','suicides_no'])
# a.sum()

# result is 0

✤ Let's see the final version of my dataset(without empty rows): 

In [69]:
# to see final version of dataset
# as an example, retrieve first 5 rows:
rating.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,hdiforyear,gdp_for_year,gdp_per_capita,generation,suicide/pop
0,Albania,1987,male,15-24 years,21,312900,0.619,2156624900,796,Generation X,6.7
1,Albania,1987,male,35-54 years,16,308000,0.619,2156624900,796,Silent,5.2
2,Albania,1987,female,15-24 years,14,289700,0.619,2156624900,796,Generation X,4.8
3,Albania,1987,male,75+ years,1,21800,0.619,2156624900,796,G.I. Generation,4.6
4,Albania,1987,male,25-34 years,9,274300,0.619,2156624900,796,Boomers,3.3


✤ After checking for duplicates, deleting unnecessary column,and creating 'ratio' column, generally we can check a total amount of rows and columns, correspondingly:

In [37]:
# by using shape, we find out number of rows & columns:
print(rating.shape)

(27820, 11)


###  ✤  Now, my dataset is clean, because of deleting unnecessary column, and after making some manipulations with data.

### ✤ I know  the "Worldwide Suicides Rates" dataset more clearly by unique values, by grouping gender & ages, and describe all datatypes, numerical information, and I am ready to make graphs, where I can find out answers in more detailed.