<h1>
    <font color=0000FF>
        Global Suicide Rates 
    </font>
</h1>

---

## This script contains the following:

### 01. Executive Summary


### 02. Data Source, Collection & Contents
- Data Source
- Data Collection
- Data Contents
- Limitations 

### 03. Data Profile

#### A. Import libraries

#### B. Import Data

#### C. Review data and basic descriptive statistical analysis

#### D. Data Wrangling
- Renaming columns
- Checking for missing data
- Dropping columns not needed for further analysis
- Checking shape of new df
- Checking for duplicates
- Checking for mixed-type data
- Exporting clean dataset

### 04. Questions for Analysis



---

## 01. Executive Summary

This dataset provides information on suicide rates on a global level. As per the World Health Organization's (WHO) website, more than than 700 000 people die by suicide every year. Furthermore, for each suicide, there are more than 20 suicide attempts.

Suicides and suicide attempts have a ripple effect that impacts on families, friends, colleagues, communities and societies. Suicides are preventable. Much can be done to prevent suicide at individual, community and national levels. In this analysis we will explore questions around the reasons and key factors behind global suicide rates. We will also observe geographical areas where the suicide rate is the highest / lowest and demographic factors of which age group, gender, generatio is suicide most common in, in order to develop insights to for suicide prevention.

---

## 02. Data Source, Collection & Contents 


### Date Source

The data for this project is an open source data downloaded from [Kaggle](https://www.kaggle.com/datasets/russellyates88/suicide-rates-overview-1985-to-2016?select=master.csv).


### Data Collection

This is a compiled dataset pulled from four other datasets linked by time and place which are as follows: 

* United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506
* World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#
* [Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook
* World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/


### Data Contents 

This dataset contains socio-economic information comparison on suicide rates by year and country. This is a compiled dataset pulled from four other datasets linked by time and place, and was built to find signals correlated to increased suicide rates among different cohorts globally, across the socio-economic spectrum. 


### Limitations

As suicide is a taboo topic, there can a possibility the data that is collected on number of suicides in a particular country has been under- or misreported.

---

## 03. Data Profile

### A. Import Libraries

In [5]:
# Import libraries
import pandas as pd
import numpy as np
import os

### B. Import Data

In [1]:
#Turn project folder path into a string
'/Users/aysha/Documents/Achievement 6 - Global Suicide Rates/'

'/Users/aysha/Documents/Achievement 6 - Global Suicide Rates/'

In [2]:
path = r'/Users/aysha/Documents/Achievement 6 - Global Suicide Rates/'

In [3]:
path

'/Users/aysha/Documents/Achievement 6 - Global Suicide Rates/'

In [6]:
#Import data
df = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'master.csv'), index_col = False)

### C. Review data and basic descriptive statistical analysis

In [11]:
df.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [8]:
#Check the information of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   country             27820 non-null  object 
 1   year                27820 non-null  int64  
 2   sex                 27820 non-null  object 
 3   age                 27820 non-null  object 
 4   suicides_no         27820 non-null  int64  
 5   population          27820 non-null  int64  
 6   suicides/100k pop   27820 non-null  float64
 7   country-year        27820 non-null  object 
 8   HDI for year        8364 non-null   float64
 9    gdp_for_year ($)   27820 non-null  object 
 10  gdp_per_capita ($)  27820 non-null  int64  
 11  generation          27820 non-null  object 
dtypes: float64(2), int64(4), object(6)
memory usage: 2.5+ MB


In [12]:
#Descriptive Statistical Analysis
df.describe()

Unnamed: 0,year,suicides_no,population,suicides/100k pop,HDI for year,gdp_per_capita ($)
count,27820.0,27820.0,27820.0,27820.0,8364.0,27820.0
mean,2001.258375,242.574407,1844794.0,12.816097,0.776601,16866.464414
std,8.469055,902.047917,3911779.0,18.961511,0.093367,18887.576472
min,1985.0,0.0,278.0,0.0,0.483,251.0
25%,1995.0,3.0,97498.5,0.92,0.713,3447.0
50%,2002.0,25.0,430150.0,5.99,0.779,9372.0
75%,2008.0,131.0,1486143.0,16.62,0.855,24874.0
max,2016.0,22338.0,43805210.0,224.97,0.944,126352.0


### D. Data Wrangling

##### Renaming columns

In [19]:
# Renaming column suicides_no to count of suicides
# Renaming column suicides/100k pop to suicide rate
df.rename(columns = {'suicides_no' : 'count of suicides'}, inplace = True)
df.rename(columns = {'suicides/100k pop' : 'suicide rate'}, inplace = True)

##### Checking for missing data

In [20]:
#Checking for missing data
df.isnull().sum()

country                   0
year                      0
sex                       0
age                       0
count of suicides         0
population                0
suicide rate              0
country-year              0
HDI for year          19456
 gdp_for_year ($)         0
gdp_per_capita ($)        0
generation                0
dtype: int64

The column 'HDI for year' has 19456 missing values. I would have imputed the value with mean or median as the value is numeric. However, this variable is not required for further analysis so I will drop the variable from the dataset. 

##### Dropping columns not needed for further analysis

In [21]:
#Dropping column'country-year' as it's a composite key of country and year variable from df
#Dropping column 'HDI for year' as it's not required for further analysis
#And then overwriting df to update it without the above variables

df1 = df.drop(columns = ['HDI for year','country-year'])

##### Checking shape of new df

In [22]:
#Check shape of new df

df1.shape

(27820, 10)

##### Checking for duplicates

In [23]:
#Checking for duplicates

df1_dups = df1[df1.duplicated()]

In [24]:
df1_dups

Unnamed: 0,country,year,sex,age,count of suicides,population,suicide rate,gdp_for_year ($),gdp_per_capita ($),generation


There are no duplicates

##### Checking for mixed-type data

In [25]:
#Checking for mixed-type data columns

for col in df1.columns.tolist():
  weird = (df1[[col]].applymap(type) != df1[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df1[weird]) > 0:
    print (col)

There is no mixed-type data in the dataframe **df1**.

##### Exporting clean dataset

In [26]:
# Exporting df1
df1.to_csv(os.path.join(path, '02 Data','Prepared Data', 'master_clean_df1_6_1.csv'))

---

## 04. Questions for Analysis

After conducting some initial basic data cleaning and consistency checks, the data leaves the following questions to be answered for further analysis: 

1. What patterns can we see with regards to total number of suicides over the years?   
2. Are suicides rates climbing or falling in various countries? What do suicides by country over the years look like? 
3. What are the ten countries with highest and lowest suicide rates?
4. What variables (such as gender or age) can you find that might correlate to suicide rates?
5. What are the major factors behind suicides? Low gdppercapita, from a certain generation group, mental health reasons, etc?  


---