#### **Problem Formulation**

Our Dataset: (https://www.kaggle.com/datasets/ronaldonyango/global-suicide-rates-1990-to-2022) 

##### Our Question: What `socioeconomic factors` might be behind the suicide rates in the world ?

#### Import Important Libraries


In [2]:
import numpy as np
import math
import matplotlib.gridspec as gridspec
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
from sklearn.linear_model import LinearRegression
sb.set() # set the default Seaborn style for graphics
import plotly.express as px

In [3]:
data = pd.read_csv('age_std_suicide_rates_1990-2022.csv')
data.head()

Unnamed: 0,RegionCode,RegionName,CountryCode,CountryName,Year,Sex,SuicideCount,CauseSpecificDeathPercentage,StdDeathRate,DeathRatePer100K,Population,GDP,GDPPerCapita,GNI,GNIPerCapita,InflationRate,EmploymentPopulationRatio
0,EU,Europe,ALB,Albania,1992,Male,33,0.331959,2.335802,2.076386,3247039.0,652175000.0,200.85222,906184200.0,1740.0,226.005421,45.315
1,EU,Europe,ALB,Albania,1992,Female,14,0.19186,0.86642,0.874563,3247039.0,652175000.0,200.85222,906184200.0,1740.0,226.005421,45.315
2,EU,Europe,ALB,Albania,1993,Male,46,0.477724,3.330938,2.937233,3227287.0,1185315000.0,367.279225,1024263000.0,2110.0,85.004751,47.798
3,EU,Europe,ALB,Albania,1993,Female,27,0.385164,1.755077,1.686025,3227287.0,1185315000.0,367.279225,1024263000.0,2110.0,85.004751,47.798
4,EU,Europe,ALB,Albania,1994,Male,37,0.419406,2.678796,2.332619,3207536.0,1880951000.0,586.416135,1216681000.0,2300.0,22.565053,50.086


<span style='font-size:xxx-large'>**Data Preparation & Cleaning**</span>

<span style='font-size:x-large'>**Preliminary Feature Selection**</span>
> select relevent variables to find which variable is the best predictor for suicide rate




In [4]:
# Selecting relevant variables for analysis
data = data[['RegionCode', 'Year', 'Sex', 'StdDeathRate',  'Population', 'GDPPerCapita',  'GNIPerCapita', 'InflationRate', 'EmploymentPopulationRatio']]
print("The shape of the dataset before cleaning", data.shape)
data = data.dropna()
data.isnull().values.any()
print("The shape of the new dataset",data.shape)
print("===============================")
data.info()

The shape of the dataset before cleaning (5928, 9)
The shape of the new dataset (4732, 9)
<class 'pandas.core.frame.DataFrame'>
Index: 4732 entries, 0 to 5371
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   RegionCode                 4732 non-null   object 
 1   Year                       4732 non-null   int64  
 2   Sex                        4732 non-null   object 
 3   StdDeathRate               4732 non-null   float64
 4   Population                 4732 non-null   float64
 5   GDPPerCapita               4732 non-null   float64
 6   GNIPerCapita               4732 non-null   float64
 7   InflationRate              4732 non-null   float64
 8   EmploymentPopulationRatio  4732 non-null   float64
dtypes: float64(6), int64(1), object(2)
memory usage: 369.7+ KB


In [5]:
def removeOutliers(dataSet, quantifier):
    q1 = dataSet[quantifier].quantile(0.25)
    q3 = dataSet[quantifier].quantile(0.75)

    IQR = q3 - q1

    lower_bound = q1 - 1.5 * IQR
    upper_bound= q3 + 1.5 * IQR
    cleanedDataset = dataSet[(dataSet[quantifier] >= lower_bound) & (dataSet[quantifier] <= upper_bound)]

    return cleanedDataset

`Outliers will be removed during EDA, the suicides vary by countries.`


In [7]:
#split data based on gender

#EU: Europe, AS: Asia, OA: Oceania, CSA: Central and South America, NAC: North America & Carribbean, AF: Africa
regioncode_map = {'EU': 0, 'AS':1, 'OA':2, 'NAC':3, 'CSA':4, 'AF':5}
data['RegionCode'] = data['RegionCode'].replace(regioncode_map)
data['RegionCode']=data['RegionCode'].astype('category')
#data.head()

sex_mapping = {'Male': 0, 'Female': 1}
data['Sex'] = data['Sex'].replace(sex_mapping)
maleData = data[data['Sex']==0]
femaleData = data[data['Sex']==1]

---------------------------------