# 01 Target Modeling

*Let's have a look how we could model the targets in a Regression, Classification (if you are not familiar with those check [this](https://www.geeksforgeeks.org/ml-classification-vs-regression/)) and Survival Model.*

* * *

# Imports

In [1]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [2]:
import pandas as pd
import numpy as np

from lifelines.datasets import load_dd
from src.dd_column_cfg import id_col, drop_cols, cat_cols, num_cols, duration_col, event_col, target_cols

# Data

[Democracy and Dictatorship dataset](https://lifelines.readthedocs.io/en/latest/lifelines.datasets.html#lifelines.datasets.load_dd)

Classification of political regimes as democracy and dictatorship which covers 202 countries from 1946 or year of independence to 2008

In [3]:
data = load_dd()

data = data.reset_index().rename(columns={'index': id_col})
data.democracy = np.where(data.democracy=='Democracy', 1,0)

data.shape

(1808, 13)

The duration column is the duration in years, the observed column holds the event occurence

In [4]:
data.head()

Unnamed: 0,regime_id,ctryname,cowcode2,politycode,un_region_name,un_continent_name,ehead,leaderspellreg,democracy,regime,start_year,duration,observed
0,0,Afghanistan,700,700.0,Southern Asia,Asia,Mohammad Zahir Shah,Mohammad Zahir Shah.Afghanistan.1946.1952.Mona...,0,Monarchy,1946,7,1
1,1,Afghanistan,700,700.0,Southern Asia,Asia,Sardar Mohammad Daoud,Sardar Mohammad Daoud.Afghanistan.1953.1962.Ci...,0,Civilian Dict,1953,10,1
2,2,Afghanistan,700,700.0,Southern Asia,Asia,Mohammad Zahir Shah,Mohammad Zahir Shah.Afghanistan.1963.1972.Mona...,0,Monarchy,1963,10,1
3,3,Afghanistan,700,700.0,Southern Asia,Asia,Sardar Mohammad Daoud,Sardar Mohammad Daoud.Afghanistan.1973.1977.Ci...,0,Civilian Dict,1973,5,0
4,4,Afghanistan,700,700.0,Southern Asia,Asia,Nur Mohammad Taraki,Nur Mohammad Taraki.Afghanistan.1978.1978.Civi...,0,Civilian Dict,1978,1,0


## Target Cols

**Duration**

In [5]:
val_max = data[duration_col].max()
val_max

47

In [6]:
data[duration_col].describe()

count    1808.000000
mean        5.043695
std         6.208406
min         1.000000
25%         1.000000
50%         3.000000
75%         6.000000
max        47.000000
Name: duration, dtype: float64

**Event**

In [7]:
data[event_col].value_counts()

1    1468
0     340
Name: observed, dtype: int64

# Modeling as Regression problem

In [8]:
data[num_cols+cat_cols+[duration_col]].head(5)

Unnamed: 0,un_region_name,un_continent_name,democracy,regime,duration
0,Southern Asia,Asia,0,Monarchy,7
1,Southern Asia,Asia,0,Civilian Dict,10
2,Southern Asia,Asia,0,Monarchy,10
3,Southern Asia,Asia,0,Civilian Dict,5
4,Southern Asia,Asia,0,Civilian Dict,1


# Modeling as Classification problem

## Time independent

In [9]:
data[num_cols+cat_cols+[event_col]].head(5)

Unnamed: 0,un_region_name,un_continent_name,democracy,regime,observed
0,Southern Asia,Asia,0,Monarchy,1
1,Southern Asia,Asia,0,Civilian Dict,1
2,Southern Asia,Asia,0,Monarchy,1
3,Southern Asia,Asia,0,Civilian Dict,0
4,Southern Asia,Asia,0,Civilian Dict,0


## Time Dependent

* Observed = 0 & duration < t -> Remove (no information at t)
* Observed = 1 & duration <= t -> 1
* Duration > t  -> 0


Example: Will new goverment make it to 4 years?

In [10]:
years=4

In [11]:
conds = [(data[event_col]==0) & (data[duration_col]<years), # Remove (no information at t)
       (data[duration_col]<=4) & (data[event_col]==1), # -> 1
       data[duration_col]>4] # -> 0
choices = [2, 1, 0]

In [12]:
data['observed_years'] = np.select(conds, choices)
data.observed_years.value_counts()

1    1009
0     664
2     135
Name: observed_years, dtype: int64

In [13]:
data_years = data[data.observed_years!=2].copy()

In [14]:
data_years[num_cols+cat_cols+['observed_years']].head(5)

Unnamed: 0,un_region_name,un_continent_name,democracy,regime,observed_years
0,Southern Asia,Asia,0,Monarchy,0
1,Southern Asia,Asia,0,Civilian Dict,0
2,Southern Asia,Asia,0,Monarchy,0
3,Southern Asia,Asia,0,Civilian Dict,0
5,Southern Asia,Asia,0,Civilian Dict,0


# Modeling as Survival problem

2 targets

In [15]:
data[num_cols+cat_cols+target_cols].head(5)

Unnamed: 0,un_region_name,un_continent_name,democracy,regime,duration,observed
0,Southern Asia,Asia,0,Monarchy,7,1
1,Southern Asia,Asia,0,Civilian Dict,10,1
2,Southern Asia,Asia,0,Monarchy,10,1
3,Southern Asia,Asia,0,Civilian Dict,5,0
4,Southern Asia,Asia,0,Civilian Dict,1,0
