## 0. DATA FOR SYNTHETIC MODELS

In [2]:
import pandas as pd
import numpy as np 
import warnings

## 1) Prepare data for synthetic models

We will develop models to generate new synthetic data similar to our score dataset. To do so, formats need to be the same so we should not create columns for each area. 
We start working with the two raw subsets:

In [4]:
df1 = pd.read_csv('../data/raw_data/raw_subset_data1.csv')
df2 = pd.read_csv('../data/raw_data/raw_subset_data2.csv')

In [5]:
df1 = df1.drop(columns=['education_id', 'Unnamed: 0'], axis=1)
df2 = df2.drop(columns=['education_id', 'id', 'Unnamed: 0'], axis=1)

In [6]:
# Remove patients with no gender and birthdate
df1 = df1[df1['gender'].notnull()] 
df1 = df1[df1['birthdate'].notnull()] 

df2 = df2[df2['gender'].notnull()] 
df2 = df2[df2['birthdate'].notnull()] 

In [7]:
df2.head()

Unnamed: 0,patient_id,day,score,area_id,area_score,gender,birthdate
0,5763,2020-01-01T00:00:00.000Z,0.4174,13,0.473,1.0,2016-02-09
1,5763,2020-01-01T00:00:00.000Z,0.4174,14,0.2977,1.0,2016-02-09
2,5763,2020-01-01T00:00:00.000Z,0.4174,15,0.1667,1.0,2016-02-09
3,5763,2020-01-01T00:00:00.000Z,0.4174,16,0.2477,1.0,2016-02-09
4,5763,2020-01-01T00:00:00.000Z,0.4174,22,0.1667,1.0,2016-02-09


We merge the subsets and eliminate duplicates from their intersection:

In [12]:
merged_df = pd.concat([df1, df2], ignore_index=True, sort=False)

In [14]:
clean_data = merged_df.drop_duplicates(['day', 'patient_id', 'area_id', 'area_score']) # duplicated ones during the process, subsets solapados


We finally correct for surreal birthdates and save the file that will be used as input for these models.

In [15]:
warnings.filterwarnings('ignore')

clean_data['birthdate'] = pd.to_datetime(clean_data['birthdate'], format='%Y-%m-%d', errors='coerce').dt.strftime('%Y-%m-%d')
clean_data["day"] = pd.to_datetime(clean_data["day"], errors='coerce').dt.strftime('%Y-%m-%d')

In [17]:
clean_data = clean_data[clean_data['birthdate'].notnull()] 

In [18]:
clean_data['birth_year'] = pd.to_datetime(clean_data['birthdate']).dt.year

clean_data = clean_data.dropna(subset=['birth_year'])
clean_data = clean_data[(clean_data['birth_year'] >= 1900) & (clean_data['birth_year'] <= 2023)] # no unreal ages

In [19]:
clean_data = clean_data.drop('birth_year', axis=1)

In [22]:
# clean_data.to_csv('../data/input_syntheticmodels.csv', index=False)

Now data is prepared for the models... see Jupyter Notebooks 1 and 2