## Unificació dels datasets amb l'informació estàtica de cada estat

Unificarem quatre conjunts de dades ja preprocessats que contenen informació a nivell estatal sobre llits hospitalaris, assegurança de salut, població i grups d'edats.

In [21]:
pip install openpyxl



In [22]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

Carreguem els conjunts de dades

In [23]:
df_estado_1 = pd.read_csv('/content/acs_2018_health_insurance_coverage_estimates_cleaned.csv')
df_estado_2 = pd.read_csv('/content/kff_usa_hospital_beds_per_capita_2018_cleaned.csv')
df_estado_3 = pd.read_csv('/content/population_cleaned.csv')
df_estado_4 = pd.read_csv('/content/age_groups_cleaned.csv')


Realitzem un petit procés per evitar valors que falten a l'hora d'ajuntar els conjunts de dades.


Imputem els valors relacionats amb l'estat de Puerto Rico, ja que aquests no són presents a tots els conjunts.

In [24]:
df_estado_1 = df_estado_1.dropna()
df_estado_3 = df_estado_3[df_estado_3['state'] != 'PR']
df_estado_4 = df_estado_4[df_estado_4['state'] != 'PR']


Visualitzem les columnes dels diferentes datasets

In [25]:
print("Columnas del df_estado_1:", df_estado_1.columns)
print("Columnas del df_estado_2:", df_estado_2.columns)
print("Columnas del df_estado_3:", df_estado_3.columns)
print("Columnas del df_estado_4:", df_estado_4.columns)

Columnas del df_estado_1: Index(['state', 'acs_variable', 'estimate', 'margin_of_error',
       'estimate_type_population', 'health_insurance_coverage',
       'private_insurance_coverage', 'public_coverage', 'age_group_under_19',
       'age_group_overall', 'not_in_labor_force', 'labor_force_Unknown'],
      dtype='object')
Columnas del df_estado_2: Index(['state', 'state_local_government', 'non_profit', 'for_profit', 'total'], dtype='object')
Columnas del df_estado_3: Index(['state', 'population', 'pop_density'], dtype='object')
Columnas del df_estado_4: Index(['state', 'population', 'agegroup', 'pct_pop'], dtype='object')


Reanomenem certes columnes per evitar duplicats i facilitar la comprensió de cada variable

In [26]:
df_estado_2 = df_estado_2.rename(columns={'state_local_government': 'bedsstate_local_government', 'non_profit': 'bedsnon_profit', 'for_profit': 'bedsfor_profit', 'total': 'bedstotal' })
df_estado_3 = df_estado_3.rename(columns={'population': 'population_state', 'pop_density': 'pop_density_state'})
df_estado_4 = df_estado_4.rename(columns={'population': 'population_agegroup', 'pct_pop': 'pct_pop_agegroup'})

Unifiquem els 4 datasets basant-nos en la columna "state", que conté la informació de l'estat corresponent a cada fila.

In [27]:
merged_df = df_estado_1.merge(df_estado_2, on='state', how='inner') \
    .merge(df_estado_3, on='state', how='inner') \
    .merge(df_estado_4, on='state', how='inner')

print(merged_df.head())


  state acs_variable   estimate  margin_of_error  estimate_type_population  \
0    AL    DP03_0096  4307566.0           8603.0                         1   
1    AL    DP03_0096  4307566.0           8603.0                         1   
2    AL    DP03_0096  4307566.0           8603.0                         1   
3    AL    DP03_0096  4307566.0           8603.0                         1   
4    AL    DP03_0096  4307566.0           8603.0                         1   

   health_insurance_coverage  private_insurance_coverage  public_coverage  \
0                          1                           0                0   
1                          1                           0                0   
2                          1                           0                0   
3                          1                           0                0   
4                          1                           0                0   

   age_group_under_19  age_group_overall  ...  labor_force_Unknown  

Verifiquem que no hi hagi valors faltants després de la fusió

In [28]:
print(merged_df.isnull().sum())

state                         0
acs_variable                  0
estimate                      0
margin_of_error               0
estimate_type_population      0
health_insurance_coverage     0
private_insurance_coverage    0
public_coverage               0
age_group_under_19            0
age_group_overall             0
not_in_labor_force            0
labor_force_Unknown           0
bedsstate_local_government    0
bedsnon_profit                0
bedsfor_profit                0
bedstotal                     0
population_state              0
pop_density_state             0
population_agegroup           0
agegroup                      0
pct_pop_agegroup              0
dtype: int64


In [29]:
merged_df

Unnamed: 0,state,acs_variable,estimate,margin_of_error,estimate_type_population,health_insurance_coverage,private_insurance_coverage,public_coverage,age_group_under_19,age_group_overall,...,labor_force_Unknown,bedsstate_local_government,bedsnon_profit,bedsfor_profit,bedstotal,population_state,pop_density_state,population_agegroup,agegroup,pct_pop_agegroup
0,AL,DP03_0096,4307566.0,8603.0,1,1,0,0,0,1,...,1,1.4,0.8,0.9,3.1,4887871,96.509389,293203,0,0.059986
1,AL,DP03_0096,4307566.0,8603.0,1,1,0,0,0,1,...,1,1.4,0.8,0.9,3.1,4887871,96.509389,297900,1,0.060947
2,AL,DP03_0096,4307566.0,8603.0,1,1,0,0,0,1,...,1,1.4,0.8,0.9,3.1,4887871,96.509389,310495,2,0.063524
3,AL,DP03_0096,4307566.0,8603.0,1,1,0,0,0,1,...,1,1.4,0.8,0.9,3.1,4887871,96.509389,315680,3,0.064584
4,AL,DP03_0096,4307566.0,8603.0,1,1,0,0,0,1,...,1,1.4,0.8,0.9,3.1,4887871,96.509389,325220,4,0.066536
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30595,WY,DP03_0118P,17.7,1.3,0,0,0,0,0,0,...,0,2.6,0.5,0.4,3.5,577737,5.950611,33583,13,0.058129
30596,WY,DP03_0118P,17.7,1.3,0,0,0,0,0,0,...,0,2.6,0.5,0.4,3.5,577737,5.950611,24585,14,0.042554
30597,WY,DP03_0118P,17.7,1.3,0,0,0,0,0,0,...,0,2.6,0.5,0.4,3.5,577737,5.950611,16203,15,0.028046
30598,WY,DP03_0118P,17.7,1.3,0,0,0,0,0,0,...,0,2.6,0.5,0.4,3.5,577737,5.950611,10323,16,0.017868


Desem la matriu de dades resultant de la combinació dels cinc conjunts de dades amb informació a nivell estatal

In [31]:
merged_df.to_csv('states_data_clean.csv', index=False)
