## Pakete

In [None]:
%pip install scikit-learn

In [1]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

## Allgemeine Information zum Datensatz

In [None]:
# df holen
#import pandas as pd
#davor muss mittels `os` die directory über dem `src` für die Daten und Ergebnisse angeheftet werden
import os
os.getcwd()
os.chdir(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))


aids = pd.read_csv('data/aids.csv', sep =",")

In [6]:
aids.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2139 entries, 0 to 2138
Data columns (total 25 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   num      2139 non-null   int64  
 1   time     2139 non-null   int64  
 2   trt      2139 non-null   int64  
 3   age      2139 non-null   int64  
 4   wtkg     2139 non-null   float64
 5   hemo     2139 non-null   int64  
 6   homo     2139 non-null   int64  
 7   drugs    2139 non-null   int64  
 8   karnof   2139 non-null   int64  
 9   oprior   2139 non-null   int64  
 10  z30      2139 non-null   int64  
 11  zprior   2139 non-null   int64  
 12  preanti  2139 non-null   int64  
 13  race     2139 non-null   int64  
 14  gender   2139 non-null   int64  
 15  str2     2139 non-null   int64  
 16  strat    2139 non-null   int64  
 17  symptom  2139 non-null   int64  
 18  treat    2139 non-null   int64  
 19  offtrt   2139 non-null   int64  
 20  cd40     2139 non-null   int64  
 21  cd420    2139 

## Dictionary für das Labeling erstellen

Die Beschreibung einzelner Features befindet sich im `README`. Es werden Labels für nicht kontinuerliche Features erstellt. Die Information in den Features liegt numerisch vor. Es muss kein Replacement in den Daten erfolgen.

In [8]:
trt_dict = {0: 'ZDV only', 1: 'ZDV + ddI', 2: 'ZDV + Zal', 3: 'ddI only'}
hemo_dict = {0: 'no hemophilia', 1: 'hemophilia'}
homo_dict = {0: 'no homosexual', 1: 'homosexual'}
drugs_dict = {0: 'no IV drugs use', 1: 'IV drugs use'}
oprior_dict = {0: 'no prior antiretroviral therapy (no ZDV)', 1: 'prior antiretroviral therapy (no ZDV)'}
z30_dict = {0: 'no ZDV-Therapy 30 days befor randomisation', 1: 'ZDV-Therapy 30 days befor randomisation'}
zprior_dict = {0: 'no ZDV-Therapy befor randomisation', 1: 'ZDV-Therapy befor randomisation'}
race_dict = {0: 'White', 1: 'Non-White'}
gender_dict = {0: 'Female', 1: 'Male'}
str2_dict = {0: 'naive', 1: 'experienced'}
strat_dict = {1: 'Antiretroviral Naive', 2: '> 1 but <= 52 weeks of prior antiretroviral therapy', 3: '> 52 weeks'}
symptom_dict = {0: 'asymptomatic', 1: 'symtomatic'}
treat_dict = {0: 'ZDV only', 1: 'others'}
cid_dict = {0: 'censoring', 1: 'failure'}


In [9]:
aids.head()

Unnamed: 0,num,time,trt,age,wtkg,hemo,homo,drugs,karnof,oprior,...,str2,strat,symptom,treat,offtrt,cd40,cd420,cd80,cd820,cid
0,0,948,2,48,89.8128,0,0,0,100,0,...,0,1,0,1,0,422,477,566,324,0
1,1,1002,3,61,49.4424,0,0,0,90,0,...,1,3,0,1,0,162,218,392,564,1
2,2,961,3,45,88.452,0,1,1,90,0,...,1,3,0,1,1,326,274,2063,1893,0
3,3,1166,3,47,85.2768,0,1,0,100,0,...,1,3,0,1,0,287,394,1590,966,0
4,4,1090,0,43,66.6792,0,1,0,100,0,...,1,3,0,0,0,504,353,870,782,0


## Contingency table

 für die kategorischen Features als Vorbereitung für die Zusammenhangs-Statistik. Es wird Fisher-Exact-Testverwendet.

In [15]:
import numpy as np
var1 = np.array(aids["hemo"])
var2 = np.array(aids["homo"])

In [57]:

import statsmodels.api as sm
import pickle
d = {"Hemophily": var1, "Homosexuality": var2}
d = pd.DataFrame(d)
table = sm.stats.Table.from_data(d)

from scipy import stats
oddsratio, pvalue = stats.fisher_exact(table.table)
print("Odds-Ratio:", oddsratio, "p-value:", pvalue)


Odds-Ratio: 0.02075294999063495 p-value: 3.669612705132406e-74


Der Odds-Ratio von 0,021 deutet darauf hin, dass Quotenverhältnis in der 2 x 2 Tabelle nicht gleich ist. Ein p-value von 3,6 deutet auf einen Fehler hin. Das p-value liegt zwischen 0 und 1. Diesen Umstand möchte ich näher betrachten und erstelle eine 4-Felder-Tafel, eine Confusion-Matrix.

## Confusion Matrix

In [53]:
import pandas as pd
var1 = aids["hemo"].replace(hemo_dict)
var2 = aids["homo"].replace(homo_dict)
v1 = pd.Series(var1,name="Hemophilia")
v2 = pd.Series(var2, name="Homosexuality")
df_confusion = pd.crosstab(v1, v2, margins=False) # wenn margins=True ist, werden noch die Zeilensummen ausgegeben.


In [58]:
df_confusion

Homosexuality,homosexual,no homosexual
Hemophilia,Unnamed: 1_level_1,Unnamed: 2_level_1
hemophilia,9,171
no hemophilia,1405,554


In [50]:
file_to_write = open("tables/tbl1.pickle", "wb")
pickle.dump(df_confusion, file_to_write)