### CRITERI DI RICERCA:
- analisi delle persone sopravvissute
- analisi delle persone non sopravvissute
- analisi delle variabili
- analizzare le parti più sicure della nave in base alla classe in cui si è
    1. la nave è andata giù di prua
    2. la parte più sicura per un periodo è stata la poppa
    3. una volta spaccata a metà, la prua è affondata subito, la poppa poco dopo
    4. la prima classe era in alto, la seconda verso la poppa al centro, la terza era a prua e a poppa in basso
- selezionare quale variabile è la più incisiva a nostro parere
- dare un peso diverso ad ogni colonna e aggregare i risultati

### SUDDIVISIONE LAVORO:
- chi analizza i sopravvissuti/non sopravvissuti
- chi analizza le variabili
- chi analizza la situazione della nave

In [117]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [118]:
# importazione dei file
relevant_cols = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']
train = pd.read_csv('train.csv', usecols = relevant_cols + ['Survived'])
test = pd.read_csv('test.csv', usecols = relevant_cols)

In [119]:
# manipolazione dei dati affinchè siano utilizzabili nell'elaborazione
# - trasformazione della colonna "Sex" in binario (male = 0, female = 1)
train.replace(['male','female'],[0,1], inplace=True)
# - trasformazione della colonna "Age" in intero (i valori nulli sono stati sostituiti con la media dell'età)
train['Age'] = train['Age'].fillna(train['Age'].mean())
train['Age'] = train['Age'].astype(int)

In [120]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int64  
 3   Age       891 non-null    int32  
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
dtypes: float64(1), int32(1), int64(5)
memory usage: 45.4 KB


In [121]:
# percentuale di sopravvivenza per sesso
female = train.loc[train.Sex == 1]["Survived"]
print(f"Female survival rate: {(sum(female)/len(female))*100}%")

male = train.loc[train.Sex == 0]["Survived"]
print(f"Male survival rate: {(sum(male)/len(male))*100}%")

Female survival rate: 74.20382165605095%
Male survival rate: 18.890814558058924%


In [122]:
# percentuale di sopravvivenza per cateogoria di classe (1°, 2°, 3°)
classes = pd.get_dummies(train["Pclass"])
classes.rename(columns={1: "First", 2: "Second", 3: "Third"}, inplace = True)

firstClass = train.loc[classes.First == 1]["Survived"]
print(f"First class survival rate: {(sum(firstClass)/len(firstClass))*100}%")

secondClass = train.loc[classes.Second == 1]["Survived"]
print(f"Second class survival rate: {(sum(secondClass)/len(secondClass))*100}%")

thirdClass = train.loc[classes.Third == 1]["Survived"]
print(f"Third class survival rate: {(sum(thirdClass)/len(thirdClass))*100}%")

First class survival rate: 62.96296296296296%
Second class survival rate: 47.28260869565217%
Third class survival rate: 24.236252545824847%


In [123]:
# percentuale di sopravvivenza per età
# classificazione delle età in categorie
train.loc[train['Age'] <= 12, 'Age'] = 0                           # junior
train.loc[(train['Age'] > 12) & (train['Age'] <= 19), 'Age'] = 1   # teen
train.loc[(train['Age'] > 19) & (train['Age'] <= 30), 'Age'] = 2   # guy
train.loc[(train['Age'] > 30) & (train['Age'] <= 40), 'Age'] = 3   # adult1
train.loc[(train['Age'] > 40) & (train['Age'] <= 50), 'Age'] = 4   # adult2
train.loc[(train['Age'] > 50) & (train['Age'] <= 60), 'Age'] = 5   # adult3
train.loc[train['Age'] > 60, 'Age'] = 6   # senior

In [140]:
junior = train.loc[train.Age == 0]["Survived"]
print(f"0-12 yo survival rate: {(sum(junior)/len(junior))*100}% ({len(junior)})")

teen = train.loc[train.Age == 1]["Survived"]
print(f"13-19 yo survival rate: {(sum(teen)/len(teen))*100}% ({len(teen)})")

guy = train.loc[train.Age == 2]["Survived"]
print(f"20-30 yo survival rate: {(sum(guy)/len(guy))*100}% ({len(guy)})")

adult1 = train.loc[train.Age == 3]["Survived"]
print(f"31-40 yo survival rate: {(sum(adult1)/len(adult1))*100}% ({len(adult1)})")

adult2 = train.loc[train.Age == 4]["Survived"]
print(f"41-50 yo survival rate: {(sum(adult2)/len(adult2))*100}% ({len(adult2)})")

adult3 = train.loc[train.Age == 5]["Survived"]
print(f"51-60 yo survival rate: {(sum(adult3)/len(adult3))*100}% ({len(adult3)})")

senior = train.loc[train.Age == 6]["Survived"]
print(f"61+ yo survival rate: {(sum(senior)/len(senior))*100}% ({len(senior)})")

0-12 yo survival rate: 57.971014492753625% (69)
13-19 yo survival rate: 41.05263157894737% (95)
20-30 yo survival rate: 32.78301886792453% (424)
31-40 yo survival rate: 44.516129032258064% (155)
41-50 yo survival rate: 39.285714285714285% (84)
51-60 yo survival rate: 40.476190476190474% (42)
61+ yo survival rate: 22.727272727272727% (22)


In [138]:
# percentuale sopravvivenza donne in prima classe suddivise per età
femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 0) & (classes.First == 1)]["Survived"]
print(f"0-12 yo female survival rate in first class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 1) & (classes.First == 1)]["Survived"]
print(f"13-19 yo female survival rate in first class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 2) & (classes.First == 1)]["Survived"]
print(f"20-30 yo female survival rate in first class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 3) & (classes.First == 1)]["Survived"]
print(f"31-40 yo female survival rate in first class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 4) & (classes.First == 1)]["Survived"]
print(f"41-50 yo female survival rate in first class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 5) & (classes.First == 1)]["Survived"]
print(f"51-60 yo female survival rate in first class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 6) & (classes.First == 1)]["Survived"]
print(f"61+ yo female survival rate in first class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

0-12 yo female survival rate in first class: 0.0% 1
13-19 yo female survival rate in first class: 100.0% 13
20-30 yo female survival rate in first class: 96.66666666666667% 30
31-40 yo female survival rate in first class: 100.0% 24
41-50 yo female survival rate in first class: 92.3076923076923% 13
51-60 yo female survival rate in first class: 100.0% 11
61+ yo female survival rate in first class: 100.0% 2


In [142]:
# percentuale sopravvivenza donne in seconda classe suddivise per età
femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 0) & (classes.Second == 1)]["Survived"]
print(f"0-12 yo female survival rate in second class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 1) & (classes.Second == 1)]["Survived"]
print(f"13-19 yo female survival rate in second class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 2) & (classes.Second == 1)]["Survived"]
print(f"20-30 yo female survival rate in second class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 3) & (classes.Second == 1)]["Survived"]
print(f"31-40 yo female survival rate in second class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 4) & (classes.Second == 1)]["Survived"]
print(f"41-50 yo female survival rate in second class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 5) & (classes.Second == 1)]["Survived"]
print(f"51-60 yo female survival rate in second class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

# é stato appurato tramite errori che, in questa fascia d'età, non ' sopravvissuta alcuna donna'
# femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 6) & (classes.Second == 1)]["Survived"]
# print(f"prova: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% {len(femaleAgePclass)}") ERROR: DIVISION BY ZERO
print("61+ yo female survival rate in second class: 0.0% (0)")

0-12 yo female survival rate in second class: 100.0% (8)
13-19 yo female survival rate in second class: 100.0% (8)
20-30 yo female survival rate in second class: 90.0% (30)
31-40 yo female survival rate in second class: 94.11764705882352% (17)
41-50 yo female survival rate in second class: 90.0% (10)
51-60 yo female survival rate in second class: 66.66666666666666% (3)
61+ yo female survival rate in second class: 0.0% (0)


In [143]:
# percentuale sopravvivenza donne in terza classe suddivise per età
femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 0) & (classes.Third == 1)]["Survived"]
print(f"0-12 yo female survival rate in third class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 1) & (classes.Third == 1)]["Survived"]
print(f"13-19 yo female survival rate in third class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 2) & (classes.Third == 1)]["Survived"]
print(f"20-30 yo female survival rate in third class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 3) & (classes.Third == 1)]["Survived"]
print(f"31-40 yo female survival rate in third class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 4) & (classes.Third == 1)]["Survived"]
print(f"41-50 yo female survival rate in third class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

# é stato appurato tramite errori che, in questa fascia d'età, non ' sopravvissuta alcuna donna'
# femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 5) & (classes.Third == 1)]["Survived"
# print(f"prova: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% {len(femaleAgePclass)}")
print("51-60 yo female survival rate in third class: 0.0% (0)")

femaleAgePclass = train.loc[(train.Sex == 1) & (train.Age == 6) & (classes.Third == 1)]["Survived"]
print(f"61+ yo female survival rate in third class: {(sum(femaleAgePclass)/len(femaleAgePclass))*100}% ({len(femaleAgePclass)})")

0-12 yo female survival rate in third class: 47.82608695652174% (23)
13-19 yo female survival rate in third class: 59.09090909090909% (22)
20-30 yo female survival rate in third class: 53.246753246753244% (77)
31-40 yo female survival rate in third class: 46.15384615384615% (13)
41-50 yo female survival rate in third class: 0.0% (8)
51-60 yo female survival rate in third class: 0.0% (0)
61+ yo female survival rate in third class: 100.0% (1)
