# 泰坦尼克号生还率逻辑回归模型的训练与预测

## 一、数据来源与说明

### （一）数据来源

        数据来源于Kaggle社区发布，用于对泰坦尼克号的所有生存数据做技术分析。

### （二）数据说明

### 数据中各个列名的含义：
PassengerId：乘客编号<br>
Survived:生还概率<br>
Pclass：船舱等级<br>
Name：乘客姓名<br>
Sex：乘客性别<br>
Age：乘客年龄<br>
SibSp：同代直系亲属人数<br>
Parch：不同代直系亲属人数<br>
Ticket：船票编号<br>
Ticket：船票价格<br>
Cabin:客舱号<br>
Embark：登船港口

## 二、分析内容

     通过训练数据建立逻辑回归模型，用于预测泰坦尼克号的生还概率，并解读模型结果。之后用预测数据进行试测。

## 三、分析流程

### （一）读取与清理数据

#### 1.数据读取

In [1]:
import pandas as pd

In [2]:
survival = pd.read_csv(r"C:\Users\52699\Desktop\数据集\泰坦尼克号生存数据\train.csv")

In [3]:
survival

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


#### 2.数据清理

##### （1）整合列名

        为方便后续模型的建立，此处将“乘客编号”“姓名”“船票信息”“船票费用”“客舱号”“登船港口”等与生存概率无关的变量删除，并将“同代直系亲属人数”和“不同代直系亲属人数”整合为“FamilyNum”（家庭成员数）。

In [4]:
survival_train = survival.copy()

In [5]:
survival_train = survival_train.drop(['PassengerId','Name','Ticket','Fare','Cabin','Embarked'],axis = 1)

In [6]:
survival_train['FamilyNum'] = survival_train['SibSp'] + survival_train['Parch']

In [7]:
survival_train = survival_train.drop(['SibSp','Parch'],axis = 1)

In [8]:
survival_train

Unnamed: 0,Survived,Pclass,Sex,Age,FamilyNum
0,0,3,male,22.0,1
1,1,1,female,38.0,1
2,1,3,female,26.0,0
3,1,1,female,35.0,1
4,0,3,male,35.0,0
...,...,...,...,...,...
886,0,2,male,27.0,0
887,1,1,female,19.0,0
888,0,3,female,,3
889,1,1,male,26.0,0


##### （2）清理空值

In [9]:
survival_train.isnull().sum()

Survived       0
Pclass         0
Sex            0
Age          177
FamilyNum      0
dtype: int64

    根据检查，发现训练数据中的"Age"列存在177个数据缺失，会影响后续模型的建立与拟合，因此用所有数据的年龄平均值填补缺失值。

In [10]:
survival_train = survival_train.fillna(survival_train['Age'].mean())

##### （3）转换数据类型

        根据数据可知，Age变量被识别为了float类型，应为int类型。因此进行转化

In [11]:
survival_train['Age'] = survival_train['Age'].astype(int)

        最终清理好的训练模型：

In [12]:
survival_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,FamilyNum
0,0,3,male,22,1
1,1,1,female,38,1
2,1,3,female,26,0
3,1,1,female,35,1
4,0,3,male,35,0


### （二）模型建立与拟合

#### 1.建立并拟合模型

In [13]:
import statsmodels.api as sm

In [14]:
# 引入虚拟变量
survival_train_dummies = pd.get_dummies(data = survival_train,columns = ['Pclass','Sex'],dtype = int,drop_first = True)

In [15]:
# 检查解释变量之间的相关性
survival_train_dummies.corr()

Unnamed: 0,Survived,Age,FamilyNum,Pclass_2,Pclass_3,Sex_male
Survived,1.0,-0.067809,0.016639,0.093349,-0.322308,-0.543351
Age,-0.067809,1.0,-0.24737,0.010199,-0.285608,0.082533
FamilyNum,0.016639,-0.24737,1.0,-0.038594,0.071142,-0.200988
Pclass_2,0.093349,0.010199,-0.038594,1.0,-0.56521,-0.064746
Pclass_3,-0.322308,-0.285608,0.071142,-0.56521,1.0,0.137143
Sex_male,-0.543351,0.082533,-0.200988,-0.064746,0.137143,1.0


        根据相关系数检验表，以及系数大于0.8为强相关的标准，可以发现清理后的数据不存在多重共线性问题，可进行下一步的模型建立与拟合。

In [16]:
# 独立出解释变量和因变量
X = survival_train_dummies.drop('Survived',axis = 1)
Y = survival_train_dummies['Survived']

In [17]:
# 添加截距、建立并拟合模型
X_with_constant = sm.add_constant(X)
model_result = sm.Logit(Y,X_with_constant).fit()
model_result.summary()

Optimization terminated successfully.
         Current function value: 0.444686
         Iterations 6


0,1,2,3
Dep. Variable:,Survived,No. Observations:,891.0
Model:,Logit,Df Residuals:,885.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 22 Aug 2024",Pseudo R-squ.:,0.3322
Time:,17:33:37,Log-Likelihood:,-396.22
converged:,True,LL-Null:,-593.33
Covariance Type:,nonrobust,LLR p-value:,5.211e-83

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,4.0512,0.403,10.058,0.000,3.262,4.841
Age,-0.0394,0.008,-5.055,0.000,-0.055,-0.024
FamilyNum,-0.2176,0.064,-3.373,0.001,-0.344,-0.091
Pclass_2,-1.1766,0.261,-4.509,0.000,-1.688,-0.665
Pclass_3,-2.3480,0.243,-9.677,0.000,-2.824,-1.872
Sex_male,-2.7851,0.198,-14.072,0.000,-3.173,-2.397


#### 2.模型结果解读

        根据模型拟合结果可知，模型的决定系数R²为0.3322，解释变量能较好地解释因变量的变异程度。各项解释变量系数的P值均小于0.05，说明模型系数显著，解释变量能够显著影响因变量。 由于逻辑回归模型的本质是线性回归+sigmoid函数，对解释变量系数的解读需要进行相应的计算。
        在此处开展计算过程：

In [18]:
import numpy as np

In [19]:
# Age
np.exp(-0.0394)

0.9613660857925073

In [20]:
# FamilyNum
np.exp(-0.2176)

0.8044471561818398

In [21]:
# Pclass_2
np.exp(-1.1766)

0.3083252643980769

In [22]:
# Pclass_3
np.exp(-2.348)

0.09556009140552765

In [23]:
# Sex_male
np.exp(-2.7851)

0.06172291643069087

    结论：年龄、家庭成员数、舱位等级、性别均会对生存概率造成显著影响。
        年龄每增加一岁，生还的概率降低（1-0.96）= 4%；
        家庭成员每增加一个，生还概率降低（1-0.8）= 2%；
        与头等舱的乘客相比，二等舱的生还概率低了（1-0.3）= 7%，三等舱的生还概率低了（1-0.095）= 90% 。
        男性比女性的生还概率低（1-0.06）=94%左右。

### （三）模型预测

#### 1.导入预测模型并清理格式

In [24]:
survival_test = pd.read_csv(r"C:\Users\52699\Desktop\数据集\泰坦尼克号生存数据\test.csv")

In [25]:
survival_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [26]:
survival_test['FamilyNum'] = survival_test['SibSp'] + survival_test['Parch']

In [27]:
survival_test = survival_test.drop(['PassengerId','Name','SibSp','Parch','Ticket','Fare','Cabin','Embarked'],axis = 1)

In [28]:
survival_test.isnull().sum()

Pclass        0
Sex           0
Age          86
FamilyNum     0
dtype: int64

In [29]:
survival_test = survival_test.fillna(survival_test['Age'].mean())

In [30]:
survival_test['Age'] = survival_test['Age'].astype('int')

In [31]:
survival_test.head()

Unnamed: 0,Pclass,Sex,Age,FamilyNum
0,3,male,34,0
1,3,female,47,1
2,2,male,62,0
3,3,male,27,0
4,3,female,22,2


#### 2.预测数据

In [32]:
#转换数据类型
survival_test['Pclass'] = survival_test['Pclass'].astype(str)

In [33]:
#引入虚拟变量
survival_test['Pclass'] = pd.Categorical(survival_test['Pclass'],categories = ['1','2','3'])
survival_test['Sex'] = pd.Categorical(survival_test['Sex'],categories = ['female','male'])
survivl_test_dummies = pd.get_dummies(data = survival_test, columns = ['Pclass','Sex'],dtype = int,drop_first = True)

In [34]:
# 获取预测变量矩阵
X_test = survival_test

In [35]:
# 建立截距并预测模型
X_test_constant = sm.add_constant(X_test)
predict_value = model_result.predict(X_with_constant)

In [36]:
predict_value

0      0.102895
1      0.911956
2      0.663681
3      0.920989
4      0.078734
         ...   
886    0.274209
887    0.964541
888    0.477264
889    0.560350
890    0.087737
Length: 891, dtype: float64

In [37]:
#将预测结果整合进预测数据中
survival_test['生还概率'] = predict_value
survival_test

Unnamed: 0,Pclass,Sex,Age,FamilyNum,生还概率
0,3,male,34,0,0.102895
1,3,female,47,1,0.911956
2,2,male,62,0,0.663681
3,3,male,27,0,0.920989
4,3,female,22,2,0.078734
...,...,...,...,...,...
413,3,male,30,0,0.258821
414,1,female,39,0,0.056574
415,3,male,38,0,0.636833
416,3,male,30,0,0.750463
