In [23]:
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

In [10]:
train_data = pd.read_csv('data/train.csv',names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)

In [11]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   是否幸存    891 non-null    int64  
 1   仓位等级    891 non-null    int64  
 2   姓名      891 non-null    object 
 3   性别      891 non-null    object 
 4   年龄      714 non-null    float64
 5   兄弟姐妹个数  891 non-null    int64  
 6   父母子女个数  891 non-null    int64  
 7   船票信息    891 non-null    object 
 8   票价      891 non-null    float64
 9   客舱      204 non-null    object 
 10  登船港口    889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


训练数据共891条，1个因变量（数据无缺失），10个自变量，其中，年龄，客舱，登船港口存在缺失值。

In [4]:
train_data.head()

Unnamed: 0_level_0,是否幸存,仓位等级,姓名,性别,年龄,兄弟姐妹个数,父母子女个数,船票信息,票价,客舱,登船港口
乘客ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 一、描述性统计分析

In [5]:
train_data.describe()

Unnamed: 0,是否幸存,仓位等级,年龄,兄弟姐妹个数,父母子女个数,票价
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


1. 是否幸存，**更改变量的类型**
2. 仓位等级，初步看，取值为1，2，3，代表着高低等级，可更改变量的类型后，查看数据的分布
3. 年龄，**存在缺失值**，数据分布稍微呈现右偏，从数据的分布上看，年轻人比较多，最小值大于0，暂无发现异常值
4. 兄弟姐妹个数，大部分无兄弟姐妹，最多有8个兄弟姐妹
5. 父母子女个数，绝大部分都是0，最多是6个
6. 票价，最大票价与大部分票价相差比较大（票价不是一般一种客舱一个价格么，这里的价格有点多）

更改“是否幸存”变量的类型和“仓位等级”变量的类型

In [14]:
train_data['是否幸存'] = train_data['是否幸存'].astype('object')
train_data['仓位等级'] = train_data['仓位等级'].astype('object')

In [16]:
train_data.describe(include=['O'])

Unnamed: 0,是否幸存,仓位等级,姓名,性别,船票信息,客舱,登船港口
count,891,891,891,891,891,204,889
unique,2,3,891,2,681,147,3
top,0,3,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,549,491,1,577,7,4,644


1. 是否幸存，0的占比约为60%
2. 仓位等级，3的占比约为55%，由此也可以推断出3应该是最低等级的仓位
3. 姓名，**不参与建模**
4. 性别，male的占比约为65%
5. 船票信息，经查看，船票信息的类型数目比较多（约占总数的76%），且内容比较杂乱无规律，因此**不考虑其参与建模**
6. 客舱，经查看，客舱的缺失值较多，在非空记录中，类型数目72%，且内容无规律，因此**不考虑其参与建模**
7. 登船港口，S的占比约为72%

In [21]:
train_data[['仓位等级','船票信息','客舱']].to_csv('data/ticket_info.csv')

### 二、变量的分布

In [24]:
plt.bar(train_data['仓位等级'])

TypeError: bar() missing 1 required positional argument: 'height'

In [17]:
X_train = train_data[['仓位等级','兄弟姐妹个数','父母子女个数','票价']]
y_train = train_data['是否幸存']

In [18]:
from sklearn.linear_model import LogisticRegression
logr = LogisticRegression(penalty='none')
logr.fit(X_train,y_train)
print(logr.coef_,logr.intercept_)

ValueError: Unknown label type: 'unknown'