## 1 第一章：数据载入及初步观察

### 1.1 载入数据
数据集下载 https://www.kaggle.com/c/titanic/overview

#### 1.1.1 导入numpy和pandas

In [1]:
import numpy as np
import pandas as pd

#### 1.1.2  载入数据
(1) 使用相对路径载入数据  
(2) 使用绝对路径载入数据

In [2]:
# 相对路径
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [18]:
# 绝对路径
import os
path = os.getcwd()
csv_path = path + '\\train.csv'
df = pd.read_csv(csv_path)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


【提示】相对路径载入报错时，尝试使用os.getcwd()查看当前工作目录。  
【思考】知道数据加载的方法后，试试pd.read_csv()和pd.read_table()的不同，如果想让他们效果一样，需要怎么做？了解一下'.tsv'和'.csv'的不同，如何加载这两个数据集？  
【总结】加载的数据是所有工作的第一步，我们的工作会接触到不同的数据格式（eg:.csv;.tsv;.xlsx）,但是加载的方法和思路都是一样的，在以后工作和做项目的过程中，遇到之前没有碰到的问题，要多多查资料吗，使用google，了解业务逻辑，明白输入和输出是什么。

### 【思考】
**1. read_csv() 与 read_table() 的不同**

两者同为加载带分隔符的数据

`pandas.read_csv`的分隔符为[`,`](https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L543)    `pandas.read_table`的分隔符为[`\t`](https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L701)

体现为将表中数据取出后，`pandas.read_csv`为每格一列，`pandas.read_table`为每行一列

`pandas.read_table('train.csv', sep=',')`可通过设置分隔符，达到与`pandas.read_csv`一样的效果

In [48]:
# 1. read_csv
df = pd.read_csv('train.csv')

# 每格一列 sep=','
df.head(3).to_numpy()

array([[1, 0, 3, 'Braund, Mr. Owen Harris', 'male', 22.0, 1, 0,
        'A/5 21171', 7.25, nan, 'S'],
       [2, 1, 1, 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
        'female', 38.0, 1, 0, 'PC 17599', 71.2833, 'C85', 'C'],
       [3, 1, 3, 'Heikkinen, Miss. Laina', 'female', 26.0, 0, 0,
        'STON/O2. 3101282', 7.925, nan, 'S']], dtype=object)

In [52]:
df.head(3).to_numpy().shape

(3, 12)

In [49]:
# 2. read_table
df_txt = pd.read_table('train.csv')

# 每行一列 sep='\t'
df_txt.head(3).to_numpy()

array([['1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S'],
       ['2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C'],
       ['3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S']],
      dtype=object)

In [51]:
df_txt.head(3).to_numpy().shape

(3, 1)

In [50]:
# read_table设置sep分隔符参数，使效果与read_csv相同
df_txt_sep = pd.read_table('train.csv', sep=',')
df_txt_sep.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


**2. `.tsv`与`.csv`的不同及加载方式**

`tsv`是`csv`的一种变体，每条记录的各字段间以制表符分隔

`tsv` => `pandas.read_csv('train.tsv', sep='\t')`, 设置分隔符




#### 1.1.3 任务三：每1000行为一个数据模块，逐块读取

In [25]:
data = pd.read_csv('train.csv', chunksize=1000)

【思考】什么是逐块读取？为什么要逐块读取呢？

在调用`pandas.read_csv`时使用`chunksize`参数，得到遍历 DataFrames 的迭代器

每个 DataFrame 为下一块的文件内容

这种方式在数据文件很大时，可以只加载一部分到内存，减少内存使用

#### 1.1.4 任务四：将表头改成中文，索引改为乘客ID [对于某些英文资料，我们可以通过翻译来更直观的熟悉我们的数据]
PassengerId => 乘客ID  
Survived    => 是否幸存   
Pclass      => 乘客等级(1/2/3等舱位)  
Name        => 乘客姓名  
Sex         => 性别                 
Age         => 年龄                 
SibSp       => 堂兄弟/妹个数  
Parch       => 父母与小孩个数  
Ticket      => 船票信息             
Fare        => 票价                
Cabin       => 客舱                
Embarked    => 登船港口             

In [57]:
# 读完后替换表头
df = pd.read_csv('train.csv')
df.columns = ['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐 妹个数','父母子女个数','船票信息','票价','客舱','登船港口']
df.head()

Unnamed: 0,乘客ID,是否幸存,仓位等级,姓名,性别,年龄,兄弟姐 妹个数,父母子女个数,船票信息,票价,客舱,登船港口
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# 读数时替换原表头
df = pd.read_csv('train.csv', 
                 header=0,
                 index_col='乘客ID',
                 names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐 妹个数','父母子女个数','船票信息','票价','客舱','登船港口'])
df.head()

Unnamed: 0_level_0,是否幸存,仓位等级,姓名,性别,年龄,兄弟姐 妹个数,父母子女个数,船票信息,票价,客舱,登船港口
乘客ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


【思考】所谓将表头改为中文其中一个思路是：将英文额度表头替换成中文。还有其他的方法吗？

### 1.2 初步观察
导入数据后，你可能要对数据的整体结构和样例进行概览，比如说，数据大小、有多少列，各列都是什么格式的，是否包含null等

#### 1.2.1 任务一：查看数据的基本信息

In [59]:
# 该 DataFrame 的基本信息
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
是否幸存       891 non-null int64
仓位等级       891 non-null int64
姓名         891 non-null object
性别         891 non-null object
年龄         714 non-null float64
兄弟姐 妹个数    891 non-null int64
父母子女个数     891 non-null int64
船票信息       891 non-null object
票价         891 non-null float64
客舱         204 non-null object
登船港口       889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [60]:
# 统计数据
df.describe()

Unnamed: 0,是否幸存,仓位等级,年龄,兄弟姐 妹个数,父母子女个数,票价
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


【提示】有多个函数可以这样做，你可以做一下总结

[DataFrame API参考](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)


#### 1.2.2 任务二：观察表格前10行的数据和后15行的数据

In [61]:
df.head(10)

Unnamed: 0_level_0,是否幸存,仓位等级,姓名,性别,年龄,兄弟姐 妹个数,父母子女个数,船票信息,票价,客舱,登船港口
乘客ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [62]:
df.tail(15)

Unnamed: 0_level_0,是否幸存,仓位等级,姓名,性别,年龄,兄弟姐 妹个数,父母子女个数,船票信息,票价,客舱,登船港口
乘客ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
877,0,3,"Gustafsson, Mr. Alfred Ossian",male,20.0,0,0,7534,9.8458,,S
878,0,3,"Petroff, Mr. Nedelio",male,19.0,0,0,349212,7.8958,,S
879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S
880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0,,S
882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q


#### 1.2.4 任务三：判断数据是否为空，为空的地方返回True，其余地方返回False

In [63]:
df.isnull().head(3)

Unnamed: 0_level_0,是否幸存,仓位等级,姓名,性别,年龄,兄弟姐 妹个数,父母子女个数,船票信息,票价,客舱,登船港口
乘客ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,False,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,False


【总结】上面的操作都是数据分析中对于数据本身的观察

【思考】对于一个数据，还可以从哪些方面来观察？找找答案，这个将对下面的数据分析有很大的帮助

### 1.3 保存数据

#### 1.3.1 任务一：将你加载并做出改变的数据，在工作目录下保存为一个新文件train_chinese.csv

In [4]:
# 中文乱码 => 设置编码
df.to_csv('train_chinese.csv',encoding="gb2312")

【总结】数据的加载以及入门，接下来就要接触数据本身的运算，我们将主要掌握numpy和pandas在工作和项目场景的运用。