# 机器学习第一步： 数据分析与预处理

### 实验流程包括：
### 1、使用python加载数据
### 2、粗略查看数据
### 3、查看数据统计信息
### 4、处理缺失的数据
### 5、非数值数据的转换

## 1. 使用python加载数据

首先，我们学习如何使用 Python 导入数据。
两行语句即可完成导入工作：

**☞ 示例代码：**
```python
import pandas as pd

train = pd.read_csv("http://jizhi-10061919.file.myqcloud.com/kaggle_sklearn/titanic_train.csv") #本地文件示例"d:\\titanic_train.csv"
```

**☞ 动手练习：**

In [1]:
# 引入pandas库
import pandas as pd

# 使用read_csv函数读取.csv文件，并赋值予变量titanic
train = pd.read_csv("http://jizhi-10061919.file.myqcloud.com/kaggle_sklearn/titanic_train.csv")

## 2. 粗略查看数据信息
下面来更进一步地了解数据，可以使用pandas.describe()方法

**☞ 示例代码：**
```python
print(train.head(5))     #显示前5行数据
print(train.tail(5))     #显示后5行
train.columns    #查看列名
train.info()     #查看各列的信息
```
**☞ 动手练习：**

In [3]:
# 显示数据前5行
train.head(5)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# 查看各列的信息
train


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## 3 . 查看数据统计信息
下面来更进一步地了解数据，可以使用pandas.describe()方法

**☞ 示例代码：**
```python
train.shape      #查看数据集行列分布，几行几列
train.describe()  #查看数据的描述信息
```
**☞ 动手练习：**

In [5]:
# 查看数据集行列分布，显示几行几列

# 查看数据的描述信息
train.describe()


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## 4 . 处理缺失的数据
数据样本还需要进一步清洗，既不能暴力去除所有缺失数据的行，更不能消除整列，因为这些行含有的数据对训练算法仍有帮助。泰坦尼克号电影里提到“让妇女与儿童先走”，年龄信息对这个问题还是很重要的。

处理缺失数据的策略有很多，对该列调用.fillna()方法填充缺失值，比较简单的一种是用全列的中位数填充。

选中dataframe中的一列：train["Age"]

用填充中位数：train["Age"].fillna(train["Age"].median())

**☞ 示例代码：**
```python
train["Age"].fillna(train["Age"].median())
```
**☞ 动手练习：**

In [6]:
# Age列用中位数填充缺失值
train["Age"]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5       NaN
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17      NaN
18     31.0
19      NaN
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26      NaN
27     19.0
28      NaN
29      NaN
       ... 
861    21.0
862    48.0
863     NaN
864    24.0
865    42.0
866    27.0
867    31.0
868     NaN
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878     NaN
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [13]:
train["Age"].fillna(train["Age"].median())

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5      28.0
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17     28.0
18     31.0
19     28.0
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26     28.0
27     19.0
28     28.0
29     28.0
       ... 
861    21.0
862    48.0
863    28.0
864    24.0
865    42.0
866    27.0
867    31.0
868    28.0
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878    28.0
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [15]:
help(train["Age"].fillna)

Help on method fillna in module pandas.core.series:

fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs) method of pandas.core.series.Series instance
    Fill NA/NaN values using the specified method
    
    Parameters
    ----------
    value : scalar, dict, Series, or DataFrame
        Value to use to fill holes (e.g. 0), alternately a
        dict/Series/DataFrame of values specifying which value to use for
        each index (for a Series) or column (for a DataFrame). (values not
        in the dict/Series/DataFrame will not be filled). This value cannot
        be a list.
    method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
        Method to use for filling holes in reindexed Series
        pad / ffill: propagate last valid observation forward to next valid
        backfill / bfill: use NEXT valid observation to fill gap
    axis : {0, 'index'}
    inplace : boolean, default False
        If True, fill in place. Note: t

In [14]:
train["Age"]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5       NaN
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17      NaN
18     31.0
19      NaN
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26      NaN
27     19.0
28      NaN
29      NaN
       ... 
861    21.0
862    48.0
863     NaN
864    24.0
865    42.0
866    27.0
867    31.0
868     NaN
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878     NaN
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [16]:
train["Age"] = train["Age"].fillna(train["Age"].median())

In [17]:
train["Age"]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5      28.0
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17     28.0
18     31.0
19     28.0
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26     28.0
27     19.0
28     28.0
29     28.0
       ... 
861    21.0
862    48.0
863    28.0
864    24.0
865    42.0
866    27.0
867    31.0
868    28.0
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878    28.0
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

## 5、非数值数据的转换
在之前使用.describe()方法时，只有数值形式的列显示出来。有的列并非数值形式（比如字符，布尔值等），这些列无法直接应用在机器学习当中，要么剔除掉（Name, Sex, Cabin, Embarked, Ticket），要么将其转换为数值列。

（1）. 转换性别列
Sex列不是数值格式的，但性别信息关系重大，所以必须转换为数值列才能应用到机器学习算法中。

**☞ 示例代码：**
```python
print(train["Sex"].unique())#查看不重复值
train.loc[train["Sex"] == "male", "Sex"] = 1
```

In [18]:
train["Sex"].unique()

array(['male', 'female'], dtype=object)

In [19]:
train.loc[train["Sex"] == "female", "Sex"] = 1
train.loc[train["Sex"] == "male", "Sex"] = 2
train["Sex"]

0      2
1      1
2      1
3      1
4      2
5      2
6      2
7      2
8      1
9      1
10     1
11     1
12     2
13     2
14     1
15     1
16     2
17     2
18     1
19     1
20     2
21     2
22     1
23     2
24     1
25     1
26     2
27     2
28     1
29     2
      ..
861    2
862    1
863    1
864    2
865    1
866    1
867    2
868    2
869    2
870    2
871    1
872    2
873    2
874    1
875    1
876    2
877    2
878    2
879    1
880    1
881    2
882    1
883    2
884    2
885    1
886    2
887    1
888    1
889    2
890    2
Name: Sex, Length: 891, dtype: int64

**☞ 运行查看：**

In [20]:
#确认性别列和登船地点列的不重复值
print(train["Sex"].unique())
print(train["Embarked"].unique())

[2 1]
['S' 'C' 'Q' nan]


**☞ 动手练习：**

In [33]:
# 将male替换为1
train.loc[train["Sex"] == "male", "Sex"] = 1

# 将female替换为0


（2）. 转换“登船地点”列
对Embarked列采取与Sex类似的处理。这一列的值有S,C,Q和missing(nan)，每个字母是一个地名的缩写。

令S=0， C=1， Q=2：
**☞ 示例代码：**
```python
train["Embarked"]=train["Embarked"].fillna('S')
```
**☞ 动手练习：**

In [21]:
train["Embarked"]

0      S
1      C
2      S
3      S
4      S
5      Q
6      S
7      S
8      S
9      C
10     S
11     S
12     S
13     S
14     S
15     S
16     Q
17     S
18     S
19     C
20     S
21     S
22     Q
23     S
24     S
25     S
26     C
27     S
28     Q
29     S
      ..
861    S
862    S
863    S
864    S
865    S
866    C
867    S
868    S
869    S
870    S
871    S
872    S
873    S
874    C
875    C
876    S
877    S
878    S
879    C
880    S
881    S
882    S
883    S
884    S
885    Q
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 891, dtype: object

In [22]:
train["Embarked"].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

In [24]:
len(train.loc[train["Embarked"] == "S", "Embarked"])

644

In [25]:
train["Embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [26]:
nan

NameError: name 'nan' is not defined

In [27]:
pd.nan

AttributeError: module 'pandas' has no attribute 'nan'

In [28]:
train["Embarked"] = train["Embarked"].fillna('NAN')

In [29]:
len(train.loc[train["Embarked"] == 'NAN', "Embarked"])

2

In [30]:
# 首先使用fillna把所有缺失值替换为"S"


# 将"S"替换为0
train.loc[train["Embarked"] == "S", "Embarked"] = 0

# 将"C"替换为1
train.loc[train["Embarked"] == "C", "Embarked"] = 1

# 将"Q"替换为2
train.loc[train["Embarked"] == "Q", "Embarked"] = 2

train.loc[train["Embarked"] == "NAN", "Embarked"] = 3


**☞ 动手练习：**

In [31]:
train #再次查看数据的描述信息


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",2,22.0,1,0,A/5 21171,7.2500,,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.9250,,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1000,C123,0
4,5,0,3,"Allen, Mr. William Henry",2,35.0,0,0,373450,8.0500,,0
5,6,0,3,"Moran, Mr. James",2,28.0,0,0,330877,8.4583,,2
6,7,0,1,"McCarthy, Mr. Timothy J",2,54.0,0,0,17463,51.8625,E46,0
7,8,0,3,"Palsson, Master. Gosta Leonard",2,2.0,3,1,349909,21.0750,,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,1


经过初步的数据分析和预处理，我们得到了可以导入机器学习模型的数据。在下一节课中我们将正式进入机器学习阶段。