The file is based on [Google ML course](https://developers.google.cn/machine-learning/crash-course/), and the lession of [First Steps with TF](https://colab.research.google.com/notebooks/mlcc/intro_to_pandas.ipynb?hl=en).

In [1]:
import pandas as pd
print(pd.__version__)

0.22.0


## Data Structure
`DataFrame` and `Series` are two data structures of Pandas. A `DataFrame` contains one or more `Series` and its name.

In [2]:
pd.Series(['a', 'b', 'c'])

0    a
1    b
2    c
dtype: object

### New `Series` by `apply` function

In [3]:
s = pd.Series([100, 200, 300, 400, 500])
large_s = s.apply(lambda val: val > 200)
print(type(large_s))
print(large_s)

<class 'pandas.core.series.Series'>
0    False
1    False
2     True
3     True
4     True
dtype: bool


### New `DataFrame` by `dict` 
`DataFrame` can be built on `dict` of multiple `Series`. If the lengths are not match, `NA/NaN` will introduced.

In [4]:
names = pd.Series(['a', 'b', 'c', 'd'])
numbers = pd.Series([100, 101, 102])
df = pd.DataFrame({'Names': names, 'Numbers': numbers})
print(df)

  Names  Numbers
0     a    100.0
1     b    101.0
2     c    102.0
3     d      NaN


### Get one column

In [5]:
names = df['Names']
print(type(names))
names

<class 'pandas.core.series.Series'>


0    a
1    b
2    c
3    d
Name: Names, dtype: object

### Get first records

In [6]:
first_two_records = df[0:2]
print(first_two_records)

  Names  Numbers
0     a    100.0
1     b    101.0


### Add new column

In [7]:
df['age'] = pd.Series([20, 30, 40])
df

Unnamed: 0,Names,Numbers,age
0,a,100.0,20.0
1,b,101.0,30.0
2,c,102.0,40.0
3,d,,


## Index
Both `Series` and `DataFrame` have a `index` property that define the index of element of `Series` and row of `DataFrame`.

In [8]:
names.index

RangeIndex(start=0, stop=4, step=1)

In [9]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [10]:
df.reindex([2, 0, 1, 3])

Unnamed: 0,Names,Numbers,age
2,c,102.0,40.0
0,a,100.0,20.0
1,b,101.0,30.0
3,d,,


#### Shuffle the `DataFrame`

In [11]:
import numpy as np
df.reindex(np.random.permutation(df.index))

Unnamed: 0,Names,Numbers,age
2,c,102.0,40.0
3,d,,
0,a,100.0,20.0
1,b,101.0,30.0


## DataFrame
### read_csv()

In [12]:
california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")

### head()
Display first few records of a `DataFrame`.

In [13]:
california_housing_dataframe.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


### discribe()
Display basic statistics of a `DataFrame`.

In [14]:
california_housing_dataframe.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207300.912353
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115983.764387
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119400.0
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180400.0
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


### hist(column)
Display the distribution of a column.

In [15]:
california_housing_dataframe.hist('housing_median_age')

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x109c24400>]],
      dtype=object)