# Pandas 基礎介紹

## 匯入資料
* https://raw.githubusercontent.com/Code-Gym/python-dataset/master/u.user.txt
* 用read_csv函式將資料匯進來
* 分隔符號(sep)
* 索引欄位(index_col)

In [7]:
import pandas as pd
users = pd.read_csv("https://raw.githubusercontent.com/Code-Gym/python-dataset/master/u.user.txt",
                   sep="|",index_col="user_id")

## 列印前五筆資料

In [8]:
users.head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


## 列印後五筆資料

In [9]:
users.tail()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
939,26,F,student,33319
940,32,M,administrator,2215
941,20,M,student,97229
942,48,F,librarian,78209
943,22,M,student,77841


## 資料筆數
* shape是tuple物件
* 有兩個元素：資料總筆數和欄位的數量

In [10]:
users.shape[0]

943

## 欄位數量

In [11]:
users.shape[1]
#這個結果是扣除了索引欄位的user_id

4

## 欄位名稱和資料型態

In [12]:
users.dtypes

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

## 取得指定欄位的資料

In [13]:
users.occupation
#或輸入users["occupation"]

user_id
1         technician
2              other
3             writer
4         technician
5              other
           ...      
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object

## 列印職業欄中的第一筆資料
* 取得欄位中特定筆數的資料，在欄位後方[ ]中輸入列數

In [14]:
users.occupation[1]

'technician'


## 職業欄中，有多少不同種類的職業
* nunique

In [15]:
 users.occupation.nunique()

21

## 每一種職業的統計數量有多少？
* value_counts

In [16]:
users.occupation.value_counts()

student          196
other            105
educator          95
administrator     79
engineer          67
programmer        66
librarian         51
writer            45
executive         32
scientist         31
artist            28
technician        27
marketing         26
entertainment     18
healthcare        16
retired           14
lawyer            12
salesman          12
none               9
doctor             7
homemaker          7
Name: occupation, dtype: int64

## 出現次數最高的職業是哪一種職業？
* 使用head函式取得數量1的資料

In [19]:
users.occupation.value_counts().head(1)

student    196
Name: occupation, dtype: int64

In [20]:
#前五筆
users.occupation.value_counts().head()

student          196
other            105
educator          95
administrator     79
engineer          67
Name: occupation, dtype: int64

## 出現次數最少的是哪一種職業？

In [23]:
users.occupation.value_counts().tail(1)

homemaker    7
Name: occupation, dtype: int64

In [24]:
users.occupation.value_counts().tail()

lawyer       12
salesman     12
none          9
doctor        7
homemaker     7
Name: occupation, dtype: int64

## 基本統計資料
* describe

In [25]:
users.describe()

Unnamed: 0,age
count,943.0
mean,34.051962
std,12.19274
min,7.0
25%,25.0
50%,31.0
75%,43.0
max,73.0


## 全部欄位基本統計資料
* 數字/字串等
* 在describe函式中輸入include參數，設定為all
* unique是不重複的統計資料結果
* top是出現最多的（如 職業）
* frequency出現最多的次數（如 學生次數）

In [26]:
users.describe(include="all")


Unnamed: 0,age,gender,occupation,zip_code
count,943.0,943,943,943.0
unique,,2,21,795.0
top,,M,student,55414.0
freq,,670,196,9.0
mean,34.051962,,,
std,12.19274,,,
min,7.0,,,
25%,25.0,,,
50%,31.0,,,
75%,43.0,,,
