# Pandas1

**[1] Package**<br>
**[2] Series**<br>
**[3] DataFrame**<br>
**[4] Read data from file to DataFrame**<br>

## [1] Package

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd

## [2] Series

- **Form a series from an array**

In [2]:
Series1 = pd.Series([4, 7, -5, 3])
Series1

0    4
1    7
2   -5
3    3
dtype: int64

In [3]:
# get value
Series1.values

array([ 4,  7, -5,  3], dtype=int64)

In [4]:
# get index
Series1.index

RangeIndex(start=0, stop=4, step=1)

- **Form a series and assign an index array**

In [5]:
Series2 = pd.Series([4, 7, -5, 3], index = ["a","b","c","d"])
Series2

a    4
b    7
c   -5
d    3
dtype: int64

In [6]:
Series2.index

Index(['a', 'b', 'c', 'd'], dtype='object')

- **Subset selection**

In [7]:
# Select a single value
Series2['a']

4

In [8]:
# Select a set of values
Series2[['c', 'a', 'd']]

c   -5
a    4
d    3
dtype: int64

- **Use boolean array to filter data** 

In [9]:
Series2 > 2

a     True
b     True
c    False
d     True
dtype: bool

In [10]:
Series2[Series2 > 2]

a    4
b    7
d    3
dtype: int64

- **Use bitwise operators to combine conditions**

In [11]:
(Series2 > 2) & (Series2 < 5) 

a     True
b    False
c    False
d     True
dtype: bool

In [12]:
Series2[(Series2 > 2) & (Series2 < 5)]

a    4
d    3
dtype: int64

## Exercise.A

**(A.1) Create a series with <code>score_list</code> as the value and <code>ID_list</code> as the index.**

In [3]:
ID_list = ['S01', 'S02', 'S03', 'S04', 'S05']
score_list = [7.0, 5.5, 9.0, 5.0, 7.5]

In [5]:
myseries = pd.Series(score_list, index = ID_list)
myseries

S01    7.0
S02    5.5
S03    9.0
S04    5.0
S05    7.5
dtype: float64

**(A.2) Select the data of students <code>'S01'</code> and <code>'S03'</code>.**

In [7]:
myseries[['S01', 'S03']]

S01    7.0
S03    9.0
dtype: float64

**(A.3) Select students with a score less than 6.**

In [8]:
myseries[myseries<6]

S02    5.5
S04    5.0
dtype: float64

In [10]:
type(myseries<6)

pandas.core.series.Series

## [3] DataFrame

- **Form a DataFrame from a dictionary**

In [11]:
# Create a dictionary
data = {"state": ["Ohio","Ohio","Ohio","Nevada","Nevada","Nevada"],
        "year":[2000,2001,2002,2001,2002,2003],
        "pop":[1.5,1.7,3.6,2.4,2.9,3.2]}

In [12]:
df = pd.DataFrame(data, index = ["a","b","c","d","e","f"])
df

Unnamed: 0,state,year,pop
a,Ohio,2000,1.5
b,Ohio,2001,1.7
c,Ohio,2002,3.6
d,Nevada,2001,2.4
e,Nevada,2002,2.9
f,Nevada,2003,3.2


- **Labels**

In [16]:
df.index

Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

In [17]:
df.columns

Index(['state', 'year', 'pop'], dtype='object')

- **Subset selection - loc**

In [18]:
df.loc[,]

'Ohio'

In [19]:
df.loc[["b","c","d"], ["state","year"]]

Unnamed: 0,state,year
b,Ohio,2001
c,Ohio,2002
d,Nevada,2001


- **Subset selection - iloc**

In [20]:
df.iloc[1:4,0:2]

Unnamed: 0,state,year
b,Ohio,2001
c,Ohio,2002
d,Nevada,2001


In [21]:
df.iloc[1:4,0:2]

Unnamed: 0,state,year
b,Ohio,2001
c,Ohio,2002
d,Nevada,2001


In [22]:
df.iloc[1,0]

'Ohio'

- **Subset selection - column names**

In [23]:
df[["state","year"]]   # same as df.loc[:,["state","year"]]

Unnamed: 0,state,year
a,Ohio,2000
b,Ohio,2001
c,Ohio,2002
d,Nevada,2001
e,Nevada,2002
f,Nevada,2003


- **Subset selection - positions**

In [24]:
df[1:4]                # same as df.iloc[1:4,:]

Unnamed: 0,state,year,pop
b,Ohio,2001,1.7
c,Ohio,2002,3.6
d,Nevada,2001,2.4


- **Use boolean array to filter data**

In [25]:
df[df.year > 2001]

Unnamed: 0,state,year,pop
c,Ohio,2002,3.6
e,Nevada,2002,2.9
f,Nevada,2003,3.2


In [26]:
df[df.state == "Ohio"]

Unnamed: 0,state,year,pop
a,Ohio,2000,1.5
b,Ohio,2001,1.7
c,Ohio,2002,3.6


- **Use bitwise operators to combine conditions**

In [27]:
df[(df.state == "Ohio") & (df.year > 2000)]

Unnamed: 0,state,year,pop
b,Ohio,2001,1.7
c,Ohio,2002,3.6


In [14]:
type(df.state)

pandas.core.series.Series

## Exercise.B

**(B.1) Create a dictionary with <code>company_name, profit, assets</code> as the keys and <code>list1, list2, list3</code> as the values. Print out the dictionary.**<br>

In [15]:
list1 = ['JPMorgan Chase','Apple','Bank of America','Amazon','Microsoft']
list2 = [40.4, 63.9, 17.9, 21.3, 51.3]
list3 = [3689.3, 354.1, 2832.2, 321.2, 304.1]

In [16]:
data  = {"company_name":list1, "profit":list2, "assets":list3}
data

{'company_name': ['JPMorgan Chase',
  'Apple',
  'Bank of America',
  'Amazon',
  'Microsoft'],
 'profit': [40.4, 63.9, 17.9, 21.3, 51.3],
 'assets': [3689.3, 354.1, 2832.2, 321.2, 304.1]}

**(B.2) Create a dataframe named <code>mydf</code> based on the dictionary defined in (B.1) and use <code>a, b, c, d, e</code> as index. Print out the dataframe.**

In [18]:
mydf = pd.DataFrame(data, index = ['a','b','c','d','e'])
mydf

Unnamed: 0,company_name,profit,assets
a,JPMorgan Chase,40.4,3689.3
b,Apple,63.9,354.1
c,Bank of America,17.9,2832.2
d,Amazon,21.3,321.2
e,Microsoft,51.3,304.1


**(B.3) Use <code>loc</code> to select a subset as follows.**

| company name | assets| 
|:-:|:-:|
|Apple|354.1|
|Amazon|321.2|
|Microsoft|304.1|

In [21]:
mydf.loc[['b','d','e'],["company_name","assets"]]

Unnamed: 0,company_name,assets
b,Apple,354.1
d,Amazon,321.2
e,Microsoft,304.1


**(B.4) Use <code>iloc</code> to select the same subset.**

In [23]:
mydf.iloc[ [1,3,4] , [0,2] ]

Unnamed: 0,company_name,assets
b,Apple,354.1
d,Amazon,321.2
e,Microsoft,304.1


## [4] Read data from file

### [4.1] From csv file to dataframe

- **Read csv file**

In [24]:
park_df = pd.read_csv('../dataset/parks.csv')

In [26]:
park_df = pd.read_csv('parks.csv')

- **View data**

In [30]:
park_df.head(5)

Unnamed: 0,Park Code,Park Name,State,Acres,Latitude,Longitude
0,ACAD,Acadia National Park,ME,47390,44.35,-68.21
1,ARCH,Arches National Park,UT,76519,38.68,-109.57
2,BADL,Badlands National Park,SD,242756,43.75,-102.5
3,BIBE,Big Bend National Park,TX,801163,29.25,-103.25
4,BISC,Biscayne National Park,FL,172924,25.65,-80.08


In [31]:
park_df.shape

(56, 6)

### [4.2] Select a subset

- **Subset selection - column names**

In [32]:
park_df[["Park Code", "State"]]

Unnamed: 0,Park Code,State
0,ACAD,ME
1,ARCH,UT
2,BADL,SD
3,BIBE,TX
4,BISC,FL
5,BLCA,CO
6,BRCA,UT
7,CANY,UT
8,CARE,UT
9,CAVE,NM


- **Subset selection - integer positions**

In [33]:
park_df[10:15]

Unnamed: 0,Park Code,Park Name,State,Acres,Latitude,Longitude
10,CHIS,Channel Islands National Park,CA,249561,34.01,-119.42
11,CONG,Congaree National Park,SC,26546,33.78,-80.78
12,CRLA,Crater Lake National Park,OR,183224,42.94,-122.1
13,CUVA,Cuyahoga Valley National Park,OH,32950,41.24,-81.55
14,DENA,Denali National Park and Preserve,AK,3372402,63.33,-150.5


- **Subset selection - loc**

In [34]:
park_df.loc[[10,11,12,13,14],["Park Code","Park Name"]]

Unnamed: 0,Park Code,Park Name
10,CHIS,Channel Islands National Park
11,CONG,Congaree National Park
12,CRLA,Crater Lake National Park
13,CUVA,Cuyahoga Valley National Park
14,DENA,Denali National Park and Preserve


- **Subset selection - iloc**

In [35]:
park_df.iloc[10:15,0:2]

Unnamed: 0,Park Code,Park Name
10,CHIS,Channel Islands National Park
11,CONG,Congaree National Park
12,CRLA,Crater Lake National Park
13,CUVA,Cuyahoga Valley National Park
14,DENA,Denali National Park and Preserve


### [4.3] Filter data

- **Filter data by one condition**

In [36]:
park_df[park_df.State == "UT"]

Unnamed: 0,Park Code,Park Name,State,Acres,Latitude,Longitude
1,ARCH,Arches National Park,UT,76519,38.68,-109.57
6,BRCA,Bryce Canyon National Park,UT,35835,37.57,-112.18
7,CANY,Canyonlands National Park,UT,337598,38.2,-109.93
8,CARE,Capitol Reef National Park,UT,241904,38.2,-111.17
55,ZION,Zion National Park,UT,146598,37.3,-113.05


- **Filter data by multiple conditions**

In [37]:
park_df[(park_df.State == "UT") & (park_df.Acres > 50000)]

Unnamed: 0,Park Code,Park Name,State,Acres,Latitude,Longitude
1,ARCH,Arches National Park,UT,76519,38.68,-109.57
7,CANY,Canyonlands National Park,UT,337598,38.2,-109.93
8,CARE,Capitol Reef National Park,UT,241904,38.2,-111.17
55,ZION,Zion National Park,UT,146598,37.3,-113.05


- **Filter data by <code>isin</code>**

In [38]:
park_df[(park_df.State =='WA')|(park_df.State =='OR')|(park_df.State == 'CA')]

Unnamed: 0,Park Code,Park Name,State,Acres,Latitude,Longitude
10,CHIS,Channel Islands National Park,CA,249561,34.01,-119.42
12,CRLA,Crater Lake National Park,OR,183224,42.94,-122.1
31,JOTR,Joshua Tree National Park,CA,789745,33.79,-115.9
36,LAVO,Lassen Volcanic National Park,CA,106372,40.49,-121.51
39,MORA,Mount Rainier National Park,WA,235625,46.85,-121.75
40,NOCA,North Cascades National Park,WA,504781,48.7,-121.2
41,OLYM,Olympic National Park,WA,922651,47.97,-123.5
43,PINN,Pinnacles National Park,CA,26606,36.48,-121.16
44,REDW,Redwood National Park,CA,112512,41.3,-124.0
47,SEKI,Sequoia and Kings Canyon National Parks,CA,865952,36.43,-118.68


In [39]:
park_df[park_df.State.isin(['WA', 'OR', 'CA'])]

Unnamed: 0,Park Code,Park Name,State,Acres,Latitude,Longitude
10,CHIS,Channel Islands National Park,CA,249561,34.01,-119.42
12,CRLA,Crater Lake National Park,OR,183224,42.94,-122.1
31,JOTR,Joshua Tree National Park,CA,789745,33.79,-115.9
36,LAVO,Lassen Volcanic National Park,CA,106372,40.49,-121.51
39,MORA,Mount Rainier National Park,WA,235625,46.85,-121.75
40,NOCA,North Cascades National Park,WA,504781,48.7,-121.2
41,OLYM,Olympic National Park,WA,922651,47.97,-123.5
43,PINN,Pinnacles National Park,CA,26606,36.48,-121.16
44,REDW,Redwood National Park,CA,112512,41.3,-124.0
47,SEKI,Sequoia and Kings Canyon National Parks,CA,865952,36.43,-118.68


In [40]:
park_west_df = park_df[park_df.State.isin(['WA', 'OR', 'CA'])]
park_west_df

Unnamed: 0,Park Code,Park Name,State,Acres,Latitude,Longitude
10,CHIS,Channel Islands National Park,CA,249561,34.01,-119.42
12,CRLA,Crater Lake National Park,OR,183224,42.94,-122.1
31,JOTR,Joshua Tree National Park,CA,789745,33.79,-115.9
36,LAVO,Lassen Volcanic National Park,CA,106372,40.49,-121.51
39,MORA,Mount Rainier National Park,WA,235625,46.85,-121.75
40,NOCA,North Cascades National Park,WA,504781,48.7,-121.2
41,OLYM,Olympic National Park,WA,922651,47.97,-123.5
43,PINN,Pinnacles National Park,CA,26606,36.48,-121.16
44,REDW,Redwood National Park,CA,112512,41.3,-124.0
47,SEKI,Sequoia and Kings Canyon National Parks,CA,865952,36.43,-118.68


## Exercise.C

**(C.1) Read the csv file <code>diabetes.csv</code> using pandas. Display the first 10 rows.**

• **Pregnancies**: Number of times pregnant<br>
• **Glucose**: Plasma glucose concentration over 2 hours in an oral glucose tolerance test<br>
• **BloodPressure**: Diastolic blood pressure (mm Hg)<br>
• **SkinThickness**: Triceps skin fold thickness (mm)<br>
• **Insulin**: 2-Hour serum insulin (mu U/ml)<br>
• **BMI**: Body mass index (weight in kg/(height in m)2)<br>
• **DiabetesPedigreeFunction**: Diabetes pedigree function (a function which scores likelihood of
diabetes based on family history)<br>
• **Age**: Age (years)<br>
• **Outcome**: Class variable (0 if non-diabetic, 1 if diabetic)<br>

In [7]:
diabetes_df = pd.read_csv("../dataset/diabetes.csv")
diabetes_df.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


**(C.2) What is the number of rows and columns in this data set?**

In [3]:
diabetes_df.shape

(768, 9)

**(C.3) Select (display) column <code>BloodPressure</code> and column <code>BMI</code>.** 

In [4]:
diabetes_df.loc[:, ["BloodPressure", "BMI"] ]

Unnamed: 0,BloodPressure,BMI
0,72,33.6
1,66,26.6
2,64,23.3
3,66,28.1
4,40,43.1
...,...,...
763,76,32.9
764,70,36.8
765,72,26.2
766,60,30.1


**(C.4) Select rows with <code>BMI</code> greater than 50.**

In [5]:
diabetes_df[diabetes_df.BMI > 50]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
120,0,162,76,56,100,53.2,0.759,25,1
125,1,88,30,42,99,55.0,0.496,26,1
177,0,129,110,46,130,67.1,0.319,26,1
193,11,135,0,0,0,52.3,0.578,40,1
247,0,165,90,33,680,52.3,0.427,23,0
303,5,115,98,0,0,52.9,0.209,28,1
445,0,180,78,63,14,59.4,2.42,25,1
673,3,123,100,35,240,57.3,0.88,22,0


**(C.5) Select rows with either <code>BMI</code> greater than 50 or <code>BloodPressure</code> greater than 110.**

In [6]:
diabetes_df[ (diabetes_df.BMI>50)|(diabetes_df.BloodPressure>110)  ]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
106,1,96,122,0,0,22.4,0.207,27,0
120,0,162,76,56,100,53.2,0.759,25,1
125,1,88,30,42,99,55.0,0.496,26,1
177,0,129,110,46,130,67.1,0.319,26,1
193,11,135,0,0,0,52.3,0.578,40,1
247,0,165,90,33,680,52.3,0.427,23,0
303,5,115,98,0,0,52.9,0.209,28,1
445,0,180,78,63,14,59.4,2.42,25,1
673,3,123,100,35,240,57.3,0.88,22,0
691,13,158,114,0,0,42.3,0.257,44,1


## [5] Descriptive statistics

- **A summary of a DataFrame**

In [41]:
park_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56 entries, 0 to 55
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Park Code  56 non-null     object 
 1   Park Name  56 non-null     object 
 2   State      56 non-null     object 
 3   Acres      56 non-null     int64  
 4   Latitude   56 non-null     float64
 5   Longitude  56 non-null     float64
dtypes: float64(2), int64(1), object(3)
memory usage: 2.8+ KB


- **Change data type**

In [42]:
park_df.Acres = park_df.Acres.astype(float)
park_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56 entries, 0 to 55
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Park Code  56 non-null     object 
 1   Park Name  56 non-null     object 
 2   State      56 non-null     object 
 3   Acres      56 non-null     float64
 4   Latitude   56 non-null     float64
 5   Longitude  56 non-null     float64
dtypes: float64(3), object(3)
memory usage: 2.8+ KB


- **Descriptive statistics of numercial columns**

In [43]:
park_df.describe()

Unnamed: 0,Acres,Latitude,Longitude
count,56.0,56.0,56.0
mean,927929.1,41.233929,-113.234821
std,1709258.0,10.908831,22.440287
min,5550.0,19.38,-159.28
25%,69010.5,35.5275,-121.57
50%,238764.5,38.55,-110.985
75%,817360.2,46.88,-103.4
max,8323148.0,67.78,-68.21


- **Value count of categorical columns**

In [44]:
park_df.State.value_counts()

AK            8
CA            7
UT            5
CO            4
FL            3
WA            3
AZ            3
TX            2
SD            2
HI            2
WY            1
MN            1
ME            1
ND            1
OR            1
SC            1
CA, NV        1
MT            1
NM            1
NV            1
MI            1
VA            1
KY            1
WY, MT, ID    1
OH            1
TN, NC        1
AR            1
Name: State, dtype: int64

## Exercise.D

**(D.1) Use the same dataset <code>diabetes.csv</code> in (C.1). What is the data type of each variable?**

In [8]:
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


**(D.2) Change the data type of <code>Outcome</code> to <code>object</code>.**

In [11]:
diabetes_df.Outcome = diabetes_df.Outcome.astype(object)
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object 
dtypes: float64(2), int64(6), object(1)
memory usage: 54.1+ KB


**(D.3) Get the average of variable <code>Age</code>.**

In [16]:
diabetes_df.describe().loc["mean","Age"]

33.240885416666664

**(D.4) Get the value count of variable <code>Outcome</code>.**

In [17]:
diabetes_df.Outcome.value_counts()

0    500
1    268
Name: Outcome, dtype: int64