# Pandas 2

**[1] Subset selection**
- Series<br>
- DataFrame<br>

**[2] Data consolidation**
- Concatenate<br>
- Merge<br>

In [None]:
import pandas as pd

## [1] Subset selection

### [1.1] Series 

In [None]:
s1 = pd.Series([4, 7, -5, 3], index = ["a", "b", "c", "d"])
s1

- **Select a single value**

In [None]:
s1["a"]

- **Select a set of values**

In [None]:
s1[["c", "a", "d"]]

- **Use boolean array to filter data** 

In [None]:
s1 > 2

In [None]:
s1[s1 > 2]

- **Use bitwise operators to combine conditions**

In [None]:
(s1 > 2) & (s1 < 5) 

In [None]:
s1[(s1 > 2) & (s1 < 5)]

## Exercise.A

**(A.1) Given the following pandas series representing students' scores, where the index corresponds to their student IDs. Show the scores of both student <code>S01</code> and <code>S03</code>.**

In [None]:
scores = pd.Series([7.0, 5.5, 9.0, 5.0, 7.5], index = ['S01', 'S02', 'S03', 'S04', 'S05'])
scores

**(A.2) Using the same series in (A.1), display the data for students who scored less than 6.**

### [1.2] DataFrame 

In [None]:
data = {"state": ["Ohio","Ohio","Ohio","Nevada","Nevada","Nevada"],
        "year":[2000,2001,2002,2001,2002,2003],
        "pop":[1.5,1.7,3.6,2.4,2.9,3.2]}

In [None]:
df = pd.DataFrame(data, index = ["a","b","c","d","e","f"])
df

- **Labels**

In [None]:
df.index

In [None]:
df.columns

- **Subset selection - <code>loc</code>**

In [None]:
df.loc[["b", "c", "d"], ["state", "year"]]

- **Subset selection - <code>iloc</code>**

In [None]:
df.iloc[1:4, 0:2]

- **Subset selection - column names**

In [None]:
df[["state", "year"]]   # same as df.loc[:,["state", "year"]]

- **Subset selection - positions**

In [None]:
df[1:4]                # same as df.iloc[1:4, :]

- **Subset selection - use boolean array to filter data**

In [None]:
df[df.year > 2001]

In [None]:
df[df.state == "Ohio"]

- **Subset selection - use bitwise operators to combine conditions**

In [None]:
df[(df.state == "Ohio") & (df.year > 2000)]

## Exercise.B

**(B.1) Read the csv file <code>diabetes.csv</code> using pandas. Display the first 5 rows.**

**(B.2) Select (display) column <code>BloodPressure</code> and column <code>BMI</code>.** 

**(B.3) Select rows with <code>BMI</code> greater than 50.**

**(B.4) Select the rows where either the <code>BMI</code> is greater than 50 or the <code>BloodPressure</code> is greater than 110.**

## [2] Combining DataFrames

### [2.1] Concatenation

In [None]:
df1 = pd.DataFrame({"col1":[1,2,3],"col2":[4,5,6],"col3":[7,8,9]}, index = ['a','b','c'])
df2 = pd.DataFrame({"col1":[11,22,33],"col2":[44,55,66],"col3":[77,88,99]},index = ['a','b','c'])
display(df1)
display(df2)

- **Concatenating dataframes (by default, axis = 0)**

In [None]:
pd.concat([df1, df2])

In [None]:
pd.concat([df1, df2], ignore_index = True)

- **Concatenating dataframes (axis = 1)**

In [None]:
pd.concat([df1,df2], axis = 1)

- **Concatenating dataframes with different columns**

In [None]:
df3 = pd.DataFrame({"col1":[1,2,3],"col2":[4,5,6],"col3":[7,8,9]}, index = ['a','b','c'])
df4 = pd.DataFrame({"col2":[11,22,33],"col3":[44,55,66],"col4":[77,88,99]},index = ['a','b','c'])
display(df3)
display(df4)

In [None]:
pd.concat([df3, df4], axis = 0)

## Exercise.C

**(C.1) Import the datasets <code>municipality_info_part1.csv</code> and <code>municipality_info_part2.csv</code> as dataframes. The columns in the two datasets are described as follows. Display the first five rows of each dataframe.**
- Municipality_number (object)
- Population (int)
- Area (float)

Note: Use the parameter "dtype" to specify the data types.<br>
<code>dtype = {"Municipality_number": object, "Population": int,  "Area": float} </code>.

**(C.2) How many rows are there in each dataframe?**

**(C.3) Concatenate two dataframes in (C.1) along the rows and assign the returned dataframe to a new variable named <code>mcp_info</code>.**

**(C.4) How many rows are there in the <code>mcp_info</code> dataframe?**

### [2.2] Merge

In [None]:
df1 = pd.DataFrame({'employID':['E011','E012','E013','E014','E015','E016','E017'], 
                    'name':['John','Diana','Matthew','Jerry','Kathy','Sara','Alex']})
df2 = pd.DataFrame({'employID':['E010','E012','E013','E015','E016','E017'], 
                    'birthday':['20-07','12-06','18-01','16-05','02-10','19-08']})

display(df1)
display(df2)

- **Left join**

In [None]:
pd.merge(df1, df2, how = 'left', on = 'employID' )

- **Inner join**

In [None]:
pd.merge(df1, df2, how = 'inner', on = 'employID' )

- **Outer join**

In [None]:
pd.merge(df1, df2, how = 'outer', on = 'employID' )

## Exercise.D

**(D.1) Import the dataset <code>municipality_name.csv</code> as a dataframe named<code>mcp_name</code>. The columns in the dataset are described as follows.**<br>
- Municipality_number (object)
- Municipality_name (object)

Hint: Use the argument <code>encoding = "iso8859_10"</code> to specify the character encoding. 


**(D.2) The dataframe <code>mcp_info</code> obtained in (C.3) currently lacks the information for "municipality_name". Retrieve the "municipality_name" data from the dataframe <code>mcp_name</code> and include it as a new column in "mcp_info". Store the resulting dataframe in a new variable named "mcp_full_info".**<br>
Expected result:

||Municipality_number|Population|Area|Municipality_name|
|--:|--:|--:|--:|--:|
|**0**|0301|673469|454.03|OSLO|
|**1**|1101|14898|431.66|EIGERSUND|
|**2**|...|...|...|...|

**(D.3) Using the dataframe <code>mcp_full_info</code> obtained in (D.2), list the five most populous municipalities.**