## Pandas 3
**[1] Missing data handling**<br>
- Pandas missing values<br>
- Detect missing values<br>
- Drop missing values<br>
- Fill in missing values<br>

**[2] Basic calcualtion**<br>
- Series<br>
- DataFrame<br>

**[3] Data aggregation**<br>
- GroupBy<br>
- pivot_table<br>
- crosstab<br>

In [None]:
import pandas as pd

## [1] Missing data handling

### [1.1] Pandas values

- **Python None object**

In [None]:
x = None
type(x)

- **Pandas NaN** 

In [None]:
mySeries = pd.Series([1,2, None, 4])  # or mySeries = pd.Series([1,2, np.nan, 4])
mySeries

In [None]:
type(mySeries[2])

### [1.2] Detect missing values

In [None]:
sales_df = pd.DataFrame({"Q1":[176, 132, 157, None, 135, 108, None, 191, 191, 136], 
                         "Q2":[214, 180, 230, None, 185, 123, 213, 143, 245, 213], 
                         "Q3":[78, 232, 218, None, 175, 188, 129, None, 80, 71], 
                         "Q4":[219, 158, 188, None, 203, 163, None, 189, 242, 204]},
                           index = ["A","B","C","D","E","F","G","H","I","J"])
sales_df

- **Use <code>isna()</code> to detect if the value is missing**

In [None]:
sales_df.isna()

- **Use <code>isna()</code> to filter data**

In [None]:
sales_df[sales_df.Q1.isna()]

### [1.3] Drop missing values

- **Use <code>dropna()</code> to drop rows containing any NaNs**

In [None]:
sales_df.dropna() # same as dropna(how = "any")

- **Use <code>dropna()</code> to drop rows that are all NaNs**

In [None]:
sales_df.dropna(how = "all")

- **Use the argument <code>subset</code> to define in which columns to look for missing values**<br>

In [None]:
sales_df.dropna(subset = ['Q1', 'Q3'], how = "all")

- **Drop columns**

In [None]:
sales_df2 = pd.DataFrame({"Q1":[176, 132, 157, 212, 135, 108, 151, 191, 191, 136], 
                         "Q2":[214, 180, 230, 192, 185, 123, 213, 143, 245, 213], 
                         "Q3":[78, 232, 218, None, 175, 188, 129, None, 80, 71], 
                         "Q4":[219, 158, None, None, 203, 163, None, 189, 242, 204]},
                           index = ["A","B","C","D","E","F","G","H","I","J"])
sales_df2

In [None]:
sales_df2.dropna(axis = 1)

### [1.4] Filling in missing values

In [None]:
month_sales_df = pd.DataFrame({"Month":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],
                        "Sales":[266.0, 145.9, 183.1, None, 180.3, 168.5, 231.8, None, 192.8, 122.9, 203.4, 211.7]})
month_sales_df

- **Replace missing values with a constant**

In [None]:
month_sales_df.fillna(0)

- **Replace missing values with the last valid observation**

In [None]:
month_sales_df.fillna(method = "ffill")

- **Replace missing values with the next valid observation**

In [None]:
month_sales_df.fillna(method = "bfill")

## Exercise.A

**(A.1) Import the dataset <code>melbourne.csv</code>. Show the number of rows and columns.**

**(A.2) Delete the row containing any NaN. Show the number of rows and columns in the updated dataframe.**

**(A.3) Use the dataframe in (A.1). Calculate the number of missing value in each columns.**

**(A.4) Use the dataframe in (A.1).  The <code>Car</code> column records the number of parking spaces for each property. What is the average number of parking spaces in this dataset?**<br>
Hint: <code>df.column.mean()</code>

**(A.5)  Use the dataframe in (A.1). Fill the missing values in the <code>Car</code> column with 0 and apply this change directly to the data frame. What is the average number of parking space?**<br>

## [2] Basic Calculation

### [2.1] Series: Calculate statistics

- **Without missing values**

In [None]:
s1 = pd.Series([1,2,3,4,5])
print(s1.mean())
print(s1.sum())
print(s1.max())
print(s1.count())

- **With missing values**

In [None]:
s2 = pd.Series([1,2,3,4,None])
print(s2.mean())
print(s2.sum())
print(s2.max())
print(s2.count())

### [2.2] DataFrame: Create a new column derived from existing columns

In [None]:
df_expense = pd.DataFrame({'grocery':[3050, 2800, 2750, 2300, 3150, 2900],
                         'transportation':[1050, 900, 1150, 1850, 1250, 950],
                         'dining out':[1200, 1950, 1350, 3250, 1050, 2800],
                         'entertainment':[1250, 1050, 2500, 3150, 2000, 1050]},
                        index =['Jan','Feb','Mar','Apr','May','Jun'])
df_expense

- **Exampl-1: Calculate the sum of all the values over the column axis**

In [None]:
df_expense["total_expense"] = df_expense.sum(axis = 1)
df_expense

- **Example-2: Calculate the sum of the values of two columns**

In [None]:
df_expense['necessary_expense']  = df_expense.grocery + df_expense.transportation
df_expense

- **Example-3: Calculate the percentage**

In [None]:
df_expense['necessary_expense(%)'] =  round((df_expense.necessary_expense/ df_expense.total_expense)*100,2)
df_expense

- **Example-4: Convert continuous data to categorical data using <code>cut()</code>.**

In [None]:
score_df = pd.DataFrame({"score":[75,60,40,100,85,55,65,20,70,0]})
score_df

In [None]:
score_df["grade"] = pd.cut(score_df.score, bins = [0,34,44,54,64,74,100], 
                           labels = ["F","E","D","C","B","A"],
                           include_lowest= True)
score_df

## Exercise.B

In [None]:
import random
random.seed(0)
product_df = pd.DataFrame({"Q1":[random.randint(100,150) for i in range(5)],"Q2":[random.randint(100,150) for i in range(5)],
                           "Q3":[random.randint(120,160) for i in range(5)],"Q4":[random.randint(120,160) for i in range(5)]}, index = ["A","B","C","D","E"])
product_df

**(B.1) Given the dataframe <code>product_df</code> above, each column represents quarterly sales, and the index is the product name. Caculate the annual sales of each product.**<br>
Expected output:

||Q1|Q2|Q3|Q4|Annual|
|--:|--:|--:|--:|--:|--:|
|**A**|124|132|150|128|534|
|**B**|...|...|...|...|...|
|**C**|...|...|...|...|...|

**(B.2) Calculate quarterly average sales for each product.**<br>
Expected output:

||Q1|Q2|Q3|Q4|Annual|Avg_Q|
|--:|--:|--:|--:|--:|--:|--:|
|**A**|124|132|150|128|534|133.50|
|**B**|...|...|...|...|...|--:|
|**C**|...|...|...|...|...|--:|

**(B.3) For each product, calculate first-half sales as a percentage of annual sales. Round numbers to two decimal places.**<br>
Expected output:


||Q1|Q2|Q3|Q4|Annual|Avg_Q|H1(%)|
|--:|--:|--:|--:|--:|--:|--:|--:|
|**A**|124|132|150|128|534|133.50|48.94|
|**B**|...|...|...|...|...|...|...|
|**C**|...|...|...|...|...|...|...|

## [3] Data aggregation

In [None]:
sales_df = pd.DataFrame({ "Product":["A","A","A","A","A","A","B","B","B","B","B","B"],
                           "Quarter":["Q1","Q1","Q1","Q2","Q2","Q2","Q1","Q1","Q1","Q2","Q2","Q2"],
                           "Month":["Jan","Feb","Mar","Apr","May","Jun","Jan","Feb","Mar","Apr","May","Jun"],
                           "Sales":[67, 57, 87, 50, 97, 68, 78, 102, 113, 98, 80, 84]})
sales_df 

- **The method <code>groupby()</code> returns a GroupBy object.**

In [None]:
sales_gb_product = sales_df.groupby("Product") 
type(sales_gb_product)

- **GroupBy obejct - attribute**

In [None]:
sales_gb_product.groups

- **GroupBy obejct - methods**

In [None]:
sales_gb_product.size()

In [None]:
sales_gb_product.mean()

In [None]:
sales_gb_product.max()

In [None]:
sales_gb_product.Sales.max()

In [None]:
sales_gb_product.Sales.agg(['mean','min','max'])

- **GroupBy based on two keys**

In [None]:
# Step1: Get a groupby object
sales_gb_product_Q = sales_df.groupby(['Product','Quarter'])

# Step2: Calculate the sum within each group
sales_gb_product_Q.Sales.sum()

In [None]:
# Write the above steps in one line
sales_df.groupby(['Product','Quarter']).Sales.sum()

- **Other methods: (1) <code>pivot_table()</code>**

In [None]:
sales_df.pivot_table(index = "Product", 
                     columns = "Quarter", 
                     values = "Sales", 
                     aggfunc = sum)

- **Other methods: (2) <code>crosstab()</code>**

In [None]:
pd.crosstab(index = sales_df.Product, 
            columns = sales_df.Quarter, 
            values = sales_df.Sales, 
            aggfunc = sum)

## Exercise.C

**(C.1) Given the following data frame. What is the highest score for the midterm exam and the highest score for the final exam?**

In [None]:
exam_df = pd.DataFrame({"ID":["S01","S02","S03","S04","S01","S02","S03","S04"],
                       "Exam":["midterm","midterm","midterm","midterm","final","final","final","final"],
                       "Score":[79, 56, 75, 93, 73, 73, 65, 87]})
exam_df                   

**(C.2) The final grade is calculated from the average of the midterm and final exam scores. Calculate final grades for all students.**

**(C.3) The data below records quarterly sales for two stores in Bergen and Oslo. 
What are the total annual sales in 2019 and 2020?**

In [None]:
product_df = pd.DataFrame({"Year":["2019","2019","2019","2019","2019","2019","2019","2019","2020","2020","2020","2020","2020","2020","2020","2020"],
                           "Quarter":["Q1","Q2","Q3","Q4","Q1","Q2","Q3","Q4","Q1","Q2","Q3","Q4","Q1","Q2","Q3","Q4"],
                       "Location":["Oslo","Oslo","Oslo","Oslo","Bergen","Bergen","Bergen","Bergen","Oslo","Oslo","Oslo","Oslo","Bergen","Bergen","Bergen","Bergen"],
                       "Sales":[136, 146, 147, 214, 178, 188, 210, 111, 203, 100, 144, 197, 177, 100, 189, 194]})
product_df    

**(C.4) What are the annual sales of the two stores in 2019 and 2020?**