In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("Query_Groupby.ipynb")

In [None]:
import pandas as pd
import numpy as np

# <span style="color:red">Treating Pandas Series as a Numpy array</span>

Pandas and Numpy are designed in such a way that they are compatible to each other, which means, you can apply numpy methods to pandas series in most cases. 

For example: Consider the `Housing Conditions in Copenhagen` dataset. This is a table classifying 1681 residents of twelve areas in Copenhagen in terms of:
- **housing**: the type of housing they had (tower blocks, apartments, atrium houses and terraced houses)
- **influence**: their feeling of influence on apartment management (low, high)
- **contact**: their degree of contact with neighbours (low, high)
- **satisfaction**: their satisfaction with housing conditions (low, medium, high)
- **n**: the number of cases/residents



In [None]:
copen = pd.read_csv('data/copen.dat', sep='\\s+') 
copen.head()

`copen.loc[:, 'n']` returns a pandas series with values of the column `n`. However, you can treat the returned series as a numpy array, and can apply `np.sum` method to find the sum of all values in that column. 

In [None]:
np.sum(copen.loc[:, 'n'])

Similarly, we know that when a scalar value is applied using an operator to a numpy array, then that operation is applied to each element of the array with the scalar value. 
This behavior is also true when a pandas series is applied using an operator and a scalar value. 

For eg: 

In [None]:
copen.loc[:, 'n'] > 20 

# <span style="color:red">Selecting rows based on conditions</span>

In this notebook, we will dive further into exploring few more methods and behaviors of pandas dataframe, which will help us in our data exploration. 

So far, we have learnt about selecting rows and columns using `loc` and `iloc` property. Now, we will look into selecting rows and columns based on a condition. Such conditions are expressed as boolean expressions. We can select such rows and columns using boolean masking or using query method. 

For eg: Given the `copen.dat` dataset, suppose we want to select all the rows where residents lived in an apartment. 

### <span style="color:green">Option 1: using boolean masking</span>

Step1: Create a boolean mask with the boolean condition (in this example, the condition would be where housing is apartments). <br>
Step2: Apply this mask on the entire dataframe. This will select only those rows where mask is True.  <br>
Final result: Step2 will return a pandas dataframe or a pandas series; if more than one rows, or only one row has residents living in apartments, respectively. 

<img src='pics/maskingonDF.jpg' width=1000/>

In [None]:
from IPython.display import HTML

HTML("""
    <video width="500" height="300" controls>
        <source src="videos/query.mp4" type="video/mp4" width=400>
    </video>
""")

In [None]:
from IPython.display import HTML

HTML("""
    <video width="500" height="300" controls>
        <source src="videos/query2.mp4" type="video/mp4" width=400>
    </video>
""")

In [None]:
# Step1: Generate the boolean mask where housing is set to apartments
mask = (copen.loc[:, 'housing'] == 'apartments') # or simply (copen['housing'] == 'apartments')
mask

In [None]:
# Step2: Apply the mask on the dataframe; this will select the rows where mask is True
copen[mask] 

### <span style="color:green">Option 2: using query method</span> 

We could have simplified the above query, using the `query` method. `query` method takes a string as input argument. This string is usually the boolean expression based on the condition you would like to query the dataframe on. 

For eg: 

In [None]:
query_str = ("housing == 'apartments'")
copen.query(query_str)

### Question 1

Find the total number of residents who lived in `apartments`. Store your result in a variable, `your_ans`. 

Hint: Think of the steps you need to go through to reach to the answer. 
1. Select all the rows where housing is apartments
2. Select the column name 'n'
3. Sum all the values of that column to get the total number of residents

#### (a): Solve the question using boolean masking. 

In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q1(a)")

#### (b): Solve the question using query method. 

In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q1(b)")

### Question 2

Find the number of residents who have high degree of contact with their neighbors. Store your answer in the variable, `your_ans`. 

Think of the steps you would need to reach to the solution. 

#### (a): Solve the question using boolean masking. 

In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q2(a)")

#### (b): Solve the question using query method. 

In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q2(b)")

### Question 3

Find the number of residents who lived in apartments and have high contact with their neighbors. Store your answer in variable, `your_ans`. 

Solve this question using boolean masking. 

In [None]:
your_ans = ...
your_ans

In [None]:
your_ans.item() == 448

#### Solving Question 3 using query method

In [None]:
high_contact_apartments = copen.query('housing == "apartments" and contact == "high"')
your_ans = np.sum(high_contact_apartments['n'])
your_ans

### Question 4

Find the number of residents who live in apartments and have high contact with their neighbors; however, they have low satisfaction on their housing conditions. Store your result in a variable, `your_ans`. 

#### (a) Solve your question with boolean masking 

In [None]:




your_ans = ...
your_ans

In [None]:
grader.check("q4")

### (b) Solve it using query method

In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q4(b)")

# <span style="color:red">Pandas Groupby</span>

During data exploration, it's often useful to analyze a dataset by dividing it into groups. For example, if we have data on Netflix users, including their age, the genres of movies they've watched in the past six months, their gender, and geographic locations, we might want to investigate which movie genres are most popular among users. To do this, we could group Netflix users based on the genres they've watched. For instance, we could look at which users watched action movies and which ones watched horror films.

The Pandas `groupby` method allows us to accomplish such task by typically following these steps:
1. **Splitting** the data into groups based on some criteria. 
2. **Applying** a function to each group independently.
3. **Joining** the results into a data structure (typically a dataframe). 

Let's understand it further through an example. Suppose, you are given the [US Congress Legislators](https://github.com/unitedstates/congress-legislators?tab=readme-ov-file) dataset, consisting of current serving members of congress. 

In [None]:
columns = ["full_name","birthday","gender","type","state","party"]
congress_members = pd.read_csv('data/legislators-current.csv', usecols=columns)
congress_members.head()

Side Note: We can check how many unique values exist in the dataset, using numpy `unique` method. 

In [None]:
np.unique(congress_members['gender'])

Let’s say we want to count the total number of male and female congress members. To achieve this, we would follow these steps:
1. **Split** the dataframe into groups based on gender.
2. **Apply** the count function to each group to determine the total number of rows, where each row represents an individual congress member. This will give the total count for each gender.
3. **Combine** the results from each group into a final dataframe.

<img src="pics/groupby.png" width=900/>

In [None]:
from IPython.display import HTML

HTML("""
    <video width="500" height="300" controls>
        <source src="videos/groupby.mp4" type="video/mp4" width=400>
    </video>
""")

In [None]:
congress_members[congress_members.isna()['full_name']]

In [None]:
female = congress_members.query("gender == 'F'")
female.shape[0]

In [None]:
male = congress_members.query("gender == 'M'")
male.shape[0]

In [None]:
# Step 1: 
grouped = congress_members[['gender', 'full_name']].groupby('gender')

In [None]:
grouped.get_group('F')

In [None]:
# Step 2 and Step 3: 
grouped.count()

In the resulting dataframe, there is one column called full_name, and gender is simply used as the row index labels. The dataframe has two row labels, each corresponding to the criteria used to split the dataset.

If you want the row index labels to also appear as a column, you can do the following:

```
congress_members[['gender', 'full_name']].groupby('gender', as_index=False)
```

In [None]:
grouped = congress_members[['gender', 'full_name']].groupby('gender', as_index=False)
result = grouped.count()
result

If you look at the above dataframe, you'll see that the column names, like full_name, have been preserved during the splitting process. To give the full_name column a more meaningful name in the resulting dataframe, you can easily rename the column.

In [None]:
result = result.rename(columns = {'full_name': 'Total Congress Members'})
result

### Question 5

Write a Python code to calculate the total number of congress members for each political party. Store the result in a variable named `your_ans`. The resulting dataframe, `your_ans`, will have two columns: `party`, which represents the parties (Democrat, Independent, and Republican), and `Total Congress Members`, which shows the count of members from each party.

In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q5")

### Question 6

Write a python code to count the total number of congress members in the state of Texas from each party. Store your result, in the variable, `your_ans`. 

In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q6")

In the **Apply** step, we usually would do one of the following:

**Aggregation**: compute a summary statistic (or statistics) for each group. For example:
- Compute group sums or means.
- Compute group sizes / counts.

A complete list of in-built aggregation function can be found here: [Built-in aggregation methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#built-in-aggregation-methods): 

**Transformation**:  perform some group-specific computations and return a like-indexed object. For example: 
- Standardize data (zscore) within a group

**Filtering**: discard some groups, according to a group-wise computation that evaluates to True or False. For example: 
- Discard data that belong to groups with only a few members.
- Filter out data based on the group sum or mean.



### Apply user-defined function on a groupby (using agg)

You can apply your own aggregation function. However, you should keep in mind that your function should return one value for the group. For eg: suppose I want to write my own aggregation method to find the mean of each group. 

Given the dataframe `employees`. Find the average age (rounded upto two decimal points) of people from each gender. 

In [None]:
employees = pd.DataFrame({'Name': ['Tyler', 'Kyla', 'Kevin', 'Cynthia', 'Bailey'],
                  'Age': [45, 32, 18, 59, 22],
                  'Sex': ['M', 'F', 'M', 'F', 'F'], 
                  'Salary': [100133, 59599, 86747, 98494, 103056]})
employees

In [None]:
# Solution 1: Using in-built function

grouped = employees[['Age', 'Sex']].groupby('Sex', as_index=False)
ans = grouped.mean()
ans

In [None]:
# Solution 2: Using user-defined function

def findAvg(group):
    avg_age = np.mean(group)
    return np.round(avg_age, 2)

In [None]:
grouped = employees[['Age', 'Sex']].groupby('Sex', as_index=False)
ans = grouped.agg(findAvg)
ans

The method `findAvg` will take **each group** as input argument, and the body of the method is executed to each column of the group.
For example: 

In [None]:
grouped = employees[['Age', 'Sex', 'Salary']].groupby('Sex', as_index=False)
ans = grouped.agg(findAvg)
ans

<img src="pics/agg.png" width=800 />

### Transforming groups 

The [in-built transform methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#built-in-transformation-methods) can be applied to transform the values in each group. The transformed group will have same index as the original group. 

For eg: 

Consider the `employees` dataframe, and let's transform the `Age` column for each gender with the cumulative sum within a group. 

In [None]:
employees

In [None]:
grouped = employees[['Age', 'Sex']].groupby('Sex', as_index=False)
grouped.cumsum()

You can even apply user-defined function to transform the groups using `transform` method. For eg: Suppose we want to subtract the `Age` column within a group by the mean of that group. 

In [None]:
def my_transform(x):
    return x - np.mean(x)

In [None]:
grouped = employees[['Age', 'Sex']].groupby('Sex', as_index=False)
grouped.transform(my_transform)

In [None]:
grouped = employees[['Age', 'Sex', 'Salary']].groupby('Sex', as_index=False)
grouped.transform(my_transform)

<img src="pics/trans.png" width=800 />

### Filtering groups

The `filter` method takes a User-Defined Function (UDF) that, when applied to an entire group, returns either `True` or `False`. The result of the filter method is then the subset of groups for which the UDF returned `True`.


For eg: Suppose we want to filter out the group for which the average age is less than 35 in the `employees` dataframe. Since the mean age of male is 31.50 and female is 37.67. It will filter out the rows of male. 

<img src="pics/filteringGroupby.jpg" width=900/>

In [None]:
def filter_age(x):
    return np.mean(x['Age']) > 35

In [None]:
grouped = employees[['Age', 'Sex', 'Salary']].groupby('Sex', as_index=False)
grouped.filter(filter_age)