In [1]:
import pandas as pd

df = pd.read_csv("ads (1).csv")

## Subsetting by group

- It is very common to select only parts of a dataset for analysis. 
- Examples?

## Subsetting by group in pandas

- Pandas gives you a few ways to do this. 
- The one we will focus on is called "boolean indexing"
- For those coming from SQL, checkout [query](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) which is similar 

## Selecting a column 

- Recall that to select a column (aka "series") you use this syntax

In [2]:
df["ad"]

0      ingredients
1      ingredients
2      ingredients
3      ingredients
4             cost
          ...     
995           cost
996           cost
997           cost
998           cost
999           cost
Name: ad, Length: 1000, dtype: object

## From selection to Boolean indexing

- This is used for Boolean indexing

In [3]:
df["ad"] == "cost"

0      False
1      False
2      False
3      False
4       True
       ...  
995     True
996     True
997     True
998     True
999     True
Name: ad, Length: 1000, dtype: bool

What is that? 

In [4]:
ix = df["ad"] == "cost"
type(ix)

pandas.core.series.Series

What is this doing?

In [5]:
df[ix]

Unnamed: 0.1,Unnamed: 0,ad,click,age
4,4,cost,True,
7,7,cost,False,
9,9,cost,True,57.0
12,12,cost,False,49.0
14,14,cost,True,
...,...,...,...,...
995,995,cost,False,33.0
996,996,cost,False,
997,997,cost,True,
998,998,cost,False,64.0


In [6]:
## another way to write it 

df[df["ad"] == "cost"]

Unnamed: 0.1,Unnamed: 0,ad,click,age
4,4,cost,True,
7,7,cost,False,
9,9,cost,True,57.0
12,12,cost,False,49.0
14,14,cost,True,
...,...,...,...,...
995,995,cost,False,33.0
996,996,cost,False,
997,997,cost,True,
998,998,cost,False,64.0


### Where is the for loop?

# Combining Boolean indexes? 

In [7]:
# What do you notice about this syntax?

df[(df["ad"] == "cost") & (df["age"] < 30)]

Unnamed: 0.1,Unnamed: 0,ad,click,age
35,35,cost,False,18.0
50,50,cost,False,19.0
68,68,cost,True,26.0
91,91,cost,False,22.0
97,97,cost,False,21.0
...,...,...,...,...
913,913,cost,True,24.0
935,935,cost,True,27.0
957,957,cost,False,20.0
971,971,cost,False,17.0


### Compute the mean age for those who clicked on the cost ad

- Do you notice any problems? Welcome to data analysis 😀

In [16]:
df[(df["ad"] == "cost") & (df["click"] == True)].mean().reset_index()

  df[(df["ad"] == "cost") & (df["click"] == True)].mean().reset_index()


Unnamed: 0,index,0
0,Unnamed: 0,494.076389
1,click,1.0
2,age,45.520548


In [34]:
# how prof did it

cost_df = df[df["ad"] == "cost"]
cocli_df = cost_df[cost_df["click"] == True]
cocli_df["age"].mean()

44.732375

# Possible fixes

- [fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)
- [dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)

In [17]:
# fill nulls 
mean = df["age"].mean()
df = df.fillna(mean)

In [21]:
import pandas as pd
from pandas import DataFrame

filename = 'Class data - Sheet1 (1).csv'

cd = pd.read_csv(filename)

In [22]:
cd = cd.dropna()

In [None]:
cd

In [30]:
cd[(cd["Age"] == "21") & (cd["State"] == "CO")].reset_index()

Unnamed: 0,index,Person,State,Age,Shoe size,Sign,Chipotle order,Dog or Cat?,Major,SKI?
0,12,JP,CO,21,10.0,Cancer,Burrito,Dog,Finance,Ski
1,27,Elise,CO,21,8.0,Gemini,Burrito,Cat,Ops and Analytics,ski
2,37,Greyson,CO,21,10.5,virgo,steak burrito,Dog,Finance and Info,board
3,47,Justin,CO,21,9.5,Capricorn,Bowl,dog,info/finance,ski
