In [1]:
import pandas as pd

df = pd.read_csv("sample.csv", index_col=0)
df

Unnamed: 0,student_no,department
0,1,computer
1,2,music
2,3,medical


# Accessor

DataFrame object has attributes, a.k.a field variables in Java

to access, use dictionary-like syntax or simply access like field variable

- `df['col_name']`
- `df.col_name`

Returns `Series` object

In [2]:
df['department'] is df.department

True

In [3]:
type(df['department'])

pandas.core.series.Series

# Conditional Selection

Selection expression gives us a `Pandas.Series` of `True/False`. We can use that to filter rows we want

`df[(column with predicate)]`

**Note that 2 conditions must surround with parentheses**:
```
df[(df[A]) & (df[B])]
```

1. evaluate left, 2. then right, 3.then combine together

In [4]:
state_df = pd.read_csv("output.csv")

# Series of True/False
state_df["State"] == "OH"

# multiple conditions
state_df[(state_df["State"] == "OH") & (state_df["International plan"] == "Yes")]

Unnamed: 0.1,Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,...,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn,New Column,Churn Binary
3,3,OH,84,408,Yes,No,0,299.4,71.0,50.9,...,196.9,89.0,8.86,6.6,7,1.78,2.0,False,100,0
205,231,OH,63,415,Yes,Yes,36,199.0,110.0,33.83,...,197.6,92.0,8.89,11.0,6,2.97,1.0,False,100,0
782,808,OH,61,510,Yes,Yes,16,143.5,76.0,24.4,...,147.7,95.0,6.65,11.3,3,3.05,0.0,False,100,0
1212,1238,OH,147,415,Yes,Yes,24,219.9,118.0,37.38,...,352.5,111.0,15.86,8.1,4,2.19,3.0,False,100,0
1392,1418,OH,29,415,Yes,Yes,37,235.0,101.0,39.95,...,139.8,106.0,6.29,5.7,7,1.54,2.0,False,100,0
1836,1862,OH,133,408,Yes,No,0,254.7,103.0,43.3,...,178.1,103.0,8.01,8.0,3,2.16,0.0,True,100,1
2568,2594,OH,115,510,Yes,No,0,345.3,81.0,58.7,...,217.5,107.0,9.79,11.8,8,3.19,1.0,True,100,1
2878,2904,OH,136,408,Yes,No,0,183.4,103.0,31.18,...,200.4,122.0,9.02,10.4,9,2.81,2.0,False,100,0
3183,3209,OH,68,415,Yes,Yes,24,125.7,92.0,21.37,...,214.5,108.0,9.65,14.2,6,3.83,3.0,True,100,1


## SQL like

- `isin()`
- `isnull(), notnull()`

In [5]:
state_df["State"].isin(["OH", "WV"])
state_df["State"].notnull()

# to exclude the undesire 
~state_df["State"].isin(["OH", "WV"])

0       True
1       True
2       True
3       True
4       True
        ... 
3302    True
3303    True
3304    True
3305    True
3306    True
Name: State, Length: 3307, dtype: bool

# Indexing - loc, iloc

You can look at file `2_1_iloc_loc_location_label.ipynb`

pandas provides 2 accessors to retrieve the data. Pandas is **row-based**. Think the object as a list of rows.

||col_1|col_2|
|-|-|-|
|32|1|2|
|12|1|2|

- `iloc[]` - select data based on its position, like the 1st, 2nd position
- `loc[]` - select based on index

**"Square brackets"**:

`iloc[0]`: give us `[1, 2]`

`loc[0]`: error; `loc[12]`: `[1, 2]`

second argumenst specify the column name
`df.loc[0, "col_name"]`

In [6]:
df = pd.read_csv("sample.csv", index_col=1)
df # pay attention to the df, index starts with 1

Unnamed: 0,student_no,department
1,0,computer
2,1,music
3,2,medical


In [7]:
# access the "2nd" row
df.iloc[1]

# access the value
df.iloc[1, 1]

'music'

In [8]:
# access like python list
# access all rows, with "second" column
df.iloc[:, 1]

1    computer
2       music
3     medical
Name: department, dtype: object

In [9]:
# negative index works
df.iloc[-1, 1]

'medical'

In [10]:
# access by index
df.loc[2]

student_no        1
department    music
Name: 2, dtype: object

## Setting index
This is useful if you can come up with an index for the dataset which is better than the current one.

In [11]:
df.set_index("department")

Unnamed: 0_level_0,student_no
department,Unnamed: 1_level_1
computer,0
music,1
medical,2


# Assign

## Create new column

Create a new column is easy. Assign the value to the column will do it.

In [12]:
df["location"] = "Ireland"
df

Unnamed: 0,student_no,department,location
1,0,computer,Ireland
2,1,music,Ireland
3,2,medical,Ireland


You can offer an iterable too

In [13]:
aList = [3, 5, 7]

df["count"] = aList

In [14]:
df

Unnamed: 0,student_no,department,location,count
1,0,computer,Ireland,3
2,1,music,Ireland,5
3,2,medical,Ireland,7
