<a href="https://colab.research.google.com/github/alimoorreza/CS167-sp25-notes/blob/main/Day02_Pandas_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day02
## 🐼 Pandas Tutorial

#### CS167: Machine Learning, Spring 2025


📜 [Syllabus](https://analytics.drake.edu/~reza/teaching/cs167_sp25/cs167_syllabus_sp25.pdf)

## Overview of Day02:
<!--- Notebook #1 setup walkthrough-->
- Pandas Tutorial
- Exercises for Pandas

# 🐼 Pandas
__Pandas__ is a super powerful Python data analysis library.
- it's built on top of another super powerful libray called `numpy`

Using Google Colab, `pandas` should already be installed. If you see `In [*]` next to a cell, it means your computer is working on the task.

## Overview of Pandas Tutorial

Three main goals:
1. __Overview__ of Pandas
    - Datatypes `DataFrame` and `Series`
    - helpful functions
2. Select __columns__ in DataFrames
3. Select __rows__ in DataFrames
4. Select __subsets__ of the DataFrame (both rows and columns).

##  Pandas Datatypes: `DataFrame` and `Series`

In `pandas`,there are two main datatypes, `DataFrame` and `Series`:

Let's start with `DataFrame`

[Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) defines `DataFrames` as:
> Two-dimensional, size-mutable, potentially heterogeneous tabular data.

- basically, think of `DataFrames` as our excel sheets--two dimensional, tabular data.
- Each column has a name, and you can use these names to filter and create subsets of data.
- often, you'll see `DataFrames` abbreviated to `df`.

In [1]:
# The first step is to mount your Google Drive to your Colab account.
#You will be asked to authorize Colab to access your Google Drive. Follow the steps they lead you throuh.

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Creating a DataFrame using `pd.read_csv()`:
While you can create a DataFrame from scratch, most often we'll be importing data from a `.csv` file:
- pandas has a helpful function for this: `pd.read_csv()`, which takes the path to the csv file as an argument [[documentation]](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

In [5]:
#you should be able to run this without any issue.
import pandas as pd

path = "/content/drive/MyDrive/cs167_sp25/datasets/restaurant.csv"
df_rest = pd.read_csv(path)
print(df_rest)

    alt  bar  fri  hun   pat price rain  res     type    est target
0   Yes   No   No  Yes  Some   $$$   No  Yes   French   0-10    Yes
1   Yes   No   No  Yes  Full     $   No   No     Thai  30-60     No
2    No  Yes   No   No  Some     $   No   No   Burger   0-10    Yes
3   Yes   No  Yes  Yes  Full     $   No   No     Thai  10-30    Yes
4   Yes   No  Yes   No  Full   $$$   No  Yes   French    >60     No
5    No  Yes   No  Yes  Some    $$  Yes  Yes  Italian   0-10    Yes
6    No  Yes   No   No   NaN     $  Yes   No   Burger   0-10     No
7    No   No   No  Yes  Some    $$  Yes  Yes     Thai   0-10    Yes
8    No  Yes  Yes   No  Full     $  Yes   No   Burger    >60     No
9   Yes  Yes  Yes  Yes  Full   $$$   No  Yes  Italian  10-30     No
10   No   No   No   No   NaN     $   No   No     Thai   0-10     No
11  Yes  Yes  Yes  Yes  Full     $   No   No   Burger  30-60    Yes


In [None]:
df_rest["hun"]

### 📣 Helpful Method Alert: `df.head()`

The `.head()` method can be called on any DataFrame, and by default will display the first 5 lines rows of the data, as well as the names of the columns.
- if you want it to display more than 5 rows, you can provide a number as an argument to the method.

In IPython notebooks, whatever you leave at the end of a cell will automatically output.

So, when you put those two facts together, you get this nifty functionality:

In [None]:
df_rest.head(7)

Unnamed: 0,alt,bar,fri,hun,pat,price,rain,res,type,est,target
0,Yes,No,No,Yes,Some,$$$,No,Yes,French,0-10,Yes
1,Yes,No,No,Yes,Full,$,No,No,Thai,30-60,No
2,No,Yes,No,No,Some,$,No,No,Burger,0-10,Yes
3,Yes,No,Yes,Yes,Full,$,No,No,Thai,10-30,Yes
4,Yes,No,Yes,No,Full,$$$,No,Yes,French,>60,No
5,No,Yes,No,Yes,Some,$$,Yes,Yes,Italian,0-10,Yes
6,No,Yes,No,No,,$,Yes,No,Burger,0-10,No


### 📣 Helpful Attribute Alert: `df.shape`
Want to know the dimensions of your DataFrame? Use `df.shape`



In [None]:
df_rest.shape

(12, 11)

### 📣 Other Helpful Attributes
>> `df.columns`, `df.iloc[]`, etc

>> we are going explore more on these later



In [None]:
df_rest.columns

Index(['alt', 'bar', 'fri', 'hun', 'pat', 'price', 'rain', 'res', 'type',
       'est', 'target'],
      dtype='object')

In [None]:
df_rest.type

Unnamed: 0,type
0,French
1,Thai
2,Burger
3,Thai
4,French
5,Italian
6,Burger
7,Thai
8,Burger
9,Italian


In [None]:
df_rest['res']

Unnamed: 0,res
0,Yes
1,No
2,No
3,No
4,Yes
5,Yes
6,No
7,Yes
8,No
9,Yes


In [None]:
df_rest['alt']

Unnamed: 0,alt
0,Yes
1,Yes
2,No
3,Yes
4,Yes
5,No
6,No
7,No
8,No
9,Yes


In [None]:
df_rest.hun

Unnamed: 0,hun
0,Yes
1,Yes
2,No
3,Yes
4,No
5,Yes
6,No
7,Yes
8,No
9,Yes


In [None]:
df_rest['hun']

Unnamed: 0,hun
0,Yes
1,Yes
2,No
3,Yes
4,No
5,Yes
6,No
7,Yes
8,No
9,Yes


In [11]:
data_1d_list = ['tiger', 'lion', 'honey badger']
# dataframe creation without a user-defined column name
my_df_3 = pd.DataFrame(data_1d_list)
my_df_3

Unnamed: 0,0
0,tiger
1,lion
2,honey badger


In [12]:
data_1d_list = ['tiger', 'lion', 'honey badger']
# dataframe creation without a user-defined column name
my_df_3 = pd.DataFrame(data_1d_list, columns=['animal name'])
my_df_3

Unnamed: 0,animal name
0,tiger
1,lion
2,honey badger


In [None]:
data = ["summer", "winter", "fall"]
df_new = pd.DataFrame(data, columns=["season names"])
df_new["season names"]

Unnamed: 0,season names
0,summer
1,winter
2,fall


In [None]:
data_my = [["des moines", 60], ["philadelphia", 75], ["bloomington", 80], ["st paul", 65], ["fairfax", 67]]
df_2d = pd.DataFrame(data_my, columns=["city name", "temparature"])
df_2d.head()


Unnamed: 0,city name,temparature
0,des moines,60
1,philadelphia,75
2,bloomington,80
3,st paul,65
4,fairfax,67


In [6]:
# data frame creation using 2d list

data_2d_list = [['tiger', 10], ['lion', 15], ['honey badger', 5]]
my_df_1 = pd.DataFrame(data_2d_list, columns=['animal name', 'age'])
my_df_1

Unnamed: 0,animal name,age
0,tiger,10
1,lion,15
2,honey badger,5


In [10]:
# data frame creation using dictionary

data_dict = {'animal name': ['tiger', 'lion', 'honey badger'], 'age': [10, 15, 5]}
my_df_2 = pd.DataFrame(data_dict)
my_df_2
#print("total number of samples: ", my_df_2.shape[0])

# HEADS UP! the number of items in each list should be the same


Unnamed: 0,animal name,age
0,tiger,10
1,lion,15
2,honey badger,5


In [None]:
data_my = {"city names": ["des moines", "philadelphia", "bloomington", "st paul", "fairfax", "austin"], "temparature":[60, 67, 70, 80, 56, 90]}
df_dict = pd.DataFrame(data_my)
print(df_dict)


     city names  temparature
0    des moines           60
1  philadelphia           67
2   bloomington           70
3       st paul           80
4       fairfax           56
5        austin           90


In [None]:
df_2d["city name"]

Unnamed: 0,city name
0,des moines
1,philadelphia
2,bloomington
3,st paul
4,fairfax


In [None]:
df_rest.iloc[0]

Unnamed: 0,0
alt,Yes
bar,No
fri,No
hun,Yes
pat,Some
price,$$$
rain,No
res,Yes
type,French
est,0-10


## Other ways of creating DataFrames (without explicitly providing the path to a .csv file):
The syntax for creating a DataFrame from scratch looks like this:
- `pandas.DataFrame(data, index, columns)`


In [None]:
# Example#1: create an empty DataFrame()
data_frame1 = pd.DataFrame()
print(data_frame1)

Empty DataFrame
Columns: []
Index: []


## Creating DataFrame from a 1D List

In [None]:
# Example#2: initializing a DataFrame with list of items (without a column name)
data = ["reza", "chris", "eric"]
df_1 = pd.DataFrame(data)
print(df_1)

       0
0   reza
1  chris
2   eric


In [None]:
# Example#3: initializing a DataFrame with list of items (without a column name)

data_list = [10, 20, 30, 40, 50, 60] # initialize list elements

df_1 = pd.DataFrame(data_list, columns=['numbers']) # Create the pandas DataFrame with column name is provided explicitly
print('size of the dataframe df_1', df_1.shape)
# print dataframe
df_1

size of the dataframe df_1 (6, 1)


Unnamed: 0,numbers
0,10
1,20
2,30
3,40
4,50
5,60


In [None]:
# Example#4: adding a column name (ie, "last name") to the DataFrame
data = ["reza", "chris", "eric"]
df_1 = pd.DataFrame(data, columns=["last name"])
print(df_1)

  last name
0      reza
1     chris
2      eric


## Creating DataFrame from a 2D list:

In [None]:
# Example#8: initialize list of lists (each inner list corresponds to one row in the DataFrame)
data_2d_list = [['reza', 1], ['chris', 2], ['eric', 3]]

# Create the pandas DataFrame
df_3 = pd.DataFrame(data_2d_list, columns=['name', 'score'])

# print dataframe.
df_3


Unnamed: 0,name,score
0,reza,1
1,chris,2
2,eric,3


## Creating DataFrame from a Dictionary

In [None]:
# Example#5: Create the pandas DataFrame with the column names provided explicitly
data_dict = {'col1':[1,2,3], 'col2':[4,5,6], 'col3':[7,8,9]}
df_2 = pd.DataFrame(data_dict)
print('size of the dataframe df_2', df_2.shape)
# print dataframe
df_2

size of the dataframe df_2 (3, 3)


Unnamed: 0,col1,col2,col3
0,1,4,7
1,2,5,8
2,3,6,9


In [None]:
# Example#6: Initializing a DataFrame with a dictionary of items allows you to specify the column names along with their corresponding values.
data_source = {'first name': ['a', 'b', 'c'], 'last name':['A', 'B', 'C']}
df_2 = pd.DataFrame(data_source)
print(df_2)

# Example#7: Initializing a DataFrame with a dictionary of items allows you to specify the column names along with their corresponding values.
data_source = {"first name":["alimoor", "chris", "eric"], "last name":["reza", "porter", "manley"], "scores":[2, 3, 4]}
df_2 = pd.DataFrame(data_source)
df_2.head()

  first name last name
0          a         A
1          b         B
2          c         C


Unnamed: 0,first name,last name,scores
0,alimoor,reza,2
1,chris,porter,3
2,eric,manley,4


## Columns Names

Want to see a list of all of the columns in your dataset? Try using `df.columns`

In [None]:
col = df_rest.columns
col

Index(['alt', 'bar', 'fri', 'hun', 'pat', 'price', 'rain', 'res', 'type',
       'est', 'target'],
      dtype='object')

If there are no spaces in the name of a column, you can also reference it using dot notation like so:

In [None]:
df_rest.type # seeing the values of different rows in the column named 'type'

Unnamed: 0,type
0,French
1,Thai
2,Burger
3,Thai
4,French
5,Italian
6,Burger
7,Thai
8,Burger
9,Italian


## Selecting Rows in DataFrames using `loc` and `iloc`:
Simply put:
- `loc` gets DataFrame rows and columns by __labels/names__
- `iloc` gets DataFrame rows and columns by __index/position__

In [14]:
# load a new csv file 'titanic.csv'. you can find it on Blackboard under datasets module
path = '/content/drive/MyDrive/cs167_sp25/datasets/titanic.csv'

# read the file into a dataframe
df_titanic = pd.read_csv(path)
print('data.shape: ', df_titanic.shape)
df_titanic.head()

data.shape:  (891, 15)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
print(df_titanic.loc[880])   # 1 is really a "name", not a number

survived                 1
pclass                   2
sex                 female
age                   25.0
sibsp                    0
parch                    1
fare                  26.0
embarked                 S
class               Second
who                  woman
adult_male           False
deck                   NaN
embark_town    Southampton
alive                  yes
alone                False
Name: 880, dtype: object


In [None]:
print(df_titanic.loc[880])   # 880 is really a "name", not a number

survived                 1
pclass                   2
sex                 female
age                   25.0
sibsp                    0
parch                    1
fare                  26.0
embarked                 S
class               Second
who                  woman
adult_male           False
deck                   NaN
embark_town    Southampton
alive                  yes
alone                False
Name: 880, dtype: object


Let's take a subset of titanic and try to use `loc` and `iloc`:

In [None]:
subset = df_titanic.loc[800:805] # since it's a label, it will take rows labeled 800, 801, 802, 803, 804, and 805.
print(subset)

     survived  pclass     sex    age  sibsp  parch      fare embarked   class  \
800         0       2    male  34.00      0      0   13.0000        S  Second   
801         1       2  female  31.00      1      1   26.2500        S  Second   
802         1       1    male  11.00      1      2  120.0000        S   First   
803         1       3    male   0.42      0      1    8.5167        C   Third   
804         1       3    male  27.00      0      0    6.9750        S   Third   
805         0       3    male  31.00      0      0    7.7750        S   Third   

       who  adult_male deck  embark_town alive  alone  
800    man        True  NaN  Southampton    no   True  
801  woman       False  NaN  Southampton   yes  False  
802  child       False    B  Southampton   yes  False  
803  child       False  NaN    Cherbourg   yes  False  
804    man        True  NaN  Southampton   yes   True  
805    man        True  NaN  Southampton    no   True  


In [None]:
subset = df_titanic.loc[800:805] # since it's a label, it will take rows labeled 800, 801, 802, 803, 804, and 805.
subset.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
800,0,2,male,34.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
801,1,2,female,31.0,1,1,26.25,S,Second,woman,False,,Southampton,yes,False
802,1,1,male,11.0,1,2,120.0,S,First,child,False,B,Southampton,yes,False
803,1,3,male,0.42,0,1,8.5167,C,Third,child,False,,Cherbourg,yes,False
804,1,3,male,27.0,0,0,6.975,S,Third,man,True,,Southampton,yes,True


In [None]:
subset.loc[800] # will show the 1st row in the DataFrame called 'subset'

Unnamed: 0,800
survived,0
pclass,2
sex,male
age,34.0
sibsp,0
parch,0
fare,13.0
embarked,S
class,Second
who,man


In [None]:
subset.loc[805] # will show the 6th row in the DataFrame called 'subset'

Unnamed: 0,805
survived,0
pclass,3
sex,male
age,31.0
sibsp,0
parch,0
fare,7.775
embarked,S
class,Third
who,man


In [None]:
subset.loc[806] # DataFrame called 'subset' has only 6 rows, hence ERROR

KeyError: 806

In [None]:
subset.iloc[5]  #works

Unnamed: 0,805
survived,0
pclass,3
sex,male
age,31.0
sibsp,0
parch,0
fare,7.775
embarked,S
class,Third
who,man


## Pandas Datatypes: `Series`
- `Series` are 1D arrays with axis labels.
    - Each __row__ in a DataFrame is a `Series`.
    - Each __column__ in a DataFrame is also a `Series`.

In [None]:
print(type(df_rest.iloc[0])) #the first row in the dataframe
print(type(df_rest['type'])) #the column 'type' from the dataframe

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
