<a href="https://colab.research.google.com/github/dymiyata/intro-to-ml-and-ai-2025-2026/blob/main/intro_to_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working With Data Using the Pandas Library

To do any sort of machine learning, we first need to be comfortable working with data.  After all, data is the backbone of machine learning.

Here we will give an overview to working with the `pandas` library.  

This is a library used to work with datasets.  
Think of it like Google Sheets or Microsoft Excel but using Python.

## Housekeeping

Before We begin, save a copy of this notebook to your own google drive.
To do this click File->Save a copy in Drive.

Once you're in the copy, rename the file to 'intro_to_pandas_\<insert your name here\>`

## Importing Pandas

`pandas` is python libary so in order to access its functionality, we need to import it.  If you are running things locally on your machine (i.e. not in google colab) then you will have to install `pandas` before you can use it.

Here is a link to the pandas documentation: [pandas docs](https://pandas.pydata.org/docs/)

In [None]:
import pandas as pd

## Importing the dataset

The dataset we'll use today came from a website called [Kaggle](https://www.kaggle.com/).
* Kaggle is a great source for those looking to do data science or machine learning
* You can find many datasets, competitions, and example notebooks
* Feel free to explore Kaggle on your free time

The specific data set we are using is a dataset consisting of the first 721 pokemon.





International_Pokémon_logo.svg

The following code will import the data set.  For now, don't worry too much about how it works.  It simply downloads the dataset from Kaggle so we can work with it.

The dataset is stored in a .csv file.
* This stands for 'comma-separated values'
* It is just a big text file for storing a table of data
* The different values in the table are separated with commas (hence the name)

In [None]:
import kagglehub

path = kagglehub.dataset_download("abcsds/pokemon")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'pokemon' dataset.
Path to dataset files: /kaggle/input/pokemon


### Using Pandas

The `pandas` library primarily uses a data type called a *dataframe*. This is the primary data type that we will store our data into in this class.

The code below reads the `Pokemon.csv` file that we downloaded and creates a pandas dataframe.  
* Remember we imported `pandas` as `pd`

Then, we take that dataframe and store it in the variable df.

In [None]:
# If you run into an error try running the previous cell again
# If it still doesn't work, you may need to adjust the path to the file
# Check the printed output of the previous cell to find the right path

df = pd.read_csv("/kaggle/input/pokemon/Pokemon.csv")

Let's explore the data set a bit
* `DataFrame.head()` and `DataFrame.tail()` return the first few rows and the last few rows of the dataframe respectively.
* If you pass an integer `n` as an argument to either function, it'll return exactly `n` rows.



Let's run some of these to get a feel for what our dataset looks like

Now let's try the following functions.  See if you can figure out what they do:
* `DataFrame.info()`
* `DataFrame.describe()`

We can also sort our data using any of the columns. Suppose we want to sort the pokemon based on their speed.  We'd use the following command

`df.sort_values(by="Speed")`

How can we figure out which pokemon has the highest *HP* stat?

Suppose we don't care about all the stats. We just care about the names of the pokemon. Let's run the following command:
`df["Name"]`

Now suppose we want to entries based on their labels (either row labels or column labels).

For this we have the `DataFrame.loc()` functions.

Try `df.loc[0, ["Name", "Legendary"]]`

Then try `df.loc[0:10, ["Name", "Legendary"]]`

Next, try `df.loc[:, ["Name", "Legendary"]]`

What about `df.loc[12:15, "Name":"Attack"]`

How can we return a dataframe consisting of  the `"Name"` and all 6 stats (`"HP"`, `"Attack"`, `"Defense"`, `"Sp. Atk"`, `"Sp. Def"`, `"Speed"`) for the 60th to 67th pokemon?

Instead of using the labels, we can also use indices for the rows and columns.  For example `"#"` is column `0`, `"Name"` is colmun `1`, `"Type 1"` is column `2`, etc...

This is done using the `DataFrame.iloc()` command.  This is very similar to `DataFrame.loc()` except you use indices, not the actual labels. The "i" in "iloc" stands for index or indices.

For example, try running the following:
`df.iloc([0, 1, 2], [1,2,3])`.
Can you see how this function works?

Again let's return a dataframe consisting of  the `"Name"` and all 6 stats (`"HP"`, `"Attack"`, `"Defense"`, `"Sp. Atk"`, `"Sp. Def"`, `"Speed"`) for the 60th to 67th pokemon.  However, this time use the `DataFrame.iloc` function

We can also use booleans to filter our dataframe.

Try running the following command:
`df[df["Type 1"]=="Psychic"]`

Then try `df[(df["Type 1"]=="Psychic") & (df["Legendary"]==True)]`

Then try `df[df["Name"].str.startswith("W")]`

Then try `df[df["Type 1"].isin(["Fire", "Ghost", "Normal"])]`

Lastly, try `df[df["Attack"] > 100]`

What if you add `.info()` to the end of any of these?

How would you figure out how many pokemon are psychic type and also have a Speed stat of at least 120?

Next we'll create a copy of `df` so that I can edit things without changing the original dataframe. We'll call this copy `df_copy`
* Adding the argument `deep=True` ensures that changing the values in the copy, won't also change values in the original


In [None]:
df_copy = df.copy(deep=True)

Let's add a new column to `df_copy`

Try running `df_copy["Attack Speed Ratio"] = df_copy["Attack"] / df_copy["Speed"]`

Then use `df_copy.head()` to see the results

There are many more things we can do with `pandas` dataframes.  This is only scratching the surface.  If you ever have something you want to be able to do, you can puruse the documentation I linked above or always try googling whatever it is you want to do (I do this all the time).

Your goal with learning pandas should just be to build familiarity with it.  You won't be expected to memorize all of the functions.  Instead, as long as you know *what* you can do, you can always do a google search to figure out *how* to do it.

# **Homework 1**

Below, we have homework number 1.

These problems will use the dataframe of pokemon data that we used in class.  This should be called `df` in this notebook.  

If your runtime stopped, you may have to run the earlier cells that we used to download the data and define the dataframe.

## Instructions for Homework
* For each problem, find a way to work with the dataframe to answer each questions
* Show any relevant functions/code you used in order to get to your answer
* Have your answer clearly stated by using a text cell
  * At the top of the window you should see "+ Code" and "+ Text" buttons. To add a text cell click the "+ Text" button. If you want to try formatting things nicely, look up the markdown markup language.
  * You could write something like "Problem 5 answer: \<type answer here\>"
  * Feel free to format your answer how you see fit.  Just make sure it's clear to me what your answer is.
* This homework (and probably most) will be graded mostly for completion. If you get stuck on a problem but it's clear that you made an effort then that's mainly what I'll look for.
  * Keep the code you write, even if you don't figure out how to solve a problem.
  * If you do get stuck, please also write a short explanation of what you tried and what ended up happening.  It's really good to practice explaining what you're having trouble with.  The people who do this the best usually end up learning the most.
* Try to solve each problem just using things we've covered here (unless I specifically suggest looking up something). If you still need further help you can turn to google (Do not use ChatGPT or some other LLM). However don't just search how to solve this specific problem.
  * Instead think about what specific functionality you might need to figure out a certain step in a problem. Then search for how to use Pandas to do that.
    * Don't search, "If I have a dataframe of pokemon and their stats, how do I \<insert problem here\>". You will probably not learn this way.
    * Searching something like: "How do I sort a Pandas dataframe based on a certain column" is totally fine.
  * Then make sure you understand each step of whatever code examples you find and try to apply that knowledge to your use case.



### HW Problem 1

Find the average `Total` stat for Legendary Pokemon. Then do the same for non-Legendary pokemon.  Which is bigger? Does this seem reasonable to you?

*Optional challenge:* Figure out which type has the *most* legendary pokemon. For this, try googling the `value_counts()` function for Pandas.



### HW Problem 2

Find the five pokemon who are water type with the highest HP stat.
* Note: A pokemon is considered water type as long as at least one of `Type 1` or `Type 2` is water.

### HW Problem 3

Figure out how many pokemon have a higher Attack stat than Defense stat.

### HW Problem 4

Which generation had the fastest pokemon on average?

### HW Problem 5

Create your own stat that you think measures something interesting by combining existing stats (Get Creative!).
* For example, maybe doing `(Attack + Sp. Atk + HP)/3` (i.e. the average of these three stats) measures some sort of "bulkiness". If this is high, it seems the pokemon can hit hard, but also take a lot of hits. (Don't use this example)
* Give a name to this new stat and create a new column for it
* Find the top 5 pokemon based on this metric
* Explain why you think your new stat is interesting and meaningful (this can just be a couple sentences).

### HW Problem 6

What's your favorite pokemon? (for those familiar with pokemon, you can only pick from generations 1 to 6... sorry)
* If you don't really have one or aren't so familiar with pokemon, pick a number and use the pokemon with that `#`.

When ranking the pokemon by speed, where does your chosen pokemon rank?