**Learning Outcomes:**

-- Build familiarity with pandas and pandas syntax.

-- Learn key data structures: DataFrame, Series, and Index.

-- Demonstrate how the series class builds on top of numpy datatypes - Remember that the series class is a numpy array

-- Understand methods for extracting data: .loc, .iloc, and [].

There are three fundamental data structures in `pandas`:

1. **`Series`**: 1D labeled array data; best thought of as columnar data.
2. **`DataFrame`**: 2D tabular data with rows and columns.
3. **`Index`**: A sequence of row/column labels.

# The `DataFrame`


The DataFrame is conceptually a two-dimensional datastructure, where there's an index and multiple columns of  content, with each column having a label. You can think of the DataFrame itself as simply a two-axes labeled array.

* A Pandas `DataFrame` is built on top of NumPy arrays, providing a more flexible and **labeled** interface for working with structured data.

| Name    | Age | City     |
|---------|-----|----------|
| Alice   | 25  | New York |
| Bob     | 30  | Paris    |
| Charlie | 35  | London   |


* Each column in a `DataFrame` can be considered as a 1-dimensional NumPy array, and the entire DataFrame is essentially a collection of these arrays.

* Each column in a `DataFrame` has a unique name, which allows for easy identification and access.
    
* Rows in a `DataFrame` are also labeled, and each row has a unique index.

* You can create a `DataFrame` from a NumPy array using the `pd.DataFrame()` constructor from the Pandas library.

### `DataFrames`

Typically, we will work with `Series` using the perspective that they are columns in a `DataFrame`. We can think of a **`DataFrame`** as a collection of **`Series`** that all share the same **`Index`**. A Series is a 1-D labeled array of data. We can think of it as columnar data.



#### Creating a `DataFrame`

There are many ways to create a `DataFrame`. Here, we will cover the most popular approaches:

1. From a CSV file.
2. Using a list and column name(s).
3. From a dictionary.
4. From a `Series`.





In [1]:
# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd

### Creating a DataFrame From a CSV file
Data are typically stored in a CSV (comma-separated values) file format. We can import a CSV file into a DataFrame by passing the data path as an argument to the following pandas function.
 pd.read_csv("filename.csv")

In [2]:
#  pd.read_csv("filename.csv")

In [3]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/sumonacalpoly/Datasets/main/Iris.csv")
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


### Creating a DataFrame Using a List and Column Name(s)

We'll now explore creating a `DataFrame` with data of our own. More generally, the syntax for creating a `DataFrame` is:
        
    ` pandas.DataFrame(data, index, columns)`



Consider the following examples. The first code cell creates a `DataFrame` with a single column `Numbers`.

In [4]:
df_list = pd.DataFrame([1, 2, 3], columns=["Numbers"])
df_list

Unnamed: 0,Numbers
0,1
1,2
2,3




The second creates a `DataFrame` with the columns `Numbers` and `Description`. Notice how a 2D list of values is required to initialize the second `DataFrame` — each nested list represents a single row of data.


In [5]:
df_list = pd.DataFrame([[1, "one"], [2, "two"]], columns = ["Number", "Description"])
df_list

Unnamed: 0,Number,Description
0,1,one
1,2,two


### From a Dictionary

A third way to create a `DataFrame` is with a dictionary. The dictionary keys represent the column names, and the dictionary values represent the column values.



In [6]:
#specify the columns as keys of dictionary
df_dict = pd.DataFrame({
    "Fruit": ["Strawberry", "Orange"],
    "Price": [5.49, 3.99]
})
df_dict

Unnamed: 0,Fruit,Price
0,Strawberry,5.49
1,Orange,3.99


### From a `Series`

 A `Series` represents a column of a `DataFrame`; more generally, it can be any 1-dimensional array-like object. It contains both:

- A sequence of **values** of the same type.
- A sequence of data labels called the **index**.

The series is one of the core data structures in pandas. You think of it a cross between a list and a dictionary. The items are all stored in an order and there's labels with which you can retrieve them. An easy way to visualize this is two columns of data. The first is the special index, a lot like keys in a dictionary. While the second is your actual data. It's important to note that the data column has a label of its own and can be retrieved using the .name attribute. This is different than with dictionaries and is useful when it comes to merging multiple columns of data.


By default, the index of a Series is a sequential list of integers beginning from 0. Optionally, a manually specified list of desired indices can be passed to the index argument.
 It follows, then, that a `DataFrame` is equivalent to a collection of `Series`, which all share the same `Index`.

In fact, we can initialize a `DataFrame` by merging two or more `Series`.

In [7]:
import pandas as pd


s = pd.Series([-1, 10, 2], index = ["a", "b", "c"])
s


a    -1
b    10
c     2
dtype: int64

In [8]:
s.index

Index(['a', 'b', 'c'], dtype='object')

In [9]:
#Indices can also be changed after initialization.
s.index = ["first", "second", "third"]
s

first     -1
second    10
third      2
dtype: int64

In [10]:
#Another Example:
s = pd.Series(["Welcome", "to", "Data 301"])

In [11]:
# Accessing data values within the Series
s.values

array(['Welcome', 'to', 'Data 301'], dtype=object)

If we passed in a list of whole numbers, for instance, we could see that panda sets the type to int64. Underneath panda stores series values in a typed array using the Numpy library. This offers significant speedup when processing data versus traditional python lists.

In [12]:
# Let's create a little list of numbers
numbers = [1, 2, 3]
# And turn that into a series
pd.Series(numbers)
#And we see that the result is a dtype of int64 objects

0    1
1    2
2    3
dtype: int64

# Practice Examples: Set-1

Write the code to generate the following two series  `Series` `s_a` and `s_b`.

```
s_a =

r1	a1
r2	a2
r3	a3

```
and

```
s_b =

r1	b1
r2	b2
r3	b3

```


In [21]:
# s_a[1]
s_a = pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
print(s_a)
s_b = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])
print(s_b)

r1    a1
r2    a2
r3    a3
dtype: object
r1    b1
r2    b2
r3    b3
dtype: object


We can turn individual Series into a `DataFrame`:(shown below):

In [23]:
pd.DataFrame(s_a)

Unnamed: 0,0
r1,a1
r2,a2
r3,a3


**In general, We can turn individual Series into a DataFrame using the dictionary method:**

In [None]:
import pandas as pd
data = {
    "column1": ["a1", "a2", "a3"],
    "column2": ["x1", "x2", "x3"]
}
df = pd.DataFrame(data, index=["r1", "r2", "r3"])
df

Unnamed: 0,column1,column2
r1,a1,x1
r2,a2,x2
r3,a3,x3


## Slicing in `DataFrames` to extract subsets of data



The simplest way to manipulate a `DataFrame` is to extract a subset of rows and columns, known as **slicing**.

Common ways we may want to extract data are grabbing:

- The first or last `n` rows in the `DataFrame`.
- Data with a certain label.
- Data at a certain position.

We will do so with four primary methods of the `DataFrame` class:

1. `.head` and `.tail`
2. `.loc`: Label-based Extraction: Indexing with `.loc`
3. `.iloc`
4. `[]`

### Extracting data with `.head` and `.tail`

The simplest scenario in which we want to extract data is when we simply want to select the first or last few rows of the `DataFrame`.

To extract the first `n` rows of a `DataFrame` `df`, we use the syntax `df.head(n)`.


In [None]:
df.head()

Unnamed: 0,column1,column2
r1,a1,x1
r2,a2,x2
r3,a3,x3


In [None]:
df.head(1)

Unnamed: 0,column1,column2
r1,a1,x1


In [None]:
df.tail(2)

Unnamed: 0,column1,column2
r2,a2,x2
r3,a3,x3


### Label-based Extraction: Indexing with `.loc`

For the more complex task of extracting data with specific column or index labels, we can use `.loc`.   In Pandas, `.loc[]` is an accessor method used to access rows and columns in a DataFrame by label (not by position).

The **labels** (commonly referred to as the **indices**) are the bold text on the far *left* of a `DataFrame`, while the **column labels** are the column names found at the *top* of a `DataFrame`.
```
df.loc[row_label, column_label]
```

To grab data with `.loc`, we must specify the row and column label(s) where the data exists. The row labels are the first argument to the `.loc` function; the column labels are the second.

Arguments to `.loc` can be:

- A single value.
- A slice.
- A list.


### Integer-based Extraction: Indexing with `.iloc`

Slicing with `.iloc` works similarly to `.loc`. However, `.iloc` uses the *index positions* of rows and columns rather than the labels (think to yourself: **loc** uses **lables**; **iloc** uses **indices**). The arguments to the `.iloc` function also behave similarly — single values, lists, indices, and any combination of these are permitted.



###   `.loc[]` vs `.iloc[]`?

| Feature     | `.loc[]`                 | `.iloc[]`               |
|-------------|--------------------------|--------------------------|
| Access by   | **Label**                | **Integer position**     |
| Example     | `df.loc["A"]`            | `df.iloc[0]`             |

---


In [None]:
df.loc['r1']

Unnamed: 0,r1
column1,a1
column2,x1


In [None]:
#you want to retrieve data from a column named 'B-column' in the row indexed by 'r1'.
df.loc['r1','B-column']

In [None]:
# Remember, iloc uses integer-based indexing, referring to the position of rows and columns rather than their labels.
#Accessing a single element
print(df.iloc[0, 1])  # Accesses the element at row 0, column 1

In [None]:
# Accessing a row
print(df.iloc[1])  # Accesses the entire row at index 1


In [None]:
# Accessing a column
print(df.iloc[:, 0])  # Accesses the entire column at index 0

In [None]:
# Slicing rows and columns
print(df.iloc[0:2, 1:])  # Accesses rows 0 and 1, and columns starting from index 1

In [None]:
import pandas as pd

data = {
    "name": ["Sumona", "Eli", "Michael"],
    "role": ["Professor", "Student", "Student"]
}

df_1 = pd.DataFrame(data, index=["A", "B", "C"])
df_1

Remember, `iloc` uses integer-based indexing, referring to the position of rows and columns rather than their labels.

### How to change the index of a DataFrame

The index of a DataFrame is essentially the row labels, used to uniquely identify each row. By default, when you create a DataFrame, it has a numerical index (0, 1, 2, etc.). However, often it's more meaningful to use a specific column or a combination of columns as the index. This is where `set_index` comes in.

`set_index` returns a new DataFrame with the modified index. The original DataFrame is not changed.
You can set a single column or a list of columns as the index.

In [None]:
# Sample DataFrame
data = {'Name': ['Sally', 'Harry', 'Sherlock'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)
df



In [None]:
# Set 'Name' as the index
df_indexed = df.set_index('Name')

print(df_indexed)

In [None]:
print(df)

* Accessing Rows Without set_index


1.   using `iloc` for  integer-based indexing
2.   Using boolean indexing based on column values
3.   Using .loc with numerical indices (less common)




In [None]:
#This accesses the first row using its integer position (index 0).
print(df.iloc[0])



In [None]:
# Accessing the second and third rows (indices 1 and 2, exclusing 3)
print(df.iloc[1:3])

In [None]:
# Using boolean indexing based on column values:

# Accessing rows where the 'Name' column is 'Alice'
print(df[df['Name'] == 'Sally']) #this creates a boolean mask where True indicates rows
#where the 'Name' column is 'Alice'.
#This mask is then used to select those rows from the DataFrame.



In [None]:
# Accessing rows where the 'Age' column is greater than 30
#this creates a boolean mask for rows where the 'Age' column is greater than 30 and selects them.
print(df[df['Age'] > 30])

In [None]:
#Using .loc with numerical indices (less common):

# Accessing the first row (index 0)
print(df.loc[0])

# Accessing the second and third rows (indices 1 and 2)
print(df.loc[[1, 2]])

* Accessing Rows after `set_index`

In [None]:
# Accessing rows using .loc
print(df_indexed.loc['Sally'])  # Access row with index label 'Alice'
print(df_indexed.loc[['Sally', 'Harry']])  # Access rows with index labels 'Alice' and 'Bob'

# Accessing rows using the index directly (if it's a single label)
print(df_indexed['Age']['Sally'])  # Access 'Age' value for row with index 'Alice'

#  In Class Practice Example: Set-2

In [None]:
# Lets start by importing our pandas library
import pandas as pd

## Question: How can yo create the following dataframe with multiple columns?

```
	column1	  column2
r1	a1	    x1
r2	a2	    x2
r3	a3	    x3
```

In [None]:
# Lets create three school records for students and their
# class grades. I'll create each as a series which has a student name, the class name, and the score.
record1 = pd.Series({'Name': 'Alice',
                        'Class': 'Physics',
                        'Score': 85})
record2 = pd.Series({'Name': 'Jack',
                        'Class': 'Chemistry',
                        'Score': 82})
record3 = pd.Series({'Name': 'Helen',
                        'Class': 'Biology',
                        'Score': 90})

In [None]:
# Write the code to convert the three series records into datafram



In [None]:
#display the dataframe: show several rows of the dataframe

In [None]:
# You'll notice here that Jupyter creates a nice bit of HTML to render the results of the
# dataframe. So we have the index, which is the leftmost column and is the school name, and
# then we have the rows of data, where each row has a column header which was given in our initial
# record dictionaries

### Creating a DataFrame Using a Dictionary

In [None]:
# An alternative method is that you could use a list of dictionaries, where each dictionary
# represents a row of data.

students = [{'Name': 'Alice',
              'Class': 'Physics',
              'Score': 85},
            {'Name': 'Jack',
             'Class': 'Chemistry',
             'Score': 82},
            {'Name': 'Helen',
             'Class': 'Biology',
             'Score': 90}]

# Then we pass this list of dictionaries into the DataFrame function
df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])


In [None]:
#  print the dataframe (display)


In [None]:
#  extract data using the .iloc and .loc attributes. Because the
# DataFrame is two-dimensional, passing a single value to the loc indexing operator will return
# the series if there's only one row to return.

# For instance, if we wanted to select data associated with school2, what is the code using
# .loc attribute ?.


In [None]:
# One of the powers of the Panda's DataFrame is that you can quickly select data based on multiple axes.
# For instance, if you wanted to just list the student names for school1, you would supply two
# parameters to .loc, one being the row index and the other being the column name.

# Write the code for finding (and then display)  school1's student names


In [None]:
#  As we saw, .loc does row selection, and it can take two parameters,
# the row index and the list of column names. The .loc attribute also supports slicing.

# If we wanted to select all rows, we can use a colon to indicate a full slice from beginning to end.
# This is just like slicing characters in a list in python. Then we can add the column name as the
# second parameter as a string. If we wanted to include multiple columns, we could do so in a list.
# and Pandas will bring back only the columns we have asked for.

# Write the code to ask for all the names and scores for all schools using the .loc operator.
df.loc[:,['Name', 'Score']]

In [None]:
#Question: Code Critique - What is the wrong in this code and how would you fix it such that your dataframe has two columns indexed
import pandas as pd


s = pd.Series([9.5, 10, 25], ["apple","pie","pizza"],index = ["a", "b", "c"])
s
