# Week 4
# Pandas Data Frames (Part 1)

[Pandas](https://pandas.pydata.org/) is a major tool for data scientists on Python. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy.

References:
- Textbook Chapter 5: Getting Started with Pandas
- [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html)
- [Pandas Exercises on W3Resources](https://www.w3resource.com/python-exercises/pandas/index.php)

In [1]:
import numpy as np
import pandas as pd # pd is the universally-used abbreviation

Pandas provides two data types that extend numpy arrays:
- Data Series: extending 1D array, used to represent a single feature
- Data Frame: extending 2D array, used to represent a data table

We will focus on data frames today, as most data sets are stored in table format.

In [2]:
# Define a DataFrame from scratch
df1 = pd.DataFrame(np.random.rand(5, 3),
                   columns=['Feature1', 'Feature2', 'Feature3'])
df1.head() # prints the first several rows

Unnamed: 0,Feature1,Feature2,Feature3
0,0.983302,0.27843,0.192737
1,0.247968,0.037161,0.100702
2,0.094191,0.911979,0.519443
3,0.231755,0.83948,0.465626
4,0.264518,0.668683,0.729336


In [3]:
# Print the shape of the data frame
print(df1.shape)

(5, 3)


In [4]:
# Row indices
print(df1.index)
# print(df1.index.values)

RangeIndex(start=0, stop=5, step=1)


In [5]:
# Column indices
print(df1.columns)
# print(df1.columns.values)

Index(['Feature1', 'Feature2', 'Feature3'], dtype='object')


In [6]:
# Access elements using .loc[row_index, col_index]
# Ex: Print the Feature1 value on the first row



In [7]:
# Index slicing
# Ex: Print the Feature2 value for the first 3 rows


In [8]:
# Ex: Print the Feature2 and Feature3 values for the last 3 rows
# .loc[] does not support negative index
print(df1.loc[2:4, ['Feature1', 'Feature2']])

   Feature1  Feature2
2  0.094191  0.911979
3  0.231755  0.839480
4  0.264518  0.668683


In [9]:
# Ex: Use boolean indexing to extract the last 3 rows
row_indices = (df1.index >= 2)
print(row_indices)
print(df1.loc[row_indices, :])

[False False  True  True  True]
   Feature1  Feature2  Feature3
2  0.094191  0.911979  0.519443
3  0.231755  0.839480  0.465626
4  0.264518  0.668683  0.729336


## Basic Table Operations
- Change a value 
- Add a new row
- Add a new column
- Remove a row
- Remove a column

In [10]:
data = [[60, 70, 80],
        [66, 88, 77],
        [100, 60, 30],
        [85, 87, 83]]
scores = pd.DataFrame(data,
                      index=['Alice', 'Bob', 'Chris', 'David'],
                      columns=['Quiz1', 'Quiz2', 'Final'])
scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60,70,80
Bob,66,88,77
Chris,100,60,30
David,85,87,83


In [11]:
# Change Alice's final score to 90.
scores.loc['Alice', 'Final'] = 90
scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60,70,90
Bob,66,88,77
Chris,100,60,30
David,85,87,83


In [12]:
# Add a new row: "Edward": [77, 88, 99]
scores.loc['Edward', :] = [77, 88, 99]
scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,85.0,87.0,83.0
Edward,77.0,88.0,99.0


In [17]:
# Append a new data frame
more_scores = pd.DataFrame(data={'Quiz1': [67, 76],
                                 'Quiz2': [78, 87],
                                 'Final': [89, 98]},
                           index=['Flora', 'Gabriel']) # Represent data as a dictionary
print(more_scores)
total_scores = scores.append(more_scores) # append() creates a new data frame
total_scores

         Quiz1  Quiz2  Final
Flora       67     78     89
Gabriel     76     87     98


Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,85.0,87.0,83.0
Edward,77.0,88.0,99.0
Flora,67.0,78.0,89.0
Gabriel,76.0,87.0,98.0


In [21]:
# Add a column "ExtraCredit"
total_scores['ExtraCredit'] = [0, 1, 2, 3, 4, 5, 6]
total_scores

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0
Bob,66.0,88.0,77.0,1
Chris,100.0,60.0,30.0,2
David,85.0,87.0,83.0,3
Edward,77.0,88.0,99.0,4
Flora,67.0,78.0,89.0,5
Gabriel,76.0,87.0,98.0,6


In [22]:
# Add additional columns from another data frame
# will be discussed in Chapter 8


In [23]:
# Remove record for Chris
total_scores.drop('Chris') # drop() creates a new data frame



Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0
Bob,66.0,88.0,77.0,1
David,85.0,87.0,83.0,3
Edward,77.0,88.0,99.0,4
Flora,67.0,78.0,89.0,5
Gabriel,76.0,87.0,98.0,6


In [24]:
# Remove column "ExtraCredit"
total_scores.drop('ExtraCredit', axis=1) # drop() creates a new data frame



Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,85.0,87.0,83.0
Edward,77.0,88.0,99.0
Flora,67.0,78.0,89.0
Gabriel,76.0,87.0,98.0


In [25]:
# Remove both David and Flora
total_scores.drop(['David', 'Flora'])

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0
Bob,66.0,88.0,77.0,1
Chris,100.0,60.0,30.0,2
Edward,77.0,88.0,99.0,4
Gabriel,76.0,87.0,98.0,6


## Table Arithmetics
- Perform an operation uniformly to all values in a column
- Arithmetics with multiple columns
- Calculate statistics
- Apply a user-defined function to all rows

In [None]:
# Double the extra credits



In [None]:
# Calculate grades:
#   Grades = Quiz1 * 25% + Quiz2 * 25% + Final * 50% + ExtraCredit


In [None]:
# Ex: Curve the grades:
# Formula: CurvedGrades = sqrt(Grades) * 10



In [None]:
# Calculate the min, max, mean, median, variance, and std of the final grades



In [None]:
# Ex: Define a function num2letter() that converts a numerical grade to a letter grade.
# For example, num2letter(95) returns 'A', num2letter(59) returns 'F'



In [None]:
# Apply the function to the final grade column



## Example: Revisit 80 Cereal Data

Using Pandas and DataFrame, let repeat our analysis of the 80 Cereal Data:
- Load the csv file using `pd.read_csv()`
- Examine the data
- Explore the ratings
- Analyze sugar contents

In [26]:
# Load the dataset
raw_data = pd.read_csv('cereal.csv') 
raw_data.head() # by default, column names come from the first row, and integer indexing is used.

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [27]:
# Display the shape



In [28]:
# Display the columns



In [29]:
# Display the data types



In [30]:
# Display all cereal names



In [31]:
# Display all cereal ratings



In [32]:
# Find the product name with highest rating



In [33]:
# Display all cereals with rating above 60



In [34]:
# Calculate sugar per ounce
# sugar per ounce = sugar per serving / ounce per serving



In [35]:
# Which product has the highest amount of sugar per ounce?

