# Week 4
# Pandas Data Frames (Part 1)

[Pandas](https://pandas.pydata.org/) is a major tool for data scientists on Python. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy.

References:
- Textbook Chapter 5: Getting Started with Pandas

In [2]:
import numpy as np
import pandas as pd # pd is the universally-used abbreviation

Pandas provides two data types that extend numpy arrays:
- Data Series: extending 1D array, used to represent a single feature
- Data Frame: extending 2D array, used to represent a data table

We will focus on data frames today, as most data sets are stored in table format.

In [7]:
# Define a DataFrame from scratch
df1 = pd.DataFrame(np.random.rand(5, 3),
                   columns=['Feature1', 'Feature2', 'Feature3'])
df1.head(3)# prints the first several rows

Unnamed: 0,Feature1,Feature2,Feature3
0,0.441358,0.440195,0.248595
1,0.583467,0.80258,0.364975
2,0.467576,0.254247,0.420916


In [8]:
# Print the shape of the data frame
print(df1.shape)

(5, 3)


In [11]:
# Row indices
# print(df1.index)
print(df1.index.values)

[0 1 2 3 4]


In [13]:
# Column indices
# print(df1.columns)
print(df1.columns.values)

['Feature1' 'Feature2' 'Feature3']


In [18]:
# Extract a column
df1['Feature1'] # returns a DataSeries

0    0.441358
1    0.583467
2    0.467576
Name: Feature1, dtype: float64

In [17]:
# Extract multiple columns
# df1[['Feature1', 'Feature2']].head(3)
df1.head(3)[['Feature1', 'Feature2']]

Unnamed: 0,Feature1,Feature2
0,0.441358,0.440195
1,0.583467,0.80258
2,0.467576,0.254247


In [24]:
# Access elements using .loc[row_index, col_index]
# Example: Print the Feature1 value on the first row

print(df1.loc[0, 'Feature1'])
# print(df1[0, 'Feature1']) # This will cause an error

# Ex: Display the value in the lower-right corner
print(df1.loc[4, 'Feature3'])

# # There is another expression using integer indices: iloc[]
# print(df1.iloc[0, 0])
# print(df1.iloc[-1, -1])

0.4413580480511703
0.9338144577140153
0.9338144577140153


In [25]:
# Extract a row
df1.loc[2] # returns a DataSeries

Feature1    0.467576
Feature2    0.254247
Feature3    0.420916
Name: 2, dtype: float64

In [26]:
# Index slicing
# Print the Feature2 value for the first 3 rows
df1.loc[0:2, "Feature2"] # Both the start index and the end index are inclusive

0    0.440195
1    0.802580
2    0.254247
Name: Feature2, dtype: float64

In [27]:
# Ex: Print the Feature2 and Feature3 values for the last 3 rows
# .loc[] does not support negative index

df1.loc[2:4, ['Feature2', 'Feature3']]

Unnamed: 0,Feature2,Feature3
2,0.254247,0.420916
3,0.23692,0.007152
4,0.911047,0.933814


In [28]:
# Ex: Use boolean indexing to extract the last 3 rows
row_indices = (df1.index >= 2)
print(row_indices)
df1.loc[row_indices, :]

[False False  True  True  True]


Unnamed: 0,Feature1,Feature2,Feature3
2,0.467576,0.254247,0.420916
3,0.28194,0.23692,0.007152
4,0.764278,0.911047,0.933814


In [30]:
df1.loc[(df1.index % 2 == 0), :] # It would be hard to do the same
                                # in Excel.

Unnamed: 0,Feature1,Feature2,Feature3
0,0.441358,0.440195,0.248595
2,0.467576,0.254247,0.420916
4,0.764278,0.911047,0.933814


In [32]:
# Ex: Extract rows whose Feature2 value is less than 0.5
df1.loc[(df1['Feature2'] < 0.5), :]

df1[(df1['Feature2'] < 0.5)] # a simpler expression

Unnamed: 0,Feature1,Feature2,Feature3
0,0.441358,0.440195,0.248595
2,0.467576,0.254247,0.420916
3,0.28194,0.23692,0.007152


## Basic Table Operations
- Change a value 
- Add a new row
- Add a new column
- Remove a row
- Remove a column

In [3]:
data = [[60, 70, 80],
        [66, 88, 77],
        [100, 60, 30],
        [85, 87, 83]]
scores = pd.DataFrame(data,
                      index=['Alice', 'Bob', 'Chris', 'David'],
                      columns=['Quiz1', 'Quiz2', 'Final'])
scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60,70,80
Bob,66,88,77
Chris,100,60,30
David,85,87,83


In [4]:
# Change Alice's final score to 90.
scores.loc['Alice', 'Final'] = 90
scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60,70,90
Bob,66,88,77
Chris,100,60,30
David,85,87,83


In [8]:
# Add a new row: "Edward": [77, 88, 99]
scores.loc['Edward', :] = [77, 88, 99]
scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,85.0,87.0,83.0
Edward,77.0,88.0,99.0


In [12]:
# If I want to make sure that I am modifying an existing row, do this:
name = 'David'
if name in scores.index:
    scores.loc[name, ['Quiz1', 'Quiz2']] = [95, 85]
    print(scores)
else:
    print("Error:", name, "does not exist.")

        Quiz1  Quiz2  Final
Alice    60.0   70.0   90.0
Bob      66.0   88.0   77.0
Chris   100.0   60.0   30.0
David    95.0   85.0   83.0
Edward   77.0   88.0   99.0


In [14]:
# What if we only want to display David's record?
scores.loc[["David"], :]

Unnamed: 0,Quiz1,Quiz2,Final
David,95.0,85.0,83.0


In [17]:
print("Data types:\n", scores.dtypes) # Each columns has its own data type
# Change Quiz1 to integers
scores['Quiz1'] = scores['Quiz1'].astype(int)
print("Data types after modification:")
print(scores.dtypes)
scores

Data types:
 Quiz1      int32
Quiz2    float64
Final    float64
dtype: object
Data types after modification:
Quiz1      int32
Quiz2    float64
Final    float64
dtype: object


Unnamed: 0,Quiz1,Quiz2,Final
Alice,60,70.0,90.0
Bob,66,88.0,77.0
Chris,100,60.0,30.0
David,95,85.0,83.0
Edward,77,88.0,99.0


In [18]:
scores.loc["Fred", ["Quiz1", "Quiz2"]] = [100, 100]
scores
# This will create a missing value in the Final column
# Python will use "NaN" to indicate a missing value (Not a number)

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,95.0,85.0,83.0
Edward,77.0,88.0,99.0
Fred,100.0,100.0,


In [19]:
# Append a new data frame
more_scores = pd.DataFrame(data={'Quiz1': [67, 76],
                                 'Quiz2': [78, 87],
                                 'Final': [89, 98]},
                           index=['Flora', 'Gabriel']) # Represent data as a dictionary
print(more_scores)
total_scores = pd.concat([scores, more_scores]) # pd.concat() creates a new data frame
total_scores

         Quiz1  Quiz2  Final
Flora       67     78     89
Gabriel     76     87     98


Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,95.0,85.0,83.0
Edward,77.0,88.0,99.0
Fred,100.0,100.0,
Flora,67.0,78.0,89.0
Gabriel,76.0,87.0,98.0


In [20]:
total_scores.shape

(8, 3)

In [22]:
# Add a column "ExtraCredit"
total_scores['ExtraCredit'] = [0, 1, 2, 3, 4, 5, 6, 7]
total_scores

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0
Bob,66.0,88.0,77.0,1
Chris,100.0,60.0,30.0,2
David,95.0,85.0,83.0,3
Edward,77.0,88.0,99.0,4
Fred,100.0,100.0,,5
Flora,67.0,78.0,89.0,6
Gabriel,76.0,87.0,98.0,7


In [23]:
total_scores['ExtraCredit'] = [0, 0, 0, 0, 0, 0, 0, 0]
total_scores

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0
Bob,66.0,88.0,77.0,0
Chris,100.0,60.0,30.0,0
David,95.0,85.0,83.0,0
Edward,77.0,88.0,99.0,0
Fred,100.0,100.0,,0
Flora,67.0,78.0,89.0,0
Gabriel,76.0,87.0,98.0,0


In [None]:
# Add additional columns from another data frame
# will be discussed in Chapter 8


In [25]:
# Remove record for Chris
scores_without_chris = total_scores.drop('Chris') # drop() creates a new data frame

total_scores # The original data frame is not affected.
scores_without_chris 

# If you still want to keep a copy of the original data, assign the result from 
# drop() to a new variable. In this way you have two data frames.

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0
Bob,66.0,88.0,77.0,0
David,95.0,85.0,83.0,0
Edward,77.0,88.0,99.0,0
Fred,100.0,100.0,,0
Flora,67.0,78.0,89.0,0
Gabriel,76.0,87.0,98.0,0


In [None]:
# If the original data frame is no longer needed, then simply assign the drop
# result to the same variable.

total_scores = total_scores.drop('Chris')

total_scores

In [None]:
# Modifying an existing data frame is called an "in-place" operation.
total_scores.drop("David", inplace=True)

In [None]:
total_scores

In [26]:
# Remove column "ExtraCredit"
total_scores = total_scores.drop('ExtraCredit', axis=1) # drop() creates a new data frame
# total_scores.drop('ExtraCredit', axis=1, inplace=True)

total_scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,95.0,85.0,83.0
Edward,77.0,88.0,99.0
Fred,100.0,100.0,
Flora,67.0,78.0,89.0
Gabriel,76.0,87.0,98.0


In [27]:
# Remove both Fred and Flora
total_scores = total_scores.drop(['Fred', 'Flora'])
total_scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,95.0,85.0,83.0
Edward,77.0,88.0,99.0
Gabriel,76.0,87.0,98.0


## Table Arithmetics
- Perform an operation uniformly to all values in a column
- Arithmetics with multiple columns
- Calculate statistics
- Apply a user-defined function to all rows

In [28]:
# Create a data frame called grades
grades = pd.DataFrame([[56, 67, 78, 5],
                       [66, 77, 88, 8],
                       [98, 97, 85, 3]],
                      index=["Superman", "Hulk", "Thor"],
                      columns=["Quiz1", "Quiz2", "Final", "ExtraCredit"])
grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Superman,56,67,78,5
Hulk,66,77,88,8
Thor,98,97,85,3


In [29]:
# Double the extra credits

grades['ExtraCredit'] = grades['ExtraCredit'] * 2

grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Superman,56,67,78,10
Hulk,66,77,88,16
Thor,98,97,85,6


In [30]:
# Calculate the final grades:
#  final Grades = Quiz1 * 25% + Quiz2 * 25% + Final * 50% + ExtraCredit

grades['FinalGrade'] = grades['Quiz1'] * 0.25 + grades['Quiz2'] * 0.25 +\
                grades['Final'] * 0.5 + grades['ExtraCredit']

grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit,FinalGrade
Superman,56,67,78,10,79.75
Hulk,66,77,88,16,95.75
Thor,98,97,85,6,97.25


In [31]:
# Ex: Curve the grades:
# Formula: CurvedGrades = sqrt(Grades) * 10

# Attempt 1: use the ** operator
# grades['CurvedGrades'] = grades['FinalGrade'] ** 0.5 * 10
# grades

# Attempt 2: use numpy.sqrt()
grades['CurvedGrades'] = np.sqrt(grades['FinalGrade']) * 10 
# numpy functions can take in a list of values
grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit,FinalGrade,CurvedGrades
Superman,56,67,78,10,79.75,89.302855
Hulk,66,77,88,16,95.75,97.851929
Thor,98,97,85,6,97.25,98.615415


In [38]:
# Calculate the min, max, mean, median, variance, and std of the final grades

print("Max grade:", grades['CurvedGrades'].max())
print("Min grade:", grades['CurvedGrades'].min())
print("Student with max grade:", grades['CurvedGrades'].idxmax())
print("Student with min grade:", grades['CurvedGrades'].idxmin())
print("Average grade:", grades['CurvedGrades'].mean())
print("Use numpy.std() to calcuate the standard deviation:",
      np.std(grades['CurvedGrades']))
# You can break a long statement into multiple lines by starting a new line
# after a comma

Max grade: 98.6154146165801
Min grade: 89.30285549745875
Student with max grade: Thor
Student with min grade: Superman
Average grade: 95.25673302264784
Use numpy.std() to calcuate the standard deviation: 4.221549539998431


In [41]:
# Ex: Define a function num2letter() that converts a numerical grade to a letter grade.
# For example, num2letter(95) returns 'A', num2letter(59) returns 'F'

def num2letter(num):
    
    # 90+ -> A
    # >= 80 and < 90 -> B
    # >= 70 and < 80 -> C
    # >= 60 and < 70 -> D
    # otherwise: F
    
    if num >= 90:
        return 'A'
    elif num >= 80:
        return 'B'
    elif num >= 70:
        return 'C'
    elif num >= 60:
        return 'D'
    else:
        return 'F'

In [42]:
num2letter(59)

'F'

In [43]:
# We cannot apply this function directly to a list of values
num2letter(grades['CurvedGrades'])

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [44]:
# Apply the function to the curved grade column

grades['LetterGrade'] = grades['CurvedGrades'].apply(num2letter)
grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit,FinalGrade,CurvedGrades,LetterGrade
Superman,56,67,78,10,79.75,89.302855,B
Hulk,66,77,88,16,95.75,97.851929,A
Thor,98,97,85,6,97.25,98.615415,A


In [46]:
# Clean up the data frame by dropping the intermediate columns
grades = grades.drop(['FinalGrade', 'CurvedGrades'], axis=1)
grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit,LetterGrade
Superman,56,67,78,10,B
Hulk,66,77,88,16,A
Thor,98,97,85,6,A


## Example: Revisit 80 Cereal Data

Using Pandas and DataFrame, let's repeat our analysis of the 80 Cereal Data:
- Load the csv file using `pd.read_csv()`
- Examine the data
- Explore the ratings
- Analyze sugar contents

In [47]:
import pandas as pd

In [50]:
# Load the dataset
raw_data = pd.read_csv('mydata/cereal.csv') 
raw_data.head(5) # by default, column names come from the first row, and integer indexing is used.

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [51]:
# Display the shape
raw_data.shape

(77, 16)

In [52]:
# Display the columns
print("Columns:")
print(raw_data.columns.values)

Columns:
['name' 'mfr' 'type' 'calories' 'protein' 'fat' 'sodium' 'fiber' 'carbo'
 'sugars' 'potass' 'vitamins' 'shelf' 'weight' 'cups' 'rating']


In [53]:
# Display the data types

raw_data.dtypes

name         object
mfr          object
type         object
calories      int64
protein       int64
fat           int64
sodium        int64
fiber       float64
carbo       float64
sugars        int64
potass        int64
vitamins      int64
shelf         int64
weight      float64
cups        float64
rating      float64
dtype: object

In [54]:
# Display all cereal names

for i in raw_data.index:
    print(raw_data.loc[i, 'name'])

100% Bran
100% Natural Bran
All-Bran
All-Bran with Extra Fiber
Almond Delight
Apple Cinnamon Cheerios
Apple Jacks
Basic 4
Bran Chex
Bran Flakes
Cap'n'Crunch
Cheerios
Cinnamon Toast Crunch
Clusters
Cocoa Puffs
Corn Chex
Corn Flakes
Corn Pops
Count Chocula
Cracklin' Oat Bran
Cream of Wheat (Quick)
Crispix
Crispy Wheat & Raisins
Double Chex
Froot Loops
Frosted Flakes
Frosted Mini-Wheats
Fruit & Fibre Dates; Walnuts; and Oats
Fruitful Bran
Fruity Pebbles
Golden Crisp
Golden Grahams
Grape Nuts Flakes
Grape-Nuts
Great Grains Pecan
Honey Graham Ohs
Honey Nut Cheerios
Honey-comb
Just Right Crunchy  Nuggets
Just Right Fruit & Nut
Kix
Life
Lucky Charms
Maypo
Muesli Raisins; Dates; & Almonds
Muesli Raisins; Peaches; & Pecans
Mueslix Crispy Blend
Multi-Grain Cheerios
Nut&Honey Crunch
Nutri-Grain Almond-Raisin
Nutri-grain Wheat
Oatmeal Raisin Crisp
Post Nat. Raisin Bran
Product 19
Puffed Rice
Puffed Wheat
Quaker Oat Squares
Quaker Oatmeal
Raisin Bran
Raisin Nut Bran
Raisin Squares
Rice Chex
Rice Kr

In [65]:
# Display all cereal names and ratings

for i in raw_data.index:
    print("%25s - %.2f" % (raw_data.loc[i, 'name'],
          raw_data.loc[i, 'rating']))
#     print("{:<25} - {:.2f}".format(raw_data.loc[i, 'name'],
#           raw_data.loc[i, 'rating']))

                100% Bran - 68.40
        100% Natural Bran - 33.98
                 All-Bran - 59.43
All-Bran with Extra Fiber - 93.70
           Almond Delight - 34.38
  Apple Cinnamon Cheerios - 29.51
              Apple Jacks - 33.17
                  Basic 4 - 37.04
                Bran Chex - 49.12
              Bran Flakes - 53.31
             Cap'n'Crunch - 18.04
                 Cheerios - 50.76
    Cinnamon Toast Crunch - 19.82
                 Clusters - 40.40
              Cocoa Puffs - 22.74
                Corn Chex - 41.45
              Corn Flakes - 45.86
                Corn Pops - 35.78
            Count Chocula - 22.40
       Cracklin' Oat Bran - 40.45
   Cream of Wheat (Quick) - 64.53
                  Crispix - 46.90
   Crispy Wheat & Raisins - 36.18
              Double Chex - 44.33
              Froot Loops - 32.21
           Frosted Flakes - 31.44
      Frosted Mini-Wheats - 58.35
Fruit & Fibre Dates; Walnuts; and Oats - 40.92
            Fruitful Bran - 41.02
 

In [69]:
# Use "idxmax()" to find the product name with highest rating
idx_max_rating = raw_data['rating'].idxmax()
raw_data.loc[idx_max_rating, 'name']

'All-Bran with Extra Fiber'

In [71]:
# Display all cereals with rating above 60
raw_data[raw_data['rating'] > 60]['name']

0                     100% Bran
3     All-Bran with Extra Fiber
20       Cream of Wheat (Quick)
54                  Puffed Rice
55                 Puffed Wheat
63               Shredded Wheat
64       Shredded Wheat 'n'Bran
65    Shredded Wheat spoon size
Name: name, dtype: object

In [74]:
# Display the sugars and weight columns
raw_data[['name', 'sugars', 'weight']]

Unnamed: 0,name,sugars,weight
0,100% Bran,6,1.0
1,100% Natural Bran,8,1.0
2,All-Bran,5,1.0
3,All-Bran with Extra Fiber,0,1.0
4,Almond Delight,8,1.0
...,...,...,...
72,Triples,3,1.0
73,Trix,12,1.0
74,Wheat Chex,3,1.0
75,Wheaties,3,1.0


In [76]:
# Calculate sugar per ounce
# sugar per ounce = sugar per serving / ounce per serving
raw_data['sugar per ounce'] = raw_data['sugars'] / raw_data['weight']
raw_data.head(20)

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,sugar per ounce
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973,6.0
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679,8.0
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505,5.0
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912,0.0
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843,8.0
5,Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541,10.0
6,Apple Jacks,K,C,110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094,14.0
7,Basic 4,G,C,130,3,2,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562,6.015038
8,Bran Chex,R,C,90,2,1,200,4.0,15.0,6,125,25,1,1.0,0.67,49.120253,6.0
9,Bran Flakes,P,C,90,3,0,210,5.0,13.0,5,190,25,3,1.0,0.67,53.313813,5.0


In [79]:
# Which product has the highest amount of sugar per ounce?
sugary_cereals = raw_data.sort_values('sugar per ounce').tail()
sugary_cereals[['name', 'sugar per ounce']]

Unnamed: 0,name,sugar per ounce
14,Cocoa Puffs,13.0
18,Count Chocula,13.0
6,Apple Jacks,14.0
66,Smacks,15.0
30,Golden Crisp,15.0


In [80]:
# Are there other products with the maximum sugar per ounce?
no_sugar_cereals = raw_data[raw_data['sugars'] == 0.0]
no_sugar_cereals[['name', 'sugar per ounce']]

Unnamed: 0,name,sugar per ounce
3,All-Bran with Extra Fiber,0.0
20,Cream of Wheat (Quick),0.0
54,Puffed Rice,0.0
55,Puffed Wheat,0.0
63,Shredded Wheat,0.0
64,Shredded Wheat 'n'Bran,0.0
65,Shredded Wheat spoon size,0.0
