# Week 4
# Pandas Data Frames (Part 1)

[Pandas](https://pandas.pydata.org/) is a major tool for data scientists on Python. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy.

References:
- Textbook Chapter 5: Getting Started with Pandas

In [1]:
import numpy as np
import pandas as pd # pd is the universally-used abbreviation

Pandas provides two data types that extend numpy arrays:
- Data Series: extending 1D array, used to represent a single feature
- Data Frame: extending 2D array, used to represent a data table

We will focus on data frames today, as most data sets are stored in table format.

In [13]:
# Define a DataFrame from scratch
df1 = pd.DataFrame(np.random.rand(5, 3),
                   columns=['Feature1', 'Feature2', 'Feature3'])
df1.head(3)# prints the first several rows

Unnamed: 0,Feature1,Feature2,Feature3
0,0.405935,0.761449,0.847433
1,0.530033,0.537026,0.916869
2,0.626821,0.708161,0.047923


In [14]:
# Print the shape of the data frame
print(df1.shape)

(5, 3)


In [17]:
# Row indices
# print(df1.index)
print(df1.index.values)

[0 1 2 3 4]


In [22]:
# Column indices
# print(df1.columns)
print(df1.columns.values)

['Feature1' 'Feature2' 'Feature3']


In [33]:
# Extract a column
df1['Feature1'] # returns a DataSeries

0    0.405935
1    0.530033
2    0.626821
3    0.620581
4    0.448071
Name: Feature1, dtype: float64

In [26]:
# Access elements using .loc[row_index, col_index]
# Ex: Print the Feature1 value on the first row
ms
print(df1.loc[0, 'Feature1'])

# Ex: Display the value in the lower-right corner
print(df1.loc[4, 'Feature3'])

# # There is another expression using integer indices: iloc[]
print(df1.iloc[0, 0])
print(df1.iloc[4, 2])

0.4059349834708027
0.8673834018716187
0.4059349834708027
0.8673834018716187


In [34]:
# Extract a row
df1.loc[2] # returns a DataSeries

Feature1    0.626821
Feature2    0.708161
Feature3    0.047923
Name: 2, dtype: float64

In [27]:
# Index slicing
# Print the Feature2 value for the first 3 rows
df1.loc[0:2, "Feature2"] # Both the start index and the end index are inclusive

0    0.761449
1    0.537026
2    0.708161
Name: Feature2, dtype: float64

In [28]:
# Ex: Print the Feature2 and Feature3 values for the last 3 rows
# .loc[] does not support negative index
df1.loc[2:4, ['Feature2', 'Feature3']]

Unnamed: 0,Feature2,Feature3
2,0.708161,0.047923
3,0.443285,0.484035
4,0.056282,0.867383


In [30]:
# Ex: Use boolean indexing to extract the last 3 rows
row_indices = (df1.index >= 2)
print(row_indices)
df1.loc[row_indices, :]

[False False  True  True  True]


Unnamed: 0,Feature1,Feature2,Feature3
2,0.626821,0.708161,0.047923
3,0.620581,0.443285,0.484035
4,0.448071,0.056282,0.867383


In [31]:
df1.loc[(df1.index <= 2), :]

Unnamed: 0,Feature1,Feature2,Feature3
0,0.405935,0.761449,0.847433
1,0.530033,0.537026,0.916869
2,0.626821,0.708161,0.047923


In [35]:
# Extract rows whose Feature2 value is less than 0.5
df1.loc[(df1['Feature2'] < 0.5), :]

Unnamed: 0,Feature1,Feature2,Feature3
3,0.620581,0.443285,0.484035
4,0.448071,0.056282,0.867383


## Basic Table Operations
- Change a value 
- Add a new row
- Add a new column
- Remove a row
- Remove a column

In [38]:
data = [[60, 70, 80],
        [66, 88, 77],
        [100, 60, 30],
        [85, 87, 83]]
scores = pd.DataFrame(data,
                      index=['Alice', 'Bob', 'Chris', 'David'],
                      columns=['Quiz1', 'Quiz2', 'Final'])
scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60,70,80
Bob,66,88,77
Chris,100,60,30
David,85,87,83


In [39]:
# Change Alice's final score to 90.
scores.loc['Alice', 'Final'] = 90
scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60,70,90
Bob,66,88,77
Chris,100,60,30
David,85,87,83


In [43]:
# Add a new row: "Edward": [77, 88, 99]
scores.loc['Edward', :] = [77, 88, 99]
scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,85.0,87.0,83.0
Edward,77.0,88.0,99.0


In [46]:
print("Data types:\n", scores.dtypes) # Each columns has its own data type
# Change Quiz1 to integers
scores['Quiz1'] = scores['Quiz1'].astype(int)
print("Data types after modification:")
print(scores.dtypes)
scores

Data types:
 Quiz1      int32
Quiz2    float64
Final    float64
dtype: object
Data types after modification:
Quiz1      int32
Quiz2    float64
Final    float64
dtype: object


Unnamed: 0,Quiz1,Quiz2,Final
Alice,60,70.0,90.0
Bob,66,88.0,77.0
Chris,100,60.0,30.0
David,85,87.0,83.0
Edward,77,88.0,99.0


In [47]:
scores.loc["Fred", ["Quiz1", "Quiz2"]] = [100, 100]
scores
# This will create a missing value in the Final column
# Python will use "NaN" to indicate a missing value (Not a number)

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,85.0,87.0,83.0
Edward,77.0,88.0,99.0
Fred,100.0,100.0,


In [52]:
# Append a new data frame
more_scores = pd.DataFrame(data={'Quiz1': [67, 76],
                                 'Quiz2': [78, 87],
                                 'Final': [89, 98]},
                           index=['Flora', 'Gabriel']) # Represent data as a dictionary
print(more_scores)
total_scores = pd.concat([scores, more_scores]) # pd.concat() creates a new data frame
total_scores

         Quiz1  Quiz2  Final
Flora       67     78     89
Gabriel     76     87     98


Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Chris,100.0,60.0,30.0
David,85.0,87.0,83.0
Edward,77.0,88.0,99.0
Fred,100.0,100.0,
Flora,67.0,78.0,89.0
Gabriel,76.0,87.0,98.0


In [53]:
total_scores.shape

(8, 3)

In [57]:
# Add a column "ExtraCredit"
total_scores['ExtraCredit'] = [0, 1, 2, 3, 4, 5, 6, 4.5]
total_scores

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0.0
Bob,66.0,88.0,77.0,1.0
Chris,100.0,60.0,30.0,2.0
David,85.0,87.0,83.0,3.0
Edward,77.0,88.0,99.0,4.0
Fred,100.0,100.0,,5.0
Flora,67.0,78.0,89.0,6.0
Gabriel,76.0,87.0,98.0,4.5


In [None]:
total_scores['ExtraCredit'] = [0, 0, 0, 0, 0, 0, 0, 0]
total_scores

In [None]:
# Add additional columns from another data frame
# will be discussed in Chapter 8


In [59]:
# Remove record for Chris
scores_without_chris = total_scores.drop('Chris') # drop() creates a new data frame

total_scores # The original data frame is not affected.
scores_without_chris 

# If you still want to keep a copy of the original data, assign the result from 
# drop() to a new variable. In this way you have two data frames.

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0.0
Bob,66.0,88.0,77.0,1.0
Chris,100.0,60.0,30.0,2.0
David,85.0,87.0,83.0,3.0
Edward,77.0,88.0,99.0,4.0
Fred,100.0,100.0,,5.0
Flora,67.0,78.0,89.0,6.0
Gabriel,76.0,87.0,98.0,4.5


In [60]:
# If the original data frame is no longer needed, then simply assign the drop
# result to the same variable.

total_scores = total_scores.drop('Chris')

total_scores

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0.0
Bob,66.0,88.0,77.0,1.0
David,85.0,87.0,83.0,3.0
Edward,77.0,88.0,99.0,4.0
Fred,100.0,100.0,,5.0
Flora,67.0,78.0,89.0,6.0
Gabriel,76.0,87.0,98.0,4.5


In [61]:
# Modifying an existing data frame is called an "in-place" operation.
total_scores.drop("David", inplace=True)

In [62]:
total_scores

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Alice,60.0,70.0,90.0,0.0
Bob,66.0,88.0,77.0,1.0
Edward,77.0,88.0,99.0,4.0
Fred,100.0,100.0,,5.0
Flora,67.0,78.0,89.0,6.0
Gabriel,76.0,87.0,98.0,4.5


In [63]:
# Remove column "ExtraCredit"
total_scores = total_scores.drop('ExtraCredit', axis=1) # drop() creates a new data frame
# total_scores.drop('ExtraCredit', axis=1, inplace=True)

total_scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Edward,77.0,88.0,99.0
Fred,100.0,100.0,
Flora,67.0,78.0,89.0
Gabriel,76.0,87.0,98.0


In [64]:
# Remove both Fred and Flora
total_scores = total_scores.drop(['Fred', 'Flora'])
total_scores

Unnamed: 0,Quiz1,Quiz2,Final
Alice,60.0,70.0,90.0
Bob,66.0,88.0,77.0
Edward,77.0,88.0,99.0
Gabriel,76.0,87.0,98.0


## Table Arithmetics
- Perform an operation uniformly to all values in a column
- Arithmetics with multiple columns
- Calculate statistics
- Apply a user-defined function to all rows

In [65]:
# Create a data frame called grades
grades = pd.DataFrame([[56, 67, 78, 5],
                       [66, 77, 88, 8],
                       [98, 97, 85, 3]],
                      index=["Superman", "Hulk", "Thor"],
                      columns=["Quiz1", "Quiz2", "Final", "ExtraCredit"])
grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Superman,56,67,78,5
Hulk,66,77,88,8
Thor,98,97,85,3


In [66]:
# Double the extra credits

grades['ExtraCredit'] = grades['ExtraCredit'] * 2

grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit
Superman,56,67,78,10
Hulk,66,77,88,16
Thor,98,97,85,6


In [67]:
# Calculate the final grades:
#  final Grades = Quiz1 * 25% + Quiz2 * 25% + Final * 50% + ExtraCredit

grades['FinalGrade'] = grades['Quiz1'] * 0.25 + grades['Quiz2'] * 0.25 + \
grades['Final'] * 0.5 + grades['ExtraCredit']

grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit,FinalGrade
Superman,56,67,78,10,79.75
Hulk,66,77,88,16,95.75
Thor,98,97,85,6,97.25


In [69]:
# Ex: Curve the grades:
# Formula: CurvedGrades = sqrt(Grades) * 10

# Attempt 1: use the ** operator
# grades['CurvedGrades'] = grades['FinalGrade'] ** 0.5 * 10
# grades

# Attempt 2: use numpy.sqrt()
grades['CurvedGrades'] = np.sqrt(grades['FinalGrade']) * 10 
# numpy functions can take in a list of values
grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit,FinalGrade,CurvedGrades
Superman,56,67,78,10,79.75,89.302855
Hulk,66,77,88,16,95.75,97.851929
Thor,98,97,85,6,97.25,98.615415


In [70]:
# Calculate the min, max, mean, median, variance, and std of the final grades

print("Max grade:", grades['CurvedGrades'].max())
print("Min grade:", grades['CurvedGrades'].min())
print("Use numpy.variance() to calcuate the variance:",
      np.var(grades['CurvedGrades']))
# You can break a long statement into multiple lines by starting a new line
# after a comma

Max grade: 98.6154146165801
Min grade: 89.30285549745875
Use numpy.variance() to calcuate the variance: 17.821480518660966


In [71]:
# Ex: Define a function num2letter() that converts a numerical grade to a letter grade.
# For example, num2letter(95) returns 'A', num2letter(59) returns 'F'

def num2letter(num):
    
    # 90+ -> A
    # >= 80 and < 90 -> B
    # >= 70 and < 80 -> C
    # >= 60 and < 70 -> D
    # otherwise: F
    
    if num >= 90:
        return 'A'
    elif num >= 80:
        return 'B'
    elif num >= 70:
        return 'C'
    elif num >= 60:
        return 'D'
    else:
        return 'F'

In [72]:
num2letter(59)

'F'

In [73]:
# We cannot apply this function directly to a list of values
num2letter(grades['CurvedGrades'])

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [74]:
# Apply the function to the curved grade column

grades['LetterGrade'] = grades['CurvedGrades'].apply(num2letter)
grades

Unnamed: 0,Quiz1,Quiz2,Final,ExtraCredit,FinalGrade,CurvedGrades,LetterGrade
Superman,56,67,78,10,79.75,89.302855,B
Hulk,66,77,88,16,95.75,97.851929,A
Thor,98,97,85,6,97.25,98.615415,A


## Example: Revisit 80 Cereal Data

Using Pandas and DataFrame, let's repeat our analysis of the 80 Cereal Data:
- Load the csv file using `pd.read_csv()`
- Examine the data
- Explore the ratings
- Analyze sugar contents

In [80]:
# Load the dataset
raw_data = pd.read_csv('data/cereal.csv') 
raw_data.head(5) # by default, column names come from the first row, and integer indexing is used.

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [76]:
# Display the shape

raw_data.shape

(77, 16)

In [77]:
# Display the columns

print(raw_data.columns.values)

['name' 'mfr' 'type' 'calories' 'protein' 'fat' 'sodium' 'fiber' 'carbo'
 'sugars' 'potass' 'vitamins' 'shelf' 'weight' 'cups' 'rating']


In [78]:
# Display the data types

raw_data.dtypes

name         object
mfr          object
type         object
calories      int64
protein       int64
fat           int64
sodium        int64
fiber       float64
carbo       float64
sugars        int64
potass        int64
vitamins      int64
shelf         int64
weight      float64
cups        float64
rating      float64
dtype: object

In [85]:
# Display all cereal names

for i, j in zip(raw_data['name'], range(1, 78)):
    print("%2d   %s" % (j, i))

 1   100% Bran
 2   100% Natural Bran
 3   All-Bran
 4   All-Bran with Extra Fiber
 5   Almond Delight
 6   Apple Cinnamon Cheerios
 7   Apple Jacks
 8   Basic 4
 9   Bran Chex
10   Bran Flakes
11   Cap'n'Crunch
12   Cheerios
13   Cinnamon Toast Crunch
14   Clusters
15   Cocoa Puffs
16   Corn Chex
17   Corn Flakes
18   Corn Pops
19   Count Chocula
20   Cracklin' Oat Bran
21   Cream of Wheat (Quick)
22   Crispix
23   Crispy Wheat & Raisins
24   Double Chex
25   Froot Loops
26   Frosted Flakes
27   Frosted Mini-Wheats
28   Fruit & Fibre Dates; Walnuts; and Oats
29   Fruitful Bran
30   Fruity Pebbles
31   Golden Crisp
32   Golden Grahams
33   Grape Nuts Flakes
34   Grape-Nuts
35   Great Grains Pecan
36   Honey Graham Ohs
37   Honey Nut Cheerios
38   Honey-comb
39   Just Right Crunchy  Nuggets
40   Just Right Fruit & Nut
41   Kix
42   Life
43   Lucky Charms
44   Maypo
45   Muesli Raisins; Dates; & Almonds
46   Muesli Raisins; Peaches; & Pecans
47   Mueslix Crispy Blend
48   Multi-Grain Che

In [88]:
# Display all cereal ratings

print(raw_data['rating'].values)

[68.402973 33.983679 59.425505 93.704912 34.384843 29.509541 33.174094
 37.038562 49.120253 53.313813 18.042851 50.764999 19.823573 40.400208
 22.736446 41.445019 45.863324 35.782791 22.396513 40.448772 64.533816
 46.895644 36.176196 44.330856 32.207582 31.435973 58.345141 40.917047
 41.015492 28.025765 35.252444 23.804043 52.076897 53.371007 45.811716
 21.871292 31.072217 28.742414 36.523683 36.471512 39.241114 45.328074
 26.734515 54.850917 37.136863 34.139765 30.313351 40.105965 29.924285
 40.69232  59.642837 30.450843 37.840594 41.50354  60.756112 63.005645
 49.511874 50.828392 39.259197 39.7034   55.333142 41.998933 40.560159
 68.235885 74.472949 72.801787 31.230054 53.131324 59.363993 38.839746
 28.592785 46.658844 39.106174 27.753301 49.787445 51.592193 36.187559]


In [97]:
# Find the product name with highest rating

# Attempt 1.
max_rating = raw_data['rating'].max()
raw_data[raw_data['rating'] == max_rating][['name', 'rating']]

# Attempt 2.
max_rating_index = raw_data['rating'].argmax()
raw_data.loc[max_rating_index, ['name', 'rating']]

# Attempt 3. Use sort_values()
raw_data.sort_values('rating', ascending=False)

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.00,0.50,93.704912
64,Shredded Wheat 'n'Bran,N,C,90,3,0,0,4.0,19.0,0,140,0,1,1.00,0.67,74.472949
65,Shredded Wheat spoon size,N,C,90,3,0,0,3.0,20.0,0,120,0,1,1.00,0.67,72.801787
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.00,0.33,68.402973
63,Shredded Wheat,N,C,80,2,0,0,3.0,16.0,0,95,0,1,0.83,1.00,68.235885
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14,Cocoa Puffs,G,C,110,1,1,180,0.0,12.0,13,55,25,2,1.00,1.00,22.736446
18,Count Chocula,G,C,110,1,1,180,0.0,12.0,13,65,25,2,1.00,1.00,22.396513
35,Honey Graham Ohs,Q,C,120,1,2,220,1.0,12.0,11,45,25,2,1.00,1.00,21.871292
12,Cinnamon Toast Crunch,G,C,120,1,3,210,0.0,13.0,9,45,25,2,1.00,0.75,19.823573


In [99]:
# Display all cereals with rating above 60
raw_data[raw_data['rating'] > 60]


Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
20,Cream of Wheat (Quick),N,H,100,3,0,80,1.0,21.0,0,-1,0,2,1.0,1.0,64.533816
54,Puffed Rice,Q,C,50,1,0,0,0.0,13.0,0,15,0,3,0.5,1.0,60.756112
55,Puffed Wheat,Q,C,50,2,0,0,1.0,10.0,0,50,0,3,0.5,1.0,63.005645
63,Shredded Wheat,N,C,80,2,0,0,3.0,16.0,0,95,0,1,0.83,1.0,68.235885
64,Shredded Wheat 'n'Bran,N,C,90,3,0,0,4.0,19.0,0,140,0,1,1.0,0.67,74.472949
65,Shredded Wheat spoon size,N,C,90,3,0,0,3.0,20.0,0,120,0,1,1.0,0.67,72.801787


In [None]:
# Calculate sugar per ounce
# sugar per ounce = sugar per serving / ounce per serving



In [None]:
# Which product has the highest amount of sugar per ounce?

