# Week 4
# Pandas Data Frames (Part 1)

[Pandas](https://pandas.pydata.org/) is a major tool for data scientists on Python. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy.

References:
- Textbook Chapter 5: Getting Started with Pandas

In [None]:
import numpy as np
import pandas as pd # pd is the universally-used abbreviation

Pandas provides two data types that extend numpy arrays:
- Data Series: extending 1D array, used to represent a single feature
- Data Frame: extending 2D array, used to represent a data table

We will focus on data frames today, as most data sets are stored in table format.

In [None]:
# Define a DataFrame from scratch
df1 = pd.DataFrame(np.random.rand(5, 3),
                   columns=['Feature1', 'Feature2', 'Feature3'])
df1.head(3)# prints the first several rows

In [None]:
# Print the shape of the data frame
print(df1.shape)

In [None]:
# Row indices
# print(df1.index)
print(df1.index.values)

In [None]:
# Column indices
# print(df1.columns)
print(df1.columns.values)

In [None]:
# Extract a column
df1['Feature1'] # returns a DataSeries

In [None]:
# Access elements using .loc[row_index, col_index]
# Example: Print the Feature1 value on the first row
ms
print(df1.loc[0, 'Feature1'])

# Ex: Display the value in the lower-right corner


# # There is another expression using integer indices: iloc[]
# print(df1.iloc[0, 0])
# print(df1.iloc[4, 2])

In [None]:
# Extract a row
df1.loc[2] # returns a DataSeries

In [None]:
# Index slicing
# Print the Feature2 value for the first 3 rows
df1.loc[0:2, "Feature2"] # Both the start index and the end index are inclusive

In [None]:
# Ex: Print the Feature2 and Feature3 values for the last 3 rows
# .loc[] does not support negative index



In [None]:
# Ex: Use boolean indexing to extract the last 3 rows
row_indices = (df1.index >= 2)
print(row_indices)
df1.loc[row_indices, :]

In [None]:
df1.loc[(df1.index <= 2), :]

In [None]:
# Ex: Extract rows whose Feature2 value is less than 0.5




## Basic Table Operations
- Change a value 
- Add a new row
- Add a new column
- Remove a row
- Remove a column

In [None]:
data = [[60, 70, 80],
        [66, 88, 77],
        [100, 60, 30],
        [85, 87, 83]]
scores = pd.DataFrame(data,
                      index=['Alice', 'Bob', 'Chris', 'David'],
                      columns=['Quiz1', 'Quiz2', 'Final'])
scores

In [None]:
# Change Alice's final score to 90.
scores.loc['Alice', 'Final'] = 90
scores

In [None]:
# Add a new row: "Edward": [77, 88, 99]
scores.loc['Edward', :] = [77, 88, 99]
scores

In [None]:
print("Data types:\n", scores.dtypes) # Each columns has its own data type
# Change Quiz1 to integers
scores['Quiz1'] = scores['Quiz1'].astype(int)
print("Data types after modification:")
print(scores.dtypes)
scores

In [None]:
scores.loc["Fred", ["Quiz1", "Quiz2"]] = [100, 100]
scores
# This will create a missing value in the Final column
# Python will use "NaN" to indicate a missing value (Not a number)

In [None]:
# Append a new data frame
more_scores = pd.DataFrame(data={'Quiz1': [67, 76],
                                 'Quiz2': [78, 87],
                                 'Final': [89, 98]},
                           index=['Flora', 'Gabriel']) # Represent data as a dictionary
print(more_scores)
total_scores = pd.concat([scores, more_scores]) # pd.concat() creates a new data frame
total_scores

In [None]:
total_scores.shape

In [None]:
# Add a column "ExtraCredit"
total_scores['ExtraCredit'] = [0, 1, 2, 3, 4, 5, 6, 4.5]
total_scores

In [None]:
total_scores['ExtraCredit'] = [0, 0, 0, 0, 0, 0, 0, 0]
total_scores

In [None]:
# Add additional columns from another data frame
# will be discussed in Chapter 8


In [None]:
# Remove record for Chris
scores_without_chris = total_scores.drop('Chris') # drop() creates a new data frame

total_scores # The original data frame is not affected.
scores_without_chris 

# If you still want to keep a copy of the original data, assign the result from 
# drop() to a new variable. In this way you have two data frames.

In [None]:
# If the original data frame is no longer needed, then simply assign the drop
# result to the same variable.

total_scores = total_scores.drop('Chris')

total_scores

In [None]:
# Modifying an existing data frame is called an "in-place" operation.
total_scores.drop("David", inplace=True)

In [None]:
total_scores

In [None]:
# Remove column "ExtraCredit"
total_scores = total_scores.drop('ExtraCredit', axis=1) # drop() creates a new data frame
# total_scores.drop('ExtraCredit', axis=1, inplace=True)

total_scores

In [None]:
# Remove both Fred and Flora
total_scores = total_scores.drop(['Fred', 'Flora'])
total_scores

## Table Arithmetics
- Perform an operation uniformly to all values in a column
- Arithmetics with multiple columns
- Calculate statistics
- Apply a user-defined function to all rows

In [None]:
# Create a data frame called grades
grades = pd.DataFrame([[56, 67, 78, 5],
                       [66, 77, 88, 8],
                       [98, 97, 85, 3]],
                      index=["Superman", "Hulk", "Thor"],
                      columns=["Quiz1", "Quiz2", "Final", "ExtraCredit"])
grades

In [None]:
# Double the extra credits

grades['ExtraCredit'] = grades['ExtraCredit'] * 2

grades

In [None]:
# Calculate the final grades:
#  final Grades = Quiz1 * 25% + Quiz2 * 25% + Final * 50% + ExtraCredit

grades['FinalGrade'] = grades['Quiz1'] * 0.25 + grades['Quiz2'] * 0.25 + \
grades['Final'] * 0.5 + grades['ExtraCredit']

grades

In [None]:
# Ex: Curve the grades:
# Formula: CurvedGrades = sqrt(Grades) * 10

# Attempt 1: use the ** operator
# grades['CurvedGrades'] = grades['FinalGrade'] ** 0.5 * 10
# grades

# Attempt 2: use numpy.sqrt()
grades['CurvedGrades'] = np.sqrt(grades['FinalGrade']) * 10 
# numpy functions can take in a list of values
grades

In [None]:
# Calculate the min, max, mean, median, variance, and std of the final grades

print("Max grade:", grades['CurvedGrades'].max())
print("Min grade:", grades['CurvedGrades'].min())
print("Use numpy.variance() to calcuate the variance:",
      np.var(grades['CurvedGrades']))
# You can break a long statement into multiple lines by starting a new line
# after a comma

In [None]:
# Ex: Define a function num2letter() that converts a numerical grade to a letter grade.
# For example, num2letter(95) returns 'A', num2letter(59) returns 'F'

def num2letter(num):
    
    # 90+ -> A
    # >= 80 and < 90 -> B
    # >= 70 and < 80 -> C
    # >= 60 and < 70 -> D
    # otherwise: F
    
    if num >= 90:
        return 'A'
    elif num >= 80:
        return 'B'
    elif num >= 70:
        return 'C'
    elif num >= 60:
        return 'D'
    else:
        return 'F'

In [None]:
num2letter(59)

In [None]:
# We cannot apply this function directly to a list of values
num2letter(grades['CurvedGrades'])

In [None]:
# Apply the function to the curved grade column

grades['LetterGrade'] = grades['CurvedGrades'].apply(num2letter)
grades

## Example: Revisit 80 Cereal Data

Using Pandas and DataFrame, let's repeat our analysis of the 80 Cereal Data:
- Load the csv file using `pd.read_csv()`
- Examine the data
- Explore the ratings
- Analyze sugar contents

In [None]:
import pandas as pd

In [None]:
# Load the dataset
raw_data = pd.read_csv('data/cereal.csv') 
raw_data.head(5) # by default, column names come from the first row, and integer indexing is used.

In [None]:
# Display the shape


In [None]:
# Display the columns


In [None]:
# Display the data types



In [None]:
# Display all cereal names


In [None]:
# Display all cereal ratings



In [None]:
# Find the product name with highest rating



In [None]:
# Display all cereals with rating above 60


In [None]:
# Display the sugar and weight columns, sorted by sugar


In [None]:
# Calculate sugar per ounce
# sugar per ounce = sugar per serving / ounce per serving


In [None]:
# Which product has the highest amount of sugar per ounce?


In [None]:
# Are there other products with the maximum sugar per ounce?
