---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: E-Commerce

### 📋 **Topic**: DataFrames and Analyzing Data with Python

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---


## PART 1

**Topic**: We will learn the most useful data structure in Python: the DataFrame. We will also see how easy it is to read and write data using Python/pandas


## 1. DataFrames

DataFrames can be thought of as Excel sheets on steroids:
- Each row refers to an observation: (entity, purchase, review, user, ...)
- Each column refers to an attribute of the observation (e.g. age, height, ...)
- The first row usually has the names of each column


In [None]:
# 1.1 Creating a DataFrame

# First, let's import pandas
import pandas as pd
import numpy as np

# To make a DataFrame
test_scores = [19, 18, 20, 20, 17]
quiz_scores = [18, 17, 19, 20, 20]
names = ["Apostolos", "Aja", "Calai", "Katalina", "Cal"]

# Python automatically names the columns of the DataFrame according to the dictionary keys
df = pd.DataFrame(
    {"test_scores": test_scores, "quiz_scores": quiz_scores, "names": names}
)
print(df)


In [None]:
# 1.2 Viewing DataFrames

# The head() function prints the first few lines of the DataFrame
# For DataFrames with millions of rows, this will be important...
print(df.head())


In [None]:
# Similarly, the tail() function prints the last few rows
# If your df is very small, they will be the same :)
print(df.tail())


In [None]:
# 1.3 Getting information about a DataFrame

# To access any column of a DataFrame
print(df["test_scores"])
print(type(df["test_scores"]))

In [None]:
print(df["names"])
print(type(df["names"]))

In [None]:
# A useful way to summarize a DataFrame fast
print(df.describe())

In [None]:
# Types of each column
print(df.dtypes)


In [None]:
# More detailed information about the DataFrame
print(df.info())

In [None]:
# How many observations (rows)
print(df.shape[0])  # or len(df)
# How many columns
print(df.shape[1])  # or len(df.columns)
# Names of rows (index)
print(df.index.tolist())
# Names of columns
print(df.columns.tolist())


In [None]:
# 1.4 Changing a DataFrame

# You can create a new column in the DataFrame as follows
df["new_column"] = df["quiz_scores"]
print(df)

# You can delete a column in the DataFrame by using drop()
df = df.drop("new_column", axis=1)
print(df.head())


In [None]:
# 1.5 Accessing subsets of your DataFrame

# You can access rows and columns directly
# In pandas, we use .loc[] for label-based indexing and .iloc[] for position-based indexing

# Select specific rows and columns by position
print(df.iloc[0:4]["quiz_scores"])  # rows 0-3, quiz_scores column
print(df.iloc[[0, 1, 3]])  # rows 0, 1, 3, all columns
print(df.iloc[0:4, 0:2])  # rows 0-3, columns 0-1


In [None]:
# You can select only those rows that satisfy a condition

# Select rows that have test_scores greater than 19
print(df[df["test_scores"] > 19])

# Select rows with either test score > 19 OR quiz score <= 18
print(df[(df["test_scores"] > 19) | (df["quiz_scores"] <= 18)])

# Select rows with either test score > 19 AND quiz score <= 18
print(df[(df["test_scores"] > 19) & (df["quiz_scores"] <= 18)])


In [None]:
# Select rows with test score less or equal to 19
print(df[~(df["test_scores"] > 19)])  # ~ is the negation operator
print(df[df["test_scores"] <= 19])

# Select rows with test scores equal to 19
print(df[df["test_scores"] == 19])


## 2. Reading data


In [None]:
# 2.1 Reading CSV files
# Reading data with Python/pandas is super easy
# Let's load the reviews data

# Load it to a DataFrame (assuming we're running from base directory)
df = pd.read_csv("../data/reviews.csv")
print(type(df))


In [None]:
# 2.2 Exploring the data
# Let's explore it a little
print(df.describe())
print(df.head())
# tail() returns the last six rows of the DataFrame
print(df.tail())
# len() returns the number of rows of the DataFrame
print(len(df))
# len(df.columns) returns the number of columns of the DataFrame
print(len(df.columns))


In [None]:
# The following returns the number of rows that have "verified" equal to 1
print(len(df[df["verified"] == 1]))
# How many verified reviews? (proportion)
print(df["verified"].mean())

# What are the unique different possible ratings?
print(df["productRating"].unique())


In [None]:
# 2.3 Writing data
# Let's keep only two star reviews
df_twostar = df[df["productRating"] == 2]
# Let's write the new file
df_twostar.to_csv("../temp/reviews_2star.csv", index=False)

# Super easy, right? And way way faster than Excel.
# And, as will become apparent, way more possibilities...


## PART 2

**Topic**: We will now take a look into how to analyze data. We will also introduce pandas method chaining, which helps us write code in a more readable way


In [None]:
# 1. Data
# Load the homes dataset
df_homes = pd.read_csv("../data/homes.csv")

# What is this data?
# This data contains information about homes in a county
# We can take a quick look at the dataset size and information
print(len(df_homes))
print(df_homes.columns.tolist())


In [None]:
# Or we can look at a summary of the data set to get some more advanced information
print(df_homes.describe())

# How many conditions?
print(df_homes["condition"].unique())


## 2. Pandas: isolating data


In [None]:
# 2.1 The select operation
# This operation is used to extract columns from our data by using their name

# For example, let's say we want to extract the total value column
values = df_homes[["totalvalue"]]  # Double brackets return DataFrame
print(values.mean())  # This works because values is a DataFrame

print(type(values))

# Alternatively, single brackets return a Series
values_series = df_homes["totalvalue"]
print(values_series.mean())

# For example let's say we want to extract more than one column
df_new = df_homes[["totalvalue", "yearbuilt"]]

# If we want all columns except certain ones
df_new = df_homes.drop(["yearbuilt"], axis=1)


In [None]:
# 2.2 The filter operation
# This operation keeps only rows that satisfy a condition

# Keep only houses built in 2015
df_new = df_homes[df_homes["yearbuilt"] == 2015]

# Keep houses not built in 2015
df_new = df_homes[df_homes["yearbuilt"] != 2015]

# Keep houses built after 2015
df_new = df_homes[df_homes["yearbuilt"] > 2015]

# Keep houses that were built in 2011, 2013, or 2015
df_new = df_homes[df_homes["yearbuilt"].isin([2011, 2013, 2015])]
# Keep houses that are either in Scottsville or Crozet
df_new = df_homes[df_homes["city"].isin(["SCOTTSVILLE", "CROZET"])]


In [None]:
# Keep houses with the maximum number of bedrooms
max_bedrooms = df_homes["bedroom"].max()
df_new = df_homes[df_homes["bedroom"] == max_bedrooms]
# Alternatively
df_new = df_homes[df_homes["bedroom"] == df_homes["bedroom"].max()]

# Multiple conditions
df_new = df_homes[
    (df_homes["city"] == "CROZET") & (df_homes["finsqft"] > df_homes["finsqft"].mean())
]


In [None]:
# 2.3 The sort operation
# This operation helps us sort according to whatever we want
df_new = df_homes.sort_values("finsqft", ascending=False)


## 3. Pandas: method chaining


In [None]:
# 3.1 Chaining operations
# Sometimes we want to perform many operations on a dataset
# For example, let's say we want to (i) only select houses in Crozet
# (ii) only keep the "totalvalue", and "lotsize" columns, and
# (iii) sort our data by decreasing lotsize order

# We could do it step by step:
df_crozet = df_homes[df_homes["city"] == "CROZET"]
df_crozet = df_crozet[["totalvalue", "lotsize"]]
df_crozet = df_crozet.sort_values("lotsize", ascending=False)
print(df_crozet.head())


In [None]:
# Or we can chain the operations together:
df_crozet_2 = df_homes[df_homes["city"] == "CROZET"][  # Filter for Crozet
    ["totalvalue", "lotsize"]
].sort_values("lotsize", ascending=False)

# Method chaining makes code more readable and follows the data flow


## 4. Pandas: deriving data


In [None]:
# 4.1 Creating new columns
# We can create new columns using assignment

# Create a copy of the data to work with
df_new = df_homes.copy()

# Create a new column by dividing two existing columns
df_new["value_sqft"] = df_new["totalvalue"] / df_new["finsqft"]

# Select only the columns we want and sort by value per square foot
df_new = df_new[["yearbuilt", "condition", "finsqft", "totalvalue", "city", "value_sqft"]]
df_new = df_new.sort_values("value_sqft", ascending=False)

print(df_new.head())


In [None]:
# We can also create conditional columns
# First, let's create the value_sqft column again
df_new = df_homes.copy()
df_new["value_sqft"] = df_new["totalvalue"] / df_new["finsqft"]

# Create a column that shows if a house has high value per sqft
median_value_sqft = df_new["value_sqft"].median()
df_new["high_value_sqft"] = df_new["value_sqft"] > median_value_sqft

# Select columns and sort
df_new = df_new[["yearbuilt", "condition", "finsqft", "totalvalue", "city", "value_sqft", "high_value_sqft"]]
df_new = df_new.sort_values("value_sqft", ascending=False)

print(df_new.head())


In [None]:
# Creating multiple new columns at once
df_new = df_homes.copy()

# Create value per square foot
df_new["value_sqft"] = df_new["totalvalue"] / df_new["finsqft"]

# Create remodel indicator (1 if remodeled, 0 if not)
df_new["remodel"] = (df_new["yearremodeled"] > 0).astype(int)

# Select only the columns we want and sort
df_new = df_new[["value_sqft", "remodel", "city"]]
df_new = df_new.sort_values("value_sqft")

print(df_new.head())


In [None]:
# 4.2 Summary statistics
# This computes summary statistics and creates a new DataFrame

# First, filter houses with yearbuilt info
df_homes_filtered = df_homes[df_homes["yearbuilt"] > 0]

# Compute summary statistics
df_homes_stats = df_homes_filtered.agg({"yearbuilt": ["min", "max", "count", "mean", "median"]}).round(2)

print(df_homes_stats)

# Alternative approach using describe
df_homes_stats = df_homes_filtered["yearbuilt"].describe()


In [None]:
# 4.3 Group by operations
# This function groups cases by common values of one or more columns

# First, filter houses with yearbuilt info
df_homes_filtered = df_homes[df_homes["yearbuilt"] > 0]

# Group by city and compute summary statistics
df_homes_stats = (
    df_homes_filtered
    .groupby("city")["yearbuilt"]
    .agg(["min", "max", "count", "mean", "median"])
    .round(2)
    .reset_index()
)

print(df_homes_stats)


In [None]:
# If you don't want to obtain summaries but only within-group quantities
# First, filter houses with yearbuilt info
df_homes_filtered = df_homes[df_homes["yearbuilt"] > 0].copy()

# Calculate median year built by city
median_by_city = df_homes_filtered.groupby("city")["yearbuilt"].median()

# Add the median back to our dataframe
df_homes_filtered["median_yearbuilt"] = df_homes_filtered["city"].map(median_by_city)

# Create a new column indicating if house is newer than city median
df_homes_filtered["new"] = (df_homes_filtered["yearbuilt"] >= df_homes_filtered["median_yearbuilt"]).astype(int)

# Select only the columns we want
df_homes_stats = df_homes_filtered[["yearbuilt", "condition", "finsqft", "city", "median_yearbuilt", "new"]]

print(df_homes_stats.head())


In [None]:
# You can group by more than one variable
# Group by both city and the "new" indicator we just created
df_homes_stats = (
    df_homes_filtered
    .groupby(["city", "new"])["yearbuilt"]
    .agg(["min", "max", "count", "mean", "median"])
    .round(2)
    .reset_index()
)

print(df_homes_stats)


---

## 🎉 FIN!!

### Recap:
We covered the most useful analysis data structure in Python -- the DataFrame. We'll use that structure a lot in the next few weeks.

### Next week:
We will learn how we can visualize data using Python

---
