# Pandas II

# Announcements - Monday November 14
* Download the files on Canvas->Files->Mike's Lecture Notes->lec28_pandas
* **Final Exam** 
  * December 19
  * Please fill out the [exam conflict form](https://cs220.cs.wisc.edu/f22/surveys.html)
* [Python Tutor](https://pythontutor.com/python-debugger.html#mode=edit)
* Read: [Intro to Pandas](https://cs220.cs.wisc.edu/f22/materials/readings/pandas-intro.html)
* [Common Project Issues Thread on Piazza](https://piazza.com/class/l7f7vr5x63n7l1)
* **P8 Timeout error** on gradescope - TA will email you with instructions to correct and resubmit
* [Zoom Link](https://uwmadison.zoom.us/j/9741859842?pwd=OURuZnZuL0lhYlJkNVJHR1pLeUQwUT09)
  * Projector Only
  * No Audio
  * The class is not livestreamed 

In [1]:
import pandas as pd
from pandas import Series, DataFrame
# We can explictly import Series and DataFrame, why might we do this?

###  Series Review


# Series from `list`

In [2]:
scores_list = [54, 22, 19, 73, 80]
scores_series = Series(scores_list)
scores_series

# What is the terminology for:  0, 1, 2, ... ??       A:  
# What is the terminology for:  54, 22, 19, .... ??   A:  

0    54
1    22
2    19
3    73
4    80
dtype: int64

#### Selecting certain scores.
What are all the scores `> 50`?

In [4]:
b = scores_series > 50
scores_series[b]

0    54
3    73
4    80
dtype: int64

**Answer:** Boolean indexing. Try the following...

In [5]:
scores_series[[True, True, False, False, True]] # often called a "mask"

0    54
1    22
4    80
dtype: int64

We are really writing a "mask" for our data.
The process of selecting data that meet our criteria is known as "filtering"

# Series from `dict`

In [6]:
# Imagine we hire students and track their weekly hours
week1 = Series({"Rita":5, "Therese":3, "Janice": 6})
week2 = Series({"Rita":3, "Therese":7, "Janice": 4})
week3 = Series({"Therese":5, "Janice":5, "Rita": 8}) # Wrong order! Will this matter?
print(week1)
print(week2)
print(week3)

Rita       5
Therese    3
Janice     6
dtype: int64
Rita       3
Therese    7
Janice     4
dtype: int64
Therese    5
Janice     5
Rita       8
dtype: int64


####  For everyone in Week 1, add 3 to their hours 

In [7]:
week1 + week3

Janice     11
Rita       13
Therese     8
dtype: int64

#### Total up everyone's hours

In [8]:
total_hours = week1+ week2 + week3
total_hours

Janice     15
Rita       16
Therese    15
dtype: int64

#### What is week1 / week3 ?

In [9]:
# Notice that we didn't have to worry about the order of indices
week1 / week3

Janice     1.200
Rita       0.625
Therese    0.600
dtype: float64

#### What type of values are stored in  week1 > week2?

In [10]:
print(week1)
print(week2)
week1 > week2
# Notice that indices are ordered the same

Rita       5
Therese    3
Janice     6
dtype: int64
Rita       3
Therese    7
Janice     4
dtype: int64


Rita        True
Therese    False
Janice      True
dtype: bool

####  What is week1 > week3?

In [12]:
print(week1)
print(week3)
# week1 > week3 # Does it work?
# can only compare (relational operators) series 
# with indicies in the same order

# How do we fix this?
week1.sort_index() > week3.sort_index()


Rita       5
Therese    3
Janice     6
dtype: int64
Therese    5
Janice     5
Rita       8
dtype: int64


Janice      True
Rita       False
Therese    False
dtype: bool


# Lecture 28:  Pandas 2 - DataFrames


Learning Objectives:
- Create a DataFrame from 
 - a dictionary of Series, lists, or dicts
 - a list of Series, lists, dicts
- Select a column, row, cell, or rectangular region of a DataFrame
- Convert CSV files into DataFrames and DataFrames into CSV Files
- Access the head or tail of a DataFrame

**Big Idea**: Data Frames store 2-dimensional data in tables! It is a collection of Series.

## You can create a DataFrame in a variety of ways!

- dictionary of Series
- dictionary of lists
- dictionary of dictionaries
- list of dictionarines
- list of lists

### From a dictionary of Series

In [15]:
# this is most like bringing in a csv file with a header row

names = Series(["Alice", "Bob", "Cindy", "Dan"])
scores = Series([6, 7, 8, 9])

# to make a DataFrame using a dictionary of Series, 
# we need to write column names for the keys
df = DataFrame({"names": names, "scores": scores})
df

Unnamed: 0,names,scores
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### From a dictionary of lists

In [16]:
name_list = ["Alice", "Bob", "Cindy", "Dan"]
score_list = [6, 7, 8, 9]

# same as above with lists. Remember, Series act like lists
df = DataFrame({"names": name_list, "scores": score_list})
df

Unnamed: 0,names,scores
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### From a dictionary of dictionaries
We need to make up keys to match the things in each column

In [18]:


data = {
    "Player name": {0: "Alice", 1: "Bob", 2: "Cindy", 3: "Dan"},
    "Score": {0: 6, 1: 7, 2: 8, 3: 9}
}
data

df = DataFrame(data)
df

Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### From a list of dicts

In [20]:
# this is much like when we transformed our csv rows into dictionaries
# so we could stop looking up which column with header.index("score")

data = [
    {"Player name": "Alice", "Score": 6},
    {"Player name": "Bob", "Score": 7},
    {"Player name": "Cindy", "Score": 8},
    {"Player name": "Dan", "Score": 9}
]
data

df = DataFrame(data)
df

Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### From a list of lists

In [21]:
data = [
    ["Alice", 6],
    ["Bob", 7],
    ["Cindy", 8],
    ["Dan", 9]
]
data

df = DataFrame(data)
df

Unnamed: 0,0,1
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### Explicitly naming the columns
We have to add the column names, we do this with `columns = [name1, name2, ....]`

In [23]:
data = [
    ["Alice", 6],
    ["Bob", 7],
    ["Cindy", 8],
    ["Dan", 9]
]
data

df = DataFrame(data, columns = ["name", "score"])
df

Unnamed: 0,name,score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### Explicitly naming the indices
We can use `index = [name1, name2, ...]` to rename the index of each row

In [25]:
data = [
    {"Player name": "Alice", "Score": 6},
    {"Player name": "Bob", "Score": 7},
    {"Player name": "Cindy", "Score": 8},
    {"Player name": "Dan", "Score": 9}
]
data

df = DataFrame(data, index = ['a','b','c','d'])
df

Unnamed: 0,Player name,Score
a,Alice,6
b,Bob,7
c,Cindy,8
d,Dan,9


In [None]:
# TODO: 
# Make a DataFrame of 4 people you know with different ages
# Give names to both the columns and rows

# Share how you did with this with your neighbor
# If you both did it the same way, try it a different way.

## Select a column, row, cell, or rectangular region of a DataFrame
### Data lookup: Series
- `s.loc[X]`   <- lookup by pandas index
- `s.iloc[X]`  <- lookup by integer position

In [None]:
hours = Series({"Alice": 6, "Bob": 7, "Cindy": 8, "Dan": 9})
hours

In [None]:
# Lookup Bob's hours by pandas index.


In [None]:
# Lookup Bob's hours by integer position.


In [None]:
# Lookup Cindy's hours by pandas index.


###  Data lookup: DataFrame


- `d.loc[r]`     lookup ROW by pandas ROW index
- `d.iloc[r]`    lookup ROW by ROW integer position
- `d[c]`         lookup COL by pandas COL index
- `d.loc[r, c]`  lookup by pandas ROW index and pandas COL index
- `d.iloc[r, c]`  lookup by ROW integer position and COL integer position

In [26]:
# We often call the object that we make df
data = [
    ["Hope", 10],
    ["Peace", 7],
    ["Joy", 4],
    ["Love", 11]
]
df = DataFrame(data, index = ["H", "P", "J", "L"], columns = ["Player name", "Score"])
df

Unnamed: 0,Player name,Score
H,Hope,10
P,Peace,7
J,Joy,4
L,Love,11


### What are 3 different ways of accessing row L? 

In [30]:
df.loc["L"]
df.iloc[3]
df.loc["L",:]
df.iloc[-1,:]

Player name    Love
Score            11
Name: L, dtype: object

### How about accessing a column?

In [34]:
df["Score"]
df.loc[:,"Score"]
df.iloc[:,1]
df.Score

H    10
P     7
J     4
L    11
Name: Score, dtype: int64

### What are 3 different ways to access a single cell?

In [38]:
df["Score"]["L"] # df["Score"] is a Series ["L"] indexes into that series
df.loc["L","Score"]
df.iloc[3,1]
df.Score["L"]

11

## How to set values for a specific entry?

- `d.loc[r, c] = new_val`
- `d.iloc[r, c] = new_val`

In [39]:
#change player D's name
df.loc["L", "Player name"] = "Luisa"
df

Unnamed: 0,Player name,Score
H,Hope,10
P,Peace,7
J,Joy,4
L,Luisa,11


In [None]:
# then add 3 to that player's score using .loc


In [None]:
# add 7 to a different player's score using .iloc


### Find the max score and the mean score

In [40]:
# find the max and mean of the "Score" column
# remember columns are Series
print(df["Score"].max(), df["Score"].mean())

11 8.0


### Find the highest scoring player

##  Slicing a DataFrame

- `df.iloc[ROW_SLICE, COL_SLICE]` <- make a rectangular slice from the DataFrame using integer positions
- `df.loc[ROW_SLICE, COL_SLICE]` <- make a rectangular slice from the DataFrame using index

In [None]:
df.iloc[1:3, 0:2]

In [None]:
df.loc["P":"J", "Player name":"Score"] # notice that this way is inclusive of endpoints

## Set values for sliced DataFrame

- `d.loc[ROW_SLICE, COL_SLICE] = new_val` <- set value by ROW INDEX and COL INDEX
- `d.iloc[ROW_SLICE, COL_SLICE] = new_val` <- set value by ROW Integer position and COL Integer position

In [None]:
df

In [None]:
df.loc["P":"J", "Score"] += 5
df

### Pandas allows slicing of non-contiguous columns

In [None]:
# just get Player name for Index P and L
df.loc[["P", "L"],"Player name"]

In [None]:
# add 2 to the people in rows P and L
df.loc[["P", "L"],"Score"] += 2
df

## Boolean indexing on a DataFrame

- `d[BOOL SERIES]`  <- makes a new DF of all rows that lined up were True

In [None]:
df

### Make a Series of Booleans based on Score >= 15

In [None]:
b

### use b to slice the DataFrame
if b is true, include this row in the new df

### do the last two things in a single step

## Creating DataFrame from csv

In [None]:
# it's that easy!  
df = pd.read_csv("IMDB-Movie-Data.csv")
df

###   View the first few lines of the DataFrame
- `.head(n)` gets the first n lines, 5 is the default

### get the first 2 rows

###   View the last few lines of the DataFrame
- `.tail(n)` gets the last n lines, 5 is the default

### What are the first and last years in our dataset?

In [None]:
# Extract Year column


In [None]:
# First and last Year


In [None]:
# Format Review
print("First year: {}, Last year: {}".format(???)

### What are the rows that correspond to movies whose title contains "Harry" ? 


### What is the movie at index 6 ? 

In [None]:
df

## Notice that there are two index columns
- That happened because when you write a csv from pandas to a file, it writes a new index column
- So if the dataFrame already contains an index, you are going to get two index columns
- Let's fix that problem

### How can you use slicing to get just columns with Title and Year?

In [None]:
df2 = ???
df2
# notice that this does not have the 'index' column

### How can you use slicing to get rid of the first column?

In [None]:
df = df.iloc[???] #all the rows, not column 0
df

### Write a df to a csv file

In [None]:
df.to_csv("better_movies.csv", index = False)

## Practice on your own.....Data Analysis with Data Frames


### What are all the movies that have above average run time (long movies)? 

In [None]:
long_movies = ???
long_movies

### Which long movie has the lowest rating?

In [None]:
# of these movies, what was the min rating? 
min_rating = ???
min_rating

In [None]:
# Which movies had this min rating?
???

### What are all long movies with someone in the cast named "Emma" ? 

In [None]:
???

### What is the title of the shortest movie?

In [None]:
???

### What movie had the highest revenue?

In [None]:
df["Revnue"].max() # does not work, Why?

In [None]:
# We need to clean our data
# Some movies have M at the end and others don't.
# All revenues are in millions of dollars.
def format_revenue(revenue):
    """ 
    Checks the last character of the string and formats accordingly
    """
    if type(revenue) == float: # need this in here if we run code multiple times
        return revenue
    elif revenue[-1] == 'M': # some have an "M" at the end
        return ??? # TODO: convert relevant part of the string to float and multiple by 1e6
    else:
        return ??? # TODO: convert to float and multiple by 1e6

In [None]:
# What movie had the highest revenue?
revenue = df["Revenue"].apply(format_revenue) # apply a function to a column; returns a Series
print(revenue.head())
max_revenue = revenue.max()

# make a copy of our df
rev_df = df.copy()
rev_df["Revenue (float)"] = revenue
rev_df

In [None]:
# Now we can answer the question!
???

In [None]:
# Or more generally...
rev_df.sort_values(by = "Revenue (float)", ascending = False)

In [None]:
df

### What is the average runtime for movies by "Francis Lawrence"?

### More complicated questions...

In [None]:
# Which director had the highest average rating? 

# one way is to make a python dict of director, list of ratings
director_dict = dict()

# make the dictionary: key is director, value is list of ratings
for i in range(len(df)):
    director = df.loc[i, "Director"]
    rating = df.loc[i, "Rating"]
    #print(i, director, rating)
    if director not in director_dict:
        director_dict[director] = []
    director_dict[director].append(rating)

# make a ratings dict key is directory, value is average
# only include directors with > 4 movies
ratings_dict = {k:sum(v)/len(v) for (k,v) in director_dict.items() if len(v) > 4}

#sort a dict by values
dict(sorted(ratings_dict.items(), key=lambda t:t[-1], reverse=True))
    

In [None]:
# FOR DEMONSTRATION PURPOSES ONLY
# We haven't (and will not) learn about "groupby"
# Pandas has many operations which will be helpful!

# Consider what you already know, and what Pandas can solve
# when formulating your solutions.
rating_groups = df.groupby("Director")["Rating"]
rating_groups.mean()[rating_groups.count() > 4].sort_values(ascending=False)

In [None]:
# Extra Practice: Make up some of your own questions about the movies