<a target="_blank" rel="noopener noreferrer" href="https://colab.research.google.com/github/epacuit/introduction-machine-learning/blob/main/tutorials/tutorial3.ipynb">![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)</a>


Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

(tutorial3)=
# Tutorial 3: Brief Introduction to Pandas

This tutorial will provide a very brief introduction to the Pandas library. Pandas is a powerful data manipulation library for Python. 

For a more in-depth introduction to Pandas, read the [Pandas Documentation](https://pandas.pydata.org/docs/).



#### Import the Pandas library and read a dataset

The first step is to read a dataset into a Pandas DataFrame. A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns).

In [None]:
import pandas as pd 

In [None]:
url = 'https://raw.githubusercontent.com/epacuit/introduction-machine-learning/refs/heads/main/tutorials/comedy_comparisons_metadata.csv'

df = pd.read_csv(url)

In [None]:
type(df) # the type of the object is a DataFrame

In [None]:
df.head() # shows the first 5 rows of the DataFrame

In [None]:
df.columns # lists the columns of the DataFrame

In [None]:
df.shape # shows the shape of the DataFrame (rows, columns)

In [None]:
len(df) # shows the number of rows in the DataFrame

#### Creating a DataFrame



In [None]:
df_1 = pd.DataFrame({
    'A': [1, 2, 3], 
    'B': [4, 5, 6],
    'C': ['x', 'y', 'z']})

df_1

In [None]:
df_2 = pd.DataFrame(
    [[1, 4, 'x'], [2, 5, 'y'], [3, 6, 'z']], 
    columns=['A', 'B', 'C'])
df_2

#### Selecting Columns

The next step is to filter the DataFrame to select specific rows or columns.

In [None]:
df['title'] # A column can be accessed by using the column name as a key

In [None]:
type(df['title']) # the type of a column is a Series

In [None]:
titles = df['title'] # the column can be stored in a variable

titles.head() # shows the first 5 rows of the Series

In [None]:
print(f"The 11th title is {titles[10]}") # a value in a Series can be accessed by using the index as a key

print(f'The 11th title is {df["title"][10]}') # you can also access a value in a Series by using the DataFrame and column name

print("\nThe first 5 values of the title: ", titles[0:5]) # A Series object can be sliced like a list

print("\nThe last 5 values of the title: ", titles[-5:]) # A Series object can be sliced like a list 


In [None]:
print("The unique values of the view_counts column are: ", df['view_count'].unique()) # shows the unique values of a column

print("The number of values of the view_counts column are: ", df['view_count'].count()) # shows the number of values of a column

print("The number of unique values of the view_counts column are: ", df['view_count'].nunique()) # shows the number of unique values of a column

print(" ", len(list(set(df['view_count'])))) # Another way to find the number of unique elements in a column

Typically, it is faster to use the Pandas's built-in functions. 

In [None]:
%%timeit

df['view_count'].nunique()

In [None]:
%%timeit 

len(list(set(df['view_count'])))

#### Filtering the DataFrame

In [None]:
df[df['view_count'] > 1000000] # shows the rows that have a view_count greater than 1,000,000

In [None]:
df['view_count'] > 1000000 # returns a boolean Series

In [None]:
df[df['comment_count'].isin([5, 10, 1000])] # shows the rows that have a comment count that is either 5, 10, or 1000 greater than 1,000,000

In [None]:
df[(df['view_count'] > 1000000) & (df['comment_count'] > 1000)] # shows the rows that have a view_count greater than 1,000,000 and a comment_count greater than 1000

One issue to be aware of is that when you filter a DataFrame, the index of the original DataFrame is preserved.  


In [None]:
df[df["view_count"] % 2 == 0] # get all rows where the view_count is even

In [None]:
# get the 1 element of the view_count column
print("The 2nd element of the view_count column is ", df.loc[1, 'view_count'])
df[df["view_count"] % 2 == 0][1] # produces an error since the item with index 1 does not have an even view_count

In [None]:
df[df["view_count"] % 2 == 0].values[1] # use the values attribute to get the values in the filtered dataframe.

#### Statistics about the DataFrame

In [None]:
df["view_count"].sum() # shows the mean of the view_count column

In [None]:
sum(df["view_count"]) # outputs nan because the column has missing values (listed at NaN)

In [None]:
sum(df["view_count"].dropna()) # outputs the sum of the view_count column without the missing values

Again, it is faster to use the Pandas's built-in functions.

In [None]:
%%timeit

df["view_count"].sum()  

In [None]:
%%timeit 

sum(df["view_count"].dropna())

In [None]:
df["like_count"].max() # shows the max of the view_count column

In [None]:
df.describe() # shows the summary statistics of the DataFrame

#### Combining DataFrames

In [None]:
df_3 = pd.DataFrame({
    'A': [10, 11, 12], 
    'B': [13, 14, 15],
    'C': ['xx', 'yy', 'zz']})

df_3

In [None]:
combined_df = pd.concat([df_1, df_2, df_3], axis=0) # concatenates the two DataFrames along the rows

combined_df

In [None]:
combined_df = pd.concat([df_1, df_2, df_3], axis=1) # concatenates the two DataFrames along the columns

combined_df

### Task

In [None]:
def compare_views(df, video_id1, video_id2): 
    """Return True if the view count of video_id1 is greater than the view count of video_id2; otherwise, return False. If video_id1 or video_id2 is not in the DataFrame, raise a ValueError that outputs the string 'video_id is not in the DataFrame', where video_id is not present in the frame."""
    
    # YOUR CODE HERE
    raise NotImplementedError()


In [None]:
assert compare_views(df, 'vzpD6OogahQ', 'yzGWOpop6i8') == True
assert compare_views(df, 'yzGWOpop6i8', 'vzpD6OogahQ') == False
assert compare_views(df, 'DE1-cD3pTkA', 'XZqSz_X-j8Y') == False
assert compare_views(df, 'XZqSz_X-j8Y', 'DE1-cD3pTkA') == True
assert compare_views(df, 'yzGWOpop6i8', 'yzGWOpop6i8') == False
try:
    compare_views(df, 'vzpD6OogahQ', 'not_in_df1')
except ValueError as e:
    assert str(e) == 'not_in_df1 is not in the DataFrame'
try:
    compare_views(df, 'not_in_df2', 'vzpD6OogahQ')
except ValueError as e:
    assert str(e) == 'not_in_df2 is not in the DataFrame'