<a target="_blank" rel="noopener noreferrer" href="https://colab.research.google.com/github/epacuit/introduction-machine-learning/blob/main/tutorials/tutorial2.ipynb">![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)</a>

# Tutorial 2    

# Part 1

#### 1. Reading `csv` files

A `csv` file is a comma separated values file. It is a simple file format used to store tabular data, such as a spreadsheet or database. The first row of the file typically contains the column names, and the following rows contain the data.

The file `comedy_comparisons_metadata.csv` contains metadata about videos on YouTube.  The file is available at the following URL: [https://raw.githubusercontent.com/epacuit/introduction-machine-learning/refs/heads/main/tutorials/comedy_comparisons_metadata.csv](https://raw.githubusercontent.com/epacuit/introduction-machine-learning/refs/heads/main/tutorials/comedy_comparisons_metadata.csv)

Use the `csv` Python package ([https://docs.python.org/3/library/csv.html](https://docs.python.org/3/library/csv.html)) to read the file.   Create a list `metadata` that contains dictionaries for each row in the file.  The keys of the dictionaries should be the column names and the values should be the corresponding values in the row.  For example, the first dictionary in the list should be: "video_id", "title", "view_count", "like_count", "comment_count", "duration", corresponding to the columns in the file.  

In [1]:

import csv

### BEGIN SOLUTION
metadata = []
with open('./comedy_comparisons_metadata.csv', 'r') as file:
    reader = csv.reader(file)
    header= next(reader) 
    for row in reader:
        metadata.append({
            "video_id": row[0],
            "title": row[1],
            "view_count": row[2],
            "like_count": row[3],
            "comment_count": row[4],
            "duration": row[5],
        })
### END SOLUTION


In [2]:
assert len(metadata) == 11541
assert all([type(x) == dict for x in metadata])
assert all([type(x) == dict for x in metadata])
assert all([sorted(list(x.keys())) == sorted(['video_id', 'duration', 'title', 'view_count', 'like_count', 'comment_count', ]) for x in metadata])

In [3]:
def avg_view_count(metadata):
    """
    Calculate the average view count of the videos in the metadata.   Return the average rounded to two decimal places.
    """
    ### BEGIN SOLUTION

    total = 0
    for video in metadata:
        total += int(video['view_count']) if video['view_count'] else 0
    return round(total / len(metadata), 2)

    ### END SOLUTION



In [4]:
assert avg_view_count(metadata) == 891988.54
assert avg_view_count(metadata[100:200]) == 1152895.47

#### 2. Write a function that accepts the `metadata` list, the `video_id`, and a column name, and return the value of the column name for the video_id. 

For instance, `get_value(metadata, 'DE1-cD3pTkA', 'like_count')` should the number of likes for the video with id "DE1-cD3pTkA".

In [5]:
def get_value(metadata, video_id, col_name):

    ### BEGIN SOLUTION

    num =  [md for md in metadata if md['video_id'] == video_id][0].get(col_name, 0)  

    return int(num) if num != '' else 0

    ### END SOLUTION



In [6]:
assert get_value(metadata, 'XZqSz_X-j8Y', 'view_count') == 1919
assert get_value(metadata, 'XZqSz_X-j8Y', 'like_count') == 7
assert get_value(metadata, 'XZqSz_X-j8Y', 'comment_count') == 3
 

# Part 2: Predicting Video Comparisons from Metadata

In this part, we will attempt to predict which of two YouTube videos is considered funnier based on their metadata.  

The dataset `comedy_comparisons.csv` is a subset of the *YouTube Comedy Slam Preference* dataset, available from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/223/youtube+comedy+slam+preference+data). It contains pairwise comparisons of videos, where each row records the video IDs of two videos and indicates which one was rated as funnier by a user.  

You can access the file at the following URL:  
[https://raw.githubusercontent.com/epacuit/introduction-machine-learning/refs/heads/main/tutorials/test_comedy_comparisons_restricted.csv](https://raw.githubusercontent.com/epacuit/introduction-machine-learning/refs/heads/main/tutorials/test_comedy_comparisons_restricted.csv).

## Tasks

1. **Read the Dataset**: Read the file `test_comedy_comparisons_restricted.csv` and create a list of dictionaries. Each dictionary should have the keys `"video_id_1"`, `"video_id_2"`, and `"winner"`.  
   - `"video_id_1"` and `"video_id_2"` should store the video IDs being compared.  
   - `"winner"` should be `1` if `video_id_1` is considered funnier and `0` if `video_id_2` is considered funnier.  
   - The function should return a list of such dictionaries.  

2. **Implement Comparison Functions**: Write three different comparison functions of the following form:  

   ```python
   def is_funnier(video_id_1, video_id_2, metadata):
       """
       Returns True if video_id_1 is predicted to be funnier than video_id_2 based on metadata.
       """
    ```

    Each function should predict which video is funnier based on some metadata attribute, such as:
        - Number of views
        - Number of likes
        - Number of comments

3. **Evaluate Accuracy**: Write a function `evaluate` that accepts the list of comparisons created in step 1 and evaluates the *accuracy* of a  comparison function.  The accuracy is the proportion of comparisons where the function correctly predicts the funnier video.
  

In [7]:
import csv

### BEGIN SOLUTION
comparisons = []

## read the file test_comedy_comparisons_restricted.csv and store the dictionary in the list comparisons

### END SOLUTION

In [8]:


def is_funnier_1(metadata, video_id1, video_id2):
    ### BEGIN SOLUTION
    
    raise NotImplementedError
    
    ### END SOLUTION



In [9]:
def is_funnier_2(metadata, video_id1, video_id2):
    ### BEGIN SOLUTION
    
    raise NotImplementedError
    
    ### END SOLUTION




In [10]:
def is_funnier_3(metadata, video_id1, video_id2):
    ### BEGIN SOLUTION
    
    raise NotImplementedError
    
    ### END SOLUTION



In [11]:
def evaluate(metadata, comparisons, is_funnier):
    ### BEGIN SOLUTION
    
    raise NotImplementedError
    
    ### END SOLUTION



In [12]:

print("The accuracy of is_funnier_1 is", evaluate(metadata, comparisons, is_funnier_1))

print("The accuracy of is_funnier_2 is", evaluate(metadata, comparisons, is_funnier_2))

print("The accuracy of is_funnier_3 is", evaluate(metadata, comparisons, is_funnier_3))

NotImplementedError: 