**<h1 style="text-align: center">✨ Data Analysis with Disney ✨</h1>**
<p style="text-align: center"> 🔱 Project by: Fatima Khan, Ilker Dom, and Pei Huang 🔱 </p>
<hr/>

## Data dictionary ##
|ind. | name | description |
|----|----|:----|
|0.| **header** | Column names |
|1.| **show_id** | Unique id |
|2.| **type** | Movie or TV Show |
|3.| **title** | Name of the movie/show |
|4.| **director** | Directors of the movie/show |
|5.| **cast** | Main cast of the movie/show |
|6.| **country** | Country of production |
|7.| **date_added** | Date added on Disney Plus |
|8.| **release_year** | Original Release Year of the movie/tv show |
|9.| **rating** | Rating of the movie/show |
|10.| **duration** | Total duration of the movie/show |
|11.| **listed_in** | Genres in which the movie is listed |  
|12.| **description** | One-Line content description |

## Step 1: Reading the Data File 🕵️

In [None]:
from csv import reader #Package that reads the csv file.

with open('./Data/disney_plus_titles.csv', encoding="utf-8") as opened_file:
    read_file = reader(opened_file)
    data_list = list(read_file) # This stores the data in the list of lists format.

In [None]:
#extracing only the column names
data_header = data_list[0]

#extracting the data excluding the column names 
data = data_list[1:]

## Step 2: Data Exploration 🚀

In [None]:
'''
The function lets you view data from any point and until the row you specified. 
To use this function, you need to define four parameters: 
1. name of the dataset you want to explore 
2. index of the row you want to start at
3. index of the row you want to end at
4. whether you want to display the total rows and columns of the table (True or False)
By default, we are using our disney dataset, displaying the first three rows as well as the count of rows and columns
'''
def explore_data(data_local = data, start_row_local = 1, end_row_local = 4, rows_and_cols_local = True):
    sliced_data = data_local[start_row_local: end_row_local]
    for d in sliced_data: 
        print(d)
    if rows_and_cols_local == True: #if user wants to show rows and cols
        print("Total rows: ", len(data_local))
        print("Total columns: ", len(data_local[0]))
        
explore_data()

#### **Display Top 5 Rows**

In [None]:
explore_data(data, 0, 5, True)

#### **Display Last Five Rows**

First we gotta modify the function to display until the last row because in slicing, the end index is exclusive.

In [None]:
'''
Our previous function was not optimized to fetch data from the end of the table.
In order to perform negative indexing, we need to slightly modify our original function 
to redefine the ranges. The following function checks to see if the end index is -1, and it slices 
the data to include all rows from the start index to the end of the row. 
Otherwise, it performs slicing as usual
'''
def explore_data(data_local = data, start_row_local = -5, end_row_local = -1, show_last_row = True):
    if show_last_row == True and end_row_local == -1:
        sliced_data = data_local[start_row_local:]
    else:
        sliced_data = data_local[start_row_local: end_row_local]
    for d in sliced_data: 
        print(d)
        
explore_data()

## Step 3: Seperating Movies and Shows ✂️ 

In [None]:
'''
In this part we are required to create two lists to seperate the movies and the tv shows. To do that, 
we first created three empty lists, and an empty title string which we will use for comparing individual elements. 
Then, we iterate over our data list, converting each element into lower case, then checking to which category it belongs to, 
and adding it to the respective list
'''
disney_movies = []
disney_shows = []
disney_other = []
title = ''
for d in data:
    title = d[1].lower()
    if 'movie' in title: 
        title = "Movie"
        disney_movies.append(d)
    elif 'tv' in title: 
        title = "TV Show"
        disney_shows.append(d)
    else: 
        title = "Other"
        disney_other.append(d)

print("Disney Movies: ", len(disney_movies))
print("Disney Shows: ", len(disney_shows))
print("Disney Other: ", len(disney_other))


## Step 4: Get Unique Values 💎

In [None]:
''' 
This function will display the unique values of any specified column. 
It takes in the dataset, and a column index as its arguments. 
We also create a new list which will store our unique values. 
Then, for each row in the dataset, it will check whether the element at that row exists in the results_list[],
if not, then add it to the list. 
'''
def list_of_elements(my_data_local = data, col_index_local = 8):
    result_list = []
    for d in my_data_local:
        if d[col_index_local] not in result_list:
            result_list.append(d[col_index_local])
    return result_list
list_of_elements(data, 8)

## Step 5: No. of Movies and Shows for Each Unique Value in a Column 

In [None]:
def elements_count(my_data_local, col_index_local):
    element_count_dict = {}
    for d in my_data_local: 
        if d[col_index_local] not in element_count_dict:
            element_count_dict[d[col_index_local]] = 1 
        else: 
            element_count_dict[d[col_index_local]] += 1
    sorted_elements_count = dict(sorted(element_count_dict.items(), key=lambda item: item[1], reverse=True))
    return sorted_elements_count

# elements_count(disney_shows, 8)
elements_count(disney_movies, 8)


sorted(element_count_dict.items(), key=lambda item: item[1], reverse=True):

The sorted() function takes an iterable and returns a new sorted list from that iterable.
element_count_dict.items() provides the iterable (list of key-value pairs) to be sorted.
key=lambda item: item[1] specifies a sorting key. Here, a lambda function lambda item: item[1] is used, where item is each tuple (key, value), and item[1] is the value part of the tuple. This means the sorting will be based on the values of the dictionary.
reverse=True specifies that the sorting should be in descending order.

## Step 6: Fetching Unique Values for '_Listed in_' Column

In [None]:
def list_of_elements(my_data_local):
    result_list = []
    for d in my_data_local:
        if d[10] not in result_list:
            result_list.append(d[10])
    print(result_list)
list_of_elements(data)

## Step 7: Get Unique Dictionary of Category and Count 👯

In [None]:
'''
In step 6, we saw that multiple genres were being listed in the same row, hence being considered as "unique values", 
despite them being duplicates. 
We want to count how many movies belong to each "unique" category. 
In this function, we take three parameters: our dataset, the column index, and the seperator. 
By default, the column index is set to 10, because our categories are in the 10th column, and since the values
are seperated by commas, the sep_local is set to ','
The function iterates over each row in the "listed_in" column, and splitting the list of categories into subcategories.
Then, for each element within the sublist, we use the strip() function to remove any whitespaces, then add them into our 
newly-created dictionary if it is not already there. 
Finally, we sort the list in descending order. 
'''
def elements_count(my_data_local, col_index_local = 10, sep_local = ','): #sep_local is the seperator
    element_count_dict = {} #empty dictionary that'll hold the genres and counts. 
    for d in my_data_local: 
       values = d[col_index_local].split(sep_local) #split each list into sublist so only one genre in one element
       for value in values: 
           value = value.strip() #remove whitespaces from each elemenet
           if value not in element_count_dict:
               element_count_dict[value] = 1
           else: 
               element_count_dict[value] += 1
    sorted_elements_count = dict(sorted(element_count_dict.items(), key=lambda item: item[1], reverse=True))
    return sorted_elements_count

In [None]:
elements_count(disney_shows)

In [None]:
elements_count(disney_movies)

Fill the blanks below manually based on the output in the above two cells.

The maximum movies are listed in <u>'Family'</u> followed by 'Comedy'.

The maximum Shows are listed in <u>'Animation'</u> followed by 'Action-Adventure'.

## Step 8: Average Durations of Movies and Shows ⌛

#### Function to extract only the numeric values from the 'duration' column

In [None]:
''' 
In the duration column, data is represented like this: "24 minutes", "3 seasons", etc. 
In order to find the average durations of movies and tv shows, we first need to write a function that will extract 
only the numeric values from the column. 
To do this, we created a new empty list to store our converted data (converted_durations[])
Then, for each row, we split the list based on the spaces, then take the first element (at index 0) of the split list,
which extracts for us only the numbers in every row. 
Finally, we can convert these numbers to integers and append them to the new list.
'''
def duration_converstion(my_data_local, col_index_local = -3):
    converted_durations = []
    for d in my_data_local:
      #split by spaces, then take the first element of the stripped list, then convert to int
      #cuz we'll need to get average later
        converted_durations.append(int(d[col_index_local].split(' ')[0]))
    return converted_durations


In [None]:

''' Storing the durations of movies and shows in two seperate lists'''
converted_movies = duration_converstion(disney_movies)
converted_shows = duration_converstion(disney_shows)

print(converted_movies)


#### Function to find average duration

In [None]:
''' 
Finding the average now becomes simple. All we have to do is divide the sum of the list with the length of 
the list. For easy readability, we have also rounded our answer to 3 digits. 
'''
def get_average(durations):
    return round(sum(durations)/len(durations),3)
print(f'Average Duration of Movies: {get_average(converted_movies)} minutes')
print(f'Average No. of Seasons in TV Shows: ~{get_average(converted_shows)} seasons')