# Working with Pandas on a Netflix dataset

## Exercise 1: Loading Data

**Objective**: Load the `netflix_titles.csv` file into a Pandas DataFrame. Display the first 5 rows using the `head()` method to ensure the data is loaded correctly.

## Exercise 2: Basic Information

**Objective**: Get basic information about the DataFrame.

- Display the shape (number of rows and columns) of the DataFrame.
- Use the `info()` method to get a summary of the DataFrame, including the data types of each column and the number of non-null entries.

## Exercise 3: Indexing and Selection

**Objective**: Select specific columns and rows from the DataFrame.

- Select the `title`, `country`, and `release_year` columns and display the first 10 rows.
- Select and display the row for the movie "Dick Johnson Is Dead" using boolean indexing.

## Exercise 4: Slicing

**Objective**: Use slicing to create a sub-DataFrame.

- Create a sub-DataFrame containing only the first 20 entries of the Netflix dataset.
- From this sub-DataFrame, select only the title and rating columns.

## Exercise 5: Basic Data Manipulation

**Objective**: Add a new column to the DataFrame named duration_minutes that contains the duration of the show in minutes. The duration column in the dataset includes durations in two formats: for movies, it's indicated in minutes (e.g., "100 min"), and for TV shows, it's indicated in seasons (e.g., "1 Season" or "2 Seasons"). Your task is to convert these values into an integer number of minutes. For movies, extract the number of minutes directly. For TV shows, you can assume a default duration of 45 minutes per episode, estimating 10 episodes per season.

**Hints**:

1. **Differentiating Between Movies and TV Shows**: Use the `type` column to apply different logic for movies and TV shows.
2. **Extracting Numeric Values**: For movies, you can extract numeric values from the `duration` column using regular expressions.
3. **Assumptions for TV Shows**: For TV shows, assume each season contains 10 episodes, with each episode lasting 45 minutes.
4. **Using the apply Function**: Utilize the `apply` function to iterate over each row of the DataFrame. This function allows you to apply a custom function that will convert the duration string into minutes. Remember to set axis=1 to apply the function to each row.
5. **Creating the New Column**: After calculating the duration in minutes for both movies and TV shows, create the duration_minutes column to store these values.

## Exercise 6: Querying

**Objective**: Use query methods to filter the DataFrame.

- Filter the DataFrame to show only entries that are movies released in the year 2020 and later.
- Further refine this query to show only movies with a rating of `PG-13`.

## Exercise 7: Grouping and Aggregating

**Objective**: Analyze the Netflix dataset by performing grouping and aggregation operations.

- **Count Content by Country**: Calculate the total number of Netflix titles (both movies and TV shows combined) for each country. Display the top 5 countries by total title count.
- **Average Duration of Movies**: Find the average duration of movies across all available movies in the dataset. Assume all durations are accurately represented in minutes within the dataset.

### Part 1: Count Content by Country

- **Hint 1**: Use the `groupby()` function to group the dataset by the 'country' column. This operation will allow you to perform calculations on each group of titles that share the same country.
- **Hint 2**: After grouping the data by country, use the `size()` method to count the number of titles within each country. This method returns the size of each group.
- **Hint 3**: Once you have the counts, use the `sort_values(ascending=False)` method to sort the countries based on their title counts in descending order. This will help you identify the top countries with the most titles.
- **Hint 4**: To display the top 5 countries, you can use the `.head(5)` method after sorting the values. This will give you the first 5 rows of your sorted series, which correspond to the top 5 countries by title count.

### Part 2: Average Duration of Movies

- **Hint 5**: The duration column for movies includes numbers followed by the text "min". To convert this column to integer values representing the duration in minutes, use the `.str.extract('(\d+)')` method to extract numerical parts of the strings, followed by `.astype(int)` to convert these extracted strings to integers.
- **Hint 6**: After converting the duration column to integers, use the `mean()` method to calculate the average duration of movies. This method computes the mean value of a numeric column.

## Exercise 8: Handling Missing Values

**Objective**: Handle missing values in the DataFrame.

- Identify columns with missing values and the count of missing values in each.
- For the country column, replace missing values with a default value (e.g., "Unknown").
- Drop any rows where the `title` or `release_year` is missing.

- **Hint 1**: Use the `.isnull()` and `.sum()` methods in tandem to identify missing values in each column.
- **Hint 2**: To fill missing values in the country column, consider using the `.fillna()` method with a default value such as "Unknown".
- **Hint 3**: Rows lacking essential information like title or release_year might not be useful for analysis. Consider removing these rows with the `.dropna()` method, specifying the relevant columns in the subset parameter.

## Exercise 9: Exporting Data

**Objective**: Export a modified DataFrame to a new CSV file.

- Ensure that the index is not included in the exported file.

