# Session 6: Pandas I  

## Overview  
This session focuses on analyzing two datasets: Netflix and Titanic. Below is a summary of the exercises and their objectives using Pandas methods.  

---

## Netflix Dataset Exercises  

1. **Identify Missing Ratings**  
   - **Task**: Count the number of missing values in each column to assess the completeness of the data.  

2. **Films from 2021 in Your Country**  
   - **Task**: Replace missing values in the 'country' column with "Unknown" and identify films from 2021 that correspond to your country.  

3. **Movies from 2020 with Complete Information**  
   - **Task**: Filter the dataset to find movies from 2020 that have all necessary data fields populated.  

4. **Year with the Most Titles**  
   - **Task**: Determine which year had the highest number of entries in the dataset.  

5. **Average Annual Releases Since 2010**  
   - **Task**: Calculate the average number of titles released each year since 2010.  

---

## Titanic Dataset Exercises  

1. **Gender-Based Survival Percentage**  
   - **Task**: Calculate the survival rate for each gender by dividing the number of survivors by the total number of passengers in each gender group.  

2. **Survival Percentage by Gender and Class**  
   - **Task**: Compute survival rates for each combination of gender and passenger class to analyze trends across groups.  

---

#### Loading Data Sets

In [35]:
import pandas as pd

path = 'netflix_titles.csv'

df = pd.read_csv(path)

path = 'train_and_test2.csv'

titanic = pd.read_csv(path)

## Exercises

### Pandas I

Home exercises for Netflix:

1. Is there any missing rating?
2. How many films in 2021 correspond to your country?
3. What's the number of movies in 2020 with full information?
4. Give me the year with more titles,
5. and what has been the average in terms of releases from 2010. 

And for Titanic:

1. Calculate Gender-Based Survival Percentage
2. Calculate Survival Percentage Grouped by Gender and Class

---

### Netflix

#### Exercise 1:

We will check if there are any missing value in the rating cloumn

In [36]:
# Select rows where the 'rating' column is NaN using the 'isnull' method
missing_ratings_df = df[df['rating'].isnull()]

# Displaying rows where 'rating' is NaN
print(missing_ratings_df.index)


Index([5989, 6827, 7312, 7537], dtype='int64')


### Exercise 2:

Counting how many films were released in "Egypt"

In [37]:
# Filtering rows where 'release_year' is 2021 and 'country' is Egypt and 'type' is 'Movie'
films_2021_my_country = df[(df['release_year'] == 2021) & (df['country'] == "Egypt") & (df['type'] == 'Movie')]

# Counting the number of rows in the filtered DataFrame to show amount of movies
films_2021_my_country_count = films_2021_my_country.shape[0]

print(f"Number of movies from Egypt: {films_2021_my_country_count}")

Number of movies from Egypt: 0


### Exercise 3:

Counting the number of films with full info from 2020



In [38]:
# Filtering rows with .notnull() method, to eliminate rows with missing values
movies_2020_full_info = df[(df['release_year'] == 2020) & (df.notnull().all(axis=1))] # .all(axis=1) to check all columns

# Counting the number of rows in the filtered DataFrame to show amount of movies with full information
movies_2020_full_info_count = movies_2020_full_info.shape[0]

print(f"Number of movies from 2020 with full information: {movies_2020_full_info_count}")

Number of movies from 2020 with full information: 442


### Exercise 4:

Finding the year with the most amount of titles

In [39]:
# Counting the number of movies per year using the 'value_counts' method
titles_per_year = df['release_year'].value_counts()

# Finding the year with the most titles using the 'idxmax' method
year_with_most_titles = titles_per_year.idxmax()

print(f"The yea with the most amount of titles is {year_with_most_titles} with {titles_per_year[year_with_most_titles]} titles") #checking for amount of titles

The yea with the most amount of titles is 2018 with 1147 titles


### Exercise 5:

calculating what average amount of realease from 2010 to the latest year

In [40]:
# Filtering titles from 2010 onwards
titles_from_2010 = df[df['release_year'] >= 2010]

# Calculating the average realease per year
average_releases_2010 = titles_from_2010['release_year'].value_counts().mean()

print(f"The average number of releases per year from 2010 onwards is {average_releases_2010:.2f}.") #using .2f to round to 2 decimal places

The average number of releases per year from 2010 onwards is 622.67.


---
### Titanic


#### Exercise 1:

In this exercise, we determine the survival rate for each gender by dividing the number of survivors by the total number of passengers in each gender group.


In [41]:
# Grouping data by 'Sex' and calculating the mean of survival rate
gender_survival_percentage = titanic.groupby('Sex')['2urvived'].mean()

print(f"Male Survival Percentage: {gender_survival_percentage[0] * 100:.2f}%")    # 0 for Male.
print(f"Female Survival Percentage: {gender_survival_percentage[1] * 100:.2f}%")  # 1 for Female.

Male Survival Percentage: 12.93%
Female Survival Percentage: 50.00%


#### Exercise 2:

In this exercise, we calculate the survival rate for each combination of gender and class.


In [42]:
# Grouping data by 'Sex' and 'Pclass' and calculating mean survival rate
gender_class_survival_percentage = titanic.groupby(['Sex', 'Pclass'])['2urvived'].mean()

# Printing the results in a loop to display each class individually
for (sex, pclass), percentage in gender_class_survival_percentage.items():
    gender = "Male" if sex == 0 else "Female" 
    print(f"{gender} in Class {pclass}: {percentage * 100:.2f}% survival rate")

Male in Class 1: 25.14% survival rate
Male in Class 2: 9.94% survival rate
Male in Class 3: 9.53% survival rate
Female in Class 1: 63.19% survival rate
Female in Class 2: 66.04% survival rate
Female in Class 3: 33.33% survival rate
