# Python for Data Science
## Session 6
### Datasets – Pandas I 

---

### ***Note Before Proceeding***
1. Download the "netflix_titles" and "train_and_test2 2.csv" 
2. Make sure to update the "path" variable to the specific path where you downloaded the files
3. The dataset does not contain any titles in 2021 for the Philippines

## Netflix Questions

### **Question 1. Is there any missing rating?**

In [175]:
"""
This code checks if any of the Netflix titles contain missing ratings:
1. Loads the dataset
2. Finds titles with NaN values for 'rating' 
3. Prints the count, titles, and more details of missing ratings
"""

import pandas as pd
path = '/Users/biancabaldonado/Desktop/netflix_titles.csv'
#Edit this path depending on where you saved the netflix_titles.csv file

df = pd.read_csv(path)

#Indexing the titles that do not have a value for the rating
missing_ratings = df[df['rating'].isna()]

#Printing the number of titles missing
print("----Number of rows with missing ratings----")
print(f'There are {len(missing_ratings)} titles that do not have ratings')

#Printing the names of these titles with missing ratings
print("\n----Titles with missing 'rating'----")
for title in missing_ratings['title']:
    print(title)

#Printing more information about these missing titles
print("\n----Information of the 4 Titles with missing 'rating'----")
missing_ratings

----Number of rows with missing ratings----
There are 4 titles that do not have ratings

----Titles with missing 'rating'----
13TH: A Conversation with Oprah Winfrey & Ava DuVernay
Gargantia on the Verdurous Planet
Little Lunch
My Honor Was Loyalty

----Information of the 4 Titles with missing 'rating'----


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5989,s5990,Movie,13TH: A Conversation with Oprah Winfrey & Ava ...,,"Oprah Winfrey, Ava DuVernay",,"January 26, 2017",2017,,37 min,Movies,Oprah Winfrey sits down with director Ava DuVe...
6827,s6828,TV Show,Gargantia on the Verdurous Planet,,"Kaito Ishikawa, Hisako Kanemoto, Ai Kayano, Ka...",Japan,"December 1, 2016",2013,,1 Season,"Anime Series, International TV Shows","After falling through a wormhole, a space-dwel..."
7312,s7313,TV Show,Little Lunch,,"Flynn Curry, Olivia Deeble, Madison Lu, Oisín ...",Australia,"February 1, 2018",2015,,1 Season,"Kids' TV, TV Comedies","Adopting a child's perspective, this show take..."
7537,s7538,Movie,My Honor Was Loyalty,Alessandro Pepe,"Leone Frisa, Paolo Vaccarino, Francesco Miglio...",Italy,"March 1, 2017",2015,,115 min,Dramas,"Amid the chaos and horror of World War II, a c..."


### **Question 2. How many films in 2021 correspond to your country?**

In [176]:
"""
This code analyzes titles from the Philippines, including the breakdown of Movies and TV Shows released in 2021.
If no titles were released in 2021, it will show the most recent release year and count the titles for that said year.
"""

ph_titles = df[df['country'] == 'Philippines']

# Count titles, movies, and TV shows from 2021
count_ph_titles_2021 = len(ph_titles[ph_titles['release_year'] == 2021])
count_ph_movies_2021 = len(ph_titles[(ph_titles['release_year'] == 2021) & (ph_titles['type'] == 'Movie')])
count_ph_tvshows_2021 = len(ph_titles[(ph_titles['release_year'] == 2021) & (ph_titles['type'] == 'TV Show')])

print("----Titles Released in 2021 from the Philippines----")

# If no titles from 2021, show the most recent year instead
if count_ph_titles_2021 == 0:
    recent_year = ph_titles['release_year'].max()
    count_ph_titles_recent = len(ph_titles[ph_titles['release_year'] == recent_year])
    count_ph_movies_recent = len(ph_titles[(ph_titles['release_year'] == recent_year) & (ph_titles['type'] == 'Movie')])
    count_ph_tvshows_recent = len(ph_titles[(ph_titles['release_year'] == recent_year) & (ph_titles['type'] == 'TV Show')])

    print(f'There were {count_ph_titles_2021} titles released in 2021 in the Philippines. Since there are no releases during this given year, we look into the most recent releases instead.')
    print("\n----Titles Released in the Most Recent Year in the Philippines----")
    print(f"The most recent release year is {recent_year}. There were {count_ph_titles_recent} titles released in {recent_year} from the Philippines, {count_ph_movies_recent} of which are movies and {count_ph_tvshows_recent} of which are TV shows.")
else:
    print(f"{count_ph_titles_2021} titles were released in 2021 in the Philippines, including {count_ph_movies_2021} movies and {count_ph_tvshows_2021} TV shows.")


----Titles Released in 2021 from the Philippines----
There were 0 titles released in 2021 in the Philippines. Since there are no releases during this given year, we look into the most recent releases instead.

----Titles Released in the Most Recent Year in the Philippines----
The most recent release year is 2020. There were 5 titles released in 2020 from the Philippines, 4 of which are movies and 1 of which are TV shows.


### **Question 3. What's the number of movies in 2020 with full information?**


In [177]:
"""
This code looks for the movies released in 2020 that have full information.
"""

#Indexing the movies from 2020 
films_2020 = df[(df['release_year'] == 2020) & (df['type'] == 'Movie')]

#Removing the movies that have missing information / NaN values
films_2020_complete = films_2020.dropna()

#Counting the number of movies with complete information, after those with missing values have been dropped
num_films_complete = len(films_2020_complete)

#Printing the value of the movies with full information
print("----Number of Movies in 2020 with Full Information----")
print(f'There are {num_films_complete} movies from 2020 with full information.')


----Number of Movies in 2020 with Full Information----
There are 409 movies from 2020 with full information.


### **Question 4. Give me the year with more titles**


In [178]:
"""
This code looks for the year with the most titles.
"""

#Counting the titles released per year
films_per_year = df['release_year'].value_counts()

#Looking for the year with the most titles
year_with_most_films = films_per_year.idxmax()
film_count = films_per_year.max()

#Printing the year with the most titles
print("----Year with the Most Titles----")
print(f'{year_with_most_films} is the year with the most titles, as it specifically has {film_count} titles.')

----Year with the Most Titles----
2018 is the year with the most titles, as it specifically has 1147 titles.


### **Question 5. What has been the average in terms of releases from 2010?**

In [179]:
"""
This code looks for the average number of releases since 2010.
"""

#Indexing the dataframe to see all releases since 2010
since_2010 = df[df['release_year'] >= 2010]

#Grouping by release year
releases_per_year = since_2010.groupby('release_year').size()

#Averaging the total releases since 2010
average_releases = releases_per_year.mean()

#Printing the average release
print("----Average Releases Since 2010----")
print(f"The average number of movie releases per year from 2010 is {average_releases:.2f}")

----Average Releases Since 2010----
The average number of movie releases per year from 2010 is 622.67


## Titanic Questions

### **Question 1. Calculate Gender-Based Survival Percentage**

In [180]:
"""
This code calculates the mean survival rate for each gender
"""

path = '/Users/biancabaldonado/Desktop/train_and_test2 2.csv'
#Edit this path depending on where you saved the netflix_titles.csv file

df = pd.read_csv(path)

#Grouping the data by gender ('Sex') and calculating the mean gender-based survival percentage
gender_survival = df.groupby('Sex')['2urvived'].mean() 
print("----Gender-Based Survival Percentage----")
print(gender_survival)

#Identifying the gender with the higher survival percentage
highest_survival_sex = gender_survival.idxmax()  
highest_survival_percentage = gender_survival.max()  

print("\n----Analsyis of Gender-Based Survival Percentage----")
print(f"The Sex {highest_survival_sex} has the highest survival percentage, which is {highest_survival_percentage:.2f}")

----Gender-Based Survival Percentage----
Sex
0    0.1293
1    0.5000
Name: 2urvived, dtype: float64

----Analsyis of Gender-Based Survival Percentage----
The Sex 1 has the highest survival percentage, which is 0.50


### **Question 2. Calculate Survival Percentage Grouped by Gender and Class**

In [181]:
"""
This code calculates the mean survival rate for each gender and each class accordingly
"""
#Grouping the data by gender ('Sex') and class ('PClass) and calculating the mean gender- and class-based survival percentage
gender_class_survival = df.groupby(['Sex', 'Pclass'])['2urvived'].mean() 

print("----Survival Percentage Grouped by Gender and Class----")
print(gender_class_survival)
#Identifying the gender-class combination with the highest survival percentage
highest_survival_sexclass = gender_class_survival.idxmax()  # Getting the index (Sex, Pclass) with the highest survival rate
highest_survival_sexclass_percentage = gender_class_survival.max()  # Getting the highest survival percentage

print("\n----Highest Survival Rate Grouped by Gender and Class----")
print(f"The gender-class combination with the highest survival percentage rate is Sex {highest_survival_sexclass[0]} and Pclass {highest_survival_sexclass[1]} with a survival percentage of {highest_survival_sexclass_percentage:.2f}")

----Survival Percentage Grouped by Gender and Class----
Sex  Pclass
0    1         0.251397
     2         0.099415
     3         0.095335
1    1         0.631944
     2         0.660377
     3         0.333333
Name: 2urvived, dtype: float64

----Highest Survival Rate Grouped by Gender and Class----
The gender-class combination with the highest survival percentage rate is Sex 1 and Pclass 2 with a survival percentage of 0.66
