<a href="https://colab.research.google.com/github/ankita1120/almabetter/blob/publicBranch/Copy_of_Numerical_Programming_in_Python_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web Scraping & Data Handling Challenge**



### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

## **Task 1:- Web Scrapping**

In [None]:
#Installing all necessary labraries
!pip install bs4
!pip install requests

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


In [None]:
#import all necessary labraries

from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

## **Scrapping Movies Data**

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import drive

drive.mount('/content/drive')

url = 'https://www.justwatch.com/in/movies?release_year_from=2000'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

movie_links = soup.find_all('a', href=True)
movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]
url_list = ['https://www.justwatch.com' + x for x in movie_urls]

movie_data = []

for movie_url in url_list:
    response = requests.get(movie_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    title = soup.find('h1').text.strip() if soup.find('h1') else 'N/A'
    release_year = soup.find('span', class_='release-year').text.strip() if soup.find('span', class_='release-year') else 'N/A'
    imdb_rating = soup.find('span', class_='imdb-score').text.strip().split(' ')[0] if soup.find('span', class_='imdb-score') else 'N/A'

    streaming_services = []
    service_tags = soup.find_all('a', class_='offer')
    for tag in service_tags:
        img = tag.find('img')
        service = img['alt'].strip() if img and 'alt' in img.attrs else 'N/A'
        streaming_services.append(service)

    movie_data.append({
        'Title': title,
        'Release Year': release_year,
        'IMDb Rating': imdb_rating,
        'Streaming Services': ', '.join(streaming_services)
    })

df = pd.DataFrame(movie_data)

output_path = '/content/drive/My Drive/movie_data.csv'
df.to_csv(output_path, index=False)

print(f"Movie data scraping complete and saved to '{output_path}'.")

Mounted at /content/drive
Movie data scraping complete and saved to '/content/drive/My Drive/movie_data.csv'.


### **Fetching Movie URL's**

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import files
from google.colab import drive
drive.mount ('/content/drive')

# Base URL for movie listings
url = 'https://www.justwatch.com/in/movies?release_year_from=2000'
headers ={
    'User-Agent': 'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all movie links and build full URLs
movie_links = soup.find_all('a', href=True)
movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]
url_list = ['https://www.justwatch.com' + x for x in movie_urls]

movie_url_df = pd.DataFrame({'Movie URL': url_list})

# Save the DataFrame to a CSV file
movie_url_df.to_csv('/content/drive/My Drive/movie_url.csv', index=False)

# More informative print statement
print(f"Movie URLs have been scraped and saved to '/content/drive/My Drive/movie_url.csv'.")
print(f"A total of {len(movie_urls)} movie URLs were found.")
# Optionally, print a few sample URLs:
print(f"Sample URLs: {movie_urls[:5]}")  # Prints the first 5 URLs

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Movie URLs have been scraped and saved to '/content/drive/My Drive/movie_url.csv'.
A total of 110 movie URLs were found.
Sample URLs: ['/in/movie/pushpa-the-rule-part-2', '/in/movie/marco-2024', '/in/movie/sookshma-darshini', '/in/movie/ore-dake-level-up-na-ken-reawakening', '/in/movie/bhool-bhulaiyaa-3']


## **Scrapping release Year**

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import drive


# Mount Google Drive
drive.mount('/content/drive')

# Set the URL and headers
url = 'https://www.justwatch.com/in/movies?release_year_from=2000'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# Fetch the main page
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract movie URLs
movie_links = soup.find_all('a', href=True)
movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]
url_list = ['https://www.justwatch.com' + x for x in movie_urls]

# Initialize storage for release years
release_years = []

# Scrape each movie page
for movie_url in url_list:
    try:
        response = requests.get(movie_url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract release year
        year_tag = soup.find('span', class_='release-year')
        release_year = year_tag.text.strip() if year_tag else 'N/A'
        release_years.append(release_year)

        # Respect server limits

    except Exception as e:
        print(f"Error fetching {movie_url}: {e}")
        release_years.append('Error')

# Save to DataFrame and CSV
df = pd.DataFrame({'release_year': release_years})
df.to_csv('/content/drive/My Drive/movie_release_years.csv', index=False)

print("Release years have been scraped and saved to 'movie_release_years.csv'.")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Release years have been scraped and saved to 'movie_release_years.csv'.


## **Scrapping Genres**

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from google.colab import files

# Mount Google Drive (if in Google Colab)
from google.colab import drive
drive.mount('/content/drive')

# URL to scrape
url = 'https://www.justwatch.com/in/movies?release_year_from=2000'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# Initial request to fetch the main page
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract movie URLs
movie_links = soup.find_all('a', href=True)
movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]
url_list = ['https://www.justwatch.com' + x for x in movie_urls]

# Initialize genres list
genres = []

# Loop through movie URLs to scrape genres
for movie_url in url_list:
    response = requests.get(movie_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract genres
    div = soup.find('div', class_='title-info title-info')
    span = div.find_all('span')
    genre_list = []
    for tag in span:
        genre = tag.text.strip() if tag else 'N/A'
        genre_list.append(genre)
    genres.append(', '.join(genre_list))
df = pd.DataFrame({'Genre': genres})
df.to_csv('/content/drive/My Drive/movie_genres.csv', index=False)

print("Genres have been scraped and saved to 'movie_genres.csv'.")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


AttributeError: 'NoneType' object has no attribute 'find_all'

## **Scrapping IMBD Rating**

In [None]:
# Write Your Code here
import requests
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import drive
from google.colab import files
import time

# Mount Google Drive
drive.mount('/content/drive')

# URL of the JustWatch page to scrape
url = 'https://www.justwatch.com/in/movies?release_year_from=2000'

# Headers to mimic a web browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# Get the HTML content of the page
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all movie links
movie_links = soup.find_all('a', href=True)

# Extract movie URLs that contain '/movie/'
movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]

# Add the base URL to each movie URL
url_list = ['https://www.justwatch.com' + x for x in movie_urls]

# List to store IMDb ratings
imdb_ratings = []

# Loop through each movie URL and extract the IMDb rating
for movie_url in url_list:
    response = requests.get(movie_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the element containing the IMDb rating using its class
    rating_tag = soup.find('span', class_='imdb-score')

    # Extract the rating or 'N/A' if not found
    imdb_ratings.append(rating_tag.text.strip().split(' ')[0] if rating_tag else 'N/A')
# Create a Pandas DataFrame to store the ratings
df = pd.DataFrame({'IMDb Rating': imdb_ratings})

# Save the DataFrame to a CSV file on Google Drive
df.to_csv('/content/drive/My Drive/movie_imdb_ratings.csv', index=False)

print("IMDb ratings have been scraped and saved to 'tv_imdb_ratings.csv'.")

## **Scrapping Runtime/Duration**

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Mount Google Drive (if in Google Colab)
from google.colab import drive
drive.mount('/content/drive')

url = 'https://www.justwatch.com/in/movies?release_year_from=2000'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# Fetch the main page and parse
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract movie links
movie_links = soup.find_all('a', href=True)
movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]
url_list = ['https://www.justwatch.com' + x for x in movie_urls]

# Initialize a list to store data
runtime_data = []

# Scrape runtime for each movie
for movie_url in url_list:
    try:
        response = requests.get(movie_url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract runtime (adjust the class as needed)
        runtime_tag = soup.find('div', class_='title-detail-hero-details__item')
        runtime = runtime_tag.text.strip() if runtime_tag else 'N/A'

        runtime_data.append({ 'Runtime': runtime})
        time.sleep(1)  # Add delay to avoid being blocked

    except Exception as e:

        runtime_data.append({ 'Runtime': 'Error'})

# Save the data to CSV
df = pd.DataFrame(runtime_data)
df.to_csv('/content/drive/My Drive/movie_runtime.csv', index=False)

print("Runtimes have been scraped and saved to 'movie_runtime.csv'.")


## **Scrapping Age Rating**

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


# Mount Google Drive (if in Google Colab)
from google.colab import drive
drive.mount('/content/drive')

# Define the base URL and headers
base_url = 'https://www.justwatch.com/in/movies?release_year_from=2000'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# Fetch the main page
response = requests.get(base_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract movie URLs
movie_links = soup.find_all('a', href=True)
movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]
url_list = ['https://www.justwatch.com' + x for x in movie_urls]

# Scrape Age Ratings
age_rating_data = []

for movie_url in url_list:
    try:
        response = requests.get(movie_url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract Age Rating
        titleInfoTag = soup.find_all('div', class_='title-info')  # Fixed class name here

        for tag in titleInfoTag:
            divs = tag.find_all('div', class_='detail-infos')
            for div in divs:
                heading = div.find('h3', class_='detail-infos__subheading')
                if heading and heading.text.strip() == 'Age rating':
                    age_rating_div = div.find('div', class_='detail-infos__value')
                    if age_rating_div:
                        age_rating_data.append(age_rating_div.text.strip())
                    else:
                        age_rating_data.append('N/A')
                    break


    except Exception as e:
        print(f"Error scraping {movie_url}: {e}")
        age_rating_data.append('N/A')

# Create a DataFrame
df = pd.DataFrame({'Age Rating': age_rating_data})

# Save to CSV
df.to_csv('/content/drive/My Drive/movie_age_rating.csv', index=False)

print("Age Ratings have been scraped and saved to 'movie_age_rating.csv'.")


## **Fetching Production Countries Details**

In [None]:
# Write Your Code here
import requests
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Base URL
url = 'https://www.justwatch.com/in/movies?release_year_from=2000'

# Headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# Step 1: Fetch Movie Links
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract movie URLs
movie_links = soup.find_all('a', href=True)
movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]
url_list = ['https://www.justwatch.com' + x for x in movie_urls]

# Step 2: Fetch Production Countries
production_country = []

for movie_url in url_list:
    response = requests.get(movie_url, headers=headers)
    movie_soup = BeautifulSoup(response.text, 'html.parser')

    production_country_tag = movie_soup.find('div', class_='title-info')  # Find the main container

    if production_country_tag:
        detail_divs = production_country_tag.find_all('div', class_='detail-infos')  # Find all detail info divs
        for detail_div in detail_divs:
            heading = detail_div.find('h3', class_='detail-infos__subheading')  # Find the heading
            if heading and heading.text.strip() == 'Production country':  # Check if heading exists and matches
                production_country_div = detail_div.find('div', class_='detail-infos__value')
                if production_country_div:
                    production_country.append(production_country_div.text.strip())
                else:
                    production_country.append('N/A')  # Handle cases where production country is not found
                break  # Exit the inner loop once found
    else:
        production_country.append('N/A')  # Handle cases where the main container is not found


# Step 3: Save Data to CSV
df = pd.DataFrame({ 'Production Country': production_country})
output_path = '/content/drive/My Drive/movie_production_country.csv'
df.to_csv(output_path, index=False)

print(f"Production countries have been scraped and saved to '{output_path}'.")


## **Fetching Streaming Service Details**

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

from google.colab import files
from google.colab import drive
drive.mount('/content/drive')

url = 'https://www.justwatch.com/in/movies?release_year_from=2000'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

movie_links = soup.find_all('a', href=True)
movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]
url_list = ['https://www.justwatch.com' + x for x in movie_urls]

streaming_service = []

for movie_url in url_list:
    response = requests.get(movie_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all offer tags (service_tags) for the current movie
    service_tags = soup.find_all('a', class_='offer')  # Moved inside the loop
    platforms = [tag.find('img')['alt'].strip() for tag in service_tags if tag.find('img')]

    streaming_service.append(', '.join(platforms) if platforms else 'N/A')

df = pd.DataFrame({'Streaming Service': streaming_service})

# Save the DataFrame to a CSV file
df.to_csv('movie_streaming_service.csv', index=False)
df.to_csv('/content/drive/My Drive/movie_streaming_service.csv.csv', index=False)

print("Streaming Service have been scraped and saved to 'movie_streaming_service.csv'.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Streaming Service have been scraped and saved to 'movie_streaming_service.csv'.


## **Now Creating Movies DataFrame**

Feteching Movies_TiTles








In [None]:
from bs4 import BeautifulStoneSoup
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
from google.colab import files

url = 'https://www.justwatch.com/in/movies?release_year_from=2000'
headers ={
    'User-Agent': 'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all movie links and build full URLs
movie_links = soup.find_all('a', href=True)
movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]
url_list = ['https://www.justwatch.com' + x for x in movie_urls]

movies_titles = []

for movie_url in url_list:
    response = requests.get(movie_url,headers= headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract movie title
    title_tag = soup.find('h1')
    title = title_tag.text.strip() if title_tag else 'N/A'
    movies_titles.append(title)

df = pd.DataFrame({'Title': movies_titles})
df.to_csv('/content/drive/My Drive/movie_titles.csv', index=False)

print("Movie titles have been scraped and saved to 'movie_titles.csv'.")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Movie titles have been scraped and saved to 'movie_titles.csv'.


In [None]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

# Check if the file exists before attempting to read it
import os
file_path = '/content/drive/My Drive/movie_titles.csv'  # Replace with the actual path if needed
if not os.path.exists(file_path):
  print(f"Error: The file '{file_path}' does not exist. Please make sure it was created in the previous scraping step.")
else:
    # Assuming you've already scraped and saved the data into these CSV files:
    movie_titles_df = pd.read_csv(file_path)
    movie_release_years_df = pd.read_csv('/content/drive/My Drive/movie_release_years.csv')
    movie_genres_df = pd.read_csv('/content/drive/My Drive/movie_genres.csv')
    movie_imdb_ratings_df = pd.read_csv('/content/drive/My Drive/movie_imdb_ratings.csv')
    movie_runtime_df = pd.read_csv('/content/drive/My Drive/movie_runtime.csv')
    movie_age_rating_df = pd.read_csv('/content/drive/My Drive/movie_age_rating.csv')
    movie_production_country_df = pd.read_csv('/content/drive/My Drive/movie_production_country.csv')
    movie_streaming_service_df = pd.read_csv('/content/drive/My Drive/movie_streaming_service.csv.csv')

    # Concatenate (combine) the DataFrames horizontally
    movies_df = pd.concat([ movie_titles_df, movie_release_years_df, movie_genres_df, movie_imdb_ratings_df,
                           movie_runtime_df, movie_age_rating_df, movie_production_country_df,
                           movie_streaming_service_df], axis=1)

    # Display the first few rows of the DataFrame to check if it's correct
    movies_df.head()

    # Save the combined DataFrame to a new CSV file
    movies_df.to_csv('/content/drive/My Drive/movies_data.csv', index=False)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Scraping TV  Show Data**

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
from google.colab import drive
drive.mount('/content/drive')
from google.colab import files

tv_url = 'https://www.justwatch.com/in/tv-shows?release_year_from=2000'

headers = {
    'User-Agent': 'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

page = requests.get(tv_url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

tv_links = soup.find_all('a',href=True)
tv_show_urls = [link['href'] for link in tv_links if '/tv-show/' in link['href']]
full_tv_show_urls = ['https://www.justwatch.com' + url for url in tv_show_urls]

tv_show_data = []
for tv_show_url in full_tv_show_urls:
    response = requests.get(tv_show_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract data
    title = soup.find('h1', class_="title-detail-hero__details__title").text.strip() if soup.find('h1', class_="title-detail-hero__details__title") else 'N/A'
    release_year = soup.find('span', class_='release-year').text.strip() if soup.find('span', class_='release-year') else 'N/A'
    imdb_rating = soup.find('span', class_='imdb-score').text.strip().split(' ')[0] if soup.find('span', class_='imdb-score') else 'N/A'

    # Extract streaming services
    streaming_services = []
    service_tags = soup.find_all('a', class_='offer')
    for tag in service_tags:
        img = tag.find('img')
        service = img['alt'].strip() if 'alt' in img.attrs else 'N/A'
        streaming_services.append(service)

    # Initialize genre to 'N/A' before the try-except block
    genre = 'N/A'

    # Extract genre
    try:
        div = soup.find('div', class_='title-info')
        if div:
            detail_div = div.find_all('div', class_='detail-infos')[1]  # Adjust class as needed
            if detail_div:
                span_tags = detail_div.find_all('span')  # Find all <span> tags if genres are stored there
                genre_list = []  # Initialize an empty list for genres
                for tag in span_tags:
                    genre = tag.text.strip()
                    if genre:
                        genre_list.append(genre)
                # Join the genres into a comma-separated string
                genre = ', '.join(genre_list)
    except IndexError:
        pass  # Handle the IndexError if it occurs
    except Exception as e:
        print(f"Error extracting genre from {tv_show_url}: {e}")  # Print the error for debugging


    # Append all data for the current TV show as one dictionary
    tv_show_data.append({
        'Title': title,
        'Release Year': release_year,
        'IMDb Rating': imdb_rating,
        'Genre': genre,  # Use the extracted genre value
        'Streaming Services': ', '.join(streaming_services)
    })

df = pd.DataFrame(tv_show_data)

output_path = '/content/drive/My Drive/tv_shows_data.csv'
df.to_csv(output_path, index=False)

print(f"Release years have been scraped and saved to '{output_path}'.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Release years have been scraped and saved to '/content/drive/My Drive/tv_shows_data.csv'.


## **Fetching Tv shows Url details**

In [None]:
# Write Your Code here
import requests
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
from google.colab import files

# URL for TV shows listing
tv_url = 'https://www.justwatch.com/in/tv-shows?release_year_from=2000'

headers = {
    'User-Agent': 'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

page = requests.get(tv_url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

tv_links = soup.find_all('a',href=True)
tv_show_urls = [link['href'] for link in tv_links if '/tv-show/' in link['href']]
full_tv_show_urls = ['https://www.justwatch.com' + url for url in tv_show_urls]

tv_show_Url_details_df = pd.DataFrame(full_tv_show_urls, columns=['TV Show URLs'])
output_path = '/content/drive/My Drive/tv_show_urls.csv'
tv_show_Url_details_df.to_csv(output_path, index=False)

print(f"TV show URLs have been scraped and saved to '{output_path}'.")
print(tv_show_Url_details_df.head())

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
TV show URLs have been scraped and saved to '/content/drive/My Drive/tv_show_urls.csv'.
                                        TV Show URLs
0    https://www.justwatch.com/in/tv-show/squid-game
1    https://www.justwatch.com/in/tv-show/paatal-lok
2  https://www.justwatch.com/in/tv-show/solo-leve...
3  https://www.justwatch.com/in/tv-show/thukra-ke...
4  https://www.justwatch.com/in/tv-show/the-day-o...


## **Fetching Tv Show Title details**

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
from google.colab import files

tv_url = 'https://www.justwatch.com/in/tv-shows?release_year_from=2000'

headers = {
    'User-Agent': 'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

page = requests.get(tv_url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

tv_links = soup.find_all('a',href=True)
tv_show_urls = [link['href'] for link in tv_links if '/tv-show/' in link['href']]
full_tv_show_urls = ['https://www.justwatch.com' + url for url in tv_show_urls]


# List to store extracted data
tv_show_details = []
for tv_show_url in full_tv_show_urls:
    response = requests.get(tv_show_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')



    title_tag = soup.find('h1')

    title = title_tag.text.strip() if title_tag else 'N/A'
    tv_show_details.append(title)

df = pd.DataFrame({'Title': tv_show_details})
output_path = '/content/drive/My Drive/tv_show_titles.csv'
df.to_csv(output_path, index=False)

print(f"TV show titles have been scraped and saved to '{output_path}'.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
TV show titles have been scraped and saved to '/content/drive/My Drive/tv_show_titles.csv'.


## **Fetching Release Year**

In [None]:
# Write Your Code here
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
from google.colab import files



tv_url = 'https://www.justwatch.com/in/tv-shows?release_year_from=2000'
headers = {
    'User-Agent': 'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

response = requests.get(tv_url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

tv_show_links = soup.find_all('a', href=True)
tv_show_urls = [link['href'] for link in tv_show_links if '/tv-show/' in link['href']]
full_tv_show_urls = ['https://www.justwatch.com' + url for url in tv_show_urls]


# List to store release years
release_years = []

# Extract release year for each TV show
for tv_show_url in full_tv_show_urls:
    try:
        response = requests.get(tv_show_url, headers=headers)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract release year
        year_tag = soup.find('span', class_='release-year')  # Adjust class if needed
        release_year = year_tag.text.strip() if year_tag else 'N/A'

        # Append to the list
        release_years.append(release_year)


    except Exception as e:
        print(f"Error processing {tv_show_url}: {e}")
        release_years.append('N/A')

        # Create a DataFrame from the extracted data
df = pd.DataFrame({'Release Year': release_years})


output_path = '/content/drive/My Drive/tv_show_release_years.csv'
df.to_csv(output_path, index=False)

print(f"Release years have been scraped and saved to '{output_path}'.")



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Release years have been scraped and saved to '/content/drive/My Drive/tv_show_release_years.csv'.


## **Fetching TV Show Genre Details**

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Base URL for TV show listings
tv_url = 'https://www.justwatch.com/in/tv-shows?release_year_from=2000'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# Fetch the main page
response = requests.get(tv_url, headers=headers)
if response.status_code != 200:
    print("Failed to fetch the main page. Please check the URL or internet connection.")
    exit()

soup = BeautifulSoup(response.text, 'html.parser')

# Find all TV show links and build full URLs
tv_links = soup.find_all('a', href=True)
tv_urls = [link['href'] for link in tv_links if '/tv-show/' in link['href']]
tv_url_list = ['https://www.justwatch.com' + x for x in tv_urls]

# Initialize list to store genres
tv_genres = []

# Scrape each TV show's genres
for tv_url in tv_url_list:
    try:
        response = requests.get(tv_url, headers=headers)
        if response.status_code != 200:
            print(f"Failed to fetch page: {tv_url}")
            tv_genres.append('N/A')
            continue

        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the div containing genres
        div = soup.find('div', class_='title-info')  # Update the class if necessary based on actual site HTML

        # Extract genres
        genre_list = []
        if div:
            detail_div = div.find_all('div', class_='detail-infos')[1]  # Adjust class as needed
            if detail_div:
                span_tags = detail_div.find_all('span')  # Find all <span> tags if genres are stored there
                for tag in span_tags:
                    genre = tag.text.strip()
                    if genre:
                        genre_list.append(genre)

        # Handle case where no genres are found
        if not genre_list:
            genre_list.append('N/A')

        # Append the genres as a comma-separated string
        tv_genres.append(', '.join(genre_list))

    except Exception as e:
        print(f"An error occurred while processing {tv_url}: {e}")
        tv_genres.append('N/A')

# Create DataFrame and save to CSV
df_tv_genres = pd.DataFrame({'Genre': tv_genres})
csv_path = '/content/drive/My Drive/tv_show_genres.csv'
df_tv_genres.to_csv(csv_path, index=False)

print(f"TV show genres have been scraped and saved to '{csv_path}'.")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
TV show genres have been scraped and saved to '/content/drive/My Drive/tv_show_genres.csv'.


## **Fetching IMDB Rating Details**

In [None]:
# Write Your Code here
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
from google.colab import files

tv_url = 'https://www.justwatch.com/in/tv-shows?release_year_from=2000'

headers = {
    'User-Agent': 'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

page = requests.get(tv_url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

# Find all TV show links and build full URLs
tv_links = soup.find_all('a',href=True)
tv_show_urls = [link['href'] for link in tv_links if '/tv-show/' in link['href']]
full_tv_show_urls = ['https://www.justwatch.com' + url for url in tv_show_urls]


# List to store extracted data
tv_imdb_ratings = []
for tv_show_url in full_tv_show_urls:
    try:
        response = requests.get(tv_show_url, headers=headers)
        soup = BeautifulSoup(response.text,'html.parser')
        # Find the IMDB rating
        rating_tag = soup.find('span', class_='imdb-score')
        imdb_rating = rating_tag.text.strip().split(' ')[0] if rating_tag else 'N/A'

        tv_imdb_ratings.append(imdb_rating)

    except Exception as e:
        print(f"Error processing {tv_url}: {e}")
        tv_imdb_ratings.append('N/A')



        # Create a DataFrame from the extracted data
df = pd.DataFrame({'IMDB Rating': tv_imdb_ratings})

output_path = '/content/drive/My Drive/tv_show_imdb_ratings.csv'
df.to_csv(output_path, index=False)

print(f"IMDB ratings have been scraped and saved to '{output_path}'.")



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
IMDB ratings have been scraped and saved to '/content/drive/My Drive/tv_show_imdb_ratings.csv'.


## **Fetching Age Rating Details**

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

tv_url = 'https://www.justwatch.com/in/tv-shows?release_year_from=2000'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# Fetch the main page
response = requests.get(tv_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

tv_links = soup.find_all('a', href=True)
tv_urls = [link['href'] for link in tv_links if '/tv-show/' in link['href']]
tv_url_list = ['https://www.justwatch.com' + x for x in tv_urls]

tv_show_age_ratings = []

for tv_url in tv_url_list:
    response = requests.get(tv_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    age_rating = 'N/A'  # Default value if not found
    age_rating_tag = soup.find('div', class_='title-info')

    if age_rating_tag:
        divs = age_rating_tag.find_all('div', class_='detail-infos')
        for div in divs:
            heading = div.find('h3', class_='detail-infos__subheading')
            if heading and heading.text.strip() == 'Age rating':
                age_rating_div = div.find('div', class_='detail-infos__value')
                if age_rating_div:
                    age_rating = age_rating_div.text.strip()
                break  # Exit the loop once found

    tv_show_age_ratings.append(age_rating)

df = pd.DataFrame({'Age Rating': tv_show_age_ratings})
output_path = '/content/drive/My Drive/tv_show_age_rating.csv'
df.to_csv(output_path, index=False)

print(f"Age ratings have been scraped and saved to '{output_path}'.")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Age ratings have been scraped and saved to '/content/drive/My Drive/tv_show_age_rating.csv'.


## **Fetching Production Country details**

In [None]:

from bs4 import BeautifulSoup
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
from google.colab import files

# Base URL for TV show listings (if not already defined)
tv_url = 'https://www.justwatch.com/in/tv-shows?release_year_from=2000'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

response = requests.get(tv_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

tv_links = soup.find_all('a', href=True)
tv_urls = [link['href'] for link in tv_links if '/tv-show/' in link['href']]
tv_url_list = ['https://www.justwatch.com' + x for x in tv_urls]

tv_production_country_data = []

for tv_url in tv_url_list:
    response = requests.get(tv_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Find the production country information
    production_country_tag = soup.find('div', class_='title-info')
    if production_country_tag:
        detail_divs = production_country_tag.find_all('div', class_='detail-infos')
        for detail_div in detail_divs:
            heading = detail_div.find('h3', class_='detail-infos__subheading')
            if heading and heading.text.strip() == 'Production country':
                production_country_div = detail_div.find('div', class_='detail-infos__value')
                if production_country_div:
                    tv_production_country_data.append(production_country_div.text.strip())
                else:
                    tv_production_country_data.append('N/A')
                break  # Exit the inner loop once found
    else:
        tv_production_country_data.append('N/A')

# Create a DataFrame and save to CSV
df = pd.DataFrame({'Production Country': tv_production_country_data})
df.to_csv('/content/drive/My Drive/tv_show_production_country.csv', index=False)

print("Production countries for TV shows have been scraped and saved to 'tv_show_production_country.csv'.")







Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Production countries for TV shows have been scraped and saved to 'tv_show_production_country.csv'.


## **Fetching Streaming Service details**

In [None]:
# Write Your Code here
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
from bs4 import BeautifulSoup
from google.colab import files

# Base URL for TV show listings (if not already defined)
tv_url = 'https://www.justwatch.com/in/tv-shows?release_year_from=2000'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

response = requests.get(tv_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

tv_links = soup.find_all('a', href=True)
tv_urls = [link['href'] for link in tv_links if '/tv-show/' in link['href']]
tv_url_list = ['https://www.justwatch.com' + x for x in tv_urls]

streaming_services = []  # Corrected variable name

for tv_url in tv_url_list:
    response = requests.get(tv_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all offer tags (service_tags) for the current TV show
    service_tags = soup.find_all('a', class_='offer')
    platforms = [tag.find('img')['alt'].strip() for tag in service_tags if tag.find('img')]

    streaming_services.append(', '.join(platforms) if platforms else 'N/A') #Append list into platforms

df = pd.DataFrame({'Streaming Service': streaming_services}) # Corrected variable name
df.to_csv('/content/drive/My Drive/tv_show_streaming_service.csv', index=False)

print("Streaming Service have been scraped and saved to 'tv_show_streaming_service.csv'.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Streaming Service have been scraped and saved to 'tv_show_streaming_service.csv'.


## **Fetching Duration Details**

In [None]:
# Write Your Code here
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
from google.colab import files


tv_url = 'https://www.justwatch.com/in/tv-shows?release_year_from=2000'

headers = {
    'User-Agent': 'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

page = requests.get(tv_url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

# Find all TV show links and build full URLs
tv_links = soup.find_all('a',href=True)
tv_show_urls = [link['href'] for link in tv_links if '/tv-show/' in link['href']]
full_tv_show_urls = ['https://www.justwatch.com' + url for url in tv_show_urls]


# Initialize list to store TV show durations
tv_show_duration = []

# Loop through individual TV show URLs:
for tv_show_url in full_tv_show_urls:  # Use the correct variable
    try:

        response = requests.get(tv_show_url, headers=headers)
        if response.status_code == 200:
          soup = BeautifulSoup(response.text, 'html.parser')

          # Find the element containing duration information
          runtime_tag = soup.find('div', class_='title-detail-hero-details__item')

          # Extract duration, handling cases where it's not found
          runtime = runtime_tag.text.strip().split()[0] if runtime_tag and runtime_tag.text.strip() else 'N/A'
          tv_show_duration.append({'Runtime': runtime})
        else:
          print(f"Failed to fetch {tv_show_url}: HTTP {response.status_code}")
          tv_show_duration.append({'Runtime': 'Error'})

    except Exception as e:
        print(f"Error scraping {tv_show_url}: {e}")  # Correct URL for error message
        tv_show_duration.append({'Runtime': 'Error'})


# Create a DataFrame and save to CSV
df = pd.DataFrame(tv_show_duration)
df.to_csv('/content/drive/My Drive/tv_show_duration.csv', index=False)

print("TV show durations have been scraped and saved to 'tv_show_duration.csv'.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
TV show durations have been scraped and saved to 'tv_show_duration.csv'.


## **Creating TV Show DataFrame**

In [None]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

tv_show_titles_df = pd.read_csv('/content/drive/My Drive/tv_show_titles.csv')
tv_show_release_years_df = pd.read_csv('/content/drive/My Drive/tv_show_release_years.csv')
tv_show_genres_df = pd.read_csv('/content/drive/My Drive/tv_show_genres.csv')
tv_show_imdb_ratings_df = pd.read_csv('/content/drive/My Drive/tv_show_imdb_ratings.csv')
tv_show_age_rating_df = pd.read_csv('/content/drive/My Drive/tv_show_age_rating.csv')
tv_show_production_country_df = pd.read_csv('/content/drive/My Drive/tv_show_production_country.csv')
tv_show_streaming_service_df = pd.read_csv('/content/drive/My Drive/tv_show_streaming_service.csv')
tv_show_duration_df = pd.read_csv('/content/drive/My Drive/tv_show_duration.csv')

tv_shows_df = pd.concat([tv_show_titles_df, tv_show_release_years_df, tv_show_genres_df,
                         tv_show_imdb_ratings_df, tv_show_age_rating_df, tv_show_production_country_df,
                         tv_show_streaming_service_df, tv_show_duration_df], axis=1)
tv_shows_df.head()

tv_shows_df.to_csv('/content/drive/My Drive/tv_shows_data.csv', index=False)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Task 2 :- Data Filtering & Analysis**

In [None]:
import pandas as pd
from datetime import datetime, timedelta

movies_df = pd.read_csv('/content/drive/My Drive/movies_data.csv')
tv_shows_df = pd.read_csv('/content/drive/My Drive/tv_shows_data.csv')

two_years_ago = datetime.now() - timedelta(days=730)  # Calculate the date two years ago

# Convert 'release_year' and 'Release Year' to datetime objects before filtering
movies_df['release_year'] = pd.to_datetime(movies_df['release_year'], errors='coerce')
tv_shows_df['Release Year'] = pd.to_datetime(tv_shows_df['Release Year'], errors='coerce')

filtered_movies = movies_df[movies_df['release_year'] > two_years_ago]
filtered_tv_shows = tv_shows_df[tv_shows_df['Release Year'] > two_years_ago]

# Convert 'IMDb Rating' to numeric before filtering
filtered_movies['IMDb Rating'] = pd.to_numeric(filtered_movies['IMDb Rating'], errors='coerce')
filtered_tv_shows['IMDb Rating'] = pd.to_numeric(filtered_tv_shows['IMDb Rating'], errors='coerce')

filtered_movies = filtered_movies[filtered_movies['IMDb Rating'] >= 7]
filtered_tv_shows = filtered_tv_shows[filtered_tv_shows['IMDb Rating'] >= 7]

avg_movie_rating = filtered_movies['IMDb Rating'].astype(float).mean()
avg_tv_show_rating = filtered_tv_shows['IMDb Rating'].astype(float).mean()

all_genres = pd.concat([filtered_movies['Genre'], filtered_tv_shows['Genre']]).dropna()
top_genres = all_genres.value_counts().head(5)

# Check if the column 'Streaming Service' exists in both DataFrames
# Print a warning message and continue if not found
if 'Streaming Service' not in filtered_movies.columns or 'Streaming Service' not in filtered_tv_shows.columns:
    print("Warning: 'Streaming Service' column not found in one or both DataFrames. Skipping streaming service analysis.")
else:
    all_streaming_services = pd.concat([filtered_movies['Streaming Service'], filtered_tv_shows['Streaming Service']]).dropna()
    top_streaming_service = all_streaming_services.value_counts().idxmax()


movies_df = pd.concat([movies_df, filtered_movies], axis=0)
tv_shows_df = pd.concat([tv_shows_df, filtered_tv_shows], axis=0)

movies_df.to_csv('/content/drive/My Drive/filtered_movies.csv', index=False)
tv_shows_df.to_csv('/content/drive/My Drive/filtered_tv_shows.csv', index=False)

print("Filtered movies and TV shows have been saved successfully.")

Filtered movies and TV shows have been saved successfully.


  movies_df['release_year'] = pd.to_datetime(movies_df['release_year'], errors='coerce')
  tv_shows_df['Release Year'] = pd.to_datetime(tv_shows_df['Release Year'], errors='coerce')


## **Analyzing Top Genres**

In [None]:
# Write Your Code here
import pandas as pd
from google.colab import drive
from collections import Counter

# Mount Google Drive
drive.mount('/content/drive')

# Load data
movies_df = pd.read_csv('/content/drive/My Drive/movies_data.csv')
tv_shows_df = pd.read_csv('/content/drive/My Drive/tv_shows_data.csv')

# Function to analyze top genres for a DataFrame
def analyze_top_genres(df, top_n=5):
    genre_counts = Counter()
    for genre_string in df['Genre']:
        if isinstance(genre_string, str):
            genres = genre_string.split(', ')
            genre_counts.update(genres)
    return genre_counts.most_common(top_n)

# Analyze top movie genres
top_movie_genres = analyze_top_genres(movies_df)
print("Top Movie Genres:")
for genre, count in top_movie_genres:
    print(f"- {genre}: {count}")

# Analyze top TV show genres
top_tv_show_genres = analyze_top_genres(tv_shows_df)
print("\nTop TV Show Genres:")
for genre, count in top_tv_show_genres:
    print(f"- {genre}: {count}")

# Create a DataFrame for top genres
top_genres_df = pd.DataFrame({
    'Movie Genres': [genre for genre, count in top_movie_genres],
    'Movie Counts': [count for genre, count in top_movie_genres],
    'TV Show Genres': [genre for genre, count in top_tv_show_genres],
    'TV Show Counts': [count for genre, count in top_tv_show_genres]
})

# Save to CSV
top_genres_df.to_csv('/content/drive/My Drive/analyz_top_genre.csv', index=False)
print("\nTop genres have been analyzed and saved to 'analyz_top_genre.csv'.")

## **Finding Predominant Streaming Service**

In [None]:
import pandas as pd
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Load movie and TV show data
movies_df = pd.read_csv('/content/drive/My Drive/movies_data.csv')
tv_shows_df = pd.read_csv('/content/drive/My Drive/tv_shows_data.csv')

# Combine both datasets
all_content_df = pd.concat([movies_df, tv_shows_df], ignore_index=True)

# Ensure 'Streaming Service' column is properly split and exploded
all_content_df['Streaming Service'] = all_content_df['Streaming Service'].str.split(', ')
all_content_df = all_content_df.explode('Streaming Service')

# Group by streaming service and count occurrences
service_counts = all_content_df.groupby('Streaming Service')['Title'].count().reset_index()

# Rename columns for clarity
service_counts.rename(columns={'Streaming Service': 'Service', 'Title': 'Count'}, inplace=True)

# Sort by count in descending order
service_counts = service_counts.sort_values('Count', ascending=False)

# Find the predominant streaming service
predominant_service = service_counts.iloc[0]['Service']

pridominant_service_df = pd.DataFrame({'Predominant Streaming Service': [predominant_service]})

pridominant_service_df.to_csv('/content/drive/My Drive/predominant_streaming_service.csv', index=False)
print(f"The predominant streaming service is: {predominant_service}")


## **Task 3 :- Data Export**

In [None]:
import pandas as pd
from datetime import datetime, timedelta

# Load the datasets
movie_df = pd.read_csv('/content/drive/My Drive/movies_data.csv')
tv_shows_df = pd.read_csv('/content/drive/My Drive/tv_shows_data.csv')

# Define the two-year cutoff
two_years_ago = datetime.now().year - 2

# Helper function to safely convert series to numeric
def safe_numeric(series):
    return pd.to_numeric(series, errors='coerce')

# Ensure 'release_year' and 'Release Year' are strings
movie_df['release_year'] = movie_df['release_year'].astype(str)
tv_shows_df['Release Year'] = tv_shows_df['Release Year'].astype(str)

# Extract the year using regex and convert to numeric
movie_df['release_year'] = safe_numeric(movie_df['release_year'].str.extract(r'(\d{4})')[0])
tv_shows_df['Release Year'] = safe_numeric(tv_shows_df['Release Year'].str.extract(r'(\d{4})')[0])

# Filter movies released in the last two years with IMDb Rating >= 7
filtered_movies = movie_df[
    (movie_df['release_year'] >= two_years_ago) &
    (safe_numeric(movie_df['IMDb Rating']) >= 7)
].dropna(subset=['release_year', 'IMDb Rating'])  # Drop rows with NaN in critical columns

# Filter TV shows released in the last two years with IMDb Rating >= 7
filtered_movies = movie_df[
    (movie_df['release_year'].astype(str).str.isdigit()) & (movie_df['release_year'].astype(int) >= two_years_ago) &
    (safe_numeric(movie_df['IMDb Rating']) >= 7)
].dropna(subset=['release_year', 'IMDb Rating'])


# Save filtered data to CSV files
filtered_movies.to_csv('/content/drive/My Drive/filtered_movies.csv', index=False)
filtered_tv_shows.to_csv('/content/drive/My Drive/filtered_tv_shows.csv', index=False)

print("Filtered movies and TV shows have been saved successfully.")


Filtered movies and TV shows have been saved successfully.


# **Dataset Drive Link (View Access with Anyone) -**

# ***Congratulations!!! You have completed your Assignment.***