# **Web Scraping & Data Handling Challenge**



### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

## **Task 1:- Web Scrapping**

In [None]:
#Installing all necessary labraries
!pip install bs4
!pip install requests

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


In [None]:
#import all necessary labraries
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

## **Scrapping Movies Data**

In [None]:
def fetch_movie_urls(url):                              #creating a custom function for soup making
    headers = {                                         #using header in def function to avoid non response from the website
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }
    response = requests.get(url, headers=headers)       #requesting the response from the website
    if response.status_code != 200:                     #handling error in response
        return "Failed to retrieve the page, status code:", response.status_code
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup


url = 'https://www.justwatch.com/in/movies?release_year_from=2000'
soup=fetch_movie_urls(url)                              #calling the function for soup making
print(soup.prettify())                                  #prettifying the soup

<!DOCTYPE html>
<html data-vue-meta="%7B%22dir%22:%7B%22ssr%22:%22ltr%22%7D,%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-vue-meta-server-rendered="" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8" data-vue-meta="ssr"/>
  <meta content="IE=edge" data-vue-meta="ssr" httpequiv="X-UA-Compatible"/>
  <meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" data-vue-meta="ssr" name="viewport"/>
  <meta content="JustWatch" data-vue-meta="ssr" property="og:site_name"/>
  <meta content="794243977319785" data-vue-meta="ssr" property="fb:app_id"/>
  <meta content="/appassets/img/JustWatch_logo_with_claim.png" data-vmid="og:image" data-vue-meta="ssr" property="og:image"/>
  <meta content="606" data-vmid="og:image:width" data-vue-meta="ssr" property="og:image:width"/>
  <meta content="302" data-vmid="og:image:height" data-vue-meta="ssr" pro

## **Fetching Movie URL's**

In [None]:
movie_links = soup.find_all('a', href=True)                                       #finding all link in href using a tag
movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]  #filtering the links in which /movie/ exists

url_list=[]
for x in movie_urls:                                                              #creating a list of url's
  url_list.append('https://www.justwatch.com'+x)                                  #appending all urls in movie_urls in the empty list
url_list

['https://www.justwatch.com/in/movie/bhool-bhulaiyaa-3',
 'https://www.justwatch.com/in/movie/stree-2',
 'https://www.justwatch.com/in/movie/deadpool-3',
 'https://www.justwatch.com/in/movie/the-substance',
 'https://www.justwatch.com/in/movie/ntr-30',
 'https://www.justwatch.com/in/movie/meiyazhagan',
 'https://www.justwatch.com/in/movie/vettaiyan',
 'https://www.justwatch.com/in/movie/venom-3-2024',
 'https://www.justwatch.com/in/movie/kishkkindha-kandam',
 'https://www.justwatch.com/in/movie/do-patti',
 'https://www.justwatch.com/in/movie/singham-again-2024-0',
 'https://www.justwatch.com/in/movie/gladiator',
 'https://www.justwatch.com/in/movie/amaran-2024',
 'https://www.justwatch.com/in/movie/martin',
 'https://www.justwatch.com/in/movie/lubber-pandhu',
 'https://www.justwatch.com/in/movie/the-wild-robot',
 'https://www.justwatch.com/in/movie/alien-romulus',
 'https://www.justwatch.com/in/movie/black-2024',
 'https://www.justwatch.com/in/movie/gaganachari',
 'https://www.justwatc

## **Scrapping Movie Title**

In [None]:
movie_title=soup.findAll("a",href=True)                                          #finding all link in href using a tag
title=[link["href"] for link in movie_title if "/movie/" in link["href"]]        #filtering the title from movie_title
title_lst=[]
for data in title:
  title_lst.append(data.split("/")[-1])                                          #appending the title in the empty list after spliting the movie title
title_lst


['bhool-bhulaiyaa-3',
 'stree-2',
 'deadpool-3',
 'the-substance',
 'ntr-30',
 'meiyazhagan',
 'vettaiyan',
 'venom-3-2024',
 'kishkkindha-kandam',
 'do-patti',
 'singham-again-2024-0',
 'gladiator',
 'amaran-2024',
 'martin',
 'lubber-pandhu',
 'the-wild-robot',
 'alien-romulus',
 'black-2024',
 'gaganachari',
 'siddharth-roy',
 'ajayante-randam-moshanam',
 'bagheera-2024',
 'lucky-baskhar',
 'furiosa',
 'strange-darling',
 'the-buckingham-murders',
 'kill-2024',
 'all-we-imagine-as-light',
 'tumbbad',
 'khel-khel-mein',
 'transformers-one-2024',
 'longlegs',
 'bhool-bhulaiyaa-2',
 'venom-2018',
 'vicky-vidya-ka-woh-wala-video',
 'thalapathy-68',
 'pushpa',
 'viswam-2024',
 'my-old-ass',
 'kanguva',
 'thangalaan',
 'kalki-2898-ad',
 'golam',
 'yudhra',
 'it-ends-with-us',
 'untitled-soorarai-pottru-remake',
 'vedaa',
 'bhool-bhulaiyaa',
 'swag',
 '365-days',
 'munjha',
 'venom-let-there-be-carnage',
 'vaazhai',
 'caddo-lake',
 'zwigato',
 'salaar',
 'a-quiet-place-day-one',
 'stree',


## **Scrapping release Year**

In [None]:
import time                               #importing time to stop my request for few seconds to handle response status code 429
release_years = []                        #creating empty list for release year
max_retries = 3                           #setting max retries to 3

for link in url_list:                     #looping through the url_list
  retries = 0                             #setting retries to 0 and value none if retries becomes zero
  release_year = None

  while retries < max_retries:            #while loop for retries
    soup1 = fetch_movie_urls(link)        #calling the function for soup making
    if isinstance(soup1, BeautifulSoup):  # Check if it's a BeautifulSoup object
      release_year_element = soup1.find("h1", class_="title-detail-hero__details__title")  #finding the release year
      if release_year_element:
        release_year = release_year_element.find('span', class_='release-year')
        if release_year:
          release_year = release_year.get_text(strip=True).strip("()")
          break
        else:
          release_year = None
      else:
        release_year = None
    else:
      print(f"Failed to fetch page for {link}: {soup1}") # Print error if not BeautifulSoup

    retries += 1                                          #incrementing retries by 1
    time.sleep(3)                                         #stopping the request for 3 seconds

  release_years.append(release_year)
release_years

['2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2000',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2018',
 '2024',
 '2024',
 '2024',
 '2022',
 '2018',
 '2024',
 '2024',
 '2021',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2007',
 '2024',
 '2020',
 '2024',
 '2021',
 '2024',
 '2024',
 '2023',
 '2023',
 '2024',
 '2018',
 '2024',
 '2024',
 '2024',
 '2024',
 '2003',
 '2024',
 '2018',
 '2024',
 '2024',
 '2018',
 '2023',
 '2018',
 '2024',
 '2024',
 '2024',
 '2024',
 '2023',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2001',
 '2015',
 '2024',
 '2019',
 '2024',
 '2024',
 '2024',
 '2016',
 '2024',
 '2022',
 '2013',
 '2024',
 '2023',
 '2015',
 '2022',
 '2022',
 '2019',
 '2024',
 '2019',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024',
 '2024']

## **Scrapping Genres**

In [None]:
import time                                      #importing time to stop my request for few seconds to handle response status code 429
movie_genre_list = []                             #creating empty list for movie genre
max_retries = 3                                   #setting max retries to 3

for link in url_list:                             #looping through the url_list
  retries = 0                                     #setting retries to 0 and value none if retries becomes zero
  genre_text = None

  while retries < max_retries:                    #while loop for retries
    soup = fetch_movie_urls(link)                 #calling the function for soup making
    soup2 = fetch_movie_urls(link)
    if isinstance(soup2, BeautifulSoup):
      h3_element = soup2.find('h3', class_='detail-infos__subheading', string='Genres')          #finding genre
      if h3_element:
        try:
          div_element = h3_element.find_next('div', class_='detail-infos__value')
          genre_text = div_element.find('span').text
          break  # Exit loop if successful
        except AttributeError:
          genre_text = None
      else:
        genre_text = None
    else:
      print(f"Failed to fetch page for {link}: {soup2}")              #show the request deny

    retries += 1                                                       #incrementing retries by 1
    time.sleep(3)                                                      #stopping the request for 3 seconds

  movie_genre_list.append(genre_text)

movie_genre_list

['Comedy, Horror',
 'Comedy, Horror',
 'Comedy, Science-Fiction, Action & Adventure',
 'Horror, Science-Fiction, Drama',
 'Action & Adventure, Drama, Mystery & Thriller',
 'Kids & Family, Drama',
 'Action & Adventure, Crime, Drama',
 'Science-Fiction, Action & Adventure, Mystery & Thriller',
 'Drama, Mystery & Thriller',
 'Mystery & Thriller, Drama, Crime',
 'Action & Adventure, Drama',
 'Drama, Action & Adventure',
 'Drama, War & Military, Action & Adventure',
 'Action & Adventure, Drama, Mystery & Thriller',
 'Comedy, Drama, Sport, Kids & Family, Romance',
 'Science-Fiction, Animation',
 'Horror, Science-Fiction, Mystery & Thriller',
 'Mystery & Thriller, Horror, Science-Fiction',
 'Science-Fiction, Comedy, Fantasy',
 'Romance, Drama',
 'Action & Adventure, Drama, Comedy',
 'Action & Adventure',
 'Crime, Drama, Mystery & Thriller',
 'Action & Adventure, Science-Fiction, Mystery & Thriller',
 'Horror, Mystery & Thriller, Crime',
 'Mystery & Thriller, Crime, Drama',
 'Mystery & Thrille

## **Scrapping IMBD Rating**

In [None]:
import time                                                                                             #importing time to stop my request for few seconds to handle response status code 429

imdb_rt = []                                                                                             #creating empty list for imdb rating
max_retries = 3                                                                                           #setting max retries to 3

for link in url_list:                                                                                     #looping through the url_list
  retries = 0                                                                                             #setting retries to 0 and value none if retries becomes zero
  rating_text = None

  while retries < max_retries:                                                                             #while loop for retries
    soup = fetch_movie_urls(link)                                                                          #fetching soup
    if isinstance(soup, BeautifulSoup):                                                                    # Check if it's a BeautifulSoup object
      try:
        imdb_rt_element = soup.findAll("span", class_="imdb-score")                                          #finding imdb rating
        if imdb_rt_element:  # Check if any elements were found
          rating_text = imdb_rt_element[0].text.strip()  # Extract from the first element
          break
      except AttributeError:                                                                               #if error occurs
        rating_text = None
    else:
      print(f"Failed to fetch page for {link}: {soup}")                                                      # Print error if not BeautifulSoup

    retries += 1                                                                                             #incrementing retries by 1
    time.sleep(3)                                                                                            #stopping the request for 3 seconds

  imdb_rt.append(rating_text)

imdb_rt

['5.2 (70k)',
 '7.0 (36k)',
 '7.7 (393k)',
 '7.5 (143k)',
 '6.1 (16k)',
 '8.4 (9.5k)',
 '7.3 (36k)',
 '6.2 (44k)',
 '8.3 (4.5k)',
 '6.6 (66k)',
 '6.0 (69k)',
 '8.5 (2m)',
 '8.5 (8.6k)',
 '2.5 (35k)',
 '8.3 (4.2k)',
 '8.3 (84k)',
 '7.2 (170k)',
 '7.2 (1.7k)',
 '7.1 (771)',
 '6.7 (2.7k)',
 '7.2 (6.8k)',
 '7.2 (4k)',
 '8.4 (3.2k)',
 '7.5 (232k)',
 '7.1 (30k)',
 '6.0 (9.3k)',
 '7.6 (32k)',
 '7.3 (1.6k)',
 '8.2 (66k)',
 '6.5 (25k)',
 '7.6 (33k)',
 '6.7 (137k)',
 '5.7 (34k)',
 '6.6 (561k)',
 '6.1 (7.6k)',
 '5.8 (20k)',
 '7.6 (91k)',
 '4.3 (544)',
 '7.0 (18k)',
 '5.3 (11k)',
 '7.1 (4.8k)',
 '7.1 (55k)',
 '7.2 (4k)',
 '5.5 (13k)',
 '6.5 (51k)',
 '7.2 (9.6k)',
 '6.0 (3.6k)',
 '7.5 (35k)',
 '7.8 (2.7k)',
 '3.3 (101k)',
 '6.4 (23k)',
 '5.9 (275k)',
 '7.8 (3.6k)',
 '6.9 (28k)',
 '7.0 (9.8k)',
 '6.6 (71k)',
 '6.3 (119k)',
 '7.5 (43k)',
 '6.2 (51k)',
 '5.2 (734)',
 '6.9 (47k)',
 '7.0 (7.3k)',
 '8.3 (657k)',
 '7.8 (16k)',
 '7.7 (7k)',
 '6.8 (1.9k)',
 '7.4 (18k)',
 '8.5 (40k)',
 '6.1 (101k)',
 '8.2 (1

## **Scrapping Runtime/Duration**

In [None]:
import time                                                                #importing time to stop my request for few seconds to handle response status code 429
runtime_list = []                                                            #creating empty list for runtime
max_retries = 3                                                              #setting max retries to 3

for link in url_list:                                                        #looping through the url_list
  retries = 0                                                                 #setting retries to 0 and value none if retries becomes zero
  runtime_element = None

  while retries < max_retries:                                                #while loop for retries
    soup = fetch_movie_urls(link)                                            #fetching soup
    if isinstance(soup, BeautifulSoup):                                      # Check if it's a BeautifulSoup object
      try:
        h3_element = soup.find('h3', class_='detail-infos__subheading', string='Runtime')  #finding runtime
        if h3_element:
          runtime_element = h3_element.find_next("div", class_="detail-infos__value").text.strip()
          break  # Exit the loop if successful
        else:                                                                #if error occurs
          runtime_element = None
      except AttributeError:                                                 #if error occurs
        runtime_element = None
    else:
      print(f"Failed to fetch page for {link}: {soup}")                      # Print error if not BeautifulSoup

    retries += 1                                                              #incrementing retries by 1
    time.sleep(3)                                                             #stopping the request for 3 seconds

  runtime_list.append(runtime_element)

runtime_list

['2h 38min',
 '2h 27min',
 '2h 8min',
 '2h 21min',
 '2h 56min',
 '2h 57min',
 '2h 43min',
 '1h 49min',
 '2h 13min',
 '2h 7min',
 '2h 24min',
 '2h 35min',
 '2h 47min',
 '2h 27min',
 '2h 26min',
 '1h 42min',
 '1h 59min',
 '2h 30min',
 '1h 54min',
 '2h 28min',
 '2h 27min',
 '2h 38min',
 '2h 50min',
 '2h 29min',
 '1h 37min',
 '1h 47min',
 '1h 45min',
 '1h 58min',
 '1h 44min',
 '2h 15min',
 '1h 44min',
 '1h 41min',
 '2h 23min',
 '1h 52min',
 '2h 32min',
 '3h 3min',
 '2h 59min',
 '2h 33min',
 '1h 29min',
 '2h 34min',
 '2h 36min',
 '3h 0min',
 '2h 0min',
 '2h 22min',
 '2h 10min',
 '2h 35min',
 '2h 25min',
 '2h 38min',
 '2h 30min',
 '1h 54min',
 '2h 3min',
 '1h 37min',
 '2h 14min',
 '1h 43min',
 '1h 46min',
 '2h 55min',
 '1h 39min',
 '2h 8min',
 '1h 34min',
 '2h 12min',
 '2h 7min',
 '2h 52min',
 '2h 0min',
 '2h 36min',
 '2h 19min',
 '1h 56min',
 '2h 7min',
 '2h 39min',
 '3h 21min',
 '2h 36min',
 '2h 35min',
 '1h 50min',
 '2h 1min',
 '1h 57min',
 '2h 29min',
 '1h 30min',
 '2h 18min',
 '1h 32min

## **Scrapping Age Rating**

In [None]:
import time                                                               #importing time to stop my request for few seconds to handle response status code 429
age_rt = []                                                               #creating empty list for age rating
max_retries = 3                                                           #setting max retries to 3

for link in url_list:                                                      #looping through the url_list
  retries = 0                                                              #setting retries to 0 and value none if retries becomes zero
  age_rt_element = None                                                    #creating empty list for age rating

  while retries < max_retries:                                              #while loop for retries
    soup = fetch_movie_urls(link)                                          #fetching soup
    if isinstance(soup, BeautifulSoup):                                    # Check if it's a BeautifulSoup object
      try:                                                                #try block for error handling
        h3_element = soup.find('h3', class_='detail-infos__subheading', string='Age rating')
        age_rt_element = h3_element.find_next_sibling('div', class_='detail-infos__value').text.strip()
        break                                                              # Exit loop if successful
      except AttributeError:
        age_rt_element = None
    else:
      print(f"Failed to fetch page for: {link}:{soup}")  # Print only the link

    retries += 1
    time.sleep(3)

  age_rt.append(age_rt_element)

age_rt

['UA',
 'UA',
 'A',
 None,
 'UA',
 'U',
 'UA',
 None,
 'UA',
 None,
 'UA',
 None,
 None,
 'UA',
 'U',
 None,
 'A',
 'UA',
 None,
 'A',
 'UA',
 'UA',
 'UA',
 'A',
 None,
 'A',
 'A',
 'A',
 'A',
 'UA',
 'U',
 'A',
 'UA',
 'UA',
 'UA',
 'UA',
 'UA',
 'UA',
 None,
 'UA',
 'UA',
 'UA',
 'UA',
 'UA',
 'A',
 'U',
 None,
 None,
 None,
 None,
 'UA',
 None,
 'U',
 'A',
 'U',
 'A',
 None,
 'UA',
 'U',
 None,
 None,
 'UA',
 'A',
 'U',
 'UA',
 'UA',
 'UA',
 'U',
 'A',
 'UA',
 None,
 None,
 'UA',
 None,
 None,
 None,
 'A',
 'A',
 None,
 None,
 'U',
 None,
 'UA',
 None,
 'UA',
 None,
 None,
 None,
 None,
 'UA',
 'A',
 None,
 None,
 'A',
 'A',
 'A',
 'UA',
 'UA',
 'A',
 None,
 'UA',
 'UA',
 'A',
 None,
 'UA',
 None,
 None,
 None,
 None,
 None]

## **Fetching Production Countries Details**

In [None]:
import time                                                              #importing time to stop my request for few seconds to handle response status code 429
country_list = []                                                          #creating empty list for country
max_retries = 3                                                            #setting max retries to 3

for link in url_list:                                                      #looping through the url_list
  retries = 0                                                               #setting retries to 0 and value none if retries becomes zero
  elements = None

  while retries < max_retries:                                              #while loop for retries
    soup = fetch_movie_urls(link)                                          #fetching soup
    if isinstance(soup, BeautifulSoup):                                    # Check if it's a BeautifulSoup object
      h3_element = soup.find('h3', class_='detail-infos__subheading', string=lambda text: 'Production country' in text)
      if h3_element:
        try:
          elements = h3_element.find_next_sibling('div', class_='detail-infos__value').text.strip()
          break  # Exit loop if successful
        except AttributeError:                                                #handling error
          elements = None
      else:
        elements = None
    else:
      print(f"Failed to fetch page for: {link}")                              # Print error if not BeautifulSoup

    retries += 1                                                              #incrementing retries by 1
    time.sleep(3)                                                             #stopping the request for 3 seconds

  country_list.append(elements)

country_list

['India',
 'India',
 'United States',
 'United Kingdom, France',
 'India',
 'India',
 'India',
 'United States',
 'India',
 'India',
 'India',
 'United Kingdom, United States',
 'India',
 'India',
 'India',
 'Japan, United States',
 'New Zealand, Canada, United Kingdom, United States, Hungary, Australia',
 'India',
 'India',
 'India',
 'India',
 'India',
 'India',
 'Australia, United States',
 'United States',
 'United Kingdom, India',
 'India',
 'Netherlands, France, India, Italy, Luxembourg',
 'India, Sweden',
 'India',
 'United States',
 'Canada, United States',
 'India',
 'China, United States',
 'India',
 'India',
 'India',
 'India',
 'United Kingdom, Canada, United States',
 'India',
 'India',
 'India',
 'India',
 'India',
 'United States',
 'India',
 'India',
 'India',
 'India',
 'Poland',
 'India',
 'United States',
 'India',
 'United States',
 'India',
 'India',
 'United Kingdom, Canada, United States',
 'India',
 'United States',
 'India',
 'United States',
 'India',
 'South 

## **Fetching Streaming Service Details**

In [None]:
import time                                                              #importing time to stop my request for few seconds to handle response status code 429
import re                                                               #importing re to handle regular expression
from urllib.parse import unquote                                          #importing unquote to handle url to be used in another cell

service_list = []                                                          #creating empty list for service
max_retries = 3                                                            #setting max retries to 3

for link in url_list:                                                      #looping through the url_list
  retries = 0                                                               #setting retries to 0 and value none if retries becomes zero
  data1 = None                                                              #creating empty list for service

  while retries < max_retries:                                              #while loop for retries
    soup = fetch_movie_urls(link)                                          #fetching soup
    if isinstance(soup, BeautifulSoup):                                    # Check if it's a BeautifulSoup object
      try:                                                                #try block for error handling
        data = soup.findAll("a", class_="offer", href=True)
        data1 = [item["href"] for item in data]
        break  # Exit loop if successful
      except AttributeError:
        data1 = None
    else:
      print(f"Failed to fetch page for: {link}")                          # Print error if not BeautifulSoup

    retries += 1                                                              #incrementing retries by 1
    time.sleep(3)                                                         #stopping the request for 3 seconds

  service_list.append(data1)

In [None]:
import re                                                              #importing re to handle regular expression
from urllib.parse import unquote

def extract_streaming_services(urls_list):                                #function for extracting streaming services
  """
  This function extracts streaming services like Netflix, Hotstar, etc. from a list of URLs using regex.

  Args:
    urls_list (list): A list of lists of URLs to extract streaming services from.

  Returns:
    list: A list of lists of extracted streaming services. Each inner list corresponds to an item in urls_list
           and contains all the streaming services found in that item's URLs. Empty lists are replaced with None.
  """
  services = []                                                           #creating empty list for service
  pattern = r'//(?:www\.)?([a-z0-9]+)\.com'                               #regular expression for extracting streaming services
  for url_list in urls_list:                                                #looping through the url_list
    item_services = []                                                      #creating empty list for service
    if url_list is None:                                                    #if url_list is none
      services.append(None)                                                 #append none in services
    else:                                                                  #if url_list is not none
      for url in url_list:                                                  #looping through the url_list
        decoded_url = unquote(url)                                         #decoding the url
        match = re.search(pattern, decoded_url)                           #searching the pattern in the decoded_url
        if match:                                                         #if match found
          service = match.group(1)                                        #appending the service in the empty list
          item_services.append(service)                                   #appending the service in the empty list
      # Replace empty item_services with None
      services.append(item_services if item_services else None)

  return services                                                         #returning the services


services = extract_streaming_services(service_list)                        #calling the function for extracting streaming services
services

## **Now Creating Movies DataFrame**

In [None]:
len(title_lst),len(url_list),len(release_years),len(movie_genre_list),len(imdb_rt),len(runtime_list),len(age_rt),len(country_list),len(services)

In [None]:
# Write Your Code here
movies_df=pd.DataFrame({                                        #creating movies dataframe
    "Title":title_lst,
    "URL":url_list,
    "Release Year":release_years,
    "Genre":movie_genre_list,
    "IMDB Rating":imdb_rt,
    "Runtime":runtime_list,
    "Age Rating":age_rt,
    "Production Country":country_list,
    "Streaming Service":services
})
movies_df


In [None]:
movies_df.isnull().sum()                                     #checking for null values

In [None]:
import re

# to get the only the rating and removing the no. of reviewers from imdb rating
movies_df['IMDB Rating'] = movies_df['IMDB Rating'].apply(lambda x: re.findall(r'\d+\.\d+', x)[0] if isinstance(x, str) else x)

# Convert the column to numeric type
movies_df['IMDB Rating'] = pd.to_numeric(movies_df['IMDB Rating'], errors='coerce')
movies_df['Release Year']=pd.to_numeric(movies_df['Release Year'],errors='coerce')
movies_df

## **Scraping TV  Show Data**

In [None]:
# Specifying the URL from which tv show related data will be fetched
tv_url='https://www.justwatch.com/in/tv-shows?release_year_from=2000'
# Sending an HTTP GET request to the URL
page=requests.get(tv_url)
# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=fetch_movie_urls(tv_url)
# Printing the prettified HTML content
print(soup.prettify())

## **Fetching Tv shows Url details**

In [None]:
# Write Your Code here
url_tv_show_list=[]                                                              #creating empty list for tv show url
show_url_data=soup.find_all('a',href=True)                                      #finding all link in href using a tag
show_url=[link['href'] for link in show_url_data if '/tv-show/' in link['href']]  #filtering the links in which /tv-show/ exists
for data in show_url:                                                           #looping through the show_url
  url_tv_show_list.append('https://www.justwatch.com'+data)                    #appending all urls in show_url in the empty list
url_tv_show_list



## **Fetching Tv Show Title details**

In [None]:
# Write Your Code here
title_tv_show_list=[]                                                              #creating empty list for tv show title
show_title_data=soup.findAll('a',href=True)                                       #finding all link in href using a tag
show_title=[link['href'] for link in show_title_data if '/tv-show/' in link['href']]  #filtering the title from show_title
for data in show_title:                                                           #looping through the show_title
  title_tv_show_list.append(data.split("/")[-1])                                 #appending the title in the empty list after spliting the show_title
title_tv_show_list


## **Fetching Release Year**

In [None]:
# Write Your Code here
import time                                                              #importing time to stop my request for few seconds to handle response status code 429
release_years = []                                                        #creating empty list for release year
max_retries = 3                                                           #setting max retries to 3

show_r_year_list=[]                                                       #creating empty list for release year
for link in url_tv_show_list:                                             #looping through the url_list
  retries = 0                                                              #setting retries to 0 and value none if retries becomes zero
  release_year = None                                                      #creating empty list for release year
  while retries < max_retries:                                              #while loop for retries
    soup1 = fetch_movie_urls(link)
    if isinstance(soup1, BeautifulSoup):  # Check if it's a BeautifulSoup object
      release_year_s_element = soup1.find("h1", class_="title-detail-hero__details__title")
      if release_year_s_element:
        release_year_s = release_year_s_element.find('span', class_='release-year')
        if release_year_s:
          release_year_s = release_year_s.get_text(strip=True).strip("()")
          break                                                                              # Exit loop if successful
        else:
          release_year = None                                                                #creating empty list for release year
      else:
        release_year = None                                                                #creating empty list for release year
    else:
      print(f"Failed to fetch page for {link}: {soup1}") # Print error if not BeautifulSoup

    retries += 1                                                              #incrementing retries by 1
    time.sleep(3)                                                              #stopping the request for 3 seconds

  show_r_year_list.append(release_year_s)                                      #appending the release year in the empty list
show_r_year_list


## **Fetching TV Show Genre Details**

In [None]:
# Write Your Code here
import time                                                              #importing time to stop my request for few seconds to handle response status code 429
show_genre_list = []                                                      #creating empty list for genre
max_retries = 3                                                           #setting max retries to 3
for link in url_tv_show_list:                                             #looping through the url_list
  retries = 0                                                              #setting retries to 0 and value none if retries becomes zero
  genre_text = None  # Initialize as None
  while retries < max_retries:                                              #while loop for retries
    soup2 = fetch_movie_urls(link)                                           #fetching soup
    if isinstance(soup2, BeautifulSoup):                                    # Check if it's a BeautifulSoup object
      h3_element_s = soup2.find('h3', class_='detail-infos__subheading', string='Genres')
      if h3_element_s:
        try:
          genre_text_s = h3_element_s.find_next('div', class_='detail-infos__value')
          genre_text_s = genre_text_s.find('span').text
          break  # Exit loop if successful
        except AttributeError:                                              #if error occurs
          genre_text = None                                                 #creating empty list for genre
      else:                                                              #if error occurs
        genre_text_s = None
    else:
      print(f"Failed to fetch page for {link}: {soup2}")                 # Print error if not BeautifulSoup

    retries += 1                                                              #incrementing retries by 1
    time.sleep(3)                                                             #stopping the request for 3 seconds

  show_genre_list.append(genre_text_s)                                     #appending the genre in the empty list
show_genre_list



## **Fetching IMDB Rating Details**

In [None]:
# Write Your Code here
import time                                                              #importing time to stop my request for few seconds to handle response status code 429
s_imdb_r_list = []                                                       #creating empty list for imdb rating
max_retries = 3                                                           #setting max retries to 3
for link in url_tv_show_list:                                            #looping through the url_list
  retries = 0                                                              #setting retries to 0 and value none if retries becomes zero
  rating_text = None
  while retries < max_retries:                                            #while loop for retries
    soup = fetch_movie_urls(link)                                         #fetching soup
    if isinstance(soup, BeautifulSoup):                                   # Check if it's a BeautifulSoup object
      try:                                                              #try block for error handling
        s_rt_data=soup.findAll("span", class_="imdb-score")
        if s_rt_data:  # Check if any elements were found
          rating_text = s_rt_data[0].text.strip()
          break
          # Exit the loop if successful
      except AttributeError:                                             #if error occurs
        rating_text = None
    else:
      print(f"Failed to fetch page for {link}: {soup}")                  # Print error if not BeautifulSoup

    retries += 1                                                           #incrementing retries by 1
    time.sleep(3)                                                          #stopping the request for 3 seconds

  s_imdb_r_list.append(rating_text)                                     #appending the imdb rating in the empty list
s_imdb_r_list



## **Fetching Age Rating Details**

In [None]:
# Write Your Code here
s_age_rt_list = []                                                       #creating empty list for age rating
max_retries = 3                                                          #setting max retries to 3
for link in url_tv_show_list:                                            #looping through the url_list
  retries = 0                                                              #setting retries to 0 and value none if retries becomes zero
  age_rt_element = None
  while retries < max_retries:                                            #while loop for retries
    soup = fetch_movie_urls(link)                                         #fetching soup
    if isinstance(soup, BeautifulSoup):                                   # Check if it's a BeautifulSoup object
      try:                                                              #try block for error handling
        h3_element_s = soup.find('h3', class_='detail-infos__subheading', string='Age rating')
        s_age_rt_element = h3_element_s.find_next_sibling('div', class_='detail-infos__value').text.strip()
        break
      except AttributeError:
        s_age_rt_element = None
    else:
      print(f"Failed to fetch page for: {link}")                         # Print error if not BeautifulSoup

    retries += 1                                                           #incrementing retries by 1
    time.sleep(3)                                                          #stopping the request for 3 seconds
  s_age_rt_list.append(s_age_rt_element)                                 #appending the age rating in the empty list
s_age_rt_list



## **Fetching Production Country details**

In [None]:
# Write Your Code here
s_country_list = []                                                       #creating empty list for country
max_retries = 3                                                           #setting max retries to 3
for link in url_tv_show_list:                                             #looping through the url_list
  retries = 0                                                              #setting retries to 0 and value none if retries becomes zero
  elements = None                                                          #creating empty list for country
  while retries < max_retries:                                              #while loop for retries
    soup = fetch_movie_urls(link)                                         #fetching soup
    if isinstance(soup, BeautifulSoup):                                   # Check if it's a BeautifulSoup object
      h3_element_s = soup.find('h3', class_='detail-infos__subheading', string=lambda text: 'Production country' in text)
      if h3_element_s:
        try:
          elements = h3_element_s.find_next_sibling('div', class_='detail-infos__value').text.strip()
          break  # Exit loop if successful
        except AttributeError:                                              #if error occurs
          elements = None
      else:
        elements = None
    else:
      print(f"Failed to fetch page for: {link}")                         # Print error if not BeautifulSoup

    retries += 1                                                              #incrementing retries by 1
    time.sleep(3)                                                         #stopping the request for 3 seconds

  s_country_list.append(elements)                                        #appending the country in the empty list
s_country_list


## **Fetching Streaming Service details**

In [None]:
import time                                                              #importing time to stop my request for few seconds to handle response status code 429
import re                                                              #importing re to handle regular expression
from urllib.parse import unquote                                         #importing unquote to handle url to be used in another cell

service_list = []                                                        #creating empty list for service
max_retries = 3                                                          #setting max retries to 3

for link in url_tv_show_list:                                            #looping through the url_list
  retries = 0                                                              #setting retries to 0 and value none if retries becomes zero
  data1 = None

  while retries < max_retries:                                            #while loop for retries
    soup = fetch_movie_urls(link)                                         #fetching soup
    if isinstance(soup, BeautifulSoup):                                  # Check if it's a BeautifulSoup object
      try:
        data = soup.findAll("a", class_="offer", href=True)
        data1 = [item["href"] for item in data]
        break  # Exit loop if successful
      except AttributeError:                                              #if error occurs
        data1 = None
    else:
      print(f"Failed to fetch page for: {link}")                         # Print error if not BeautifulSoup

    retries += 1
    time.sleep(3)

  service_list.append(data1)                                             #appending the service in the empty list

In [None]:
import re                                                              #importing re to handle regular expression
from urllib.parse import unquote                                         #importing unquote to handle url to be used in another cell

def extract_streaming_services(urls_list):                               #function for extracting streaming services
  """
  This function extracts streaming services like Netflix, Hotstar, etc. from a list of URLs using regex.

  Args:
    urls_list (list): A list of lists of URLs to extract streaming services from.

  Returns:
    list: A list of lists of extracted streaming services. Each inner list corresponds to an item in urls_list
           and contains all the streaming services found in that item's URLs. Empty lists are replaced with None.
  """
  services = []                                                           #creating empty list for service
  pattern = r'//(?:www\.)?([a-z0-9]+)\.com'                               #regular expression for extracting streaming services
  for url_list in urls_list:                                                #looping through the url_list
    item_services = []                                                      #creating empty list for service
    if url_list is None:                                                    #if url_list is none
      services.append(None)                                                 #append none in services
    else:                                                                  #if url_list is not none
      for url in url_list:                                                  #looping through the url_list
        decoded_url = unquote(url)                                         #decoding the url
        match = re.search(pattern, decoded_url)                           #searching the pattern in the decoded_url
        if match:                                                         #if match found
          service = match.group(1)                                        #appending the service in the empty list
          item_services.append(service)                                   #appending the service in the empty list
      # Replace empty item_services with None
      services.append(item_services if item_services else None)

  return services                                                         #returning the services


services = extract_streaming_services(service_list)                        #calling the function for extracting streaming services
services

## **Fetching Duration Details**

In [None]:
# Write Your Code here
s_runtime_list = []                                                      #creating empty list for runtime
import time                                                              #importing time to stop my request for few seconds to handle response status code 429
max_retries = 3                                                          #setting max retries to 3
for link in url_tv_show_list:
  retries = 0
  runtime_text = None
  while retries < max_retries:
    soup = fetch_movie_urls(link)
    if isinstance(soup, BeautifulSoup):
      h3_element_s=soup.find('h3', class_='detail-infos__subheading', string='Runtime')
      if h3_element_s:
        try:
          runtime_text = h3_element_s.find_next_sibling('div', class_='detail-infos__value').text.strip()
          break
        except AttributeError:
          runtime_text = None
    else:
      print(f"Failed to fetch page for {link}: {soup}")

    retries += 1                                                              #incrementing retries by 1
    time.sleep(3)                                                             #stopping the request for 3 seconds
  s_runtime_list.append(runtime_text)                                     #appending the runtime in the empty list
s_runtime_list


## **Creating TV Show DataFrame**

In [None]:
# Write Your Code here
import pandas as pd                                                       #importing pandas to create dataframe
import numpy as np                                                        #importing numpy to create dataframe
show_df=pd.DataFrame({                                                    #creating tv show dataframe
    "TV shows Title":title_tv_show_list,
    "TV shows URL":url_tv_show_list,
    "TV shows Release Year":show_r_year_list,
    "TV shows Genre":show_genre_list,
    "TV shows IMDB Rating":s_imdb_r_list,
    "TV shows Runtime":s_runtime_list,
    "TV shows Age Rating":s_age_rt_list,
    "TV shows Production Country":s_country_list,
    "TV shows Streaming Service":services
})
show_df


In [None]:
show_df.isnull().sum()                                                  #checking for null values

In [None]:
import re

# to get the only the rating and removing the no. of reviewers from imdb rating
show_df['TV shows IMDB Rating'] = show_df['TV shows IMDB Rating'].apply(lambda x: re.findall(r'\d+\.\d+', x)[0] if isinstance(x, str) else x)

# Convert the column to numeric type
show_df['TV shows IMDB Rating'] = pd.to_numeric(show_df['TV shows IMDB Rating'], errors='coerce')
show_df['TV shows Release Year']=pd.to_numeric(show_df['TV shows Release Year'],errors='coerce')
show_df.head(10)

## **Task 2 :- Data Filtering & Analysis**

## **Calculating Mean IMDB Ratings for both Movies and Tv Shows**

In [None]:
# Write Your Code here
mean_imdb_rating_movies=movies_df['IMDB Rating'].mean()       #mean imdb rating
mean_imdb_rating_movies
print("The mean imdb rating of movies list are : ",mean_imdb_rating_movies)    #printing the mean imdb rating
mean_imdb_rating_tv_shows=show_df['TV shows IMDB Rating'].mean()  #mean imdb rating
mean_imdb_rating_tv_shows
print("The mean imdb rating of tv shows list are : ",mean_imdb_rating_tv_shows)    #printing the mean imdb rating for tv shows



## **Analyzing Top Genres**

In [None]:

all_genres_movie = []
for genres in movies_df['Genre']:
  if genres is not None:                 # Check if genres is not None
    all_genres_movie.extend(genres.split(', '))
genre_counts = pd.Series(all_genres_movie).value_counts()
print("The top genre movies are : ")
print(genre_counts.head(10))
#
all_genres_show = []                                       #creating a empty list
for genres in show_df['TV shows Genre']:
  if genres is not None:                   #looping through the genre
    all_genres_show.extend(genres.split(', '))             #splitting all the items seperated by commas
genre_counts_show = pd.Series(all_genres_show).value_counts()         #counting
print("The top genre shows are : ")
print(genre_counts_show.head(10))                                   #printing the top 10 genres



In [None]:
#Let's Visvalize it using word cloud
from wordcloud import WordCloud, STOPWORDS                    #importing stopwords from wordcloud
import matplotlib.pyplot as plt                               #importing matplotlib for plotting

# Combine genre counts from movies and TV shows
all_genres_combined = pd.concat([pd.Series(all_genres_movie), pd.Series(all_genres_show)])
genre_counts_combined = all_genres_combined.value_counts()

# Create text from genre counts
text = " ".join(genre for genre in all_genres_combined)

# Generate word cloud
stopwords = set(STOPWORDS)
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=stopwords, min_font_size=10).generate(text)

# Plot the WordCloud image
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()



## **Finding Predominant Streaming Service**

In [None]:
# Handle mixed format of streaming services in movies
streaming_movie_counts = movies_df["Streaming Service"].explode().str.split(', ').explode()
streaming_movie_counts = streaming_movie_counts[streaming_movie_counts != 'None'].value_counts()
print("Streaming Movie Counts:")
print(streaming_movie_counts)

# Count the occurrences of each streaming service for shows (unchanged)
streaming_show_counts = show_df["TV shows Streaming Service"].explode().str.split(', ').explode()
streaming_show_counts = streaming_show_counts[streaming_show_counts != 'None'].value_counts()
print("\nStreaming Show Counts:")
print(streaming_show_counts)

In [None]:
#Let's Visvalize it using word cloud
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

# Combine movie and show streaming counts
all_streaming_counts = pd.concat([streaming_movie_counts, streaming_show_counts])

# Create text from streaming counts (weighted by frequency)
text = " ".join([service for service, count in all_streaming_counts.items() for _ in range(count)])

# Generate word cloud
stopwords = set(STOPWORDS)
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=stopwords, min_font_size=10).generate(text)

# Plot the WordCloud image
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()


## **Task 3 :- Data Export**

In [None]:
#saving final dataframe as Final Data in csv format
movies_df.to_csv('Final Movies Data.csv', index=False)
show_df.to_csv('Final Shows Data.csv', index=False)



In [None]:
# Filter movies, handling None values
filtered_movies = movies_df[movies_df['Genre'].apply(lambda x: 'Drama' in x.lower() if x is not None else False)]

# Filter shows, handling None values
filtered_shows = show_df[show_df['TV shows Genre'].apply(lambda x: 'Drama' in x.lower() if x is not None else False)]

filtered_movies.to_csv('Filter Movies Data.csv', index=False)
filtered_shows.to_csv('Filter Shows Data.csv', index=False)

# **Dataset Drive Link (View Access with Anyone) -**

# ***Congratulations!!! You have completed your Assignment.***