# **Web Scraping & Data Handling Challenge**



### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

## **Task 1:- Web Scrapping**

In [2]:
#Installing all necessary labraries
!pip install bs4
!pip install requests



In [3]:
#import all necessary labraries
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import time

## **Scrapping Movies Data**

In [4]:
# Specifying the URL from which movies related data will be fetched
url='https://www.justwatch.com/in/movies?release_year_from=2000'

# Sending an HTTP GET request to the URL
response = requests.get(url)
# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=BeautifulSoup(response.content,'html.parser')
# Printing the prettified HTML content
print(soup.prettify())

<!DOCTYPE html>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>
 403
</title>
403 Forbidden



## **Fetching Movie URL's**

In [5]:
# Write Your Code here
list_of_links = []
main_url = r'https://www.justwatch.com'
ancher_tag = soup.find_all('a', class_ = 'title-list-grid__item--link')
for tags in ancher_tag:
  href = tags.get('href')
  link = main_url + href
  list_of_links.append(link)

print(list_of_links)
print(len(list_of_links))

[]
0


## **Scrapping Movie Title**

In [6]:
# Write Your Code here
list_of_Movie_Title = []
for link in list_of_links:
  try:
    response = requests.get(link)

    # # Check if the resposne status code is 429 (Too Many Requests)
    # if response.status_code == 429:
    #   # if a 429 error is encountered, wait for an increasingly longer time before retrying
    #   time.sleep(5) #wait for 5 seconds
    #   response = requests.get(link) # Retry the request


    soup = BeautifulSoup(response.content, 'html.parser')
    title_tag = soup.find('h1')

    if title_tag:
      title = title_tag.text.split('(')[0].strip()
      list_of_Movie_Title.append(title)

    else:
      list_of_Movie_Title.append('Title Not Found')

  except Exception as e:
    print(e)

  time.sleep(3)
print(list_of_Movie_Title)
print(len(list_of_Movie_Title))

[]
0


## **Scrapping release Year**

In [7]:
# Write Your Code here
list_of_release_year = []
for link in list_of_links:
  try:
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')

    release_tag = soup.find('span', class_ = 'text-muted')
    if release_tag:
      release_year = release_tag.text.strip()
      list_of_release_year.append(release_year[1:5])
    # else:
    #   list_of_release_year.append('Release Year Not Found')

  except Exception as e:
    print(e)

  time.sleep(3)
print(list_of_release_year)
print(len(list_of_release_year))

[]
0


## **Scrapping Genres**

In [8]:
# Write Your Code here
list_of_Genre = []
for link in list_of_links:
  try:
    response = requests.get(link)

    # Check if the resposne status code is 429 (Too Many Requests)
    if response.status_code == 429:
      # if a 429 error is encountered, wait for an increasingly longer time before retrying
      time.sleep(5) #wait for 5 seconds
      response = requests.get(link) # Retry the request

    soup = BeautifulSoup(response.content, 'html.parser')

    genre_tag = soup.find('h3', class_ = 'detail-infos__subheading', string = 'Genres')
    if genre_tag:
      genre = genre_tag.find_next('div', class_ = 'detail-infos__value')
      list_of_Genre.append(genre.text.strip())
    else:
      list_of_Genre.append('Genre Not Found')
  except Exception as e:
    print(e)

print(list_of_Genre)
print(len(list_of_Genre))

[]
0


## **Scrapping IMBD Rating**

In [9]:
# Write Your Code here
list_of_IMDB_rating = []
for link in list_of_links:
  try:
    response = requests.get(link)

    # Check if the resposne status code is 429 (Too Many Requests)
    if response.status_code == 429:
      # if a 429 error is encountered, wait for an increasingly longer time before retrying
      time.sleep(5) #wait for 5 seconds
      response = requests.get(link) # Retry the request

    soup = BeautifulSoup(response.content, 'html.parser')

    IMDB_Tag = soup.find_all('div', class_ = 'jw-scoring-listing__rating')
    if IMDB_Tag:
      rating = IMDB_Tag[1].text.split('(')[0].strip()
      list_of_IMDB_rating.append(rating)
    else:
      list_of_IMDB_rating.append('Rating Not Found')

  except Exception as e:
    print(e)

print(list_of_IMDB_rating)
print(len(list_of_IMDB_rating))

[]
0


## **Scrapping Runtime/Duration**

In [10]:
# Write Your Code here
list_of_runtime = []

for link in list_of_links:
  try:
    response = requests.get(link)

    # Check if the resposne status code is 429 (Too Many Requests)
    if response.status_code == 429:
      # if a 429 error is encountered, wait for an increasingly longer time before retrying
      time.sleep(5) #wait for 5 seconds
      response = requests.get(link) # Retry the request

    soup = BeautifulSoup(response.content, 'html.parser')

    runtime_tag = soup.find('h3', class_ = 'detail-infos__subheading', string = 'Runtime')
    if runtime_tag:
      runtime = runtime_tag.find_next('div', class_ = 'detail-infos__value')
      list_of_runtime.append(runtime.text.strip())

    else:
      list_of_runtime.append('Runtime Not Found')

  except Exception as e:
    print(e)

print(list_of_runtime)
print(len(list_of_runtime))

[]
0


## **Scrapping Age Rating**

In [11]:
# Write Your Code here
list_of_Age = []
for link in list_of_links:
  try:
    response = requests.get(link)

    # Check if the resposne status code is 429 (Too Many Requests)
    if response.status_code == 429:
      # if a 429 error is encountered, wait for an increasingly longer time before retrying
      time.sleep(5) #wait for 5 seconds
      response = requests.get(link) # Retry the request

    soup = BeautifulSoup(response.content, 'html.parser')

    Age_rating = soup.find('h3', class_ = 'detail-infos__subheading', string = 'Age rating')
    if Age_rating:
      Age = Age_rating.find_next('div', class_ = 'detail-infos__value')
      list_of_Age.append(Age.text.strip())

    else:
      list_of_Age.append('Age Rating Not Found')

  except Exception as e:
    print(e)

print(list_of_Age)
print(len(list_of_Age))

[]
0


## **Fetching Production Countries Details**

In [12]:
# Write Your Code here
list_of_Production_Countries = []
for link in list_of_links:
  try:
    response = requests.get(link)

    # Check if the resposne status code is 429 (Too Many Requests)
    if response.status_code == 429:
      # if a 429 error is encountered, wait for an increasingly longer time before retrying
      time.sleep(5) #wait for 5 seconds
      response = requests.get(link) # Retry the request

    soup = BeautifulSoup(response.content, 'html.parser')

    countries_tag = soup.find('h3', class_ = 'detail-infos__subheading', string = ' Production country ')
    if countries_tag:
      country = countries_tag.find_next('div', class_ = 'detail-infos__value')
      list_of_Production_Countries.append(country.text.strip())

    else:
      list_of_Production_Countries.append("Country Not Found")

  except Exception as e:
    print(e)

print(list_of_Production_Countries)
print(len(list_of_Production_Countries))

[]
0


## **Fetching Streaming Service Details**

In [13]:
# Write Your Code here
list_of_Streaming_Service = []
for link in list_of_links:
  try:
    response = requests.get(link)

    # Check if the resposne status code is 429 (Too Many Requests)
    if response.status_code == 429:
      # if a 429 error is encountered, wait for an increasingly longer time before retrying
      time.sleep(5) #wait for 5 seconds
      response = requests.get(link) # Retry the request

    soup = BeautifulSoup(response.content, 'html.parser')

    streaming_service = soup.find('img', class_ = 'offer__icon')
    if streaming_service:
      streaming  = streaming_service.get('alt')
      list_of_Streaming_Service.append(streaming)

    else:
      list_of_Streaming_Service.append('Streaming Service Not Found')

  except Exception as e:
    print(e)

print(list_of_Streaming_Service)
print(len(list_of_Streaming_Service))


[]
0


## **Now Creating Movies DataFrame**

In [14]:
# Write Your Code here
movie_df = pd.DataFrame()
movie_df['Title'] =  list_of_Movie_Title
movie_df['Release Year'] = list_of_release_year
movie_df['Genres'] = list_of_Genre
movie_df['IMDB-Rating'] = list_of_IMDB_rating
movie_df['Runtime'] = list_of_runtime
movie_df['Age Rating'] = list_of_Age
movie_df['Production Country'] = list_of_Production_Countries
movie_df['Streaming Service'] = list_of_Streaming_Service

## **Scraping TV  Show Data**

In [15]:
# Specifying the URL from which tv show related data will be fetched
tv_url='https://www.justwatch.com/in/tv-shows?release_year_from=2000'
# Sending an HTTP GET request to the URL
response = requests.get(tv_url)
# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=BeautifulSoup(response.content,'html.parser')
# Printing the prettified HTML content
print(soup.prettify())

<!DOCTYPE html>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>
 403
</title>
403 Forbidden



## **Fetching Tv shows Url details**

In [16]:
# Write Your Code here
list_of_links = []
main_Tv_Url = 'https://www.justwatch.com'
ancher_Tv_tag = soup.find_all('a', class_ = 'title-list-grid__item--link')
for tag in ancher_Tv_tag:
  href = tag.get('href')
  tv_link = main_Tv_Url + href
  list_of_links.append(tv_link)
print(list_of_links)
print(len(list_of_links))

[]
0


## **Fetching Tv Show Title details**

In [17]:
# Write Your Code here
list_of_Tv_title = []
for link in list_of_links:
  try:
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')

    tv_title_tag = soup.find('h1')
    if tv_title_tag:
      title = tv_title_tag.text.strip()
      list_of_Tv_title.append(title)

    else:
      list_of_Tv_title.append('Title Not Found')

  except Exception as e:
    print(e)

  time.sleep(3)

print(list_of_Tv_title)
print(len(list_of_Tv_title))

[]
0


## **Fetching Release Year**

In [18]:
# Write Your Code here
list_of_Tv_Release_year = []
for link in list_of_links:
  try:
    response = requests.get(link)

    # Check if the resposne status code is 429 (Too Many Requests)
    if response.status_code == 429:
      # if a 429 error is encountered, wait for an increasingly longer time before retrying
      time.sleep(5) #wait for 5 seconds
      response = requests.get(link) # Retry the request

    soup = BeautifulSoup(response.content, 'html.parser')

    tv_release_year = soup.find('span', class_ = 'text-muted')
    if tv_release_year:
      release_year = tv_release_year.text.strip()
      list_of_Tv_Release_year.append(release_year[1:5])
    else:
      list_of_Tv_Release_year.append('Release Year Not Found')

  except Exception as e:
    print(e)

  # time.sleep(3)

print(list_of_Tv_Release_year)
print(len(list_of_Tv_Release_year))

[]
0


## **Fetching TV Show Genre Details**

In [19]:
# Write Your Code here
list_of_tv_Genre = []
for link in list_of_links:
  try:
    response = requests.get(link)

    # Check if the resposne status code is 429 (Too Many Requests)
    if response.status_code == 429:
      # if a 429 error is encountered, wait for an increasingly longer time before retrying
      time.sleep(5) #wait for 5 seconds
      response = requests.get(link) # Retry the request

    soup = BeautifulSoup(response.content, 'html.parser')

    tv_genre = soup.find('h3', class_ = 'detail-infos__subheading', string = 'Genres')
    if tv_genre:
      genres = tv_genre.find_next('div', class_ = 'detail-infos__value')
      list_of_tv_Genre.append(genres.text.strip())

    else:
      list_of_tv_Genre.append('Genre Not Found')

  except Exception as e:
    print(e)

print(list_of_tv_Genre)
print(len(list_of_tv_Genre))

[]
0


## **Fetching IMDB Rating Details**

In [20]:
# Write Your Code here
list_of_tv_imdb = []
for link in list_of_links:
  try:
    response = requests.get(link)

    # Check if the resposne status code is 429 (Too Many Requests)
    if response.status_code == 429:
      # if a 429 error is encountered, wait for an increasingly longer time before retrying
      time.sleep(5) #wait for 5 seconds
      response = requests.get(link) # Retry the request

    soup = BeautifulSoup(response.content, 'html.parser')

    tv_imdb = soup.find_all('div', class_ = 'jw-scoring-listing__rating')
    if tv_imdb:
      imdb = tv_imdb[1].text.split('(')[0].strip()
      list_of_tv_imdb.append(imdb)

    else:
      list_of_tv_imdb.append('IMDB Not Found')

  except Exception as e:
    print(e)

print(list_of_tv_imdb)
print(len(list_of_tv_imdb))

[]
0


## **Fetching Age Rating Details**

In [21]:
# Write Your Code here
list_of_tv_Age = []
for link in list_of_links:
  try:
    response = requests.get(link)

    # Check if the resposne status code is 429 (Too Many Requests)
    if response.status_code == 429:
      # if a 429 error is encountered, wait for an increasingly longer time before retrying
      time.sleep(5) #wait for 5 seconds
      response = requests.get(link) # Retry the request

    soup = BeautifulSoup(response.content, 'html.parser')

    tv_age = soup.find('h3', class_ = 'detail-infos__subheading', string = 'Age rating')
    if tv_age:
      age = tv_age.find_next('div', class_ = 'detail-infos__value')
      list_of_tv_Age.append(age.text.strip())

    else:
      list_of_tv_Age.append('Age Rating Not Found')

  except Exception as e:
    print(e)

print(list_of_tv_Age)
print(len(list_of_tv_Age))

[]
0


## **Fetching Production Country details**

In [22]:
# Write Your Code here
list_of_tv_countries = []
for link in list_of_links:
  try:
    response = requests.get(link)

    # Check if the resposne status code is 429 (Too Many Requests)
    if response.status_code == 429:
      # if a 429 error is encountered, wait for an increasingly longer time before retrying
      time.sleep(5) #wait for 5 seconds
      response = requests.get(link) # Retry the request

    soup = BeautifulSoup(response.content, 'html.parser')

    tv_country = soup.find('h3', class_ = 'detail-infos__subheading', string = ' Production country ')
    if tv_country:
      countries = tv_country.find_next('div', class_ = 'detail-infos__value')
      list_of_tv_countries.append(countries.text.strip())

    else:
      list_of_tv_countries.append('Production Country Not Found')

  except Exception as e:
    print(e)

print(list_of_tv_countries)
print(len(list_of_tv_countries))

[]
0


## **Fetching Streaming Service details**

In [23]:
# Write Your Code here
list_of_tv_streaming = []
for link in list_of_links:
  try:
    response = requests.get(link)

    # Check if the resposne status code is 429 (Too Many Requests)
    if response.status_code == 429:
      # if a 429 error is encountered, wait for an increasingly longer time before retrying
      time.sleep(5) #wait for 5 seconds
      response = requests.get(link) # Retry the request

    soup = BeautifulSoup(response.content, 'html.parser')

    tv_streaming = soup.find('img', class_ = 'offer__icon')
    if tv_streaming:
      streaming = tv_streaming.get('alt')
      list_of_tv_streaming.append(streaming)

    else:
      list_of_tv_streaming.append('Streaming Service Not Found')

  except Exception as e:
    print(e)

print(list_of_tv_streaming)
print(len(list_of_tv_streaming))

[]
0


## **Fetching Duration Details**

In [24]:
# Write Your Code here
list_of_tv_runtime = []
for link in list_of_links:
  try:
    response = requests.get(link)

    # Check if the resposne status code is 429 (Too Many Requests)
    if response.status_code == 429:
      # if a 429 error is encountered, wait for an increasingly longer time before retrying
      time.sleep(5) #wait for 5 seconds
      response = requests.get(link) # Retry the request

    soup = BeautifulSoup(response.content, 'html.parser')

    tv_runtime = soup.find('h3', class_ = 'detail-infos__subheading', string = 'Runtime')
    if tv_runtime:
      runtime = tv_runtime.find_next('div', class_ = 'detail-infos__value')
      list_of_tv_runtime.append(runtime.text.strip())

    else:
      list_of_tv_runtime.append('Runtime Not Found')

  except Exception as e:
    print(e)

print(list_of_tv_runtime)
print(len(list_of_tv_runtime))

[]
0


## **Creating TV Show DataFrame**

In [25]:
# Write Your Code here
Tv_Show_df = pd.DataFrame()
Tv_Show_df['Title'] = list_of_Tv_title
Tv_Show_df['Release Year'] = list_of_Tv_Release_year
Tv_Show_df['Genres'] = list_of_tv_Genre
Tv_Show_df['IMDB-Rating'] = list_of_tv_imdb
Tv_Show_df['Age Rating'] = list_of_tv_Age
Tv_Show_df['Production Country'] = list_of_tv_countries
Tv_Show_df['Streaming Service'] = list_of_tv_streaming
Tv_Show_df['Runtime'] = list_of_tv_runtime

In [26]:
Tv_Show_df

Unnamed: 0,Title,Release Year,Genres,IMDB-Rating,Age Rating,Production Country,Streaming Service,Runtime


## **Task 2 :- Data Filtering & Analysis**

In [27]:
# Write Your Code here
# - Only include movies and TV shows released in the last 2 years (from the current date).
# - Only include movies and TV shows with an IMDb rating of 7 or higher.

# lets create copy of movie_df
df = movie_df.copy()
df.head()

Unnamed: 0,Title,Release Year,Genres,IMDB-Rating,Runtime,Age Rating,Production Country,Streaming Service


In [28]:
from datetime import datetime

df['Release Year'] = df['Release Year'].astype(int) #Here We will datatype of released year from object to int
current_year = datetime.now().year

recent_movie = df[df['Release Year'] > current_year - 2]
recent_movie

Unnamed: 0,Title,Release Year,Genres,IMDB-Rating,Runtime,Age Rating,Production Country,Streaming Service


In [29]:
# - Only include movies with an IMDb rating of 7 or higher.
recent_movie['IMDB-Rating'] = recent_movie['IMDB-Rating'].replace('Rating Not Found', 0)
recent_movie['IMDB-Rating'] = recent_movie['IMDB-Rating'].astype(float)
Filter_Movies = recent_movie[recent_movie['IMDB-Rating'] >= 7]
Filter_Movies

Unnamed: 0,Title,Release Year,Genres,IMDB-Rating,Runtime,Age Rating,Production Country,Streaming Service


In [30]:
# - Only include TV shows released in the last 2 years (from the current date).
df_Tv = Tv_Show_df.copy()

df_Tv['Release Year'] = df['Release Year'].astype(int)
tv_current_year = datetime.now().year

recent_tv_show = df_Tv[df_Tv['Release Year'] > tv_current_year - 2]
recent_tv_show

Unnamed: 0,Title,Release Year,Genres,IMDB-Rating,Age Rating,Production Country,Streaming Service,Runtime


In [31]:
#- Only include movies and TV shows with an IMDb rating of 7 or higher.
recent_tv_show['IMDB-Rating'] = recent_tv_show['IMDB-Rating'].replace('', 0)
recent_tv_show['IMDB-Rating'] = recent_tv_show['IMDB-Rating'].astype(float)
Filter_Tv_Show = recent_tv_show[recent_tv_show['IMDB-Rating'] >= 7]
Filter_Tv_Show

Unnamed: 0,Title,Release Year,Genres,IMDB-Rating,Age Rating,Production Country,Streaming Service,Runtime


## **Calculating Mean IMDB Ratings for both Movies and Tv Shows**

In [32]:
# Write Your Code here
# Mean IMDB for Movies
df['IMDB-Rating'] = df['IMDB-Rating'].replace('Rating Not Found', 0)
df['IMDB-Rating'] = df['IMDB-Rating'].astype(float)
movies_mean = df['IMDB-Rating'].mean()
movies_mean

nan

In [33]:
# Mean IMDB for Tv Shows
df_Tv['IMDB-Rating'] = df_Tv['IMDB-Rating'].replace('', 0)
df_Tv['IMDB-Rating'] = df['IMDB-Rating'].astype(float)
tv_shows_mean = df['IMDB-Rating'].mean()
tv_shows_mean

nan

Analyzing Top Genres



In [34]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = " ".join(df['Genres'])
wordcloud = WordCloud(width = 1000, height = 600, background_color = 'White').generate(text)

plt.figure(figsize = (8,5))
plt.imshow(wordcloud)

plt.axis('off')
plt.show()

ValueError: We need at least 1 word to plot a word cloud, got 0.

In [None]:
# top Genres of Tv show
text = " ".join(df_Tv['Genres'])
wordcloud = WordCloud(width = 1000, height = 600, background_color = 'White').generate(text)

plt.figure(figsize = (8,5))
plt.imshow(wordcloud)

plt.axis('off')
plt.show()

## **Finding Predominant Streaming Service**

In [None]:
# Write Your Code here
streaming_service_count = df.groupby('Streaming Service').size().reset_index(name = 'Count')
streaming_service_count_sorted = streaming_service_count.sort_values(by = 'Count', ascending = False)
streaming_service_count_sorted

In [None]:
#Let's Visvalize it using word cloud
text = " ".join(df['Streaming Service'])
wordcloud = WordCloud(width = 1000, height = 600, background_color = 'White').generate(text)
plt.figure(figsize = (8,5))
plt.imshow(wordcloud)

plt.axis('off')
plt.show()

## **Task 3 :- Data Export**

In [None]:
#saving final dataframe as Final Data in csv format
df.to_csv("Original_DataFrame.csv", index = False)
print('Export Successfully')

In [None]:
#saving filter data as Filter Data in csv format
Filter_Movies.to_csv('Filter_All_Movies.csv', index = False)
print('Export Successfully')