Student name: Bimal Kandel

Write a program in Python to IMDb using the following method:

1. Web scraping:
Use a web scraping tool (like BeautifulSoup or Scrapy) to extract IMDb movie reviews.
Ensure you gather relevant fields (e.g., review, name, date).

2. Data Cleaning:
Remove irrelevant data, empty reviews, and duplicates.
Handle missing data appropriately
3. Data Transformation (Optional/Advanced)

4. Convert all text to lowercase.
Optionally, perform text tokenization or stemming for future analysis.

5. Prepare Data for Analysis:
Store cleaned data in a structured format (CSV).
Create a Jupyter notebook documenting your process with code, explanations,and necessary visualizations.
5. Report:
Summarize the steps taken in data acquisition, cleaning, and transformation.



This project is based upon webscrapping IMDBs 100 top ranking celebrities.

In [120]:
# step 1: Importing Required Libraries
import requests
import json
from bs4 import BeautifulSoup
import pandas as pd


step 2: Installing NLTK library for Complete Code with Text Lowercasing, Tokenization, and Stemming

In [105]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


step 3: Importing nltk library along with tokenize and stem

In [107]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer


step 4 : Load the web link url and get request response. If error persists try to Set a proper User-Agent to make your request look like a browser request.

In [108]:
# got the forbidden response
url = 'https://www.imdb.com/chart/starmeter/?ref_=nv_cel_m'
resp = requests.get(url)
print(resp)

<Response [403]>


Step 5: setting up user-agent to extract data from browser request. 

In [109]:
import requests
from bs4 import BeautifulSoup

# Define a function for web scraping
def scrape_imdb_starmeter(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    # Send a GET request to fetch the webpage content with headers
    response = requests.get(url, headers=headers)
    
    # Print the HTTP status code for debugging
    print(f"HTTP Status Code: {response.status_code}")
    
    # Check if the response is successful
    if response.status_code != 200:
        print("Failed to retrieve the webpage. Returning None.")
        return None

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Initialize lists to store data
    celebrity_names = []
    ranks = []
    
    # Find relevant data: each celebrity is within a td tag with class 'name'
    for i, celeb in enumerate(soup.find_all('td', class_='name'), 1):
        name = celeb.a.text.strip()  # Extract the celebrity name and clean up white spaces
        celebrity_names.append(name)
        ranks.append(i)  # The order in the list implies ranking
    
    # Return the lists of celebrity names and ranks
    return celebrity_names, ranks

# URL for IMDb Most Popular Celebrities
url = 'https://www.imdb.com/chart/starmeter/?ref_=nv_cel_m'

# Call the function and get the result
result = scrape_imdb_starmeter(url)

# Check if the result is not None before unpacking
if result:
    celebrity_names, ranks = result
    # Print the scraped data
    print(celebrity_names)
    print(ranks)
else:
    print("No data was retrieved.")


HTTP Status Code: 200
[]
[]


Step 6: Successfully fetching data after request code error is solved.

In [110]:
if response.status_code == 200:
    print("Successfully fetched the page!")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")


Successfully fetched the page!


Step 7: Parse the HTML content with Beautiful Soup

In [121]:

soup = BeautifulSoup(response.content, 'html.parser')

In [122]:
#Print the parsed HTML to verify
print(soup.prettify()[:2000])

<!DOCTYPE html>
<html lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width" name="viewport"/>
  <script>
   if(typeof uet === 'function'){ uet('bb', 'LoadTitle', {wb: 1}); }
  </script>
  <script>
   window.addEventListener('load', (event) => {
        if (typeof window.csa !== 'undefined' && typeof window.csa === 'function') {
            var csaLatencyPlugin = window.csa('Content', {
                element: {
                    slotId: 'LoadTitle',
                    type: 'service-call'
                }
            });
            csaLatencyPlugin('mark', 'clickToBodyBegin', 1728859217490);
        }
    })
  </script>
  <title>
   Most Popular Celebs
  </title>
  <meta content="As determined by IMDb users" data-id="main" name="description"/>
  <script type="application/ld+json">
   {"@type":"ItemList","itemListElement":[{"@type":"ListItem","item":{"@type":

Step 8: Extract the JSON-LD data from the <script> tag

In [123]:

script_tag = soup.find('script', type='application/ld+json')


In [124]:
# Load the JSON data
import json
if script_tag:
    data = json.loads(script_tag.string)
    celebrities = data.get("itemListElement", [])

# Initialize lists to store names and URLs
names = []  # Ensure this variable is defined
urls = []   # Ensure this variable is defined
    

Step 9: Extracing the top 100 celebrity names and URLs

In [125]:
# Extract celebrity names and URLs
for celeb in celebrities:
        name = celeb['item']['name']
        url = celeb['item']['url']
        print(f'Name: {name}, URL: {url}')
else:
    print("No script tag found.")

Name: Adam Brody, URL: https://www.imdb.com/name/nm0111013/
Name: Kris Kristofferson, URL: https://www.imdb.com/name/nm0001434/
Name: Maggie Smith, URL: https://www.imdb.com/name/nm0001749/
Name: Nicholas Alexander Chavez, URL: https://www.imdb.com/name/nm12560173/
Name: Justine Lupe, URL: https://www.imdb.com/name/nm4419771/
Name: Cooper Koch, URL: https://www.imdb.com/name/nm2599408/
Name: Ari Graynor, URL: https://www.imdb.com/name/nm0310966/
Name: Cristin Milioti, URL: https://www.imdb.com/name/nm2129662/
Name: Kristen Bell, URL: https://www.imdb.com/name/nm0068338/
Name: Margaret Qualley, URL: https://www.imdb.com/name/nm4960279/
Name: John Amos, URL: https://www.imdb.com/name/nm0025309/
Name: Eve Hewson, URL: https://www.imdb.com/name/nm2016723/
Name: Jackie Tohn, URL: https://www.imdb.com/name/nm0865626/
Name: Monica Bellucci, URL: https://www.imdb.com/name/nm0000899/
Name: Colin Farrell, URL: https://www.imdb.com/name/nm0268199/
Name: Emma Corrin, URL: https://www.imdb.com/name

In [126]:
for celeb in celebrities:
    names.append(celeb['item']['name'])
    urls.append(celeb['item']['url'])

Step 10: Creating the dataframe using Name and urls

In [127]:
# Create a DataFrame using the extracted data
import pandas as pd
df = pd.DataFrame({
    'Name': names,  
    'URL': urls    
})


In [128]:
df.head()

Unnamed: 0,Name,URL
0,Adam Brody,https://www.imdb.com/name/nm0111013/
1,Kris Kristofferson,https://www.imdb.com/name/nm0001434/
2,Maggie Smith,https://www.imdb.com/name/nm0001749/
3,Nicholas Alexander Chavez,https://www.imdb.com/name/nm12560173/
4,Justine Lupe,https://www.imdb.com/name/nm4419771/


Step 11: Data transformation
converting all text to lowercase

In [113]:
df['Name'] = df['Name'].str.lower()
print(df)

                         Name                                    URL
0                  adam brody   https://www.imdb.com/name/nm0111013/
1          kris kristofferson   https://www.imdb.com/name/nm0001434/
2                maggie smith   https://www.imdb.com/name/nm0001749/
3   nicholas alexander chavez  https://www.imdb.com/name/nm12560173/
4                justine lupe   https://www.imdb.com/name/nm4419771/
..                        ...                                    ...
95               mira sorvino   https://www.imdb.com/name/nm0000227/
96           natalie martinez   https://www.imdb.com/name/nm2358540/
97                 maya hawke   https://www.imdb.com/name/nm1638321/
98     jessica parker kennedy   https://www.imdb.com/name/nm2498781/
99                gavin creel   https://www.imdb.com/name/nm1342128/

[100 rows x 2 columns]


Step 12: Performing  text tokenization or stemming for future analysis.

In [129]:
# Optional - Tokenization
import nltk
nltk.download('punkt')
df['Tokenized'] = df['Name'].apply(word_tokenize)
print(df)

                         Name                                    URL  \
0                  Adam Brody   https://www.imdb.com/name/nm0111013/   
1          Kris Kristofferson   https://www.imdb.com/name/nm0001434/   
2                Maggie Smith   https://www.imdb.com/name/nm0001749/   
3   Nicholas Alexander Chavez  https://www.imdb.com/name/nm12560173/   
4                Justine Lupe   https://www.imdb.com/name/nm4419771/   
..                        ...                                    ...   
95               Mira Sorvino   https://www.imdb.com/name/nm0000227/   
96           Natalie Martinez   https://www.imdb.com/name/nm2358540/   
97                 Maya Hawke   https://www.imdb.com/name/nm1638321/   
98     Jessica Parker Kennedy   https://www.imdb.com/name/nm2498781/   
99                Gavin Creel   https://www.imdb.com/name/nm1342128/   

                        Tokenized  
0                   [Adam, Brody]  
1           [Kris, Kristofferson]  
2                 [Maggie, 

[nltk_data] Downloading package punkt to /Users/iambimalk/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [130]:
# Optional - Stemming
ps = PorterStemmer()
df['Stemmed'] = df['Tokenized'].apply(lambda x: [ps.stem(word) for word in x])
print(df)

                         Name                                    URL  \
0                  Adam Brody   https://www.imdb.com/name/nm0111013/   
1          Kris Kristofferson   https://www.imdb.com/name/nm0001434/   
2                Maggie Smith   https://www.imdb.com/name/nm0001749/   
3   Nicholas Alexander Chavez  https://www.imdb.com/name/nm12560173/   
4                Justine Lupe   https://www.imdb.com/name/nm4419771/   
..                        ...                                    ...   
95               Mira Sorvino   https://www.imdb.com/name/nm0000227/   
96           Natalie Martinez   https://www.imdb.com/name/nm2358540/   
97                 Maya Hawke   https://www.imdb.com/name/nm1638321/   
98     Jessica Parker Kennedy   https://www.imdb.com/name/nm2498781/   
99                Gavin Creel   https://www.imdb.com/name/nm1342128/   

                        Tokenized                     Stemmed  
0                   [Adam, Brody]               [adam, brodi]  
1      

Step 13: Saving the DataFrame to a CSV file

In [131]:

df.to_csv('popular_celebrities.csv', index=False)
print("Data saved in CSV format")

Data saved in CSV format
