## Web Scraping using python and beautiful soup to create a dataframe for futher analysis

In this project I will be `scraping` data from the `imdb movie website` using `beautifulSoup` and `requests` libraries. 
The data will be stored in a `dataframe` for further analysis. 

![imdb imdb](imdb2.png)

Import the `Libraries` needed for this project.

In [1]:
#Importing the libraries

import pandas as pd #to create dataframe
import numpy as np  # to count the values 
from bs4 import BeautifulSoup #to get the content in the form of HTML
import requests #The requests module allows you to send HTTP requests using Python
               


Create a `variable` and store the website url that you want to scrape.

In [2]:
# Assigning the url to a variable named "url"

url = "https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating"

The `requests` module allows you to send `HTTP` requests using `Python`
The HTTP request returns a Response `Object` with all the response data `(content, encoding, status, etc)`.

In [3]:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

In [4]:
response

<Response [200]>

Creating an empty `list`, so that we can `append` the values that we scrape.

In [5]:
movie_name = []
year = []
time = []
rating = []
metascore = []
votes = []
gross = []
description = []
Director = []
Stars = []

Storing the meaningfull required data in the variable `"movie_data"`.

In [6]:
movie_data = soup.findAll('div', attrs = {'class' : 'lister-item mode-advanced' })

Using a for loop to scrape data using "find" functions and storing the data in the list created as well as doing feature engineering. 

In [7]:


for store in movie_data:
    name = store.h3.a.text
    movie_name.append(name)
    
    year_of_release = store.h3.find('span', class_ = "lister-item-year text-muted unbold").text.replace('(','').replace(')','')
    year.append(year_of_release)
    
    runtime = store.p.find('span', class_ = 'runtime').text.replace(' min', '')
    time.append(runtime)
    
    rate = store.find("div", "inline-block ratings-imdb-rating").text.replace('\n','')
    rating.append(rate)
    
    meta = store.find("span", class_ = "metascore favorable").text.replace(' ', "") if store.find("span", class_ = "metascore favorable") else ''
    metascore.append(meta)
    
    value = store.find_all('span', attrs = {'name' : 'nv'})
    
    vote = value[0].text
    votes.append(vote)
    
    # Gross and votes have same attributes, so i created a common variable and then used indexing
    grosses = value[1].text if len(value) > 1 else '******'
    gross.append(grosses)
    
    # Description of the Movies --
    description_1 = store.find_all("p", class_ = "text-muted")
    description_2 = description_1[1].text.replace("\n",'') if len(description_1) > 1 else "******"
    description.append(description_2)
    
    director_1 = store.find_all("p")
    director_2 = director_1[2].text.replace('\n', '').split("|")
    director_3 = director_2[0].split(':')
    director_4 = director_3[1]
    Director.append(director_4)
    
    #Cast Details -- Scraping Director name and Stars 
    cast = director_2[1].split(':')
    cast_1 = cast[1:]
    Stars.append(cast_1)
    
      
   
    
    

Creating a `dataframe` using `pandas` library.

In [8]:
movie_df = pd.DataFrame({"Name of movie": movie_name, "Year of release": year, "Runtime": time, "Ratings": rating, "Metascore": metascore, "Votes": votes, "Gross": gross, "Discription": description, "Director": Director, "Stars": Stars})

Use `np.count()` to count the rows we have which should equal to `100`.

In [9]:
np.count_nonzero(movie_name)

100

Viewing the first 30 rows of the data.

Feature engineering and data cleaning can now be done before exploring the data for insights via a dashboard or to build a movie recommendation system etc.

In [10]:
movie_df.head(30)

Unnamed: 0,Name of movie,Year of release,Runtime,Ratings,Metascore,Votes,Gross,Discription,Director,Stars
0,Jai Bhim,2021,164,9.3,,173420,#138,When a tribal man is arrested for a case of al...,T.J. Gnanavel,"[Suriya, Lijo Mol Jose, Manikandan, Rajisha Vi..."
1,The Shawshank Redemption,1994,142,9.3,80.0,2541720,$28.34M,Two imprisoned men bond over a number of years...,Frank Darabont,"[Tim Robbins, Morgan Freeman, Bob Gunton, Will..."
2,The Godfather,1972,175,9.2,100.0,1748755,$134.97M,The aging patriarch of an organized crime dyna...,Francis Ford Coppola,"[Marlon Brando, Al Pacino, James Caan, Diane K..."
3,Soorarai Pottru,2020,153,9.1,,107193,#249,"Nedumaaran Rajangam ""Maara"" sets out to make t...",Sudha Kongara,"[Suriya, Paresh Rawal, Aparna Balamurali, Prak..."
4,The Dark Knight,2008,152,9.0,84.0,2491936,$534.86M,When the menace known as the Joker wreaks havo...,Christopher Nolan,"[Christian Bale, Heath Ledger, Aaron Eckhart, ..."
5,The Godfather: Part II,1974,202,9.0,90.0,1212878,$57.30M,The early life and career of Vito Corleone in ...,Francis Ford Coppola,"[Al Pacino, Robert De Niro, Robert Duvall, Dia..."
6,12 Angry Men,1957,96,9.0,96.0,751005,$4.36M,The jury in a New York City murder trial is fr...,Sidney Lumet,"[Henry Fonda, Lee J. Cobb, Martin Balsam, John..."
7,The Lord of the Rings: The Return of the King,2003,201,8.9,94.0,1752433,$377.85M,Gandalf and Aragorn lead the World of Men agai...,Peter Jackson,"[Elijah Wood, Viggo Mortensen, Ian McKellen, O..."
8,Pulp Fiction,1994,154,8.9,94.0,1955545,$107.93M,"The lives of two mob hitmen, a boxer, a gangst...",Quentin Tarantino,"[John Travolta, Uma Thurman, Samuel L. Jackson..."
9,Schindler's List,1993,195,8.9,94.0,1297684,$96.90M,"In German-occupied Poland during World War II,...",Steven Spielberg,"[Liam Neeson, Ralph Fiennes, Ben Kingsley, Car..."


In [11]:
movie_df.shape

(100, 10)

In [12]:
movie_df = movie_df.to_csv('movie_df')