# WEB- SCRAPING - BOLLYWOOD - FILMOGRAPHIES - PROJECT

 
Data Source :[Bollywood Filmography](https://en.wikipedia.org/wiki/Category:Indian_filmographies)
![](https://www.studytonight.com/python/web-scraping/images/web-scraping-course.jpg)

## Let's see what is web-scraping
 - **Web scraping** is the process of collecting structured web data in an automated fashion. It's also called web data extraction. ... In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.


## How does web scraping works
- Identify the target website.
- Collect URLs of the pages where you want to extract data from.
- Make a request to these URLs to get the HTML of the page.
- Use locators to find the data in the HTML.
- Save the data in a JSON or CSV file or some other structured format.
![](https://www.webharvy.com/images/web%20scraping.png)

## About Filmograpgy
- A filmography is a list of films related by some criteria. For example, an actor's career filmography is the list of films they have appeared in; a director's comedy filmography is the list of comedy films directed by a particular director. The term, which has been in use since at least 1957.





![](https://1.bp.blogspot.com/-rDT18tMdEAY/YFcFTbmnDAI/AAAAAAAAWhE/CimiIVZ19BYdtcV_gfvHd4I4te9kE8COgCLcBGAsYHQ/s1280/All%2BActors%2BBollywood%2BLyricist%2BWeb.jpg)

## Project Idea

In this Project I will parse through the Actors and Actresses of the Bollywood.


I will retrieve information from the page [Bollywood Filmography](https://en.wikipedia.org/wiki/Category:Indian_filmographies) using **web scraping**.

## Project Goal 
The project goal is to build a web scraper that withdraws all desirable information and assemble them into a single CSV. The format of the output CSV file is shown below:


|#|Actor/Actress Name|Profiles_urls                                         | Images_url
|-|------------------|------------------------------------------------------|---------------|
|0|John Abraham     |https://en.wikipedia.org/wiki/|https://en.wikipedia.org/wiki/
|1|Kajal Agarwal    |https://en.wikipedia.org/wiki/|https://en.wikipedia.org/wiki/
|187|Prithviraj Sukumaran filmography|https://en.wikipedia.org/wiki/|https://en.wikipedia.org/wiki/



|#|Year              |Title                                     | Role | Actor/Actress name
|-|------------------|------------------------------------------------------|---------------|------------|
|0|2003              |Jism                                      |Kabir lal |John Abraham
|1|2003              |Saaya                                     |Dr.Akash "Akki" Bhatnagar |John Abraham
|223|2015-2016       |Daar Sabko Lagta hai                      |Host/presenter | Bipasha Basu

## Project Steps 
The outline of the projet:

1. Download the webpage using `requests`
2. Parse the HTML source code using `BeautifulSoup` library and extract the desired infromation
3. Building the scraper components
4. Compile the extracted information into Python list and dictionaries
5. Converting the python dictionaries into `Pandas DataFrames`
5. Write information to the final CSV file
7. Future work and references


>## Packages Used:
>1. Requests — For downloading the HTML code from the IMDB URL
>2. BeautifulSoup4 — For parsing and extracting data from the HTML string
>3. Pandas — to gather my data into a dataframe for further processing

## How to run the code

This tutorial is an executable [Jupyter notebook](https://jupyter.org) hosted on [Jovian](https://www.jovian.ai). You can _run_ this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your computer*.


### Option 1: Running using free online resources (1-click, recommended)
The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Binder**. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on [Google Colab](https://colab.research.google.com) or [Kaggle](https://kaggle.com) to use these platforms.

### Option 2: Running on your computer locally
To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.

>>**Jupyter Notebooks**: This tutorial is a [Jupyter notebook](https://jupyter.org) - a document made of cells. Each cell can contain code written in Python or explanations in plain English. You can execute code cells and view the results, e.g., numbers, messages, graphs, tables, files, etc., instantly within the notebook. Jupyter is a powerful platform for experimentation and analysis. Don't be afraid to mess around with the code & break things - you'll learn a lot by encountering and fixing errors. You can use the "Kernel > Restart & Clear Output" menu option to clear all outputs and start again from the top.

# Let's Start the Project :



>>Note : We will use the `Jovian` library and its `commit()` function throughout the code to save our progress as we move along.

In [1]:
!pip install jovian --upgrade --quiet
import jovian
# Execute this to save new versions of the notebook
jovian.commit(project="WEB- SCRAPING - BOLLYWOOD - FILMOGRAPHIES - PROJECT")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "deepakkumawat2120/web-scraping-bollywood-filmographies-project" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/deepakkumawat2120/web-scraping-bollywood-filmographies-project[0m


'https://jovian.ai/deepakkumawat2120/web-scraping-bollywood-filmographies-project'

## Download the webpage using `requests`

>The requests module allows you to send HTTP requests using Python.
>The HTTP request returns a Response Object with all the response data (content, encoding, status, etc).




>We use `pip`, a package-management system, to install and manage softwares. Since the platform we selected is **Binder**, we would have to type a line of code `!pip install` to install `requests`. You will see lots codes of `!pip` when installing other packages.



>When we attempt to use some prewritten functions from a certain library, we would use the `import` statement. e.g. When we would have to type `import requests` after installation, we are able to use any function from `requests` library.

In [2]:
!pip install requests --upgrade  --quiet 
import requests

## **requests.get()**

In order to **download a web page**, we use `requests.get()` to **send the HTTP request** to the **FILMOGRAPHIES SERVER** and what the function returns is a **response object**, which is **the HTTP response**. 

![](https://nimbus-screenshots.s3.amazonaws.com/s/0734f0696813fdf86885d92ab8455334.png)

In [3]:
Filmography_url = 'https://en.wikipedia.org/wiki/Category:Indian_filmographies'
response = requests.get(Filmography_url)

## Status Code


Now, we have to check if we succesfully send the HTTP request and get a HTTP response back on purpose. This is because we're NOT using browsers, because of which we can't get the feedback directly if we didn't send HTTP requests successfully.

In general, the method to check out if the server sended a HTTP response back is the status code. In requests library, requests.get returns a response object, which containing the page contents and the information about status code indicating if the HTTP request was successful. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

If the request was successful, response.status_code is set to a value between 200 and 299.

In [4]:
response.status_code    #Here we are checking the Status code, -> 200-299 will mean that the request was successful

200

The HTTP response contains HTML that is ready to be displayed in browser. Here we can use `response.text` to retrive the HTML document.

In [5]:
page_contents = response.text
len(page_contents)    #The `len` fucnction tells us the length of the response object

60002

We have 57476 characters in the the HTML that we have just downloaded in a second

In [6]:
page_contents[:1000]    #This displays the first 1000 characters of `page_contents`

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Category:Indian filmographies - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"ab565daa-d678-4a5e-be05-463a6fcf1a18","wgCSPNonce":false,"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":14,"wgPageName":"Category:Indian_filmographies","wgTitle":"Indian filmographies","wgCurRevisionId":950929752,"wgRevisionId":950929752,"wgArticleId":38075388,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Template Category TOC via CatAutoTOC on category with 201–300 pages","CatAutoTOC generates standard 

- What we see above is the source code of the web page. It is written in a language called HTML. 
- It defines and display the content and structure of the web page by the help of the browsers like Chrome

In [7]:
with open('webpage.html','w') as f:  #Writing the html page to a file locally, i.e. a copy of real html page
    f.write(page_contents)

Here, we save the text that we have got into a `HTML` file with `open` statement.

Now, a HTML File is created by the name `WEB-SCRAPING-BOLLYWOOD-FILMOGRAPHIES-PROJECT.html`

![](https://nimbus-screenshots.s3.amazonaws.com/s/a51a561ef65a84cc65fbde63b2ab53fe.png)

In [8]:
jovian.commit() #Saving the work done till now

<IPython.core.display.Javascript object>

[jovian] Updating notebook "deepakkumawat2120/web-scraping-bollywood-filmographies-project" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/deepakkumawat2120/web-scraping-bollywood-filmographies-project[0m


'https://jovian.ai/deepakkumawat2120/web-scraping-bollywood-filmographies-project'

# Parse the HTML source code using Beautiful Soup library


<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2021/04/56856232112.png" width="500" height="240" align="centre"/>

>### What is Beautiful Soup?

>Beautiful Soup is **a Python package** for **parsing HTML and XML documents**. Beautiful Soup enables us to get data out of sequences of characters. It creates a parse tree for parsed pages that can be used to extract data from HTML. It's a handy tool when it comes to web scraping. You can read more on their documentation site. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#getting-help

>To extract information from the HTML source code of a webpage programmatically, we can use the Beautiful Soup library. Let's install the library and import **the BeautifulSoup class** from **the bs4 module.**

In [9]:
!pip install beautifulsoup4 --upgrade  --quiet 
from bs4 import BeautifulSoup
doc = BeautifulSoup(page_contents, 'html.parser')  #Now 'doc' contains entire html in parsed format

In [10]:
type(doc)

bs4.BeautifulSoup

## Inspecting the HTML source code of a web page



>In Beautiful Soup library, we can specify `html.parser` to ask Python to read components of the page, instead of reading it as a long string. 

>we should know the basic knowledge about HTML, before moving further for the project.

>HyperText Markup Language (HTML) is the set of markup symbols or codes inserted into a file intended for display on the Internet. The markup tells web browsers how to display a web page's words and images.

![](https://www.softwaretestinghelp.com/wp-content/qa/uploads/2020/12/html-code.jpg)

## **An HTML tag comprises of three parts:**

1. **Name**: (`html`, `head`, `body`, `div`, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
2. **Attributes**: (`href`, `target`, `class`, `id`, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
3. **Children**: A tag can contain some text or other tags or both between the opening and closing segments, e.g., `<div>Some content</div>`.


## Common Tags and Attributes

Following are some of the most commonly used HTML tags:

* `html`
* `head`
* `title`
* `body`
* `div`
* `span`
* `h1` to `h6`
* `p`
* `img`
* `ul`, `ol` and `li`
* `table`, `tr`, `th` and `td`
* `style`
* ...

Each tag supports several attributes. Following are some common attributes used to modify the behavior of tags:

* `id`
* `style`
* `class`
* `href` (used with `<a>`)
* `src` (used with `<img>`)


What we can do with **a BeautifulSoup object** is to get **a specifc types of a tag in HTML** by calling the name of a tag, as shown in code cell below.

Here, we use the find() function of BeautifulSoup to find the first <title> tag in the HTML document and display its content

In [11]:
title = doc.find('title')
title

<title>Category:Indian filmographies - Wikipedia</title>

## Inspecting HTML in the Browser

>To view the **source code** of any webpage right within **your browser**, you can **right click** anywhere on a page and **select** the **"Inspect"** option. You access the **"Developer Tools"** mode, where you can see the source code as **a tree**. You can expand and collapse various nodes and find the source code for a specific portion of the page.



## Actors/Actress Name

![](https://nimbusweb.me/nimbus-screenshots/bcd87344895ebe660af3385470e3006a)

![](https://nimbus-screenshots.s3.amazonaws.com/s/7082b6ef9577ad635dd662e80eae01df.png)

In [12]:
# here we get all the anchor tags from HTML page in the class external text
contents_a_tags = doc.find_all('a',{'class':'external text'})

In [13]:
len(contents_a_tags) # we will get the length of contents_a_tags

28

In [14]:
contents_a_tags

[<a class="external text" href="https://en.wikipedia.org/wiki/Category:Indian_filmographies">Top</a>,
 <a class="external text" href="https://en.wikipedia.org/w/index.php?title=Category:Indian_filmographies&amp;from=0">0–9</a>,
 <a class="external text" href="https://en.wikipedia.org/w/index.php?title=Category:Indian_filmographies&amp;from=A">A</a>,
 <a class="external text" href="https://en.wikipedia.org/w/index.php?title=Category:Indian_filmographies&amp;from=B">B</a>,
 <a class="external text" href="https://en.wikipedia.org/w/index.php?title=Category:Indian_filmographies&amp;from=C">C</a>,
 <a class="external text" href="https://en.wikipedia.org/w/index.php?title=Category:Indian_filmographies&amp;from=D">D</a>,
 <a class="external text" href="https://en.wikipedia.org/w/index.php?title=Category:Indian_filmographies&amp;from=E">E</a>,
 <a class="external text" href="https://en.wikipedia.org/w/index.php?title=Category:Indian_filmographies&amp;from=F">F</a>,
 <a class="external text" hr

## Now we will use `BeautifulSoup` to extract the `Names` , `URLs` and `Images` from the HTML Page 

### Filmography Name

In [15]:
# This is to get the name of actors/actress.

a_tags = doc.find_all('a',{'class': None})
len(a_tags)

255

In [16]:
titles_a_tag = []
for tag in a_tags:
    if 'filmography' in tag.text:
        titles_a_tag.append(tag['title'].replace('filmography',''))
titles_a_tag[:5]

['John Abraham ',
 'Kajal Aggarwal ',
 'Samantha Ruth Prabhu ',
 'Ali ',
 'Zeenat Aman ']

### Filmography URL's

In [17]:
# Now we need to fetch the URL's of all the filmographies
# I have taken all the filmographies url which has "filmography" in all of them.
link_a_tag = []
for tag in a_tags:
    if 'filmography' in tag.text:
        link_a_tag.append(tag['href'])
link_a_tag[:10]

['/wiki/John_Abraham_filmography',
 '/wiki/Kajal_Aggarwal_filmography',
 '/wiki/Samantha_Ruth_Prabhu_filmography',
 '/wiki/Ali_filmography',
 '/wiki/Zeenat_Aman_filmography',
 '/wiki/Ambareesh_filmography',
 '/wiki/Dev_Anand_filmography',
 '/wiki/Anil_Panachooran_filmography',
 '/wiki/Ramesh_Aravind_filmography',
 '/wiki/Allu_Arjun_filmography']

In [18]:
# To get the full link we need to give the "Base URL" which is given below
Filmography0_url = "https://en.wikipedia.org/"+ link_a_tag[0]
Filmography0_url # This is the first URl.

'https://en.wikipedia.org//wiki/John_Abraham_filmography'

In [19]:
# WE need to loop over to get all the URl's
Filmography_urls = []

for i in range(len(link_a_tag)):
    
    Filmography_urls.append("https://en.wikipedia.org/"+ link_a_tag[i])
    
Filmography_urls[:10]

['https://en.wikipedia.org//wiki/John_Abraham_filmography',
 'https://en.wikipedia.org//wiki/Kajal_Aggarwal_filmography',
 'https://en.wikipedia.org//wiki/Samantha_Ruth_Prabhu_filmography',
 'https://en.wikipedia.org//wiki/Ali_filmography',
 'https://en.wikipedia.org//wiki/Zeenat_Aman_filmography',
 'https://en.wikipedia.org//wiki/Ambareesh_filmography',
 'https://en.wikipedia.org//wiki/Dev_Anand_filmography',
 'https://en.wikipedia.org//wiki/Anil_Panachooran_filmography',
 'https://en.wikipedia.org//wiki/Ramesh_Aravind_filmography',
 'https://en.wikipedia.org//wiki/Allu_Arjun_filmography']

### Images for all the URL's

###### As we can see by talking a picture and we will get all the images 



![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/John_Abraham_at_Trailor_launch_of_%27Shootout_At_Wadala%272013.jpg/318px-John_Abraham_at_Trailor_launch_of_%27Shootout_At_Wadala%272013.jpg)

In [20]:
# here we have created a function to to get all the Images of Filmographies that we have Extracted Filmography Website
# we will search "img" tag to get the image of all the filmographies 
# This we take images from each individual URL's.
def Filmography_Images(Filmography):
    Images = []
    for url in Filmography:
        response1 = requests.get(url)
        topic_doc1 = BeautifulSoup(response1.text,'html.parser')
        a1=topic_doc1.find_all('img')
        b1=a1[0]['src']
        b2= a1[1]['src']
        if "//upload.wikimedia.org/wikipedia/commons/thumb/" in b1:
            Images.append(b1)
        elif  "//upload.wikimedia.org/wikipedia/commons/thumb/" in b2 :
            Images.append(b2)
        else:
            Images.append(None)
    return Images

In [21]:
Images = Filmography_Images(Filmography_urls)
Images

['//upload.wikimedia.org/wikipedia/commons/thumb/6/6e/Johnnyabrahama.jpg/220px-Johnnyabrahama.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/2/2d/Kajal_Aggarwal_Lakme_Fashion_Week_2017.jpg/220px-Kajal_Aggarwal_Lakme_Fashion_Week_2017.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/9/99/Samantha_at_10_Enradhukulla_Teaser_Launch.jpg/220px-Samantha_at_10_Enradhukulla_Teaser_Launch.jpg',
 None,
 '//upload.wikimedia.org/wikipedia/commons/thumb/0/08/Zeenat_Aman_at_the_Society_Achievers_Awards_2018_%28cropped%29.jpg/196px-Zeenat_Aman_at_the_Society_Achievers_Awards_2018_%28cropped%29.jpg',
 None,
 '//upload.wikimedia.org/wikipedia/commons/thumb/7/71/Portrait_of_Dev_Anand_1951.jpg/200px-Portrait_of_Dev_Anand_1951.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Merge-arrow.svg/50px-Merge-arrow.svg.png',
 None,
 '//upload.wikimedia.org/wikipedia/commons/thumb/f/f0/Allu_Arjun_at_62nd_Filmfare_awards_south.jpg/220px-Allu_Arjun_at_62nd_Filmfare_awards_south.jpg',
 '//uplo

In [22]:
# We will check the length of all the Filmographies, Their URls and Their Images
len(titles_a_tag),len(Filmography_urls),len(Images)

(187, 187, 187)

## Creating a DataFrame using Pandas for Lists derived till now

> What is **Pandas**?

>Pandas is a software library written for the Python programming language for data manipulation and analysis. 
In particular, it offers data structures and operations for manipulating numerical tables and time series.

>What is a **DataFrame**?

>A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
DataFrame makes it easier for us to work with tablular data and analyse it.

In [23]:
!pip install pandas --upgrade --quiet    #Installing Pandas Library
import pandas as pd

Now, First we will create a `Python Dictionary` with the `filmographies Names` and `filmographies URLs` and `filmographies Images` that we have extracted above.

In [24]:
Filmography_dict = {
    'Actor/Actress_Name' : titles_a_tag,
    'Url_Of_Profiles': Filmography_urls,
    'Images' : Images
}

In [25]:
Filmography_df = pd.DataFrame(Filmography_dict)  #Here we convert the dictionary into a Pandas DataFrame

Now, Let us check the length of the Dataframe that we have created which contains the `filmography Names` and `filmograpgy URLs` and `filmography Images`

In [26]:
len(Filmography_df)

187

We can see that the DataFrame consists of **188 items**, that is equal to the number of filmography.

Therefore, we can be sure that we have extracted the complete information that we had intended to.

In [39]:
Filmography_df

Unnamed: 0,Actor/Actress_Name,Url_Of_Profiles,Images
0,John Abraham,https://en.wikipedia.org//wiki/John_Abraham_fi...,//upload.wikimedia.org/wikipedia/commons/thumb...
1,Kajal Aggarwal,https://en.wikipedia.org//wiki/Kajal_Aggarwal_...,//upload.wikimedia.org/wikipedia/commons/thumb...
2,Samantha Ruth Prabhu,https://en.wikipedia.org//wiki/Samantha_Ruth_P...,//upload.wikimedia.org/wikipedia/commons/thumb...
3,Ali,https://en.wikipedia.org//wiki/Ali_filmography,
4,Zeenat Aman,https://en.wikipedia.org//wiki/Zeenat_Aman_fil...,//upload.wikimedia.org/wikipedia/commons/thumb...
...,...,...,...
182,Seeman,https://en.wikipedia.org//wiki/Seeman_filmography,
183,Aparna Sen,https://en.wikipedia.org//wiki/Aparna_Sen_film...,
184,Jisshu Sengupta,https://en.wikipedia.org//wiki/Jisshu_Sengupta...,//upload.wikimedia.org/wikipedia/commons/thumb...
185,Vijay Sethupathi,https://en.wikipedia.org//wiki/Vijay_Sethupath...,//upload.wikimedia.org/wikipedia/commons/thumb...


We have finally created a DataFrame which contains `filmography Names`,`filmography URLs` and `filmography Images` of all the `188` filmographies mentioned on the`Bollywood Filmographies` page.

In [None]:
# Let's Save the work again before going further
jovian.commit()

<IPython.core.display.Javascript object>

## Next Steps
### Now, we will go into the  individual `Filmographies` page and extract the rest of the required information

![](https://nimbus-screenshots.s3.amazonaws.com/s/fa8aae3a65f38cd2a56da4e49799708b.png)

#### Let's start with extracting all the information from the Filmographies of some particular `URL's`

In [29]:
# we will take the tables from Filmography URL's 0,1,3,4,23. In which we will take The Year, Title and Role
urls = [Filmography_urls[0],Filmography_urls[1],Filmography_urls[3],Filmography_urls[4],Filmography_urls[23]]

In [30]:
# get_docs fuction will reterive all the tables which has class "Wikitable"
# and also get_docs function will get all the "h1" tags and replace filmography with empty space so that we will get all the Actor/Actress name.
def get_docs(films_urls):
    doc_list = []
    title_list = []
    for url in films_urls:
        # Fetch and parse url's using Bs4
        response = requests.get(url)
        response.status_code
        if response.status_code == 200 : # check the status_code 
            topic_doc = BeautifulSoup(response.text,'html.parser')
            title_tags = topic_doc.find('h1').text.replace(' filmography','')
            body_tags = topic_doc.find_all('table',{'class' : 'wikitable'})
            doc_list.append(body_tags)
            title_list.append(title_tags)
        else :
            print('Failed to load page {}'.format(url)) # if status_code is not equal to 200
    return doc_list,title_list

In [31]:
Table_urls, title_tags = get_docs(urls) # Calling get_docs fuction on urls


In [32]:
title_tags # These are the Actor/Actress name for which we are accessing the data.

['John Abraham',
 'Kajal Aggarwal',
 'Ali',
 'Zeenat Aman',
 'Nandamuri Balakrishna']

In [33]:
# we have written a function to get DataFrames of a table
# we will covert the given Table_urls in Pandas Dataframe
def DataFrames(Tables):
    dfs = []
    for i in Tables:
        df=pd.read_html(str(i))
        df=pd.DataFrame(df[1])
        df=df[df.columns[0:3]]
        dfs.append(df)
    return dfs
    

In [34]:
# here we have the names of few actor/actress which we have taken in our data.
# we are adding the actor/actress column in the dataframe. 
dfs = DataFrames(Table_urls)
for df, title in zip(dfs, title_tags) :
    df['Actor/Actress'] = title
dfs

[    Year                    Title  \
 0   2003                     Jism   
 1   2003                    Saaya   
 2   2003                     Paap   
 3   2004                  Aetbaar   
 4   2004                   Lakeer   
 5   2004                    Dhoom   
 6   2004                 Madhoshi   
 7   2005                    Elaan   
 8   2005                    Karam   
 9   2005                     Kaal   
 10  2005                  Viruddh   
 11  2005                    Water   
 12  2005             Garam Masala   
 13  2006                    Zinda   
 14  2006            Taxi No. 9211   
 15  2006                   Baabul   
 16  2006            Kabul Express   
 17  2007            Salaam-e-Ishq   
 18  2007               No Smoking   
 19  2007     Dhan Dhana Dhan Goal   
 20  2008                  Dostana   
 21  2009                 New York   
 22  2010                Aashayein   
 23  2010          Jhootha Hi Sahi   
 24  2011             7 Khoon Maaf   
 25  2011   

In [35]:
Table_contents = pd.concat(dfs) # we have concatinate all dataframes using pd.concat()
Table_contents

Unnamed: 0,Year,Title,Role,Actor/Actress
0,2003,Jism,Kabir Lal,John Abraham
1,2003,Saaya,Dr. Aakash Bhatnagar,John Abraham
2,2003,Paap,Inspector Shiven Verma,John Abraham
3,2004,Aetbaar,Aryan Trivedi,John Abraham
4,2004,Lakeer,Sahil Mishra,John Abraham
...,...,...,...,...
86,2017,Sallu Ki Shaadi,Sallu's mother,Zeenat Aman
87,2017,Love Life & Screw Ups,Joanna,Zeenat Aman
88,2019,Panipat,Sakeena Begum,Zeenat Aman
89,TBA,Margaon: The Closed File,Unnamed role,Zeenat Aman


## Now we have all the required information that we wish to get from the web page

In [36]:
# The first data frame is
Filmography_df

Unnamed: 0,Actor/Actress_Name,Url_Of_Profiles,Images
0,John Abraham,https://en.wikipedia.org//wiki/John_Abraham_fi...,//upload.wikimedia.org/wikipedia/commons/thumb...
1,Kajal Aggarwal,https://en.wikipedia.org//wiki/Kajal_Aggarwal_...,//upload.wikimedia.org/wikipedia/commons/thumb...
2,Samantha Ruth Prabhu,https://en.wikipedia.org//wiki/Samantha_Ruth_P...,//upload.wikimedia.org/wikipedia/commons/thumb...
3,Ali,https://en.wikipedia.org//wiki/Ali_filmography,
4,Zeenat Aman,https://en.wikipedia.org//wiki/Zeenat_Aman_fil...,//upload.wikimedia.org/wikipedia/commons/thumb...
...,...,...,...
182,Seeman,https://en.wikipedia.org//wiki/Seeman_filmography,
183,Aparna Sen,https://en.wikipedia.org//wiki/Aparna_Sen_film...,
184,Jisshu Sengupta,https://en.wikipedia.org//wiki/Jisshu_Sengupta...,//upload.wikimedia.org/wikipedia/commons/thumb...
185,Vijay Sethupathi,https://en.wikipedia.org//wiki/Vijay_Sethupath...,//upload.wikimedia.org/wikipedia/commons/thumb...


In [37]:
# The Second data  frame is
Table_contents

Unnamed: 0,Year,Title,Role,Actor/Actress
0,2003,Jism,Kabir Lal,John Abraham
1,2003,Saaya,Dr. Aakash Bhatnagar,John Abraham
2,2003,Paap,Inspector Shiven Verma,John Abraham
3,2004,Aetbaar,Aryan Trivedi,John Abraham
4,2004,Lakeer,Sahil Mishra,John Abraham
...,...,...,...,...
86,2017,Sallu Ki Shaadi,Sallu's mother,Zeenat Aman
87,2017,Love Life & Screw Ups,Joanna,Zeenat Aman
88,2019,Panipat,Sakeena Begum,Zeenat Aman
89,TBA,Margaon: The Closed File,Unnamed role,Zeenat Aman


## Here we are going to write all the above used functions in a single cell



In [None]:
def Requested_page(url):                  #This is the function to get BeautifulSoup object for any given URL
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
    # Parse the `response' text using BeautifulSoup
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc


# This is the fuction to get all the images inside a particular Filmography Url
def Filmography_Images(Filmography): 
    Images = []
    for url in Filmography:
        response1 = requests.get(url)
        topic_doc1 = BeautifulSoup(response1.text,'html.parser')   # Parse using Beautiful Soup
        a1=topic_doc1.find_all('img')
        b1=a1[0]['src']
        b2= a1[1]['src']
        if "//upload.wikimedia.org/wikipedia/commons/thumb/" in b1:
            Images.append(b1)
        elif  "//upload.wikimedia.org/wikipedia/commons/thumb/" in b2 :
            Images.append(b2)
        else:
            Images.append("image not available")
    return Images


# get_docs fuction will reterive all the tables which has class "Wikitable"
def get_docs(films_urls):
    doc_list = []
    for url in films_urls:
        # Fetch and parse url's using Bs4
        response = requests.get(url)
        response.status_code
        if response.status_code == 200 : # check the status_code 
            topic_doc = BeautifulSoup(response.text,'html.parser')
            body_tags = topic_doc.find_all('table',{'class' : 'wikitable'})
            doc_list.append(body_tags)
        else :
            print('Failed to load page {}'.format(url)) # if status_code is not equal to 200
    return doc_list



# we have written a function to get DataFrames of a table
# we will covert the given Table_urls in Pandas Dataframe
def DataFrames(Tables):
    dfs = []
    for i in Tables:
        df=pd.read_html(str(i))
        df=pd.DataFrame(df[1])
        df=df[df.columns[0:3]]
        dfs.append(df)
    return dfs
    

### This is exactly what our desired output was when we began this project.
> Let us now save this DataFrame as a CSV file

In [None]:
Filmography_df.to_csv('Filmographies.csv', index=None)   #Converting the Filmography Dataframe to a CSV File
print('file converted successfully')

file converted successfully


In [None]:
Table_contents.to_csv('Data-inside-Filmographies.csv')     #Converting the Table Contents Dataframe to a CSV File
print('file converted successfully')

file converted successfully


![](https://nimbus-screenshots.s3.amazonaws.com/s/4bdcc119a356d0d5ec13d81c322452a5.png)

## Summary

- At the very initial stage I have checked all the Filmographies on the :[Bollywood Filmography](https://en.wikipedia.org/wiki/Category:Indian_filmographies) page manually.





- Then after I have collected all the Filmography titles, url's and Images.





- while getting all the Filmographies I was getting some extra information and could not able to get a common class out of that, then I have take a common word "filmography" in tag.text.




- Then I decided to collect the tables from each Filmography url's which contains Title,Year,Role,languages,Notes,References. but some of them has missing data, so I have taken Title, Year and Role column which is commom in all the urls.




- Then While Fetching the tables many of the Urls containing different DataFrames and they does not have any common attribute, so I went through some of the urls manually and checked for common DataFrames and taken out those DataFrames.





- The DataFrames which we have taken out we haven taken the common Columns, i.e. Title, Year and Role From it.
 
 
 
 

- Then finally we have managed to `parse` Filmographies.




- At last we have saved all the data into CSV files using which we can further get aswers to a lot of questions.

## Let us look at the steps that we took from start to finish :

1. We downloaded the webpage using requests

2. We parsed the HTML source code using BeautifulSoup library and extracted the desired infromation, i.e.

- The names of 'Filmographies'
- URLs of each of those Filmographies
- Images of each of those Filmographies
3. We created a DataFrame using Pandas for Python Lists that we derived from the previous step

4. We extracted detailed information for some Filmography among the list of Filmographies, such as :

- Year of release
- Title of movie
- Role in movie
- Actor/Actress name
5. We then created a Python Dictionary to save all these details

6. We converted the python dictionary into Pandas DataFrames

7. Now that we have 2 DataFrames

8. we have coverted the DataFrames into CSV file, which was the goal of our project.

## Future Work

We can now work forward to explore this data more and more to fetch meaningful information out of it.

With all the insights , and further analysis into the data, we can have answers to a lot of questions like -

- Which actor/actress has the highest numbers of movies?
- which actor has the highest number of movies as a director?
- which actor has made the highest number of movies in a particular year?
- we will get all the tables from each of the urls,
And the list goes on..

>In the future, I would like to work to make this DataSet even richer with more data from other lists created by Filmographies I would then like to work on analysing the entire data, to know a lot more about movies than I currently know.

## References
[1] Python offical documentation. https://docs.python.org/3/

[2] Beautiful Soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

[3] Aakash N S, Introduction to Web Scraping, 2021. https://jovian.ai/aakashns/python-web-scraping-and-rest-api

[4] Pandas library documentation. https://pandas.pydata.org/docs/

[5] Filmography Website.  https://en.wikipedia.org/wiki/Category:Indian_filmographies

[6] Working with Jupyter Notebook https://www.youtube.com/watch?v=lNPofGL28lU and https://towardsdatascience.com/write-markdown-latex-in-the-jupyter-notebook-10985edb91fd

In [None]:
jovian.commit(files=['Filmographies.csv', 'Data-inside-Filmographies.csv'])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "deepakkumawat2120/web-scraping-bollywood-filmographies-project" on https://jovian.ai[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/deepakkumawat2120/web-scraping-bollywood-filmographies-project[0m


'https://jovian.ai/deepakkumawat2120/web-scraping-bollywood-filmographies-project'