# HTML Parser for IMDB
---
**Author Github:** [Giray Coskun](https://github.com/giraycoskun)

**Valid Links:**

https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&view=simple

https://www.imdb.com/search/title/?groups=top_1000&view=simple&sort=user_rating,desc&count=10&start=5&ref_=adv_nxt

---
**Legal issues on Web Scraping:**

https://www.lawinsociety.org/legal-perspectives-on-scraping-data-from-the-modern-web

---

#### Tutorial: https://realpython.com/beautiful-soup-web-scraper-python/

In [1]:
from bs4 import BeautifulSoup
import requests
import urllib.request, urllib.parse

## Constants

In [2]:
COUNT = 100
START =1

In [3]:
URL = "https://www.imdb.com/search/title/?groups=top_1000&view=simple&sort=user_rating,desc&count="+str(COUNT)+"&start="+str(START)+"&ref_=adv_nxt"
print(URL)

https://www.imdb.com/search/title/?groups=top_1000&view=simple&sort=user_rating,desc&count=100&start=1&ref_=adv_nxt


## Page Parsing

In [4]:
page = requests.get(URL)
print("Status Code:",page.status_code)

Status Code: 200


In [5]:
soup = BeautifulSoup(page.content, 'html.parser')

In [6]:
soup.find(class_="lister-item-content").find('a').text

'The Shawshank Redemption'

In [7]:
soup.find(class_="lister-item-content").find(class_="lister-item-index unbold text-primary").text.strip('.')

'1'

In [8]:
soup.find(class_="lister-item-content").find(class_="lister-item-year text-muted unbold").text.strip('()')

'1994'

In [9]:
soup.find(class_="lister-item-content").find(class_="col-imdb-rating").text.strip()

'9.3'

## CLASS: Movie

In [10]:
class Movie:
    def __init__(self, title, year, rank, rating=None):
        self.title = title
        self.rank = rank
        self.year = year
        self.rating = rating
    
    def print_(self):
        if(self.rating == None):
            func = print( str(self.rank) + ". " + self.title + ", " + str(self.year))
        else:
            func = print( str(self.rank) + ". " + self.title + ", " + str(self.year) + ", " + str(self.rating))
        return func

## Function to Iterate over Page

In [11]:
contents = soup.find_all(class_="lister-item-content")
movie_list = []
for element in contents:

    title = element.find('a').text
    #print(title)
    year = element.find(class_="lister-item-year text-muted unbold").text.strip('()')
    #print(year)
    rank = int(element.find(class_="lister-item-index unbold text-primary").text.strip('.'))
    #print(rank)
    rating = element.find(class_="col-imdb-rating").text.strip()
    #print(rating)
    
    temp = Movie(title, year, rank, rating)
    temp.print_()
    movie_list.append(temp)

1. The Shawshank Redemption, 1994, 9.3
2. The Godfather, 1972, 9.2
3. The Dark Knight, 2008, 9
4. The Godfather: Part II, 1974, 9
5. The Lord of the Rings: The Return of the King, 2003, 8.9
6. Pulp Fiction, 1994, 8.9
7. Schindler's List, 1993, 8.9
8. 12 Angry Men, 1957, 8.9
9. Inception, 2010, 8.8
10. Fight Club, 1999, 8.8
11. The Lord of the Rings: The Fellowship of the Ring, 2001, 8.8
12. Forrest Gump, 1994, 8.8
13. Il buono, il brutto, il cattivo, 1966, 8.8
14. The Lord of the Rings: The Two Towers, 2002, 8.7
15. The Matrix, 1999, 8.7
16. Goodfellas, 1990, 8.7
17. Star Wars: Episode V - The Empire Strikes Back, 1980, 8.7
18. One Flew Over the Cuckoo's Nest, 1975, 8.7
19. Seppuku, 1962, 8.7
20. Gisaengchung, 2019, 8.6
21. Interstellar, 2014, 8.6
22. Cidade de Deus, 2002, 8.6
23. Sen to Chihiro no kamikakushi, 2001, 8.6
24. Saving Private Ryan, 1998, 8.6
25. The Green Mile, 1999, 8.6
26. La vita è bella, 1997, 8.6
27. Se7en, 1995, 8.6
28. The Silence of the Lambs, 1991, 8.6
29. Star W

## Write to File

In [12]:
output_filename = "film_list.csv"
with open(output_filename, 'w') as file:
    file.write("Rank, Title, Year, Rank")
    for movie in movie_list:
        line = '\n' + str(movie.rank) + ', ' + movie.title + ', ' + str(movie.year) + ', ' + str(movie.rating)
        file.write(line)