# Basic web scrap project
## IMDB movies you must watch
<img src = "https://jeremymattheiss.files.wordpress.com/2017/07/jaws-logo.png?w=640">
<br>

__Importing libraries__

In [174]:
from requests import get
import pandas as pd
from bs4 import BeautifulSoup

__Defining URL to be scrapped__

In [210]:
url = "https://www.imdb.com/list/ls002448041/"

__Getting page from URL and transforming into indexable html string with BeautifulSoup library__

In [226]:
response = get(url)
text1 = response.text

In [228]:
html_soup = BeautifulSoup(text1,'html.parser')

__Defines specific section and class where movies information are__

In [229]:
movie_detail = html_soup.findAll('div',class_='lister-item mode-detail')

__Inserts through a for loop each value into a correspondent list__
<br>
Note that each variable (title, year, etc) has a different path to be found on the HTML structure. It's a matter of trying and find where each interest string is kept on and then translate it into Find/FindAll code.

In [214]:
titles = []
years = []
age_restriction = []
directors = []
revenues = []
genres = []
    
for i in movie_detail:

    try:    
        #title
        title = i.findAll('a')[1].text.strip()
        titles.append(title)
        
        #year
        year = i.findAll('span')[1].text.strip()
        year = year.replace('(','').replace(')','').replace('I','').replace(' ','')
        years.append(year)
        
        #director
        director = i.findAll('p')[2].find('a').text.strip()
        directors.append(director)
        
        #age
        age = i.find('p').find('span').text.strip()
        age = age.replace("min",'').replace('Livre','None')
        age_restriction.append(age)
        
        #revenue
        revenue = i.findAll('p')[3].findAll('span')[4].text
        revenue = revenue.replace('$','').replace('M','')
        revenues.append(revenue)
        
        #genre
        genre = i.findAll('span')[6].text.replace('/n','').strip()
        genres.append(genre)
        
    except IndexError:
        
        #Include None when gross revenue is missing
        
        revenues.append(None)      
        

__Treating diferent indexes on HTML__
<br>
Because of the specific HTML structure of this page, 2 movies had not the genre and revenue information place into the same index. <br>
So, I've made pecific and pontual corrections here. However, it should be noted that that for bigger amounts of data (and indexes inconsistency), there are better alternatives to solve it

In [215]:

genres.insert(14, movie_detail[14].findAll('span')[6].text.replace('/n','').strip())
genres.insert(39, movie_detail[40].findAll('span')[4].text.replace('/n','').strip())


__Creating pandas DataFrame to include lists__

In [217]:

data = pd.DataFrame({'movie_title': titles, 
                     'release_year': years,
                     'director': directors,
                     'gross_revenue': revenues,
                     'age_restriction': age_restriction,
                     'genre': genres})

data.tail(20)

Unnamed: 0,movie_title,release_year,director,gross_revenue,age_restriction,genre
80,Apocalypto,2006,Mel Gibson,50.87,16.0,"Action, Adventure, Drama"
81,Os Infiltrados,2006,Martin Scorsese,132.38,18.0,"Crime, Drama, Thriller"
82,Pequena Miss Sunshine,2006,Jonathan Dayton,59.89,14.0,"Comedy, Drama"
83,Onde os Fracos Não Têm Vez,2007,Ethan Coen,74.28,16.0,"Crime, Drama, Thriller"
84,Batman: O Cavaleiro das Trevas,2008,Christopher Nolan,534.86,12.0,"Action, Crime, Drama"
85,Quem Quer Ser um Milionário?,2008,Danny Boyle,141.32,16.0,"Drama, Romance"
86,"Se Beber, Não Case!",2009,Todd Phillips,277.32,14.0,Comedy
87,"Se Beber, Não Case! Parte II",2011,Todd Phillips,254.46,16.0,Comedy
88,Um Parto de Viagem,2010,Todd Phillips,100.54,14.0,"Comedy, Drama"
89,Avatar,2009,James Cameron,760.51,12.0,"Action, Adventure, Fantasy"


__Exports data frame into csv__

In [232]:
data.to_csv('MovieList.csv')