# Webscrape Final
#### This notebook combines the methods from the prework notebooks to scrape the final data to be used in this project. 

#### Steps for this are as follows:
- export table of listings for each streaming service, including link to more details by show
- combine all tables into 1
- combine titles available in multiple services in the same row
- use link to more details to bring in country and genre data as new columns

### Import needed packages

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re

### Define function to scrape listings

In [9]:
# function created in the lists prework notebook.  Edited to bring the link as a new column. 
def scrape_reelgood(baseurl,service):
    '''
    Scrapes from reelgood the full list of movies and tv shows for a specified streaming service.
    Requires baseurl as 'https://reelgood.com/source/netflix?offset=' 
    Returns a pandas dataframe with title, year, age rating, imdb score, rotten tomatoes score, what service was searched,
    and link to more details for each listing. 
    '''    
    keep_looping = True
    offset = 0
    df = pd.DataFrame()
    base = 'https://reelgood.com'
    
    while keep_looping == True:
        response = requests.get(baseurl+str(offset)) # gets the webpage as response
        soup = BeautifulSoup(response.content,'lxml') #turns response into soup
        table_list = soup.find_all('table',attrs={'class':'css-1179hly'}) # makes a list of tables. our webpage should only have one
        
        if len(table_list) == 1: #if 1 table is found, there are still results. execute code
            result_list = pd.read_html(str(table_list[0])) #returns a list of dataframes (only 1 found here)
            page_df = result_list[0] #returns the table found into a dataframe
            
            #next lines used to get the links as a column
            table_soup = BeautifulSoup(str(table_list[0])) #turns the html for the table into soup
            links = table_soup.find_all('a', attrs={'href': re.compile("/")})
            links_list = []
            for i in range (int(len(links)/2)):
                full_url = base+links[2*i]['href']
                links_list.append(full_url)
            page_df['link'] = links_list
            
            df = pd.concat([df,page_df], axis=0) #concatenates dataframe for each page with the combined dataframe
            offset += 50 #increases this variable, which will be used to get the next page of 50 tv/movie listings
        
        elif len(table_list) == 0: #if no tables are found, stop loop. 
            keep_looping = False
    
    df = df.reset_index(drop=True)
    df.columns = ['pic','title','tv','year','rating','score_imdb','score_rotten',str(service),'episodes','track', 'link']
    df.iloc[:,7] = 'Yes'
    df = df.drop(columns = ['pic','episodes','track'])
        
    return df

### Try it out for Netflix!

In [10]:
%%time
df_netflix = scrape_reelgood('https://reelgood.com/source/netflix?offset=', 'netflix')

Wall time: 1min 36s


### A little eda before running it for all other services

In [11]:
df_netflix.head()

Unnamed: 0,title,tv,year,rating,score_imdb,score_rotten,netflix,link
0,Breaking Bad,TV,2008,18+,9.5,96%,Yes,https://reelgood.com/show/breaking-bad-2008
1,Inception,,2010,13+,8.8,87%,Yes,https://reelgood.com/movie/inception-2010
2,Back to the Future,,1985,7+,8.5,96%,Yes,https://reelgood.com/movie/back-to-the-future-...
3,The Matrix,,1999,18+,8.7,88%,Yes,https://reelgood.com/movie/the-matrix-1999
4,Stranger Things,TV,2016,16+,8.8,93%,Yes,https://reelgood.com/show/stranger-things-2016


In [12]:
df_netflix.tail()

Unnamed: 0,title,tv,year,rating,score_imdb,score_rotten,netflix,link
5779,Shikari,,1991,,,,Yes,https://reelgood.com/movie/shikari-1991
5780,Love Family,TV,2013,,,,Yes,https://reelgood.com/show/love-family-2013
5781,The Show,,2017,,,,Yes,https://reelgood.com/movie/the-show-2017
5782,The Stolen,,2017,,5.2,,Yes,https://reelgood.com/movie/the-stolen-2017
5783,Bal Ganesh,TV,2012,,,,Yes,https://reelgood.com/show/bal-ganesh-2012


In [14]:
df_netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5784 entries, 0 to 5783
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         5784 non-null   object 
 1   tv            2036 non-null   object 
 2   year          5784 non-null   int64  
 3   rating        2988 non-null   object 
 4   score_imdb    5530 non-null   float64
 5   score_rotten  1906 non-null   object 
 6   netflix       5784 non-null   object 
 7   link          5784 non-null   object 
dtypes: float64(1), int64(1), object(6)
memory usage: 361.6+ KB


### Looks good at first glance, let's run this for all services

In [15]:
%%time
df_prime = scrape_reelgood('https://reelgood.com/source/amazon?offset=', 'Prime Video')

Wall time: 5min 38s


In [16]:
print(df_prime.shape)
df_prime.tail()

(15488, 8)


Unnamed: 0,title,tv,year,rating,score_imdb,score_rotten,Prime Video,link
15483,A Place to Stand,,2015,,8.1,,Yes,https://reelgood.com/movie/a-place-to-stand-2015
15484,The Cure,,2019,,7.2,,Yes,https://reelgood.com/movie/the-cure-2019
15485,Fine Lines,,2019,7+,7.3,,Yes,https://reelgood.com/movie/fine-lines-2019
15486,World's Best Beaches,TV,2016,,,,Yes,https://reelgood.com/show/worlds-best-beaches-...
15487,Kayak to Klemtu,,2018,,6.3,,Yes,https://reelgood.com/movie/kayak-to-klemtu-2018


In [17]:
%%time
df_hulu = scrape_reelgood('https://reelgood.com/source/hulu?offset=', 'Hulu')

Wall time: 46.4 s


In [18]:
print(df_hulu.shape)
df_hulu.tail()

(2806, 8)


Unnamed: 0,title,tv,year,rating,score_imdb,score_rotten,Hulu,link
2801,pocket.watch JillianTubeHD Ultimate mishmash,TV,2018,,,,Yes,https://reelgood.com/show/pocketwatch-jilliant...
2802,Gen H,TV,2012,,,,Yes,https://reelgood.com/show/gen-h-2012
2803,Crime Shock: Asia Exposed,TV,2013,,,,Yes,https://reelgood.com/show/crime-shock-asia-exp...
2804,Walter and Dude,TV,2012,,,,Yes,https://reelgood.com/show/walter-and-dude-2012
2805,Revolt,TV,2017,,,,Yes,https://reelgood.com/show/revolt-2017


In [19]:
df_disney = scrape_reelgood('https://reelgood.com/source/disney_plus?offset=', 'Disney+')

In [21]:
print(df_disney.shape)
df_disney.head()

(803, 8)


Unnamed: 0,title,tv,year,rating,score_imdb,score_rotten,Disney+,link
0,Star Wars: A New Hope,,1977,7+,8.6,92%,Yes,https://reelgood.com/movie/star-wars-1977
1,Star Wars: The Empire Strikes Back,,1980,7+,8.7,94%,Yes,https://reelgood.com/movie/the-empire-strikes-...
2,The Lion King,,1994,all,8.5,93%,Yes,https://reelgood.com/movie/the-lion-king-1994
3,The Avengers,,2012,13+,8.0,92%,Yes,https://reelgood.com/movie/the-avengers-2012
4,Toy Story,,1995,all,8.3,100%,Yes,https://reelgood.com/movie/toy-story-1995


In [22]:
%%time
df_hbomax = scrape_reelgood('https://reelgood.com/source/hbo_max?offset=', 'HBO Max')

Wall time: 31.3 s


In [23]:
print(df_hbomax.shape)
df_hbomax.tail()

(2016, 8)


Unnamed: 0,title,tv,year,rating,score_imdb,score_rotten,HBO Max,link
2011,What Animals See,,2018,,,,Yes,https://reelgood.com/movie/what-animals-see-2018
2012,The Moon's Spell On The Great Barrier Reef,,2014,,,,Yes,https://reelgood.com/movie/the-moons-spell-on-...
2013,Mandrake Telefilm: Part 1,,2013,,,,Yes,https://reelgood.com/movie/mandrake-telefilm-p...
2014,The Hunt for the Slave Ship Guerrero,,2018,,,,Yes,https://reelgood.com/movie/the-hunt-for-the-sl...
2015,The Rise & Fall Of T-Rex,,2015,,,,Yes,https://reelgood.com/movie/the-rise-fall-of-tr...


### Exporting the individual dataframes, for our records.

In [24]:
df_netflix.to_csv('../data/df_netflix.csv',index = False)
df_prime.to_csv('../data/df_prime.csv',index = False)
df_hulu.to_csv('../data/df_hulu.csv',index = False)
df_disney.to_csv('../data/df_disney.csv',index = False)
df_hbomax.to_csv('../data/df_hbomax.csv',index = False)