# Data Collection of TV Shows on Streaming Platforms
## TV Shows Hosted on Four Streaming Platforms, with IMDB and Rotten Tomatoes Rating
## By: Bryan Kolano, October 30th, 2022
#### Data from [Kaggle Dataset by User: Ruchi Bhatia](https://www.kaggle.com/datasets/ruchi798/tv-shows-on-netflix-prime-video-hulu-and-disney) 
***

#### Background of Project

I found an intesting dataset on Kaggle.com about TV shows on streaming platforms.  The dataset contained the show title, debut year, age rating, which streaming platform it was hosted on, and it's Internet Movie Database (IMDb) and Rotten Tomatoes ratings.  <br>

In a previous notebook, I conducted a visual exploratory data analysis to examine best and worst shows across each platform.  I thought it might be an interesting aside to have more information about each tv show and be able to perform further analysis.  I inspected IMDb's website for a few of the TV shows and saw they had short one to two sentence descriptions about each show and each show had a couple of genre tags.  For example, the show "Breaking Bad" has the genre tags of crime, drama, and thriller.  <br>

To further enhance the streaming TV show dataset, I decided I would webscrape IMDb to collect each show's descriptions and tags and then append it to the original data.

In [10]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import lxml
import time
from janitor import clean_names
import re



#### Import and clean data
The first steps are reading in the dataset, remove columns that are not needed, changing column data types, and filling NAs. <br>

Additionally, as I began to scrape IMDb for show information, I discovered the scraper struggled on TV shows that have Marvel in the name.  For example, the original data set says "Marvel's Agents of S.H.I.E.L.D", but in IMDb, the title is just "Agents of S.H.I.E.L.D."  Therefore, Marvel was remove from any title with it.

In [54]:
df = pd.read_csv('tv_shows.csv',)

#remove unnecessary columns
df.drop(['Unnamed: 0', 'ID', 'Type'], axis = 1, inplace = True)

#use janitor to standardize column names
df = clean_names(df)

#This step below isn't really necessary, I used it in my other notebook to grab ratings values from IMDb and Rotten Tomatoes.
#For example, the imdb column in the data set would say 9/5/ 10, and these new columns would turn it into 9.5.  The data type 
# of the new column is changed as well.
df['imdb_raw'] = df['imdb'].str.split('/').str[0]
df['rotten_raw'] = df['rotten_tomatoes'].str.split('/').str[0]

df = df.astype(
    {'imdb_raw':'float',
    'rotten_raw': 'float'}
)

#NAs are filled
df.fillna(0, inplace = True)

#find all shows with "Marvel's and remove Marvel's"
df['title'] = df['title'].apply(lambda x: re.split(r"\'s ", x)[1]  if "Marvel's" in x else x)


#### Scraping function
This function below takes in a TV show title, and then will attempt to collect it's genre tags (or themes) and its show desscription.  <br>

To manipulate URLs to find a show's IMDb page, you cannot just plug the title into the URL.  IMDb uses a title ID as part of a shows URL.  For example, with the show "Breaking Bad," you cannot simply add Breaking bad to a URL to find its IMDb page, because the url for 'Breaking Bad' is 'https://www.imdb.com/title/tt0903747/'.  To get that title ID, we create individual search URLs for IMDb shows, grab the title ID, and then go to each shows title ID URL.  <br>

Going along the "Breaking Bad" example, we can add "+" in between each word of the title and then plug that isn't a set URL format.  For "Breaking Bad," we can add "breaking+bad" into a set URL format and then create the URL: "https://www.imdb.com/find?q=breaking+bad&ref_=nv_sr_sm".  We can then find the title ID from that search page, maneuver to that show's actual page, and then scrape the information desired. 


In [55]:
def get_imdb_info(title):
   
    try:
        
        #combine the individual words of our title
        combine_title = '+'.join(title.lower().split(' '))

        #search for that show
        search_url = f'https://www.imdb.com/find?q={combine_title}&ref_=nv_sr_sm'
        page = requests.get(search_url)

        #Turn our HTML page into beautiful soup object
        soup = BeautifulSoup(page.text, 'html.parser')

        #Need to find the title ID of the particular show
        search = soup.select('li.find-title-result:nth-child(1) > div:nth-child(2) > div:nth-child(1) > a:nth-child(1)')
        search_extension = search[0].attrs.get('href')
        
        #setting up some variables for error checking
        title_lower = title.lower()
        #found_title is the title pulled from our scrape
        found_title = search[0].get_text().lower()
        #regex for removing punctuation
        remove_punct = r'[^\w\s]'
        #sometimes the script will be unable to match tv shows, so the follow lines remove punctuation 
        #to be able to check if the show names are the same without punctuation
        title_lower_no_punct = re.sub(remove_punct, ' ', title_lower)
        found_title_lower_no_punct = re.sub(remove_punct, ' ', found_title)

        #Checking for errors.  If the name in the dataset does not match the name found on the search page
        #then, it will just return the two title names, to allow for visual inspection later.
        if ((title_lower != found_title) and 
            (title_lower_no_punct != found_title_lower_no_punct)):

            return  {'title': [title],
                    'found_title': [found_title],
                     'same_title': [''],
                        'themes': [''],
                        'description': ['']}
        else:
            #If the two names match, then we are going to continue scraping info
            #we use the title id and can att it to the set URL format below.
            show_search = f'https://www.imdb.com{search_extension}/?ref_=fn_al_tt_0'

            #get HTML page
            show_page = requests.get(show_search)
            
            #turn html into beautiful soup object
            show_soup = BeautifulSoup(show_page.text, 'html.parser')
            
            #grab the description
            description = show_soup.select('.sc-16ede01-1')[0].text
            
            #set up holder list to grab the genre tags
            tags = []
            
            #grab all the tags
            tag_finder = show_soup.find_all("span", {"class": "ipc-chip__text"})
            
            #add each tag to the tags list
            for tag in tag_finder:
                tags.append(tag.text)
            
            #return dictionary with all scraped information.  
            return {'title': [title],
                    'found_title': [found_title],
                    'same_title': [''],
                        'themes': [' '.join(tags)],
                        'description': [description]}

    except:
        #If there is an error while scraping, just return the title
        return {'title': [title],
        'found_title': [''],
        'same_title': [''],
            'themes': [''],
            'description': ['']}
    
    
   

#### Collecting the tags
Next is creating a dataframe and then appending all the scraped information to it.  It then writes this intermediate step to a csv.

In [15]:
#set up blank dataframe
df_concat = pd.DataFrame()

#loop through all titles, apply function to it, and concatanate it to the data frame
for title in df['title']:
    current_title = pd.DataFrame.from_dict(get_imdb_info(title))
    df_concat = pd.concat([df_concat,current_title],ignore_index= True)
    
#write to CSV                            
df_concat.to_csv('scraped_with_returned_title.csv', index= False)

After having run the initial tv show list through the function to get each tag and description, approximately 1,500 had some type of error.  The function was accurately able to collect the majority of the descriptions but for those that had issues, I visually inspected which one had errors, marked them with a 0, 1, or 2 and then will run them through the function again.  A 2 means it was a match during the first scrape.  A 1 means the title in the dataset and the found title on IMDb are a match, but had some sort of issue.  A 0 means the title in the data is not the same as the title found in IMDb, so they were excluded from further analysis.  

In [5]:
second_round = pd.read_csv('scraped_with_returned_title_reviewed.csv')
second_round.head()

Unnamed: 0,title,found_title,same_title,themes,description
0,Breaking Bad,breaking bad,2,Crime Drama Thriller,A chemistry teacher diagnosed with inoperable ...
1,Stranger Things,stranger things,2,Drama Fantasy Horror,"When a young boy disappears, his mother, a pol..."
2,Attack on Titan,attack on titan,2,Animation Action Adventure,After his hometown is destroyed and his mother...
3,Better Call Saul,better call saul,2,Crime Drama,The trials and tribulations of criminal lawyer...
4,Dark,dark,2,Crime Drama Mystery,"A family saga with a supernatural twist, set i..."


In [9]:
#same title column: 2 means it collected the correct themes and description, 1 means it found a slightly
# different name but is the same show.  0 means the scraper did not find the correct show.
second_round_to_scrape = second_round.query('same_title == 1')

The new function is similar to the other function, but this time it isn't checking dataset names to found IMDb names.  The names found in the previous round worked, and so now we are just going to scrape the titles that I visually inspected and confirmed it was the same show.

In [56]:
def get_imdb_info_no_name_check(title):
   

    try:
        #combine title with +
        combine_title = '+'.join(title.lower().split(' '))
        
        #search page and get HTML
        search_url = f'https://www.imdb.com/find?q={combine_title}&ref_=nv_sr_sm'
        page = requests.get(search_url)
        
        #turn HTML infor into beautiful soup object
        soup = BeautifulSoup(page.text, 'html.parser')

        #find the title id from this page
        search = soup.select('li.find-title-result:nth-child(1) > div:nth-child(2) > div:nth-child(1) > a:nth-child(1)')
        search_extension = search[0].attrs.get('href')
        
        #Now find tv show's page with actual title ID                     
        show_search = f'https://www.imdb.com{search_extension}/?ref_=fn_al_tt_0'
        
        #Turn into beautiful soup object
        show_page = requests.get(show_search)
        show_soup = BeautifulSoup(show_page.text, 'html.parser')
        
        #grab the description
        description = show_soup.select('.sc-16ede01-1')[0].text
        
        #grab the tags
        tags = []
        tag_finder = show_soup.find_all("span", {"class": "ipc-chip__text"})
        for tag in tag_finder:
            tags.append(tag.text)
        
        #Return the collected information.  We are visually inspecting anything this time
        # so we can just return the scraped info.
        return {'title': [title],
                'themes': [' '.join(tags)],
                'description': [description]}

    except:
        #If there is an error, just return the title
        return {'title': [title],
                'themes': [''],
                'description': ['']}

In [26]:
second_round_df = pd.DataFrame()

for title in second_round_to_scrape['title']:
    current_title = pd.DataFrame.from_dict(get_imdb_info_no_name_check(title))
    second_round_df = pd.concat([second_round_df,current_title],ignore_index= True)
    
                            
second_round_df.to_csv('second_round_scrape.csv', index= False)

#### Merging the Datasets

In [14]:
#Merge the two datasets

first = pd.read_csv('scraped_with_returned_title.csv')
first.drop(['found_title', 'same_title'], inplace= True, axis = 1)

second = pd.read_csv('second_round_scrape.csv').fillna('')

first.head()


Unnamed: 0,title,themes,description
0,Breaking Bad,Crime Drama Thriller,A chemistry teacher diagnosed with inoperable ...
1,Stranger Things,Drama Fantasy Horror,"When a young boy disappears, his mother, a pol..."
2,Attack on Titan,Animation Action Adventure,After his hometown is destroyed and his mother...
3,Better Call Saul,Crime Drama,The trials and tribulations of criminal lawyer...
4,Dark,Crime Drama Mystery,"A family saga with a supernatural twist, set i..."


In [15]:
second.head()

Unnamed: 0,title,themes,description
0,Star Trek,Action Adventure Sci-Fi,"In the 23rd Century, Captain James T. Kirk and..."
1,"Tiger King: Murder, Mayhem and Madness",Documentary Biography Crime,A rivalry between big cat eccentrics takes a d...
2,Leyla ile Mecnun,Adventure Comedy Drama,Turkish television comedy series set in Istanb...
3,Resurrection: Ertugrul,Action Adventure Drama,"The heroic story of Ertugrul Ghazi, the father..."
4,Code Geass: Lelouch of the Rebellion,Animation Action Drama,After being given a mysterious power to contro...



There are now two datasets that need to be merged together: each dataframe created after each round of scraping.
what the code below does is do a left join on the two rounds of scraping.  In the first dataset (called first, which is the first round of scraping), where it has NA values is where the dataset called second has themes and descriptions scraped for the same show. <br>

Therefore, we can fill NA values from the first DF with the respective theme and description values from the second DF.

In [42]:
#left join the two datasets
merge_scrapes = first.merge(second, how= 'left', on = 'title')

#where themes of first DF are blank, fill with the scrapped themes of second DF
merge_scrapes['themes_x'] = merge_scrapes['themes_x'].fillna(merge_scrapes.pop('themes_y'))

#where descriptions of first DF are blank, fill with the scrapped descriptions of second DF
merge_scrapes['description_x'] = merge_scrapes['description_x'].fillna(merge_scrapes.pop('description_y'))

#rename columns
merge_scrapes = merge_scrapes.rename_columns({'themes_x': 'themes', 'description_x':"description"}).fillna('')
merge_scrapes

Unnamed: 0,title,themes,description
0,Breaking Bad,Crime Drama Thriller,A chemistry teacher diagnosed with inoperable ...
1,Stranger Things,Drama Fantasy Horror,"When a young boy disappears, his mother, a pol..."
2,Attack on Titan,Animation Action Adventure,After his hometown is destroyed and his mother...
3,Better Call Saul,Crime Drama,The trials and tribulations of criminal lawyer...
4,Dark,Crime Drama Mystery,"A family saga with a supernatural twist, set i..."
...,...,...,...
5363,Paradise Islands,,
5364,Mexico Untamed,Documentary,
5365,Wild Centeral America,Documentary,Nigel Marven explores the culture and wildlife...
5366,Wild Russia,Documentary,An in-depth look at Russia's most natural wond...


Finally, we can merge scraped data to our original dataset.

In [47]:
final_dataset_all_titles = df.merge(right = merge_scrapes, how = 'left',on= 'title')

Of note, there are 80 shows across these streaming platforms that have themes of IMDb, but do not have descriptions on IMDb.  They will not be included in the final dataset. 

In [48]:
themes_no_description = final_dataset_all_titles.query("themes != '' and description == '' ")
len(themes_no_description)

82

Additionally, due to various scraping issues, the script failed to find 1104 descriptions, and those shows will be removed from the final dataset.

In [49]:
no_descriptions = final_dataset_all_titles.query("description == '' ")
len(no_descriptions)

1104

In [50]:
final_dataset = final_dataset_all_titles.query("description != '' ")
final_dataset.to_csv('tv_shows_themes_descriptions.csv', index= False)