# Drama Recommendations

Building on top of the drama reviews project that I did earlier, I wanted to build a drama recommender system. There are several approaches to it, even using deep learning. Thus, I wanted to explore these areas:

1. Web scrape the drama information 
2. Combine the scraped data
3. Brief exploration of the dramas
4. Build several recommender systems using machine learning
5. Try a recommender system using deep learning

This notebook is the first of a 5 part series that I have completed.

# Web Scrape Drama List

Task: Web scrape the dramas from 'https://mydramalist.com/shows/top' so that the reviews can be used for analysis in subsequent notebooks. 

Information to get include:
1. Drama title
2. Year released
3. Main actors
4. Genres
5. Tags

## 1. Import libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import itertools

## 2. Use BeautifulSoup to parse html

There were thousands of dramas listed over several pages. To access each drama, I need its url. Thus, I

1. Access the website 'https://mydramalist.com/shows/top' and identify the last page
2. Read each page to get the drama titles listed on that page
3. Create the drama urls, and access each drama website

In [2]:
shows_url = 'https://mydramalist.com/shows/top'
resp = requests.get(shows_url)
soup = BeautifulSoup(resp.text)
# print(soup.prettify())

pages = soup.find_all('a',{'class':'page-link'})
last_page = int(pages[-1].get('href')[16:]) # get the page number of the last page - all pages will be scraped
last_page

627

In [3]:
pages_url = ["{}?page={}".format(shows_url, str(page)) for page in range(1, last_page + 1)] # list of pages url
big_list_title = []
for page in pages_url[530:]:
    resp = requests.get(page)
    soup = BeautifulSoup(resp.text)
    titles = soup.find_all('a',{'class':'block'}) # get list of drama titles
    list_title = []
    for title in titles:
        list_title.append(title.get('href')) # get href of drama titles
    big_list_title.append(list_title)

big_list_title = list(itertools.chain.from_iterable(big_list_title))
# print(big_list_title)

In [4]:
# function to create the drama url based on drama title
def first_page_url(big_list_title):
    mydramalist_url = 'https://mydramalist.com'
    list_url = []
    for title in big_list_title:
        title_url = mydramalist_url + title # create urls of drama 
        list_url.append(title_url)
    return list_url

In [5]:
# function to read each drama url
def first_page_html(list_url):
    list_html = []
    for drama_url in list_url: 
        resp = requests.get(drama_url)
        soup = BeautifulSoup(resp.text) # html of drama page
        list_html.append(soup)
    return list_html

## 3. Populate DataFrame with drama reviews

Data to include in the DataFrame:

1. Drama title
2. Year released - retrieved from drama title
3. Main actors - the list of support roles is too long and may not be a good distinguishing factor
4. Genres - general; each drama can have more than one genre
5. Tags - more specific than genres, include phrases like 'Adapted from a manga'

Each data is appended to a list. I combined the list to create the DataFrame.

In [6]:
# function to retrieve drama title from html
def drama_title(list_html):
    list_titles = []
    list_years = []
    
    for soup in list_html:
        drama_title = soup.title.get_text().replace(' - MyDramaList','') # drama title
        drama_year = drama_title[-5:-1] # drama year
        list_titles.append(drama_title)
        list_years.append(drama_year)
        
    return list_titles, list_years

In [7]:
# function to retrieve main actors from html
def drama_actors(list_html):
    list_main_actors = []
    
    for soup in list_html:
        roles_text = soup.find_all('small', attrs={'class':'text-muted'})
        roles = []
        for i in roles_text:
            roles.append(i.text) # list of roles of all actors in main drama page
        no_main = roles.count('Main Role') # number of main actors

        actors_text = soup.find_all('a', attrs={'class':'text-primary text-ellipsis'})
        actors = []
        for i in actors_text:
            actors.append(i.text) # list of all actors in main drama page
        main_actors = actors[:no_main] # grab the list of main actors by indexing the number of main actors
        main_actors = ', '.join(main_actors)
        list_main_actors.append(main_actors)

    return list_main_actors

In [8]:
# function to retrieve genres from html
def drama_genres(list_html):
    list_genres = []

    for soup in list_html:
        info = soup.find('ul', attrs={'class','list m-a-0'})
        ge_sep = 'Genres: '
        ta_sep = 'Tags: '
        if ge_sep in info.text: # genres exist
            rest = info.text.split(ge_sep, 1)[1] # string contains other information such as tags. split string to get genres
            genres = rest.split(ta_sep, 1)[0] # get genres
            list_genres.append(genres)
        else:
            list_genres.append('') # no genres
        
    return list_genres

In [9]:
# function to retrieve tags from html
def drama_tags(list_html):
    list_tags = []
    for soup in list_html:
        info = soup.find('ul', attrs={'class','list m-a-0'})
        ta_sep = 'Tags: '
        rest = info.text.split(ta_sep, 1)[1]
        vote_sep = '(Vote or add tags)'
        if rest.rstrip() != vote_sep: # tags exist
            tags = rest.split(vote_sep, 1)[0] # string contains other information such as vote_sep. split string to get tags
            list_tags.append(tags)
        else:
            list_tags.append('') # no tags
    return list_tags

In [11]:
# function to retrieve synopsis from html
def drama_synopsis(list_html):
    list_synopsis = []
    
    for soup in list_html:
        synopsis = soup.find('div', attrs={'class':'show-synopsis'})
        sep = 'Edit Translation'
        rest = synopsis.text.split(sep, 1)[0] # string contains other info like sep. split string to get drama synopsis
        clean = rest.lstrip().replace('\n(Source: MyDramaList) ','').replace('\n','').replace('\r','') # remove special characters
        list_synopsis.append(clean)
        
    return list_synopsis

In [13]:
# function to create DataFrame from lists of information
def drama_df(list_titles, list_years, list_main_actors, list_genres, list_tags, list_synopsis):
    
    df = pd.DataFrame(list(zip(list_titles, list_years, list_main_actors,
                               list_genres, list_tags, list_synopsis)), 
                   columns =['drama_title', 'year', 'main_actors', 'genres', 'tags', 'synopsis']
                     ) # create DataFrame with these variables & column names
    
    return df

## 4. Run codes and save DataFrame to CSV

In [15]:
list_url = first_page_url(big_list_title)
list_html = first_page_html(list_url)
list_titles, list_years = drama_title(list_html)
list_main_actors = drama_actors(list_html)
# list_directors = drama_director(list_html)
list_genres = drama_genres(list_html)
list_tags = drama_tags(list_html)
list_synopsis = drama_synopsis(list_html)
df = drama_df(list_titles, list_years, list_main_actors, list_genres, list_tags, list_synopsis)

In [16]:
df.head()

Unnamed: 0,drama_title,year,main_actors,genres,tags,synopsis
0,Mon Jun Tra (2013),2013,"Krit Shahkrit Yamnam, Margie Rasri Balenciaga",,"Older Man/Younger Woman, Mafia, Age Gap",Sarawaree is a young reporter aiming to write ...
1,Fah Krajang Dao (2013),2013,"Boy Pakorn Chatborirak, Matt Peeranee Kongthai","Romance, Drama",,"The n’ek (Mee) has a pretty messed up past, sh..."
2,Club Friday 2: The Series (2012),2012,,Drama,Gay Character,This is a series of short stories based on a r...
3,Look Poo Chai Hua Jai Petch (2002),2002,"Chakrabongse Chulachak, Namfon Kullanat Preeyawat",Drama,,Synopsis needed.
4,Yok Lai Mek (2009),2009,"Janie Tienphosuwan, Ohm Atshar Nampan, Pip Raw...","Business, Romance, Drama, Family","Birth Secret, Revenge, Multiple Couples",Katriya Ekthamrongworakul is the eldest daught...


In [17]:
df.to_csv('drama_list6.csv',index=False)