# Drama Reviews

I love watching dramas, and often look up this website 'https://mydramalist.com/shows/top' to find top dramas to watch. With some new Python skills that I have picked up, I wanted to do something related to dramas such as:

1. Web scrape the drama reviews
2. Preprocess the drama reviews
3. Conduct exploratory data analysis on the reviews
4. Identify topics present in the reviews
5. Uncover the sentiment of viewers

This notebook is the first of a 5 part series that I have completed.

# Web Scrape Drama Reviews

Task: Web scrape the drama reviews from 'https://mydramalist.com/shows/top' so that the reviews can be used for analysis in subsequent notebooks. 

Note: This is my first attempt at web scraping (learnt it on the go as I did this project) so some codes may not be the most effective way of retrieving the required data.

## 1. Import libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import itertools

## 2. Use BeautifulSoup to parse html

There are many pages of dramas, and each drama has several pages of reviews. These are the steps I took to find where the reviews are located:

1. Find the last page number of the list of dramas 
2. For each page, I get the drama titles, and create the review website address of each drama using the drama titles
3. For each drama, I find out whether there are more than one page of reviews. If drama has more than one page of reviews, I find the last page number of the reviews. Then I created the review website address of each page of drama review.

In [2]:
shows_url = 'https://mydramalist.com/shows/top'
resp = requests.get(shows_url)
soup = BeautifulSoup(resp.text)
# print(soup.prettify())

pages = soup.find_all('a',{'class':'page-link'})
last_page = int(pages[-1].get('href')[16:]) # get the page number of the last page - all pages will be scraped
last_page

668

In [15]:
pages_url = ["{}?page={}".format(shows_url, str(page)) for page in range(1, last_page + 1)] # list of pages url
big_list_title = []
for page in pages_url[550:]:
    resp = requests.get(page)
    soup = BeautifulSoup(resp.text)
    titles = soup.find_all('a',{'class':'block'}) # get list of drama titles
    list_title = []
    for title in titles:
        list_title.append(title.get('href')) # get href of drama titles
    big_list_title.append(list_title)

big_list_title = list(itertools.chain.from_iterable(big_list_title))
# print(big_list_title)

In [4]:
def first_page_url(big_list_title):
    mydramalist_url = 'https://mydramalist.com'
    review = '/reviews'
    list_review_url = []
    for title in big_list_title:
        title_url = mydramalist_url + title + review # create urls of drama reviews
        list_review_url.append(title_url)
    return list_review_url

In [5]:
def first_page_html(list_review_url):
    list_review_html = []
    for drama_review_url in list_review_url: 
        resp = requests.get(drama_review_url)
        soup = BeautifulSoup(resp.text) # html of drama reviews page
        list_review_html.append(soup)
    return list_review_html

In [6]:
def all_pages_url(d):
    x = []
    y = []
    
    for i in range(len(d['url'])):
        url = d['url'][i]
        html = d['html'][i]
        pages = html.find_all('a',{'class':'page-link'})

        if pages == []: # if drama has no extra pages, append url straightaway 
            x.append(url)
            
        else: # if drama has extra pages, create the url of the other pages
            last_page = pages[-1].get('href') # get the page number of the last page
            last_page = ''.join(re.findall(r'\d+$', last_page))
            pages_url = ["{}?page={}".format(url, str(page)) for page in range(1, int(last_page) + 1)]
            y.append(pages_url)
    y = list(itertools.chain.from_iterable(y))
    list_all_pages_url = x + y
    
    return list_all_pages_url

## 3. Populate DataFrame with drama reviews

Data to include in the DataFrame:

1. Drama Title - Get title of html
2. Name of reviewer - If drama has no reviews, skip drama.
3. List of ratings (overall, story, cast, music, rewatch value)
4. Reviews text

Each data is appended to a list. I combined the list to create the DataFrame. I also printed the drama titles that do not have reviews.

In [7]:
def all_pages_html(list_all_pages_url):
    list_review_soups = []
    
    for url in list_all_pages_url:
        resp = requests.get(url)
        soup = BeautifulSoup(resp.text) # html of each drama reviews page
        list_review_soups.append(soup)
    return list_review_soups

In [8]:
def review_titles_users(list_review_soups):
    big_list_titles = []
    big_list_users = []

    for soup in list_review_soups:
        drama_title = soup.title.get_text().replace(' Reviews - MyDramaList','') # drama title
        review_user = soup.find_all('div',{'class':'review'})
        if review_user == []: # if drama has no reviews, print drama name and continue loop to next drama
            print(drama_title, 'has no reviews')
            continue

        list_titles = []
        list_users = []

        for i in range(0,len(review_user)):
            list_titles.append(drama_title)
            x = review_user[i].get_text().lstrip().split() # get user name depending on number of words in user name
            if x[0] == 'Ongoing' and x[8] == 'people': # 5 word name
                y = x[2]+" " + x[3] + " " + x[4] + " " + x[5] + " " + x[6]
            elif x[0] == 'Ongoing' and x[7] == 'people': # 4 word name
                y = x[2]+" " + x[3] + " " + x[4] + " " + x[5]
            elif x[0] == 'Ongoing' and x[6] == 'people': # 3 word name
                y = x[2]+" " + x[3] + " " + x[4]
            elif x[0] == 'Ongoing' and x[5] == 'people': # 2 word name
                y = x[2]+" " + x[3]
            elif x[0] == 'Ongoing' and x[4] == 'people': # 1 word name
                y = x[2]
            elif x[0] == 'Completed' and x[7] == 'people':
                y = x[1]+" " + x[2] + " " + x[3] + " " + x[4] + " " + x[5]
            elif x[0] == 'Completed' and x[6] == 'people':
                y = x[1]+" " + x[2] + " " + x[3] + " " + x[4]
            elif x[0] == 'Completed' and x[5] == 'people':
                y = x[1]+" " + x[2] + " " + x[3] 
            elif x[0] == 'Completed' and x[4] == 'people':
                y = x[1]+" " + x[2]
            elif x[0] == 'Completed' and x[3] == 'people':
                y = x[1]
            list_users.append(y)
        big_list_titles.append(list_titles)
        big_list_users.append(list_users)

    big_list_titles = list(itertools.chain.from_iterable(big_list_titles))
    big_list_users = list(itertools.chain.from_iterable(big_list_users))
    return big_list_titles, big_list_users

In [9]:
def review_ratings_reviews(list_review_soups):
    big_list_ratings = []
    big_list_reviews = []

    for soup in list_review_soups:
        review_entire = soup.find_all('div',{'class':'col-sm-12 review-body'}) # entire review without user name
        read_more = 'Read More Was this review helpful to you? Yes No Cancel' # read_more string
        list_ratings = []
        list_reviews = []

        for i in range(0,len(review_entire)):
            z = review_entire[i].get_text().lstrip()
            z = re.sub(' +', ' ', z)
            ratings = re.findall(r"\bOverall\b [0-9.]+ \bStory\b [0-9.]+ \bActing/Cast\b [0-9.]+ \bMusic\b [0-9.]+ \bRewatch Value\b [0-9.]+", z)
            list_ratings.append(ratings[0]) # get ratings (string, with format: rating type + rating)

            reviews = re.sub(r"\bOverall\b [0-9.]+ \bStory\b [0-9.]+ \bActing/Cast\b [0-9.]+ \bMusic\b [0-9.]+ \bRewatch Value\b [0-9.]+", '', z) # get reviews text by removing the rating details
            reviews = reviews.replace(read_more,'') # get reviews text by removing the read_more text
            list_reviews.append(reviews)
        big_list_ratings.append(list_ratings)
        big_list_reviews.append(list_reviews)

    big_list_ratings = list(itertools.chain.from_iterable(big_list_ratings))
    big_list_reviews = list(itertools.chain.from_iterable(big_list_reviews))

    return big_list_ratings, big_list_reviews

In [10]:
def review_ratings_ind(big_list_ratings):
    list_rating_overall = []
    list_rating_story = []
    list_rating_cast = []
    list_rating_music = []
    list_rating_rewatch = []
    for i in range(0,len(big_list_ratings)):
        x = re.sub("[A-Za-z/]", " ", big_list_ratings[i]) # remove rating type (str), leaving ratings (float)
        x = x.split() # split into 5 ratings to append to list
        list_rating_overall.append(x[0]) # overall rating
        list_rating_story.append(x[1]) # story rating
        list_rating_cast.append(x[2]) # cast rating
        list_rating_music.append(x[3]) # music rating
        list_rating_rewatch.append(x[4]) # rewatch value rating
    return list_rating_overall, list_rating_story, list_rating_cast, list_rating_music, list_rating_rewatch

In [11]:
def drama_df(big_list_titles, big_list_users, big_list_rating_overall, 
                               list_rating_story, list_rating_cast, list_rating_music, list_rating_rewatch, 
                               big_list_reviews):
    
    df = pd.DataFrame(list(zip(big_list_titles, big_list_users, big_list_rating_overall, 
                               list_rating_story, list_rating_cast, list_rating_music, list_rating_rewatch, 
                               big_list_reviews)), 
                   columns =['drama_title', 'user_name', 'overall_rating', 
                             'story_rating', 'cast_rating', 'music_rating', 'rewatch_value_rating', 
                             'reviews']) # create DataFrame with these variables & column names
    return df


## 4. Run codes and save DataFrame to CSV

In [16]:
list_review_url = first_page_url(big_list_title)
list_review_html = first_page_html(list_review_url)
d = {'url':list_review_url, 'html':list_review_html}
list_all_pages_url = all_pages_url(d)
list_review_soups = all_pages_html(list_all_pages_url)
big_list_titles, big_list_users = review_titles_users(list_review_soups)
big_list_ratings, big_list_reviews = review_ratings_reviews(list_review_soups)
list_rating_overall, list_rating_story, list_rating_cast, list_rating_music, list_rating_rewatch = review_ratings_ind(big_list_ratings)
df = drama_df(big_list_titles, big_list_users, list_rating_overall, 
                               list_rating_story, list_rating_cast, list_rating_music, list_rating_rewatch, 
                               big_list_reviews)

Game Loon Rak (2009) has no reviews
Hatsu Taiken (2002) has no reviews
Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu Dang I (2002) has no reviews
Game Loon Rak (2009) has no reviews
Hatsu Taiken (2002) has no reviews
Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu Dang I (2002) has no reviews
Ga

Wu Dang I (2002) has no reviews
Game Loon Rak (2009) has no reviews
Hatsu Taiken (2002) has no reviews
Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu Dang I (2002) has no reviews
Game Loon Rak (2009) has no reviews
Hatsu Taiken (2002) has no reviews
Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu

Hatsu Taiken (2002) has no reviews
Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu Dang I (2002) has no reviews
Game Loon Rak (2009) has no reviews
Hatsu Taiken (2002) has no reviews
Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu Dang I (2002) has no reviews
Game Loon Rak (2009) has no reviews
Ha

Game Loon Rak (2009) has no reviews
Hatsu Taiken (2002) has no reviews
Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu Dang I (2002) has no reviews
Game Loon Rak (2009) has no reviews
Hatsu Taiken (2002) has no reviews
Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu Dang I (2002) has no reviews
Ga

Game Loon Rak (2009) has no reviews
Hatsu Taiken (2002) has no reviews
Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu Dang I (2002) has no reviews
Game Loon Rak (2009) has no reviews
Hatsu Taiken (2002) has no reviews
Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu Dang I (2002) has no reviews
Ga

Game Loon Rak (2009) has no reviews
Hatsu Taiken (2002) has no reviews
Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu Dang I (2002) has no reviews
Game Loon Rak (2009) has no reviews
Hatsu Taiken (2002) has no reviews
Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu Dang I (2002) has no reviews
Ga

Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu Dang I (2002) has no reviews
Game Loon Rak (2009) has no reviews
Hatsu Taiken (2002) has no reviews
Duang Taa Nai Duang Jai (2011) has no reviews
Shitto no Kaori (2001) has no reviews
Only Love (2014) has no reviews
Kularb Satan (2011) has no reviews
Samee (1999) has no reviews
Fah Krajang Dao (2013) has no reviews
Club Friday The Series Season 2 (2012) has no reviews
Look Poo Chai Hua Jai Petch (2002) has no reviews
Yok Lai Mek (2009) has no reviews
Sen Tai Salai Sode (2011) has no reviews
Wu Dang I (2002) has no reviews
Game Loon Rak (2009) has no reviews
Hatsu Taiken (2002) has no reviews
Du

In [17]:
df.head()

Unnamed: 0,drama_title,user_name,overall_rating,story_rating,cast_rating,music_rating,rewatch_value_rating,reviews
0,Sao Chai Hi-Tech (2010),gwen_narnia,7.5,7.5,7.0,7.5,7.0,"If you like to watch comedy lakorn, this is o..."
1,Namtan Mai (2009),mahiba,8.0,8.0,10.0,7.0,6.0,I'm new to the thai entertainment. Of the dou...
2,Hua Jai Chocolate (2005),asiandramafan,6.5,7.0,5.0,9.0,2.5,"Chun is a rich kid, he's snob, he has no mann..."
3,Mon Jun Tra (2013),MysteryMel-Bookish,8.0,8.5,9.0,7.0,7.5,3M - S2 is the Best of 3M series for 3 reason...
4,Mon Jun Tra (2013),gwen_narnia,8.0,8.5,7.0,8.0,8.0,The story plots are amazing with many element...


In [18]:
df.to_csv(r'C:/Users/weich/Downloads/Data Science Portfolio/drama_reviews7.csv')