# Webscaping and Data Cleaning in Python
## Top 100 Thai Drama on MyDramaList from 2000-2023

According to [MyDramalist](https://www.mydramalist.com "Mydramalist"), "MyDramaList.com is a community-based platform where Asian drama and movie fans can create their own list and etc". The website is practically a wide database of movies, shows, dramas, from various Asia countries that allows users to create lists, utilize forums, and overall discuss all things of Asian entertainment. This dataset ranks the Top 100 Thai Dramas based on ratings given by the users on the website.

### Content / Data Dictionary
- Rank: Ranking on the mydramalist website based on algoithm of popularity.
- Title: Name of drama
- Info: mydramalist category of country's drama
- Rating: Overall average of user specified rating up to 10.0
- Summary: Short synopsis of the drama
- Trailer: has a viewable trailer on mydramalist
- Year: Year the Drama Aired
- Episodes: Number of Episodes
- Main_Actor: Actor/Actress that was listed as first actor on drama page
- Content_Rating: Rating of maturity level suggested for suitability
- Genres: Genre that the drama is listed in
- Aired: Dates range the drama aired on Thai networks
- Users: Number of Users contributed to Rating
- Tags: Tags that the drama is listed in

### Acknowledgements
This data was taken from the following website with the use of Webscaping via beautifulsoup in Python: [Top 100 Thai Dramas from 2000-2023](https://mydramalist.com/search?adv=titles&ty=68&co=6&re=2000,2023&st=3&so=rated) 

##### Inspiration:
I have been a huge Thai Drama fan since watching "Full House" in 2018. It is fun to integrate what I love with my interest toward data science. Some of the interesting questions or tasks could include the following:

- Does the number of episodes affect its rating?
- What is/are the most popular Genre and Themes (Tags)? Do theses have higher ratings?
- Does the Cast members affect the ranking?

There could be so many more!

In [1]:
# Importing packages/libraries
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import numpy as np

After importing the libraries and packages to be utilized. I created a few loops to extract data from the website I wanted to utilize for my analysis.

The steps I used were as follows:

- Create a file name for my csv which will hold all my webscraped data.
- Create a *while* loop to toggle through the needed pages of the specified url.
- Using the specified url, I set up my request and soup variables.
- Soup variable was used with the findAll function to pull the drama data from the **'div class'** found on the webpage's inspect.
- Used the newly created variable Drama to create a for loop to pull the data and then separate the data into columns.
- Columns were appended into a list called *thaidramas* and written to **"Top100Thai.csv"** .
- embedded loop when through each link of the specified dramas and extracted additional data to be amended into another list called *dramacontent*
- The lists *thaidramas* and *dramacontent* were imported into a Pandas' dataframe for further analysis.

## Webscraping

In [2]:
# Webscrapping website to create dataframe for analysis with beautifulsoup and requests.

file = 'Top100Thai.csv'
f = open(file, 'w', newline = '')
show = csv.writer(f)
baseurl = 'https://mydramalist.com'
page = 1
thaidramas = []
dramacontent = []
while page != 6:
      url = f"https://mydramalist.com/search?adv=titles&ty=68&co=6&re=2000,2023&st=3&so=rated&page={page}"
      headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.62"}
      request = requests.get(url, headers = headers)
      soup = BeautifulSoup(request.text, 'html')
      
      Drama = soup.findAll("div", attrs= {"class":"col-xs-9 row-cell content"})

      for drama in Drama:
          columns = drama.findChildren(recursive = False)
          columns = [data.text.strip() for data in columns]
          show.writerow(columns)
          thaidramas.append(columns)

          for link in drama.find_all('a', href=True): 
              embedurl = baseurl + link['href']
              headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.62"}
              req = requests.get(embedurl, headers = headers)
              soup = BeautifulSoup(req.content, 'html')

              actors= soup.findAll('b', attrs={"itempropx":"name"})
              aired= soup.findAll('li', attrs= {"class": "list-item p-a-0", "xitemprop":"datePublished"})
              genres = soup.findAll('li', attrs= {"class":"list-item p-a-0 show-genres"})
              contentrate= soup.findAll('li', attrs={"class":"list-item p-a-0 content-rating"})
              tags = soup.findAll('li', attrs= {"class":"list-item p-a-0 show-tags"})
              users = soup.findAll('span', attrs={"class":"hft"})
                
              for actor, air, genre, content, tag, user in zip(actors, aired, genres, contentrate, tags, users):
                  extdata = actor.text + " _ " + air.text + " _ " + genre.text + " _ " + content.text + " _ " + tag.text + " _ " + user.text
                  
                  dramacontent.append(extdata)
 
      page = page + 1
f.close()


In [3]:
# Convert list dramacontent to a dataframe
dramas = pd.DataFrame(dramacontent)
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 5000)
dramas.head()

Unnamed: 0,0
0,"Beam Papangkorn Lerkchaleampote _ Aired: Sep 22, 2022 _ Genres: Drama _ Content Rating: G - All Ages _ Tags: Survival, Based On True Story, Rescue Mission, Brave Male Lead, Thai Mythology, Clever Male Lead, Co-produced, Biographical, Bilingual Supporting Character, Bilingual Female Lead (Vote or add tags) _ (scored by 958 users)"
1,"James Teeradon Supapunpinyo _ Aired: Sep 9, 2017 - Nov 25, 2017 _ Genres: Psychological, Youth, Drama, Sports _ Content Rating: 13+ - Teens 13 or older _ Tags: Skaters, Depression, Psychology, Self-harm, Friendship, Father-Son Relationship, Suicide, Doctor Female Lead, Skateboarding, High School (Vote or add tags) _ (scored by 1,692 users)"
2,"Bella Ranee Campen _ Aired: Feb 21, 2018 - Apr 11, 2018 _ Genres: Historical, Comedy, Romance, Supernatural _ Content Rating: G - All Ages _ Tags: Sassy Female Lead, Time Travel, Transmigration, Soulmates, Fated Love, Eccentric Female Lead, Reincarnation, Strong Female Lead, Nice Male Lead, Hidden Identity (Vote or add tags) _ (scored by 3,080 users)"
3,"Nanon Korapat Kirdpan _ Aired: Aug 5, 2018 - Nov 4, 2018 _ Genres: Thriller, Mystery, Psychological, Supernatural _ Content Rating: 13+ - Teens 13 or older _ Tags: High School, Special Power, Multiple Mains, Adapted From A Novel, Corruption, Suspense, Hidden Talent, Teamwork, Friendship, Slight Romance (Vote or add tags) _ (scored by 10,248 users)"
4,"Win Metawin Opas-iamkajorn _ Aired: Jul 15, 2023 - Aug 5, 2023 _ Genres: Mystery, Horror, Youth, Supernatural _ Content Rating: 13+ - Teens 13 or older _ Tags: Teenager Female Lead, High School, Black Magic, Student Supporting Character, Student Female Lead, School Setting, Exorcist Male Lead, Teenager Supporting Character, Age Gap [Drama Life], Age Gap [Real Life] (Vote or add tags) _ (scored by 2,304 users)"



The DataFrame came over as one large column. This needed to be split into multiple columns to allow for each variable to be separated. This was completed with the use of a **for** loop and splitting on the separator **"_"**. Each column created with the starting string of **'col'** and concatenated with a number in sequence.
Once the columns were created the original column was dropped from the dataframe as it is no longer needed.

Each **'col'** variable was renamed accordingly based on their entries within. The last step was cleaning up some of the string information that was not needed. Some variables had strings with the variable names followed by a ':'. These were replaced with the use of .replace() passing through the needed information and replacing it with nothing. Other variables included parentheses within the string with unnecessary information. These were removed similiarly.


In [4]:
# Splitting data into multiple columns then dropping original column as it is not needed.

for i in range(dramas.shape[0]):
    split = dramas.loc[i, 0].split('_')
    for j in range(len(split)):
        colname = f'col{j+1}'
        dramas.loc[i, colname] = split[j]
dramas.drop(columns=[0], inplace=True)
dramas.head()

Unnamed: 0,col1,col2,col3,col4,col5,col6
0,Beam Papangkorn Lerkchaleampote,"Aired: Sep 22, 2022",Genres: Drama,Content Rating: G - All Ages,"Tags: Survival, Based On True Story, Rescue Mission, Brave Male Lead, Thai Mythology, Clever Male Lead, Co-produced, Biographical, Bilingual Supporting Character, Bilingual Female Lead (Vote or add tags)",(scored by 958 users)
1,James Teeradon Supapunpinyo,"Aired: Sep 9, 2017 - Nov 25, 2017","Genres: Psychological, Youth, Drama, Sports",Content Rating: 13+ - Teens 13 or older,"Tags: Skaters, Depression, Psychology, Self-harm, Friendship, Father-Son Relationship, Suicide, Doctor Female Lead, Skateboarding, High School (Vote or add tags)","(scored by 1,692 users)"
2,Bella Ranee Campen,"Aired: Feb 21, 2018 - Apr 11, 2018","Genres: Historical, Comedy, Romance, Supernatural",Content Rating: G - All Ages,"Tags: Sassy Female Lead, Time Travel, Transmigration, Soulmates, Fated Love, Eccentric Female Lead, Reincarnation, Strong Female Lead, Nice Male Lead, Hidden Identity (Vote or add tags)","(scored by 3,080 users)"
3,Nanon Korapat Kirdpan,"Aired: Aug 5, 2018 - Nov 4, 2018","Genres: Thriller, Mystery, Psychological, Supernatural",Content Rating: 13+ - Teens 13 or older,"Tags: High School, Special Power, Multiple Mains, Adapted From A Novel, Corruption, Suspense, Hidden Talent, Teamwork, Friendship, Slight Romance (Vote or add tags)","(scored by 10,248 users)"
4,Win Metawin Opas-iamkajorn,"Aired: Jul 15, 2023 - Aug 5, 2023","Genres: Mystery, Horror, Youth, Supernatural",Content Rating: 13+ - Teens 13 or older,"Tags: Teenager Female Lead, High School, Black Magic, Student Supporting Character, Student Female Lead, School Setting, Exorcist Male Lead, Teenager Supporting Character, Age Gap [Drama Life], Age Gap [Real Life] (Vote or add tags)","(scored by 2,304 users)"


In [5]:
# Renaming Info column to Main_Actor

dramas.rename(columns = {'col1': 'Main_Actor', 'col2': 'Aired', 'col3': 'Genres', 'col4': 'Content_Rating', 'col5':'Tags', 'col6':'Users'}, inplace=True)

In [6]:
# Removing strings that are unneeded within the values

dramas['Content_Rating'] = dramas['Content_Rating'].str.replace('Content Rating:', '')
dramas['Genres'] = dramas['Genres'].str.replace('Genres:', '')
dramas['Aired'] = dramas['Aired'].str.replace('Aired:', '')
dramas['Users'] = dramas['Users'].str.replace('(scored by', '')
dramas['Users'] = dramas['Users'].str.replace('users)', '')
dramas['Tags'] = dramas['Tags'].str.replace('Tags:', '')
dramas['Tags'] = dramas['Tags'].str.replace('(Vote or add tags)', '')
dramas.head()

Unnamed: 0,Main_Actor,Aired,Genres,Content_Rating,Tags,Users
0,Beam Papangkorn Lerkchaleampote,"Sep 22, 2022",Drama,G - All Ages,"Survival, Based On True Story, Rescue Mission, Brave Male Lead, Thai Mythology, Clever Male Lead, Co-produced, Biographical, Bilingual Supporting Character, Bilingual Female Lead",958
1,James Teeradon Supapunpinyo,"Sep 9, 2017 - Nov 25, 2017","Psychological, Youth, Drama, Sports",13+ - Teens 13 or older,"Skaters, Depression, Psychology, Self-harm, Friendship, Father-Son Relationship, Suicide, Doctor Female Lead, Skateboarding, High School",1692
2,Bella Ranee Campen,"Feb 21, 2018 - Apr 11, 2018","Historical, Comedy, Romance, Supernatural",G - All Ages,"Sassy Female Lead, Time Travel, Transmigration, Soulmates, Fated Love, Eccentric Female Lead, Reincarnation, Strong Female Lead, Nice Male Lead, Hidden Identity",3080
3,Nanon Korapat Kirdpan,"Aug 5, 2018 - Nov 4, 2018","Thriller, Mystery, Psychological, Supernatural",13+ - Teens 13 or older,"High School, Special Power, Multiple Mains, Adapted From A Novel, Corruption, Suspense, Hidden Talent, Teamwork, Friendship, Slight Romance",10248
4,Win Metawin Opas-iamkajorn,"Jul 15, 2023 - Aug 5, 2023","Mystery, Horror, Youth, Supernatural",13+ - Teens 13 or older,"Teenager Female Lead, High School, Black Magic, Student Supporting Character, Student Female Lead, School Setting, Exorcist Male Lead, Teenager Supporting Character, Age Gap [Drama Life], Age Gap [Real Life]",2304


In [7]:
# View the shape of the data to view the total number of entries(rows) and the total amount of attributes(columns).

dramas.shape

(100, 6)


We can see that there was a total of 100 entries across 6 columns for this first dataset. The dataset saved to a CSV file incase of use for another project at a later time.


In [8]:
#print copy of dramacontent dataset to a csv file as original content.

dramas.to_csv(r'DramasExtended.csv')


After the completing the import of the first DataFrame, Dramas, the second list of **thaidramas** was imported into an additional DataFrame called thai. Since the thaidrama list was also written to a csv during the webscraping step, the export csv was reviewed to count the number of columns with data captured. It made naming the variables easy.


In [9]:
#import second list, thaidramas into a dataframe

thai = pd.DataFrame(thaidramas, columns=["Rank", "Title", "DramaData", "Rating", "Summary", "Trailer", "Online", "None"])
thai[['DramaData','Year']] = thai['DramaData'].str.split('-', expand = True)
thai[['Year','Episodes']] = thai['Year'].str.split(',', expand = True)
thai.head()

Unnamed: 0,Rank,Title,DramaData,Rating,Summary,Trailer,Online,None,Year,Episodes
0,#189,Thai Cave Rescue,Thai Drama,8.7,The Limited Series is based on 2018’s world-famous event. Twelve boys from the same football team decided to spend an afternoon with their coach exploring the Tham Luang caves in northern Thailand. When heavy rainfall…,,,,2022,6 episodes
1,#330,Project S: Skate Our Souls,Thai Drama,8.6,"Boo, a student whose grades aren't promising, knows well that he falls short of his father's standards. Quietly, Boo has been suffering from depression for a long time. He doesn't eat or sleep well and has no friends,…",Watch Trailer,,,2017,8 episodes
2,#349,Love Destiny,Thai Drama,8.6,"This is a story where karma, merit, love destiny and a moon mantra combine to fling Kadesurang, a chubby archaeologist, into the body of another woman, Karakade, during the Ayutthaya era (300 years earlier). In the past,…",Watch Trailer,,,2018,15 episodes
3,#382,The Gifted,Thai Drama,8.6,"Ritdha Wittayakom High School has a ""Gifted Program."" The program offers special classes to a handful of ""special"" students chosen by the school administration. Incredibly, Pang, a tenth-year student from the lowest…",Watch Trailer,,,2018,13 episodes
4,#391,Enigma,Thai Drama,8.6,"There is something wrong with Fa's high school; strange events have been happening around her. In which way are they related to the new teacher, Ajin?\n\n(Source: MyDramaList)",Watch Trailer,,,2023,4 episodes


In [10]:
#removing string text from entries that are unneeded

thai['Episodes'] = thai['Episodes'].str.replace('episodes', '')
thai['Rank'] = thai['Rank'].str.replace('#', '')
thai.head()

Unnamed: 0,Rank,Title,DramaData,Rating,Summary,Trailer,Online,None,Year,Episodes
0,189,Thai Cave Rescue,Thai Drama,8.7,The Limited Series is based on 2018’s world-famous event. Twelve boys from the same football team decided to spend an afternoon with their coach exploring the Tham Luang caves in northern Thailand. When heavy rainfall…,,,,2022,6
1,330,Project S: Skate Our Souls,Thai Drama,8.6,"Boo, a student whose grades aren't promising, knows well that he falls short of his father's standards. Quietly, Boo has been suffering from depression for a long time. He doesn't eat or sleep well and has no friends,…",Watch Trailer,,,2017,8
2,349,Love Destiny,Thai Drama,8.6,"This is a story where karma, merit, love destiny and a moon mantra combine to fling Kadesurang, a chubby archaeologist, into the body of another woman, Karakade, during the Ayutthaya era (300 years earlier). In the past,…",Watch Trailer,,,2018,15
3,382,The Gifted,Thai Drama,8.6,"Ritdha Wittayakom High School has a ""Gifted Program."" The program offers special classes to a handful of ""special"" students chosen by the school administration. Incredibly, Pang, a tenth-year student from the lowest…",Watch Trailer,,,2018,13
4,391,Enigma,Thai Drama,8.6,"There is something wrong with Fa's high school; strange events have been happening around her. In which way are they related to the new teacher, Ajin?\n\n(Source: MyDramaList)",Watch Trailer,,,2023,4


In [11]:
#view the shape of the second dataframe

thai.shape

(100, 10)

## Data Cleaning of Combined Data

In [12]:
#concatenate both dataframes together horizontally

thaidramas = pd.concat([thai, dramas], axis=1)
thaidramas.head()

Unnamed: 0,Rank,Title,DramaData,Rating,Summary,Trailer,Online,None,Year,Episodes,Main_Actor,Aired,Genres,Content_Rating,Tags,Users
0,189,Thai Cave Rescue,Thai Drama,8.7,The Limited Series is based on 2018’s world-famous event. Twelve boys from the same football team decided to spend an afternoon with their coach exploring the Tham Luang caves in northern Thailand. When heavy rainfall…,,,,2022,6,Beam Papangkorn Lerkchaleampote,"Sep 22, 2022",Drama,G - All Ages,"Survival, Based On True Story, Rescue Mission, Brave Male Lead, Thai Mythology, Clever Male Lead, Co-produced, Biographical, Bilingual Supporting Character, Bilingual Female Lead",958
1,330,Project S: Skate Our Souls,Thai Drama,8.6,"Boo, a student whose grades aren't promising, knows well that he falls short of his father's standards. Quietly, Boo has been suffering from depression for a long time. He doesn't eat or sleep well and has no friends,…",Watch Trailer,,,2017,8,James Teeradon Supapunpinyo,"Sep 9, 2017 - Nov 25, 2017","Psychological, Youth, Drama, Sports",13+ - Teens 13 or older,"Skaters, Depression, Psychology, Self-harm, Friendship, Father-Son Relationship, Suicide, Doctor Female Lead, Skateboarding, High School",1692
2,349,Love Destiny,Thai Drama,8.6,"This is a story where karma, merit, love destiny and a moon mantra combine to fling Kadesurang, a chubby archaeologist, into the body of another woman, Karakade, during the Ayutthaya era (300 years earlier). In the past,…",Watch Trailer,,,2018,15,Bella Ranee Campen,"Feb 21, 2018 - Apr 11, 2018","Historical, Comedy, Romance, Supernatural",G - All Ages,"Sassy Female Lead, Time Travel, Transmigration, Soulmates, Fated Love, Eccentric Female Lead, Reincarnation, Strong Female Lead, Nice Male Lead, Hidden Identity",3080
3,382,The Gifted,Thai Drama,8.6,"Ritdha Wittayakom High School has a ""Gifted Program."" The program offers special classes to a handful of ""special"" students chosen by the school administration. Incredibly, Pang, a tenth-year student from the lowest…",Watch Trailer,,,2018,13,Nanon Korapat Kirdpan,"Aug 5, 2018 - Nov 4, 2018","Thriller, Mystery, Psychological, Supernatural",13+ - Teens 13 or older,"High School, Special Power, Multiple Mains, Adapted From A Novel, Corruption, Suspense, Hidden Talent, Teamwork, Friendship, Slight Romance",10248
4,391,Enigma,Thai Drama,8.6,"There is something wrong with Fa's high school; strange events have been happening around her. In which way are they related to the new teacher, Ajin?\n\n(Source: MyDramaList)",Watch Trailer,,,2023,4,Win Metawin Opas-iamkajorn,"Jul 15, 2023 - Aug 5, 2023","Mystery, Horror, Youth, Supernatural",13+ - Teens 13 or older,"Teenager Female Lead, High School, Black Magic, Student Supporting Character, Student Female Lead, School Setting, Exorcist Male Lead, Teenager Supporting Character, Age Gap [Drama Life], Age Gap [Real Life]",2304


In [13]:
#view columns, non null counts, and data types

thaidramas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Rank            100 non-null    object
 1   Title           100 non-null    object
 2   DramaData       100 non-null    object
 3   Rating          100 non-null    object
 4   Summary         100 non-null    object
 5   Trailer         93 non-null     object
 6   Online          2 non-null      object
 7   None            2 non-null      object
 8   Year            100 non-null    object
 9   Episodes        100 non-null    object
 10  Main_Actor      100 non-null    object
 11  Aired           100 non-null    object
 12  Genres          100 non-null    object
 13  Content_Rating  100 non-null    object
 14  Tags            100 non-null    object
 15  Users           100 non-null    object
dtypes: object(16)
memory usage: 12.6+ KB


In [14]:
#exploring variables with less than 100 non-null values to determine what they are

thaidramas['Online'].unique()

array([None, 'Watch Online'], dtype=object)

In [15]:
thaidramas['None'].unique()

array([None, ''], dtype=object)

In [16]:
thaidramas['Trailer'].unique()

array([None, 'Watch Trailer'], dtype=object)

In [17]:
#dropped unneeded columns from dataframe

thaidramas.drop(columns=['Online','None', 'DramaData'], inplace=True)

In [18]:
#converting entries into 0, 1 values

thaidramas['Trailer']=thaidramas['Trailer'].replace('Watch Trailer', 1)
thaidramas['Trailer'].fillna(0, inplace=True)
thaidramas.head()

Unnamed: 0,Rank,Title,Rating,Summary,Trailer,Year,Episodes,Main_Actor,Aired,Genres,Content_Rating,Tags,Users
0,189,Thai Cave Rescue,8.7,The Limited Series is based on 2018’s world-famous event. Twelve boys from the same football team decided to spend an afternoon with their coach exploring the Tham Luang caves in northern Thailand. When heavy rainfall…,0.0,2022,6,Beam Papangkorn Lerkchaleampote,"Sep 22, 2022",Drama,G - All Ages,"Survival, Based On True Story, Rescue Mission, Brave Male Lead, Thai Mythology, Clever Male Lead, Co-produced, Biographical, Bilingual Supporting Character, Bilingual Female Lead",958
1,330,Project S: Skate Our Souls,8.6,"Boo, a student whose grades aren't promising, knows well that he falls short of his father's standards. Quietly, Boo has been suffering from depression for a long time. He doesn't eat or sleep well and has no friends,…",1.0,2017,8,James Teeradon Supapunpinyo,"Sep 9, 2017 - Nov 25, 2017","Psychological, Youth, Drama, Sports",13+ - Teens 13 or older,"Skaters, Depression, Psychology, Self-harm, Friendship, Father-Son Relationship, Suicide, Doctor Female Lead, Skateboarding, High School",1692
2,349,Love Destiny,8.6,"This is a story where karma, merit, love destiny and a moon mantra combine to fling Kadesurang, a chubby archaeologist, into the body of another woman, Karakade, during the Ayutthaya era (300 years earlier). In the past,…",1.0,2018,15,Bella Ranee Campen,"Feb 21, 2018 - Apr 11, 2018","Historical, Comedy, Romance, Supernatural",G - All Ages,"Sassy Female Lead, Time Travel, Transmigration, Soulmates, Fated Love, Eccentric Female Lead, Reincarnation, Strong Female Lead, Nice Male Lead, Hidden Identity",3080
3,382,The Gifted,8.6,"Ritdha Wittayakom High School has a ""Gifted Program."" The program offers special classes to a handful of ""special"" students chosen by the school administration. Incredibly, Pang, a tenth-year student from the lowest…",1.0,2018,13,Nanon Korapat Kirdpan,"Aug 5, 2018 - Nov 4, 2018","Thriller, Mystery, Psychological, Supernatural",13+ - Teens 13 or older,"High School, Special Power, Multiple Mains, Adapted From A Novel, Corruption, Suspense, Hidden Talent, Teamwork, Friendship, Slight Romance",10248
4,391,Enigma,8.6,"There is something wrong with Fa's high school; strange events have been happening around her. In which way are they related to the new teacher, Ajin?\n\n(Source: MyDramaList)",1.0,2023,4,Win Metawin Opas-iamkajorn,"Jul 15, 2023 - Aug 5, 2023","Mystery, Horror, Youth, Supernatural",13+ - Teens 13 or older,"Teenager Female Lead, High School, Black Magic, Student Supporting Character, Student Female Lead, School Setting, Exorcist Male Lead, Teenager Supporting Character, Age Gap [Drama Life], Age Gap [Real Life]",2304



In the brief overview of the columns, it is noted that Rank, Rating, Trailer, Year,Episode, and Users were all numeric, but set as objects. This needed to be corrected accordingly. The specified columns were converted to strings and then into integers. The only two columns that were not instanteously converted were Trailers and Users. Trailers were categorical with 1 being True and 0 being False. This was converted to a Boolean. For the column Users there was a comma separating the thousands value so that was removed prior to converting to a string and then an integer.


In [19]:
#converting datatypes based on observations

thaidramas['Users'] = thaidramas['Users'].str.replace(',','')
thaidramas['Users'] = thaidramas['Users'].astype(str).astype(int)
thaidramas['Rank'] = thaidramas['Rank'].astype(int)
thaidramas['Trailer'] = thaidramas['Trailer'].astype(bool)
thaidramas['Year'] = thaidramas['Year'].astype(str).astype(int)
thaidramas['Episodes'] = thaidramas['Episodes'].astype(str).astype(int)

In [20]:
#reviewing dataframe information to ensure datatypes have been resolved.

thaidramas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Rank            100 non-null    int32 
 1   Title           100 non-null    object
 2   Rating          100 non-null    object
 3   Summary         100 non-null    object
 4   Trailer         100 non-null    bool  
 5   Year            100 non-null    int32 
 6   Episodes        100 non-null    int32 
 7   Main_Actor      100 non-null    object
 8   Aired           100 non-null    object
 9   Genres          100 non-null    object
 10  Content_Rating  100 non-null    object
 11  Tags            100 non-null    object
 12  Users           100 non-null    int32 
dtypes: bool(1), int32(4), object(8)
memory usage: 8.0+ KB



We can see that there are three different data types found within the combined dataset. 1 Boolean, 4 integers, and 8 objects for a total of 13 columns, each with 100 entries.

Next the 'Aired' column needed to be reformatted to allow for the dates to be used in various analysis in the future. The 'Aired' column showed a range of dates to represent the start and end dates of the dramas initial airing on television in Thailand. This was split into two new columns. The split columns were converted to datetime format afterwards.


In [21]:
# splitting Aired Column into Start and End Aired dates.

thaidramas[['Aired_Start', 'Aired_End']] = thaidramas['Aired'].str.split(' - ', expand=True)
thaidramas.drop(columns='Aired', inplace=True)
thaidramas.head()

Unnamed: 0,Rank,Title,Rating,Summary,Trailer,Year,Episodes,Main_Actor,Genres,Content_Rating,Tags,Users,Aired_Start,Aired_End
0,189,Thai Cave Rescue,8.7,The Limited Series is based on 2018’s world-famous event. Twelve boys from the same football team decided to spend an afternoon with their coach exploring the Tham Luang caves in northern Thailand. When heavy rainfall…,False,2022,6,Beam Papangkorn Lerkchaleampote,Drama,G - All Ages,"Survival, Based On True Story, Rescue Mission, Brave Male Lead, Thai Mythology, Clever Male Lead, Co-produced, Biographical, Bilingual Supporting Character, Bilingual Female Lead",958,"Sep 22, 2022",
1,330,Project S: Skate Our Souls,8.6,"Boo, a student whose grades aren't promising, knows well that he falls short of his father's standards. Quietly, Boo has been suffering from depression for a long time. He doesn't eat or sleep well and has no friends,…",True,2017,8,James Teeradon Supapunpinyo,"Psychological, Youth, Drama, Sports",13+ - Teens 13 or older,"Skaters, Depression, Psychology, Self-harm, Friendship, Father-Son Relationship, Suicide, Doctor Female Lead, Skateboarding, High School",1692,"Sep 9, 2017","Nov 25, 2017"
2,349,Love Destiny,8.6,"This is a story where karma, merit, love destiny and a moon mantra combine to fling Kadesurang, a chubby archaeologist, into the body of another woman, Karakade, during the Ayutthaya era (300 years earlier). In the past,…",True,2018,15,Bella Ranee Campen,"Historical, Comedy, Romance, Supernatural",G - All Ages,"Sassy Female Lead, Time Travel, Transmigration, Soulmates, Fated Love, Eccentric Female Lead, Reincarnation, Strong Female Lead, Nice Male Lead, Hidden Identity",3080,"Feb 21, 2018","Apr 11, 2018"
3,382,The Gifted,8.6,"Ritdha Wittayakom High School has a ""Gifted Program."" The program offers special classes to a handful of ""special"" students chosen by the school administration. Incredibly, Pang, a tenth-year student from the lowest…",True,2018,13,Nanon Korapat Kirdpan,"Thriller, Mystery, Psychological, Supernatural",13+ - Teens 13 or older,"High School, Special Power, Multiple Mains, Adapted From A Novel, Corruption, Suspense, Hidden Talent, Teamwork, Friendship, Slight Romance",10248,"Aug 5, 2018","Nov 4, 2018"
4,391,Enigma,8.6,"There is something wrong with Fa's high school; strange events have been happening around her. In which way are they related to the new teacher, Ajin?\n\n(Source: MyDramaList)",True,2023,4,Win Metawin Opas-iamkajorn,"Mystery, Horror, Youth, Supernatural",13+ - Teens 13 or older,"Teenager Female Lead, High School, Black Magic, Student Supporting Character, Student Female Lead, School Setting, Exorcist Male Lead, Teenager Supporting Character, Age Gap [Drama Life], Age Gap [Real Life]",2304,"Jul 15, 2023","Aug 5, 2023"


In [22]:
#converting aired start and end dates to datatime format

thaidramas['Aired_Start']= pd.to_datetime(thaidramas['Aired_Start'], format='mixed')
thaidramas['Aired_End']= pd.to_datetime(thaidramas['Aired_End'], format='mixed')

In [23]:
#Checking to see if there is any Null values

thaidramas.isnull().sum()

Rank              0
Title             0
Rating            0
Summary           0
Trailer           0
Year              0
Episodes          0
Main_Actor        0
Genres            0
Content_Rating    0
Tags              0
Users             0
Aired_Start       0
Aired_End         2
dtype: int64


There were 2 null values found within the Aired_End variable. This was noted that the drama aired all the episodes on one date as a marathon. This was cleaned up by capping or setting the End date to the Start date.


In [24]:
#cleaning up Null values found within the variable Aired End. The Null value is notated as NaT which represents Not a Time.
## Capped values to Aired_Start date

thaidramas['Aired_End'] = np.where(thaidramas['Aired_End'].isnull, 
                                   thaidramas['Aired_Start'], thaidramas['Aired_End'])
thaidramas.head()

Unnamed: 0,Rank,Title,Rating,Summary,Trailer,Year,Episodes,Main_Actor,Genres,Content_Rating,Tags,Users,Aired_Start,Aired_End
0,189,Thai Cave Rescue,8.7,The Limited Series is based on 2018’s world-famous event. Twelve boys from the same football team decided to spend an afternoon with their coach exploring the Tham Luang caves in northern Thailand. When heavy rainfall…,False,2022,6,Beam Papangkorn Lerkchaleampote,Drama,G - All Ages,"Survival, Based On True Story, Rescue Mission, Brave Male Lead, Thai Mythology, Clever Male Lead, Co-produced, Biographical, Bilingual Supporting Character, Bilingual Female Lead",958,2022-09-22,2022-09-22
1,330,Project S: Skate Our Souls,8.6,"Boo, a student whose grades aren't promising, knows well that he falls short of his father's standards. Quietly, Boo has been suffering from depression for a long time. He doesn't eat or sleep well and has no friends,…",True,2017,8,James Teeradon Supapunpinyo,"Psychological, Youth, Drama, Sports",13+ - Teens 13 or older,"Skaters, Depression, Psychology, Self-harm, Friendship, Father-Son Relationship, Suicide, Doctor Female Lead, Skateboarding, High School",1692,2017-09-09,2017-09-09
2,349,Love Destiny,8.6,"This is a story where karma, merit, love destiny and a moon mantra combine to fling Kadesurang, a chubby archaeologist, into the body of another woman, Karakade, during the Ayutthaya era (300 years earlier). In the past,…",True,2018,15,Bella Ranee Campen,"Historical, Comedy, Romance, Supernatural",G - All Ages,"Sassy Female Lead, Time Travel, Transmigration, Soulmates, Fated Love, Eccentric Female Lead, Reincarnation, Strong Female Lead, Nice Male Lead, Hidden Identity",3080,2018-02-21,2018-02-21
3,382,The Gifted,8.6,"Ritdha Wittayakom High School has a ""Gifted Program."" The program offers special classes to a handful of ""special"" students chosen by the school administration. Incredibly, Pang, a tenth-year student from the lowest…",True,2018,13,Nanon Korapat Kirdpan,"Thriller, Mystery, Psychological, Supernatural",13+ - Teens 13 or older,"High School, Special Power, Multiple Mains, Adapted From A Novel, Corruption, Suspense, Hidden Talent, Teamwork, Friendship, Slight Romance",10248,2018-08-05,2018-08-05
4,391,Enigma,8.6,"There is something wrong with Fa's high school; strange events have been happening around her. In which way are they related to the new teacher, Ajin?\n\n(Source: MyDramaList)",True,2023,4,Win Metawin Opas-iamkajorn,"Mystery, Horror, Youth, Supernatural",13+ - Teens 13 or older,"Teenager Female Lead, High School, Black Magic, Student Supporting Character, Student Female Lead, School Setting, Exorcist Male Lead, Teenager Supporting Character, Age Gap [Drama Life], Age Gap [Real Life]",2304,2023-07-15,2023-07-15


The final step within this webscrapping project was to ensure thae dataset had no more null valdues, no duplicates, and confirm the final shape. There were no duplicates found within the data and all the null values were resolved.

In [25]:
#reviewed nulls again to ensure properly cleaned.

thaidramas.isnull().sum()

Rank              0
Title             0
Rating            0
Summary           0
Trailer           0
Year              0
Episodes          0
Main_Actor        0
Genres            0
Content_Rating    0
Tags              0
Users             0
Aired_Start       0
Aired_End         0
dtype: int64

In [26]:
#Checked for duplicated entries

thaidramas.duplicated()

0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Length: 100, dtype: bool

In [27]:
#viewed shape of full dataset

thaidramas.shape

(100, 14)

The final dataset had a total of 100 entries and 14 attributes. This cleaned dataset was saved and exported into a csv file called *ThaiDramas100_Clean.csv*


In [28]:
#save cleaned dataset to csv.

thaidramas.to_csv(r'ThaiDramas100_Clean.csv')

## Thank you!
Thank you for taking the time to view my webscraping project and feel free to explore the code and csv files compiled.