# Concatenate DataFrames

As I scraped the website in batches, I had to join the DataFrames into one single CSV file.

## 1. Import Pandas Library

In [1]:
import pandas as pd

## 2. Read CSV File and Concatenate All Files

Since there were other irrelevant CSV files in the same folder, I could not use the os or glob method to identify the CSV files present in the folder. I had to list the names of the CSV files, loop through the list to read the CSV files, and then concatenate them.

In [7]:
# names of CSV files
csvfiles = ['drama_list1.csv','drama_list2.csv','drama_list3.csv','drama_list4.csv','drama_list5.csv','drama_list6.csv']

# loop through the files and read them in with pandas
dataframes = []  # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
    df = pd.read_csv(csvfile)
    dataframes.append(df)

# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)

Drop duplicated rows based on title. The later few pages repeat the display of dramas.

In [8]:
result = result.drop_duplicates(subset=['drama_title'])

## 3. Explore Data

In [9]:
result.head()

Unnamed: 0,drama_title,year,main_actors,genres,tags,synopsis
0,The Untamed (2019),2019,"Sean Xiao, Wang Yi Bo","Action, Adventure, Friendship, Historical, ...","Censored Romance, Tragic Past, Adapted From A ...",Wei Wuxian and Lan Wangji are two completely d...
1,My Mister (2018),2018,"Lee Sun Kyun, IU","Business, Psychological, Life, Drama, Fami...","Nice Male Lead, Strong Female Lead, Healing, I...",Park Dong Hoon is a middle-aged engineer who i...
2,Signal (2016),2016,"Lee Je Hoon, Kim Hye Soo, Jo Jin Woong","Suspense, Mystery, Crime, Drama, Supernatu...","Different Timelines, Criminal Profiler, Serial...","Fifteen years ago, a young girl was kidnapped ..."
3,Nirvana in Fire (2015),2015,"Hu Ge, Tamia Liu, Wang Kai, Chen Long, Leo Wu,...","Friendship, Historical, Wuxia, Drama, Fant...","Bromance, Smart Male Lead, Political Intrigue,...",During the Datong era of the Southern Liang Dy...
4,Prison Playbook (2017),2017,"Park Hae Soo, Jung Kyung Ho","Friendship, Comedy, Life, Drama","Prison, Bromance, Slight Romance, Black Comedy...","Kim Je Hyuk, a famous baseball player, is conv..."


In [10]:
result.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4999 entries, 0 to 4999
Data columns (total 6 columns):
drama_title    4999 non-null object
year           4999 non-null object
main_actors    4925 non-null object
genres         4757 non-null object
tags           2908 non-null object
synopsis       4980 non-null object
dtypes: object(6)
memory usage: 273.4+ KB


There are 4999 unique dramas in the DataFrame. All dramas have a title and the year they were released. Only slightly more than half of the dramas have tags. This is understandable since tags were created by fans, and it undergoes voting. 

## 4. Save CSV File

In [12]:
# print out to a new csv file
result.to_csv('drama_list.csv',index=False)