# Data preprocessing
This notebook illustrates how I processed the raw dataset downloaded from Kaggle. For this project, I selected the Big Bang Theory scripts: https://www.kaggle.com/datasets/mitramir5/the-big-bang-theory-series-transcript

## Connect the notebook to MyDrive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import libraries

In [3]:
# Standard python libraries for data processing
import pandas as pd
from sklearn.utils import shuffle

# Custom module
import sys
sys.path.insert(0,'/content/drive/MyDrive/python_project/code')
from standardiseNames import rename_characters

## Load data and get general information about the data

In [4]:
# Import the comma separate file into a pandas DataFrame
data = pd.read_csv("drive/MyDrive/python_project/data/tbbt.csv")
# Print first 5 rows to check if everything is loaded correctly
data.head()

Unnamed: 0,episode_name,dialogue,person_scene
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Scene
1,Series 01 Episode 01 – Pilot Episode,So if a photon is directed through a plane wi...,Sheldon
2,Series 01 Episode 01 – Pilot Episode,"Agreed, what’s your point?",Leonard
3,Series 01 Episode 01 – Pilot Episode,"There’s no point, I just think it’s a good id...",Sheldon
4,Series 01 Episode 01 – Pilot Episode,Excuse me?,Leonard


In [5]:
# Check how many rows and columns our DataFrame has
data.shape

(54406, 3)

In [6]:
# Check general information about our DataFrame (type of data points, if there are unkown data points)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54406 entries, 0 to 54405
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   episode_name  54406 non-null  object
 1   dialogue      54404 non-null  object
 2   person_scene  54406 non-null  object
dtypes: object(3)
memory usage: 1.2+ MB


## Preprocess data
Create a copy of data

In [8]:
# Create a copy of data so we can restore the original data any time without loading it again
pd_data = data[['dialogue', 'person_scene']].copy()
# Check if copy of our DataFrame is created correctly by printing first 5 rows
pd_data.head()

Unnamed: 0,dialogue,person_scene
0,A corridor at a sperm bank.,Scene
1,So if a photon is directed through a plane wi...,Sheldon
2,"Agreed, what’s your point?",Leonard
3,"There’s no point, I just think it’s a good id...",Sheldon
4,Excuse me?,Leonard


Look at unique values in the 'person_scene' column

In [9]:
# Get a list of unique values of the column 'person_scene' in our dataset
unique_vals = pd_data.person_scene.unique().tolist()

# Print the list of unique values and the length of the list
print(unique_vals)
print(len(unique_vals))

['Scene', 'Sheldon', 'Leonard', 'Receptionist', 'Penny', '(mouths)', 'back)', 'Howard', 'Raj', 'Raj)', 'buzzer)', 'buzzer', 'Voice', 'man', 'Man', '(sings)', '(off)', 'together', '(snarkily)', '(entering)', 'likewise)', 'off)', 'door)', 'talk)', '(internally)', 'mat)', 'him)', 'ear)', 'Teleplay', 'Story', 'hallway)', 'Doug', 'Lesley', 'instructor', 'Leonard)', '(singing)', 'Waitress', 'Summer', 'Sheldon)', 'Gablehouser', 'round)', 'supplements)', 'Cooper', 'Cooper)', 'chair)', '(leaving)', 'quartettist', 'room)', 'apartment)', '(inside)', '(arriving)', 'costume)', 'All', 'Thor)', 'off-screen)', 'Girl', 'Costume', 'Kurt', 'ground)', 'entering)', 'doorway)', 'phone', 'television)', 'again)', 'Christie', 'Waiter', '(answering)', '(voice)', 'within)', 'women)', 'Koothrappali', 'Together', 'gather)', 'phone)', 'Lalita', 'Penny)', 'embarrassed)', 'mailbox)', 'glasses)', 'down)', 'floor)', 'captions)', 'two', 'one', 'tunelessly)', 'Toby', 'duvet)', 'stairs)', 'Mother', 'voice)', 'clearance)',

In [10]:
# Get top 4 characters' names with the highest number of lines
top_4_char = pd_data.groupby("person_scene")["dialogue"].count().sort_values(ascending=False)
top_4_char.head(4)

person_scene
Sheldon    11484
Leonard     9638
Penny       7476
Howard      5737
Name: dialogue, dtype: int64

The following values are the same:

'Raj': 'Raj)', 'Koothrappali', 'Rai', 'Rajj', 'Ra'
'Leonard: 'Leonard)', 'Leonard-warrior', 'Hofstadter', 'Leonard:', 'Leoanard'
'Sheldon: 'Sheldon)', 'Cooper', 'Cooper)', 'Shedon', 'Shldon', 'Sehldon', 'Sgeldon', 'Sheldon-bot'
'Penny: 'Penny)', 'Penny(leaving)', 'Penny-warrior'
'Howard': 'Howard)', 'Wolowitz', 'Wolowitz)', 'Howatd'

**Standardize main characters' names**

In [11]:
# To standardize the main characters names we wrote a function rename_characters in python module standardiseNames available in the code folder
pd_data['person_scene'] = pd_data['person_scene'].apply(rename_characters)

Check if the changes have been implemented correctly

In [12]:
# Get a list of unique values of the column 'person_scene' in our dataset
unique_vals = pd_data.person_scene.unique().tolist()

# Print the list of unique values and the length of the list
print(unique_vals)
print(len(unique_vals))

['Scene', 'Sheldon', 'Leonard', 'Receptionist', 'Penny', '(mouths)', 'back)', 'Howard', 'Raj', 'buzzer)', 'buzzer', 'Voice', 'man', 'Man', '(sings)', '(off)', 'together', '(snarkily)', '(entering)', 'likewise)', 'off)', 'door)', 'talk)', '(internally)', 'mat)', 'him)', 'ear)', 'Teleplay', 'Story', 'hallway)', 'Doug', 'Lesley', 'instructor', '(singing)', 'Waitress', 'Summer', 'Gablehouser', 'round)', 'supplements)', 'chair)', '(leaving)', 'quartettist', 'room)', 'apartment)', '(inside)', '(arriving)', 'costume)', 'All', 'Thor)', 'off-screen)', 'Girl', 'Costume', 'Kurt', 'ground)', 'entering)', 'doorway)', 'phone', 'television)', 'again)', 'Christie', 'Waiter', '(answering)', '(voice)', 'within)', 'women)', 'Together', 'gather)', 'phone)', 'Lalita', 'embarrassed)', 'mailbox)', 'glasses)', 'down)', 'floor)', 'captions)', 'two', 'one', 'tunelessly)', 'Toby', 'duvet)', 'stairs)', 'Mother', 'voice)', 'clearance)', 'sigh)', 'teeth)', 'vaporub)', 'Dennis', '(dramatically)', 'tune)', 'Goldfarb'

For the current project, we will select only 4 main characters as classes to predict. For this reason, we will count the frequency of lines each character has in the dataset and select 4 characters with the most lines.

In [13]:
# Get top 4 characters' names with the highest number of lines
top_4_char = pd_data.groupby("person_scene")["dialogue"].count().sort_values(ascending=False)
top_4_char.head(4)

person_scene
Sheldon    11732
Leonard     9706
Penny       7484
Howard      5766
Name: dialogue, dtype: int64

The number of lines that Sheldon has in the dataset exceeds twice the number of lines Howard has. To avoid skewed class proportion we will select 5766 samples from each category.

In [21]:
# Get Sheldon's lines
lines_sheldon = pd_data[pd_data.person_scene=='Sheldon'].sample(n=5766).rename(columns={'person_scene': 'name'}).reset_index().drop(columns='index')
# Get Leonard's lines
lines_leonard = pd_data[pd_data.person_scene=='Leonard'].sample(n=5766).rename(columns={'person_scene': 'name'}).reset_index().drop(columns='index')
# Get Penny's lines
lines_penny = pd_data[pd_data.person_scene=='Penny'].sample(n=5766).rename(columns={'person_scene': 'name'}).reset_index().drop(columns='index')
# Get Howard's lines
lines_howard = pd_data[pd_data.person_scene=='Howard'].rename(columns={'person_scene': 'name'}).reset_index().drop(columns='index')

# Contatenate all separately created DFs
train_df = pd.concat([lines_sheldon.loc[:4616, :], lines_leonard.loc[:4616, :], lines_penny.loc[:4616, :], lines_howard.loc[:4616, :]])
print(train_df.shape)
test_df = pd.concat([lines_sheldon.loc[4617:, :], lines_leonard.loc[4617:, :], lines_penny.loc[4617:, :], lines_howard.loc[4617:, :]])
print(test_df.shape)

(18468, 2)
(4596, 2)


## Shuffle train and test datasets

In [22]:
# Shuffle train dataset using pandas method sample with the following parameters: frac = 1 - returns the entire DataFrame, random_state = 1 - makes the results reproducible
train_df = train_df.sample(frac = 1, random_state = 1).reset_index().drop(columns='index')
test_df = test_df.sample(frac = 1, random_state = 1).reset_index().drop(columns='index')


In [23]:
# Check if train and test DataFrames are correctly shuffled and the number of rows and columns remain the same as before shuffling the datasets
# For this reason we print the first 5 rows of train and test datasets and their shapes
print(train_df.head())
print(train_df.shape)
print(test_df.head())
print(test_df.shape)

                                            dialogue     name
0   I’ll say it looks good. It’s in my proprietar...  Sheldon
1   Wait, wait, wait. When did you send my mom no...  Leonard
2            I never had a beer with my dad, either.   Howard
3                      My thank you was not sincere.  Sheldon
4   I’m calling to invite you to a spontaneous da...  Sheldon
(18468, 2)
                                            dialogue     name
0                      Yeah, that’s when it started.  Leonard
1   A million dollars? God, it’s like my nuts jus...   Howard
2   Dinner’s almost ready. If you like meatloaf, ...   Howard
3   Well, Leonard, I think it’s high time you and...  Sheldon
4              All right. It is a comfortable chair.  Sheldon
(4596, 2)


## Save preprocessed training and test datasets

In [24]:
train_df.to_csv("drive/MyDrive/python_project/data/train_df.csv", index = False)
test_df.to_csv("drive/MyDrive/python_project/data/test_df.csv", index = False)