### Preprocessing and Sentiment Analysis

> <sub>⚠️ **Note**: Internal links (like Table of Contents) work best when this notebook is opened in **Jupyter Notebook** or **nbviewer.org**.<br>
> GitHub does **not support scrolling to sections** inside `.ipynb` files.</sub>

---
###### Data Ingestion
######  - [Reading the data set](#Reading-the-data-set)
######  - [Preprocessing of the data](#Preprocessing-of-the-data)
###### Sentiment Analysis
######  - [Using Bing Liu Lexicon](#Using-Bing-Liu-Lexicon)
######  - [Using LM dictionary](#Using-LM-dictionary)
######  - [Using TextBlob](#Using-TextBlob)
######  - [Using Vader](#Using-Vader)
######  - [Using Obscene Words](#Using-Obscene-Words)
###### Output Extraction
######  - [Creating consolidated dataset](#Creating-consolidated-dataset)
######  - [Writing the dataset](#Writing-the-dataset)
---

#### Reading the data set
######  - [_Click here to move back to index_](#Preprocessing-and-Sentiment-Analysis)

In [85]:
import sys
sys.path.append('src') 


In [6]:
from Data_Preprocessing import load_data, analyze_character_dialogues

#Reading data dialogues of TV series

series_1 = load_data('data/raw/friends_quotes.csv')
series_2 = load_data('data/raw/Game_of_Thrones_Script.csv')
series_3 = load_data('data/raw/Office Script by Characters.csv')
series_4 = load_data('data/raw/1_10_seasons_tbbt.csv')


In [8]:
# Initial analysis for series 1
results = analyze_character_dialogues(series_1, show_overall=True, season='1', episode='2')
for name, table in results.items():
    print(f"\n{name}\n", table)



Top 10 by Dialogue Length - Episode 01x02
      character  total_characters  percent_of_total
0         Ross              3802             27.95
1       Rachel              2389             17.56
2       Monica              1515             11.14
3     Chandler              1284              9.44
4   Mr. Geller               772              5.68
5        Barry               747              5.49
6       Phoebe               713              5.24
7  Mrs. Geller               657              4.83
8        Carol               582              4.28
9         Joey               351              2.58

Top 10 by Dialogue Rows - Episode 01x02
      character  dialogue_lines  percent_of_total
0         Ross              63             25.93
1       Rachel              38             15.64
2       Monica              28             11.52
3        Carol              21              8.64
4        Barry              16              6.58
5     Chandler              16              6.58
6       Ph

In [10]:
# Initial analysis for series 2
results = analyze_character_dialogues(series_2, show_overall=True, season='1', episode='2')
for name, table in results.items():
    print(f"\n{name}\n", table)



Top 10 by Dialogue Length - Episode 01x02
            character  total_characters  percent_of_total
0   Tyrion Lannister              2538             13.87
1   Robert Baratheon              2189             11.96
2   Cersei Lannister              1476              8.07
3       Eddard Stark              1420              7.76
4           Jon Snow              1305              7.13
5      Catelyn Stark              1240              6.78
6             Doreah              1234              6.74
7  Joffrey Lannister              1110              6.07
8    Jaime Lannister               817              4.46
9         Arya Stark               713              3.90

Top 10 by Dialogue Rows - Episode 01x02
             character  dialogue_lines  percent_of_total
0        Eddard Stark              32             11.31
1            Jon Snow              24              8.48
2          Arya Stark              23              8.13
3    Robert Baratheon              23              8.13
4    Ty

In [12]:
# Initial analysis for series 3
results = analyze_character_dialogues(series_3, show_overall=True, season='1', episode='2')
for name, table in results.items():
    print(f"\n{name}\n", table)



Top 10 by Dialogue Length - Episode 01x02
    character  total_characters  percent_of_total
0    Michael             12936             57.12
1  Mr. Brown              3201             14.13
2        Jim              2202              9.72
3     Dwight              1975              8.72
4        Pam               820              3.62
5       Ryan               436              1.93
6      Oscar               404              1.78
7      Kevin               388              1.71
8    Stanley               125              0.55
9       Toby                85              0.38

Top 10 by Dialogue Rows - Episode 01x02
    character  dialogue_lines  percent_of_total
0    Michael             103             34.68
1  Mr. Brown              52             17.51
2     Dwight              40             13.47
3        Jim              35             11.78
4        Pam              26              8.75
5      Oscar              13              4.38
6      Kevin               8              2.69

In [14]:
# Initial analysis for series 4
results = analyze_character_dialogues(series_4, show_overall=True, season='1', episode='2')
for name, table in results.items():
    print(f"\n{name}\n", table)



Top 10 by Dialogue Length - Episode 01x02
       character  total_characters  percent_of_total
0       Sheldon              5386             32.90
1       Leonard              4989             30.47
2         Penny              2635             16.09
3        Howard              1057              6.46
4         Scene               911              5.56
5  (Internally)               299              1.83
6           Raj               267              1.63
7    (Entering)               137              0.84
8          Him)               133              0.81
9         Talk)                89              0.54

Top 10 by Dialogue Rows - Episode 01x02
       character  dialogue_lines  percent_of_total
0       Leonard              83             33.47
1       Sheldon              70             28.23
2         Penny              41             16.53
3        Howard              16              6.45
4         Scene              11              4.44
5           Raj               5           

#### Preprocessing of the data
######  - [_Click here to move back to index_](#Preprocessing-and-Sentiment-Analysis)

In [17]:
from Data_Preprocessing import preprocess_dialogue

#Apply preprocessing to the dialogue column
series_1['dialogue_preprocessed'] = series_1['dialogue'].apply(preprocess_dialogue)
series_2['dialogue_preprocessed'] = series_2['dialogue'].apply(preprocess_dialogue)
series_3['dialogue_preprocessed'] = series_3['dialogue'].apply(preprocess_dialogue)
series_4['dialogue_preprocessed'] = series_4['dialogue'].apply(preprocess_dialogue)


In [19]:
series_2.head()

Unnamed: 0,release date,season,episode,episode name,character,dialogue,season_episode,dialogue_length,dialogue_preprocessed
0,4/17/2011,1,1,Winter is Coming,Waymar Royce,What do you expect? They're savages. One lot s...,01x01,137,expect savage one lot steal goat another lot k...
1,4/17/2011,1,1,Winter is Coming,Will,I've never seen wildlings do a thing like this...,01x01,103,never seen wildlings thing like never seen thi...
2,4/17/2011,1,1,Winter is Coming,Waymar Royce,How close did you get?,01x01,22,close get
3,4/17/2011,1,1,Winter is Coming,Will,Close as any man would.,01x01,23,close man would
4,4/17/2011,1,1,Winter is Coming,Gared,We should head back to the wall.,01x01,32,head back wall


### Sentiment Analysis

#### Using Bing Liu Lexicon
######  - [_Click here to move back to index_](#Preprocessing-and-Sentiment-Analysis)

In [23]:
from Sentiment_Analysis import count_pos_neg

#Import Bing Liu's dictionary
from nltk.corpus import opinion_lexicon

#Name the two positive and negative dictionary
pos_list_BL=set(opinion_lexicon.positive())

neg_list_BL=set(opinion_lexicon.negative())

#Ddding the columns to the dataframe to count the positive, negative and net using BL dictionary
#Using the function from above

series_1['poscnt_BL'], series_1['negcnt_BL'], series_1['netcnt_BL'] = count_pos_neg(series_1['dialogue'], pos_list_BL, neg_list_BL)
series_2['poscnt_BL'], series_2['negcnt_BL'], series_2['netcnt_BL'] = count_pos_neg(series_2['dialogue'], pos_list_BL, neg_list_BL)
series_3['poscnt_BL'], series_3['negcnt_BL'], series_3['netcnt_BL'] = count_pos_neg(series_3['dialogue'], pos_list_BL, neg_list_BL)
series_4['poscnt_BL'], series_4['negcnt_BL'], series_4['netcnt_BL'] = count_pos_neg(series_4['dialogue'], pos_list_BL, neg_list_BL)


#### Using LM dictionary
######  - [_Click here to move back to index_](#Preprocessing-and-Sentiment-Analysis)

In [26]:
from Sentiment_Analysis import read_local_dictionary

#loading the text file into dictionary
pos_list_LM = read_local_dictionary('data/raw/positive-words-LM.txt')

neg_list_LM = read_local_dictionary('data/raw/negative-words-LM.txt')

#Ddding the columns to the dataframe to count the positive, negative and net using LM dictionary
#Using the function from above

series_1['poscnt_LM'], series_1['negcnt_LM'], series_1['netcnt_LM'] = count_pos_neg(series_1['dialogue'], pos_list_LM, neg_list_LM)
series_2['poscnt_LM'], series_2['negcnt_LM'], series_2['netcnt_LM'] = count_pos_neg(series_2['dialogue'], pos_list_LM, neg_list_LM)
series_3['poscnt_LM'], series_3['negcnt_LM'], series_3['netcnt_LM'] = count_pos_neg(series_3['dialogue'], pos_list_LM, neg_list_LM)
series_4['poscnt_LM'], series_4['negcnt_LM'], series_4['netcnt_LM'] = count_pos_neg(series_4['dialogue'], pos_list_LM, neg_list_LM)
				

#### Using TextBlob
######  - [_Click here to move back to index_](#Preprocessing-and-Sentiment-Analysis)

In [31]:
from textblob import TextBlob
import pandas as pd

# or, more safely, only convert non-null values
series_2["dialogue"] = series_2["dialogue"].apply(lambda x: str(x) if pd.notnull(x) else '')
series_4["dialogue"] = series_4["dialogue"].apply(lambda x: str(x) if pd.notnull(x) else '')

#Mapping the sentiment scores and adding the new column for the score
series_1["score_TextBlob"] = series_1["dialogue"].str.lower().map(lambda x:TextBlob(x).sentiment.polarity)
series_2["score_TextBlob"] = series_2["dialogue"].str.lower().map(lambda x:TextBlob(x).sentiment.polarity)
series_3["score_TextBlob"] = series_3["dialogue"].str.lower().map(lambda x:TextBlob(x).sentiment.polarity)
series_4["score_TextBlob"] = series_4["dialogue"].str.lower().map(lambda x:TextBlob(x).sentiment.polarity)


#### Using Vader
######  - [_Click here to move back to index_](#Preprocessing-and-Sentiment-Analysis)

In [34]:
from Sentiment_Analysis import vader_analyzer

series_1 = vader_analyzer(series_1)
series_2 = vader_analyzer(series_2)
series_3 = vader_analyzer(series_3)
series_4 = vader_analyzer(series_4)


#### Using Obscene Words
######  - [_Click here to move back to index_](#Preprocessing-and-Sentiment-Analysis)

In [37]:
from Sentiment_Analysis import count_obs

#loading the text file into dictionary
obscene_words = read_local_dictionary('data/raw/obscene_words.txt')

#Ddding the columns to the dataframe to count the obscene words
#Using the function from above

series_1['obscene_words_count'] = count_obs(series_1['dialogue'], obscene_words)
series_2['obscene_words_count'] = count_obs(series_2['dialogue'], obscene_words)
series_3['obscene_words_count'] = count_obs(series_3['dialogue'], obscene_words)
series_4['obscene_words_count'] = count_obs(series_4['dialogue'], obscene_words)


#### Creating consolidated dataset
######  - [_Click here to move back to index_](#Preprocessing-and-Sentiment-Analysis)

In [40]:
series_1['Series_Name'] = 'Friends'
series_2['Series_Name'] = 'Game of Thrones'
series_3['Series_Name'] = 'The Office'
series_4['Series_Name'] = 'The Big Bang Theory'

common_columns = list(set(series_1.columns) & set(series_2.columns) & set(series_3.columns) & set(series_4.columns))
common_columns

concatenated_df = pd.concat([series_1[common_columns], series_2[common_columns], series_3[common_columns], 
                             series_4[common_columns]], ignore_index=True)

# Multiply specified columns by 'dialogue_length' for weighted mean
cols_to_multiply = ['negcnt_BL', 'poscnt_LM', 'poscnt_BL', 'obscene_words_count', 'negscore_Vader', 
                    'score_TextBlob', 'negcnt_LM', 'posscore_Vader', 'compound_vader']
for col in cols_to_multiply:
    if col in concatenated_df.columns:  # Check if the column exists
        concatenated_df[f'SumProd_{col}'] = concatenated_df[col] * concatenated_df['dialogue_length']


In [42]:
# Adding a 'Top_6_Flag' column showing top 6 characters for each season of a series by Dialogue length

grouped = (
    concatenated_df.groupby(['Series_Name', 'season', 'character'], as_index=False)
      .agg({'dialogue_length': 'sum'})
)

# Rank characters by Dialogue Length within each Series-Season
grouped['Rank'] = grouped.groupby(['Series_Name', 'season'])['dialogue_length'] \
                         .rank(method='first', ascending=False)

# Flag top 6
grouped['Top_6_Flag'] = grouped['Rank'].apply(lambda x: 'Top 6' if x <= 6 else 'Other')

# If you want to merge back to the original dataframe:
concatenated_df = concatenated_df.merge(grouped[['Series_Name', 'season', 'character', 'Top_6_Flag']], 
              on=['Series_Name', 'season', 'character'], how='left')


#### Writing the dataset
######  - [_Click here to move back to index_](#Preprocessing-and-Sentiment-Analysis)

In [45]:
from Data_Preprocessing import write_data

# Save to processed
write_data(concatenated_df, 'Data_For_Tableau.csv')

Data written to: C:\Users\utkar\Desktop\PyCharm Projects Spring\Natural Language Processing\data\processed\Data_For_Tableau.csv


'data\\processed\\Data_For_Tableau.csv'

In [55]:
write_data(series_1, 'processed_series_1.csv')
write_data(series_2, 'processed_series_2.csv')
write_data(series_3, 'processed_series_3.csv')
write_data(series_4, 'processed_series_4.csv')

Data written to: C:\Users\utkar\Desktop\PyCharm Projects Spring\Natural Language Processing\data\processed\processed_series_1.csv
Data written to: C:\Users\utkar\Desktop\PyCharm Projects Spring\Natural Language Processing\data\processed\processed_series_2.csv
Data written to: C:\Users\utkar\Desktop\PyCharm Projects Spring\Natural Language Processing\data\processed\processed_series_3.csv
Data written to: C:\Users\utkar\Desktop\PyCharm Projects Spring\Natural Language Processing\data\processed\processed_series_4.csv


'data\\processed\\processed_series_4.csv'