# Project Part 1

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/brearenee/NLP-Project/blob/main/startrek.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sgeinitz/https://github.com/brearenee/NLP-Project/blob/main/startrek.ipynb)


In [1]:
import pandas as pd
import json
import requests
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
url = 'https://raw.githubusercontent.com/brearenee/NLP-Project/main/dataset/StarTrekDialogue_v2.json'
response = requests.get(url)

##This CodeBlock is thanks to ChatGPT :-) 
if response.status_code == 200:
    json_data = json.loads(response.text)
    lines = []
    characters = []
    episodes = []
  
    # extract the information from the JSON file for the "TNG" series
    for series_name, series_data in json_data.items():
        if series_name == "TNG": 
            for episode_name, episode_data in series_data.items():
                for character_name, character_lines in episode_data.items():
                    for line_text in character_lines:
                        lines.append(line_text)
                        characters.append(character_name)
                        episodes.append(episode_name)
                     
    # Create a DataFrame from the extracted data
    df = pd.DataFrame({
        'Line': lines,
        'Character': characters,
        'Episode': episodes,
    })

    # Remove duplicate lines, keeping the first occurrence (preserving the original order)
    df = df.drop_duplicates(subset='Line', keep='first')

    # Reset the index of the DataFrame
    df.reset_index(drop=True, inplace=True)

else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")


As you can see, "The Next Generation" series has quite a few characters. Let's eliminate all of the outliers by removing characters that don't occur in more than 5 episodes. 

In [3]:
episode_counts = df.groupby('Character')['Episode'].nunique()

characters_to_keep = episode_counts[episode_counts > 5].index

df = df[df['Character'].isin(characters_to_keep)]
unique_characters = df['Character'].unique().tolist()


print(unique_characters)

['PICARD', 'DATA', 'TROI', 'WORF', 'Q', 'TASHA', "O'BRIEN", 'RIKER', 'WESLEY', 'CRUSHER', 'LAFORGE', 'COMPUTER', 'SECURITY', 'WOMAN', 'MAN', 'CREWMAN', 'CHIEF', 'MEDIC', 'VOICE', 'LWAXANA', 'CREWWOMAN', 'NURSE', 'GUINAN', 'PULASKI', 'ALL', 'OGAWA', 'KLINGON', 'ROMULAN', 'ALEXANDER', 'KEIKO', 'RO']


This result is a lot better, but there are still some fields in here like "ALL" "BOTH" or "GIRL" that dont correlate to one single character/species 

Lets check the character distribution


In [4]:
df['Character'].value_counts()


Character
PICARD       10798
RIKER         6454
DATA          5699
LAFORGE       4111
WORF          3185
CRUSHER       2944
TROI          2856
WESLEY        1206
Q              535
PULASKI        487
TASHA          474
COMPUTER       471
O'BRIEN        440
GUINAN         432
LWAXANA        404
RO             304
ALEXANDER      156
OGAWA          110
KEIKO           78
CREWMAN         51
WOMAN           46
NURSE           30
CHIEF           28
VOICE           27
ROMULAN         25
MAN             22
CREWWOMAN       17
SECURITY        15
KLINGON         14
ALL             13
MEDIC            6
Name: count, dtype: int64

The chart provides a clear depiction of the number of lines attributed to each character in our dataset. Given the imbalance in this distribution, our next step involves addressing this by once again removing outliers.

To achieve this, characters with fewer than 1000 lines will be removed. This allows us to retain characters that frequently appear, ensuring a substantial volume of lines for our model to gain insights and patterns from.


In [5]:
character_counts = df['Character'].value_counts()

characters_to_remove = character_counts[character_counts < 1000].index
df = df[~df['Character'].isin(characters_to_remove)]


## Tokenizing 
Before exploring our observations (lines), we'll initiate the tokenization process for the "Lines" field in our dataframe. This involves breaking down each sentence into individual tokens, enhancing the model's ability to interpret and analyze them.

For the current phase, we're retaining stop words. There’s a chance they might hold stylistic nuances that are important for character prediction.  However, recognizing that our dataset is sourced from the internet, we'll enforce consistency by converting all text in the "Line" field to lowercase.


In [6]:
nltk.download('punkt')
df['Line'] = df['Line'].apply(lambda x: word_tokenize(x.lower()))


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Lets print out the words/tokens that show up most. 

In [7]:
all_tokens = [token for sublist in df['Line'] for token in sublist]

word_frequencies = pd.DataFrame({
    'Token': all_tokens
})

top_words = word_frequencies['Token'].value_counts().head(20)

print("Top 20 Tokens:")
top_words


Top 20 Tokens:


Token
.       52781
,       25925
the     19638
to      15146
i       14973
you     12713
?        9671
a        9294
it       8228
of       7937
is       7069
that     6707
we       6562
's       6159
and      5186
in       4767
have     4663
this     3921
be       3784
do       3759
Name: count, dtype: int64

Upon inspection, you can see there’s a large presence of punctuation in our text. While these tokens are likely unhelpful, there’s a chance tokens like commas or question marks could contribute to speaking styles, enhancing character prediction in our models.  Because of this, punctuation will be retained in our dataset (for the time being).

What's the vocabulary size? 