# Project Part 2

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/brearenee/NLP-Project/blob/main/part2-startrek.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sgeinitz/https://github.com/brearenee/NLP-Project/blob/main/part2-startrek.ipynb)


**NLP Problem:** given a script from Star Trek The Next Generation, predict from 8 main characters who said a line. 

Part 2 of my project involves creating a basic model for my NLP problem


As mentioned in Part 1, my dataset's original format is not in the most useful form.  
To start Part 2, I must parse through the raw JSON and return an organized dataFrame. 

In [1]:
import pandas as pd
import json
import requests

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [2]:
url = 'https://raw.githubusercontent.com/brearenee/NLP-Project/main/dataset/StarTrekDialogue_v2.json'
response = requests.get(url)

##This CodeBlock is thanks to ChatGPT :-) 
if response.status_code == 200:
    json_data = json.loads(response.text)
    lines = []
    characters = []
    episodes = []
  
    # extract the information from the JSON file for the "TNG" series
    for series_name, series_data in json_data.items():
        if series_name == "TNG": 
            for episode_name, episode_data in series_data.items():
                for character_name, character_lines in episode_data.items():
                    for line_text in character_lines:
                        lines.append(line_text)
                        characters.append(character_name)
                        episodes.append(episode_name)
                     
    # Create a DataFrame from the extracted data
    df = pd.DataFrame({
        'Line': lines,
        'Character': characters,
        'Episode': episodes,
    })

    # Remove duplicate lines, keeping the first occurrence (preserving the original order)
    df = df.drop_duplicates(subset='Line', keep='first')

    # Reset the index of the DataFrame
    df.reset_index(drop=True, inplace=True)

else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")


We then need to clean our dataset by removing non-main characters.  We are going to remove all characters that have less than 1000 lines. 


In [3]:
character_counts = df['Character'].value_counts()

characters_to_remove = character_counts[character_counts < 1000].index
df = df[~df['Character'].isin(characters_to_remove)]


In [4]:
df['Character'].value_counts()


Character
PICARD     10798
RIKER       6454
DATA        5699
LAFORGE     4111
WORF        3185
CRUSHER     2944
TROI        2856
WESLEY      1206
Name: count, dtype: int64

# Decision Tree
For my simple model, I'll be using a Decision Tree Classifier. 


In [5]:
# Vectorize the "Line" column using a "bag of words" representation.  
# This represntation converts the lines of text into a numerical format 
# that can be used by the DecisionTreeClassifier for prediction.
vectorizer = CountVectorizer()

#extract the lines from my dataframe
lines = df['Line'].tolist()
character = df['Character'].tolist()

X = vectorizer.fit_transform(lines)
y = character

In [6]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Decision Tree model
model1 = DecisionTreeClassifier(
    random_state=42)
model1.fit(X_train, y_train)

# Make predictions on the test set and evaluate the model
y_pred = model1.predict(X_test)
accuracy1 = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy1}')

Accuracy: 0.3206281036102537


As you can see, the model's accuracy is terrible. Lets adjust some hyper parameters to try and increase the accuracy. 

In [7]:
model2 = DecisionTreeClassifier(
    random_state=42,
    max_depth=20, 
    #min_samples_split=5,  
    #min_samples_leaf=1,  
    #max_features='sqrt',  
    #criterion='entropy' 
)
model2.fit(X_train, y_train)

# Make predictions on the test set and evaluate the model
y_pred = model2.predict(X_test)
accuracy2 = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy2}')

Accuracy: 0.3482753992752651


After trial and error with many different hyper parameters,  I realized these arent going to help much.  I'm going to try a different approach - instead of a Bag of Words type representation, I'm going to try TF-IDF to represent the text data. 

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

#TF-IDF vecotorization instead of Bag of Words. 
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(lines)
y = character


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Adjust hyperparameters (you can experiment with these)
model3 = DecisionTreeClassifier(
    random_state=42,
    max_depth=20,
    min_samples_split=5,
    #min_samples_leaf=2,
    #max_features='sqrt',
    #criterion='gini'
)


model3.fit(X_train, y_train)
y_pred = model3.predict(X_test)
accuracy3 = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy3}')

Accuracy: 0.34223594148436454


Switching from bag of words to TF-IDF didn't change much. In fact, The accuracy went slightly down. Next we are going to try a different model, but sticking with TF-IDF


# Random Forest

In [9]:
from sklearn.ensemble import RandomForestClassifier
model4 = RandomForestClassifier(random_state=42)
model4.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model4.predict(X_test)

# Evaluate the model
accuracy4 = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy4}')

Accuracy: 0.41068312978123744


It's better! 41% still doesn't seem too accurate, but considering there are 8 possible classifications, it's better than a normal guess. 