# Project Part 1

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/brearenee/NLP-Project/blob/main/startrek.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sgeinitz/https://github.com/brearenee/NLP-Project/blob/main/startrek.ipynb)


## 1. Introduction/Background
In this notebook, I'll be working with a dataset containing dialogue transcripts from various Star Trek series episodes, which I found on [kaggle](http://http://https://www.kaggle.com/datasets/birkoruzicka/startrekdialoguetranscripts/data). This dataset provides a large amount of script lines, each accompanied by information on episode, seriess, and the character who delivered said line. 

The objective of this project is to build a model capable of predicting the character associated with a given line from the script.  This type of problem is known as Speaker Identification and i'll be treating it as a text classification problem since model will need to learn patterns that are indicative of the speaking style of each character. Since there are multiple different characters which occur in this dataset, the end result is a multi-class classification task. 











## 2. Data Preprocessing

The datasource I'm working with is initially structured as a highly nested JSON file, and it's original format isn't quite optimal for the model I'm trying to create.  Becase of this, I'll need to parse through the file and transform it's structure to allow for a more useful dataframe. 



In [1]:
# import all of the python modules/packages you'll need here
import pandas as pd
import json
import requests
# ...

In [2]:
url = 'https://raw.githubusercontent.com/brearenee/NLP-Project/main/dataset/StarTrekDialogue.json'

# Make a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Load JSON data from the response
    json_data = json.loads(response.text)

    # Extract lines, characters, episodes, and series
    lines = []
    characters = []
    episodes = []
    series = []

    # extract the information from the JSON file
    for series_name, series_data in json_data.items():
        for episode_name, episode_data in series_data.items():
            for character_name, character_lines in episode_data.items():
                for line_text in character_lines:
                    lines.append(line_text)
                    characters.append(character_name)
                    episodes.append(episode_name)
                    series.append(series_name)

    # Create a DataFrame from the extracted data
    df = pd.DataFrame({
        'Line': lines,
        'Character': characters,
        'Episode': episodes,
        'Series': series
    })

    # Remove duplicate lines, keeping the first occurrence (preserving the original order)
    df = df.drop_duplicates(subset='Line', keep='first')

    # Reset the index of the DataFrame
    df.reset_index(drop=True, inplace=True)

else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")

**Test out our new Dataframe**

## 2. Exploratory Data Analysis

In [3]:
print(df.head(100))

                                                 Line Character  Episode  \
0                                  Check the circuit.     SPOCK  tos_000   
1   It can't be the screen then. Definitely someth...     SPOCK  tos_000   
2   Their call letters check with a survey expedit...     SPOCK  tos_000   
3   Records show the Talos group has never been ex...     SPOCK  tos_000   
4               We aren't going to go, to be certain?     SPOCK  tos_000   
..                                                ...       ...      ...   
95                                            Engage.      PIKE  tos_000   
96                                            Yeoman.      PIKE  tos_000   
97   I thought I told you that when I'm on the bridge      PIKE  tos_000   
98                              Oh, I see. Thank you.      PIKE  tos_000   
99  She does a good job, all right. It's just that...      PIKE  tos_000   

   Series  
0     TOS  
1     TOS  
2     TOS  
3     TOS  
4     TOS  
..    ...  
95 