# Project Part 1

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/sgeinitz/CS39AA-project/blob/main/project_part1.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sgeinitz/CS39AA-project/blob/main/project_part1.ipynb)

This notebook is intended to serve as a template to complete Part 1 of the projects. Feel free to modify this notebook as needed, but be sure to have the two main parts, a) a introductory proposal section describing what it is your doing to do and where the dataset originates, and b) an exploratory analysis section that has the histograms, charts, tables, etc. that are the output from your exploratory analysis. 

__Note you will want to remove the text above, and in the markdown cells below, and replace it with your own text describing the dataset, task, exploratory steps, etc.__

## 1. Introduction/Background

_In this section you will describe (in English) the dataset you are using as well as the NLP problem it deals with. For example, if you are planning to use the Twitter Natural Disaster dataset, then you will describe what the data and where it came as if you were explaining it to someone who does not know anything about the data. You will then describe how this is a __text classification__ problem, and that the labels are binary (e.g. a tweet either refers to a genuine/real natural disaster, or it does not)._ 

_Overall, this should be about a paragraph of text that could be read by someone outside of our class, and they could still understand what it is your project is doing._ 

_Note that you should __not__ simply write one sentence stating, "This project is base on the Kaggle competition: Predicting Natural Disasters with Twitter._"

_If you are still looking for datasets to use, consider the following resources to explore text datasets._

* https://huggingface.co/datasets/
* https://www.kaggle.com/datasets
* https://data-flair.training/blogs/machine-learning-datasets/ 
* https://pytorch.org/text/stable/datasets.html
* https://github.com/niderhoff/nlp-datasets 
* https://medium.com/@ODSC/20-open-datasets-for-natural-language-processing-538fbfaf8e38 
* https://imerit.net/blog/25-best-nlp-datasets-for-machine-learning-all-pbm/ 


_If you instead are planning to do a more research-oriented or applied type of project, then describe what it is that you plan to do._

_If it is research, then what do you want to understand/explain better?_

_If it is applied, then what it is you plan to build?_ 




## Introduction 

So far, not quite sure how this is going to pan out. But my current idea is that given a sentence from a Star Trek script, I'd like to be able to predict who said it.  This is a text classification problem, but the output is categorical other than binary, as there are many cast members.   I'll work out the details soon. 

I'll be using the following dataset: 
https://www.kaggle.com/datasets/birkoruzicka/startrekdialoguetranscripts/data

currently I'm learning how to parse the JSON file into a more useful format. 


## 2. Exploratory Data Analysis

_You will now load the dataset and carry out some exploratory data analysis steps to better understand what text data looks like. See the examples from class on 10/. The following links provide some good resources of exploratory analyses of text data with Python._


* https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools
* https://regenerativetoday.com/exploratory-data-analysis-of-text-data-including-visualization-and-sentiment-analysis/
* https://medium.com/swlh/text-summarization-guide-exploratory-data-analysis-on-text-data-4e22ce2dd6ad  
* https://www.kdnuggets.com/2019/05/complete-exploratory-data-analysis-visualization-text-data.html  


In [1]:
# import all of the python modules/packages you'll need here
import pandas as pd
import json
import requests
# ...

#### Take JSON dataset and parse it into a more suitable format/ dataframe

In [2]:
url = 'https://raw.githubusercontent.com/brearenee/NLP-Project/main/dataset/StarTrekDialogue.json'

# Make a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Load JSON data from the response
    json_data = json.loads(response.text)

    # Extract lines, characters, episodes, and series
    lines = []
    characters = []
    episodes = []
    series = []

    # extract the information from the JSON file
    for series_name, series_data in json_data.items():
        for episode_name, episode_data in series_data.items():
            for character_name, character_lines in episode_data.items():
                for line_text in character_lines:
                    lines.append(line_text)
                    characters.append(character_name)
                    episodes.append(episode_name)
                    series.append(series_name)

    # Create a DataFrame from the extracted data
    df = pd.DataFrame({
        'Line': lines,
        'Character': characters,
        'Episode': episodes,
        'Series': series
    })

    # Remove duplicate lines, keeping the first occurrence (preserving the original order)
    df = df.drop_duplicates(subset='Line', keep='first')

    # Reset the index of the DataFrame
    df.reset_index(drop=True, inplace=True)

else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")

**Test out our new Dataframe**

In [3]:
print(df.head(100))

                                                 Line Character  Episode  \
0                                  Check the circuit.     SPOCK  tos_000   
1   It can't be the screen then. Definitely someth...     SPOCK  tos_000   
2   Their call letters check with a survey expedit...     SPOCK  tos_000   
3   Records show the Talos group has never been ex...     SPOCK  tos_000   
4               We aren't going to go, to be certain?     SPOCK  tos_000   
..                                                ...       ...      ...   
95                                            Engage.      PIKE  tos_000   
96                                            Yeoman.      PIKE  tos_000   
97   I thought I told you that when I'm on the bridge      PIKE  tos_000   
98                              Oh, I see. Thank you.      PIKE  tos_000   
99  She does a good job, all right. It's just that...      PIKE  tos_000   

   Series  
0     TOS  
1     TOS  
2     TOS  
3     TOS  
4     TOS  
..    ...  
95 