# Text Cleaning Pipeline
The goal of this file is to import the comments and news sources, join the data, and find the sentiment scores of both the source and the cleaned text before performing analysis.

In [1]:
from _functions import *

  demoji.download_codes()
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Trevo\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Trevo\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Create Model Input
Complete this process through concatenating the comments and the video transcripts

In [2]:
parent_directory = Path().resolve().parent
data_directory = parent_directory / "Data"
comments_directory = data_directory / "video_comments"
transcripts_directory = data_directory / "video_transcripts"

In [3]:
DUDNUM = 999999999

reliability_scores = {"nbc_news": 42.78, "wsj": 48.87, "bbc_news": 44.73, "bloomberg" : 42.11,
                      "cnn" : 42.13, "sixty_minutes" : 34.19, "sky_news" : 42.24,
                       "wusa": 46.67, "dw_news" : DUDNUM, "forbes" : 41.06,
                        "fox_news" : 24.83, "podcast_cvt" : DUDNUM, "econ_explained": DUDNUM}

bias_scores = {"nbc_news": -5.61, "wsj": -0.27, "bbc_news" : -1.33, "bloomberg": -3.16, "cnn" : -6.18,
               "sixty_minutes" : -9.55, "sky_news" : -0.88, "wusa": -1.50, "dw_news" : DUDNUM,
               "forbes": -3.87, "fox_news": 18.50, "podcast_cvt" : DUDNUM, "econ_explained": DUDNUM}

bias_ratings = {"nbc_news": "Lean Left", "wsj": "Center", "bbc_news" : "Center", "bloomberg": "Lean Left", "cnn" : "Lean Left",
               "sixty_minutes" : "Lean Left", "sky_news" : "Lean Left", "wusa": -1.50, "dw_news" : DUDNUM,
               "forbes": "Center", "fox_news": "Right", "podcast_cvt" : DUDNUM, "econ_explained": DUDNUM}

## Data Processing and Sentiment Analysis

This code processes video comments and transcripts, extracting relevant information and calculating sentiment scores. It performs the following steps:

1. **Initialize DataFrames**: Creates empty DataFrames for storing transcripts and comments.
2. **Iterate Over Files**: Loops through each file in the comments directory.
3. **Read Comments**: Reads the comments from CSV files, dropping any rows with missing values and unnecessary columns.
4. **Read Transcripts**: Reads the corresponding transcript files if they exist.
5. **Extract Information**: Extracts the source name and index from the file name.
6. **Filter Data**: Filters out sources with invalid reliability scores.
7. **Add Metadata**: Adds metadata such as source name, reliability score, and bias score to the comments DataFrame.
8. **Sentiment Analysis**: Calculates VADER sentiment scores for both the transcript and comments.
9. **Append Data**: Appends the processed data to the respective DataFrames.
10. **Save Data**: Saves the final DataFrames to CSV files.

In [4]:
transcript_df = pd.DataFrame(columns=['index', 'source', 'transcript'])

comments_df = pd.DataFrame(columns=['index', 'source', 'leaning', 'reliability_score', 'bias_score', 'vader_transcript', 'vader_comment', 'comment'])

# Iterate over each .txt file in the directory
for file_name in os.listdir(comments_directory):

    # Construct comments file path
    file_path = comments_directory / file_name

    # Remove ".csv" and concatenate "_transcript.txt" for file_name
    transcript_file_name = file_name.replace(".csv", "_transcript.txt")
    transcript_file_path = transcripts_directory / transcript_file_name

    # TODO : Read in the CSV file for the comments
    input_df = pd.read_csv(file_path, encoding="utf-8")
    input_df = input_df.dropna()

    # Drop 'Unnamed: 0' column if it exists, as not all csv pulls have the index
    if 'Unnamed: 0' in input_df.columns:
        input_df = input_df.drop(columns=['Unnamed: 0'])

    # TODO : Read in the transcript file and append score to dataframe
    # NOTE : Cannot make use of any data without a transcript
    if transcript_file_path.exists():
        with open(transcript_file_path, "r", encoding="utf-8") as file:
            transcript_text = file.read().strip()

        # DONE : Define unique identifiers for the data
        source_name = re.sub(r"[0-9]+\.csv", "", file_name)
        index = int(re.sub(r"\D", "", file_name))

        # TODO : Turn this into a function
        # NOTE : Cannot make use of a function withot having , 
        if reliability_scores[source_name] != DUDNUM:

            # TODO : Add the news source name to the dataframe
            input_df['index'] = index
            input_df["source"] = source_name
            input_df['leaning'] = bias_ratings[source_name]
            input_df["reliability_score"] = reliability_scores[source_name]
            input_df["bias_score"] = bias_scores[source_name]

            # TODO : Add the VADER sentiment score to the dataframe
            input_df["vader_transcript"] = analyze_sentiment_vader(transcript_text)

            # TODO : Clean the comment
            input_df["clean_comment"] = input_df.apply(lambda row: preprocess_text(row['comment']), axis=1)

            input_df["vader_comment"] = input_df.apply(lambda row: analyze_sentiment_vader(row['clean_comment']), axis=1)

            # DONE : Append the transcript as a row to the transcript dataframe
            np_array_entry = np.array([index, source_name, transcript_text])
            transcript_df.loc[len(transcript_df)] = np_array_entry


            # Drop the 0th index row
            input_df = input_df.drop(0)

            comments_df = pd.concat([comments_df, input_df])

  comments_df = pd.concat([comments_df, input_df])


## Determining Sarcasm of the Comment

In [5]:
import pandas as pd
from _functions import get_relative_path

sarcasm_df = pd.read_csv(get_relative_path() / "reddit_sarcasm_training.csv", encoding="utf-8")
sarcasm_df.head()

Unnamed: 0,label,comment
0,0,NC and NH.
1,0,I think a significant amount would be against ...
2,0,because it's what really bothers him... and it...
3,0,Conservatism as an ideology is for sure a reac...
4,0,"Maybe not control, but certainly that is evide..."


In [6]:
# Drop NaN values
sarcasm_df = sarcasm_df.dropna()
sarcasm_df = sarcasm_df.reset_index(drop=True)
sarcasm_df.head()

Unnamed: 0,label,comment
0,0,NC and NH.
1,0,I think a significant amount would be against ...
2,0,because it's what really bothers him... and it...
3,0,Conservatism as an ideology is for sure a reac...
4,0,"Maybe not control, but certainly that is evide..."


In [7]:
sarcasm_sample_df = sarcasm_df.sample(n=2000, random_state=8)
sarcasm_sample_df = sarcasm_sample_df.reset_index(drop=True)
sarcasm_sample_df.head()

Unnamed: 0,label,comment
0,1,But the problem is how progressives have faile...
1,0,"I'll vote against Hillary, thanks tho!"
2,1,As opposed to all of the other qualified and c...
3,0,Dribble.
4,0,"By forming a PAC, money that is donated and no..."


In [8]:
from BERT import SarcasmDetector
import pandas as pd

detector = SarcasmDetector()
detector.train(sarcasm_sample_df['comment'].tolist(), sarcasm_sample_df['label'].tolist())

# Predict on new examples
new_texts = [
    "Oh fantastic, my car broke down in the middle of nowhere!",
    "I'm really looking forward to my dentist appointment tomorrow.",
    "I just won the lottery! Best day ever!"
]
print(detector.predict(new_texts))

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 0.6343585167825222
Epoch 2, Loss: 0.4652016258239746
Epoch 3, Loss: 0.24268606734462084
Validation Accuracy: 0.7225
              precision    recall  f1-score   support

           0       0.64      0.72      0.67       160
           1       0.79      0.72      0.76       240

    accuracy                           0.72       400
   macro avg       0.71      0.72      0.72       400
weighted avg       0.73      0.72      0.72       400

['Sarcastic', 'Not Sarcastic', 'Sarcastic']


In [None]:
# Test sarcasm detection
test_text = sarcasm_sample_df["comment"].iloc[8]
print(f"Text: {test_text}\nSarcastic? {detector.predict([test_text])}")

Text: Let's just hand over the keys to the economy to a guy who can manage neither his campaign finances nor his tax returns, what could go wrong ?
Sarcastic? ['Sarcastic']


In [12]:
comments_df['sarcasm'] = detector.predict(comments_df['clean_comment'].tolist())

In [13]:
# Save data for analysis

comments_df.to_csv(get_relative_path() / "comments.csv", index=False)
transcript_df.to_csv(get_relative_path() / "transcripts.csv", index=False)