### (pavellexyr) The Reddit Climate Change Dataset: ###
Retrieved from: https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset

There are two .csv files, one containing comments from Reddit on climate change and the other containing posts on climate change. For this project, we will only be using the .csv file with comments as there is already sentiment analysis done on the 4.6 million observations collected.

An observation in this dataset consists of:
* Type of datapoint (comment)
* Unique ID of the comment
* Unique ID of the comment’s subreddit
* The name of the subreddit the comment was found on
* If the comment’s subreddit is NSFW
* The timestamp (UTC) of the comment
* Permalink to the comment
* Body text of the comment
* Analyzed sentiment for the comment as a continuous value from [-1, 1]
* Comment’s score (votes on Reddit)

### (cosmos98) Twitter and Reddit Sentimental analysis Dataset: ###
Retrieved from: https://www.kaggle.com/datasets/cosmos98/twitter-and-reddit-sentimental-analysis-dataset

There are two .csv files, one containing comments from Reddit (36k observations) and the other containing tweets from Twitter (162k observations). The Twitter dataset was extracted with the focus on tweets people made about the Indian Prime Minister Narendra Modi. The Reddit dataset has no indication of having a specific area or topic they were sourced from. 

Both datasets only have two variables:
* The text of the comment or tweet
* The category/sentiment of the text {-1, 0, 1}

### (tirendazacademy) FIFA World Cup 2022 Tweets:  ###
Retrieved from: https://www.kaggle.com/datasets/tirendazacademy/fifa-world-cup-2022-tweets

There is one .csv file containing tweets regarding the 2022 FIFA World Cup, a dataset of about 22k observations.

An observation in this dataset contains:
* ID (index) of the observation
* Date the tweet was created
* Number of likes the tweet had
* Source of the tweet (Twitter of iPhone, Twitter for Android)
* The body text of the tweet
* Sentiment of the tweet as strings: “positive”, “neutral”, or “negative”


For all the datasets discussed above, there are only two variables we are concerned with: the text of the comment/tweet and the sentiment the text was already given. Due to some datasets being incredibly large, only the first 100k observations of each dataset will be used. 

All datasets will undergo further data cleaning. To the best of our ability, we will filter our dataset to only have English detected text using the langdetect library and regex. Other unusable observations (such as rows containing NaN values) will also be excluded. This results in slightly less than the upper limit of 100k observations initially taken from each dataset. In addition, the existing numeric labels for sentiment some datasets may have will be changed to string values of “positive”, “negative”, and “neutral” to be consistent with each other and only have 3 total classes for classification. 

The full code for cleaning the files used up to this point are in the “COGS118A replacement data cleaning.ipynb” file in this repository. Below will be code snippets, mainly of the functions created to clean the data. Demonstration will also be available in the other file as it is simply too large to include here. For demonstration purposes, these will just be example variable names instead of what was actually used in practice. 


In [4]:
"""
Libraries and global variables to be used for cleaning and limiting collected data to 100k observations at most. 
"""

import numpy as np
import pandas as pd

row_count = 1000
max_obv = 100

#https://pypi.org/project/langdetect/
### uncomment this to install, then comment and restart kernel ###
# %%capture
# !pip install langdetect

from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException
### regex for more lang checking
import re

Taking the first 100k observations of a dataset and putting it into an initial DataFrame object.

In [None]:
data_text = []
data_sentiment = []
i = 0
for chunk in pd.read_csv('dataset.csv', chunksize=row_count):
  if i < max_obv:
    data_text += chunk['text'].tolist()
    data_sentiment += chunk['sentiment'].tolist()
    i += 1
  else:
    break;

In [None]:
df = pd.DataFrame(data={'text': data_text, 'sentiment': data_sentiment})
df = df.dropna()

Function to validate a text body. Valid text will be non-empty strings that are not solely whitespace of at least length 1. Regex and langdetect is used to keep observations that have Latin characters (so that it may be able to filter out text using solely Korean characters for example).

In [None]:
def validate_line(line):
    if not line:
        return np.nan
    if line == "":
        return np.nan
    if not bool(line.strip()):
        return np.nan
    if len(line) < 1:
        return np.nan
    
    if bool(re.match('^(?=.*[a-zA-Z])', line)):
        try:
            if detect(line) != 'en':
                return np.nan
        except LangDetectException:
            return np.nan
    return True

Function to check if text body is English (detect returns 'en') for the entire dataset. 
* text_col takes a DataFrame.Series: df['text']
* sentiment_col takes a DataFrame.Series: df['sentiment']
Returns 3 lists of the same length, truncating the last chunk of observations that are less than 1000.

In [None]:
def check_en(text_col, sentiment_col):
    en_text = text_col.tolist()
    en_sentiment = sentiment_col.tolist()
    lang = []
    
    start = 0
    for i in np.arange(row_count, len(en_text), row_count):
        #observations <1000 at the end will be lost but impact is negligible
        #!!!uncomment print statement below to show progress (recommended)!!!
#         print(start, i)
        lang += [validate_line(x) for x in en_text[start:i]]
        start = i
    print("Finished English check")
    ### all three return values should be of the same length
    return en_text[0:len(lang)], en_sentiment[0:len(lang)], lang

Putting the returned lists from check_en into a new DataFrame. df['english'] will be dropped after removing all rows with NaN values (non-English, non valid text bodies).

In [None]:
en_data_text, en_data_sentiment, en_data_lang = check_en(df['text'], df['sentiment'])

en_df = pd.DataFrame(data={'text': en_data_text, 'sentiment': en_data_sentiment, 'english': en_data_lang})
en_df = en_df.dropna()
en_df = en_df.drop(columns=['english'])

Function to change numeric sentiment values into string values, otherwise it will just return the input value (if it was already string).

In [None]:
def sentiment_to_string(sentiment):
    if type(sentiment) == int or type(sentiment) == float:
        if sentiment < 0:
            return "negative"
        if sentiment > 0:
            return "positive"
        return "neutral"
    else:
        return sentiment

Final clean dataset as a DataFrame after this line.

In [None]:
en_df['sentiment'] = en_df['sentiment'].apply(sentiment_to_string)