## README
This Jupyter Notebook performs the following tasks:

1. **Import Libraries**: Imports necessary libraries for data manipulation, text processing, and multiprocessing.
   
2. **Get Data**: Reads the input data from a CSV file.

3. **Preprocess Data**:
    - **Extract Distinct Words with Apostrophe**: Finds and lists all distinct words containing apostrophes.
    - **Find Transitional Phrases**: Identifies and counts transitional phrases in the text.
    - **General Text Counting Functions**: Includes functions for counting words, phrases, stopwords, characters, and various other text features.

4. **Feature Engineering**:
    - **Text Analysis Features**: Computes various text features such as word count, stopword count, punctuation count, sentence lengths, and others.
    - **Ratio Features**: Calculates ratios of different features like distinct words ratio, mistakes ratio, and transitional phrases ratio.
    - **Text Statistics Features**: Uses the `textstat` library to compute readability and complexity metrics such as Flesch reading ease, SMOG index, Coleman-Liau index, and others.

5. **Export Data**: Exports the processed DataFrame with the new features to a CSV file.


## Content
  - [1. Import Libraries](#1-Import-Libraries)
  - [2. Get Data](#2-Get-Data)
  - [3. Preprocess Data](#3-Preprocess-Data)
    - [Extract Distinct Words with Apostrophe](#Extract-Distinct-Words-with-Apostrophe)
    - [Find Transitional Phrases](#Find-Transitional-Phrases)
    - [General Text Counting Functions](#General-Text-Counting-Functions)
  - [4. Feature Engineering](#4-Feature-Engineering)
    - [Text Analysis Features](#Text-Analysis-Features)
    - [Ratio Features](#Ratio-Features)
    - [Text Statistics Features](#Text-Statistics-Features)
  - [5. Export Data](#5-Export-Data)

In [3]:
# 1. Import libs
import pandas as pd
import numpy as np
import re
import string
import unicodedata
import time
import multiprocessing
from collections import Counter
from joblib import Parallel, delayed
import matplotlib.pyplot as plt
import spacy
import nltk
from nltk.corpus import wordnet, stopwords
from nltk.tokenize import word_tokenize, regexp_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from spellchecker import SpellChecker
import textstat

# Get the number of available CPU cores
num_cores = multiprocessing.cpu_count()
print(f"Number of available CPU cores: {num_cores}")

# 2. Get data
df = pd.read_csv('transformed_data_v1.csv')

# Load stopwords
stop_words = set(stopwords.words('english'))

# Transitional phrases list
transitional_phrases = [
    'Above all', 'Accordingly', 'Additionally', 'After', 'After all', 'Afterward', 'All in all', 'Also', 'Alternatively', 
    'As a result', 'As an illustration', 'As long as', 'As mentioned earlier', 'As noted', 'At the same time', 'Before', 
    'Besides', 'But', 'By all means', 'Consequently', 'Conversely', 'Correspondingly', 'Despite', 'During', 'Even if', 
    'Even so', 'Especially', 'Eventually', 'Finally', 'First', 'For example', 'For instance', 'Furthermore', 'Hence', 
    'However', 'If', 'In addition', 'In brief', 'In case', 'In comparison', 'In conclusion', 'In fact', 'In contrast', 
    'In other words', 'In particular', 'In simpler terms', 'In summary', 'In the meantime', 'In the same way', 'Indeed', 
    'Instead', 'Lastly', 'Later', 'Likewise', 'Meanwhile', 'Moreover', 'More importantly', 'Namely', 'Nevertheless', 
    'Next', 'Nonetheless', 'Notably', 'Now', 'On the contrary', 'On condition that', 'On one hand', 'On the other hand', 
    'Overall', 'Particularly', 'Plus', 'Previously', 'Provided that', 'Regardless', 'Second', 'Similarly', 'Since', 
    'Specifically', 'Still', 'Subsequently', 'That is', 'Then', 'Therefore', 'Third', 'Thus', 'To clarify', 'To conclude', 
    'To demonstrate', 'To illustrate', 'To put it another way', 'To summarize', 'To sum up', 'Ultimately', 'Unless', 'Unlike', 
    'Until', 'Whereas', 'Yet', 'Above and beyond', 'According to', 'After a while', 'All things considered', 'Although', 
    'Another key point', 'As a consequence', 'As a matter of fact', 'As can be seen', 'As far as', 'As soon as', 'At first', 
    'At last', 'At length', 'At this point', 'Be that as it may', 'By and large', 'By the same token', 'Even though', 
    'For fear that', 'For that reason', 'For the most part', 'Granted', 'Henceforth', 'If by chance', 'If so', 'In a moment', 
    'In any case', 'In any event', 'In light of', 'In order to', 'In particular', 'In reality', 'In short', 'In spite of', 
    'In view of', 'It follows that', 'Least of all', 'Most importantly', 'Needless to say', 'Of course', 'On the whole', 
    'One example is', 'One reason is', 'Or', 'Over time', 'Prior to', 'Provided that', 'Seeing that', 'So as to', 'Sooner or later', 
    'Such as', 'That being said', 'The next step', 'Thereafter', 'Thereby', 'Thirdly', 'Through', 'Till', 'To be sure', 
    'To begin with', 'To illustrate', 'To reiterate', 'To the end that', 'To this end', 'Until now', 'Up to now', 'What is more', 
    'Without a doubt', 'Without delay', 'Without exception', 'Yet again'
]

# 3. Preprocess data
# Function to extract distinct words with apostrophe
def extract_distinct_words_with_apostrophe(full_text):
    pattern = r"\b\w*'\w*\b"
    words_with_apostrophe = set()
    
    if isinstance(full_text, str):
        texts = [full_text]
    else:
        texts = full_text
    
    for text in texts:
        matches = re.findall(pattern, text)
        words_with_apostrophe.update(matches)
    
    return sorted(words_with_apostrophe)

# Function to find transitional phrases in text
def find_transitional_phrases(text):
    return [phrase for phrase in transitional_phrases if phrase.lower() in text.lower()]

# General text counting functions
def count_phrases_in_list(phrases_list):
    return len(phrases_list)

def count_distinct_phrases(phrases_list):
    return len(set(phrases_list))

def count_mistakes(word_list):
    return len(word_list)

def count_distinct_mistakes(word_list):
    return len(set(word_list))

def count_words(text):
    return len(text.split())

def count_distinct_words(text):
    words = text.lower().split()
    return len(set(words))

def count_occurrences(text, pattern):
    return len(re.findall(pattern, text))

def count_character(text, character):
    return text.count(character)

def count_stopwords(text):
    words = text.lower().split()
    return len([word for word in words if word in stop_words])

def count_newlines(text):
    return len([line for line in text.splitlines() if line.strip()])

def count_capital_words(text):
    words = text.split()
    return sum(word[0].isupper() for word in words)

def count_letters(text):
    return sum(char.isalpha() for char in text)

def count_capital_letters(text):
    return sum(char.isupper() for char in text)

def sentence_lengths(text):
    sentences = re.split(r'[.!?]\s*', text)
    lengths = [len(sentence.split()) for sentence in sentences if sentence]
    return lengths

def min_sentence_length(text):
    lengths = sentence_lengths(text)
    return min(lengths) if lengths else 0

def max_sentence_length(text):
    lengths = sentence_lengths(text)
    return max(lengths) if lengths else 0

def avg_sentence_length(text):
    lengths = sentence_lengths(text)
    return np.mean(lengths) if lengths else 0

# 4. Feature Engineering
## Text Analysis Features
print("Working on Text Analysis Features")
start_time = time.time()

df['word_count_in_full_text'] = df['full_text'].str.split().apply(len)
df["stopword_count_in_full_text"] = df["full_text"].apply(count_stopwords)
df['newline_count'] = df['full_text'].str.count('\n')
df['not_count'] = df['full_text'].str.lower().str.count('not')
df['question_count'] = df['full_text'].str.count('\?')
df['comma_count'] = df['full_text'].str.count(',')
df['colon_count'] = df['full_text'].str.count(':')
df['exclamation_count'] = df['full_text'].str.count('!')
df['dash_count'] = df['full_text'].str.count('-')
df['capital_word_count'] = df['full_text'].apply(count_capital_words)
df['letters_count'] = df['full_text'].apply(count_letters)
df['capital_letters_count'] = df['full_text'].apply(count_capital_letters)
df['min_sentence_length'] = df['full_text'].apply(min_sentence_length)
df['max_sentence_length'] = df['full_text'].apply(max_sentence_length)
df['avg_sentence_length'] = df['full_text'].apply(avg_sentence_length)
df['words_with_apostrophe'] = df['preprocessed_text_part1'].apply(extract_distinct_words_with_apostrophe)
df['transitional_phrases'] = df['preprocessed_text'].apply(find_transitional_phrases)
df['mistakes_count'] = df['misspelled_words_spell_checker'].apply(count_mistakes)
df['mistakes_dist_count'] = df['misspelled_words_spell_checker'].apply(count_distinct_mistakes)
df['transitional_phrases_c'] = df['transitional_phrases'].apply(count_phrases_in_list)
df['transitional_phrases_dist_c'] = df['transitional_phrases'].apply(count_distinct_phrases)
df['words_with_apostrophe_count'] = df['words_with_apostrophe'].apply(count_phrases_in_list)
df['preprocessed_text_count'] = df['preprocessed_text'].apply(count_words)
df['preprocessed_text_dist_count'] = df['preprocessed_text'].apply(count_distinct_words)

end_time = time.time()
print(f"Elapsed time: {end_time - start_time} seconds")

## Ratio Features
print("Working on Ratio Features")
start_time = time.time()

df['text_dist_words_ratio'] = df.apply(lambda x: x['preprocessed_text_dist_count'] / x['preprocessed_text_count'], axis=1)
df['mistakes_dist_ratio'] = df.apply(lambda x: x['mistakes_dist_count'] / x['preprocessed_text_count'], axis=1)
df['mistakes_dist_dist_ratio'] = df.apply(lambda x: x['mistakes_dist_count'] / x['preprocessed_text_dist_count'], axis=1)
df['transitional_phrases_ratio'] = df.apply(lambda x: x['transitional_phrases_c'] / x['preprocessed_text_count'], axis=1)
df['transitional_dist_dist_phrases_ratio'] = df.apply(lambda x: x['transitional_phrases_dist_c'] / x['preprocessed_text_count'], axis=1)
df['transitional_dist_phrases_ratio'] = df.apply(lambda x: x['transitional_phrases_dist_c'] / x['preprocessed_text_dist_count'], axis=1)

end_time = time.time()
print(f"Elapsed time: {end_time - start_time} seconds")

## Text Statistics Features
print("Working on Text Statistics Features")
start_time = time.time()

def textstat_features(text):
    features = {}  
    features['flesch_reading_ease'] = textstat.flesch_reading_ease(text)  
    features['flesch_kincaid_grade'] = textstat.flesch_kincaid_grade(text)  
    features['smog_index'] = textstat.smog_index(text)  
    features['coleman_liau_index'] = textstat.coleman_liau_index(text)  
    features['automated_readability_index'] = textstat.automated_readability_index(text)  
    features['dale_chall_readability_score'] = textstat.dale_chall_readability_score(text)  
    features['difficult_words'] = textstat.difficult_words(text)  
    features['linsear_write_formula'] = textstat.linsear_write_formula(text)  
    features['gunning_fog'] = textstat.gunning_fog(text)  
    features['text_standard'] = textstat.text_standard(text, float_output=True)  
    features['spache_readability'] = textstat.spache_readability(text)  
    features['mcalpine_eflaw'] = textstat.mcalpine_eflaw(text)  
    features['reading_time'] = textstat.reading_time(text)  
    features['syllable_count'] = textstat.syllable_count(text)  
    features['lexicon_count'] = textstat.lexicon_count(text)  
    features['monosyllabcount'] = textstat.monosyllabcount(text)  
    features['char_count'] = textstat.char_count(text)  
    features['sentence_count'] = textstat.sentence_count(text)  
    features['polysyllabcount'] = textstat.polysyllabcount(text)  
    features['reading_time_minutes'] = textstat.reading_time(text) / 60  
    return features  

results = Parallel(n_jobs=num_cores)(delayed(textstat_features)(text) for text in df['full_text_without_non_ascii'])
features_df = pd.DataFrame(results)
df = pd.concat([df, features_df], axis=1)

end_time = time.time()
print(f"Elapsed time: {end_time - start_time} seconds")

# 5. Export Data

# Export the DataFrame to a CSV file
print("Exporting file")

df.to_csv('numeric_features_added_v1.csv', index=False)
print("File exported")

Number of available CPU cores: 32
Working on Text Analysis Features
Elapsed time: 55.57690668106079 seconds
Working on Ratio Features
Elapsed time: 2.8294968605041504 seconds
Working on Text Statistics Features
Elapsed time: 22.818469047546387 seconds
Exporting file
File exported
