# Text Feature Extraction Tool

This is a toy project for extracting linguistic features from the random text. This is a simplistic model, but it gives a lot of insights into the sample text.

In [1]:
import nltk
from text_classifier import TextClassifier
import pandas as pd

In [2]:
def extract_features_from_text(sentences):
    """
    Extracts features from text from a given list

    :param
        sentences: a list of text to analyze
    :return:
        A data frame of features
    """
    features = {
        "adj_and_adv_frequency": [],
        "has_subordinate_words": [],
        "modal_frequency": [],
        "peculiar_words": [],
        "plural_usage": [],
        "text_reading_ease": [],
        'article_density': [],
        'preposition_density': [],
        'type_token_ratio': [],
    }

    for sentence in sentences:
        cls = TextClassifier(sentence)
        features["adj_and_adv_frequency"].append(cls.calculate_lexical_density_by_tags({"JJ", "RB"}))
        features["has_subordinate_words"].append(cls.has_peculiar_expression("\b(But|So|Because)"))
        features["modal_frequency"].append(cls.calculate_lexical_density_by_tags({'MD'}))
        features["peculiar_words"].append(cls.calculate_words_frequency({"good"}))
        features["plural_usage"].append(cls.calculate_lexical_density_by_tags({"NNS"}))
        features["text_reading_ease"].append(cls.calculate_sentence_reading_ease())
        features['article_density'].append(cls.calculate_words_frequency({"a", "an", "the"}))
        features['preposition_density'].append(cls.calculate_lexical_density_by_tags({"IN"}))
        features['type_token_ratio'].append(cls.calculate_type_token_ratio())

    return pd.DataFrame(features)

In [3]:
sents = [
    "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.",
    "It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy.",
    "Mohandas Karamchand Gandhi has always been a very prominent figure in Indian history. From his unbeatable spirit to inspiring courage, from various controversies to his life as the father of the nation, Gandhi has always been an interesting, inspiring and impressive personality to read about. If you want to know all about Gandhi and his journey, you cannot miss out on reading 'My Experiments with the Truth', his autobiography that covers his life from early childhood till 1921. The introduction mentions how Gandhi resumed writing at the insistence of a fellow prisoner at Yerwada Central jail. The autobiography was written as weekly journals and then compiled and published as a book. From his childhood memories, his experiments with eating meat, smoking, drinking and stealing to the demise of his father, the book captures many unknown instances of Gandhi's life."
]

extract_features_from_text(sents)


Unnamed: 0,adj_and_adv_frequency,has_subordinate_words,modal_frequency,peculiar_words,plural_usage,text_reading_ease,article_density,preposition_density,type_token_ratio
0,12.0,False,0.0,0.0,6.0,137.755,90.0,10.0,0.66
1,12.632,False,2.105,0.0,4.211,152.765991,63.158,12.632,0.621053
2,10.828,False,0.637,0.0,3.822,205.209093,63.694,13.376,0.573248
