# Data to Text Pipeline using SimpleNLG


Data to text conversion is always interesting. This small script contains some of key ideas to use simpleNLg to generate synactic correct sentences using simpleNLG.

In [10]:
!pip3 install simplenlg
!pip3 install pandas

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-2.0.0-cp39-cp39-macosx_11_0_arm64.whl (10.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Collecting tzdata>=2022.1
  Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m341.8/341.8 kB[0m [31m77.2 kB/s[0m eta [36m0:00:00[0m kB/s[0m eta [36m0:00:01[0m:03[0m
[?25hCollecting pytz>=2020.1
  Downloading pytz-2023.3-py2.py3-none-any.whl (502 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m502.3/502.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hCollecting numpy>=1.20.3
  Downloading numpy-1.24.2-cp39-cp39-macosx

In [11]:
import simplenlg

In [12]:
from simplenlg.framework import *
from simplenlg.lexicon import *
from simplenlg.realiser.english import *
from simplenlg.phrasespec import *
from simplenlg.features import *

SimpleNLG handles the following:

Lexicon/morphology system: The default lexicon which computes morphological realisation.

Realiser: Generates texts from a syntactic form.

In [13]:
lexicon = Lexicon.getDefaultLexicon()
nlgFactory = NLGFactory(lexicon)
realiser = Realiser(lexicon)

In [14]:
# Sample Example for creating a sentence

s1 = nlgFactory.createSentence("my dog is happy")

In [15]:
# Once you created the sentence, inorder to get the text we need to realise the sentence generated
output = realiser.realiseSentence(s1)

In [16]:
print(output)

My dog is happy.


In [17]:
# We have taken a mental illness dataset to generate few sentences using multiple column values.

In [18]:
import pandas as pd

survey = pd.read_csv("survey.csv")

In [19]:
# Understand the meaning of these columns


"""
1. Age: Age of the submitter.
2. Gender: Gender of the submitter.
3. Country: Country of the submitter.
4. Family_history: Do you have a family history of mental illness?
5. treatment: Have you sought treatment for a mental health condition?
6. Work_interfere: If you have a mental health condition, do you feel that it interferes with your work?
7. no_employees: How many employees does your company or organization have?
8. Remote_work: Do you work remotely (outside of an office) at least 50% of the time?
9. Tech Company: Is your employer primarily a tech company/organization?
10. benefits: Does your employer provide mental health benefits?
11. care_options: Do you know the options for mental health care your employer provides?
12. wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?
13. seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?
14. Anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment
15. Leave: How easy is it for you to take medical leave for a mental health condition?
16. mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences?
17. phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences?
18. coworkers: Would you be willing to discuss a mental health issue with your coworkers?
19. supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
20. mental_health_interview: Would you bring up a mental health issue with a potential employer in an interview?
21. phys_health_interview: Would you bring up a physical health issue with a potential employer in an interview?
22. mental_vs_physical: Do you feel that your employer takes mental health as seriously as physical health?
23. obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
"""

'\n1. Age: Age of the submitter.\n2. Gender: Gender of the submitter.\n3. Country: Country of the submitter.\n4. Family_history: Do you have a family history of mental illness?\n5. treatment: Have you sought treatment for a mental health condition?\n6. Work_interfere: If you have a mental health condition, do you feel that it interferes with your work?\n7. no_employees: How many employees does your company or organization have?\n8. Remote_work: Do you work remotely (outside of an office) at least 50% of the time?\n9. Tech Company: Is your employer primarily a tech company/organization?\n10. benefits: Does your employer provide mental health benefits?\n11. care_options: Do you know the options for mental health care your employer provides?\n12. wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?\n13. seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?\n14. Anonymity: Is your 

Now we have some understanding of the data. Now I am interested in seeing how age is varying with people reporting about mental health and seeking help, vs people who don't.



In [20]:
# Building a feature using treatment and age column
survey_age = survey[['treatment', 'Age']]

## Distribution of age of people who seek help for mental health

* Category 1: People in early 20s [18 - 24]
* Category 2: People in their late 20s [25-29]
* Category 3: People in their early 30s [30-34]
* Category 4: People in their late 30s [35-39]
* Category 5: People in early 40s [40-45]
* Category 6: People in their late 40s [46-49]
* category 7: People with 50 and above [50 - ]




In [21]:
def create_age_group(age):
    
    if age >= 18 and age < 25:
        
        return "Early 20s"
    
    if age >= 25 and age < 30:
        
        return "Late 20s"
    
    if age >= 30 and age < 35:
        
        return "Early 30s"
    
    if age >= 35 and age < 40:
        
        return "Late 30s"
    
    if age >= 40 and age < 45:
        
        return "Early 40s"
    
    if age >= 45 and age < 50:
        
        return "Late 40s"
    
    if age >= 50 and age < 70:
        
        return "50s"
    


    
survey_age['age_group'] = survey['Age'].apply(create_age_group)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  survey_age['age_group'] = survey['Age'].apply(create_age_group)


In [22]:
final_df = survey_age.groupby(['age_group', 'treatment']).size().reset_index()
final_df = final_df.rename(columns={0: "count"})

In [23]:
final_df.columns

Index(['age_group', 'treatment', 'count'], dtype='object')

In [24]:
final_df

Unnamed: 0,age_group,treatment,count
0,50s,No,14
1,50s,Yes,17
2,Early 20s,No,86
3,Early 20s,Yes,70
4,Early 30s,No,173
5,Early 30s,Yes,174
6,Early 40s,No,49
7,Early 40s,Yes,64
8,Late 20s,No,188
9,Late 20s,Yes,172


So we have a dataframe which contains information about people with age group and seeking help.
Let's convert this information into text.


In [25]:
"""
Required text: 1. [20] people in Age group ---- seeks help for mental illness.
               2. [30] people in Age group ---- does not seek help for mental illness.
               
               Inorder to create these sentences let's create a small rule:
               
               Noun phrase: people
               Premodifier: []
               PostModifier: in Age Group + []
               
               Subject: Noun Phrase
               Verb: Seek
               Complement: help for mental illness
               
"""

def create_descriptions(row):
    noun_phrase = nlgFactory.createNounPhrase("People")
    noun_phrase.addPreModifier(str(row['count']))
    post_modifier = "in Age Group " + row['age_group']
    noun_phrase.addPostModifier(post_modifier)
    
    sentence = nlgFactory.createClause()
    sentence.setSubject(noun_phrase)
    sentence.setVerb("seek")
    
    if row['treatment'] == 'No':
        
        # This will negate the sentence
        sentence.setFeature(Feature.NEGATED, True)
    
    sentence.addComplement("help for mental illness")

    return realiser.realiseSentence(sentence)



In [26]:
final_df['text'] = final_df.apply(lambda r: create_descriptions(r), axis=1)

In [27]:
final_df

Unnamed: 0,age_group,treatment,count,text
0,50s,No,14,14 People in Age Group 50s does not seek help ...
1,50s,Yes,17,17 People in Age Group 50s seeks help for ment...
2,Early 20s,No,86,86 People in Age Group Early 20s does not seek...
3,Early 20s,Yes,70,70 People in Age Group Early 20s seeks help fo...
4,Early 30s,No,173,173 People in Age Group Early 30s does not see...
5,Early 30s,Yes,174,174 People in Age Group Early 30s seeks help f...
6,Early 40s,No,49,49 People in Age Group Early 40s does not seek...
7,Early 40s,Yes,64,64 People in Age Group Early 40s seeks help fo...
8,Late 20s,No,188,188 People in Age Group Late 20s does not seek...
9,Late 20s,Yes,172,172 People in Age Group Late 20s seeks help fo...
