# Challenge: Text Into Data

```yaml
Course:   DS 5001 
Module:   02 Text Models
Topic:    Homework
Author:   Andrew Avitabile
Date:     27 January 2024
```

## Set up data

In [1]:
#Get packages
import pandas as pd

In [2]:
#Set configuration
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']

In [3]:
#File locations of Austen novels
sense_and_sensibility_file = f"{output_dir}/austen-sense-and-sensibility.csv" # Location of Jane Austen's Sense and Sensibility
persuasion_file = f"{output_dir}/austen-persuasion.csv" # Location of Jane Austen's Persuasion

In [4]:
#Import novel data
sense_and_sensibility_df = pd.read_csv(sense_and_sensibility_file)
persuasion_df = pd.read_csv(persuasion_file)

In [5]:
#Add column with novel name
sense_and_sensibility_df['title'] = "Sense and Sensibility"
persuasion_df['title'] = "Persuasion"

In [6]:
# Append the novels into a corpus
corpus = pd.concat([sense_and_sensibility_df, persuasion_df], ignore_index=True)

## Question 1 How many raw tokens are in the combined data frame?

In [7]:
print("Raw tokens in the combined data frame:", len(corpus))

Raw tokens in the combined data frame: 207896


## Question 2: How many distinct terms are there in the combined data frame (i.e. how big is the vocabulary)?

In [8]:
unique_term_str = corpus['term_str'].nunique()
print("Distinct terms in the combined data frame:", unique_term_str)

Distinct terms in the combined data frame: 8238


## Question 3: How many more terms does the vocabulary of Sense and Sensibility have than that of Persuasion?

In [9]:
# Calculate the number of terms by title
terms_by_title = corpus.groupby('title')['term_str'].nunique().reset_index(name='Unique_Terms')

# Filter the result based on title name
target_group = 'Persuasion'
persuasion_terms = terms_by_title[terms_by_title['title'] == target_group].reset_index(drop=True)['Unique_Terms'].values
ss_terms = terms_by_title[terms_by_title['title'] != target_group].reset_index(drop=True)['Unique_Terms'].values

# Display the results
print("Number of terms in Sense and Sensibility: ", ss_terms)
print("Number of terms in Persuasion: ", persuasion_terms)
print("Difference: ", ss_terms - persuasion_terms)

Number of terms in Sense and Sensibility:  [6279]
Number of terms in Persuasion:  [5759]
Difference:  [520]


## Question 4: What is the average number of tokens, rounded to an integer, per chapter in the corpus?

In [10]:
#Calculate the tokens per chapter
tokens_per_chapter = corpus.groupby(['title','chap_num']).size().reset_index(name='Row_Count')

#Average the tokens per chapter
average_tokens_per_chapter = tokens_per_chapter['Row_Count'].mean().round()

# Display the result
print("Average tokens per chapter:", average_tokens_per_chapter)

Average tokens per chapter: 2809.0


## Question 5: What is the average number of tokens, rounded to an integer, per paragraph in the corpus?

In [11]:
#Calculate the tokens per paragraph
tokens_per_paragraph = corpus.groupby(['title','para_num']).size().reset_index(name='Row_Count')

#Average the tokens per paragraph
average_tokens_per_paragraph = tokens_per_paragraph['Row_Count'].mean().round()

# Display the result
print("Average tokens per paragraph:", average_tokens_per_paragraph)

Average tokens per paragraph: 1118.0
